An Uberproblem
Prerequisites
Abstract
The scope of this article is presenting an interesting graph counting problem and its equivalently interesting solution. You may find the original statement in Romanian here. I personally think that equipped with this solution, ubergraph is one of the most interesting problems I've ever encountered in my competitive programming journey.
The Problem Statement
We define an ubergraph as a directed acyclic graph with the propriety that no two nodes have the same set of outneighbors. Given a number N (\(\le\) 300), find the number of unlabelled ubergraphs of N nodes.
*For simplicity, let's call the propriety of our DAG that no two nodes have the same outneighbors the distinction propriety.
**We shall also denote the set of nodes as V, the set of directed edges E and set of outneighbors of node \(x\) as \(g(x)\).
***Because of the lack of automorphisms that we'll see later, an ubergraph is just a made up term for an asymmetric DAG.
My Thought Process and Solution
As this is a counting problem, we have to approach possibilities: a direct combinatorical formula or a DP. The limit though, seems quite small for any popular combinatorics technique and this is a strong suggestion that the solution is a DP.Even if knowledge of such formalities isn't required in solving the problem, we can think about the restriction as a way of saying that the graph has no automorphisms except for the identity function ( a function \( f:V \mapsto V, f \text{ bijective, s.t. } \forall (a,b) \in E, (f(a), f(b)) \in E \) ).
This means, in simpler terms, that if we decide to label the DAG, there is no way of permuting the labels such that for all pairs of labels \((a, b)\), the nodes with these labels after the permutation is performed will have an edge between them iff they had one in their original placement.
An implication of this is that the number of labelled ubergraphs is equal to the number of unlabelled ubergraphs times N factorial.
These observations seem quite promising in elaborating a solution. With a little experience, they suggest that we can count labelled ubergraphs instead of unlabelled ones if we find a restriction that gives us a unique labelling for an unlabelled one. This allows us to count structures with stronger conditions, which may dramatically reduce the difficulty of the problem. This may sound a little bit counterintuitive but examples of such difficulty reductions are everywhere (e.g. the number of groups with N elements vs the number of abelian groups with N elements; the number of unlabelled trees vs the number of unlabelled complete binary trees etc).
Now we basically have to find a way of uniquely labelling a DAG, preferably respecting the topological order (i.e. if there are two nodes labelled \(x,y\) where \(x < y\) there may not be any path from \(y\) to \(x\) in the DAG. Yes, I know this feels backwards but we will append to our dp from the "sink" up) so that we may approach a natural DP approach of the form \(\text{dp[#nodes][some additional information]}\) where we could append nodes in order of their label.
Let's look at the ubergraphs with four nodes to get an idea of such a labelling:
So far, we've managed to find the labels for two nodes, which are imposed anyway by the fact that we want our labels to respect the topological order 🥳 🍾. Now let's find something useful for the general case: suppose we have two nodes \(x\) and \(y\) such that \(x\) is not reachable from \(y\), \(x\) is not reachable from \(y\) and we have found the labels of all their outneighbors. Which one of \(x\) and \(y\) should have the smaller label? There is obviously no direct answer, as so far the definition of what we are looking for is "something that feels useful". Now is the moment of thinking of a way of labelling the nodes in a way that may lead to a DP approach. The way I did this was just write down some random criteria, try elaborating from each and stop when I found something promising; and the one I've found is the following: define the "weight" of a node \(x\) as \(w(x)=\sum_{v \in g(x)} 2^{\text{label}(v)} \) and just give the smaller label to the one with the smaller weight. Keep in mind this weight function, as it will be a crucial part of our DP.
Let's now wrap it all up and see some pseudocode that will label an unlabelled ubergraph using the criteria we've stated before.
# the procedure is basically Kahn's algorithm ran from the sink "up"
layer = [("the sink node")]
gt = ("the transposed graph")
outdeg = ["the outdegrees of the nodes"]
label = ["empty list of length N"]
label[("the sink node")] = 1
label_index = 2 # at each step we will increment this value to assign new labels
while layer != []:
newlayer = [] # we will store here the nodes we have yet to label and have all of their outneighbors labelled
for node in layer:
for v in gt[node]:
outdeg[v]= 1
if outdeg[v] == 0:
newlayer.append(v)
sort newlayer by w # we sort the yet to be labelled nodes by the weight criterion, as they are interchangeable in the topological order
for node in newlayer: # we finally label the new nodes
label[node] = label_index
label_index+= 1
layer = newlayer # we continue from the freshly labelled nodes
Now the fact that this procedure works is just proof that our labelling method is a valid one and uniquely determines a labelling for an unlabelled ubergraph (ahem! labelling, labelling labelling labelling!). Now let's look at the proprieties of this labelling: \(\text{label}(x) < \text{label}(y) \implies w(x) < w(y) \)
 \(w(\text{nod})\) basically encodes the adjacency list of \(\text{nod}\), because \((\text{nod}, v) \in E\) iff the \((v+1)\)th bit is one in the binary representation of \(w(\text{nod})\)
 \(w(\text{nod}) < 2^{\text{label(nod)}} \)
So we've just reduced our problem to counting these sequences, which seems much easier. I won't go into very much detail on how to efficiently compute this, but the DP goes something like this: $$ \text{dp[#length of the sequence][index of the most significant bit of the value of the last element]} \\ \text{dp[0][0]}=\text{dp[1][1]}=1 \\ \text{dp[i][j]}=\sum^{j}_{k=0} \binom{2^{j1}}{ik} \text{dp[k][j1]} $$ *Note that the labels start from zero, not from one as explained before. This may be computed in \(O(N^3)\) by just coding the recurrence above, or in \(O(N^2 log N)\) by optimizing with FFT, but I have my doubts that this is the best attainable complexity.
A Bitwise Convolution Tutorial
Prerequisites
An introduction
It is an important and popular fact that the things that we classify as FFT on codedrills, are not always actual fast fourier transforms (if we're being overlytechnical). Suppose you are given two arrays \(A\) and \(B\) and an operation \(*\). We will call the \(*\) convolution of \(A\) and \(B\) the array \(C\), such that:
$$
C_p = (A*B)_p = \sum_{i*j=p}{A_i B_j}
$$
In the case of actual FFT, \(*\) is addition, but it might as well be something else, like AND, OR, XOR, or something like gcd or lcm that require techniques far less advanced than the ones presented in this article, intended to introduce the (*sighs* rare) reader to techniques that allow us to efficiently compute such convolutions by linear algebraic means.
I will also present the solutions to some problems that involve these techniques such as Random Nim Generator.
Throwback to FFT
As you should know, FFT is basically a divide and conquer algorithm designed to reduce the complexity of the multiplication of a vector with a specific Vandermonde matrix: $$ V_n = \begin{bmatrix} \omega^{0 \cdot 0} & \omega^{0 \cdot 1} & \omega^{0 \cdot 2} & \dots & \omega^{0 \cdot n} \\ \omega^{1 \cdot 0} & \omega^{1 \cdot 1} & \omega^{1 \cdot 2} & \dots & \omega^{1 \cdot n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \omega^{n \cdot 0} & \omega^{n \cdot 1} & \omega^{n \cdot 2} & \dots & \omega^{n \cdot n} \end{bmatrix} \large{ , \omega=e^{ \frac{2\pi i}{n} } } $$ Let's recap a bit the FFT process:
 We define the product operation \(*\) on two vectors as the following: given two vectors \(A\) and \(B\), \(A*B=C\), where \(C=A+B1\) and $$ C_p = \sum_{k=0}^{p}{A_k B_{pk}} $$ Note: Despite the horrific notation colision, in the reminder of the article \(V\) will denote the number of elements in the corresponding array of vector \(V\).
 Using the definition, the operation uses \(O(A B)\) time, so we resize our vectors to the smallest power of two larger than \(A+B\) (let that power of two be \(n=2^\text{some exponent}\)). And multiply each of these vectors with \(V_n\) to obtain \(A'\) and \(B'\). We define \(C'=A' \odot B'\), where \(\odot\) denotes element by element multiplication (e.g.: \( (A \odot B)_i = A_i \cdot B_i \) ), and we multiply \(C'\) with \(V_n^{1}\) to obtain the desired \(C\) matrix. In short: $$ C=V_n^{1}((V_n A) \odot (V_n B)) $$ where \(A\) and \(B\) have already been resized to \(n\). This works because our \(*\) operation is equivalent to polynomial multiplication, and multiplying a vector \(P=[P_0, P_1, \dots, P_n]\) with \(V_n\) esentially evaluates a polynomial at the roots of unity: $$ (V_nP)_t = \sum_{k=0}^{n}{P_k \omega^{tk}} $$ So \( ((V_n A) \odot (V_n B))_k \) is equal to the product of the corresponding polynomials of \(A\) and \(B\) evaluated at \(\omega^k\) and multiplying the \(((V_n A) \odot (V_n B))\) vector with \(V_n^{1}\) basically interpolates the polynomial from the values evaluated at all these roots of unity.

Now let's see how we multiply such a vector with \(V_n\) fast, or because FFT is among the prerequisites, why the algorithm works from a linear algebraic point of view.
int rev(int x, int l) { // mirror the bits (e.g. 01011 > 11010) int res = 0; for (int bit = 0; bit < l; ++bit) { res*= 2; res= x & 1; x/= 2; } return res; } void fft(vector<Complex> &pol, int inverse = 1) { int n = size(pol); for (int i = 0; i < n; ++i) if (i < rev(i, log2[n])) swap(pol[i], pol[rev(i, log2[n])]); for (int step = 1; step < n; step*= 2) { Complex root; root.r = cos(inverse * PI / step); // real part root.i = sin(inverse * PI / step); // imaginary part for (int pos = 0; pos < n; pos+= step * 2) { Complex omega = 1; for (int i = 0; i < step; ++i) { Complex x = pol[pos + i]; Complex y = pol[pos + i + step] * omega; omega*= root; pol[pos + i] = x + y; pol[pos + i + step] = x  y; } } if (inverse == 1) for (int i = 0; i < n; ++i) pol[i]*= 0.5; } }

Basically, after applying a permutation to the initial vector, at each level (value of step), it divides the sequence into buckets of length step*2 and multiplies the vector \(S\) correspondent to that bucket with a matrix \(M_\text{step}\) that looks like this:
$$
M_\text{step} =
\begin{bmatrix}
1 & 0 & 0 & 0 & \dots & 0 & \omega^{0} & 0 & 0 & 0 & \dots & 0 \\
0 & 1 & 0 & 0 & \dots & 0 & 0 & \omega^{1} & 0 & 0 & \dots & 0 \\
0 & 0 & 1 & 0 & \dots & 0 & 0 & 0 & \omega^{2} & 0 & \dots & 0 \\
0 & 0 & 0 & 1 & \dots & 0 & 0 & 0 & 0 &\omega^{3} & \dots & 0 \\
\vdots & \vdots & \vdots & \vdots &\ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & 0 & \dots & 1 & 0 & 0 & 0 & 0 & \dots & \omega^{\text{step}  1} \\
1 & 0 & 0 & 0 & \dots & 0 & \omega^{0} & 0 & 0 & 0 & \dots & 0 \\
0 & 1 & 0 & 0 & \dots & 0 & 0 & \omega^{1} & 0 & 0 & \dots & 0 \\
0 & 0 & 1 & 0 & \dots & 0 & 0 & 0 & \omega^{2} & 0 & \dots & 0 \\
0 & 0 & 0 & 1 & \dots & 0 & 0 & 0 & 0 & \omega^{3} & \dots & 0 \\
\vdots & \vdots & \vdots & \vdots &\ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & 0 & \dots & 1 & 0 & 0 & 0 & 0 & \dots & \omega^{\text{step}  1} \\
\end{bmatrix}
\large{
, \omega=e^{\frac{2\pi i}{\text{step}}} }
$$
Multiplying \(M_\text{step}\) with \(S\) takes \(O(\text{step})\) time, because the matrix is sparse. This multiplication is found in the
for (int pos = 0; ...)
loop. Now as you might expect, FFT basically decomposes \(V_n\) into a product of these matrices and a permutation matrix. So $$ V_n = (I_2^{\otimes (\log{n})1} \otimes M_1) (I_2^{\otimes (\log{n})2} \otimes M_2) \dots (I_2^{\otimes 1} \otimes M_\frac{n}{4}) M_\frac{n}{2} (\text{"The permutation matrix"}) $$ Where '\( \otimes \)' denotes the Kronecker product, a key operation troughout the whole of this article and you must know at least its definition to grasp the remainder of this article. In this particular case, you may interpret it as \(M_\text{step}\) is applied on all the buckets (i.e. the Kroneker product distributes the \(M_\text{step}\) multiplication on each of the buckets). The proof of the equivalence of this decomposition to \(V_n\) is based on induction and quite beyond the scope of this article.
The Bitwise Convolutions Matrices
Now that there we're done with the short "classical FFT" recap, we may dive into XOR, AND and OR covolutions, and there is good news! They are quite easier to understand, don't require complex numbers or roots of unity in some field, run much faster and are shorter to code. All of that matrix decomposition maths was there to point out the fact that most convolution algorithms (but not gcd convolution for example) are based on such matrix decompositions and present a familiar example, even though it is one of the most complex ones from this perspective.
Remember the \(C=V_n^{1}((V_n A) \odot (V_n B))\) relation from the beginning of the article? Let's change \(V_n\) to \(T_n\) (for "transformation") to avoid confusion. Now the meaning of \(C=T_n^{1}((T_n A) \odot (T_n B))\) is that we apply some transformation to each of the vectors, obtain another vector trough one by one multiplication ('\(\odot\)') and then reverse the transformation in order to get the result of the convolution. Looking at this equation from above, \(T_n\) (the transformation) is the unknown. In the case of "classical FFT", the vandermonde matrix over the roots of unity worked, but now we must find another that yelds the XOR convolution for example. Let's see how we find out the values of \(T_n\) in that case!
We will use \( \oplus \) as the operator symbol of XOR, as it is shorter and looks fancier and \($\) as the XOR convolution symbol. Also, \(n\) will be a power of two to ensure that everything is contained.
Let's see how this convolution would work for vectors of size 2.
By definition:
$$
(A$B)_p=\sum_{k=0}^{n  1}{A_k B_{k \oplus p}}
$$
Additionally:
$$
(A$B)_p=T_n^{1}((T_nA) \odot \ (T_nB))
$$
And we'll set
$$
A=
\begin{bmatrix}
a_0 \\
a_1 \\
\end{bmatrix},
B=
\begin{bmatrix}
b_0 \\
b_1 \\
\end{bmatrix},
T_2=
\begin{bmatrix}
w & x \\
y & z \\
\end{bmatrix}
\\ \text{so} \\
A$B=
\begin{bmatrix}
a_0b_0 + a_1b_1 \\
a_0b_1 + a_1b_0 \\
\end{bmatrix}
$$
So now we can start solving for \(T_2\).
$$
(A$B)_p = T_n^{1}((T_nA) \odot \ (T_nB))
\implies
T_n (A$B) = (T_nA) \odot \ (T_nB)
\\ \implies \\
\begin{bmatrix}
w & x \\
y & z \\
\end{bmatrix}
\begin{bmatrix}
a_0b_0 + a_1b_1 \\
a_0b_1 + a_1b_0 \\
\end{bmatrix}
=
\begin{bmatrix}
w & x \\
y & z \\
\end{bmatrix}
\begin{bmatrix}
a_0 \\
a_1 \\
\end{bmatrix}
\odot
\begin{bmatrix}
w & x \\
y & z \\
\end{bmatrix}
\begin{bmatrix}
b_0 \\
b_1 \\
\end{bmatrix}
\\ \implies \\
\begin{bmatrix}
w (a_0b_0 + a_1b_1) + x (a_0b_1 + a_1b_0) \\
y (a_0b_0 + a_1b_1) + z (a_0b_1 + a_1b_0) \\
\end{bmatrix}
=
\begin{bmatrix}
w a_0 + x a_1 \\
y a_0 + z a_1 \\
\end{bmatrix}
\odot
\begin{bmatrix}
w b_0 + x b_1 \\
y b_0 + z b_1 \\
\end{bmatrix}
\\ \implies \\
\begin{bmatrix}
w (a_0b_0 + a_1b_1) + x (a_0b_1 + a_1b_0) \\
y (a_0b_0 + a_1b_1) + z (a_0b_1 + a_1b_0) \\
\end{bmatrix}
=
\begin{bmatrix}
(w a_0 + x a_1) (w b_0 + x b_1) \\
(y a_0 + z a_1) (y b_0 + z b_1) \\
\end{bmatrix}
\\ \implies \\
\begin{bmatrix}
w (a_0b_0 + a_1b_1) + x (a_0b_1 + a_1b_0) \\
y (a_0b_0 + a_1b_1) + z (a_0b_1 + a_1b_0) \\
\end{bmatrix}
=
\begin{bmatrix}
w^2 a_0b_0 + x^2 a_1b_1 + wx (a_0b_1 + a_1b_0) \\
y^2 a_0b_0 + z^2 a_1b_1 + wx (a_0b_1 + a_1b_0) \\
\end{bmatrix}
$$
Now, because \(a_0, a_1, b_0, b_1\) are variables, this gives us a system of equations:
$$
\begin{cases}
w^2=w,
x^2=w,
x=wx \\
y^2=y,
z^2=y,
z=yz \\
\end{cases}
$$
Which has the following solutions:
$$
(w,x) \in \{(0, 0), (1, 1), (1, 1) \} \\
\land \\
(y,z) \in \{(0, 0), (1, 1), (1, 1)\}
$$
But there is a catch, \(T_2\) must be invertible. That is \(\det{T_2} \neq 0\), which leaves us with these solutions
$$
T_2 \in \left \{
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
\end{bmatrix},
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
\end{bmatrix}
\right \}
$$
And indeed, these are the the matrices whose correspondend transformation gives us the XOR convolution for arrays of length \(n=2\). So we'd better figure out a way to extend this to larger powers of 2, let's say \(n=2^k\). Then, the transformation matrix would be \(T_n=T_2^{\otimes k}\), which in our case is the Hadamard Matrix. The proof of this, again is based on induction (and actually not that hard) and will appear in case of popular demand. But the main principle behind it is that for XOR, OR and AND, the bits are independent and the Kroenenker product has a distributive multiplication efect, as in the case of "classical FFT", on the sequence partitioned in buckets of size 2, then 4, then 8 etc.
There's just one more thing to point out about the kroneker product:
$$
(M^{\otimes n})^{1} = (M^{1})^{\otimes n}
$$
Let's see how we efficiently multiply these kroneker products with vectors if
$$
T_2=
\begin{bmatrix}
w & x \\
y & z \\
\end{bmatrix}
\implies
T_2^{1}=
\frac{1}{wzxy}
\begin{bmatrix}
z & x \\
y & w \\
\end{bmatrix}
$$
void transform(vector<double> &pol) {
int n = size(pol);
for (int step = 1; step < n; step*= 2) {
for (int pos = 0; pos < n; pos+= step * 2) {
for (int i = 0; i < len; ++i) {
// replace values pol[pos + i] pol[pos + 1 + step] with their product with T_2
double a = pol[pos + i];
double b = pol[pos + i + step];
pol[pos + i] = w * a + x * b;
pol[pos + i + step] = y * a + z * b;
}
}
}
}
void inverse_transform(vector<double> &pol) {
const double determinant = w * z  x * y;
int n = size(pol);
for (int step = 1; step < n; step*= 2) {
for (int pos = 0; pos < n; pos+= step * 2) {
for (int i = 0; i < len; ++i) {
// replace values pol[pos + i] pol[pos + 1 + step] with their product with the inverse of T_2
double a = pol[pos + i];
double b = pol[pos + i + step];
pol[pos + i] = (z * a  y * b) / determinant;
pol[pos + i + step] = (w * b  x * a) / determinant;
}
}
}
}
And in case you want a XOR convolution, just set \(w, x, y, z\) to the values from the specific XOR transformation matrix.
Some Pen and Paper Exercises
 Find \(T_2\) for OR and AND transforms
 Given an operation that "ORs" the first bit, "ANDs" the second bit, "XORs" the third one and then repeats, find the transformation matrix for such a convolution on arrays of \(k\) bits (of length \(n=2^k\)).
 Find a transformation matrix where the operation is XOR but in base 3 (addition with no carry).
Random Nim Generator
"How many sequences of length \(n\) containing numbers from \(0\) to \(k\) have their total XOR sum greater than \(0\)?"
Let's see a dp approach!
$$
\text{dp[a][b]} = \sum_{i=0}^{k}{dp[a  1][i \oplus b]}
$$
where \(\text{a}\) is the number of elements taken so far and \(\text{b}\) is their current XOR sum. And the dp is initialized with \(\text{dp[1][i]}=1, \forall i \in [0..k] \text{ and } 0 \text{ in rest}\).
Let's define \(K=[1, 1, 1, ..., 1, 0, 0, ..., 0]\) (the first k + 1 values are \(1\) and the rest are \(0\), the array's length being the smallest power of \(2\) greater or equal to \(k\)). Now we can write the recurrence as
$$
\text{dp[a]} = \text{dp[a1]}$K \\
\text{but} \\
\text{dp[1]}=K \\
\text{so} \\
\text{dp[a]} = K$K$K$ \dots $K \text{ (a times)}
$$
So you just have to use the XOR transform on \(K\) convolute it with itself \(n\) times and output the sum of the values of the nonzero positions of the resulting vector.
Here is my source code.
AND closure
Just apply a "selfANDconvolution" to the frequency array log times and output the positions with nonzero corresponding values.
Here is my source code.