"The Trick from Aliens"

Prerequisites

Abstract

The scope of this article is presenting a very useful DP optimization technique, introduced in the problem Aliens at IOI 2016. The techinque is used to reduce dimensions in particular DP configurations, by exploiting the convex nature of some cost functions. We will introduce the technique by starting with a simpler DP problem, show the optimization from \( O(N^2) \) to \(O(Nlog\text{VAL})\), then reveal the full solution of the original problem. Apparently, the official name of this optimization technique is "parameter search" and the Chinese call it "wqs binary search".

A problem example

You are given an array \(v\) of integers (possibly negative) of length \(N\) (\(\le 10^5\)) and a number \(K\) (\(\le N\)). Select at most \(K\) disjoint subarrays of the initial sequence such that the sum of the elements included in the subarrays is maximized.
The standard approach to such a problem would be a DP of the form:
$$ \text{dp[n][k]=["the solution for an array with the first n elements of the given array and k subarrays to be taken"]} \\ \text{dp[n][k]}= \max \{ \text{dp[n-1][k], } \max^{n-1}_{i=k} \{ \text{dp[i-1][k-1]} + (\Sigma^{n}_{k=i} v_k) \} \} \\ $$ Implementing this recurrence directly would be \( O(N^4) \), supposing that \(K\) is comparable in size to \(N\). It is left as an exercise to the reader (prefferably at least capable to solve a div2 C problem) to find a way of optimize this recurrence to \(O(N^2)\).
The trick behind the "aliens optimization" is that we can add a cost (penalty) which we will denote by \(\lambda\) for each taken subarray. If \(\lambda=0\) the solution would be taking a subarray for each positive element, but by increasing the value of \(\lambda\), the optimum solution shifts to taking fewer subarrays. Now we just have to find a \(\lambda\) that allows us to take as many subarrays as possible, but still fewer that \(K\).
To do a small recap, \(\lambda\) is the cost we assign to adding a new subarray, and increasing \(\lambda\) will decrease the number of subarrays in an optimal solution or keep it the same, but never increase it. That suggests that we could just binary search the smallest value of \(\lambda\) that yeilds an optimal soltion with less than \(K\) elements.
$$ \text{dp}_{\lambda} \text{[n]=["The solution for the prefix of length n of our initial array v, where adding a subarray comes with cost }\lambda \text{"]} \\ \text{dp}_{\lambda} \text{[n]} = \max \{ \text{dp}_{\lambda}\text{[n-1], } \max^{n-1}_{i=1} \{ (\Sigma^{n}_{k=i} v_k) + \text{dp}_{\lambda}\text{[i - 1]} - \lambda \} \} $$ Besides just the dp, we will store another auxiliary array: $$ \text{cnt}_{\lambda} \text{[n]=["How many subarrays does dp}_{\lambda} \text{[n] employ in its solution"]} $$ These recurrences are implementable in \(O(N)\), but I won't go into detail, as the target audience of this article is supposed to be able to solve this on its own.
The pseudocode of the solution would look like this:

minbound = 0, maxbound = 1e18
while maxbound - minbound > 1e-6:
    λ = (maxbound - minbound) / 2
    #compute dp and aux for λ
    if cnt[n] <= k:
        minbound = λ
    else:
        maxbound = λ
#compute dp and cnt for the final λ
return dp[n] + cnt[n] * λ #note that if there are less than k positive values, then cnt[n] < k

Proof and Formal Requirements

In the case of our initial problem, the fact that increasing \(\lambda\) never increases the number of subarrays taken was probably a very intuitive fact, but we'd like to find an actual proof that this works and find a general criterion for using the peak setting optimization in reducing DP dimensions. This criterion is in a way concavity (or convexity). Let's denote by \(\text{ans[k]}\) the answer for the problem, but using exactly \(k\) subarrays. The key observation in proving that our solving method is correct is that the \(\text{ans[k]}\) sequence is concave, that is \(\text{ans[k]} - \text{ans[k - 1]} \le \text{ans[k - 1]} - \text{ans[k - 2]}\). A more natural way of thinking about this and the actual way most people "feel" the concavity/convexity is by interpreting it as if I have \(k\) subarrays and add another one, it will help me more than if I had \(k+1\) subarrays and added one more. Now let's see how this concavity helps us prove the correctness of our solution!
Suppose \(\lambda=0\). Our solution will just find the global maximum of our concave sequence, be it \(\text{ans[p]}\). Notice that no matter the value of \(\lambda\), the fact that our sequence is concave won't change. Let's shift our attention for a bit from concave sequences to concave functions. \(f(x)=\lambda x-x^2\) is a fine example. By changing \(\lambda\), we can move the peak of the function wherever we want and the function will remain concave. Now let's go back to our more "discrete" sequence. We have an algorithm that finds \(p\) and \(\text{ans[p]}\) such that \(\text{ans[p]}\) is the maximum of the sequence, but we don't want the maximum of the sequence, we want \(\text{ans[k]}\) for some given \(k\). So... we can force \(k\) to be the maximum of the sequence, by adding a linear function to our sequnce (\(\text{ans[k]} \rightarrow \text{ans[k]}+\lambda k \)), just as we changed the peak of our continous function, which is exactly what we did in our solution. The algorithm will yield that the maximum of the sequence is at \(k\) with the value \(\text{ans[k]}+\lambda k\) and we just need to substract \(\lambda k\) to obtain our desired value: \(\text{ans[k]}\).
As for the general criterion, you might have already guessed it: if \((\text{ans[k]})_{1 \le k \le n}\) is the sequence of answers for given \(k\)s, the sequence must be convex or concave, that is: $$ \forall i \in (1..n), \text{ans[i]} - \text{ans[i-1]} \le \text{ans[i+1]} - \text{ans[i]} \\ \text{or} \\ \forall i \in (1..n), \text{ans[i]} - \text{ans[i-1]} \ge \text{ans[i+1]} - \text{ans[i]} \\ $$

Reconstruction Issues

Let's get back to our initial problem (there are \(N\) integers, you have to choose \(K\) subarrays, yada yada, blah blah) and let's change the statement, instead of selecting at most \(K\), you have to select exactly \(K\) subarrays. The difference is quite subtle, and the actual result is different iff there are less than \(K\) non-negative integers in the sequence. In this case, we just have to replace return dp[n] - λ * aux[n] with return dp[n] - λ * k. This may seem weird and quite unintuitive as for why it works. Let's look at a few proprieties of our algorithm. First of all, it may not be the case that for each \(k\) we have a corresponding set of \(\lambda\)s, that is: if for a given \(p\), the maxium \(\lambda\) for which taking \(p\) objects is optimal, then the solution for \(\lambda \leftarrow \lambda + \varepsilon\) where \(\varepsilon\) is an arbitrarily small value, may use more than \(p + 1\) objects, i.e. there may be no choice of \(\lambda\) for which the optimal solution employs a fixed number of elements. This may seem as a small game-killer for our technique, but let's look at the cause of this issue. Looking back at the Proof paragraphs, we are given the condition: $$ \forall i \in (1..n), \text{ans[i]} - \text{ans[i-1]} \le \text{ans[i+1]} - \text{ans[i]} $$ In case of equality here, we may have the following situation: $$ \text{ans[i + 1]} = \text{ans[i]} + t \\ \text{ans[i + 2]} = \text{ans[i]} + 2t \\ \dots $$ If the \(\lambda\) we choose equals \(t\), then all of these solutions will seem equivalently good. In fact, if a subarray of solutions \(\text{ans[a]}, \text{ans[a + 1]}, \dots, \text{ans[b]}\) for an arithmetic progression, there is no choice of \(\lambda\) that finds any other optimal solution other than using \(a\) or \(b\) objects. However, the fact that our solution fails on possible arithmetic progressions from our sequence (i.e. if the sequence is not strictly convex) is the very thing that will help us solve this issue. Suppose we find the smallest lambda that makes the solution employ \(\le k\) objects (let's say it uses \(a\) objects). This means the answer using exactly \(p\) objects is \(\text{dp[p]}-\lambda p\), but this basically implies that between \(p\) and \(k\) (the fixed number of objects we want to use) there is an arithmetic progression (i.e. \(\text{ans[k]} = \text{ans[p]} + t(k - p)\)). So if the answer for \(p\) would be \(\text{dp[n]}-p \lambda\), then the answer for \(k\) must be \(\text{dp[n]}-p\lambda-(k-p)\lambda=\text{dp[n]}-\lambda k \). This is quite weird as by finding a solution for \(p\), we also find the answer for \(k\), even if \((\text{aux[n]}=p) \neq k\). The downside of this workaround, is that even if we can find the value of the answer, a general method of reconstructing the solution (finding out what subarrays should we select) probably doesn't exist.

Integral Lambda Search

You might have noticed that we are binary searching a floating point \(\lambda\), not an integral valued one. The reason is that if the prerequisites of applying the optimization are satisfied, then we have proved that a \(\lambda\) exists, not that it would have an integral value. The thing is, in most DP problems, the optimization works just as well with integers. It's just not that obvious to prove why.
Let's consider a convex sequence of \(N\) elements, call it \(V_{1 \dots N}\). Now let's consider a set of points \( P = \{ p_i=(i, v_i) | i \in [1 \dots N] \} \), once drawn, together with the line segments between consecutive points (which will bear great importance in the following steps), you will see a convex lower/upper hull. A key observation now is that when looking at our convex sequence geometrically, the "peak" of the sequence will be the unique point that has segments with different signs of the slope to its left and to its right (with the exception the edge cases where the optimum is the first or last element of the sequence). In our drawn example, that peak is the 6th point, with a segment with negative slope value on its left and positive on its right. Another useful observation is that if we have two lines \(a_1 x + b_1\) and \(a_2 x + b_2\), adding a constant value \(\lambda\) to both their slopes doesn't change the x coordinate of their intersection: $$ \frac{(a_1 + \lambda) - (a_2 + \lambda)}{b_2 - b_1} = \frac{a_1 - a_2}{b_2 - b_1} $$ In our context, this means that if we add a constant value to all the slopes of the segments, the intersection points will remain the same, so the peak x coordinate will still be integral. Now all that is left to do is "say" this: if we want to force a position \(t\) to be the global optimum of this sequence and the slope of the segment to the left of the point is \(l\) and the slope of the segment right of the point is \(r\) and \( \{l, r\} \subset \mathbb{Z} \), then there exists at least one value \(\lambda \in \mathbb{Z} \) such that \(l + \lambda\) and \(r + \lambda\) have different signs. Translating this directly into terms of our convex sequence of answers, where our "slopes" are just the differences between two adjacent answers (i.e. \(\text{slope[k]} = \text{ans[k + 1]} - \text{ans[k]}\)), if the values of \(\text{ans}\) are integral, then obviously the differences (slopes) will also be integral, so if the answers to our problem are integral, then we can always binary search \(\lambda\) as an integral value.

Aliens

I find it quite amusing that when the IOI introduces some totally new technique for 99% of its contestants (like the convex hull trick with the problem batch scheduling at IOI 2002), the technique is usually merely a subproblem of a task that is quite difficult on its on own, even without the fact that the contestant is required to rediscover some then obscure technique. So is the case with Aliens (IOI 2016), even if you know the optimization, it's still quite a tricky convex hull trick problem. The solution goes something like this:
First of all, notice that if a point lies below the main diagonal, we can just replace it with its transpose (i.e. \((x, y) \rightarrow (y, x)\)), as any photo that captures \((x, y)\) will also capture \((y, x)\). After this transformation, notice the fact that we might be left off with a lot of useless points, that if removed will not change our answer. That is because if we have two points \(p_1 = (x_1, y_1)\) and \(p_2 = (x_2, y_2)\) such that \(x_1 \le x_2\) and \(y_1 \ge y_2\), any photo that captures \(p_2\) will also capture \(p_1\), but not all photos capturing \(p_1\) will capture \(p_2\) and given that we must capture all points (so \(p_2\) must be captured), we might as well remove \(p_1\) because it would have been captured by the photo containing \(p_2\) anyway. We can remove these useless points using a stack, and we'll be left with a sequence of points \((x_1, y_1), (x_2, y_2), \dots, (x_b, y_b) \) that if sorted in increasing order by the x coordinate, the sequence will also be sorted in decreasing order by the y coordinate. From now on, we will consider the sequence of points in this order.
Let's define \(\text{dp[k][n]=["the answer for only the first n objectives employing only k objects"]}\). Given the title of the article and the many paragraphs above, I don't feel the need to say that we can get rid of the first dimension of the dp and how.
So we're left with \(\text{dp}_\lambda\text{[n]=["the answer for the first n points, taking into account the } \lambda \text{ added cost"]}\). The recurrence of this DP should go something like this: $$ \text{dp}_\lambda\text{[n]} = \min_{t=1}^{n-1} \{ \text{["minimum area of a square that covers all the points from t+1 to n"]} + \text{dp}_\lambda\text{[t]} - \\ \text{["the area of the intersection between my square and the ones used in dp}_\lambda\text{[t]"]} + \lambda \\ $$ And that is: $$ \text{dp}_\lambda\text{[n]} = \min_{t=1}^{n-1} {\{ (x_n - y_{t+1} + 1)^2 + \text{dp}_\lambda\text{[t]} - \max ( x_t - y_{t+1} + 1, 0)^2 + \lambda \}} $$ Which can computed in linear time using the convex hull trick.

Other problems:


Note: I feel obliged to say that during the writing of this article, BlueDiamond stated that...
ThE bEsT ShUFfLe TeChNiQuE IS MeRgE-sHufFlE .


An Uberproblem

Prerequisites

Abstract

The scope of this article is presenting an interesting graph counting problem and its equivalently interesting solution. You may find the original statement in Romanian here. I personally think that equipped with this solution, ubergraph is one of the most interesting problems I've ever encountered in my competitive programming journey.

The Problem Statement

We define an ubergraph as a directed acyclic graph with the propriety that no two nodes have the same set of out-neighbors. Given a number N (\(\le\) 300), find the number of unlabelled ubergraphs of N nodes.

*For simplicity, let's call the propriety of our DAG that no two nodes have the same out-neighbors the distinction propriety.
**We shall also denote the set of nodes as V, the set of directed edges E and set of out-neighbors of node \(x\) as \(g(x)\).
***Because of the lack of automorphisms that we'll see later, an ubergraph is just a made up term for an asymmetric DAG.

My Thought Process and Solution

As this is a counting problem, we have to approach possibilities: a direct combinatorial formula or a DP. The limit though, seems quite small for any popular combinatorics technique and this is a strong suggestion that the solution is a DP.
Even if knowledge of such formalities isn't required in solving the problem, we can think about the restriction as a way of saying that the graph has no automorphisms except for the identity function ( a function \( f:V \mapsto V, f \text{ bijective, s.t. } \forall (a,b) \in E, (f(a), f(b)) \in E \) ).
This means, in simpler terms, that if we decide to label the DAG, there is no way of permuting the labels such that for all pairs of labels \((a, b)\), the nodes with these labels after the permutation is performed will have an edge between them iff they had one in their original placement.
An implication of this is that the number of labelled ubergraphs is equal to the number of unlabelled ubergraphs times N factorial.
These observations seem quite promising in elaborating a solution. With a little experience, they suggest that we can count labelled ubergraphs instead of unlabelled ones if we find a restriction that gives us a unique labelling for an unlabelled one. This allows us to count structures with stronger conditions, which may dramatically reduce the difficulty of the problem. This may sound a little bit counterintuitive but examples of such difficulty reductions are everywhere (e.g. the number of groups with N elements vs the number of abelian groups with N elements; the number of unlabelled trees vs the number of unlabelled complete binary trees etc).
Now we basically have to find a way of uniquely labelling a DAG, preferably respecting the topological order (i.e. if there are two nodes labelled \(x,y\) where \(x < y\) there may not be any path from \(y\) to \(x\) in the DAG. Yes, I know this feels backwards but we will append to our dp from the "sink" up) so that we may approach a natural DP approach of the form \(\text{dp[#nodes][some additional information]}\) where we could append nodes in order of their label.
Let's look at the ubergraphs with four nodes to get an idea of such a labelling:

Now call me a madman, but I think it's a good idea to assign the node \(x\) with \(g(x)=\emptyset\) the label one and we will denote this with a function \(\text{label}\), where \(\text{label}(x):=1\). Notice that there may only be one function with out-degree equal to zero, so all nodes can reach this node. Also, notice that all the ubergraphs in the picture have a node pointing only to the sink node (the one with out-degree zero). The reason behind this is just that there is only one sink and the graph is a DAG. It would also not be an absurd idea to label this node \(x\) with two: \(g(x)=\{1\} \implies label(x):=2\).
So far, we've managed to find the labels for two nodes, which are imposed anyway by the fact that we want our labels to respect the topological order 🥳 🍾. Now let's find something useful for the general case: suppose we have two nodes \(x\) and \(y\) such that \(x\) is not reachable from \(y\), \(x\) is not reachable from \(y\) and we have found the labels of all their out-neighbors. Which one of \(x\) and \(y\) should have the smaller label? There is obviously no direct answer, as so far the definition of what we are looking for is "something that feels useful". Now is the moment of thinking of a way of labelling the nodes in a way that may lead to a DP approach. The way I did this was just write down some random criteria, try elaborating from each and stop when I found something promising; and the one I've found is the following: define the "weight" of a node \(x\) as \(w(x)=\sum_{v \in g(x)} 2^{\text{label}(v)} \) and just give the smaller label to the one with the smaller weight. Keep in mind this weight function, as it will be a crucial part of our DP.
Let's now wrap it all up and see some pseudocode that will label an unlabelled ubergraph using the criteria we've stated before.
# the procedure is basically Kahn's algorithm ran from the sink "up"
layer = [("the sink node")]
gt = ("the transposed graph")
outdeg = ["the out-degrees of the nodes"]
label = ["empty list of length N"]

label[("the sink node")] = 1
label_index = 2 # at each step we will increment this value to assign new labels

while layer != []: 
    newlayer = [] # we will store here the nodes we have yet to label and have all of their out-neighbors labelled
    for node in layer:
        for v in gt[node]:
            outdeg[v]-= 1
            if outdeg[v] == 0:
                newlayer.append(v)
    sort newlayer by w # we sort the yet to be labelled nodes by the weight criterion, as they are interchangeable in the topological order
    for node in newlayer: # we finally label the new nodes
        label[node] = label_index
        label_index+= 1
    layer = newlayer # we continue from the freshly labelled nodes
Now the fact that this procedure works is just proof that our labelling method is a valid one and uniquely determines a labelling for an unlabelled ubergraph (ahem! labelling, labelling labelling labelling!). Now let's look at the proprieties of this labelling:
These seem quite promising. After analyzing them, you may notice the following quite mesmerizing thing: any \(w\) function respecting the first and last proprieties corresponds to a unique ubergraph, which you can reconstruct using the second propriety. In other words, the first and last proprieties are necessary sufficient conditions, so the number of unlabelled ubergraphs is equal to the number of strictly increasing sequences of length N, for which an element on position \(i\) is included in \([0...2^{i})\).
So we've just reduced our problem to counting these sequences, which seems much easier. I won't go into very much detail on how to efficiently compute this, but the DP goes something like this: $$ \text{dp[#length of the sequence][index of the most significant bit of the value of the last element]} \\ \text{dp[0][0]}=\text{dp[1][1]}=1 \\ \text{dp[i][j]}=\sum^{j}_{k=0} \binom{2^{j-1}}{i-k} \text{dp[k][j-1]} $$ *Note that the labels start from zero, not from one as explained before. This may be computed in \(O(N^3)\) by just coding the recurrence above, or in \(O(N^2 log N)\) by optimizing with FFT, but I have my doubts that this is the best attainable complexity.