The Two-Towers Problem

By leonardo maffi
V.1.2, August 31 2012
Keywords: programming, Python, Psyco, D language, C, gcc, llvm

[Go back to the article index]

This is a little problem from the Ebook "Data Structures in Java, for the Principled Programmer", page 191:
http://www.cs.williams.edu/~bailey/Java

Structures/Book_files/JavaStructures.pdf

A smaller PDF with just the problem:
http://www.cs.williams.edu/~jeannie/cs136/labs/towers.pdf

From page 191, 8.7 Laboratory: The Two-Towers Problem:

"Suppose that we are given n uniquely sized cubic blocks and that each block has a face area between 1 and n. Build two towers by stacking these blocks. How close can we get the heights of the two towers?
The following two towers built by stacking 15 blocks, for example, differ in height by only 129 millions of an inch (each unit is one-tenth of an inch):"



It looks related to the 0-1 Knapsack problem:
http://en.wikipedia.org/wiki/0-1_Knapsack_problem#0-1_knapsack_problem
But the dynamic programming algorithm requires discrete (and limited) values, while in the two-towers problem the "weights" are reals (floating point values).

There can be smart algorithms to solve this problem, but all the following code uses a brute-force strategy. The following code shows that there can be various ways to implement a supposed simple enumerative algorithm. If you know smarter ways to solve this problem please send me an email or leave a comment in the LiveJournal (my email address is at the top of the home page).

All the following code can be found here:
http://www.fantascienza.net/leonardo/js/two_towers.zip

With Python you need just a small script to solve the problem (in my first version I have created 'sqrs' using xrange(1,n), that's a bug). This code is simple, but uses itertools.combinations (of Python 2.6) of the std lib, that allows to speed up the code a lot, even compared to many other kinds of code using Psyco. So the end result isn't slow:
# two_towers1_py.py
from sys import argv
from itertools import combinations

def main(n):
    sqrs = [n ** 0.5 for n in xrange(1, n+1)]
    half = sum(sqrs) / 2
    sol = min((abs(sum(sub)-half), sub)
              for nit in xrange(1, n+1)
              for sub in combinations(sqrs, nit))[1]
    print "err:", abs(sum(sol) - sum(set(sqrs).difference(sol)))
    print "n:", n, [int(round(x * x)) for x in sol]

n = int(argv[1]) if len(argv) == 2 else 10
main(n)

That program (with no arguments) prints:
err: 0.0147326613754
n: 10 [1, 2, 3, 4, 6, 7]

Using my dlibs that Python code can be translated to similar D (this code is short, but programmign in D is less smooth than Python, so you need some tries to fix this code):
// two_towers1_d.d
import std.math: sqrt, abs, round;
import std.conv, d.func, d.comb;
import d.sets: set;

void main(string[] argv) {
    int n = argv.length == 2 ? toInt(argv[1]) : 10;
    auto sqrs = map((double d){ return sqrt(d); }, xrange(1, n+1));
    double half = sum(sqrs) / 2;
    auto sol = min(xsubsets(sqrs), (double[] s){ return abs(sum(s)-half); });
    putr("err: ", abs(sum(sol) - sum(set(sqrs).difference(sol))));
    putr("n: ", n, " ", map((double x) { return cast(int)round(x * x); }, sol) );
}

This D code is actually closer to my first Python version (not shown here). I am using the DMD compiler V.1.042.
That prints:
err: 0.0147326
n: 10 [1, 4, 5, 8, 10]

You can find xsubsets() and the other things inside my dlibs:
http://www.fantascienza.net/leonardo/so/libs_d.zip

Few timings (on a Core 2 at 2 GHz, WinXP):
two_towers1_py.py 20
err: 0.0000716919350303
n: 20 [2, 5, 6, 7, 8, 11, 13, 14, 17, 20]
1.34 seconds

two_towers1_py.py 25
err: 3.47015006952e-06
n: 25 [1, 2, 4, 5, 6, 9, 11, 16, 17, 21, 23, 24, 25]
34.69 seconds

two_towers1_d.exe 25
err: 3.47015e-06
n: 25 [3, 7, 8, 10, 12, 13, 14, 15, 18, 19, 20, 22]
23.91 seconds

The timings for the D version are lower, but not much because that Python version is very fast, it is looking up for the minimum error summing up 2^25 ~= 33.5 million different subsets!

You can see that the D and Python versions give complementary solutions. The difference:
>>> from math import sqrt
>>> a = [1, 2, 4, 5, 6, 9, 11, 16, 17, 21, 23, 24, 25]
>>> b = [3, 7, 8, 10, 12, 13, 14, 15, 18, 19, 20, 22]
>>> da = sum(sqrt(x) for x in a)
>>> db = sum(sqrt(x) for x in b)
>>> da, db
(42.816888402464038, 42.816891872614107)
>>> abs(da - db)
3.4701500695177856e-06
It's not difficult to write a much faster D version (that uses just Phobos), that is lower level compared to the first D one, just a bit higher-level than C code:
// two_towers2_d.d
import std.math: sqrt, abs, round;
import std.conv: toInt;
import std.stdio: writefln;

int[] finder(int n) {
    double data[60];
    double target = 0.0;
    for (int i = 1; i <= n; i++) {
        data[i - 1] = sqrt(cast(double)i);
        target += data[i - 1];
    }
    target /= 2;

    uint index_min;
    double current_min = double.max;

    for (uint index = 0; index < (1 << n); index++) {
        auto i = index;
        double current_tot = 0.0;
        uint position;

        while (i) {
            if (i & 1)
                current_tot += data[position];
            i >>= 1;
            position++;
        }

        double current_diff = abs(current_tot - target);
        if (current_diff < current_min) {
            current_min = current_diff;
            index_min = index;
        }
    }

    int[] result;
    uint position;
    while (index_min) {
        if (index_min & 1)
            result ~= position + 1;
        index_min >>= 1;
        position++;
    }

    return result;
}

void main(string[] argv) {
    int n = argv.length == 2 ? toInt(argv[1]) : 10;
    auto solution = finder(n);

    double tot1 = 0.0;
    double tot2 = 0.0;
    for (int i = 1; i <= n; i++) {
        bool found = false;
        foreach (j; solution)
            if (i == j)
                found = true;
        if (found)
            tot1 += sqrt(cast(double)i);
        else
            tot2 += sqrt(cast(double)i);
    }
    double err = abs(tot1 - tot2);

    writefln("err: %.16f", err);
    writefln("n: ", n, "  ", solution);
}

In D you can also use the FP type "real" (80-bit FP) to have a better precision with a small slowdown.
The runtimg timings are much better, more than 5 times faster:

two_towers2_d.exe 25
err: 0.0000034701500695
n: 25 [1,2,4,5,6,9,11,16,17,21,23,24,25]
5.26 seconds

This solution is more or less O(n 2^n) still.
I can also translate it to pure C, lowering its level down a bit more:
// two_towers1_c.c
#include "math.h"
#include "stdio.h"
#include "stdlib.h"

int main(int argc, char** argv) {
    int n = argc == 2 ? atoi(argv[1]) : 10;

    double data[60];
    unsigned int i;
    for (i = 0; i < 60; i++)
        data[i] = 0.0;

    double target = 0.0;
    for (i = 0; i < n; i++) {
        data[i] = sqrt((double)(i + 1));
        target += data[i];
    }
    target /= 2;

    unsigned int index_min = 0;
    double min_difference = 1e30;

    unsigned int index;
    for (index = 0; index < (1 << n); index++) {
        i = index;
        double current_tot = 0.0;
        unsigned int position = 0;

        while (i) {
            if (i & 1)
                current_tot += data[position];
            i >>= 1;
            position++;
        }

        double current_diff = fabs(current_tot - target);
        if (current_diff < min_difference) {
            min_difference = current_diff;
            index_min = index;
        }
    }

    double tot1 = 0.0;
    double tot2 = 0.0;
    for (i = 0; i < n; i++)
        if (index_min & (1 << i))
            tot1 += sqrt((double)(i + 1));
        else
            tot2 += sqrt((double)(i + 1));
    double err = fabs(tot1 - tot2);
    printf("err: %.16f\n", err);

    printf("n: %d ", n);
    unsigned int position = 0;
    while (index_min) {
        if (index_min & 1)
            printf("%d ", position + 1);
        index_min >>= 1;
        position++;
    }
    printf("\n");

    return 0;
}

I have compiled it with llvm-gcc (gcc version 4.2.1, llvm 2.5):
llvm-gcc -Wall -O3 -s -fomit-frame-pointer -msse3 -march=core2 two_towers1_c.c -o two_towers1_c
Here GCC gives more approximated results (at first I have feared to have found a bug in GCC, it seems current_diff<min_difference is computed with lower precision by gcc), and it's slower.
Some timings:

two_towers1_c.exe 25
err: 0.0000034701500695
n: 25 [1 2 4 5 6 9 11 16 17 21 23 24 25 ]
3.49 seconds
0inputs+0outputs (377major+0minor)pagefau

two_towers1_c.exe 31
err: 0.0000000552507515
n: 31 [2 4 5 6 7 10 11 12 15 18 19 22 26 27 28 29 ]
270.39 seconds

This is the (cleaned up) asm code produced by llvm-gcc of the two inner loops of the program, it's very short and nice:
LBB1_10:
	testl	%esi, %esi
	jne	LBB1_20
LBB1_11:
	pxor	%xmm0, %xmm0
	movapd	%xmm0, %xmm2
	jmp	LBB1_24
LBB1_12:
	testl	$1, %ecx
	je	LBB1_14
LBB1_13:
	addsd	(%edx), %xmm1
LBB1_14:
	addl	$8, %edx
	shrl	%ecx
	testl	%ecx, %ecx
	jne	LBB1_12
LBB1_15:
	subsd	%xmm0, %xmm1
	andpd	LCPI1_2, %xmm1
	ucomisd	%xmm1, %xmm2
	cmova	%eax, %edi
	incl	%eax
	cmpl	%ebx, %eax
	minsd	%xmm2, %xmm1
	je	LBB1_10

To show better the approximations gcc is doing I have created the following reduced C program, you can compile it with -O0 and -O1 and compare the results.

Even if you compile with -O0 and you add all the optimization flags used by -O1, the output is the same (correct) one of -O0.

The -O1 flags are: -fauto-inc-dec -fcprop-registers -fdce -fdefer-pop -fdelayed-branch -fdse -fguess-branch-probability -fif-conversion2 -fif-conversion -finline-small-functions -fipa-pure-const -fipa-reference -fmerge-constants -fsplit-wide-types -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time
// gcc_approx.c
#include "math.h"
#include "stdio.h"

#define N 20

int main() {
    double data[60];
    int i;
    double target = 0.0;
    for (i = 0; i < N; i++) {
        data[i] = sqrt((double)i + 1);
        target += data[i];
    }
    target /= 2;

    double min_difference = 1e30;

    unsigned int index;
    for (index = 0; index < (1 << N); index++) {
        unsigned int i = index;
        double current_tot = 0.0;
        unsigned int position = 0;

        while (i) {
            if (i & 1)
                current_tot += data[position];
            i >>= 1;
            position++;
        }

        double current_diff = fabs(current_tot - target);
        if (current_diff < min_difference) {
            min_difference = current_diff;
            printf("%.18f\n", min_difference);
        }
    }

    return 0;
}

I have tried to speed up the D code coming out of the inner loop as soon as the partial sum is too much bigger than the half sum, but in all my experiments the resulting performance is worse.

An obvious optimization is to avoid summing a whole subset of items at each iteration.

You can also keep a running sum, and change only one value each iteration, creating the subsets according to a gray code, that allows you to change only one digit at a time, so you can only sum and subtract one item each iteration:
0
1
11
10
110
111
101
100
1100
1101
1111
1110
1010
1011
1001
1000
11000
11001
11011
...

But that can't be used because we are working with floating point values here, and after a while the approximations grow too much.

If you scan such subsets as binary numbers starting from 0 you only change bits after a certain bit, so you can keep the sum of all the items before that bit. So you can change this computation into a tree, computing much less item sums and avoiding the FP approximations.

This idea leads to code like the following, this is hard-coded for n=6 (I have used Psyco for speed. Note that Psyco can be found for Python 2.6 too):
import sys, psyco

def main(n):
  sqrs = [n ** 0.5 for n in xrange(1, n+1)]
  half = sum(sqrs) / 2
  tot0 = -half
  current_min = 1e20
  best = None

  for i1 in xrange(2):
    tot1 = tot0 + 1.0 if i1 else tot0
    for i2 in xrange(2):
      tot2 = tot1 + 1.4142135623730951 if i2 else tot1
      for i3 in xrange(2):
        tot3 = tot2 + 1.7320508075688772 if i3 else tot2
        for i4 in xrange(2):
          tot4 = tot3 + 2.0 if i4 else tot3
          for i5 in xrange(2):
            tot5 = tot4 + 2.2360679774997898 if i5 else tot4
            for i6 in xrange(2):
              tot6 = tot5 + 2.4494897427831779 if i6 else tot5
              if abs(tot6) < current_min:
                current_min = abs(tot6)
                best = [i1, i2, i3, i4, i5, i6]

  sol = [s for s, ok in zip(sqrs, best) if ok]
  print "err: %.16f" % abs(sum(sol) - sum(set(sqrs).difference(sol)))
  print "n:", n, [int(round(x * x)) for x in sol]

n = 6
psyco.bind(main)
main(n)

With Python you can create a generic program able to work with other values of n. It works creating the inner loops as a string, using exec to create the function, binding/compiling it with Psyco, and then calling it:
# two_towers2_py.py
import sys, psyco

n = int(sys.argv[1]) if len(sys.argv) == 2 else 20
code = []
for i in xrange(1, n+1):
    indent = " " * 2 * i
    code.append(indent + "for i%d in xrange(2):" % i)
    code.append(indent + "  tot%d = tot%d + %r if i%d else tot%d" % (i, i-1, i**0.5, i, i-1))
indent = " " * 2 * (n + 1)
code.append(indent + "if abs(tot%d) < current_min:" % n)
code.append(indent + "  current_min = abs(tot%d)" % n)
code.append(indent + "  best = [%s]" % ", ".join("i%d" % i for i in xrange(1, n+1)))
code = "\n".join(code)

exec """
def main(n):
  sqrs = [n ** 0.5 for n in xrange(1, n+1)]
  half = sum(sqrs) / 2
  tot0 = -half
  current_min = 1e20
  best = None
XXX
  sol = [s for s, ok in zip(sqrs, best) if ok]
  print "err: %.16f" % abs(sum(sol) - sum(set(sqrs).difference(sol)))
  print "n:", n, [int(round(x * x)) for x in sol]""".replace("XXX", code)

psyco.bind(main)
main(n)

This algorithm is now O(2^n), and the timings are very good:

two_towers2_py.py 20
err: 0.0000716919350268
n: 20 [5, 6, 7, 11, 13, 14, 17, 18, 20]
0.47 seconds

Unfortunately if you try to use this two_towers2_py.py program with n>20 you receive:
SystemError: too many statically nested blocks
Python doesn't allow to nest more than 21 blocks. I think this limit is too much low, and that about doubling this limitation is better. I don't know how to change the Python interpreter to fix this.

In D you can create code at compile time and then mix-it-in with a mixin, so you can write similar code (and without the limit of 21 nested blocks), now but n must be known at compile-time. In the following program all the compile-time strings are computed with functional-style recursive templates, but often you can also use normal functions run at compile-time (CTFE):
// two_towers3_d.d
import std.math: sqrt, abs;
import std.conv: toInt;
import std.stdio: writefln;
import std.metastrings: Format, ToString;

const int n = 25; // this can be changed

template SeriesGen1(string txt, string separator, int max, int min=0) {
    static if (min > max)
        const SeriesGen1 = "";
    else static if (min == max)
        const SeriesGen1 = Format!(txt, ToString!(max));
    else
        const SeriesGen1 = SeriesGen1!(txt, separator, max-1, min) ~ separator ~
                           Format!(txt, ToString!(max));
}

template GenCore(int i, string arr) {
    const string GenCore = Format!("
if (abs(tot%s) < current_min) {
  current_min = abs(tot%s);
  best = [%s];
}", i, i, arr);
}

template GenLoop(int i) {
    static if (i <= 0)
        const string GenLoop = "";
    else
        const string GenLoop = GenLoop!(i - 1) ~
                               Format!("for (int i%s; i%s < 2; i%s++) {\n" ~
                                       "  double tot%s = i%s ? tot%s + sqrs[%s] : tot%s;\n",
                                        i, i, i, i, i, i-1, i-1, i-1);
}

template GenClosing(int i) {
    static if (i <= 0)
        const string GenClosing = "";
    else
        const string GenClosing = GenClosing!(i - 1) ~ "}";
}

template FullGenerator(int i) {
    const string FullGenerator = GenLoop!(i) ~
                                 GenCore!(i, SeriesGen1!("i%s", ", ", i, 1)) ~
                                 "\n"
                                 ~ GenClosing!(i);
}

int[] finder(int n)() {
  double sqrs[60];
  double half = 0.0;
  for (int i = 1; i <= n; i++) {
    sqrs[i - 1] = sqrt(cast(double)i);
    half += sqrs[i - 1];
  }
  half /= 2;

  double tot0 = -half;
  double current_min = 1e20;
  auto best = new int[n];

  mixin(FullGenerator!(n));

  int[] result;
  for (int i; i < n; i++)
    if (best[i])
      result ~= (i + 1);
  return result;
}

void main() {
  auto solution = finder!(n)();

  double tot1 = 0.0;
  double tot2 = 0.0;
  for (int i = 1; i <= n; i++) {
    bool found = false;
    foreach (j; solution)
      if (i == j)
        found = true;
    if (found)
      tot1 += sqrt(cast(double)i);
    else
      tot2 += sqrt(cast(double)i);
  }
  double err = abs(tot1 - tot2);

  writefln("err: %.16f", err);
  writefln("n: ", n, "  ", solution);
}

The timings are quite good:

two_towers3_d.d.exe
err: 0.0000716919350267
n: 20 [5,6,7,11,13,14,17,18,20]
0.08 seconds

two_towers3_d.d.exe
err: 0.0000034701500695
n: 25 [1,2,4,5,6,9,11,16,17,21,23,24,25]
0.62 seconds

two_towers3_d.d.exe
err: 0.0000000552507515
n: 31 [2,4,5,6,7,10,11,12,15,18,19,22,26,27,28,29]
38.94 seconds

In both the Python and D versions one, two or three inner loops can be inlined to give a bit better performance, but I have seen that inlining one inner loop doesn't lead to much more speed and makes the code generation more complex.

I have tried to come out of the inner lops as soon as the current sum is significantly bigger than the half, but all my experiments have lead to equal or lower performance, I don't know why.

Finally I have created a C version of that code. After few tries I have decided to use the simpler solution, my templat.py Python program that allows to template "any" language files using snippets of Python that just use print to generate code (inside a single file the state is keps from one snippet to the next one). Notes about templat.py:
http://www.fantascienza.net/leonardo/so/templat.html

The C code with two parts of templated Python code inside, note that in the first part it sets n that is later used in the second block of code too:
// two_towers2_c.ct
#include "stdio.h"
#include "math.h"
#include "string.h"

{{
n = 41 # this can be changed
print "#define N %d" % n
}}

int main() {
  #define M 60
  double sqrs[M];
  int i;
  for (i = 0; i < M; i++)
      sqrs[i] = 0.0;

  double half = 0.0;
  for (i = 1; i <= N; i++) {
    sqrs[i - 1] = sqrt((double)i);
    half += sqrs[i - 1];
  }
  half /= 2;

  double tot0 = -half;
  double current_min = 1e20;
  int best[M];
  for (i = 0; i < M; i++)
      best[i] = 0;

{{
print "  // generated code for n = %d ------------------" % n
print "  #if N != %d" % n
print "    #error The defined value of N is wrong for the generated code."
print "  #endif"

for i in xrange(1, n+1):
    indent = " " * 2 * i
    print indent + "int i%d; for (i%d = 0; i%d < 2; i%d++) {" % (i, i, i, i)
    print indent + "  double tot%d = i%d ? tot%d + sqrs[%d] : tot%d;" % (i, i, i-1, i-1, i-1)

indent = " " * 2 * (n + 1)
print indent + "if (fabs(tot%d) < current_min) {" % n
print indent + "  current_min = fabs(tot%d);" % n
for i in xrange(0, n):
    print indent + "  best[%d] = i%d;" % (i, i+1)
print indent + "}"

for i in xrange(n, 0, -1):
    indent = " " * 2 * i
    print indent + "}"

print "  // end generated code ------------------"
}}

  double tot1 = 0.0;
  double tot2 = 0.0;
  for (i = 0; i < N; i++)
    if (best[i])
      tot1 += sqrt((double)(i + 1));
    else
      tot2 += sqrt((double)(i + 1));
  double err = fabs(tot1 - tot2);
  printf("err: %.16f\n", err);

  printf("N: %d [", N);
  for (i = 0; i < N; i++)
    if (best[i])
      printf("%d ", i + 1);
  printf("]\n");

  return 0;
}


Using llvm-gcc the performance is much higher than D code compiled with DMD (I have not used the LDC D compiler, probably it will lead to much better performance than DMD):

subs4_c_llvm.exe
err: 0.0000000552507515
N: 31 [2 4 5 6 7 10 11 12 15 18 19 22 26 27 28 29 ]
9.21 seconds

two_towers2_c_llvm.exe
err: 0.0000000011096404
N: 36 [1 3 4 5 7 8 10 12 13 16 20 21 23 24 26 29 33 34 35 ]
5 minutes 49.13 seconds

two_towers2_c_llvm.exe
err: 0.0000000000449774
N: 38 [1 5 6 7 9 10 12 13 14 15 17 18 19 21 24 27 29 30 35 38 ]
20 minutes 12.83 seconds

two_towers2_c_llvm.exe
err: 0.0000000144697481
N: 39 [1 4 6 9 11 13 15 16 17 18 19 21 22 23 25 26 27 31 33 36 ]
6 minutes 34.71 seconds

two_towers2_c_llvm.exe
err: 0.0000000001917329
N: 40 [1 3 4 6 8 9 10 11 12 15 16 20 21 25 28 31 33 35 36 37 40 ]
1 hour 35 minutes 40 seconds

two_towers2_c_llvm.exe
err: 0.0000000001424638
N: 41 [3 7 9 10 12 15 16 17 18 19 22 23 25 26 28 30 31 36 37 41 ]
3 hours 11 minutes 8 seconds

Note that 2^41 = 2_199_023_255_552 :-)


Update August 31 2012:

The Reddit user Ledrug has sent me a faster C solution, here a little reformatted and with few names renamed:

// two_towers4_c.c
#include "stdio.h"
#include "stdlib.h"
#include "math.h"

typedef unsigned int Bits;

typedef struct {
    double v;
    Bits bits;
} Combo;

int cmp_combo(const void *aa, const void *bb) {
    const Combo *a = aa;
    const Combo *b = bb;
    return (a->v > b->v) ? 1 : (a->v == b->v) ? 0 : -1;
}

Combo * mkbits(const double *v, const Bits nbits) {
    Bits i;
    Combo *r = malloc(sizeof(*r) * (1 << nbits));
    r[0].v = 0;
    r[0].bits = 0;

    for (i = 0; i < nbits; i++)
        r[1 << i].v = v[i];

    for (i = 1; i < 1U << nbits; i++) {
        const Bits b = i & -(int)i;
        r[i].v = r[i ^ b].v + r[b].v;
        r[i].bits = i;
    }

    qsort(r, 1 << nbits, sizeof(*r), cmp_combo);
    return r;
}

int main(const int argc, const char **argv) {
    int m = (argc == 2) ? atoi(argv[1]) : 10;
    if (m <= 0)
        m = 10;

    const int n = m - 1;

    double *values = malloc(sizeof(double) * m);
    double sum = 0;

    int i, j;
    for (i = 0; i <= n; i++)
        sum += (values[i] = sqrt(i + 1));
    sum /= 2;

    const Combo *lo = mkbits(values + 1, n / 2);
    const Combo *hi = mkbits(values + 1 + n / 2, n - n / 2);

    double best_v = sum;
    Bits best_i = 0;

    for (i = (1 << n / 2), j = 0; i--; ) {
        const double s = lo[i].v;

        while ((j < 1 << (n - n / 2)) && hi[j].v + s < sum)
            j++;

        if (j) {
            const double v = fabs(hi[j - 1].v + s - sum);
            if (v < best_v) {
                best_v = v;
                best_i = lo[i].bits | (hi[j - 1].bits << n / 2);
            }
        }

        if (j < 1 << (n - n / 2)) {
            const double v = fabs(hi[j].v + s - sum);
            if (v < best_v) {
                best_v = v;
                best_i = lo[i].bits | (hi[j].bits << n / 2);
            }
        }
    }

    printf("n = %d:", m);
    for (i = 0; i < n; i++)
        if (best_i & (1 << i))
            printf(" %d", i + 2);

    printf(", diff: %g\n", best_v * 2);

    return 0;
}

A comment from Ledrug:

Here's a brief complexity analysis: given n numbers, splitting all of them into two piles and comparing sums take O(2n). The way I did it divides them into halves and produces all possible sums for each half, which is O(2n/2). Let's call 2n/2 "M", and each group of sums gets sorted, that's O(M log M). Then the two lists are scanned linearly once, that's again O(M). So overall the complexity is O(M log M) = O(n * 2n/2), which is still exponential, but a much smaller one than the original 2n. You are correct in saying that qsort is the critical part (that's the M log M thing there), and it is indeed possible to make the code faster by various tweaks, though I don't see a way to make it faster beyond a constant factor.
Later I have written a C-like D translation, and mysteriously it's faster than the C code (compiled with gcc 4.7.1 -Ofast -s -flto). Maybe it's just the qsort that is faster. In D I have tried to replace the stdlib qsort with the D sort, or with a quite faster sort I have written, but both were slower than qsort, so I have left it in D, despite in D the usage of qsort is not idiomatic.
// two_towers5_d.d
import core.stdc.stdio: printf;
import core.stdc.stdlib: malloc, qsort;
import std.math: sqrt, abs;
import std.conv: to;

alias uint Bits; // or size_t?

struct Combo {
    double v;
    Bits bits;
}

Combo* mkbits(in double* v, in int nbits) nothrow {
    Combo* r = cast(Combo*)malloc(Combo.sizeof * (1 << nbits));
    r[0].v = 0;
    r[0].bits = 0;

    foreach (Bits i; 0 .. nbits)
        r[1 << i].v = v[i];

    foreach (Bits i; 1 .. 1 << nbits) {
        immutable Bits b = i & -cast(int)i;
        r[i].v = r[i ^ b].v + r[b].v;
        r[i].bits = i;
    }

    static extern(C) int cmpCombo(const void* aa, const void* bb) pure nothrow {
        const Combo* a = cast(Combo*)aa;
        const Combo* b = cast(Combo*)bb;
        return (a.v > b.v) ? 1 : (a.v == b.v) ? 0 : -1;
    }

    // much faster than std.algorithm.sort and even of fastSort
    qsort(r, 1 << nbits, (*r).sizeof, &cmpCombo);
    return r;
}

void main(in string[] args) {
    int m = (args.length == 2) ? to!int(args[1]) : 10;
    if (m <= 0)
        m = 10;
    immutable int n = m - 1;

    double* values = cast(double*)malloc(double.sizeof * m);
    double sum = 0;
    foreach (i; 0 .. m)
        sum += (values[i] = sqrt(cast(real)(i + 1)));
    sum /= 2;

    const Combo* lo = mkbits(values + 1, n / 2);
    const Combo* hi = mkbits(values + 1 + n / 2, n - n / 2);

    double best_v = sum;
    Bits best_i = 0;

    for (int i = (1 << n / 2), j = 0; i--; ) {
        immutable double s = lo[i].v;

        while ((j < 1 << (n - n / 2)) && hi[j].v + s < sum)
            j++;

        if (j) {
            immutable double v = abs(hi[j - 1].v + s - sum);
            if (v < best_v) {
                best_v = v;
                best_i = lo[i].bits | (hi[j - 1].bits << n / 2);
            }
        }

        if (j < 1 << (n - n / 2)) {
            immutable double v = abs(hi[j].v + s - sum);
            if (v < best_v) {
                best_v = v;
                best_i = lo[i].bits | (hi[j].bits << n / 2);
            }
        }
    }

    printf("n = %d:", m);
    foreach (i; 0 .. n)
        if (best_i & (1 << i))
            printf(" %d", i + 2);

    printf(", diff: %g\n", best_v * 2);
}

Some timings:

GCC V. 4.7.1 -Ofast -s -flto
dmd V. 2.061alpha, -O -release -inline -noboundscheck

Output N = 41 (for both the C and D versions):
n = 41: 5 6 9 11 13 14 18 20 21 24 27 29 32 33 37 38 41, diff: 1.42428e-10

Output N = 45:
n = 45: 2 6 7 9 11 14 17 18 20 22 25 26 28 29 30 31 33 34 38 39 41 43, diff: 4.1922e-012

Timings, N=41, seconds:
  C4: 1.13
  D5: 0.85

Timings, N=45, seconds:
  C4: 4.87
  D5: 3.64

This means with N=41 the D5 code is about 13_000 times faster than the faster precedent version, it's a nice improvement.

[Go back to the article index]