2 Buckets in 1!
A radix-2 sort processes keys one bit at a time, splitting the current partially sorted list – based on the current bit – into two lists - often called buckets - that must normally be concatenated to form a single list before the next pass.
The following code avoids the space and time overhead of the final concatenation step by using opporiste ends of the same array as the buckets. When splitting is complete, the array is full.
void sort(unsigned *a, unsigned len)
{
unsigned *s = a, *d = malloc(len * sizeof *d), *t;
unsigned bit, is, id0, id1;
for (bit = 1; bit; bit <<= 1, t = s, s = d, d = t)
for (is = id0 = 0, id1 = len; id1 > id0; ++is)
if (((s[is] >> 1) ^ s[is]) & bit)
d[--id1] = s[is];
else
d[id0++] = s[is];
free(d);
}
The trick is that the list is being built back-to-front at the end of the destination buffer — reverse order! What’s going on here?
To see it, first consider this simpler algorithm:
void gray_code_order_sort(unsigned *a, unsigned len)
{
unsigned *s = a, *d = malloc(len * sizeof *d), *t;
unsigned bit, is, id0, id1;
for (bit = 1; bit; bit <<= 1, t = s, s = d, d = t)
for (is = id0 = 0, id1 = len; id1 > id0; ++is)
if (s[is] & bit)
d[--id1] = s[is];
else
d[id0++] = s[is];
free(d);
}
We start with the low order bit and make as many passes as the key has bits (32 on my machine), moving to successively higher-order bit positions for each pass. The pass copies all the keys with a 0 at the current bit to the low end of the destination buffer and those with a 1 to the high end. Original order is of course maintained for the 0 keys, but the 1 keys’ order is reversed. After each pass, source and destination buffer pointers are swapped.
It isn’t hard to see that the resulting sort order is gray code! A nice little proof by induction can show this rigorously. We let this as an exercise.
The final trick is to convert the keys to gray code for the ordinal given by the key “on the fly” as the sort occurs.
Happily this is an efficient bit-parallel computation! For the ordinal N, a corresponding Gray code is just:
(N >> 1) ^ N)
This explains the rather elaborate test in the conditional of the first code above. By converting the keys to gray code and then sorting in gray code order, the final sort is in normal arithmetic order! Pretty cool.
Even though it’s theoretically O(n) in performance for a list of n elements, this sort is not normally super fast. A good qucksort beats it sorting up to 10 million 32-bit unsigned integers for lists of up to 10 million or so.
My thought, however, is that this sort might work well in hardware, where its simplicity ought to be very helpful. How about it hardware hackers? Will someone build an FPGA and let me know?