share
Stack OverflowContest: fastest way to sort a big array of Gaussian-distributed data
[+23] [0] static_rtti
[2011-06-07 12:58:58]
[ algorithm performance sorting statistics programming-contest ]
[ http://stackoverflow.com/questions/6265525] [DELETED]

Following the interest in this question [1], I thought it would be interesting to make answers a bit more objective and quantitative by proposing a contest.

The idea is simple: I have generated a binary file containing 50 million gaussian-distributed doubles (average: 0, stdev 1). The goal is to make a program that will sort these in memory as fast as possible. A very simple reference implementation in python takes 1m4s to complete. How low can we go?

The rules are as follows: answer with a program that opens the file "gaussian.dat" and sorts the numbers in memory (no need to output them), and instructions for building and running the program. The program must be able to work on my Arch Linux machine (meaning you can use any programming language or library that is easily installable on this system).

The program must be reasonably readable, so that I can make sure it is safe to launch (no assembler-only solution, please!).

I will run the answers on my machine (quad core, 4 Gigabytes of RAM). The fastest solution will get the accepted answer and a 100 points bounty :)

The program used to generate the numbers:

#!/usr/bin/env python
import random
from array import array
from sys import argv
count=int(argv[1])
a=array('d',(random.gauss(0,1) for x in xrange(count)))
f=open("gaussian.dat","wb")
a.tofile(f)

The simple reference implementation:

#!/usr/bin/env python
from array import array
from sys import argv
count=int(argv[1])
a=array('d')
a.fromfile(open("gaussian.dat"),count)
print "sorting..."
b=sorted(a)

EDIT: only 4 GB of RAM, sorry

EDIT#2: Note that the point of the contest is to see if we can use prior information about the data. it is not supposed to be a pissing match between different programming language implementations!

(4) there is a site for this: codegolf.SE - yoda
(3) @yoda: this is not a code golf question nor a programming puzzle. - static_rtti
(4) @static_rtti: well, AFAIK questions of this sort belong there... anyway, I'm not entirely sure so I'll let the community decide. - yoda
I'm a bit lost concerning your implementation. I saved the first snippet as generator.py and did python generator.py 1000000. However, as I saved the second snippet as sort.py and entered in the console python sort.py 1000000, I got a EOFError. I know nearly nothing in Python, so if you could help... Edit : The gaussian.dat file is indeed at the right place, but the serialization made it look like garbage in my editor. Normal behaviour with python's tofile()? - Fezvez
@Fezvez: it's supposed to look as garbage since it's stored as raw binary data. However what you did is supposed to work... What version of python did you use? I used python 2.4, which is rather old. Also, how big is the resulting file? - static_rtti
I used python 2.7, and the file size was 8M. If anyone cared to check with this version of python, it'll help. Anyway, that's not the point, I'll generate the data myself, and sort it myself too. It's just that I wondered why I couldn't make the baseline work for me. - Fezvez
@Fezvez: I'll try tonight on a newer machine. In the meantime, can you report the exact size in bytes? If it's exactly 8,000,000 bytes, then the file is probably good. - static_rtti
Hehe : 7,62 MB (8 000 000 bytes)! In the meantime, I'll work on the sorting algorithm =) - Fezvez
@Fezvez: I just tried it again with Python 2.7, and it works fine... - static_rtti
I believe that using the Gaussian distribution needlessly complicates things, if you want to do smart calculations. If we have the uniform distribution, the point (using the knowledge of the distribution) stays the same, and the arithmetics gets much simpler... - xofon
@xofon: I picked the Gaussian distribution because it's what is most widely encountered in practice. - static_rtti
@xofon: I would be surprised if the direct indexing method described by Mike didn't hands down win such a method, while in theory with this method a strategy to avoid the cost of the phi method could be created. - Guvante
(1) Take each value and move it directly to its "expected" position, repeat for the displaced value. Not sure how to resolve a couple issue with that. When done, bubble sort until complete (a couple passes should do). - phkahler
Vote to close, off topic, not a real question for SO. - sixlettervariables
(1) I will post a bucket sort solution tomorrow evening if this hasn't been closed by then :) - jonderry
(1) @static_rtti - as a heavy CG user, this is exactly the sort of thing "we" like to hack on at CG.SE. To any reading mods, move this to CG, don't close it. - Arrdem
(1) It's not code golf though. The point is that most of the sorting algorithms we've all heard of, don't take into account the statistical distribution of the data. The question is whether we can do better if we know the distribution. Totally unrelated to code golfing. I think the only reason people want to close this is because it's phrased as a contest. It's essentially the same question as the one that inspired it, just phrased in a way that encourages actual code samples. - MatrixFrog
To mods: if SO is really a community-driven site, should you really close one of the highest-voted questions of the day? Please don't migrate it to CG, which is very low traffic and would kill the question. Please think of the people who participated... - static_rtti