Over at the NVidia forums, one enterprising researcher has optimized the Radix Sort algorithm to run on the NVidia GPU’s (GTX480 with CUDA) and put it up against previously optimized & published results from NVidia.  The results are staggering.

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).

On a quad core i7 from Intel: 240M 32-bit Keys per second.

On a 32-core Knights Ferry MIC (the successor to Larrabee): 560 32-bit Keys per second.

On the GTX480: 1,005M 32-bit Keys per second.

What makes this particularly impressive is that one of Intel’s arguments has always been that GPU algorithm performance is achievable via CPU optimization if care is taken.  They were proud of those optimized results on the Intel hardware, and the NVidia hardware easily doubled the throughput.

via SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting – NVIDIA Forums.