# Why is it that sorting an array takes longer than sorting an array?

## Problem

I’m doing a simple “between” search on a list of 500000 randomly created Tuplelong,long,string> objects:

var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);

The searches for 100 randomly generated values of x take roughly four seconds after I construct my random array and execute my search. Knowing how useful sorting is for searching, I opted to sort my data first by Item1, then by Item2, and then by Item3 before conducting my 100 searches. Because of branch prediction, I expected the sorted version to perform a little faster: my idea was that once we get to the point where Item1 == x, all subsequent tests of t.Item1 = x would properly predict the branch as “no take,” speeding up the tail section of the search. The searches took twice as long on a sorted array, much to my surprise!

I tried running my trials in a different order and with a different seed for the random number generator, but the result was the same: searches in an unsorted array were nearly twice as quick as searches in the identical array that were sorted!

Is there a good explanation for this peculiar effect? My tests’ source code is below; I’m using.NET 4.0.

private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
var data = new byte[8];
r.NextBytes(data);
return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
var res = x.Item1.CompareTo(y.Item1);
if (res != 0) return res;
res = x.Item2.CompareTo(y.Item2);
return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
}
}
static void Test(bool doSort) {
var data = new List<Tuple<long,long,string>>(TotalCount);
var random = new Random(1000000007);
var sw = new Stopwatch();
sw.Start();
for (var i = 0 ; i != TotalCount ; i++) {
var a = NextLong(random);
var b = NextLong(random);
if (a > b) {
var tmp = a;
a = b;
b = tmp;
}
var s = string.Format("{0}-{1}", a, b);
}
sw.Stop();
if (doSort) {
data.Sort(new TupleComparer());
}
Console.WriteLine("Populated in {0}", sw.Elapsed);
sw.Reset();
var total = 0L;
sw.Start();
for (var i = 0 ; i != TotalQueries ; i++) {
var x = NextLong(random);
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
total += cnt;
}
sw.Stop();
Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
Test(false);
Test(true);
Test(false);
Test(true);
}
Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)

Asked by Sergey Kalinichenko

## Solution #1

All tuples are accessed in memory-order when utilizing the unsorted list. They were allotted to RAM in that order. CPUs prefer sequential memory access because they can speculatively request the next cache line, ensuring that it is always available when needed.

Because your sort keys are produced at random, you put the list in random order when sorting it. This means that tuple member memory accesses are unpredictable. Because the CPU can’t prefetch memory, practically every tuple access is a cache miss.

This is a good illustration of one of the benefits of GC memory management: data structures that have been allocated and used together perform very well. They have a lot of local knowledge.

In this situation, the penalty for cache misses outweighs the penalty for saved branch prediction.

Try using a struct-tuple instead. Because no pointer-dereference is required at runtime to access tuple members, speed will be restored.

“For TotalCount around 10,000 or fewer, the sorted version does perform faster,” Chris Sinclair writes in the comments. This is due to the fact that a tiny list can be stored entirely in the CPU cache. Although memory accesses are unpredictably random, the target is always cached. Even a load from cache takes some cycles, therefore I suppose there is still a minor cost. However, this does not appear to be a problem because the CPU can handle numerous outstanding loads at once, resulting in increased throughput. When the CPU encounters a memory delay, it will continue to advance along the instruction stream in order to queue as many memory operations as possible. This method is used to conceal latency.

This type of behaviour demonstrates how difficult it is to estimate performance on current processors. The fact that switching from sequential to random memory access is just 2x slower tells me how much is going on under the hood to disguise memory latency. A memory access can cause the CPU to stall for up to 200 cycles. When random memory accesses are introduced, one would expect the application to become >10x slower.