Almost every modern language comes with sorting procedures. Is there any reason to dive into?

Very often it doesn’t matter which sorting procedure we select. Especially when we’re sorting a relatively small set of data, say, less than 1000 records, and the performance is not critical. Moreover, some languages give no options but one sorting procedure (taking into consideration only basic language facilities).

But what if we have gigabytes to be sorted? Let’s check out that case.

## Disclaimer

Before we go further, I’d like to clarify some things.

In this test I use only basic facilities of a given language so specialized libraries are not present here. Moreover, in real life you should consider using appropriate tools when dealing with gigabytes of data. Good examples are specialized libraries like Numpy or Panda, memory cached systems (Memcached, Redis), NoSQL databases, embedded databases (Berkeley DB, LevelDB), big data processing tools like Apache Spark, Hadoop framework to name a few.

This is neither a scientific experiment nor a proof of superiority one language over the others. This is only a brief description and short test of built-in sorting procedures.

The tests were executed on a laptop with i7 processor and 64 GB RAM. The source code is available here.

## Data (generator)

In order to generate a representative set of data I prepared a special procedure. I could have used randomly generated data but I wanted to have the same data across all tests.

The procedure is just a simple linear congruential generator.

x_1 = seed x_n+1 = (a x_n + c) mod m

with the following factors:

a = 48271 c = 0 m = 2147483647 seed = 1

Using this procedure I’ll generate 1 000 000 000 records (or less if that number cannot be handled).

Each record is a structure containing two 64-bit unsigned integers. The first field acts as a simple identifier, the second field is a value to be sorted. In C and C++ the record looks as follows:

struct { uint64_t id; uint64_t value; };

Although this is only an example, you can imagine the structure as being a simple index record, pointing to some real data. However, in real life you’d rather use a specialized tool or library instead of reinventing the wheel.

## C++

C++ comes with two sorting procedures; the first one called `std::sort`

is guaranteed to have *O(N log(N))* complexity. This constraint is valid since C++11 (before C++11, it was *O(N log(N))* on average). Engineers who implement the Standard Library are free to take any sorting algorithm provided it conforms the requirements. In case of libstdc++ (used by e.g. gcc) it is the introsort, that is, kind of quick-sort with limited branching (by switching to the heap sort) to avoid the worst case scenario which is *O(N²)*.

The second sorting procedure is `std::stable_sort`

. As the name suggests, sorting is stable, that is, equal elements preserve its relative order. The complexity is *O(N log²(N))* but it can be reduced to *O(N log(N))* if additional memory is available.

So the title of this paragraph might be misleading as we are going to test sorting procedures implemented in libstdc++ and compiled by gcc v5.4.1, in this particular case.

Let’s check the results of sorting 1 000 000 000 (one billion) records.

Sorting procedure |
Memory |
Creation |
Sorting |
Checking result |
Remarks |

std::sort | 15 GB | 5 seconds | 112 seconds | 1 second | -O2/-O3 |

std::sort | 15 GB | 21 seconds | 614 seconds | 13 second | optimizations turned off |

std::stable_sort | 22 GB | 5 seconds | 140 seconds | 1 second | -O2/-O3 |

std::stable_sort | 22 GB | 21 seconds | 689 seconds | 12 second | optimizations turned off |

Also, I checked which comparator is faster, lambda or struct based – there is no difference. Similar story with optimization level O2 vs O3 – no difference.

## C

The standard C library offers the `qsort`

procedure. The implementation is not defined (like in C++). Moreover, there are no complexity requirements and `qsort`

doesn’t have to be stable. Theoretically, it could be even the bubble sort (*O(n²)*) but fortunately this is very unlikely.

The implementation can be tricky, for example, there might be leveraged more than one algorithm depending on collection size and whether „additional” memory is available.

In this particular execution (linux, glibc, gcc v5.4.1), the merge sort variant was chosen by the runtime taking additional memory (about +7GB). If that additional memory was not available, quick-sort would be selected.

Sorting 1 000 000 000 records…

Sorting procedure |
Memory |
Creation |
Sorting |
Checking result |
Remarks |

qsort | 22 GB | 4 seconds | 277 seconds | < 1 second | -O2/-O3 |

qsort | 22 GB | 12 seconds | 355 seconds | 2 seconds | optimizations turned off |

Like in C++, optimization levels O2 and O3 don’t differ too much.

C++ seems to be a bit faster, probably due to the fact, that the comparison function was inlined, and possibly, different algorithms and implementations also matters.

## Python

As Python is not intended to be a high-performance language, I had to limit data set to 100 000 000 (a tenth of the initial number).

Python’s built-in sort is implemented in C and the algorithm is called timsort. This is a kind of the merge sort supported by the insertion sort. Requires additional memory; *O(n)* in the worst case. The complexity is *O(n log(n))* – the worst case and average case as well.

Sorting procedure |
Memory |
Creation |
Sorting |
Checking result |
Remarks |

sort | 43 GB | 129 seconds | 340 seconds | 50 seconds |

Pretty good numbers as for a scripting language. Python allows playing with reasonable big data sets.

## Java

Java (>=7) also uses the timsort algorithm, at least for sorting objects. For sorting primitives, heuristics are leveraged to select right algorithm.

The distinction in sorting between objects and primitive types has been introduced to take advantage of some properties allowing further optimizations. One property is that for primitive types, the sorting algorithm doesn’t have to be stable. Another observation is that in order to compare objects, the function `Object.compare()`

is called. It doesn’t have to be expensive, but there is a certain cost related anyway. Primitive types can be compared using faster and inline code.

Let’s sort 1 000 000 000 records…

Sorting procedure |
Memory |
Creation |
Sorting |
Checking result |
Remarks |

sort | 41 GB | 60 seconds | 788 seconds | 22 seconds | * |

The test can’t be completed if one uses only default VM parameters. The first thing is to set Xms and Xmx parameters to large enough values. In my test I used the concurrent collector as it gives relatively good results in this case.

*) -Xmx50g -Xms50g -XX:+UseConcMarkSweepGC -XX:NewSize=5g

## Conclusions

Be aware that some sorting algorithms might take additional memory (e.g. the merge sort and its derivatives).

Some sorting procedures take advantage of more than one sorting algorithm. They may select the best algorithm depending on the input (size, if almost sorted, if array of built-in types, etc).

C and C++. Always take into consideration compiler’s optimization options as it gives a significant speed boost. Not only when dealing with sorting procedures.

Java. Be aware the default VM settings won’t always be a good choice.

The more dynamic language, the bigger memory consumption. Usually.

When dealing with huge data sets, consider using specialized tools.

Regarding the performance – always measure, don’t make blind assumptions.