[c++] Performance of qsort vs std::sort?


Answers

I am surprised that no one mentions caches.

In your code, you start by touching ary and *ary_copy* so they are resident in the cache at the time of qsort. During qsort, *ary_copy* might get evicted. At the time of std::sort, the elements would have to be fetched from memory or a larger (read slower) cache level. This will of course depend on your cache sizes.

Try to reverse the test, i.e., start by running std::sort.

As some people have pointed out; making the array larger will make the test more fair. The reason is that a large array is less likely to fit in cache.

Question

According Scott Meyers, in his Effective STL book - item 46. He claimed that std::sort is about 670% faster than std::qsort due to the fact of inline. I tested myself, and I saw that qsort is faster :( ! Could anyone help me to explain this strange behavior?

#include <iostream>
#include <vector>
#include <algorithm>

#include <cstdlib>
#include <ctime>
#include <cstdio>

const size_t LARGE_SIZE = 100000;

struct rnd {
    int operator()() {
        return rand() % LARGE_SIZE;
    }
};

int comp( const void* a, const void* b ) {
    return ( *( int* )a - *( int* )b );
}

int main() {
    int ary[LARGE_SIZE];
    int ary_copy[LARGE_SIZE];
    // generate random data
    std::generate( ary, ary + LARGE_SIZE, rnd() );
    std::copy( ary, ary + LARGE_SIZE, ary_copy );
    // get time
    std::time_t start = std::clock();
    // perform quick sort C using function pointer
    std::qsort( ary, LARGE_SIZE, sizeof( int ), comp );
    std::cout << "C quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
    // get time again
    start = std::clock();
    // perform quick sort C++ using function object
    std::sort( ary_copy, ary_copy + LARGE_SIZE );
    std::cout << "C++ quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
}

This is my result:

C quick-sort time elapsed: 0.061
C++ quick-sort time elapsed: 0.086
Press any key to continue . . .

Update

Effective STL 3rd Edition ( 2001 )
Chapter 7 Programming with STL
Item 46: Consider function objects instead of functions as algorithm parameters.

Best regards,




Writing accurate benchmarks is difficult, so let's get Nonius to do it for us! Let's test qsort, std::sort with no inlining, and std::sort with inlining (the default) on a vector of a million random integers.

// sort.cpp
#define NONIUS_RUNNER
#include <nonius.h++>
#include <random>
#include <algorithm>

// qsort
int comp(const void* a, const void* b) {
    const int arg1 = *static_cast<const int*>(a);
    const int arg2 = *static_cast<const int*>(b);

    // we can't simply return a - b, because that might under/overflow
    return (arg1 > arg2) - (arg1 < arg2);
}

// std::sort with no inlining
struct compare_noinline {
    __attribute__((noinline)) bool operator()(const int a, const int b) {
        return a < b;
    }
};

// std::sort with inlining
struct compare {
    // the compiler will automatically inline this
    bool operator()(const int a, const int b) {
        return a < b;
    }
};

std::vector<int> gen_random_vector(const size_t size) {

    std::random_device seed;
    std::default_random_engine engine{seed()};
    std::uniform_int_distribution<int> dist{std::numeric_limits<int>::min(), std::numeric_limits<int>::max()};

    std::vector<int> vec;
    for (size_t i = 0; i < size; i += 1) {
        const int rand_int = dist(engine);
        vec.push_back(rand_int);
    }

    return vec;
}

// generate a vector of a million random integers
constexpr size_t size = 1'000'000;
static const std::vector<int> rand_vec = gen_random_vector(size);

NONIUS_BENCHMARK("qsort", [](nonius::chronometer meter) {

    // Nonius does multiple runs of the benchmark, and each one needs a new
    // copy of the original vector, otherwise we'd just be sorting the same
    // one over and over
    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::qsort(current_vec.data(), current_vec.size(), sizeof(int), comp);

        return current_vec;
    });
});

NONIUS_BENCHMARK("std::sort noinline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare_noinline{});

        return current_vec;

    });
});

NONIUS_BENCHMARK("std::sort inline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare{});

        return current_vec;

    });
});

Compiling with Apple Clang 7.3.0,

$ clang++ -std=c++14 -stdlib=libc++ -O3 -march=native sort.cpp -o sort
$ ./sort

and running it on my 1.7 GHz i5 Macbook Air, we get

qsort                211 ms +/- 6 ms
std::sort noinline   127 ms +/- 5 ms
std::sort inline      87 ms +/- 4 ms

So std::sort with no inlining is about 1.7x faster than qsort (perhaps due to different sorting algorithms), and inlining bumps that up to about 2.4x faster. Certainly an impressive speedup, but much less than 670%.




Another reason that qsort may perform much better than expected is that newer compilers can inline and optimize through the function pointer.

If the C header defines an inline implementation of qsort instead of implementing it inside of a library and the compiler supports indirect function inlining, then qsort can be just as fast as std::sort.




Links