How to PRECISELY measure execution time of a code?

Question

In order to compare algorithms speed, I measure execution time of a code part with a method similar to the one described in this question.

I am not operating on Windows, but on Linux, so I use gettimeofday function in order to measure time with microsecond precision.

My test program has a simple architecture:

int main()
{
    test_algorithm1(); // Measure execution time of algorithm1 and 
    test_algorithm2(); // Measure execution time of algorithm1 and 

    return 0;
}

My functions test_algorithm1 and test_algorithm2 have exactly the same structure:

void test_algorithmX()
{
    struct timeval before, after;
    time_t         elapsedUs;
    int            i;

    gettimeofday(&before, nullptr);
    for (i = 0; i < 1000000; i++) // Repeated 1 000 000 times for more precision.
    { 
        // Code to measure here
    }
    gettimeofday(&after, nullptr);

    elapsedUs = after.tv_usec - before.tv_usec; // Elapsed microseconds
    elapsedUs += (after.tv_sec - before.tv_sec) * 1000000; // Elapsed seconds
    std::cout << "Elapsed time for algorithm X: " << elapsedUs << std::endl;
}

However, I face two problems:

Measured time seems very random, especially with short algorithms. If I run many times the program, I can obtain results between 25 000 and 35 000 us.
If I invert algorithms order in the main function (running test_algorithm2 before test_algorithm1), values change. It appears that the first one to be executed is slower. Values like 50 000/15 000 us may change to 40 000/20 000 if I invert them.

This is very problematic, because I sometimes measure algorithms of very close speed, and I cannot make a conclusion with this random.

Is there a better way to measure execution time? Or did I do something bad, like using the gettimeofday function?

Chandler Carruth's highly relevant keynote from last week's CppCon15: youtube.com/watch?v=nXaxk27zwlk — Sean Middleditch, Sep 30 '15 at 5:59
@Aracthor If you would manage to find the answer to this question, you would never had to work again in your life. This (=objective all around performance comparison) is one of the key unsolved questions of IT. — wondra, Sep 30 '15 at 16:50
If running a standalone test program, I'd go with the time utility. — glampert, Sep 30 '15 at 18:38

congusbongus · Accepted Answer · 2015-09-30 05:40:02Z

For the kind of micro-optimisation that you are trying to do, nothing will beat real-life usage patterns, with your algorithm embedded in a real program, consuming and producing real data. There are tricks that you can use to get synthetic benchmarks that produce results that somewhat resemble reality, but for the most part they're a waste of time. Profile your program with the algorithm under real conditions, and measure how much time your program spends in the algorithm.

Books and careers have been made on this subject, so I'll only mention the two most egregious errors in your methodology:

Compiler

First, a simple error. For benchmarks, you always want to turn all compiler optimisations on, otherwise you're testing something different from the real thing. However, the problem is when you isolate an algorithm in a simple looping test like this one. If your algorithm does something really simple and with few side effects, the compiler could simplify parts of your algorithm away!

There are tricks you can use to prevent the compiler from making your benchmark artificially fast, but rather than fight the compiler, it's best if you measure a real program and not a synthetic benchmark.

Cache

How fast an algorithm runs on "modern" CPUs very much depends on the state of the cache. The CPU has a hierarchy of memory caches, each faster than the next but with smaller capacity. If it has to perform many different operations on different data sets, it has to constantly load and unload data to and from those caches. On the other hand, if it's running a single algorithm on a small dataset, chances are most of the time it doesn't have to do any loads/stores beyond the initial one. So when you run an algorithm one million times in a loop, you'll get performance that's much faster than if you run the algorithm once, and multiply the result by one million.

Furthermore, performance is sensitive to the size of the dataset - if it entirely fits into L1 cache, it'll be super fast; if it fits into L2 it'll be slower, L3 then a bit slower still, and so on. Take a sorting algorithm for example; if 100 elements fit in L1 but 200 can only fit in L2, then in the first case the CPU only ever needs to use L1, whereas in the second case it'll be also using L2 occasionally. So sorting two 100-element collections will be faster than sorting a single 200-element list. If you plot it out you'll get a step-like function:

Some more relevant information: http://stackoverflow.com/q/8547778/2038264

If you don't control the size of your dataset, and it happens to be near one of those step boundaries, you can get wildly fluctuating results. If your test dataset does not match real life scenarios, then you'll get performance that's wildly out of whack with reality. How your algorithm uses data also matters a heck of a lot: CPUs guess what data they need to load next, and are very good at it if the pattern is predictable - for example, if it reads data sequentially. In this case the CPU can just look ahead and prefetch the data. This is also the reason why for most use cases, vector vastly outperforms list:

http://programmers.stackexchange.com/q/186966/81527

So if you want to use a synthetic benchmark, you need to understand how your algorithm uses data and mimick it. For example, when working on an algorithmic trading software, the typical use pattern was to be idle 99% of the time, and run once off a signal, so I constructed a benchmark that explicitly cleared the CPU caches before every run. Failing to do so would produce a result that was too fast by a factor of 3. For more complex algorithms, those conditions can be much harder to set up, which is why if you can, always test with real data in real scenarios.

moonshineTheleocat · Answer 2 · 2015-09-30 04:52:21Z

up vote 0 down vote

Unfortunately there will always be some degree of randomness. But It's probably not in your best interest to use the gettimeofday function in Linux, as that can add in some overtime.

If the speeds keep jumping around each other, try taking a look at the algorithms Big-Oh and predict the worse case scenario. Then emulate it.

If you can support C++11, then include the chrono header. The Chrono header has a high res timer somewhere in the namespace.

std::chrono::high_resolution_clock

Be sure to check the documentation.

answered Sep 30 '15 at 4:52

moonshineTheleocat

400111

<chrono>'s high_resolution_clock is the way to go in theory, although on Windows it has poor precision prior VS 2015. Although there are some caveats, maybe the OP should just try to use RDTSC directly for the perf measurement? – Chuck Walbourn Sep 30 '15 at 4:58

add a comment |

asked	1 year ago
viewed	869 times
active	1 year ago

current community

your communities

more stack exchange communities

How to PRECISELY measure execution time of a code?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ optimization timer or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How to PRECISELY measure execution time of a code?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ optimization timer or ask your own question.

Linked

Related

Hot Network Questions