For the kind of micro-optimisation that you are trying to do, nothing will beat real-life usage patterns, with your algorithm embedded in a real program, consuming and producing real data. There are tricks that you can use to get synthetic benchmarks that produce results that somewhat resemble reality, but for the most part they're a waste of time. Profile your program with the algorithm under real conditions, and measure how much time your program spends in the algorithm.
Books and careers have been made on this subject, so I'll only mention the two most egregious errors in your methodology:
Compiler
First, a simple error. For benchmarks, you always want to turn all compiler optimisations on, otherwise you're testing something different from the real thing. However, the problem is when you isolate an algorithm in a simple looping test like this one. If your algorithm does something really simple and with few side effects, the compiler could simplify parts of your algorithm away!
There are tricks you can use to prevent the compiler from making your benchmark artificially fast, but rather than fight the compiler, it's best if you measure a real program and not a synthetic benchmark.
Cache
How fast an algorithm runs on "modern" CPUs very much depends on the state of the cache. The CPU has a hierarchy of memory caches, each faster than the next but with smaller capacity. If it has to perform many different operations on different data sets, it has to constantly load and unload data to and from those caches. On the other hand, if it's running a single algorithm on a small dataset, chances are most of the time it doesn't have to do any loads/stores beyond the initial one. So when you run an algorithm one million times in a loop, you'll get performance that's much faster than if you run the algorithm once, and multiply the result by one million.
Furthermore, performance is sensitive to the size of the dataset - if it entirely fits into L1 cache, it'll be super fast; if it fits into L2 it'll be slower, L3 then a bit slower still, and so on. Take a sorting algorithm for example; if 100 elements fit in L1 but 200 can only fit in L2, then in the first case the CPU only ever needs to use L1, whereas in the second case it'll be also using L2 occasionally. So sorting two 100-element collections will be faster than sorting a single 200-element list. If you plot it out you'll get a step-like function:

Some more relevant information: http://stackoverflow.com/q/8547778/2038264
If you don't control the size of your dataset, and it happens to be near one of those step boundaries, you can get wildly fluctuating results. If your test dataset does not match real life scenarios, then you'll get performance that's wildly out of whack with reality. How your algorithm uses data also matters a heck of a lot: CPUs guess what data they need to load next, and are very good at it if the pattern is predictable - for example, if it reads data sequentially. In this case the CPU can just look ahead and prefetch the data. This is also the reason why for most use cases, vector
vastly outperforms list
:

http://programmers.stackexchange.com/q/186966/81527
So if you want to use a synthetic benchmark, you need to understand how your algorithm uses data and mimick it. For example, when working on an algorithmic trading software, the typical use pattern was to be idle 99% of the time, and run once off a signal, so I constructed a benchmark that explicitly cleared the CPU caches before every run. Failing to do so would produce a result that was too fast by a factor of 3. For more complex algorithms, those conditions can be much harder to set up, which is why if you can, always test with real data in real scenarios.
time
utility. – glampert Sep 30 '15 at 18:38