Problem
Consider:
#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;
const int times = 1000;
const int N = 100000;
void run() {
for (int j = 0; j < N; j++) {
}
}
int main() {
clock_t main_start = clock();
for (int i = 0; i < times; i++) {
clock_t start = clock();
run();
cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
//usleep(1000);
}
cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}
Here’s an example of code. The run function costs roughly 0.4 ms in the first 26 iterations of the timing loop, but then drops to 0.2 ms.
The delay-loop takes 0.4 ms for all runs while usleep is uncommented, and it never speeds up. Why?
The delay loop isn’t optimised because the code is compiled with g++ -O0 (no optimisation). It’s run on Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).
Asked by phyxnj
Solution #1
Because your process uses its full time slice a few of times in a row, Linux ramps the CPU up to the maximum clock speed after 26 repeats.
If you looked at performance counters instead of wall-clock time, you’d find that the number of core clock cycles each delay-loop remained constant, proving that it’s purely a DVFS effect (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
Ramp-up would be substantially faster on a Skylake with kernel support for the new power-management mode (where the hardware takes complete control of the clock speed).
On an Intel CPU with Turbo, if you leave it running for a while, the time per repetition will almost certainly increase significantly after thermal restrictions force the clock speed to drop back down to the maximum sustained frequency. (For more information on Turbo, see Why can’t my CPU maintain peak performance in HPC.) Turbo allows the CPU to run faster than it can sustain for high-power tasks.
Because the process isn’t creating 100 percent load even at minimum frequency, introducing a usleep prevents Linux’s CPU frequency governor from ramping up the clock speed. (That is, the kernel’s heuristic determines that the CPU is fast enough for the workload it is handling.)
comments on other theories:
Regarding David’s hypothesis that a possible context switch from usleep might pollute caches: In general, that’s not a bad concept, but it doesn’t help explain this code.
For this experiment, cache / TLB pollution is unimportant. Except for the end of the stack, there isn’t anything that affects memory inside the timing window. The majority of the time is spent in a small loop (1 instruction cache line) that only touches one int of stack memory. For this code, any potential cache pollution during usleep is a fraction of a second (actual code will be different)!
For x86, here’s more information:
Although the call to clock() may fail to cache, a code-fetch cache failure causes the starting-time measurement to be delayed rather than being included in the measurement. Because it should still be hot in cache, the second call to clock() will nearly never be delayed.
The run function could be in a different cache line than main (since gcc considers main to be “cool,” it gets less optimization and is grouped with other cold functions/data). One or two instruction cache misses are likely. However, because they’re still on the same 4k page, main will have triggered the probable TLB miss before entering the program’s timed zone.
gcc -O0 will compile the OP’s code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
Because the loop number is kept in stack memory by the empty loop, the loop runs at one iteration every 6 cycles on a regular Intel x86 CPU, and one iteration every 6 cycles on the OP’s IvyBridge CPU, thanks to the store-forwarding latency that comes with add with a memory destination (read-modify-write). The contribution of at most a couple cache misses (200 cycles each for code-fetch misses that block future instructions from issuing until they’re addressed) exceeds the contribution of 100k iterations * 6 cycles/iteration (600k cycles).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).
100k cycles is a lot, even if the loop-counter is held in a register.
Answered by Peter Cordes
Solution #2
A context switch may or may not occur while calling usleep. It will take longer if it does than if it does not.
Answered by David Schwartz
Post is based on https://stackoverflow.com/questions/38299023/why-does-this-delay-loop-start-to-run-faster-after-several-iterations-with-no-sl