Issue
I have an Intel CPU with 4 HT cores (8 logical CPUs) and I built two simple processes.
The first one:
int main()
{
for(int i=0;i<1000000;++i)
for(int j=0;j<100000;++j);
}
The second one:
int main()
{
while(1);
}
Both are compiled with gcc
without special options. (I.e. with the default of -O0
: no optimization debug mode, keeping variables in memory instead of registers.)
When I run the first one on the first logical CPU (CPU0), and when the other logical CPUs have a load charge near 0%, the execution time of this first process is:
real 2m42,625s
user 2m42,485s
sys 0m0,070s
However, when I run the second process (the infinite loop) on CPU4 (CPU0 and CPU4 are on the same core but not on the same hardware thread), the execution time of the first process is
real 2m25,412s
user 2m25,291s
sys 0m0,047s
I expected a longer time since there are two processes on the same core, instead of only one. But it is actually faster. Why does this happen?
EDIT:
the P-states driver is intel_pstate. C-states are fixed by using processor.max_cstate=1 intel_idle.max_cstate=0
.
The frequency governor is set to performance (cpupower frequency-set -g performance
) and turbo is disabled (cat /sys/devices/system/cpu/intel_pstate/no_turbo
gives 1)
Solution
Both are compiled with gcc without special options. (I.e. with the default of -O0: no optimization debug mode, keeping variables in memory instead of registers.)
Unlike a normal program, the version with int i,j
loop counters bottlenecks completely on store-forwarding latency, not front-end throughput or back-end execution resources or any shared resource.
This is why you never want to do real benchmarking with -O0
debug-mode: the bottlenecks are different than with normal optimization (-O2
at least, preferably -O3 -march=native
).
On Intel Sandybridge-family (including @uneven_mark's Kaby Lake CPU), store-forwarding latency is lower if the reload doesn't try to run right away after the store, but instead runs a couple cycles later. Adding a redundant assignment speeds up code when compiled without optimization and also Loop with function call faster than an empty loop both demonstrate this effect in un-optimized compiler output.
Having another hyperthread competing for front-end bandwidth apparently makes this happen some of the time.
Or maybe the static partitioning of the store buffer speeds up store-forwarding? Might be interesting to try a minimally-invasive loop running on the other core, like this:
// compile this with optimization enabled
// and run it on the HT sibling of the debug-mode nested loop
#include <immintrin.h>
int main(void) {
while(1) {
_mm_pause(); _mm_pause();
_mm_pause(); _mm_pause();
}
}
pause
blocks for about 100 cycles on Skylake, up from about 5 on earlier CPUs.
So if the benefit to store-forwarding is in uops from the other thread having to issue/execute, this loop will do less of that and the run-time will be closer to when it has a physical core in single-thread mode.
But if the benefit is just from partitioning the ROB and store buffer (which could plausibly speed up the time for a load to probe it for stores), we'd still see the full benefit.
Update: @uneven_mark tested on Kaby Lake and found that this reduced the "speedup" to ~2%, down from ~8%. So apparently competing for front-end / back-end resources was an important part of the infinite loop in stopping the other loop from reloading too soon.
Perhaps using up BOB (branch-order-buffer) slots was the main mechanism in stopping the other thread's branch uops from issueing into the out-of-order back-end. Modern x86 CPUs snapshot the RAT and other backend state to allow fast recovery when they detect branch mispredicts, allowing rollback to the mispredicted branch without waiting for it to reach retirement.
This avoids waiting for independent work before the branch, and letting out-of-order execution of it continue while recovering. But it means fewer branches can be in flight. At least fewer conditional/indirect branches? IDK if a direct jmp
would use a BOB entry; its validity is established during decode. So maybe this guess doesn't hold water.
The while(1){}
loop has no local vars in the loop so it doesn't bottleneck on store-forwarding. It's just a top: jmp top
loop that can run at 1 cycle per iteration. That's a single-uop instruction on Intel.
i5-8250U is a Kaby Lake, Skylake-derived, so has its loop buffer (LSD) disabled by microcode. So it can't unroll itself in the LSD/IDQ (queue feeding the issue/rename stage) and has to fetch the jmp
uop separately from the uop cache every cycle. But the IDQ does buffer that, only needing an issue/rename cycle every 4 cycles to issue a group of 4 jmp uops for that logical core.
But anyway, on SKL / KBL these two threads together more than saturate uop cache fetch bandwidth and do compete with each other that way. On a CPU with the LSD (loopback buffer) enabled (e.g. Haswell / Broadwell, or Coffee Lake and later), they wouldn't. Sandybridge/Ivybridge don't unroll tiny loops to use more of their LSD so you'd have the same effect there. I'm not sure if that's significant. Testing on Haswell or Coffee Lake would be interesting.
(An unconditional jmp
always ends a uop-cache line, and it's not a trace cache anyway so one uop-cache fetch can't give you more than one jmp
uop.)
I have to correct my confirmation from above: I compiled all programs as C++ (g++), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%.
That's interesting, gcc -O0
and g++ -O0
do compile the loops differently. This is a quirk of the GCC's C vs. C++ front-ends feeding GCC's back-end different GIMPLE/RTL, or something like that, and -O0
not making the back-end fix the inefficiency. This is not anything fundamental about C vs. C++ or that you could expect from other compilers.
The C version still transforms to an idiomatic do{}while()
style loop with a cmp/jle
at the bottom of the loop, right after a memory-destination add. (The left pane on this Godbolt compiler explorer link). Why are loops always compiled into "do...while" style (tail jump)?
But the C++ version uses an if(break)
style of looping with the condition at the top, then the memory-destination add. Funny that separating the memory-destination add
from the cmp
reload by only one jmp
instruction makes that big a difference.
# inner loop, gcc9.2 -O0. (Actually g++ -xc but same difference)
jmp .L3
.L4: # do {
add DWORD PTR [rbp-8], 1 # j++
.L3: # loop entry point for first iteration
cmp DWORD PTR [rbp-8], 99999
jle .L4 # }while(j<=99999)
Apparently the add/cmp back to back make this version suffer more from slower store-forwarding on Skylake / Kaby/Coffee Lake
vs. this one which isn't affected as much:
# inner loop, g++9.2 -O0
.L4: # do {
cmp DWORD PTR [rbp-8], 99999
jg .L3 # if(j>99999) break
add DWORD PTR [rbp-8], 1 # j++
jmp .L4 # while(1)
.L3:
cmp [mem], imm
/ jcc might still micro and/or macro-fuse, but I forget which. IDK if that's relevant, but if the loop is more uops it can't issue as fast. Still, with the execution bottleneck of 1 iteration per 5 or 6 cycles (memory-destination add
latency), the front-end is easily going to stay ahead of the back-end even if it has to compete with another hyperthread.
Answered By - Peter Cordes Answer Checked By - Willingham (WPSolving Volunteer)