Issue
I have a minimally reproducible sample which is as follows -
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>
template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<<sizeof(double) * 8<<std::endl;
auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
for(int i = 0; i < size * size; i++){
*(matA + i) = i;
*(matB + i) = i;
}
auto start = std::chrono::high_resolution_clock::now();
for(int j=0; j<500; j++){
AddMatrixOpenMP<float>(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout<<"Average Time is = "<<duration/500<<std::endl;
std::cout<<*(result + 100)<<" "<<*(result + 1343)<<std::endl;
}
I experiment as follows - I time the code with #pragma omp for simd
directive for the loop in the AddMatrixOpenMP
function and then time it without the directive. I compile the code as follows -
g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
Solution
The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone))
to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3
. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd
to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native
or -fopenmp
), I get 18266, but with -O3 -fopenmp
I get avg time 39772.
With the OpenMP vectorized version, if I look at top
while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result
, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3
performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3
on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:
Running intel_gpu_top -l
while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))
). IMC
is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3
defeating the benchmark, with a 1000
repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)
Answered By - Peter Cordes