Issue
My problem: I am trying to get GCC to vectorize a nested loop.
Compiler flags I added to the basic flags:
-fopenmp
-march=native
-msse2 -mfpmath=sse
-ffast-math
-funsafe-math-optimizations
-ftree-vectorize
-fopt-info-vec-missed
Variables:
// local variables
float RTP, RPE, RR=0;
int SampleLoc, EN;
float THr;
long TN;
// global variables
extern float PR[MaxX - MinX][MaxY - MinY][MaxZ - MinZ];
extern float SampleOffset;
extern float SampleInt;
// global defines
Speed
Loops start at line 500. The code is parallelized but not vectorized. Here is the code:
#pragma omp parallel for
for (TN = 0; TN < NumXmtrFoci; TN++) {
for (XR = XlBound; XR <= XuBound; XR++) {
for (YR = YlBound; YR <= YuBound; YR++) {
for (ZR = ZlBound; ZR <= ZuBound; ZR++) {
for (EN = 0; EN < NUM_RCVR_ELE; EN++) {
RPE = REP[XR - MinX][YR - MinY][ZR - MinZ][EN];
RTP = RPT[XR - MinX][YR - MinY][ZR - MinZ][TN];
RR = RPE + RTP + ZT[TN];
SampleLoc = (int)(floor(RR/(SampleInt*Speed) + SampleOffset));
THr = TimeHistory[SampleLoc][EN];
PR[XR-MinX][YR-MinY][ZR-MinZ] += THr;
} /*for EN*/
} /*for ZR*/
} /*for YR*/
} /*for XR*/ /
} /*for TN*/
The loop bounds are all #define and range from -64 to 128. Loop iterator variables are of type int
. Inner loop variables are of type float.
A sampling of GCC compiler 'NOT VECTORIZED' message relevant to this loop; some repeated many times at many places in the code.
502|note: not vectorized: multiple nested loops.|
505|note: not vectorized: not suitable for gather load _61 = TimeHistory[_59][_85];|
500|note: not vectorized: no grouped stores in basic block.|
500|note: not vectorized: not enough data-refs in basic block.|
MY QUESTION IS: What do the GCC compiler messages tell me to focus on to get the loop to vectorize?
I do not understand the messages adequately. So far online has not provided the answers. I thought multiple nested loops were not a problem and what are: gather load; grouped stores; data-refs.
Solution
The main point of the GCC report is that the expression TimeHistory[SampleLoc][EN]
cannot be easily vectorized unless gather instructions are used. Indeed, SampleLoc
is a variable that can contains non linear values. Gather instruction are only available in the AVX2/AVX-512 instruction set. You compilation flags indicate you use SSE2 and possible a more advanced instruction set available on the target platform (but this one is not provided). Without AVX2/AVX-512, GCC cannot vectorize the loops because of such non contiguous access pattern. In fact, AVX-2 gather instructions are a bit limited compared to general reads so the compiler may not be able to use them because of that. You can see the list of intrinsics/instructions here. Additionally, the compiler needs to be sure TimeHistory
is not modified in the loop. THis seems trivial here but in practice arrays can theoretically alias so the compiler can be afraid of a possible aliasing and not vectorise the gather loads because of a possible dependence between each read and next writes. Replicating the last loop may help. Using the restrict
keyword and const
can also help.
Answered By - Jérôme Richard Answer Checked By - Timothy Miller (WPSolving Admin)