Monday, July 25, 2022

[SOLVED] Why don't gcc/clang vectorize 128-bit SIMD intrinsics into 256-bit when possible?

July 25, 2022 avx2, clang, gcc, intrinsics, x86

Issue

Suppose I have this function:

void test32(int* a, int* b, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = a[i] + b[i];
    }
}

Clang and gcc both produce 256-bit SIMD when compiled with -O3 -march=core-avx2 (godbolt).

Now suppose I have this function:

void test128(__m128i* a, __m128i* b, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = _mm_add_epi32(a[i], b[i]);
    }
}

With the same CFLAGS, clang and gcc both refuse to vectorize this to 256-bit (godbolt).

The naive code (auto-vectorized) therefore processes twice as many elements per iteration compared to the manually vectorized SSE2 code. How does this make sense? Is there a way to instruct the compiler to vectorize 128-bit SIMD intrinsics into 256-bit when AVX2 is available?

Solution

Unfortunately no, I don't know of a compiler option to re-vectorize intrinsics (or GNU C native vectors) to a wider type. That's one reason not to manually vectorize in the first place for cases that easily auto-vectorize.

It's sometimes useful to be able to tell the compiler what vectorization strategy you want it to use, and that's what intrinsics are for.

If compilers rewrote them too aggressively, that would be bad in some cases. Like cleanup loops following a loop with wider vectors, you might use 128-bit to leave fewer scalar elements after. Or maybe you have 16-byte alignment but not 32, and you care about Sandybridge specifically (where misaligned 32-byte load/store are quite bad). Or you're on Haswell server where 256-bit AVX can reduce max turbo (at least for FP math instructions), so you only want to use 256-bit vectors in some functions that will be running in some phases of your program.

Basically it's a tradeoff between how close to writing in asm it is, for clever humans to specify what they want, vs. just giving a way to tell the compiler about the program logic in a way it can understand and optimize (like the + operator: that doesn't mean you'll get an asm add instruction).

MSVC and ICC tend to take intrinsics even more literally than GCC/clang, not doing constant propagation through them. GCC/clang's choice of how to treat intrinsics is sensible in a lot of ways.

If you have a trivially vectorizable problem like this (no loop carried dependencies or shuffles), and you want your code to be compilable for future wider vector instructions, use OpenMP #pragma omp simd to tell the compiler you definitely want it vectorized. This kind of problem is what OpenMP is for, if you don't enable full optimization to get compilers to try to auto-vectorize every loop. (At gcc -O3, or clang -O2. Also -O2 for GCC12.) OpenMP can get the compiler to vectorize FP math in ways that it normally wouldn't be allowed to without -ffast-math, e.g. FP reductions like sum of an array.

Answered By - Peter Cordes

Answer Checked By - Senaida (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, July 25, 2022

[SOLVED] Why don't gcc/clang vectorize 128-bit SIMD intrinsics into 256-bit when possible?

Issue

Solution

Popular Posts

Labels