Issue
I built a small program (~1000 LOC) using GCC 11.1 and ran it for many iterations both with and without enabling -march=native
but overall there was no difference in terms of program execution time (measured in milliseconds). But why? Because it's single-threaded? Or is my stone age hardware (1st gen i5, Westmere microarchitecture with no AVX stuff) not capable enough?
A few lines from my Makefile:
CXX = g++
CXXFLAGS = -c -std=c++20 -Wall -Wextra -Wpedantic -Wconversion -Wshadow -O3 -march=native -flto
LDFLAGS = -O3 -flto
Here (Compiler Explorer) is a free function from the program for which GCC does not generate SSE instructions:
[[ nodiscard ]] size_t
tokenize_fast( const std::string_view inputStr, const std::span< std::string_view > foundTokens_OUT,
const size_t expectedTokenCount ) noexcept
{
size_t foundTokensCount { };
if ( inputStr.empty( ) ) [[ unlikely ]]
{
return foundTokensCount = 0;
}
static constexpr std::string_view delimiter { " \t" };
size_t start { inputStr.find_first_not_of( delimiter ) };
size_t end { };
for ( size_t idx { }; start != std::string_view::npos && foundTokensCount < expectedTokenCount; ++idx )
{
end = inputStr.find_first_of( delimiter, start );
foundTokens_OUT[ idx ] = inputStr.substr( start, end - start );
++foundTokensCount;
start = inputStr.find_first_not_of( delimiter, end );
}
if ( start != std::string_view::npos )
{
return std::numeric_limits<size_t>::max( );
}
return foundTokensCount;
}
I want to know why? Maybe because it's not possible to vectorize such code?
Also, another thing I want to mention is that the size of the final executable did not change at all and I even tried -march=westmere
and -march=alderlake
to see if makes any difference in size but GCC generated it with the same size.
Solution
I think you should be specifying -march=native
as part of LDFLAGS as well, so -flto
is targeting the same machine.
But it seems your code-gen is respecting your specified arch since you say -march=alderlake
make code that crashed with SIGILL, probably on an AVX encoding of a vector instruction.
It's quite possible that -mtune=generic
makes the same tuning decisions as -march=native
, and that there's nothing that benefits from anything more than SSE2. Your CPU supports SSE4.2 and popcnt, but baseline for x86-64 is already SSE2, same vector width just missing some instructions, especially for dword and qword element sizes (like packed min/max).
GCC/clang can't auto-vectorize search loops (only loops where the trip-count is known at runtime before the first iteration), so inputStr.find_first_of
either compiles to a one-byte-at-a-time search, or calls memchr
which only benefits from SSE2 anyway, but can dynamic dispatch based on CPU features since it's in a shared library.
(Glibc overloads the dynamic linking process with a "resolver" function that decides which implementation of memchr
is best on the current machine, either SSE2 or AVX2. The both versions are hand-written asm, for example the SSE2 version's source. A few functions like strstr
have SSE4.2 versions that you CPU can take advantage of, but this choice doesn't depend on -march
compile-time settings, purely run-time dynamic linker + glibc.)
If you want to see where your program is spending most of its time, use perf record ./a.out
/ perf report -Mintel
(the default is AT&T syntax disassembly; I prefer Intel).
If it's in library functions, different tuning options and new instructions available probably aren't helping your main code. If it's in your program proper, not libs, then apparently the baseline instruction-set for x86-64 and the "generic" tuning options are fine, or GCC doesn't know how to get any use out of SSSE3 / SSE4.x for your code.
I didn't look much at what your code is doing to see what manual vectorization might be possible.
Answered By - Peter Cordes Answer Checked By - Mildred Charles (WPSolving Admin)