Sunday, May 15, 2022

[SOLVED] Why can't GCC optimize the logical / bitwise AND pair in "x && (x & 4242)" to "x & 4242"?

May 15, 2022 c++, compiler-optimization, gcc, optimization

Issue

Here are two functions which I claim do exactly the same thing:

bool fast(int x)
{
  return x & 4242;
}

bool slow(int x)
{
  return x && (x & 4242);
}

Logically they do the same thing, and just to be 100% sure I wrote a test that ran all four billion possible inputs through both of them, and they matched. (x & 4242 is only non-zero if it has set bits in specific positions, which means x has a non-zero value, so testing x!=0 separately as the other side of a logical && is redundant.) But the assembly code is a different story:

fast:
    andl    $4242, %edi
    setne   %al
    ret

slow:
    xorl    %eax, %eax
    testl   %edi, %edi
    je      .L3
    andl    $4242, %edi
    setne   %al
.L3:
    rep
    ret

I was surprised that GCC could not make the leap of logic to eliminate the redundant test. I tried g++ 4.4.3 and 4.7.2 with -O2, -O3, and -Os, all of which generated the same code. The platform is Linux x86_64.

Can someone explain why GCC shouldn't be smart enough to generate the same code in both cases?

Edit to add test harness:

#include <cstdlib>
#include <vector>
using namespace std;

int main(int argc, char* argv[])
{
    // make vector filled with numbers starting from argv[1]
    int seed = atoi(argv[1]);
    vector<int> v(100000);
    for (int j = 0; j < 100000; ++j)
        v[j] = j + seed;

    // count how many times the function returns true
    int result = 0;
    for (int j = 0; j < 100000; ++j)
        for (int i : v)
            result += slow(i); // or fast(i), try both

    return result;
}

I tested the above with clang 5.1 on Mac OS with -O3. It took 2.9 seconds using fast() and 3.8 seconds using slow(). If I instead use a vector of all zeros, there is no significant difference in performance between the two functions.

Other compilers:

mainline clang 3.7 and later do the optimization even for &&, clang 3.6 and earlier don't. https://godbolt.org/z/v5bjrvrP1
latest GCC trunk (march 2022) and 11.2 still don't.
Current MSVC does both parts with branches, not using setcc.
ICC makes asm like GCC's, LLVM-based ICX is like clang. https://godbolt.org/z/cjKfr8r5b

Solution

You are correct that this appears to be a deficiency, and possibly an outright bug, in the optimizer.

Consider:

bool slow(int x)
{
  return x && (x & 4242);
}

bool slow2(int x)
{
  return (x & 4242) && x;
}

Assembly emitted by GCC 4.8.1 (-O3):

slow:
    xorl    %eax, %eax
    testl   %edi, %edi
    je      .L2
    andl    $4242, %edi
    setne   %al
.L2:
    rep ret

slow2:
    andl    $4242, %edi
    setne   %al
    ret

In other words, slow2 is misnamed.

I have only contributed the occasional patch to GCC, so whether my point of view carries any weight is debatable :-). But it is certainly strange, in my view, for GCC to optimize one of these and not the other. I suggest filing a bug report.

[Update]

Surprisingly small changes appear to make a big difference. For example:

bool slow3(int x)
{
  int y = x & 4242;
  return y && x;
}

...generates "slow" code again. I have no hypothesis for this behavior.

You can experiment with all of these on multiple compilers here.

Answered By - Nemo

Answer Checked By - Pedro (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, May 15, 2022

[SOLVED] Why can't GCC optimize the logical / bitwise AND pair in "x && (x & 4242)" to "x & 4242"?

Issue

Solution

Popular Posts

Labels