Issue

I'm writing a chess engine in c, and speed is essential. The chess engine is based on unsigned long long which I will denote as u64 and it relies heavily on a least significant bit scan. Up until now I have been using the gcc function __builtin_ctzll which has done the job just fine. I however generated assembly code for this isolated function with gcc -S -O2. It gave me the following:

xorl     %eax, %eax
rep bsfq %rdi, %rax
cltq
ret

It seems however after some investigation that the code

rep bsfq %rdi, %rax
ret

does exactly the same thing in my chess program. It is however now ~20% slower. It should be faster because it's fewer instructions. The original __builtin_ctzll is however inlined in my c code. Is this the reason my custom assembly code is running slower than the original? Because when I declare the function ctzll I can of course not declare it inline in c unless I have a definition (which is not in assembly code).

Is there another way to optimize assembly instructions or should I try my new assembly code inlined with asm directly in c?

Solution

Conclusion first, use __builtin_ctzll without converting the result to 64-bit. The following can be useful if you want to force the compiler to use tzcnt or if you want to make your own built-in.

Since @user1937198 explained all the basics, here's some code that is reasonably fast and portable.

static inline int u64_bsf(u64 x) {
    u64 r = 0;
    __asm__ ("rep bsfq\t%1, %0" : "+r"(r) : "r"(x));
    return r;
}

/*
u64_bsf:
        xorl    %eax, %eax
        rep bsfq        %rdi, %rax
        ret
*/

You can change the return type to unsigned if you want. On most platforms including x86, using int or unsigned produces the fastest code (except sometimes for array indices). On x86 specifically, using 64-bit integers cause software emulation in 32-bit mode, and larger code size (plus slower division) in 64-bit mode. Your use of 64-bit return type also confused the compiler to use a redundant cltq (which is a bad made-up name of cdqe in AT&T syntax).

rep bsf is decoded to tzcnt on machines that support it, and the rep is discarded on machines that don't. tzcnt is better in that it (usually) doesn't take the output register as input (see @user1937198's answer), and it runs a lot faster on AMD CPUs.

If you are targeting machines with tzcnt write

static inline int u64_tzcnt(u64 x) {
    u64 r;
    __asm__ ("tzcntq\t%1, %0" : "=r"(r) : "r"(x));
    return r;
}

/*
u64_tzcnt:
        tzcntq  %rdi, %rax
        ret
*/

Read GCC's online documentation for its inline assembly syntax (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html).

@zwol's answer mentioned constant folding. Here's the final code that can handle constant propagation and is inlined for a non-constant input.

static inline int u64_bsf(u64 x) {
    if (__builtin_constant_p(x)) {
        return __builtin_ctzll(x);
    }
    u64 r = 0;
    __asm__ ("rep bsfq\t%1, %0" : "+r"(r) : "r"(x));
    return r;
}

At this moment, I've basically reimplemented __builtin_ctzll, but you can always make your own intrinsic function this way.

Answered By - xiver77

Answer Checked By - Cary Denson (WPSolving Admin)

Thursday, April 7, 2022

[SOLVED] Inlining assembly in C

Issue

Solution