Wednesday, August 31, 2022

[SOLVED] How to prevent gcc from reordering x86 frame pointer saving/setup instructions?

Issue

During my profiling with flamegraph, I found that callstacks are sometimes broken even when all codebases are compiled with the -fno-omit-frame-pointer flag. By checking the binary generated by gcc, I noticed gcc may reorder x86 frame pointer saving/setup instructions (i.e., push %rbp; move %rsp, %rbp), sometimes even after ret instructions of some branches. As shown in the example below, push %rbp; move %rsp, %rbp are put at the bottom of the function. It leads to incomplete and misleading callstacks when perf happens to sample instructions in the function before frame pointers are properly set.

C code:

int flextcp_fd_slookup(int fd, struct socket **ps)
{
  struct socket *s;

  if (fd >= MAXSOCK || fhs[fd].type != FH_SOCKET) {
    errno = EBADF;
    return -1;
  }

  uint32_t lock_val = 1;
  s = fhs[fd].data.s;

  asm volatile (
      "1:\n"
      "xchg %[locked], %[lv]\n"
      "test %[lv], %[lv]\n"
      "jz 3f\n"
      "2:\n"
      "pause\n"
      "cmpl $0, %[locked]\n"
      "jnz 2b\n"
      "jmp 1b\n"
      "3:\n"
      : [locked] "=m" (s->sp_lock), [lv] "=q" (lock_val)
      : "[lv]" (lock_val)
      : "memory");

  *ps = s;
  return 0;
}

CMake Debug Profile:

0000000000007c73 <flextcp_fd_slookup>:
    7c73:   f3 0f 1e fa             endbr64 
    7c77:   55                      push   %rbp
    7c78:   48 89 e5                mov    %rsp,%rbp
    7c7b:   48 83 ec 20             sub    $0x20,%rsp
    7c7f:   89 7d ec                mov    %edi,-0x14(%rbp)
    7c82:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
    7c86:   81 7d ec ff ff 0f 00    cmpl   $0xfffff,-0x14(%rbp)
    7c8d:   7f 1b                   jg     7caa <flextcp_fd_slookup+0x37>
    7c8f:   8b 45 ec                mov    -0x14(%rbp),%eax
    7c92:   48 98                   cltq   
    7c94:   48 c1 e0 04             shl    $0x4,%rax
    7c98:   48 89 c2                mov    %rax,%rdx
    7c9b:   48 8d 05 86 86 00 00    lea    0x8686(%rip),%rax        # 10328 <fhs+0x8>
    7ca2:   0f b6 04 02             movzbl (%rdx,%rax,1),%eax
    7ca6:   3c 01                   cmp    $0x1,%al
    7ca8:   74 12                   je     7cbc <flextcp_fd_slookup+0x49>
    7caa:   e8 31 b9 ff ff          callq  35e0 <__errno_location@plt>
    7caf:   c7 00 09 00 00 00       movl   $0x9,(%rax)
    7cb5:   b8 ff ff ff ff          mov    $0xffffffff,%eax
    7cba:   eb 53                   jmp    7d0f <flextcp_fd_slookup+0x9c>
    7cbc:   c7 45 f4 01 00 00 00    movl   $0x1,-0xc(%rbp)
    7cc3:   8b 45 ec                mov    -0x14(%rbp),%eax
    7cc6:   48 98                   cltq   
    7cc8:   48 c1 e0 04             shl    $0x4,%rax
    7ccc:   48 89 c2                mov    %rax,%rdx
    7ccf:   48 8d 05 4a 86 00 00    lea    0x864a(%rip),%rax        # 10320 <fhs>
    7cd6:   48 8b 04 02             mov    (%rdx,%rax,1),%rax
    7cda:   48 89 45 f8             mov    %rax,-0x8(%rbp)
    7cde:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
    7ce2:   8b 45 f4                mov    -0xc(%rbp),%eax
    7ce5:   87 82 c0 00 00 00       xchg   %eax,0xc0(%rdx)
    7ceb:   85 c0                   test   %eax,%eax
    7ced:   74 0d                   je     7cfc <flextcp_fd_slookup+0x89>
    7cef:   f3 90                   pause  
    7cf1:   83 ba c0 00 00 00 00    cmpl   $0x0,0xc0(%rdx)
    7cf8:   75 f5                   jne    7cef <flextcp_fd_slookup+0x7c>
    7cfa:   eb e9                   jmp    7ce5 <flextcp_fd_slookup+0x72>
    7cfc:   89 45 f4                mov    %eax,-0xc(%rbp)
    7cff:   48 8b 45 e0             mov    -0x20(%rbp),%rax
    7d03:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
    7d07:   48 89 10                mov    %rdx,(%rax)
    7d0a:   b8 00 00 00 00          mov    $0x0,%eax
    7d0f:   c9                      leaveq 
    7d10:   c3                      retq     

CMake Release Profile:

0000000000007d80 <flextcp_fd_slookup>:
    7d80:   f3 0f 1e fa             endbr64 
    7d84:   81 ff ff ff 0f 00       cmp    $0xfffff,%edi
    7d8a:   7f 44                   jg     7dd0 <flextcp_fd_slookup+0x50>
    7d8c:   48 63 ff                movslq %edi,%rdi
    7d8f:   48 8d 05 6a 85 00 00    lea    0x856a(%rip),%rax        # 10300 <fhs>
    7d96:   48 c1 e7 04             shl    $0x4,%rdi
    7d9a:   48 01 c7                add    %rax,%rdi
    7d9d:   80 7f 08 01             cmpb   $0x1,0x8(%rdi)
    7da1:   75 2d                   jne    7dd0 <flextcp_fd_slookup+0x50>
    7da3:   48 8b 17                mov    (%rdi),%rdx
    7da6:   b8 01 00 00 00          mov    $0x1,%eax
    7dab:   87 82 c0 00 00 00       xchg   %eax,0xc0(%rdx)
    7db1:   85 c0                   test   %eax,%eax
    7db3:   74 0d                   je     7dc2 <flextcp_fd_slookup+0x42>
    7db5:   f3 90                   pause  
    7db7:   83 ba c0 00 00 00 00    cmpl   $0x0,0xc0(%rdx)
    7dbe:   75 f5                   jne    7db5 <flextcp_fd_slookup+0x35>
    7dc0:   eb e9                   jmp    7dab <flextcp_fd_slookup+0x2b>
    7dc2:   31 c0                   xor    %eax,%eax
    7dc4:   48 89 16                mov    %rdx,(%rsi)
    7dc7:   c3                      retq   
    7dc8:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
    7dcf:   00 
    7dd0:   55                      push   %rbp
    7dd1:   48 89 e5                mov    %rsp,%rbp
    7dd4:   e8 b7 b7 ff ff          callq  3590 <__errno_location@plt>
    7dd9:   c7 00 09 00 00 00       movl   $0x9,(%rax)
    7ddf:   b8 ff ff ff ff          mov    $0xffffffff,%eax
    7de4:   5d                      pop    %rbp
    7de5:   c3                      retq   
    7de6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    7ded:   00 00 00 

Is there any way to prevent gcc from reordering these two instructions?

Edit: I use the default toolchain (gcc-11.2.0 + glibc 2.35) on Ubuntu 22.04. Sorry that a reproducible example is not available. Edit: Add source code of the example function.


Solution

Try -fno-shrink-wrap


This looks like "shrink-wrap" optimization: only doing the function prologue in a code path where it's needed. The usual benefit is to run an early-out check before the prologue, not saving/restoring a bunch of registers on that path through the function.

But here, GCC decided to only do the prologue (setting up a frame pointer) if it had to call another function. That function is __errno_location in the error-return path. Oops. :P (And GCC correctly realized that's the uncommon case, and put it out-of-line after the ret through the fast path. So the fast path can be a straight line with no taken branches, other than inside your asm(). It's not a separate function, it's just tail-duplication of the one you showed source for.)

The main path through the function is very tiny, just a few C assignment statements and an asm() statement. GCC doesn't have a clear idea of how big an asm block is (although I think has some heuristics, but is still rather willing to inline one). And it has no idea if there might be loops or any significant time spent in an asm block.


This is a known issue, GCC bug #98018 suggested that GCC should have an option to force frame-pointer setup at the actual top of a function. Because there currently isn't an option that's 100% reliable, other than disabling optimization which is not usable. (Thanks to @Margaret Bloom for finding & linking this.)

As comment 6 on that GCC bug mentions, disabling shrink-wrapping is part of what's necessary to make sure GCC sets up the frame pointer at the top of the function itself, not just inside some if that needs the prologue.

That GCC issue seems to be considering a feature that would stop function inlining, so backtraces would fully reflect the C abstract machine's nesting of function calls. That goes beyond what you're looking for, which I think is just to have frame pointers set up on entry to functions that exist in the asm after optimization.

Disabling shrink-wrapping will force the whole prologue to happen there, including push of other regs, if there were any. Not just the frame pointer. But here there aren't any others. Still, with optimization enabled in general, losing shrink-wrapping is probably pretty minor.



Answered By - Peter Cordes
Answer Checked By - Cary Denson (WPSolving Admin)