Issue
During my profiling with flamegraph, I found that callstacks are sometimes broken even when all codebases are compiled with the -fno-omit-frame-pointer
flag. By checking the binary generated by gcc, I noticed gcc may reorder x86 frame pointer saving/setup instructions (i.e., push %rbp; move %rsp, %rbp
), sometimes even after ret
instructions of some branches. As shown in the example below, push %rbp; move %rsp, %rbp
are put at the bottom of the function. It leads to incomplete and misleading callstacks when perf happens to sample instructions in the function before frame pointers are properly set.
C code:
int flextcp_fd_slookup(int fd, struct socket **ps)
{
struct socket *s;
if (fd >= MAXSOCK || fhs[fd].type != FH_SOCKET) {
errno = EBADF;
return -1;
}
uint32_t lock_val = 1;
s = fhs[fd].data.s;
asm volatile (
"1:\n"
"xchg %[locked], %[lv]\n"
"test %[lv], %[lv]\n"
"jz 3f\n"
"2:\n"
"pause\n"
"cmpl $0, %[locked]\n"
"jnz 2b\n"
"jmp 1b\n"
"3:\n"
: [locked] "=m" (s->sp_lock), [lv] "=q" (lock_val)
: "[lv]" (lock_val)
: "memory");
*ps = s;
return 0;
}
CMake Debug
Profile:
0000000000007c73 <flextcp_fd_slookup>:
7c73: f3 0f 1e fa endbr64
7c77: 55 push %rbp
7c78: 48 89 e5 mov %rsp,%rbp
7c7b: 48 83 ec 20 sub $0x20,%rsp
7c7f: 89 7d ec mov %edi,-0x14(%rbp)
7c82: 48 89 75 e0 mov %rsi,-0x20(%rbp)
7c86: 81 7d ec ff ff 0f 00 cmpl $0xfffff,-0x14(%rbp)
7c8d: 7f 1b jg 7caa <flextcp_fd_slookup+0x37>
7c8f: 8b 45 ec mov -0x14(%rbp),%eax
7c92: 48 98 cltq
7c94: 48 c1 e0 04 shl $0x4,%rax
7c98: 48 89 c2 mov %rax,%rdx
7c9b: 48 8d 05 86 86 00 00 lea 0x8686(%rip),%rax # 10328 <fhs+0x8>
7ca2: 0f b6 04 02 movzbl (%rdx,%rax,1),%eax
7ca6: 3c 01 cmp $0x1,%al
7ca8: 74 12 je 7cbc <flextcp_fd_slookup+0x49>
7caa: e8 31 b9 ff ff callq 35e0 <__errno_location@plt>
7caf: c7 00 09 00 00 00 movl $0x9,(%rax)
7cb5: b8 ff ff ff ff mov $0xffffffff,%eax
7cba: eb 53 jmp 7d0f <flextcp_fd_slookup+0x9c>
7cbc: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%rbp)
7cc3: 8b 45 ec mov -0x14(%rbp),%eax
7cc6: 48 98 cltq
7cc8: 48 c1 e0 04 shl $0x4,%rax
7ccc: 48 89 c2 mov %rax,%rdx
7ccf: 48 8d 05 4a 86 00 00 lea 0x864a(%rip),%rax # 10320 <fhs>
7cd6: 48 8b 04 02 mov (%rdx,%rax,1),%rax
7cda: 48 89 45 f8 mov %rax,-0x8(%rbp)
7cde: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7ce2: 8b 45 f4 mov -0xc(%rbp),%eax
7ce5: 87 82 c0 00 00 00 xchg %eax,0xc0(%rdx)
7ceb: 85 c0 test %eax,%eax
7ced: 74 0d je 7cfc <flextcp_fd_slookup+0x89>
7cef: f3 90 pause
7cf1: 83 ba c0 00 00 00 00 cmpl $0x0,0xc0(%rdx)
7cf8: 75 f5 jne 7cef <flextcp_fd_slookup+0x7c>
7cfa: eb e9 jmp 7ce5 <flextcp_fd_slookup+0x72>
7cfc: 89 45 f4 mov %eax,-0xc(%rbp)
7cff: 48 8b 45 e0 mov -0x20(%rbp),%rax
7d03: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7d07: 48 89 10 mov %rdx,(%rax)
7d0a: b8 00 00 00 00 mov $0x0,%eax
7d0f: c9 leaveq
7d10: c3 retq
CMake Release
Profile:
0000000000007d80 <flextcp_fd_slookup>:
7d80: f3 0f 1e fa endbr64
7d84: 81 ff ff ff 0f 00 cmp $0xfffff,%edi
7d8a: 7f 44 jg 7dd0 <flextcp_fd_slookup+0x50>
7d8c: 48 63 ff movslq %edi,%rdi
7d8f: 48 8d 05 6a 85 00 00 lea 0x856a(%rip),%rax # 10300 <fhs>
7d96: 48 c1 e7 04 shl $0x4,%rdi
7d9a: 48 01 c7 add %rax,%rdi
7d9d: 80 7f 08 01 cmpb $0x1,0x8(%rdi)
7da1: 75 2d jne 7dd0 <flextcp_fd_slookup+0x50>
7da3: 48 8b 17 mov (%rdi),%rdx
7da6: b8 01 00 00 00 mov $0x1,%eax
7dab: 87 82 c0 00 00 00 xchg %eax,0xc0(%rdx)
7db1: 85 c0 test %eax,%eax
7db3: 74 0d je 7dc2 <flextcp_fd_slookup+0x42>
7db5: f3 90 pause
7db7: 83 ba c0 00 00 00 00 cmpl $0x0,0xc0(%rdx)
7dbe: 75 f5 jne 7db5 <flextcp_fd_slookup+0x35>
7dc0: eb e9 jmp 7dab <flextcp_fd_slookup+0x2b>
7dc2: 31 c0 xor %eax,%eax
7dc4: 48 89 16 mov %rdx,(%rsi)
7dc7: c3 retq
7dc8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
7dcf: 00
7dd0: 55 push %rbp
7dd1: 48 89 e5 mov %rsp,%rbp
7dd4: e8 b7 b7 ff ff callq 3590 <__errno_location@plt>
7dd9: c7 00 09 00 00 00 movl $0x9,(%rax)
7ddf: b8 ff ff ff ff mov $0xffffffff,%eax
7de4: 5d pop %rbp
7de5: c3 retq
7de6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
7ded: 00 00 00
Is there any way to prevent gcc from reordering these two instructions?
Edit: I use the default toolchain (gcc-11.2.0 + glibc 2.35) on Ubuntu 22.04. Sorry that a reproducible example is not available. Edit: Add source code of the example function.
Solution
Try -fno-shrink-wrap
This looks like "shrink-wrap" optimization: only doing the function prologue in a code path where it's needed. The usual benefit is to run an early-out check before the prologue, not saving/restoring a bunch of registers on that path through the function.
But here, GCC decided to only do the prologue (setting up a frame pointer) if it had to call another function. That function is __errno_location
in the error-return path. Oops. :P (And GCC correctly realized that's the uncommon case, and put it out-of-line after the ret
through the fast path. So the fast path can be a straight line with no taken branches, other than inside your asm()
. It's not a separate function, it's just tail-duplication of the one you showed source for.)
The main path through the function is very tiny, just a few C assignment statements and an asm()
statement. GCC doesn't have a clear idea of how big an asm
block is (although I think has some heuristics, but is still rather willing to inline one). And it has no idea if there might be loops or any significant time spent in an asm block.
This is a known issue, GCC bug #98018 suggested that GCC should have an option to force frame-pointer setup at the actual top of a function. Because there currently isn't an option that's 100% reliable, other than disabling optimization which is not usable. (Thanks to @Margaret Bloom for finding & linking this.)
As comment 6 on that GCC bug mentions, disabling shrink-wrapping is part of what's necessary to make sure GCC sets up the frame pointer at the top of the function itself, not just inside some if
that needs the prologue.
That GCC issue seems to be considering a feature that would stop function inlining, so backtraces would fully reflect the C abstract machine's nesting of function calls. That goes beyond what you're looking for, which I think is just to have frame pointers set up on entry to functions that exist in the asm after optimization.
Disabling shrink-wrapping will force the whole prologue to happen there, including push of other regs, if there were any. Not just the frame pointer. But here there aren't any others. Still, with optimization enabled in general, losing shrink-wrapping is probably pretty minor.
Answered By - Peter Cordes Answer Checked By - Cary Denson (WPSolving Admin)