Issue
I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:
long call_proc() {
long x1 = 1;
int x2 = 2;
short x3 = 3;
char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}
The book gives the assembly code generated by GCC:
long call_proc()
call_proc:
; Set up arguments to proc
subq $32, %rsp ; Allocate 32-byte stack frame
movq $1, 24(%rsp) ; Store 1 in &x1
movl $2, 20(%rsp) ; Store 2 in &x2
movw $3, 18(%rsp) ; Store 3 in &x3
movb $4, 17(%rsp) ; Store 4 in &x4
leaq 17(%rsp), %rax ; Create &x4
movq %rax, 8(%rsp) ; Store &x4 as argument 8
movl $4, (%rsp) ; Store 4 as argument 7
leaq 18(%rsp), %r9 ; Pass &x3 as argument 6
movl $3, %r8d ; Pass 3 as argument 5
leaq 20(%rsp), %rcx ; Pass &x2 as argument 4
movl $2, %edx ; Pass 2 as argument 3
leaq 24(%rsp), %rsi ; Pass &x1 as argument 2
movl $1, %edi ; Pass 1 as argument 1
; Call proc
call proc
; Retrieve changes to memory
movslq 20(%rsp), %rdx ; Get x2 and convert to long
addq 24(%rsp), %rdx ; Compute x1+x2
movswl 18(%rsp), %eax ; Get x3 and convert to int
movsbl 17(%rsp), %ecx ; Get x4 and convert to int
subl %ecx, %eax ; Compute x3-x4
cltq ; Convert to long
imulq %rdx, %rax ; Compute (x1+x2) * (x3-x4)
addq $32, %rsp ; Deallocate stack frame
ret ; Return
I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc
and the last 16 bytes hold 4 local variables.
Then I tested this code on GCC 11.2, using the optimization flag -Og
, and got this assembly code:
call_proc():
subq $24, %rsp
movq $1, 8(%rsp)
movl $2, 4(%rsp)
movw $3, 2(%rsp)
movb $4, 1(%rsp)
leaq 1(%rsp), %rax
pushq %rax
pushq $4
leaq 18(%rsp), %r9
movl $3, %r8d
leaq 20(%rsp), %rcx
movl $2, %edx
leaq 24(%rsp), %rsi
movl $1, %edi
call proc(long, long*, int, int*, short, short*, char, char*)
movslq 20(%rsp), %rax
addq 24(%rsp), %rax
movswl 18(%rsp), %edx
movsbl 17(%rsp), %ecx
subl %ecx, %edx
movslq %edx, %rdx
imulq %rdx, %rax
addq $40, %rsp
ret
I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq
to add 2 arguments to the stack, so the final code uses addq $40, %rsp
to free stack space.
Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?
Solution
(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)
GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc
: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc
actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.
So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.
Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args
, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args
used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.
One difference from the book's asm is a movl $0, %eax
before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc
returns: it ends up using movslq %edx, %rdx
instead of cltq
(sign-extend with RAX).
CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.
Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.
Answered By - Nate Eldredge Answer Checked By - Mildred Charles (WPSolving Admin)