Tuesday, April 12, 2022

[SOLVED] Why does GCC allocate more stack memory than needed?

Issue

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:

long call_proc() {
    long  x1 = 1;
    int   x2 = 2;
    short x3 = 3;
    char  x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
    return (x1+x2)*(x3-x4);
}

The book gives the assembly code generated by GCC:

long call_proc()
call_proc:
    ; Set up arguments to proc
    subq    $32, %rsp           ; Allocate 32-byte stack frame
    movq    $1, 24(%rsp)        ; Store 1 in &x1
    movl    $2, 20(%rsp)        ; Store 2 in &x2
    movw    $3, 18(%rsp)        ; Store 3 in &x3
    movb    $4, 17(%rsp)        ; Store 4 in &x4
    leaq    17(%rsp), %rax      ; Create &x4
    movq    %rax, 8(%rsp)       ; Store &x4 as argument 8
    movl    $4, (%rsp)          ; Store 4 as argument 7
    leaq    18(%rsp), %r9       ; Pass &x3 as argument 6
    movl    $3, %r8d            ; Pass 3 as argument 5
    leaq    20(%rsp), %rcx      ; Pass &x2 as argument 4
    movl    $2, %edx            ; Pass 2 as argument 3
    leaq    24(%rsp), %rsi      ; Pass &x1 as argument 2
    movl    $1, %edi            ; Pass 1 as argument 1
    ; Call proc
    call    proc
    ; Retrieve changes to memory
    movslq  20(%rsp), %rdx      ; Get x2 and convert to long
    addq    24(%rsp), %rdx      ; Compute x1+x2
    movswl  18(%rsp), %eax      ; Get x3 and convert to int
    movsbl  17(%rsp), %ecx      ; Get x4 and convert to int
    subl    %ecx, %eax          ; Compute x3-x4
    cltq                        ; Convert to long
    imulq   %rdx, %rax          ; Compute (x1+x2) * (x3-x4)
    addq    $32, %rsp           ; Deallocate stack frame
    ret                         ; Return

I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc and the last 16 bytes hold 4 local variables.

Then I tested this code on GCC 11.2, using the optimization flag -Og, and got this assembly code:

call_proc():
        subq    $24, %rsp
        movq    $1, 8(%rsp)
        movl    $2, 4(%rsp)
        movw    $3, 2(%rsp)
        movb    $4, 1(%rsp)
        leaq    1(%rsp), %rax
        pushq   %rax
        pushq   $4
        leaq    18(%rsp), %r9
        movl    $3, %r8d
        leaq    20(%rsp), %rcx
        movl    $2, %edx
        leaq    24(%rsp), %rsi
        movl    $1, %edi
        call    proc(long, long*, int, int*, short, short*, char, char*)
        movslq  20(%rsp), %rax
        addq    24(%rsp), %rax
        movswl  18(%rsp), %edx
        movsbl  17(%rsp), %ecx
        subl    %ecx, %edx
        movslq  %edx, %rdx
        imulq   %rdx, %rax
        addq    $40, %rsp
        ret

I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq to add 2 arguments to the stack, so the final code uses addq $40, %rsp to free stack space.

Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?


Solution

(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)

GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.

So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.

Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.

One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).


CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.


Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.



Answered By - Nate Eldredge
Answer Checked By - Mildred Charles (WPSolving Admin)