Thursday, February 3, 2022

[SOLVED] GNU C compiler sabotages undefined behaviour

Issue

I have an embedded project that requires at some point that I write to address 0. So naturally I try:

*(int*)0 = 0 ;

But at optimisation level 2 or higher, the gcc compiler rubs its hands and says, in effect, "That is undefined behaviour! I can do what I like! Bwahaha!" and emits an invalid instruction to the code stream!

Here is my source file:

void f (void)
  {
  *(int*)0 = 0 ;
  }

and here is the output listing:

    .file   "bug.c"
    .text
    .p2align 4,,15
    .globl  _f
    .def    _f; .scl    2;  .type   32; .endef
_f:
LFB0:
    .cfi_startproc
    movl    $0, 0
    ud2                <-- Invalid instruction!
    .cfi_endproc
LFE0:
    .ident  "GCC: (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 7.3.0"

My question is: Why would anybody do this? What possible benefit could accrue from sabotaging code like this? Surely the obvious course of action is to issue a warning and carry on compiling?

I know the compiler is allowed to do this, I just wonder about the motivation of the compiler writer. It cost me two days and four engineering samples to track this down, so I'm a little peeved.

Edited to add: I have worked around this by using assembly language. So I'm not looking for solutions. I'm just curious why anybody would think this compiler behaviour was a good idea.


Solution

(Disclaimer: I'm not an expert on GCC internals, and this is more of a "post hoc" attempt to explain its behavior. But maybe it will be helpful.)

the gcc compiler rubs its hands and says, in effect, "That is undefined behaviour! I can do what I like! Bwahaha!" and emits an invalid instruction to the code stream!

I won't deny that there are cases where GCC does more or less that, but here there's a little more going on, and there is some method to its madness.

As I understand it, GCC isn't treating the null dereference as totally undefined here; it is making some assumptions about what it does. Its handling of null dereferences is controlled by a flag called -fdelete-null-pointer-checks, which is probably enabled by default when you turn on optimizations. From the manual:

-fdelete-null-pointer-checks

Assume that programs cannot safely dereference null pointers, and that no code or data element resides at address zero. This option enables simple constant folding optimizations at all optimization levels. In addition, other optimization passes in GCC use this flag to control global dataflow analyses that eliminate useless checks for null pointers; these assume that a memory access to address zero always results in a trap, so that if a pointer is checked after it has already been dereferenced, it cannot be null.

Note however that in some environments this assumption is not true. Use -fno-delete-null-pointer-checks to disable this optimization for programs that depend on that behavior.

This option is enabled by default on most targets. On Nios II ELF, it defaults to off. On AVR, CR16, and MSP430, this option is completely disabled.

Passes that use the dataflow information are enabled independently at different optimization levels.

So, if you are intending to actually access address 0, or if for some other reason your code will go on executing after the dereference, then you want to disable this with -fno-delete-null-pointer-checks. That will achieve the "carry on compiling" part of what you want. It will not give you warnings, however, presumably under the assumption that such dereferences are intentional.


But under default options, why are you seeing the generated code that you do, with the undefined instruction, and why isn't there a warning? I would guess that GCC's logic is running as follows:

  • Because -fdelete-null-pointer-checks is in effect, the compiler assumes that execution will not continue past the null dereference, but instead will trap. How the trap will be handled, it doesn't know: maybe program termination, maybe a signal or exception handler, maybe a longjmp up the stack. The null dereference itself is emitted as requested, perhaps under the assumption that you are intentionally exercising your trap handler. But either way, whatever code comes after the null dereference is now unreachable.

  • So now it does what any reasonable optimizing compiler does with unreachable code: it doesn't emit it. In your case, that's nothing but a ret, but whatever it is, as far as GCC is concerned it would just be wasted bytes of memory, and should be omitted.

    You might think you should get a warning here, but GCC has a longstanding design decision not to warn about unreachable code, on the grounds that such warnings tended to be inconsistent and the false positives would do more harm than good. See for instance https://gcc.gnu.org/legacy-ml/gcc-help/2011-05/msg00360.html.

  • However, as a safety feature, GCC emits an undefined instruction (ud2 on x86) in place of the omitted unreachable code. The idea, I believe, is that just in case execution somehow does continue past the null dereference, it is better for the program to die, than to go off into the weeds and try to execute whatever memory contents happen to come next. (And indeed this can happen even on systems that do unmap the zero page; for instance, if you do struct huge *p = NULL; p->x = 0;, GCC understands this as a null dereference, even though p->x may not be on the zero page at all, and could conceivably be located at an accessible address.)

There is a warning flag, -Wnull-dereference, that will trigger a warning on your blatant null dereference. However, it only works if -fdelete-null-pointer-checks is enabled.


When would GCC's behavior be useful? Here's an example, maybe contrived, but it might get the idea across. Imagine your program has some allocation function that might fail:

struct foo *p = get_foo();
// do other stuff for a while
if (!p) {
    // 5000 lines of elaborate backup plan in case we can't get a foo
}
frob(p->bar);

Now imagine that you redesign get_foo() so that it can't fail. You forget to take out your "backup plan" code, but you go ahead and use the returned object right away:

struct foo *p = get_foo();
frob(p->bar);
// do other stuff for a while
if (!p) {
    // 5000 lines of elaborate backup plan in case we can't get a foo
}

The compiler doesn't know, a priori, that get_foo() will always return a valid pointer. But it can see that you've dereferenced it, and thus can assume that execution will only continue past that point if the pointer was not null. Therefore, it can tell that the elaborate backup plan is unreachable and should be omitted, which will save you a lot of bloat in your binary.


Incidentally, the situation with clang. Although as Eric Postpischil points out you do get a warning, what you don't get is an actual load from address 0: clang omits it and just emits ud2. This is what "doing whatever it likes" would really look like, and if you were hoping to exercise your page zero trap handler, you are out of luck.



Answered By - Nate Eldredge
Answer Checked By - Gilberto Lyons (WPSolving Admin)