Tuesday, February 6, 2024

[SOLVED] Linux kernel memory model pointer access and dereference

Issue

I was reading through the guide to using the various memory barriers provided by Linux and came upon the example below.

I was curious as to the reason why CPU2 will load P into Q before issuing the load of *Q or whether this is actually guaranteed to always be the case?

I do not see it explicitly stated but I assume there is a memory ordering guarantee that all writes to a pointer variable will occur before any subsequent deference of that pointer variable? Can anyone confirm this to be an accurate interpretation or provide evidence from the Linux-kernel memory model that justifies this behavior?

As a further example, consider this sequence of events:

        CPU 1           CPU 2
        =============== ===============
        { A == 1, B == 2, C == 3, P == &A, Q == &C }
        B = 4;          Q = P;
        P = &B;         D = *Q;

There is an obvious address dependency here, as the value loaded into D depends
on the address retrieved from P by CPU 2.  At the end of the sequence, any of
the following results are possible:

        (Q == &A) and (D == 1)
        (Q == &B) and (D == 2)
        (Q == &B) and (D == 4)

Note that CPU 2 will never try and load C into D because the CPU will load P
into Q before issuing the load of *Q.


Solution

The Linux kernel with volatile is essentially the same as ISO C with memory_order_relaxed. (Or stronger because compile-time reordering of volatile operations wrt. each other isn't allowed even to different addresses.) (In Linux kernel code this would be WRITE_ONCE(B, 4); and so on, unless you actually declared the shared variables as volatile.)

Within the same thread, sequencing and coherence rules apply, so shared = 1; tmp = shared is guaranteed to read 1 or some later value in the modification order of shared.

Except it's even simpler in this case: Q isn't touched by CPU 1 in this example, so it's not a shared variable, it's effectively local to CPU 2. The as-if rules of compiler optimization and out-of-order execution require preserving single-threaded correctness, like that variables have their new values after we write them.

Dereferencing an atomic pointer involve reading it into a temporary (a register on all(?) current ISAs that Linux supports) and then dereferencing that temporary. In this case, Q is that temporary, despite the unclear naming convention.

The read itself (of Q) is sequenced after the write (Q = ...) from the same thread.


The pointed-to data from D = *Q; is dependency-ordered after the read of Q, like C/C++ memory_order_consume, except on DEC Alpha AXP. All other ISAs that Linux runs on have hardware memory-dependency ordering, and a compiler can't plausibly break this code by knowing there's only one possible value for Q and using a constant for that which doesn't have a data dependency on the P load result. Or a branch (control dependencies can be speculated past, unlike data dependencies).

But we can still read (Q == &B) && (D == 2) because CPU 1 didn't do a release store. If it had, then even though CPU 2 didn't do an acquire load, the data dependency through the address will stop any real-world CPU from loading from *Q until after it knows the address from Q.

If you want to rely on this in Linux kernel code, you should use smp_read_barrier_depends() between Q = P and *Q. It's a no-op on everything except Alpha.

  • Memory order consume usage in C11

  • C++11: the difference between memory_order_relaxed and memory_order_consume

  • Paul E. McKenney's CppCon 2016 talk: C++ Atomics: The Sad Story of memory_order_consume: A Happy Ending At Last? - he describes how Linux effectively uses relaxed and avoids doing stuff like tmp = shared; array[tmp - tmp] as a way to get one load ordered after another because compilers will optimize that to a constant 0 with no dependency.

    (That's why C++'s memory_order_consume had to exist, because it does let you do stuff like that with formal guarantees; on real ISAs like AArch64, compilers would have to emit asm that generates a 0 with a data dependency on tmp, e.g. with a sub or xor instruction. Fun fact, non-x86 CPUs aren't even allowed to optimize xor-zeroing of registers because their ISA rules say xor and sub carry a dependency. Anyway, that's also why memory_order_consume proved so hard to support correctly, so compilers gave up and promoted consume to acquire instead of possibly having their optimizer break code like this. Linux kernel code can rely on stuff like "we know what we're doing well enough with a limited set of compilers that we can write code that compilers shouldn't optimize in ways that breaks consume")



Answered By - Peter Cordes
Answer Checked By - Mildred Charles (WPSolving Admin)