Issue
I was reading through the guide to using the various memory barriers provided by Linux and came upon the example below.
I was curious as to the reason why CPU2 will load P into Q before issuing the load of *Q or whether this is actually guaranteed to always be the case?
I do not see it explicitly stated but I assume there is a memory ordering guarantee that all writes to a pointer variable will occur before any subsequent deference of that pointer variable? Can anyone confirm this to be an accurate interpretation or provide evidence from the Linux-kernel memory model that justifies this behavior?
As a further example, consider this sequence of events:
CPU 1 CPU 2
=============== ===============
{ A == 1, B == 2, C == 3, P == &A, Q == &C }
B = 4; Q = P;
P = &B; D = *Q;
There is an obvious address dependency here, as the value loaded into D depends
on the address retrieved from P by CPU 2. At the end of the sequence, any of
the following results are possible:
(Q == &A) and (D == 1)
(Q == &B) and (D == 2)
(Q == &B) and (D == 4)
Note that CPU 2 will never try and load C into D because the CPU will load P
into Q before issuing the load of *Q.
Solution
The Linux kernel with volatile
is essentially the same as ISO C with memory_order_relaxed
. (Or stronger because compile-time reordering of volatile
operations wrt. each other isn't allowed even to different addresses.) (In Linux kernel code this would be WRITE_ONCE(B, 4);
and so on, unless you actually declared the shared variables as volatile
.)
Within the same thread, sequencing and coherence rules apply, so shared = 1;
tmp = shared
is guaranteed to read 1
or some later value in the modification order of shared
.
Except it's even simpler in this case: Q
isn't touched by CPU 1 in this example, so it's not a shared variable, it's effectively local to CPU 2. The as-if rules of compiler optimization and out-of-order execution require preserving single-threaded correctness, like that variables have their new values after we write them.
Dereferencing an atomic pointer involve reading it into a temporary (a register on all(?) current ISAs that Linux supports) and then dereferencing that temporary. In this case, Q
is that temporary, despite the unclear naming convention.
The read itself (of Q
) is sequenced after the write (Q = ...
) from the same thread.
The pointed-to data from D = *Q;
is dependency-ordered after the read of Q
, like C/C++ memory_order_consume
, except on DEC Alpha AXP. All other ISAs that Linux runs on have hardware memory-dependency ordering, and a compiler can't plausibly break this code by knowing there's only one possible value for Q
and using a constant for that which doesn't have a data dependency on the P
load result. Or a branch (control dependencies can be speculated past, unlike data dependencies).
But we can still read (Q == &B) && (D == 2)
because CPU 1 didn't do a release store. If it had, then even though CPU 2 didn't do an acquire load, the data dependency through the address will stop any real-world CPU from loading from *Q
until after it knows the address from Q
.
If you want to rely on this in Linux kernel code, you should use smp_read_barrier_depends()
between Q = P
and *Q
. It's a no-op on everything except Alpha.
C++11: the difference between memory_order_relaxed and memory_order_consume
Paul E. McKenney's CppCon 2016 talk: C++ Atomics: The Sad Story of
memory_order_consume
: A Happy Ending At Last? - he describes how Linux effectively usesrelaxed
and avoids doing stuff liketmp = shared; array[tmp - tmp]
as a way to get one load ordered after another because compilers will optimize that to a constant0
with no dependency.(That's why C++'s
memory_order_consume
had to exist, because it does let you do stuff like that with formal guarantees; on real ISAs like AArch64, compilers would have to emit asm that generates a0
with a data dependency ontmp
, e.g. with asub
orxor
instruction. Fun fact, non-x86 CPUs aren't even allowed to optimize xor-zeroing of registers because their ISA rules sayxor
andsub
carry a dependency. Anyway, that's also whymemory_order_consume
proved so hard to support correctly, so compilers gave up and promotedconsume
toacquire
instead of possibly having their optimizer break code like this. Linux kernel code can rely on stuff like "we know what we're doing well enough with a limited set of compilers that we can write code that compilers shouldn't optimize in ways that breaksconsume
")
Answered By - Peter Cordes Answer Checked By - Mildred Charles (WPSolving Admin)