Issue
I am trying to use a shared index to indicate that data has been written to a shared circular buffer. Is there an efficient way to do this on ARM (arm gcc 9.3.1 for cortex M4 with -O3) without using the discouraged volatile
keyword?
The following C functions work fine on x86:
void Test1(int volatile* x) { *x = 5; }
void Test2(int* x) { __atomic_store_n(x, 5, __ATOMIC_RELEASE); }
Both compile efficiently and identically on x86:
0000000000000000 <Test1>:
0: c7 07 05 00 00 00 movl $0x5,(%rdi)
6: c3 retq
7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
e: 00 00
0000000000000010 <Test2>:
10: c7 07 05 00 00 00 movl $0x5,(%rdi)
16: c3 retq
However on ARM the __atomic
builtin generates a Data Memory Barrier, while volatile
does not:
00000000 <Test1>:
0: 2305 movs r3, #5
2: 6003 str r3, [r0, #0]
4: 4770 bx lr
6: bf00 nop
00000000 <Test2>:
0: 2305 movs r3, #5
2: f3bf 8f5b dmb ish
6: 6003 str r3, [r0, #0]
8: 4770 bx lr
a: bf00 nop
How do I avoid the memory barrier (or similar inefficiencies) while also avoiding volatile
?
Solution
The volatile
assignment isn't a release-store, and doesn't even give you StoreStore ordering which might be all you need here.
volatile
is basically equivalent to __ATOMIC_RELAXED
ordering, except that it prevents compile-time reordering with other volatile
accesses. It does not do anything to prevent run-time reordering, which CPU memory models other than x86 do allow. (As for actual atomicity, with narrow enough types you do get atomicity with certain compilers, like GCC and Clang, since the Linux kernel uses volatile
this way to roll its own atomics, along with inline asm for fences.)
See also When to use volatile with multi threading? - never, volatile
doesn't give you anything you can't get with atomics for the purposes of multi-threading. Use GNU C builtins or C++20 std::atomic_ref
with memory_order_relaxed
instead of volatile
if you need non-atomic access to a variable in other parts of your program. Or more simply use C11 stdatomic.h
_Atomic int
or C++11 std::atomic<>
if you never need to point a plain int*
at it.
dmb ISHST
is at least a StoreStore barrier, so in asm you could get release semantics wrt. earlier stores but not earlier loads. That isn't sufficient for std::memory_order_release
aka __ATOMIC_RELEASE
(which also requires LoadStore ordering), so there's no way to get a compiler to use that for you. (None of the ops or fences in https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html map to that).
So unfortunately on ARMv7 and earlier, you need a full barrier (dmb ish
) for any standard C / C++ memory_order other than relaxed
. ARMv8 fixed that.
With -mcpu=cortex-a53
or other ARMv8 CPUs, stl
is available as a release-store even in AArch32 state. So use that to avoid an expensive dmb ish
full-barrier for release stores or acquire loads: https://godbolt.org/z/1hzvGMbon
# GCC -O2 -mcpu=cortex-a53 (or -march=armv8-a)
Test2(int*):
movs r3, #5
stl r3, [r0] // release store
bx lr
Single-core systems
On your single-core Cortex M4, all "threads" will run on the same core, so run-time memory reordering isn't possible. An interrupt leading to a context-switch is equivalent to a signal handler in the C11 / C++11 memory models.
You can use atomic_signal_fence
to roll your own same-core-acquire / same-core-release for relaxed
loads/stores.
// writer
buffer[idx] = xyz;
atomic_signal_fence(memory_order_release); // prevent compile-time reordering, no run-time cost
atomic_store_explicit(&shared_idx, idx, memory_order_relaxed);
// reader
int idx = atomic_load_explicit(&shared_idx, memory_order_relaxed);
atomic_signal_fence(memory_order_acquire); // prevent compile-time reordering, no run-time cost
int tmp = buffer[idx];
Porting such code to multi-core by changing atomic_signal_fence
to atomic_thread_fence
is safe but worse for performance on some ISAs, notably ARMv8 where a separate barrier instruction is expensive, but a release-store operation can just use stl
Answered By - Peter Cordes Answer Checked By - Dawn Plyler (WPSolving Volunteer)