The original code applies a workaround to compilers that don't need it,
Indeed, those compilers don't need it.
On the other hand, the atomic operations with compiler-only barriers, which you describe as a "workaround," *are* correct for both old and new compilers AFAIK.
so yeah arguably I think this improves correctness.
Correctness is a "yes or no" question. If the code is already correct, you cannot further improve its correctness.
Maybe you actually meant future compatibility--whether it will stay correct in the future. This requires that the following assumptions continue to hold:
1. The atomic macros assume that volatile memory access (load / store) operations emitted by the compiler implicitly have both acquire and release semantics on all x86 implementations (microarchitectures) that GCC and clang target. Exceptions are SSE non-temporal memory operations, which are not cache-coherent and thus cannot be used for implementing C-like generic memory operations in the first place.
2. The atomic macros assume that `__asm__ __volatile__("" ::: "memory");` is a compiler memory barrier. It's safe to assume that this will stay true in the future, since the Linux kernel also uses this syntax.
Besides that, I am pretty sure load acquire/store release means more than `__asm__ __volatile__( "" ::: "memory" );`, which prevents compiler reordering but not processor reordering.
A barrier (let it be a compiler-only or hardware instruction) by itself has nothing to do with Load-Acquire or Store-Release. It is the combination of memory accesses and barriers that can implement either semantics of the abstract memory model.
If you believe there's more to acquire/release, would you provide an example?
Granted it's not needed on x86,
Right. On non-x86 architectures, we assume newer GCC.
but I think it's also an argument for correctness.
I'm not sure if this stands as a valid argument for corrrectness.