Subtitle: Speeding Up, Slowing Up, Maybe Even Blowing Up
It has always been my intention to fully support memory barriers in LibQ. However, as LibQ is mainly something I work on when I'm feeling inspired, it shouldn't be too surprising that some features take a while to get implemented (especially at the times when I don't feel like coding anything for a few months). But while lack of interest on my part may have been the biggest factor, there was also the fact that I hadn't, until recently, decided on the semantics of LibQ memory barriers.
Stand-alone memory barriers (memory barriers in the middle of a sequence of normal code) are (mostly) straightforward, both in concept and in implementation. However, the exact method of achieving a memory barrier differs drastically by architecture. For example, the x86 guarantees that stores will be performed in order. As such, write barriers are not needed at all (or you could use an SFENCE instruction, if you want to play it safe and assume that this guarantee of write ordering won't be around forever). However, the x86 has no read barrier or full barrier instructions prior to the Pentium 4 (LFENCE and MFENCE instructions, respectively); the conventional wisdom has been to use a serializing instruction whenever these functions are needed (CPUID seems like a fairly good candidate on Pentium 4, while LOCK CMPXCHG seems faster on other CPUs).
The Alpha deserves special mention, as it does something that, as best I and the Linux people know, is truly novel (thank God). For reasons that appear to be a consequence of the split cache model (having multiple cache banks per core, which may be locked separately to avoid memory stalls), the Alpha does not have true memory barriers. While its memory barriers will order instructions for that CPU, there are no guarantees that the access order will hold on other processors. Amazingly (and not in a good way), this even applies to pointer dereferences; that is, if you set a shared pointer after initializing the structure, it's possible that another CPU could get the new pointer, yet pull uninitialized data through the pointer. This means that not only must you have a write barrier between two writes if they must appear in order, but you must also have a read barrier between the reads of those two variables; the read barrier will ensure that the caches will be brought into sync on the reading CPU.
Moving tangentially, different architectures have different semantics for their atomic functions. On a Pentium 3 (and other Intel CPUs before Pentium 4), LOCKed atomic operations (required to be truly atomic with multiple cores and/or CPUs) are synchronizing instructions (effectively full barriers). However, in Pentium 4 they have been demoted to simply being write barriers. As best as I can tell, on the PowerPC atomic instructions aren't memory barriers at all. Nor are atomic x86 instructions that do not use LOCK (not using LOCK is much faster, but only atomic with respect to a single core/CPU).
This is made even more complicated by the fact that there are three possible configurations for a memory barrier to be in, with respect to an atomic operation. In the acquire configuration (think of acquiring a mutex), the memory barrier appears to be just after the atomic operation - the atomic operation can be executed earlier than where it appears, but no I/O after it (inside the mutex) may be executed before it. In the release configuration, the barrier appears to be just before the atomic instruction - I/O after the atomic operation may be executed before the atomic operation, but the atomic operation cannot occur before any I/O in front of it (inside the mutex). Finally, there's the sandwich configuration, where instructions are serialized around the atomic operation, as if there was one barrier before the operation, and one after - the atomic operation cannot be executed before any preceding I/O, and no subsequent I/O may be executed before the atomic operation.