Okay, so those are the constraints within which LibQ must work. This brings me to my first critical design decision regarding memory barriers: LibQ will use hand-coded assembly functions, and client programs will call these functions through pointers (well, the functions will indeed be called through pointers, but this is transparent to the client program), allowing the fastest option to be chosen at run time based on the system.
Okay, that was fairly easy. This second decision took a lot more thinking through to reach a conclusion: what to do with respect to atomic functions? There are at least three options that are fairly obvious. The first is to provide the weakest possible guarantee: that all atomic functions are not memory barriers. This ensures that users will use memory barriers manually, and even on architectures where atomic instructions are not memory barriers the correct behavior is achieved.
However, evaluation of this possibility suggests that it is not a good option. While non-barrier atomic operations will run at maximum speed, particularly on weak architectures like PowerPC, adding in additional memory barriers on architectures like the x86 quickly kills performance. Consider the worst case scenario: an atomic operation that must have a full sandwich barrier, on a Pentium 3 (which has no full memory barrier instruction, and so must use a serializing instruction) - perhaps 100 cycles for the LOCK, another maybe 150 cycles to flush buffer contents, and two CPUID instructions at about 65 cycles each; all together, that's 385 cycles. This could be reduced by about 1/3 by leaving out the CPUIDs, as LOCK instructions are full barriers on the Pentium 3.
Alternately, we could provide the strongest possible guarantee: that all atomic instructions are full sandwich barriers. This means that users don't need to worry about putting memory barriers around atomic instructions at all, and ensures barrier safety.
However, this also is not a particularly good solution for a cross-platform library like LibQ. This strategy is optimal for the Pentium 3, where that's exactly how atomic instructions work. On Pentium 4 it's only slightly worse, as two MFENCE (or LFENCE) instructions would only add only a couple dozen cycles to the atomic operation and memory flush. However, this could be bad on weaker architectures like PowerPC. Memory barrier instructions there are comparable to the cost of a CPUID, and at least one is unnecessary most of the time (not often do you need a full sandwich barrier around an atomic operation).
The third option is to make special versions of LibQ atomic functions for each variation. While this has the advantage of providing maximum speed on all architectures (and removing function call overhead wherever possible), it would be a huge pain in the ass for me and anyone else who writes all the functions. Let's think about this: 3 types of memory barriers, 3 barrier configurations, times how many atomic function? That's like 50 or 100 different functions for each platform. Mr. Flibbles doesn't much care for that.
So, what should I do? Fortunately, there's a variation on one of these (or perhaps a hybrid of two of them) that is both easy for me to code and efficient with respect to CPU cycles (not 100% optimal, but tolerably close): option number one. Atomic instructions are not guaranteed to be memory barriers of any kind. HOWEVER, there are special memory barrier functions used to promote atomic instructions to memory barriers.
This requires only 6 functions - 3 different types of barriers, 2 configurations (sandwich is just top + bottom barriers). Using this system, these functions may function as necessary, and be no-oped as necessary, to provide (almost) maximum speed. The only waste is the overhead of one or two functions calls; my benchmarks suggest that this is a minimal increase, not taking more than 10 cycles per function - 4 to 12% waste for two barriers, when neither are necessary.