Search This Blog

Showing posts with label atomicfunctions. Show all posts
Showing posts with label atomicfunctions. Show all posts

Friday, January 12, 2007

Tricks of the Trade - Collocation

Way, way back I explained what "fast" is, with regards to synchronization objects. Briefly, a fast object is one that executes entirely in user mode, save for two times: when a thread needs to go to sleep, and when a thread needs to be woken up.

But that isn't fast enough for some people. As discussed previously, fast objects work by using atomic instructions and/or memory barriers. While atomic instructions in and of themselves aren't THAT bad, speed-wise (in the best case taking only about 20 cycles), on some architectures (like x86) they imply either a write barrier or a full barrier. By definition these are constriction points, which bring the various parts of the system into sync, and require waiting for all pending writes/operations to complete. In the best case (no queued I/O) this may be almost instant. In the worst case (many queued I/O accesses), it may take hundreds of cycles. Even though this only applies to one CPU/core, that's a pretty large amount of dead time (for example, you might be able to call malloc in that time).

But is it possible to avoid the use of atomic operations altogether? Well, yes and no. It's impossible to have a multiprocessing system that shares resources completely without atomic operations (well, I suppose you could, but it would be hideously ugly and slow). There must be some way to synchronize code or data between threads and processors - atomic operations provide this ability.

However, in very specialized circumstances (and only from some threads), we can get around using atomic operations. One way to achieve this is called collocation. Collocation is the storage of two (or more) pieces of data - one accessed by one thread, the other accessed by another thread - inside a single word. The thread's data is written to using a small write, then the entire word is read at once.

Collocation relies on data dependency to cause a partial processor stall without the need to perform a full memory barrier. Most processors are always self-consistent and have instruction ordering rules that state that a write to a memory location smaller than a word must be committed to memory (actually, cache, but CPU caches are typically coherent, meaning all CPUs will see modifications to the cache of one CPU) before a larger read from that same location. This can be done using nonatomic operations, and does not require a memory barrier - it forces ordering only on two instructions, while allowing the processor to optimally reorder and buffer all other instructions. In the worst case, this causes a full pipeline flush (20-30 cycles), and may take less - significantly faster than a full memory barrier, in the worst case.

However, this method is not an exact substitute for atomic instructions. It only ensures that both writing and reading threads will see the writes made by each other - it does not provide a way for either thread to know in what order the writes occurred, the way atomic instructions can. So, what use is it? Well, if you apply some creativity, it's still useful in some cases. To be able to take advantage of collocation, the threads sharing data must be fundamentally asymmetric - each thread must write to a separate fields in the word, and only one thread may use collocation to avoid atomic operations (the other threads must use atomic operations). This implies where collocation is useful: when one thread performs a large portion of the writes, and other threads only make occasional writes.

This is just an introduction to collocation as a lock-free structure tool; the posts on super-fast mutexes and condition variables with timeout will give some real-world examples of how atomic operations can be avoided using collocation.

Monday, January 08, 2007

Tools of the Trade - Atomic Primitives

In the wake of the recent memory barriers series, today we continue looking at topics related to multiprogramming, by taking a survey of the various atomic primitives. This particular topic is more for informational value than practical use (at least as far as LibQ is concerned, which does not support all of theseso you do not need to know the differences).

Atomic primitives are arguably the most important tool in programming that requires coordination between multiple threads and/or processors/cores. There are four basic types of atomic primitives: swap, fetch and phi, compare and swap, and load linked/store conditional. Usually these operations are performed on values the same size as or smaller than a word (the size of the processor's registers), though sometimes consecutive (single address) multiword versions are provided.

The most primitive (pun not intended) is the swap operation. This operation exchanges a value in memory with a value in a register, atomically setting the value in memory and returning the original value in memory. This is not actually very useful, in terms of multiprogramming, with only a single practical use: the construction of test and set (TAS; or test and test and set - TATAS) spinlocks. In these spinlocks, each thread attempting to enter the spinlock spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned, the spinlock was free, and the thread now owns the spinlock; if 1 is returned, the spinlock was already owned by another thread, and the thread attempting to enter the spinlock must keep spinning. Ultimately, the owning thread leaves the spinlock by setting the spinlock value to 0.
C/C++ - Code
  1. ATOMIC word swap(volatile word &value, word newValue)
  2. {
  3. word oldValue = value;
  4. value = newValue;
  5. return oldValue;
  6. }
  7. void EnterSpinlock(volatile word &value)
  8. {
  9. // Test and test and set spinlock
  10. while (value == 1 || swap(value, 1) == 1) {}
  11. }
  12. void LeaveSpinlock(volatile word &value)
  13. {
  14. value = 0;
  15. }
This makes for an extremely crude spinlock. The fact that there is only a single value all threads share means that spinning can create a lot of cache coherency traffic, as all spinning processors will be writing to the same address, continually invalidating each other's caches. Furthermore, this mechanism precludes any kind of order preservation, as it's impossible to distinguish when a given thread began waiting; threads may enter the spinlock in any order, regardless of how long any thread has been waiting to enter the spinlock.

Next up the power scale is the fetch and phi family. All members of this family follow the same basic process: a value is atomically read from memory, modified by the processor, and written back, with the original value returned to the program. The modification performed can be almost anything; one of the most useful modifications is the add operation (in this case it's called fetch and add). The fetch and add operation is notably more useful than the swap operation, but is still less than ideal; in addition to test and set spinlocks, fetch and add can be used to create thread-safe counters, and spinlocks that both preserve order and (potentially) greatly reduce cache coherency traffic.
  1. ATOMIC word fetch_and_add(volatile word &value, word addend)
  2. {
  3. word oldValue = value;
  4. value += addend;
  5. return oldValue;
  6. }
  7. void EnterSpinlock (volatile word &seq, volatile word &cur)
  8. {
  9. word mySeq = fetch_and_add(seq, 1);
  10. while (cur != mySeq) {}
  11. }
  12. void LeaveSpinlock(volatile word &cur)
  13. {
  14. cur++;
  15. }
Next is the almost ubiquitous compare and swap (CAS) operation. In this operation, a value is read from memory, and if it matches a comparison value in the processor, a third value in the processor is atomically written in the memory location, and ultimately the original value in memory is returned to the program. This operation is very popular because it allows you to perform almost any operation on a single memory location atomically, by reading the value, modifying it, and then compare-and-swapping it back to memory. For this reason, it is considered to be a universal atomic primitive.
  1. ATOMIC word compare_and_swap(volatile word &value, word newValue, word comparand)
  2. {
  3. word oldValue = value;
  4. if (oldValue == comparand)
  5. value = newValue;
  6. return oldValue;
  7. }
Some architectures (such as the x86) support double-width compare and swap (atomic compare and swap of two consecutive words). While this is convenient, it should not be relied upon in portable software, as many architectures do not support it. Note that double-width compare and swap is NOT the same as double compare and swap (which we'll look at soon).

Another universal atomic primitive is the load linked/store conditional (LL/SC) pair of instructions. In this model, when a value is loaded linked into a register, a reservation is placed on that memory address. If that reservation still exists when the store conditional operation is performed, the store will be performed. If another thread writes to the memory location between a load linked and store conditional, the reservation is cleared, and any other threads will fail their store conditional (resulting in skipping the store altogether). If the store conditional fails, the program must loop back and try the operation again, beginning with the load linked.

In theory, the load linked/store conditional primitive is superior to the compare and swap operation. As any access to the memory address will clear the reservations of other processors, the total number of reads is 1, in contrast to the 2 reads with compare and swap (1 read to get the value initially, and a second read to verify the value is unchanged). Furthermore, the LL/SC operation may distinguish whether a write has occurred, regardless of whether the value was changed by the write (we'll come back to why this is a good thing when we discuss hazard pointers). Unfortunately, my research indicates that most implementations of LL/SC do not guarantee that a write will be distinguished if the written value is the same as the value at the time of the reserve.

Finally, the wild card: the double compare and swap (DCAS). In a double compare and swap, the compare and swap operation is performed simultaneously on the values in two memory addresses. Obviously this provides dramatically more power than any previous operation, which only operate on single addresses. Unfortunately, support for this primitive is extremely rare in real-world processors, and it is typically only used by lock-free algorithm designers that are unable to reduce their algorithm to single-address compare and swap operations.
  1. ATOMIC void double_compare_and_swap(volatile word &value1, word newValue1, word comparand1, word &oldValue1, volatile word &value2, word newValue2, word comparand2, word &oldValue2)
  2. {
  3. oldValue1 = value1;
  4. oldValue2 = value2;
  5. if (oldValue1 == comparand1 && oldValue2 == comparand2)
  6. {
  7. value1 = newValue1;
  8. value2 = newValue2;
  9. }
  10. }
Lastly, note that I'm using rather loose classification of these primitives. Technically, you could place the swap (fetch and store) operations in the fetch and phi family, as well, but it seemed more intuitive to me to separate them.

Sunday, December 31, 2006

Memory Barriers 3 - The LibQ Way

Okay, so those are the constraints within which LibQ must work. This brings me to my first critical design decision regarding memory barriers: LibQ will use hand-coded assembly functions, and client programs will call these functions through pointers (well, the functions will indeed be called through pointers, but this is transparent to the client program), allowing the fastest option to be chosen at run time based on the system.

Okay, that was fairly easy. This second decision took a lot more thinking through to reach a conclusion: what to do with respect to atomic functions? There are at least three options that are fairly obvious. The first is to provide the weakest possible guarantee: that all atomic functions are not memory barriers. This ensures that users will use memory barriers manually, and even on architectures where atomic instructions are not memory barriers the correct behavior is achieved.

However, evaluation of this possibility suggests that it is not a good option. While non-barrier atomic operations will run at maximum speed, particularly on weak architectures like PowerPC, adding in additional memory barriers on architectures like the x86 quickly kills performance. Consider the worst case scenario: an atomic operation that must have a full sandwich barrier, on a Pentium 3 (which has no full memory barrier instruction, and so must use a serializing instruction) - perhaps 100 cycles for the LOCK, another maybe 150 cycles to flush buffer contents, and two CPUID instructions at about 65 cycles each; all together, that's 385 cycles. This could be reduced by about 1/3 by leaving out the CPUIDs, as LOCK instructions are full barriers on the Pentium 3.

Alternately, we could provide the strongest possible guarantee: that all atomic instructions are full sandwich barriers. This means that users don't need to worry about putting memory barriers around atomic instructions at all, and ensures barrier safety.

However, this also is not a particularly good solution for a cross-platform library like LibQ. This strategy is optimal for the Pentium 3, where that's exactly how atomic instructions work. On Pentium 4 it's only slightly worse, as two MFENCE (or LFENCE) instructions would only add only a couple dozen cycles to the atomic operation and memory flush. However, this could be bad on weaker architectures like PowerPC. Memory barrier instructions there are comparable to the cost of a CPUID, and at least one is unnecessary most of the time (not often do you need a full sandwich barrier around an atomic operation).

The third option is to make special versions of LibQ atomic functions for each variation. While this has the advantage of providing maximum speed on all architectures (and removing function call overhead wherever possible), it would be a huge pain in the ass for me and anyone else who writes all the functions. Let's think about this: 3 types of memory barriers, 3 barrier configurations, times how many atomic function? That's like 50 or 100 different functions for each platform. Mr. Flibbles doesn't much care for that.

So, what should I do? Fortunately, there's a variation on one of these (or perhaps a hybrid of two of them) that is both easy for me to code and efficient with respect to CPU cycles (not 100% optimal, but tolerably close): option number one. Atomic instructions are not guaranteed to be memory barriers of any kind. HOWEVER, there are special memory barrier functions used to promote atomic instructions to memory barriers.

This requires only 6 functions - 3 different types of barriers, 2 configurations (sandwich is just top + bottom barriers). Using this system, these functions may function as necessary, and be no-oped as necessary, to provide (almost) maximum speed. The only waste is the overhead of one or two functions calls; my benchmarks suggest that this is a minimal increase, not taking more than 10 cycles per function - 4 to 12% waste for two barriers, when neither are necessary.

Wednesday, December 13, 2006

Memory Barriers 2

Subtitle: Speeding Up, Slowing Up, Maybe Even Blowing Up

It has always been my intention to fully support memory barriers in LibQ. However, as LibQ is mainly something I work on when I'm feeling inspired, it shouldn't be too surprising that some features take a while to get implemented (especially at the times when I don't feel like coding anything for a few months). But while lack of interest on my part may have been the biggest factor, there was also the fact that I hadn't, until recently, decided on the semantics of LibQ memory barriers.

Stand-alone memory barriers (memory barriers in the middle of a sequence of normal code) are (mostly) straightforward, both in concept and in implementation. However, the exact method of achieving a memory barrier differs drastically by architecture. For example, the x86 guarantees that stores will be performed in order. As such, write barriers are not needed at all (or you could use an SFENCE instruction, if you want to play it safe and assume that this guarantee of write ordering won't be around forever). However, the x86 has no read barrier or full barrier instructions prior to the Pentium 4 (LFENCE and MFENCE instructions, respectively); the conventional wisdom has been to use a serializing instruction whenever these functions are needed (CPUID seems like a fairly good candidate on Pentium 4, while LOCK CMPXCHG seems faster on other CPUs).

The Alpha deserves special mention, as it does something that, as best I and the Linux people know, is truly novel (thank God). For reasons that appear to be a consequence of the split cache model (having multiple cache banks per core, which may be locked separately to avoid memory stalls), the Alpha does not have true memory barriers. While its memory barriers will order instructions for that CPU, there are no guarantees that the access order will hold on other processors. Amazingly (and not in a good way), this even applies to pointer dereferences; that is, if you set a shared pointer after initializing the structure, it's possible that another CPU could get the new pointer, yet pull uninitialized data through the pointer. This means that not only must you have a write barrier between two writes if they must appear in order, but you must also have a read barrier between the reads of those two variables; the read barrier will ensure that the caches will be brought into sync on the reading CPU.

Moving tangentially, different architectures have different semantics for their atomic functions. On a Pentium 3 (and other Intel CPUs before Pentium 4), LOCKed atomic operations (required to be truly atomic with multiple cores and/or CPUs) are synchronizing instructions (effectively full barriers). However, in Pentium 4 they have been demoted to simply being write barriers. As best as I can tell, on the PowerPC atomic instructions aren't memory barriers at all. Nor are atomic x86 instructions that do not use LOCK (not using LOCK is much faster, but only atomic with respect to a single core/CPU).

This is made even more complicated by the fact that there are three possible configurations for a memory barrier to be in, with respect to an atomic operation. In the acquire configuration (think of acquiring a mutex), the memory barrier appears to be just after the atomic operation - the atomic operation can be executed earlier than where it appears, but no I/O after it (inside the mutex) may be executed before it. In the release configuration, the barrier appears to be just before the atomic instruction - I/O after the atomic operation may be executed before the atomic operation, but the atomic operation cannot occur before any I/O in front of it (inside the mutex). Finally, there's the sandwich configuration, where instructions are serialized around the atomic operation, as if there was one barrier before the operation, and one after - the atomic operation cannot be executed before any preceding I/O, and no subsequent I/O may be executed before the atomic operation.