Way, way back I explained what "fast" is, with regards to synchronization objects. Briefly, a fast object is one that executes entirely in user mode, save for two times: when a thread needs to go to sleep, and when a thread needs to be woken up.
But that isn't fast enough for some people. As discussed previously, fast objects work by using atomic instructions and/or memory barriers. While atomic instructions in and of themselves aren't THAT bad, speed-wise (in the best case taking only about 20 cycles), on some architectures (like x86) they imply either a write barrier or a full barrier. By definition these are constriction points, which bring the various parts of the system into sync, and require waiting for all pending writes/operations to complete. In the best case (no queued I/O) this may be almost instant. In the worst case (many queued I/O accesses), it may take hundreds of cycles. Even though this only applies to one CPU/core, that's a pretty large amount of dead time (for example, you might be able to call malloc in that time).
But is it possible to avoid the use of atomic operations altogether? Well, yes and no. It's impossible to have a multiprocessing system that shares resources completely without atomic operations (well, I suppose you could, but it would be hideously ugly and slow). There must be some way to synchronize code or data between threads and processors - atomic operations provide this ability.
However, in very specialized circumstances (and only from some threads), we can get around using atomic operations. One way to achieve this is called collocation. Collocation is the storage of two (or more) pieces of data - one accessed by one thread, the other accessed by another thread - inside a single word. The thread's data is written to using a small write, then the entire word is read at once.
Collocation relies on data dependency to cause a partial processor stall without the need to perform a full memory barrier. Most processors are always self-consistent and have instruction ordering rules that state that a write to a memory location smaller than a word must be committed to memory (actually, cache, but CPU caches are typically coherent, meaning all CPUs will see modifications to the cache of one CPU) before a larger read from that same location. This can be done using nonatomic operations, and does not require a memory barrier - it forces ordering only on two instructions, while allowing the processor to optimally reorder and buffer all other instructions. In the worst case, this causes a full pipeline flush (20-30 cycles), and may take less - significantly faster than a full memory barrier, in the worst case.
However, this method is not an exact substitute for atomic instructions. It only ensures that both writing and reading threads will see the writes made by each other - it does not provide a way for either thread to know in what order the writes occurred, the way atomic instructions can. So, what use is it? Well, if you apply some creativity, it's still useful in some cases. To be able to take advantage of collocation, the threads sharing data must be fundamentally asymmetric - each thread must write to a separate fields in the word, and only one thread may use collocation to avoid atomic operations (the other threads must use atomic operations). This implies where collocation is useful: when one thread performs a large portion of the writes, and other threads only make occasional writes.
This is just an introduction to collocation as a lock-free structure tool; the posts on super-fast mutexes and condition variables with timeout will give some real-world examples of how atomic operations can be avoided using collocation.
Search This Blog
Showing posts with label memorybarriers. Show all posts
Showing posts with label memorybarriers. Show all posts
Friday, January 12, 2007
Sunday, December 31, 2006
Memory Barriers 3 - The LibQ Way
Okay, so those are the constraints within which LibQ must work. This brings me to my first critical design decision regarding memory barriers: LibQ will use hand-coded assembly functions, and client programs will call these functions through pointers (well, the functions will indeed be called through pointers, but this is transparent to the client program), allowing the fastest option to be chosen at run time based on the system.
Okay, that was fairly easy. This second decision took a lot more thinking through to reach a conclusion: what to do with respect to atomic functions? There are at least three options that are fairly obvious. The first is to provide the weakest possible guarantee: that all atomic functions are not memory barriers. This ensures that users will use memory barriers manually, and even on architectures where atomic instructions are not memory barriers the correct behavior is achieved.
However, evaluation of this possibility suggests that it is not a good option. While non-barrier atomic operations will run at maximum speed, particularly on weak architectures like PowerPC, adding in additional memory barriers on architectures like the x86 quickly kills performance. Consider the worst case scenario: an atomic operation that must have a full sandwich barrier, on a Pentium 3 (which has no full memory barrier instruction, and so must use a serializing instruction) - perhaps 100 cycles for the LOCK, another maybe 150 cycles to flush buffer contents, and two CPUID instructions at about 65 cycles each; all together, that's 385 cycles. This could be reduced by about 1/3 by leaving out the CPUIDs, as LOCK instructions are full barriers on the Pentium 3.
Alternately, we could provide the strongest possible guarantee: that all atomic instructions are full sandwich barriers. This means that users don't need to worry about putting memory barriers around atomic instructions at all, and ensures barrier safety.
However, this also is not a particularly good solution for a cross-platform library like LibQ. This strategy is optimal for the Pentium 3, where that's exactly how atomic instructions work. On Pentium 4 it's only slightly worse, as two MFENCE (or LFENCE) instructions would only add only a couple dozen cycles to the atomic operation and memory flush. However, this could be bad on weaker architectures like PowerPC. Memory barrier instructions there are comparable to the cost of a CPUID, and at least one is unnecessary most of the time (not often do you need a full sandwich barrier around an atomic operation).
The third option is to make special versions of LibQ atomic functions for each variation. While this has the advantage of providing maximum speed on all architectures (and removing function call overhead wherever possible), it would be a huge pain in the ass for me and anyone else who writes all the functions. Let's think about this: 3 types of memory barriers, 3 barrier configurations, times how many atomic function? That's like 50 or 100 different functions for each platform. Mr. Flibbles doesn't much care for that.
So, what should I do? Fortunately, there's a variation on one of these (or perhaps a hybrid of two of them) that is both easy for me to code and efficient with respect to CPU cycles (not 100% optimal, but tolerably close): option number one. Atomic instructions are not guaranteed to be memory barriers of any kind. HOWEVER, there are special memory barrier functions used to promote atomic instructions to memory barriers.
This requires only 6 functions - 3 different types of barriers, 2 configurations (sandwich is just top + bottom barriers). Using this system, these functions may function as necessary, and be no-oped as necessary, to provide (almost) maximum speed. The only waste is the overhead of one or two functions calls; my benchmarks suggest that this is a minimal increase, not taking more than 10 cycles per function - 4 to 12% waste for two barriers, when neither are necessary.
Okay, that was fairly easy. This second decision took a lot more thinking through to reach a conclusion: what to do with respect to atomic functions? There are at least three options that are fairly obvious. The first is to provide the weakest possible guarantee: that all atomic functions are not memory barriers. This ensures that users will use memory barriers manually, and even on architectures where atomic instructions are not memory barriers the correct behavior is achieved.
However, evaluation of this possibility suggests that it is not a good option. While non-barrier atomic operations will run at maximum speed, particularly on weak architectures like PowerPC, adding in additional memory barriers on architectures like the x86 quickly kills performance. Consider the worst case scenario: an atomic operation that must have a full sandwich barrier, on a Pentium 3 (which has no full memory barrier instruction, and so must use a serializing instruction) - perhaps 100 cycles for the LOCK, another maybe 150 cycles to flush buffer contents, and two CPUID instructions at about 65 cycles each; all together, that's 385 cycles. This could be reduced by about 1/3 by leaving out the CPUIDs, as LOCK instructions are full barriers on the Pentium 3.
Alternately, we could provide the strongest possible guarantee: that all atomic instructions are full sandwich barriers. This means that users don't need to worry about putting memory barriers around atomic instructions at all, and ensures barrier safety.
However, this also is not a particularly good solution for a cross-platform library like LibQ. This strategy is optimal for the Pentium 3, where that's exactly how atomic instructions work. On Pentium 4 it's only slightly worse, as two MFENCE (or LFENCE) instructions would only add only a couple dozen cycles to the atomic operation and memory flush. However, this could be bad on weaker architectures like PowerPC. Memory barrier instructions there are comparable to the cost of a CPUID, and at least one is unnecessary most of the time (not often do you need a full sandwich barrier around an atomic operation).
The third option is to make special versions of LibQ atomic functions for each variation. While this has the advantage of providing maximum speed on all architectures (and removing function call overhead wherever possible), it would be a huge pain in the ass for me and anyone else who writes all the functions. Let's think about this: 3 types of memory barriers, 3 barrier configurations, times how many atomic function? That's like 50 or 100 different functions for each platform. Mr. Flibbles doesn't much care for that.
So, what should I do? Fortunately, there's a variation on one of these (or perhaps a hybrid of two of them) that is both easy for me to code and efficient with respect to CPU cycles (not 100% optimal, but tolerably close): option number one. Atomic instructions are not guaranteed to be memory barriers of any kind. HOWEVER, there are special memory barrier functions used to promote atomic instructions to memory barriers.
This requires only 6 functions - 3 different types of barriers, 2 configurations (sandwich is just top + bottom barriers). Using this system, these functions may function as necessary, and be no-oped as necessary, to provide (almost) maximum speed. The only waste is the overhead of one or two functions calls; my benchmarks suggest that this is a minimal increase, not taking more than 10 cycles per function - 4 to 12% waste for two barriers, when neither are necessary.
Labels:
atomicfunctions,
libq,
memorybarriers,
programming,
smp
Wednesday, December 13, 2006
Memory Barriers 2
Subtitle: Speeding Up, Slowing Up, Maybe Even Blowing Up
It has always been my intention to fully support memory barriers in LibQ. However, as LibQ is mainly something I work on when I'm feeling inspired, it shouldn't be too surprising that some features take a while to get implemented (especially at the times when I don't feel like coding anything for a few months). But while lack of interest on my part may have been the biggest factor, there was also the fact that I hadn't, until recently, decided on the semantics of LibQ memory barriers.
Stand-alone memory barriers (memory barriers in the middle of a sequence of normal code) are (mostly) straightforward, both in concept and in implementation. However, the exact method of achieving a memory barrier differs drastically by architecture. For example, the x86 guarantees that stores will be performed in order. As such, write barriers are not needed at all (or you could use an SFENCE instruction, if you want to play it safe and assume that this guarantee of write ordering won't be around forever). However, the x86 has no read barrier or full barrier instructions prior to the Pentium 4 (LFENCE and MFENCE instructions, respectively); the conventional wisdom has been to use a serializing instruction whenever these functions are needed (CPUID seems like a fairly good candidate on Pentium 4, while LOCK CMPXCHG seems faster on other CPUs).
The Alpha deserves special mention, as it does something that, as best I and the Linux people know, is truly novel (thank God). For reasons that appear to be a consequence of the split cache model (having multiple cache banks per core, which may be locked separately to avoid memory stalls), the Alpha does not have true memory barriers. While its memory barriers will order instructions for that CPU, there are no guarantees that the access order will hold on other processors. Amazingly (and not in a good way), this even applies to pointer dereferences; that is, if you set a shared pointer after initializing the structure, it's possible that another CPU could get the new pointer, yet pull uninitialized data through the pointer. This means that not only must you have a write barrier between two writes if they must appear in order, but you must also have a read barrier between the reads of those two variables; the read barrier will ensure that the caches will be brought into sync on the reading CPU.
Moving tangentially, different architectures have different semantics for their atomic functions. On a Pentium 3 (and other Intel CPUs before Pentium 4), LOCKed atomic operations (required to be truly atomic with multiple cores and/or CPUs) are synchronizing instructions (effectively full barriers). However, in Pentium 4 they have been demoted to simply being write barriers. As best as I can tell, on the PowerPC atomic instructions aren't memory barriers at all. Nor are atomic x86 instructions that do not use LOCK (not using LOCK is much faster, but only atomic with respect to a single core/CPU).
This is made even more complicated by the fact that there are three possible configurations for a memory barrier to be in, with respect to an atomic operation. In the acquire configuration (think of acquiring a mutex), the memory barrier appears to be just after the atomic operation - the atomic operation can be executed earlier than where it appears, but no I/O after it (inside the mutex) may be executed before it. In the release configuration, the barrier appears to be just before the atomic instruction - I/O after the atomic operation may be executed before the atomic operation, but the atomic operation cannot occur before any I/O in front of it (inside the mutex). Finally, there's the sandwich configuration, where instructions are serialized around the atomic operation, as if there was one barrier before the operation, and one after - the atomic operation cannot be executed before any preceding I/O, and no subsequent I/O may be executed before the atomic operation.
It has always been my intention to fully support memory barriers in LibQ. However, as LibQ is mainly something I work on when I'm feeling inspired, it shouldn't be too surprising that some features take a while to get implemented (especially at the times when I don't feel like coding anything for a few months). But while lack of interest on my part may have been the biggest factor, there was also the fact that I hadn't, until recently, decided on the semantics of LibQ memory barriers.
Stand-alone memory barriers (memory barriers in the middle of a sequence of normal code) are (mostly) straightforward, both in concept and in implementation. However, the exact method of achieving a memory barrier differs drastically by architecture. For example, the x86 guarantees that stores will be performed in order. As such, write barriers are not needed at all (or you could use an SFENCE instruction, if you want to play it safe and assume that this guarantee of write ordering won't be around forever). However, the x86 has no read barrier or full barrier instructions prior to the Pentium 4 (LFENCE and MFENCE instructions, respectively); the conventional wisdom has been to use a serializing instruction whenever these functions are needed (CPUID seems like a fairly good candidate on Pentium 4, while LOCK CMPXCHG seems faster on other CPUs).
The Alpha deserves special mention, as it does something that, as best I and the Linux people know, is truly novel (thank God). For reasons that appear to be a consequence of the split cache model (having multiple cache banks per core, which may be locked separately to avoid memory stalls), the Alpha does not have true memory barriers. While its memory barriers will order instructions for that CPU, there are no guarantees that the access order will hold on other processors. Amazingly (and not in a good way), this even applies to pointer dereferences; that is, if you set a shared pointer after initializing the structure, it's possible that another CPU could get the new pointer, yet pull uninitialized data through the pointer. This means that not only must you have a write barrier between two writes if they must appear in order, but you must also have a read barrier between the reads of those two variables; the read barrier will ensure that the caches will be brought into sync on the reading CPU.
Moving tangentially, different architectures have different semantics for their atomic functions. On a Pentium 3 (and other Intel CPUs before Pentium 4), LOCKed atomic operations (required to be truly atomic with multiple cores and/or CPUs) are synchronizing instructions (effectively full barriers). However, in Pentium 4 they have been demoted to simply being write barriers. As best as I can tell, on the PowerPC atomic instructions aren't memory barriers at all. Nor are atomic x86 instructions that do not use LOCK (not using LOCK is much faster, but only atomic with respect to a single core/CPU).
This is made even more complicated by the fact that there are three possible configurations for a memory barrier to be in, with respect to an atomic operation. In the acquire configuration (think of acquiring a mutex), the memory barrier appears to be just after the atomic operation - the atomic operation can be executed earlier than where it appears, but no I/O after it (inside the mutex) may be executed before it. In the release configuration, the barrier appears to be just before the atomic instruction - I/O after the atomic operation may be executed before the atomic operation, but the atomic operation cannot occur before any I/O in front of it (inside the mutex). Finally, there's the sandwich configuration, where instructions are serialized around the atomic operation, as if there was one barrier before the operation, and one after - the atomic operation cannot be executed before any preceding I/O, and no subsequent I/O may be executed before the atomic operation.
Labels:
atomicfunctions,
libq,
memorybarriers,
programming,
smp
Tuesday, December 05, 2006
Memory Barriers
Well, I felt like writing something; however, this post might not turn out that great, as I've got a lot of material to cover, which I may or may not be able to elegantly separate into multiple topics (as it's too much to cover in one post).
Conceptually, computers execute instructions in single file. An instruction and its parameters (if applicable) are read from memory, the instruction is executed, and the results are written back to memory (if applicable). For some CPUs (that's all I can say, due to my indiverse knowledge of various CPUs), such as the x86, that's exactly how they originally worked. However, the x86 has gotten dramatically more complex over the last ten years or so. The 486 (I'm giving these milestones to the best of my memory, but I can't say for certain they're all correct) added support for multiple x86 microprocessors to run in parallel in the same system. With the Pentium, a second arithmetic logic unit was added, allowing the processor to execute two instructions in parallel. The Pentium Pro added support for speculative memory reads (prefetching of data used by instructions not yet executed), and added a memory write buffer that stores memory writes before they even make it to the processor's internal cache. The Pentium 4 allows two threads to be executed on a single processor, by means of instruction stream interleaving. With the Pentium D, CPUs began to contain two cores in a single chip, and Intel is about to launch a Core 2 CPU containing four cores on a single chip.
What all this comes down to is that it is no longer possible for a programmer to know the exact order instructions will be executed in a program. Thanks to speculative reads and the uncertainty of exactly what is in the cache at any point, execution is nondeterministic, and it is even impossible for a nerd with a calculator and an x86 optimization manual to calculate exactly what order a set of assembly language instructions would be executed in (at least not in the general case; in highly serialized code it might be possible). Moreover, even implementations of x86, such as the Pentium 4, Core 2, and Athlon 64, differ in implementation details, such as execution time of particular instructions.
However, as the processor (actually core) is always self-consistent, this is normally completely transparent to the programmer. The result of a calculation will always be deterministic, and strictly dictated by instruction order, even if the actual order of events inside the processor to arrive at the end result differs wildly. The only time when a programmer must be concerned with such details is when processor self-consistency is not sufficient - that is, when they are writing a program that much synchronize execution with something outside the core, such as a piece of hardware, another chip or processor on the motherboard, or even another core. While largely irrelevant for everyone else, writers of hardware-interface code (drivers) and of core operating system parts must be able to ensure that the internal state of the processor remains consistent with the world outside the processor. At least, those that don't work for Creative Labs.
There are many ways of accomplishing different aspects of this requirement, and the methods often vary by processor. Main memory is one of the things most commonly shared by the the processor and other hardware, so it is necessary that the hardware hear exactly what the processor is trying to tell it. The x86 processor orders its memory accesses relatively strongly. The Pentium 4 guarantees that writes will be performed in the order they appear in the program; however, writes may be buffered before being committed to the processor's cache, and there may be further delays before the data is written to main memory. Furthermore, it makes no guarantees about the order of reads from memory (and, remember, reads can be performed even before the instruction to perform the read is executed). It could be disastrous for another processor (including processors on hardware devices) to mix in its own data with what one processor is writing, or read data that one processor is still in the process of writing.
Memory barriers (also called fences) are used to prevent this. They instruct the processor to create a memory bottleneck at the memory barrier instruction, which some class of memory accesses may not cross. A read barrier in between two reads will ensure that the second read (or any later reads) may not be executed before the first; the same goes for write barriers and full barriers.
Serializing instructions take this one step further, not only guaranteeing the order of memory access with relation to the serializing instructions, but also preventing ALL execution of subsequent instructions until the preceding instructions have entirely finished executing and the results have been written to memory. This is something you want to avoid whenever possible, as it's a major performance killer, with the potential to soak up hundreds of cycles in dead time (although, to be fair, fences also a performance hazard, though not as large of one, as other instructions and some types of memory accesses may still execute across the fence).
Conceptually, computers execute instructions in single file. An instruction and its parameters (if applicable) are read from memory, the instruction is executed, and the results are written back to memory (if applicable). For some CPUs (that's all I can say, due to my indiverse knowledge of various CPUs), such as the x86, that's exactly how they originally worked. However, the x86 has gotten dramatically more complex over the last ten years or so. The 486 (I'm giving these milestones to the best of my memory, but I can't say for certain they're all correct) added support for multiple x86 microprocessors to run in parallel in the same system. With the Pentium, a second arithmetic logic unit was added, allowing the processor to execute two instructions in parallel. The Pentium Pro added support for speculative memory reads (prefetching of data used by instructions not yet executed), and added a memory write buffer that stores memory writes before they even make it to the processor's internal cache. The Pentium 4 allows two threads to be executed on a single processor, by means of instruction stream interleaving. With the Pentium D, CPUs began to contain two cores in a single chip, and Intel is about to launch a Core 2 CPU containing four cores on a single chip.
What all this comes down to is that it is no longer possible for a programmer to know the exact order instructions will be executed in a program. Thanks to speculative reads and the uncertainty of exactly what is in the cache at any point, execution is nondeterministic, and it is even impossible for a nerd with a calculator and an x86 optimization manual to calculate exactly what order a set of assembly language instructions would be executed in (at least not in the general case; in highly serialized code it might be possible). Moreover, even implementations of x86, such as the Pentium 4, Core 2, and Athlon 64, differ in implementation details, such as execution time of particular instructions.
However, as the processor (actually core) is always self-consistent, this is normally completely transparent to the programmer. The result of a calculation will always be deterministic, and strictly dictated by instruction order, even if the actual order of events inside the processor to arrive at the end result differs wildly. The only time when a programmer must be concerned with such details is when processor self-consistency is not sufficient - that is, when they are writing a program that much synchronize execution with something outside the core, such as a piece of hardware, another chip or processor on the motherboard, or even another core. While largely irrelevant for everyone else, writers of hardware-interface code (drivers) and of core operating system parts must be able to ensure that the internal state of the processor remains consistent with the world outside the processor. At least, those that don't work for Creative Labs.
There are many ways of accomplishing different aspects of this requirement, and the methods often vary by processor. Main memory is one of the things most commonly shared by the the processor and other hardware, so it is necessary that the hardware hear exactly what the processor is trying to tell it. The x86 processor orders its memory accesses relatively strongly. The Pentium 4 guarantees that writes will be performed in the order they appear in the program; however, writes may be buffered before being committed to the processor's cache, and there may be further delays before the data is written to main memory. Furthermore, it makes no guarantees about the order of reads from memory (and, remember, reads can be performed even before the instruction to perform the read is executed). It could be disastrous for another processor (including processors on hardware devices) to mix in its own data with what one processor is writing, or read data that one processor is still in the process of writing.
Memory barriers (also called fences) are used to prevent this. They instruct the processor to create a memory bottleneck at the memory barrier instruction, which some class of memory accesses may not cross. A read barrier in between two reads will ensure that the second read (or any later reads) may not be executed before the first; the same goes for write barriers and full barriers.
Serializing instructions take this one step further, not only guaranteeing the order of memory access with relation to the serializing instructions, but also preventing ALL execution of subsequent instructions until the preceding instructions have entirely finished executing and the results have been written to memory. This is something you want to avoid whenever possible, as it's a major performance killer, with the potential to soak up hundreds of cycles in dead time (although, to be fair, fences also a performance hazard, though not as large of one, as other instructions and some types of memory accesses may still execute across the fence).
Labels:
libq,
memorybarriers,
programming,
smp
Subscribe to:
Posts (Atom)