Search This Blog

Sunday, December 31, 2006

Moshikashite Kore Sugoku Warui

Be afraid. Be very afraid. Q has just learned about Starfish. Even if it's old as dirt, this can't possibly be a good thing. Some of the stuff it generated:

Oh, and for the curious, that phrase is from Azumanga Daioh, one of my favorite mangas. Naturally translated, it means "This could be really bad": もしかして [possibly] これ [this] 凄く [terribly] 悪い [is bad].

Memory Barriers 3 - The LibQ Way

Okay, so those are the constraints within which LibQ must work. This brings me to my first critical design decision regarding memory barriers: LibQ will use hand-coded assembly functions, and client programs will call these functions through pointers (well, the functions will indeed be called through pointers, but this is transparent to the client program), allowing the fastest option to be chosen at run time based on the system.

Okay, that was fairly easy. This second decision took a lot more thinking through to reach a conclusion: what to do with respect to atomic functions? There are at least three options that are fairly obvious. The first is to provide the weakest possible guarantee: that all atomic functions are not memory barriers. This ensures that users will use memory barriers manually, and even on architectures where atomic instructions are not memory barriers the correct behavior is achieved.

However, evaluation of this possibility suggests that it is not a good option. While non-barrier atomic operations will run at maximum speed, particularly on weak architectures like PowerPC, adding in additional memory barriers on architectures like the x86 quickly kills performance. Consider the worst case scenario: an atomic operation that must have a full sandwich barrier, on a Pentium 3 (which has no full memory barrier instruction, and so must use a serializing instruction) - perhaps 100 cycles for the LOCK, another maybe 150 cycles to flush buffer contents, and two CPUID instructions at about 65 cycles each; all together, that's 385 cycles. This could be reduced by about 1/3 by leaving out the CPUIDs, as LOCK instructions are full barriers on the Pentium 3.

Alternately, we could provide the strongest possible guarantee: that all atomic instructions are full sandwich barriers. This means that users don't need to worry about putting memory barriers around atomic instructions at all, and ensures barrier safety.

However, this also is not a particularly good solution for a cross-platform library like LibQ. This strategy is optimal for the Pentium 3, where that's exactly how atomic instructions work. On Pentium 4 it's only slightly worse, as two MFENCE (or LFENCE) instructions would only add only a couple dozen cycles to the atomic operation and memory flush. However, this could be bad on weaker architectures like PowerPC. Memory barrier instructions there are comparable to the cost of a CPUID, and at least one is unnecessary most of the time (not often do you need a full sandwich barrier around an atomic operation).

The third option is to make special versions of LibQ atomic functions for each variation. While this has the advantage of providing maximum speed on all architectures (and removing function call overhead wherever possible), it would be a huge pain in the ass for me and anyone else who writes all the functions. Let's think about this: 3 types of memory barriers, 3 barrier configurations, times how many atomic function? That's like 50 or 100 different functions for each platform. Mr. Flibbles doesn't much care for that.

So, what should I do? Fortunately, there's a variation on one of these (or perhaps a hybrid of two of them) that is both easy for me to code and efficient with respect to CPU cycles (not 100% optimal, but tolerably close): option number one. Atomic instructions are not guaranteed to be memory barriers of any kind. HOWEVER, there are special memory barrier functions used to promote atomic instructions to memory barriers.

This requires only 6 functions - 3 different types of barriers, 2 configurations (sandwich is just top + bottom barriers). Using this system, these functions may function as necessary, and be no-oped as necessary, to provide (almost) maximum speed. The only waste is the overhead of one or two functions calls; my benchmarks suggest that this is a minimal increase, not taking more than 10 cycles per function - 4 to 12% waste for two barriers, when neither are necessary.

Saturday, December 30, 2006

Another Exercise for the Reader

Something I'm currently coding in LibQ (*shock and awe*) is the capacity for timeouts in waits on a condition variable, a feature that was left out of LibQ originally, because of an insurmountable race condition which would be created by the way LibQ emulated condition variables on Windows. This challenge is very succinct: how is it possible to create a fast (remember the definition of fast, with regard to synchronization objects) condition variable that supports timeout, and what constraint(s) would exist on the semantics of this condition variable? Ready, set, go!

Oh, one more thing: if you find that challenge too easy, you can try to do it without the use of atomic operations (the way LibQ does it).

Wednesday, December 13, 2006

Memory Barriers 2

Subtitle: Speeding Up, Slowing Up, Maybe Even Blowing Up

It has always been my intention to fully support memory barriers in LibQ. However, as LibQ is mainly something I work on when I'm feeling inspired, it shouldn't be too surprising that some features take a while to get implemented (especially at the times when I don't feel like coding anything for a few months). But while lack of interest on my part may have been the biggest factor, there was also the fact that I hadn't, until recently, decided on the semantics of LibQ memory barriers.

Stand-alone memory barriers (memory barriers in the middle of a sequence of normal code) are (mostly) straightforward, both in concept and in implementation. However, the exact method of achieving a memory barrier differs drastically by architecture. For example, the x86 guarantees that stores will be performed in order. As such, write barriers are not needed at all (or you could use an SFENCE instruction, if you want to play it safe and assume that this guarantee of write ordering won't be around forever). However, the x86 has no read barrier or full barrier instructions prior to the Pentium 4 (LFENCE and MFENCE instructions, respectively); the conventional wisdom has been to use a serializing instruction whenever these functions are needed (CPUID seems like a fairly good candidate on Pentium 4, while LOCK CMPXCHG seems faster on other CPUs).

The Alpha deserves special mention, as it does something that, as best I and the Linux people know, is truly novel (thank God). For reasons that appear to be a consequence of the split cache model (having multiple cache banks per core, which may be locked separately to avoid memory stalls), the Alpha does not have true memory barriers. While its memory barriers will order instructions for that CPU, there are no guarantees that the access order will hold on other processors. Amazingly (and not in a good way), this even applies to pointer dereferences; that is, if you set a shared pointer after initializing the structure, it's possible that another CPU could get the new pointer, yet pull uninitialized data through the pointer. This means that not only must you have a write barrier between two writes if they must appear in order, but you must also have a read barrier between the reads of those two variables; the read barrier will ensure that the caches will be brought into sync on the reading CPU.

Moving tangentially, different architectures have different semantics for their atomic functions. On a Pentium 3 (and other Intel CPUs before Pentium 4), LOCKed atomic operations (required to be truly atomic with multiple cores and/or CPUs) are synchronizing instructions (effectively full barriers). However, in Pentium 4 they have been demoted to simply being write barriers. As best as I can tell, on the PowerPC atomic instructions aren't memory barriers at all. Nor are atomic x86 instructions that do not use LOCK (not using LOCK is much faster, but only atomic with respect to a single core/CPU).

This is made even more complicated by the fact that there are three possible configurations for a memory barrier to be in, with respect to an atomic operation. In the acquire configuration (think of acquiring a mutex), the memory barrier appears to be just after the atomic operation - the atomic operation can be executed earlier than where it appears, but no I/O after it (inside the mutex) may be executed before it. In the release configuration, the barrier appears to be just before the atomic instruction - I/O after the atomic operation may be executed before the atomic operation, but the atomic operation cannot occur before any I/O in front of it (inside the mutex). Finally, there's the sandwich configuration, where instructions are serialized around the atomic operation, as if there was one barrier before the operation, and one after - the atomic operation cannot be executed before any preceding I/O, and no subsequent I/O may be executed before the atomic operation.

Saturday, December 09, 2006

Exercise for the Reader

So, in one class I'm writing implementations of the dining philosophers problem (note that a variation uses chopsticks and rice) using both semaphores (it is possible to prevent deadlock if you're careful) and mutex/conditions. Running tests on it, I noticed something odd - a semi-deadlock state in the semaphore version; that is, a single philosopher was taking his time eating, while none of the others could eat (they couldn't get a hold of a pair of chopsticks). It took me a little bit to figure out what was going on (because I knew that if the teacher saw that he'd ask me if it was broken).

So, here are the questions for you:
1. How do you implement a deadlock-free dining philosophers system using semaphores?
2. How could a single philosopher block all four others all by himself?

Tuesday, December 05, 2006

Memory Barriers

Well, I felt like writing something; however, this post might not turn out that great, as I've got a lot of material to cover, which I may or may not be able to elegantly separate into multiple topics (as it's too much to cover in one post).

Conceptually, computers execute instructions in single file. An instruction and its parameters (if applicable) are read from memory, the instruction is executed, and the results are written back to memory (if applicable). For some CPUs (that's all I can say, due to my indiverse knowledge of various CPUs), such as the x86, that's exactly how they originally worked. However, the x86 has gotten dramatically more complex over the last ten years or so. The 486 (I'm giving these milestones to the best of my memory, but I can't say for certain they're all correct) added support for multiple x86 microprocessors to run in parallel in the same system. With the Pentium, a second arithmetic logic unit was added, allowing the processor to execute two instructions in parallel. The Pentium Pro added support for speculative memory reads (prefetching of data used by instructions not yet executed), and added a memory write buffer that stores memory writes before they even make it to the processor's internal cache. The Pentium 4 allows two threads to be executed on a single processor, by means of instruction stream interleaving. With the Pentium D, CPUs began to contain two cores in a single chip, and Intel is about to launch a Core 2 CPU containing four cores on a single chip.

What all this comes down to is that it is no longer possible for a programmer to know the exact order instructions will be executed in a program. Thanks to speculative reads and the uncertainty of exactly what is in the cache at any point, execution is nondeterministic, and it is even impossible for a nerd with a calculator and an x86 optimization manual to calculate exactly what order a set of assembly language instructions would be executed in (at least not in the general case; in highly serialized code it might be possible). Moreover, even implementations of x86, such as the Pentium 4, Core 2, and Athlon 64, differ in implementation details, such as execution time of particular instructions.

However, as the processor (actually core) is always self-consistent, this is normally completely transparent to the programmer. The result of a calculation will always be deterministic, and strictly dictated by instruction order, even if the actual order of events inside the processor to arrive at the end result differs wildly. The only time when a programmer must be concerned with such details is when processor self-consistency is not sufficient - that is, when they are writing a program that much synchronize execution with something outside the core, such as a piece of hardware, another chip or processor on the motherboard, or even another core. While largely irrelevant for everyone else, writers of hardware-interface code (drivers) and of core operating system parts must be able to ensure that the internal state of the processor remains consistent with the world outside the processor. At least, those that don't work for Creative Labs.

There are many ways of accomplishing different aspects of this requirement, and the methods often vary by processor. Main memory is one of the things most commonly shared by the the processor and other hardware, so it is necessary that the hardware hear exactly what the processor is trying to tell it. The x86 processor orders its memory accesses relatively strongly. The Pentium 4 guarantees that writes will be performed in the order they appear in the program; however, writes may be buffered before being committed to the processor's cache, and there may be further delays before the data is written to main memory. Furthermore, it makes no guarantees about the order of reads from memory (and, remember, reads can be performed even before the instruction to perform the read is executed). It could be disastrous for another processor (including processors on hardware devices) to mix in its own data with what one processor is writing, or read data that one processor is still in the process of writing.

Memory barriers (also called fences) are used to prevent this. They instruct the processor to create a memory bottleneck at the memory barrier instruction, which some class of memory accesses may not cross. A read barrier in between two reads will ensure that the second read (or any later reads) may not be executed before the first; the same goes for write barriers and full barriers.

Serializing instructions take this one step further, not only guaranteeing the order of memory access with relation to the serializing instructions, but also preventing ALL execution of subsequent instructions until the preceding instructions have entirely finished executing and the results have been written to memory. This is something you want to avoid whenever possible, as it's a major performance killer, with the potential to soak up hundreds of cycles in dead time (although, to be fair, fences also a performance hazard, though not as large of one, as other instructions and some types of memory accesses may still execute across the fence).