Search This Blog

Tuesday, December 05, 2006

Memory Barriers

Well, I felt like writing something; however, this post might not turn out that great, as I've got a lot of material to cover, which I may or may not be able to elegantly separate into multiple topics (as it's too much to cover in one post).

Conceptually, computers execute instructions in single file. An instruction and its parameters (if applicable) are read from memory, the instruction is executed, and the results are written back to memory (if applicable). For some CPUs (that's all I can say, due to my indiverse knowledge of various CPUs), such as the x86, that's exactly how they originally worked. However, the x86 has gotten dramatically more complex over the last ten years or so. The 486 (I'm giving these milestones to the best of my memory, but I can't say for certain they're all correct) added support for multiple x86 microprocessors to run in parallel in the same system. With the Pentium, a second arithmetic logic unit was added, allowing the processor to execute two instructions in parallel. The Pentium Pro added support for speculative memory reads (prefetching of data used by instructions not yet executed), and added a memory write buffer that stores memory writes before they even make it to the processor's internal cache. The Pentium 4 allows two threads to be executed on a single processor, by means of instruction stream interleaving. With the Pentium D, CPUs began to contain two cores in a single chip, and Intel is about to launch a Core 2 CPU containing four cores on a single chip.

What all this comes down to is that it is no longer possible for a programmer to know the exact order instructions will be executed in a program. Thanks to speculative reads and the uncertainty of exactly what is in the cache at any point, execution is nondeterministic, and it is even impossible for a nerd with a calculator and an x86 optimization manual to calculate exactly what order a set of assembly language instructions would be executed in (at least not in the general case; in highly serialized code it might be possible). Moreover, even implementations of x86, such as the Pentium 4, Core 2, and Athlon 64, differ in implementation details, such as execution time of particular instructions.

However, as the processor (actually core) is always self-consistent, this is normally completely transparent to the programmer. The result of a calculation will always be deterministic, and strictly dictated by instruction order, even if the actual order of events inside the processor to arrive at the end result differs wildly. The only time when a programmer must be concerned with such details is when processor self-consistency is not sufficient - that is, when they are writing a program that much synchronize execution with something outside the core, such as a piece of hardware, another chip or processor on the motherboard, or even another core. While largely irrelevant for everyone else, writers of hardware-interface code (drivers) and of core operating system parts must be able to ensure that the internal state of the processor remains consistent with the world outside the processor. At least, those that don't work for Creative Labs.

There are many ways of accomplishing different aspects of this requirement, and the methods often vary by processor. Main memory is one of the things most commonly shared by the the processor and other hardware, so it is necessary that the hardware hear exactly what the processor is trying to tell it. The x86 processor orders its memory accesses relatively strongly. The Pentium 4 guarantees that writes will be performed in the order they appear in the program; however, writes may be buffered before being committed to the processor's cache, and there may be further delays before the data is written to main memory. Furthermore, it makes no guarantees about the order of reads from memory (and, remember, reads can be performed even before the instruction to perform the read is executed). It could be disastrous for another processor (including processors on hardware devices) to mix in its own data with what one processor is writing, or read data that one processor is still in the process of writing.

Memory barriers (also called fences) are used to prevent this. They instruct the processor to create a memory bottleneck at the memory barrier instruction, which some class of memory accesses may not cross. A read barrier in between two reads will ensure that the second read (or any later reads) may not be executed before the first; the same goes for write barriers and full barriers.

Serializing instructions take this one step further, not only guaranteeing the order of memory access with relation to the serializing instructions, but also preventing ALL execution of subsequent instructions until the preceding instructions have entirely finished executing and the results have been written to memory. This is something you want to avoid whenever possible, as it's a major performance killer, with the potential to soak up hundreds of cycles in dead time (although, to be fair, fences also a performance hazard, though not as large of one, as other instructions and some types of memory accesses may still execute across the fence).

No comments: