Following the Standard Operating Procedure for deciding what to write about for my blog posts (namely, whatever I have a whim to write at a particular moment; explains a lot, doesn't it?), today I'm gonna talk about the different types of multiprocessors. Multiprocessing refers to the parallel execution of one or more programs in multiple instances; more specifically, I'm going to talk about architectures where you explicitly code for parallelism, as opposed to implicit things, like the multiple arithmetic units present in most modern CPUs. There are lots of different types of multiprocessors (speaking from an architecture standpoint), but traditionally (before EDGE came along) they fall into one of these categories I'm going to describe. These categories form a neat gradient from single processing to processor clusters.
The class of processors most resembling single processors is the Single Instruction Multiple Data (SIMD) processors - processors which operate on multiple data items using the same sequence of instructions. Vector processors, the simplest SIMD processors, allow you to store multiple data items in large registers (vector registers), then perform a mathematical operation on all data items in a single instruction. I gave an example of this a while ago, when I made an MMX function to convert a string to uppercase in blocks of eight characters per instruction (MMX registers are 64 bits). The eight characters were loaded from (contiguous) memory in one instruction, several mathematical operations were performed, each in one instruction, and ultimately the block of upper-case characters was written back to (contiguous) memory in one more instruction.
The problem with vector processors is that they can only read vectors from memory at regular intervals - either contiguous memory, like MMX/SSE, or every X bytes, as I'm told Cray computers can do. If you don't like it, you can manually load each value into a general register, then push it into a vector. Obviously this will be slow, as it requires a couple instructions per data item, as opposed to loading a single vector in one instruction. Though you may not have much choice, if all you've got is a Core 2 with SSE3.
Stream processors, taking this SIMD idea one step further, answer this problem. While vector processors are like a single CPU operating on an array of data items, stream processors are like an array or processors, each with individual word-sized registers, working on the same sequence of instructions. While both operate on the same sequence of instructions, each pipe (for lack of a better word) in a stream processor has its own set of registers. If, for example, a particular register is a pointer, if each pipe has a different value in that register, that pipe will load a different value from memory and operate on it; what's more, the reads and writes to memory need not be contiguous. This is how Graphics Processing Unit shaders work.
Of course, stream processors still have the limitation that there's only a single program, even if it's being executed in parallel. To that end, we next have Multiple Instruction Multiple Data (MIMD) processors, or, more specifically, Symmetric MultiProcessing (SMP). SMP systems consist of two or more relatively independent processors or cores sharing a common bus and memory, each executing their own program. 'Symmetric' refers to the fact that all CPUs have equal access to the shared hardware; (with some exceptions) a request to something like memory will be processed exactly the same, regardless of which processor made the request. This is what current x86 CPUs are, though I've heard that AMD's next x86 architecture will be NUMA.
The problem with this is that there's an obvious bottleneck: the shared bus (or, for more innovated systems like the Opteron, which lack a shared bus, the memory itself). As the number of processors and memory or other hardware accesses increases, this becomes a big problem for performance.
Consequently, systems with more than just a few processors are often Non-Uniform Memory Architectures (NUMA). On a NUMA, there are two or more banks of memory, located at different places on the ystem. It's possible that there may be one bank per X CPUs (perhaps those X CPUs and memory bank reside on an expansion card), or one bank per CPU. In either case, each processor has access to all memory in the system, just like with SMP; however, the access time to access a particular location in memory varies, based on which bank that address resides in. If each CPU has the data for its program relatively self-contained in its local memory bank, so that requests to other memory banks are rare, this allows significantly better performance in parallel systems.
The last bottleneck remaining in NUMA is the fact that any processor can request reads or writes from any memory bank. While this is good from a programming perspective, as the size of the system continues to grow, there comes a point where you can't afford to make remote memory requests - it's more efficient to perform explicit communication between processors (either synchronously or asynchronously) and keep a copy of all the data one processor needs in its own memory bank. This setup is called a cluster, and is now very common in distributed systems. Each processor (which is an entire computer, in the case of distributed computing), runs independently, but uses explicit communication to optimize costly transfers of data between systems. This is perhaps the hardest to program when dealing with a program that can't be neatly decomposed by domain (having each computer process a portion of a common amount of data), but is becoming increasingly necessary.
However, while clusters are often composed of whole computers, that isn't a requirement. For a rather peculiar example of a cluster-like system on a single chip, we need only look at the Cell - the CPU of the Playstation 3. The Cell is composed of one main core, which has access to main memory, and 8 mini-cores, each with (and limited to) its own local memory bank; blocks of memory are copied between the local banks and main memory by explicit asynchronous Direct Memory Access (DMA), analogous to asynchronous network I/O in network clusters. The combination of the 8 mini-core memory banks actually residing on the chip itself (cache, in other words) and memory copying done asynchronously allow the Cell to process at blazing speed, with all memory accesses taking only as long as a cache hit (by definition). But between the architecture itself, and the fact that game data tends to be interdependent, making breaking games down into that many threads very difficult, the Cell is incredibly (and notoriously) hard to make full use of. This makes the Cell a peculiar choice for a game system, and it would probably be better suited to scientific applications, with simpler computations in large quantities.