This is a bit disturbing. I just received a registered survey from the Democratic National Committee. A few assorted lines from the thing:
"You have been selected to represent your city in the 2007 Grassroots Survey of Democratic Leaders. Survey documents registered in your name are enclosed."
"Dear Fellow Democrat,
...
I am asking you to take part in this survey because, as a Democratic leader in your community, your insights about the critical issues facing our nation are important to us."
This is disturbing because, in fact, I am not registered with any party, am not a political leader of any kind, and am not even registered to vote right now. It makes me wonder who's been voting under my name...
Search This Blog
Monday, January 22, 2007
Sunday, January 14, 2007
The Rules of Slashdot
Saw this post on Slashdot, and thought it was humorous:
1. We love Apple (especially when they do something just like Microsoft, and even more if their product is vaporware).
2. We hate Microsoft (especially when they do something just like Apple, and even more when their product is vaporware).
3. Steve Jobs can do no wrong (especially when he does the same as Bill Gates).
4. Bill Gates can do no right (especially when he does the same as Steve Jobs).
5. Any story that is positive about Bill Gates or Microsoft will get tagged "fud" or "troll".
6. Any story that is negative about Steve Jobs or Apple will get tagged "fud" or "troll".
7. "One Laptop Per Child" is the second coming of Christ.
8. Nicholas Negroponte is Christ.
9. We ignore Sun (especially when they dominate any specific industry).
10. We adore Java (even though it was developed by Sun).
11. It has been "The Year That Linux Takes Over the Desktop" for about 8 years.
12. It has been "The Year That Microsoft Dies" for about 15 years.
13. It has been "The Year That Apple Overtakes Microsoft" for about 10 years.
14. Ubuntu is God's chosen Linux distribution.
15. All other distributions of Linux are wannabes...especially the ones that have been around longer than Ubuntu.
Friday, January 12, 2007
I Win
So, yesterday my program at work (a configuration tool for their server product) went into official Q/A, and I've acquired a list of bugs (six so far). Some of these are absolutely hysterical. Take this one, for instance:
I was laughing out loud for about 20 minutes straight after I read that bug report. Besides the obvious hilarity of a single computer downing an entire subnet, my program runs entirely in user mode, and it would be a huge security hole in the OS if a user-mode program could wreck this kind of havoc. I suspect this is actually a bug in the server (which has a driver component) or the backend (which feeds instructions to the server), though I haven't started investigating, yet.
On two of the different servers we were setting up for our test networks, the computers froze after clicking finish with the connect now checkbox checked. This happened on both servers that made it that far.
One of the servers that froze seems to have completely screwed up the subnet it was on until it was unplugged from that router. In other words, workstations on the same subnet could not pass traffic through the router or renew their IP address from the dhcp server until the computer was unplugged from the router.
I was laughing out loud for about 20 minutes straight after I read that bug report. Besides the obvious hilarity of a single computer downing an entire subnet, my program runs entirely in user mode, and it would be a huge security hole in the OS if a user-mode program could wreck this kind of havoc. I suspect this is actually a bug in the server (which has a driver component) or the backend (which feeds instructions to the server), though I haven't started investigating, yet.
Tricks of the Trade - Collocation
Way, way back I explained what "fast" is, with regards to synchronization objects. Briefly, a fast object is one that executes entirely in user mode, save for two times: when a thread needs to go to sleep, and when a thread needs to be woken up.
But that isn't fast enough for some people. As discussed previously, fast objects work by using atomic instructions and/or memory barriers. While atomic instructions in and of themselves aren't THAT bad, speed-wise (in the best case taking only about 20 cycles), on some architectures (like x86) they imply either a write barrier or a full barrier. By definition these are constriction points, which bring the various parts of the system into sync, and require waiting for all pending writes/operations to complete. In the best case (no queued I/O) this may be almost instant. In the worst case (many queued I/O accesses), it may take hundreds of cycles. Even though this only applies to one CPU/core, that's a pretty large amount of dead time (for example, you might be able to call malloc in that time).
But is it possible to avoid the use of atomic operations altogether? Well, yes and no. It's impossible to have a multiprocessing system that shares resources completely without atomic operations (well, I suppose you could, but it would be hideously ugly and slow). There must be some way to synchronize code or data between threads and processors - atomic operations provide this ability.
However, in very specialized circumstances (and only from some threads), we can get around using atomic operations. One way to achieve this is called collocation. Collocation is the storage of two (or more) pieces of data - one accessed by one thread, the other accessed by another thread - inside a single word. The thread's data is written to using a small write, then the entire word is read at once.
Collocation relies on data dependency to cause a partial processor stall without the need to perform a full memory barrier. Most processors are always self-consistent and have instruction ordering rules that state that a write to a memory location smaller than a word must be committed to memory (actually, cache, but CPU caches are typically coherent, meaning all CPUs will see modifications to the cache of one CPU) before a larger read from that same location. This can be done using nonatomic operations, and does not require a memory barrier - it forces ordering only on two instructions, while allowing the processor to optimally reorder and buffer all other instructions. In the worst case, this causes a full pipeline flush (20-30 cycles), and may take less - significantly faster than a full memory barrier, in the worst case.
However, this method is not an exact substitute for atomic instructions. It only ensures that both writing and reading threads will see the writes made by each other - it does not provide a way for either thread to know in what order the writes occurred, the way atomic instructions can. So, what use is it? Well, if you apply some creativity, it's still useful in some cases. To be able to take advantage of collocation, the threads sharing data must be fundamentally asymmetric - each thread must write to a separate fields in the word, and only one thread may use collocation to avoid atomic operations (the other threads must use atomic operations). This implies where collocation is useful: when one thread performs a large portion of the writes, and other threads only make occasional writes.
This is just an introduction to collocation as a lock-free structure tool; the posts on super-fast mutexes and condition variables with timeout will give some real-world examples of how atomic operations can be avoided using collocation.
But that isn't fast enough for some people. As discussed previously, fast objects work by using atomic instructions and/or memory barriers. While atomic instructions in and of themselves aren't THAT bad, speed-wise (in the best case taking only about 20 cycles), on some architectures (like x86) they imply either a write barrier or a full barrier. By definition these are constriction points, which bring the various parts of the system into sync, and require waiting for all pending writes/operations to complete. In the best case (no queued I/O) this may be almost instant. In the worst case (many queued I/O accesses), it may take hundreds of cycles. Even though this only applies to one CPU/core, that's a pretty large amount of dead time (for example, you might be able to call malloc in that time).
But is it possible to avoid the use of atomic operations altogether? Well, yes and no. It's impossible to have a multiprocessing system that shares resources completely without atomic operations (well, I suppose you could, but it would be hideously ugly and slow). There must be some way to synchronize code or data between threads and processors - atomic operations provide this ability.
However, in very specialized circumstances (and only from some threads), we can get around using atomic operations. One way to achieve this is called collocation. Collocation is the storage of two (or more) pieces of data - one accessed by one thread, the other accessed by another thread - inside a single word. The thread's data is written to using a small write, then the entire word is read at once.
Collocation relies on data dependency to cause a partial processor stall without the need to perform a full memory barrier. Most processors are always self-consistent and have instruction ordering rules that state that a write to a memory location smaller than a word must be committed to memory (actually, cache, but CPU caches are typically coherent, meaning all CPUs will see modifications to the cache of one CPU) before a larger read from that same location. This can be done using nonatomic operations, and does not require a memory barrier - it forces ordering only on two instructions, while allowing the processor to optimally reorder and buffer all other instructions. In the worst case, this causes a full pipeline flush (20-30 cycles), and may take less - significantly faster than a full memory barrier, in the worst case.
However, this method is not an exact substitute for atomic instructions. It only ensures that both writing and reading threads will see the writes made by each other - it does not provide a way for either thread to know in what order the writes occurred, the way atomic instructions can. So, what use is it? Well, if you apply some creativity, it's still useful in some cases. To be able to take advantage of collocation, the threads sharing data must be fundamentally asymmetric - each thread must write to a separate fields in the word, and only one thread may use collocation to avoid atomic operations (the other threads must use atomic operations). This implies where collocation is useful: when one thread performs a large portion of the writes, and other threads only make occasional writes.
This is just an introduction to collocation as a lock-free structure tool; the posts on super-fast mutexes and condition variables with timeout will give some real-world examples of how atomic operations can be avoided using collocation.
Labels:
atomicfunctions,
memorybarriers,
programming,
smp
Monday, January 08, 2007
Tools of the Trade - Atomic Primitives
In the wake of the recent memory barriers series, today we continue looking at topics related to multiprogramming, by taking a survey of the various atomic primitives. This particular topic is more for informational value than practical use (at least as far as LibQ is concerned, which does not support all of theseso you do not need to know the differences).
Atomic primitives are arguably the most important tool in programming that requires coordination between multiple threads and/or processors/cores. There are four basic types of atomic primitives: swap, fetch and phi, compare and swap, and load linked/store conditional. Usually these operations are performed on values the same size as or smaller than a word (the size of the processor's registers), though sometimes consecutive (single address) multiword versions are provided.
The most primitive (pun not intended) is the swap operation. This operation exchanges a value in memory with a value in a register, atomically setting the value in memory and returning the original value in memory. This is not actually very useful, in terms of multiprogramming, with only a single practical use: the construction of test and set (TAS; or test and test and set - TATAS) spinlocks. In these spinlocks, each thread attempting to enter the spinlock spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned, the spinlock was free, and the thread now owns the spinlock; if 1 is returned, the spinlock was already owned by another thread, and the thread attempting to enter the spinlock must keep spinning. Ultimately, the owning thread leaves the spinlock by setting the spinlock value to 0.
This makes for an extremely crude spinlock. The fact that there is only a single value all threads share means that spinning can create a lot of cache coherency traffic, as all spinning processors will be writing to the same address, continually invalidating each other's caches. Furthermore, this mechanism precludes any kind of order preservation, as it's impossible to distinguish when a given thread began waiting; threads may enter the spinlock in any order, regardless of how long any thread has been waiting to enter the spinlock.
Next up the power scale is the fetch and phi family. All members of this family follow the same basic process: a value is atomically read from memory, modified by the processor, and written back, with the original value returned to the program. The modification performed can be almost anything; one of the most useful modifications is the add operation (in this case it's called fetch and add). The fetch and add operation is notably more useful than the swap operation, but is still less than ideal; in addition to test and set spinlocks, fetch and add can be used to create thread-safe counters, and spinlocks that both preserve order and (potentially) greatly reduce cache coherency traffic.
Another universal atomic primitive is the load linked/store conditional (LL/SC) pair of instructions. In this model, when a value is loaded linked into a register, a reservation is placed on that memory address. If that reservation still exists when the store conditional operation is performed, the store will be performed. If another thread writes to the memory location between a load linked and store conditional, the reservation is cleared, and any other threads will fail their store conditional (resulting in skipping the store altogether). If the store conditional fails, the program must loop back and try the operation again, beginning with the load linked.
In theory, the load linked/store conditional primitive is superior to the compare and swap operation. As any access to the memory address will clear the reservations of other processors, the total number of reads is 1, in contrast to the 2 reads with compare and swap (1 read to get the value initially, and a second read to verify the value is unchanged). Furthermore, the LL/SC operation may distinguish whether a write has occurred, regardless of whether the value was changed by the write (we'll come back to why this is a good thing when we discuss hazard pointers). Unfortunately, my research indicates that most implementations of LL/SC do not guarantee that a write will be distinguished if the written value is the same as the value at the time of the reserve.
Finally, the wild card: the double compare and swap (DCAS). In a double compare and swap, the compare and swap operation is performed simultaneously on the values in two memory addresses. Obviously this provides dramatically more power than any previous operation, which only operate on single addresses. Unfortunately, support for this primitive is extremely rare in real-world processors, and it is typically only used by lock-free algorithm designers that are unable to reduce their algorithm to single-address compare and swap operations.
Atomic primitives are arguably the most important tool in programming that requires coordination between multiple threads and/or processors/cores. There are four basic types of atomic primitives: swap, fetch and phi, compare and swap, and load linked/store conditional. Usually these operations are performed on values the same size as or smaller than a word (the size of the processor's registers), though sometimes consecutive (single address) multiword versions are provided.
The most primitive (pun not intended) is the swap operation. This operation exchanges a value in memory with a value in a register, atomically setting the value in memory and returning the original value in memory. This is not actually very useful, in terms of multiprogramming, with only a single practical use: the construction of test and set (TAS; or test and test and set - TATAS) spinlocks. In these spinlocks, each thread attempting to enter the spinlock spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned, the spinlock was free, and the thread now owns the spinlock; if 1 is returned, the spinlock was already owned by another thread, and the thread attempting to enter the spinlock must keep spinning. Ultimately, the owning thread leaves the spinlock by setting the spinlock value to 0.
This makes for an extremely crude spinlock. The fact that there is only a single value all threads share means that spinning can create a lot of cache coherency traffic, as all spinning processors will be writing to the same address, continually invalidating each other's caches. Furthermore, this mechanism precludes any kind of order preservation, as it's impossible to distinguish when a given thread began waiting; threads may enter the spinlock in any order, regardless of how long any thread has been waiting to enter the spinlock.
Next up the power scale is the fetch and phi family. All members of this family follow the same basic process: a value is atomically read from memory, modified by the processor, and written back, with the original value returned to the program. The modification performed can be almost anything; one of the most useful modifications is the add operation (in this case it's called fetch and add). The fetch and add operation is notably more useful than the swap operation, but is still less than ideal; in addition to test and set spinlocks, fetch and add can be used to create thread-safe counters, and spinlocks that both preserve order and (potentially) greatly reduce cache coherency traffic.
- ATOMIC word fetch_and_add(volatile word &value, word addend)
- {
- word oldValue = value;
- value += addend;
- return oldValue;
- }
- void EnterSpinlock (volatile word &seq, volatile word &cur)
- {
- word mySeq = fetch_and_add(seq, 1);
- while (cur != mySeq) {}
- }
- void LeaveSpinlock(volatile word &cur)
- {
- cur++;
- }
- ATOMIC word compare_and_swap(volatile word &value, word newValue, word comparand)
- {
- word oldValue = value;
- if (oldValue == comparand)
- value = newValue;
- return oldValue;
- }
Another universal atomic primitive is the load linked/store conditional (LL/SC) pair of instructions. In this model, when a value is loaded linked into a register, a reservation is placed on that memory address. If that reservation still exists when the store conditional operation is performed, the store will be performed. If another thread writes to the memory location between a load linked and store conditional, the reservation is cleared, and any other threads will fail their store conditional (resulting in skipping the store altogether). If the store conditional fails, the program must loop back and try the operation again, beginning with the load linked.
In theory, the load linked/store conditional primitive is superior to the compare and swap operation. As any access to the memory address will clear the reservations of other processors, the total number of reads is 1, in contrast to the 2 reads with compare and swap (1 read to get the value initially, and a second read to verify the value is unchanged). Furthermore, the LL/SC operation may distinguish whether a write has occurred, regardless of whether the value was changed by the write (we'll come back to why this is a good thing when we discuss hazard pointers). Unfortunately, my research indicates that most implementations of LL/SC do not guarantee that a write will be distinguished if the written value is the same as the value at the time of the reserve.
Finally, the wild card: the double compare and swap (DCAS). In a double compare and swap, the compare and swap operation is performed simultaneously on the values in two memory addresses. Obviously this provides dramatically more power than any previous operation, which only operate on single addresses. Unfortunately, support for this primitive is extremely rare in real-world processors, and it is typically only used by lock-free algorithm designers that are unable to reduce their algorithm to single-address compare and swap operations.
- ATOMIC void double_compare_and_swap(volatile word &value1, word newValue1, word comparand1, word &oldValue1, volatile word &value2, word newValue2, word comparand2, word &oldValue2)
- {
- oldValue1 = value1;
- oldValue2 = value2;
- if (oldValue1 == comparand1 && oldValue2 == comparand2)
- {
- value1 = newValue1;
- value2 = newValue2;
- }
- }
Labels:
atomicfunctions,
programming,
smp
Thursday, January 04, 2007
Philosophers at a Table
Creating a deadlock-free semaphore-based implementation of the dining philosopher problem isn't particularly difficult. All you need to do is implement a chopstick pickup order that does not allow deadlock. The method we used (although you could just as easily do the opposite) is to have each philosopher pick up their even-numbered chopstick first, then their odd-numbered chopstick afterwards. This guarantees that, regardless of the number of philosophers n at the table (as long as there're two chopsticks, smartass), at least one will be able to eat at any point, with a maximum of n / 2 (rounded down) being able to eat at once.
I initially thought (and indeed said in my report) that the condition we were seeing, where only one of five philosophers was able to eat, was a consequence of there being an odd number of philosophers. While I was correct that the odd number had an important part in what we were seeing, the problem was more general than I thought. This diagram of an eight-philosopher table shows the worst case scenario for an even number of philosophers:
What's happening here is as follows:
1. Philosopher 0 picks up chopsticks 0 and 1, and begins eating
2. Philosopher 1 picks up chopstick 2, and blocks on chopstick 1
3. Philosopher 2 blocks on chopstick 2
4. Philosopher 7 blocks on chopstick 0
In this order (or a few variations of it), one philosopher has managed to prevent three others in a cluster from eating. The exact same method can be used to block three others, while a total of two philosophers eat; this is the worst-case scenario for this table.
You can draw it out for yourself (I drew it on paper, along with several other sizes, but don't feel like making more diagrams online) and prove that it's possible for a minimum of two philosophers to eat at a table with only six philosophers. This is logical: one philosopher can only block three others, leaving three chopsticks available; one of the last two philosophers will be able to eat.
So, four philosophers is too few (for two philosophers to eat), six is enough, but what about five? This was the classic case, and the one we implemented. What happened is this, shown in words and figure:
1. Philosopher 1 picks up chopsticks 2 and 1, and begins eating
2. Philosopher 2 blocks on chopstick 2
3. Philosopher 0 picks up chopstick 0, and blocks on chopstick 1
4. Philosopher 4 picks up chopstick 4, and blocks on chopstick 0
5. Philosopher 3 blocks on chopstick 4
As you can see, the problem is that one philosopher (#4) has two even-numbered chopsticks (#4 and #0), meaning that regardless of the order in which he tries to pick them up, it's gonna screw up the ordering. In fact, what happens is that this allows an additional philosopher to block on one eating philosopher, bringing the total in the cluster to five - the number of philosophers at the table.
Oh, and for those keeping score, the minimum number of philosophers that are able to eat at any time appears to be (n + 2) / 4.
I initially thought (and indeed said in my report) that the condition we were seeing, where only one of five philosophers was able to eat, was a consequence of there being an odd number of philosophers. While I was correct that the odd number had an important part in what we were seeing, the problem was more general than I thought. This diagram of an eight-philosopher table shows the worst case scenario for an even number of philosophers:
What's happening here is as follows:
1. Philosopher 0 picks up chopsticks 0 and 1, and begins eating
2. Philosopher 1 picks up chopstick 2, and blocks on chopstick 1
3. Philosopher 2 blocks on chopstick 2
4. Philosopher 7 blocks on chopstick 0
In this order (or a few variations of it), one philosopher has managed to prevent three others in a cluster from eating. The exact same method can be used to block three others, while a total of two philosophers eat; this is the worst-case scenario for this table.
You can draw it out for yourself (I drew it on paper, along with several other sizes, but don't feel like making more diagrams online) and prove that it's possible for a minimum of two philosophers to eat at a table with only six philosophers. This is logical: one philosopher can only block three others, leaving three chopsticks available; one of the last two philosophers will be able to eat.
So, four philosophers is too few (for two philosophers to eat), six is enough, but what about five? This was the classic case, and the one we implemented. What happened is this, shown in words and figure:
1. Philosopher 1 picks up chopsticks 2 and 1, and begins eating
2. Philosopher 2 blocks on chopstick 2
3. Philosopher 0 picks up chopstick 0, and blocks on chopstick 1
4. Philosopher 4 picks up chopstick 4, and blocks on chopstick 0
5. Philosopher 3 blocks on chopstick 4
As you can see, the problem is that one philosopher (#4) has two even-numbered chopsticks (#4 and #0), meaning that regardless of the order in which he tries to pick them up, it's gonna screw up the ordering. In fact, what happens is that this allows an additional philosopher to block on one eating philosopher, bringing the total in the cluster to five - the number of philosophers at the table.
Oh, and for those keeping score, the minimum number of philosophers that are able to eat at any time appears to be (n + 2) / 4.
Subscribe to:
Posts (Atom)