Way, way back I explained what "fast" is, with regards to synchronization objects. Briefly, a fast object is one that executes entirely in user mode, save for two times: when a thread needs to go to sleep, and when a thread needs to be woken up.
But that isn't fast enough for some people. As discussed previously, fast objects work by using atomic instructions and/or memory barriers. While atomic instructions in and of themselves aren't THAT bad, speed-wise (in the best case taking only about 20 cycles), on some architectures (like x86) they imply either a write barrier or a full barrier. By definition these are constriction points, which bring the various parts of the system into sync, and require waiting for all pending writes/operations to complete. In the best case (no queued I/O) this may be almost instant. In the worst case (many queued I/O accesses), it may take hundreds of cycles. Even though this only applies to one CPU/core, that's a pretty large amount of dead time (for example, you might be able to call malloc in that time).
But is it possible to avoid the use of atomic operations altogether? Well, yes and no. It's impossible to have a multiprocessing system that shares resources completely without atomic operations (well, I suppose you could, but it would be hideously ugly and slow). There must be some way to synchronize code or data between threads and processors - atomic operations provide this ability.
However, in very specialized circumstances (and only from some threads), we can get around using atomic operations. One way to achieve this is called collocation. Collocation is the storage of two (or more) pieces of data - one accessed by one thread, the other accessed by another thread - inside a single word. The thread's data is written to using a small write, then the entire word is read at once.
Collocation relies on data dependency to cause a partial processor stall without the need to perform a full memory barrier. Most processors are always self-consistent and have instruction ordering rules that state that a write to a memory location smaller than a word must be committed to memory (actually, cache, but CPU caches are typically coherent, meaning all CPUs will see modifications to the cache of one CPU) before a larger read from that same location. This can be done using nonatomic operations, and does not require a memory barrier - it forces ordering only on two instructions, while allowing the processor to optimally reorder and buffer all other instructions. In the worst case, this causes a full pipeline flush (20-30 cycles), and may take less - significantly faster than a full memory barrier, in the worst case.
However, this method is not an exact substitute for atomic instructions. It only ensures that both writing and reading threads will see the writes made by each other - it does not provide a way for either thread to know in what order the writes occurred, the way atomic instructions can. So, what use is it? Well, if you apply some creativity, it's still useful in some cases. To be able to take advantage of collocation, the threads sharing data must be fundamentally asymmetric - each thread must write to a separate fields in the word, and only one thread may use collocation to avoid atomic operations (the other threads must use atomic operations). This implies where collocation is useful: when one thread performs a large portion of the writes, and other threads only make occasional writes.
This is just an introduction to collocation as a lock-free structure tool; the posts on super-fast mutexes and condition variables with timeout will give some real-world examples of how atomic operations can be avoided using collocation.