As I already mentioned, a round trip to kernel mode and back is expensive, in terms of CPU usage. But there's more to it than that. Once a thread has surrendered the CPU in a wait, the CPU will be given to another thread. The waiting thread is likely to, even if it gets woken very quickly, not get the CPU back for quite some time - perhaps more than 100 ms. If that thread needs to be responsive, that's a really long time. For this reason, something called spinlocks are employed.
Spinlocks are simply code that loops, repeatedly testing some condition without going into wait. Of course, this wastes CPU cycles, but the benefit can be worth it. If another thread is executing concurrently on another processor and signals the object, the spinning thread can avoid going into wait completely, saving a substantial amount of lag time. Of course, if this is a single-CPU system, the spinning is just dead CPU time, as there are no concurrently executing threads, period. As well, this is pointless if the synchronization object is likely to not be signalled for a long period of time (such as protecting a file handle during reads). But if something like a mutex is likely to be locked very frequently and only for a few - or even a few hundred - cycles, this can make a huge difference in performance.