Search This Blog

Saturday, June 11, 2005

Conditions - Windows Implementation

The Windows version of the condition is fairly straightforward (and I don't know about you, but I'm getting a bit bored with all these almost-alike-but-not-quite fast synchronization objects), and more than a bit resembles the fast mutex. Here, we have a slow semaphore that waiting threads wait on, and a counter of how many waiting threads there are. The use of atomic functions to access the waiting threads counter allows us to signal from either inside or outside the mutex paired with the condition. Really, the only major difference between the condition and the mutex implementation is that with the condition, a paired mutex must be exited before waiting, and reentered after waiting.

In doing so, it's essential that the counter be incremented before leaving the mutex to wait (despite the fact that the counter is accessed atomically). Were this not the case, we'd create a race between the data (which is protected by the mutex) and the counter (which is accessed atomically) and possibly set a thread to sleep for some indefinite period of time. This is a problem that does not work the other way: we can safely access the counter outside the mutex if we're decrementing it (that is, we're waking waiting threads).

The code:

class CCondition // Win32 version of CCondition (fast)
{
private:

CMutex &m_pairedMutex; // The paired mutex
CSemaphore m_sem; // Semaphore waiters will wait on

// Atomically accessed, so that Signal may be called outside the mutex
int32 m_nWaiters; // Number of threads waiting on the semaphore

public:
// Associates a mutex with the condition. From this point on, no other mutex may be used to access the condition. However, multiple conditions may be associated with the same mutex.
inline CCondition(CMutex &mutex) : m_pairedMutex(mutex), m_sem(0)
{
m_nWaiters = 0;
}

inline ~CCondition()
{
if (m_nWaiters)
m_sem.Post(m_nWaiters);
}

// Signal that the condition is true and signal threads waiting; has no effect if no threads are waiting. Can either release one thread or all threads. May be called inside the mutex or outside.
inline void Signal(bool bSignalAll = false)
{
int32 nWaiters, nReleaseThreads;

do
{
nWaiters = m_nWaiters;
assert(nWaiters >= 0);

if (!nWaiters)
return; // Do nothing if no threads are waiting

nReleaseThreads = (bSignalAll ? nWaiters : 1);
} while (AtomicCompareExchange(&m_nWaiters, nWaiters - nReleaseThreads, nWaiters) != nWaiters);

m_sem.Post(nReleaseThreads); // Set them loose
}

// Waits for the condition to become true. Must be called from inside the mutex, and the thread will be back inside the mutex at return. HOWEVER, return does not GUARENTEE that the condition is true. You must verify that condition has indeed become true upon return.
inline void Wait()
{
// Absolutely essential that this get done before we leave the mutex or we create a race condition
AtomicIncrement(&m_nWaiters); // One more waiter

m_pairedMutex.Leave(); // Leave mutex
m_sem.Wait(); // Wait for signal
m_pairedMutex.Enter(); // Reenter mutex after signal
}
};

Conditions - POSIX Implementation

As with the mutex class, the POSIX version of the condition is trivial, due to the native support by the OS. The POSIX APIs allow you to specify a mutex to use each time you wait on the condition. However, as I can't see any use for this, and it leaves more room for error, I've decided to bind the condition to its mutex on construction. The code:

class CCondition
{
private:
CMutex &m_pairedMutex; // Paired mutex
pthread_cond_t m_cond; // Condition variable

public:
inline CCondition(CMutex &mutex) : m_pairedMutex(mutex)
{
if (pthread_cond_init(&m_cond, NULL) != 0)
throw std::bad_alloc("Unable to create condition");
}

inline ~CCondition()
{
verify(pthread_cond_destroy(&m_cond) == 0);
}

inline void Signal(bool bSignalAll = false)
{
int nRetVal;
if (bSignalAll)
nRetVal = pthread_cond_broadcast(&m_cond); // Signal all
else
nRetVal = pthread_cond_signal(&m_cond); // Signal one

assert(nRetVal == 0);
}

inline void Wait()
{
verify(pthread_cond_wait(&m_cond, &m_pairedMutex.GetMutex()) == 0);
}
};

Conditions - Description

Conditions are a lot like events - they indicate that some condition is or isn't true, and they may be waited on by threads that have nothing to do until the condition becomes true. However, unlike an event, they are integrally paired with a mutex that protects some data. A thread waits on the condition from inside the mutex, and when it gets woken, it will still be inside the mutex. Yet the mutex does not remain locked while the thread sleeps, so other threads can enter the mutex, modify the data, and signal that the condition has become true.

So, why do we implement both? Events are a Windows thing, as Windows doesn't support conditions. But POSIX doesn't support events - it only supports conditions. This seemed like a curious design decision to me when I first was learning about the POSIX API (as I've always programmed on Windows, and events were all that I knew), but I better understand the reasons, now.

While there are times where a simple "fire and forget" event may be useful, usually an event is associated with the state of some data. For example, one of my programs was an audio-streaming library. This system used three worker threads for its tasks (the rationale for the exact number is rather lengthy, so I won't cover it here) - a streaming (decoding) thread, a prebuffering thread, and a prefetching thread. The job of the prebuffering thread is to prebuffer new streams before they begin playing; the rest of the time, it sleeps, while it waits for new requests. When a request arrives, the condition gets flagged, and the thread goes to work. When this happens, the thread must lock the mutex protecting the list of streams needing prebuffering (as it's accessed by more than 1 thread), dequeue one stream, leave the mutex, then do its prebuffering outside the mutex. When it's done prebuffering, it adds the stream to the playing streams list (inside the mutex, of course), tells the OS that the stream has data to play, and checks if there are any more streams needing prebuffering; if no more streams need prebuffering, the thread goes back to waiting on the condition.

The point is that signaling of the condition will always be followed by the locking of the mutex. This is where condition variables come in. By combining the locking/unlocking with the wait on the condition, this leaves the OS free to do any desired optimizations, which could further improve performance. One example of a potential optimization would be a direct context switch of the CPU from the thread signaling the condition to the thread waiting on the condition, with 0 latency. From what I've heard, Windows internally supports something like this, called event pairs (my information is second-hand, as I know very little about how the Windows kernel works internally, apart from what I've heard from more knowledgeable friends and my Inside Windows NT book). In any case, the goal of LibQ is to provide a platform-independent set of interfaces, which take advantage of features that are optimized on each platform. Because the condition/mutex model is so common, that means we have to support it.

As a footnote, I should mention a couple of sticky points about conditions. Unlike events, conditions do not retain their value. If a thread signals a condition, absolutely nothing happens if there are no threads waiting on that condition (unlike for events, where the event will remain set until reset). The reason for this is that threads are expected to check the data associated with a condition to determine whether the condition is already true (since they'll already be inside the mutex). Thus, the data itself indicates whether the condition is true or false.

Also, according to the POSIX standard, conditions are not guaranteed to be watertight. That is, a thread waiting on a condition may get woken when the condition is not actually true. This can happen for several reasons; for example, a thread may be waiting on the mutex before a condition becomes true, and so will get the mutex before the woken thread gets to it, changing the value of the data so that the condition is no longer true. Beware of this - when Wait returns, always check that the condition is indeed true; if not, call Wait again.

Friday, June 10, 2005

Mutexes - Windows Implementation

Really, there isn't a great deal to say about the Windows version of the mutex, either. It's pretty similar to the fast event and fast semaphore already written. In this case, the atomic variable is a count of the number of would-be threads inside the mutex - at most one will be inside the mutex, and the rest will be waiting on the semaphore. Thus, Enter increments the count, waiting if the count was nonzero; Leave decrements the count, and releases a thread if the count is nonzero after decrementing. Of course, if possible, Enter spins to try to grab the mutex before ultimately incrementing the counter and going into wait. The code:

class CMutex // Windows version of CMutex (fast). This is actually platform-independent, but we need to use the native POSIX mutex type so that we can use condition variables.
{
private:
CSemaphore m_sem; // Semaphore threads waiting on the mutex will wait on

unsigned int m_nSpinCount; // Constant. Number of times to spin before waiting on the event

int32 m_nLockCount; // Interlocked. Number of threads trying to lock the mutex (also the number of threads waiting on the semaphore + 1)

public:
// The spin count is used on multi-processor systems (and will in fact do more harm than good on single-processor systems). It allows a "grace period" where the mutex will spin on the CPU waiting for another processor to release the mutex. This can be much faster than waiting on the semaphore, but just wastes CPU cycles on a single-processor system. Spinning may not be useful if threads typically hold the mutex for very long time periods.
inline CMutex(unsigned int nSpinCount = 0) : m_sem(0)
{
m_nSpinCount = nSpinCount;

m_nLockCount = 0;
}

inline ~CMutex()
{
}

// Tries to enter the mutex, but will fail if the mutex cannot be entered without waiting. Returns true if the mutex was entered.
inline bool TryEnter()
{
// If the lock count was 0, we now own the mutex
return (AtomicCompareExchange(&m_nLockCount, 1, 0) == 0);
}

// Waits indefinitely for the mutex. This function is NOT reentrant. Calling it twice in a row in a thread without calling Leave will cause the thread to permanently deadlock. Always make sure your Enter is paired with a Leave (or use the auto-lock below).
void Enter()
{
// Spin first, in case we can get it without waiting
unsigned int nSpinCount = m_nSpinCount;
while (nSpinCount-- > 0)
{
if (TryEnter())
return;
}

// Try to get the mutex
int32 nLockCount = AtomicIncrement(&m_nLockCount);

if (nLockCount == 1)
return; // No thread held the mutex. We do, now.

// Another thread held the mutex. We have to wait.
m_sem.Wait();
}

inline void Leave()
{
int32 nLockCount = AtomicDecrement(&m_nLockCount);

if (nLockCount > 0)
m_sem.Post(); // Somebody's waiting on the mutex. Release one waiter.
}
};

Mutexes - POSIX Implementation

Really there isn't much to say, here. The POSIX version of our mutex is nothing more than a thin wrapper around a pthread_mutex, with profuse use of our verify macro.

class CCondition; // Circular reference incoming. We don't have any choice but to forward declare it.

class CMutex // POSIX version of CMutex: pthread_mutex (fast?)
{
private:
pthread_mutex_t m_mutex;

inline pthread_mutex_t &GetMutex()
{
return m_mutex;
}

public:
// Compatible prototype, even though the spin count is unused
inline CMutex(unsigned int nSpinCount = 0)
{
if (pthread_mutex_init(&m_mutex) != 0)
throw bad_alloc("Unable to create mutex");
}
inline ~CMutex()
{ verify(pthread_mutex_destroy(&m_mutex) == 0); }

inline bool TryEnter()
{ return (pthread_mutex_trylock(&m_mutex) == 0); }
inline void Enter()
{ verify(pthread_mutex_lock(&m_mutex) == 0); }
inline void Leave()
{ verify(pthread_mutex_unlock(&m_mutex) == 0); }

// Allow CCondition to access the mutex directly
friend class CCondition;
};

Sorry About That

Just learned that anonymous comments weren't allowed. Fixed, now :P

Events - New POSIX Implementation

The new POSIX version of the event is similar to, but a bit more complicated than the fast semaphore. In fact, the POSIX version of the standard event is inherited from a platform-independent fast event. It's based on the same principle as the fast semaphore: use of an atomic variable in combination with a slow semaphore than never increases above 0. A bit more complicated, though, because now our atomic variable is a compound value, rather than a single counter.

The reason for this is that we can only operate on a single variable atomically, yet we have two values that must be tracked - the status of the event (set or cleared), and the number of waiters (if cleared). The compound value structure consists of the highest bit indicating that the event is set, and the low bits indicating the number of waiters. Logically, it's impossible for there to be any waiters when the event is set.

With this configuration, it should be fairly obvious how to implement this. The implementation will rely heavily on the fact that we can read a variable and then change it at a later time, as long as we can verify that the variable hasn't been changed in between (which we can, using AtomicCompareExchange - if the variable has been changed when we try to write it back, we simply read it again and repeat). This allows us to read the status value, perform logic on it, and then write it back with some computed new value, and still be totally thread-safe in the whole operation.

The Reset and TryWait methods are trivial - a single atomic function call - so I won't bother explaining them. Set and Wait aren't much more complicated, but I'll mention them a bit.

The only real logic that Set requires is what to do if the event is cleared. If it's a manual-reset event, all waiting threads get released, and the status becomes set. If it's an auto-reset event, what happens depends on whether there are waiting threads. If there are, then one thread is released, and the status remains cleared (although there's one less waiting thread). If no threads are waiting, the status becomes set, just like with the manual-reset event.

The logic in Wait is essentially the same. If the status isn't set, then the waiting thread count gets implemented, and the thread waits on the semaphore. If it's set and it's a manual-reset event, Wait returns immediately, and the status remains set. If the status is set and it's an auto-reset event, the status is cleared, and Wait returns (at this point no threads are waiting on the event).

The code for CFastEvent and the POSIX CEvent:

class CFastEvent
{
private:
CSemaphore m_sem; // Semaphore that threads will wait on

unsigned int m_nSpinCount; // Constant. The number of times to attempt to check if the event is set before waiting

bool m_bAutoReset; // Constant

int32 m_nStatus; // Atomically accessed. The sign bit indicates that the event is currently set. The rest of the bits are the number of waiting threads on the event when the event isn't set.

// Various local constants
enum
{
WAITER_MASK = 0x7FFFFFFFL,
SET = 0x80000000L
};

public:

inline CFastEvent(bool bAutoReset, bool bSet, unsigned int nSpinCount = 0) : m_sem(0)
{
assert(nSpinCount >= 0);

m_nSpinCount = nSpinCount;

m_bAutoReset = bAutoReset;

m_nStatus = (bSet ? SET : 0);
}

inline ~CFastEvent()
{
}

void Set()
{
int32 nStatus, nNewStatus;
do
{
nStatus = m_nStatus;

if (nStatus == SET)
return; // Already set

if (nStatus && m_bAutoReset)
nNewStatus = nStatus - 1; // If there are waiters and it's an auto-reset event, release 1 and leave unset
else
nNewStatus = SET; // Else set and release all
} while (AtomicCompareExchange(&m_nStatus, nNewStatus, nStatus) != nStatus);

// If there were any waiters, release the appropriate number
if (nStatus)
m_sem.Post(nStatus - (nNewStatus & WAITER_MASK));
}

inline void Reset()
{
// If it's set, reset it; else do nothing
AtomicCompareExchange(&m_nStatus, 0, SET);
}

inline bool TryWait()
{
// New event status if we successfully get the event
int32 nNewStatus = (m_bAutoReset ? 0 : SET);

// If it's previously set, we got it. Set or reset it based on whether it's an auto or manual-reset event.
return (AtomicCompareExchange(&m_nStatus, nNewStatus, SET) == SET);
}

// Wait indefinitely for the event
void Wait()
{
// Don't wait if we can get it by spinning
unsigned int nSpinCount = m_nSpinCount;
while (nSpinCount--)
{
if (TryWait())
return;
}

// Check if the event is set, queue as a waiter if not
int32 nStatus, nNewStatus;
do
{
nStatus = m_nStatus;

if (nStatus == SET)
nNewStatus = (m_bAutoReset ? 0 : SET); // Set or reset it
else
nNewStatus = nStatus + 1; // One more waiter
} while (AtomicCompareExchange(&m_nStatus, nNewStatus, nStatus) != nStatus);

if (nStatus == SET)
return; // It was set

// It wasn't set. Wait for it.
m_sem.Wait();
}
};

// Simply provide the CFastEvent class with an interface compatible with CEvent
class CEvent : public CFastEvent
{
public:
inline CFastEvent(bool bAutoReset, bool bSet) : CFastEvent(bAutoReset, bSet)
{ }
};

How to Surrender to MS in Style

I just saw this post on Slashdot (yes, I do read Slashdot), and found it so dazzlingly idiotic I just had to pass it around.
Testing is only a priority on closed source apps - Windows and IE being no exception. The very fact that users have neither access to the source code nor the ability to build the application sources means that any testing must be done "in-house". This is going to slow down the release cycle by exactly the amount of time it would take to run all the regression tests.

With Open Source, a patch can be released right away and users can compile in the new sources themselves. Any issues can be immediately identified and reported back to the maintainers, often with both the offending source code and potential fixes to the patch. Without the lengthy QA cycle, Open Source patches are much more immediate than any Closed Source shop could ever hope to achieve.

What's that I hear? Why, I do believe it's the sound of people stampeding to replace their OSS products with MS stuff after hearing how OSS (or at least this guy) does business. You just became MS's MVP of the week, dude; congrats!

Events - New Windows Implementation

This one's almost identical to the last. All I did was remove the timed Wait and replace with a TryWait, for compatability with the new POSIX verion.

class CEvent // Win32 version of CEvent (slow)
{
private:
HANDLE m_hEvent;

inline HANDLE GetEvent()
{
return m_hEvent;
}

public:
// bAutoReset controls how the event reacts when it gets set. An auto-reset event will automatically go back to the unset state after one thread gets released from waiting because of the event being set. A non-auto-reset event will stay set until reset explicitely. CEvent throws bad_alloc if it cannot create the event.
inline CEvent(bool bAutoReset, bool bSet)
{
m_hEvent = CreateEvent(NULL, !bAutoReset, bSet, NULL);
if (!m_hEvent)
throw std::bad_alloc("Unable to create event");
}

inline ~CEvent()
{
verify(CloseHandle(m_hEvent));
}

inline void Set()
{
verify(SetEvent(m_hEvent));
}

inline void Reset()
{
verify(ResetEvent(m_hEvent));
}

// Returns true if the event was set, otherwise false
inline bool TryWait()
{
DWORD nRetVal = WaitForSingleObject(m_hEvent, 0);
assert(nRetVal != WAIT_FAILED);

if (nRetVal == WAIT_TIMEOUT)
return false;

return true;
}

// Waits indefinitely for the event to be set
inline void Wait()
{
verify(WaitForSingleObject(m_hEvent, INFINITE) != WAIT_FAILED);
}
};

Ooooh. Shiny.

Have a look at what VC++ compiled a few lines of ThingyTron as, demonstrating the new atomic and endian function intrinsics (you can also see some calls that aren't intrinsic, but work nonetheless). Note that this is optimized code, so there isn't always a 1:1 relationship between the C++ line and the assembly lines below it.

int32 moosquish = 15, mooblah = 79;
00403EC0 mov dword ptr [esp+20h],0Fh

mooblah = AtomicExchangeAdd(&moosquish, mooblah);
00403EC8 mov eax,4Fh
00403ECD lea ecx,[esp+20h]
00403ED1 lock xadd dword ptr [ecx],eax

mooblah = AtomicExchange(&moosquish, mooblah);
00403ED5 mov edx,ecx
00403ED7 xchg eax,dword ptr [edx]
00403ED9 mov ecx,eax

mooblah = AtomicCompareExchange(&moosquish, mooblah, -15);
00403EDB mov eax,0FFFFFFF1h
00403EE0 lock cmpxchg dword ptr [edx],ecx

mooblah = AtomicBitExchange(&moosquish, 2, 0);
00403EE4 push 0
00403EE6 push 2
00403EE8 mov eax,edx
00403EEA push eax
00403EEB call _AtomicBitExchange@12 (401060h)

mooblah = AtomicBitExchangeCompliment(&moosquish, 7);
00403EF0 push 7
00403EF2 lea ecx,[esp+24h]
00403EF6 push ecx
00403EF7 call _AtomicBitExchangeCompliment@8 (401090h)

mooblah = AtomicSignedExchangeAdd(&moosquish, 29, false);
00403EFC push 0
00403EFE lea edx,[esp+24h]
00403F02 push 1Dh
00403F04 push edx
00403F05 call Q::AtomicSignedExchangeAdd (4010E0h)
00403F0A add esp,0Ch

void *squishyptr = (void *)0x1234567,
*fatptr = AtomicCompareExchangePointer(&squishyptr, (void *)0x76543210, (void *)0x1234567);
00403F0D push 1234567h
00403F12 mov dword ptr [esp+6Ch],eax
00403F16 push 76543210h
00403F1B lea eax,[esp+74h]
00403F1F push eax
00403F20 mov dword ptr [esp+78h],1234567h
00403F28 call _AtomicCompareExchangePointer@12 (401040h)

int64 x = ltoh(0x0011223344556677), y = btoh(x);
00403F2D mov ecx,44556677h
00403F32 bswap ecx

printf("mooblah = %d, mooblah = %d, squishyptr = 0x%X, fatptr = 0x%X, x = %i64X, y = %i64X", moosquish, mooblah, squishyptr, fatptr, x, y);
00403F34 push ecx
00403F35 mov esi,112233h
00403F3A bswap esi
00403F3C push esi
00403F3D push esi
00403F3E push ecx
00403F3F push eax
00403F40 mov ecx,dword ptr [esp+80h]
00403F47 mov edx,dword ptr [esp+7Ch]
00403F4B mov eax,dword ptr [esp+34h]
00403F4F push ecx
00403F50 push edx
00403F51 push eax
00403F52 push offset string "mooblah = %d, mooblah = %d, squi"... (40D4C0h)
00403F57 call printf (405697h)

Thursday, June 09, 2005

Silly Compiler Tricks

So, while I was writing those atomic functions in assembly, I figured I might as well go ahead and write the endian conversion functions. The functions themselves are trivial to write, using rol for 16-bit flipping, and bswap for 32-bit and 64-bit flipping. But after I get done with that, Merlin asks me why I didn't just use the VC++ flipping functions (_byteswap_ushort, _byteswap_ulong, and _byteswap_uint64). Well, the truth was that I didn't know about those. So, I go back and add a special case for the endian functions, when VC++ is used. So, after writing all that, I go to test the code (or #defines, in this case). Of course, I first build the test program (called ThingyTron, in this case). Stepping through the code leads me to the implimentation of the _byteswap_uint64 function, which looks like this:

unsigned __int64 __cdecl _byteswap_uint64(unsigned __int64 i)
{
unsigned __int64 j;
j = (i << 56);
j += (i << 40)&0x00FF000000000000;
j += (i << 24)&0x0000FF0000000000;
j += (i << 8)&0x000000FF00000000;
j += (i >> 8)&0x00000000FF000000;
j += (i >> 24)&0x0000000000FF0000;
j += (i >> 40)&0x000000000000FF00;
j += (i >> 56);
return j;

}

And if you think that's ugly, you should see the assembly the compiler generates from it. Fortunately, Merlin was right - in release build, these functions are translated into intrinsics (basically identical to the assembly I wrote, only inlined in the calling function). Because of this, not only are they as fast as the functions I wrote, but they don't have the overhead of my function calls, making them extremely fast. And so, everything worked out beautifully in the end. I do have to wonder, though, what on earth prompted them to use such a horrible flipping algorithm for the non-intrinsic version.

Silly MASM

So, I was just stepping through my new atomic functions code (the x86-32 version built with MASM) and noticed something kind of amusing. Take a look at the dissassembly (after it's been assembled):
_AtomicExchange@8:
00420A30 8B 54 24 04 mov edx,dword ptr [esp+4]
00420A34 8B 44 24 08 mov eax,dword ptr [esp+8]
00420A38 F0 87 02 lock xchg eax,dword ptr [edx]
00420A3B C2 08 00 ret 8
00420A3E 8B FF mov edi,edi
_AtomicExchangeAdd@8:
00420A40 8B 54 24 04 mov edx,dword ptr [esp+4]
00420A44 8B 44 24 08 mov eax,dword ptr [esp+8]
00420A48 F0 0F C1 02 lock xadd dword ptr [edx],eax
00420A4C C2 08 00 ret 8
00420A4F 90 nop
_AtomicCompareExchange@12:
00420A50 8B 4C 24 04 mov ecx,dword ptr [esp+4]
00420A54 8B 44 24 0C mov eax,dword ptr [esp+0Ch]
00420A58 8B 54 24 08 mov edx,dword ptr [esp+8]
00420A5C F0 0F B1 11 lock cmpxchg dword ptr [ecx],edx
00420A60 C2 0C 00 ret 0Ch
00420A63 8D A4 24 00 00 00 00 lea esp,[esp]
00420A6A 8D 9B 00 00 00 00 lea ebx,[ebx]

See how it pads out the functions, to align new functions on paragraph (16 byte) boundaries? Well, that's exactly what I told it to do. But what's amusing is WHAT it pads the functions with. It seems to like to fill the gaps with do-nothing instructions that consume the fewest possible cycles. No idea what the point of this is, but I thought it was humorous.

Backpeddling

Okay, enough procrastination; it's time to write this thing.

So, it's not been a good week, as far as coding has been concerned. Rewriting the atomic functions is turning out to be a major pain, as I'm not really sure what the optimal way to structure my code is. I want to support at least x86-32, x86-64, PPC32, and PPC64. I also want to support at least VC++ and GCC (which uses AT&T style assembly). So, how many copies of the functions do I have to write? Well, I'd rather not think about that :P Then there's the complication that VC++ does not support x86-64 inline assembly, meaning that at the very least I have to write those functions in an ASM file and compile it with ML64 (MASM 64). As well, as of last night, the x86-32 functions were in an ASM file to be compiled with ML (MASM).

As if that wasn't enough trouble, Merlin just brought to my attention that VS 2005 contains compiler intrinsics for the atomic functions. Intrinsics, of course, mean that the compiler could insert the instructions directly into the calling functions, without the need to call another function such as mine. As this is a serious benefit to performance, and performance is one of my biggest goals with LibQ, I'll definitely want to use those instead of my own assembly functions, when compiled with VS 2005.

So, what else could go wrong? Well, plenty, it turns out. First of all, I noticed a design flaw in my POSIX event implementation. It wasn't immediately obvious because the event works most of the time. The problem comes when one thread tries to reset the event just after another thread set it. Setting it should release at least one thread waiting on the event, even if the event is subsequently reset. But in our implementation, in order for a thread to be released from a wait, the event must remain signaled until that thread gets the CPU and checks the status of the event. If a second thread resets the event before this occurs (which is not unreasonable, given the latency associated with bringing threads out of wait), the threads would remain waiting indefinitely. So condition variables are not the way to go. I've written an alternate version of the event class which does not have this problem; it also doesn't have something else, due to the nature of the implementation: support for wait timeouts. So, I guess that means I'm gonna have to remove timed waits for both the Windows and POSIX implementations, to keep the feature set identical.

Lastly, some features were removed from the mutex class. I had originally said that, because the POSIX mutex lacked some features that we wanted - namely, the ability to spin before going into wait - we'd simply write our own implementation with the desired features. The problem with this is that I still want to support condition variables in LibQ, and POSIX condition variables require a mutex - a POSIX mutex. So, I don't have much choice but to use a POSIX mutex for the POSIX version of the mutex class, to allow it to interoperate with the condition variable later. I chose to leave the spinning feature in the Windows version, however, to benefit from it where possible, and simply ignore it in the POSIX version (leaving the interface identical in both cases).

Thursday, June 02, 2005

New and Improved

Well, I'm actually getting pretty excited about the x86-64 architecture, now that I'm learning about it. The reason I'm just now learning about this is really that I've never really needed to know, before. For quite some time the x86-64 was only available on high end server systems, which I would never have, and it was not expected that my programs (mostly related to games) would be used on such a system. But now, x86-64 is available on the desktop, and Windows is ready to support it.

So, let me give a brief overview of how x86-64 differs from x86-32 (legacy x86), and why I'm excited about it.

x86-64 is a 64-bit incarnation of the x86 architecture, supporting the same instruction set and features (most of the features, anyway). In practical terms, that means that it can do math on 64-bit numbers in 64-bit registers with single instructions. This means that, on average, code to work with 64-bit (or larger) numbers will be half as large, and take half as many cycles to execute (as 32-bit processors would have to do multiple operations to emulate 64-bit math).

Associated with 64-bit registers is the ability to use 64-bit pointers. 64-bit pointers, of course, mean an exponentially greater amount of memory that may be accessed by the processor. But even more importantly, you don't need more than 4 gigs of memory (the maximum that may be addressed by 32-bit pointers) to benefit from this. Most OSes these days use virtual address spaces, where the memory pointed to by a pointer doesn't need to be at that same address in physical memory. Some of the other things used for the address space besides physical memory (and empty space) are memory mapped ports and files. A larger address space, even with the same amount of physical memory, can allow larger regions of (or even more) files or device memory to be made available at once. Now, instead of continually mapping and unmapping small pieces of a 500 gig database, you can map the whole thing at once (theoretical example only; I'm not a database coder, and I have no idea if you would actually do this in a high-performance, commercial database system).

Yet despite this, the x86-64 architecture remains backward compatible. In addition to supporting the legacy CPU modes (real mode, protected mode, and virtual 8086 mode), which run either 16-bit or 32-bit programs and OS, it also supports two new modes: long compatibility mode and long 64-bit mode. Long 64-bit mode is exactly what you'd expect a 64-bit processor to do: run programs and OS natively in 64-bit mode, with 64-bit registers, pointers, stack, etc. But this requires that both the OS and program be compiled for x86-64.

Long compatibility mode is what's really cool. It's a hybrid of the legacy modes and long mode, allowing a 64-bit OS, which takes full advantage of the 64-bit x86-64 features, to natively run old 16-bit and 32-bit programs. This means that x86-64 supports the best of both worlds: you can have a new, 64-bit OS, but still run old programs on it, natively.

Okay, so that's progress. But there are two other reasons I'm excited about x86-64, that don't have anything to do with it the fact that it's 64-bit. In addition to supporting the 8 64-bit integer registers (RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP), the 8 64-bit floating point/MMX registers (MMX0-7/FPR0-7), and 8 128-bit XMM registers (XMM0-7), x86-64 supports 16 totally new registers. 8 of these are new integer registers, named R8-15 (kinda reminiscent of the PowerPC registers), and the other 8 are new XMM registers (XMM8-15). To me this is highly significant, as the limited number of registers (especially the integer registers) was one of the key limitations of the x86 architecture (so I believe). Having twice as many registers to work with means that code can do much more complicated math without needing to access memory to fetch/store variables.

Another significant limitation of the x86 architecture, I believe, was the inability to access memory relative to the current instruction. Branches and calls are generally relative to the current instruction, but memory accesses are always absolute (that is, they go to the same memory location regardless of the location of the current instruction).

If I had to make an educated guess as to why the x86 architecture did not initially support relative data access, I'd say it's due to the fact that originally x86 programs were segmented. You had a 64 KB code segment, a 64 KB data segment, a 64 KB stack segment, and a few other optional segments (don't ask what happens when you need more than 64 KB for your code or data, you don't want to know), which could be anywhere in memory, and were pointed to by segment registers. Branches and calls were made relative not only to the current instruction, but also to the base offset of the code segment. Memory accesses were made relative to the base offset of the data segment.

But things aren't done that way, anymore. Segments are rarely used in modern OSes, replaced mainly by virtual address spaces and memory mapped files. In Windows, an executable (or library) is mapped as a single piece (although there can be holes). All the addresses in that executable are fixed, relative to the base address (unlike with segments, where you could have each segment be anywhere in physical memory). Yet the base address of an executable is not fixed; it can be loaded almost anywhere in the virtual address space. So you now have lots of pointers that point to global variables at fixed addresses, when the variables themselves could be anywhere in memory. This requires the OS to correct all these pointers when it loads the executable into memory, which takes time. This problem is elegantly remedied by the use of pointers relative to the current instruction, as the relative offset would never change, no matter where the executable gets loaded.

Whoa!

EDMONTON - A team from the University of Alberta has proven for the first time that a single molecule can switch electrical currents off and on, a puzzle that scientists worldwide have been trying to crack for decades.

The finding could revolutionize the field of electronics, providing a leap ahead for everything from computers to batteries to medical equipment.

"We've made a tiny, tiny, maybe the ultimate tiniest transistor," said Robert Wolkow, a physics professor and the principal investigator who headed the research team from the National Institute for Nanotechnology at the U of A.

"This has been a key goal of researchers in this field for nearly 20 years and we have done it. ... Molecular electronics are indeed possible, no longer just a futuristic theory."

Edmonton Journal

Cool and Uncool

So I've been reading about the x86-64 archetecture, and programming it. It's pretty cool. The 8 additional integer registers alone make it substantially better than x86-32. Also, the fastcallish (or perhaps you'd prefer 'RISCish') calling convention is nice.

The functions themselves look trivial to write. Not so cool is the fact that VC++ doesn't support inline assembly. So I guess I'm gonna have to learn to use the standalone assembler for x86-64. Things are going to get reaaally unfun (and unportable) if I have to decorate the function names manually.

On an unrelated note, I should make a transacted file class at some point - that is, a file that remembers all the modifications you make to it (and will recall them if you read from a modified area), but doesn't actually alter the file on disk until you close it. That would support the discarding of changes on close, in addition to saving.

Wednesday, June 01, 2005

Blinding Flash of the Obvious

So yeah, I had one of those. I was working on modifying the CMutex class to support delayed semaphore allocation (that would reduce resource usage for rarely used mutexes), and noticed a couple of glaring deficiencies in the atomic functions set. First, there's no AtomicCompareExchange function for pointers. This is important because a pointer can vary in size - either 32-bits or 64-bits. That brought to light a second epiphany: NONE of those atomic functions I wrote support 64-bit pointers. Blindingly obvious, but quite frankly, I've never had a 64-bit CPU to work on, and thus it's not exactly second nature for me to work with 64-bit assembly (particularly x64 assembly). Guess it's time to grab those AMD x86-64 manuals off my shelf and write some atomic functions...

And while I'm at it, I should probably make some defines for 8, 16, 32, and 64-bit data types, rather than using short, long, and long long...