Search This Blog

Tuesday, May 31, 2005

Random Fact

I do most of this posting while in transit from one area to another in World of Warcraft.

(Fast) Mutexes - Implementation Rationale

Both Windows and POSIX support mutexes natively, but I chose not to use either, for different reasons.

On Windows, a mutex is a kernel mode object. That is, it's a slow mutex. This is much too slow for us, and almost always a fast mutex will be sufficience, and much, much faster. Windows also offers a fast mutex called a critical section. I chose not to use this because it does not offer one of the features we desire: the ability to try-enter the mutex (this feature was introduced with NT 4.0, and is not available on Windows 9x systems). As well, support for spinning is not available on Windows 95 or NT 4.0 pre-service pack 3.

POSIX mutexes, on the other hand, are fast. But they, too, lack features that we desire. While POSIX mutexes support try-enter, they do not support spinning. As mutexes are perhaps the place where spinning is most beneficial (as often a thread will only enter the mutex for long enough to update variable or two), this is undesirable for us.

So, we're going to write our own fast mutex.

(Fast) Mutexes - Description

Our atomic functions work beautifully for performing thread-safe access to a single integer variable. But what if we have a more complex data type, such as a string, or a linked list? Similarly, what if we have more than one variable which must be updated at the same time (where updating one but not the other would cause bad things to happen)? This is what a mutex is for.

A mutex is used to protect access to one or more variable across multiple threads. Threads enter the mutex before accessing the variables, and leave the mutex when they're done. If, while one thread is inside the mutex, a second thread tries to enter the mutex, the second thread will get in line and go into wait until the first thread leaves the mutex. Thus, a mutex ensures MUTually EXclusive access to shared variables.

It is worth noting that only variables which are shared across threads need protection from concurrent access. Variables which are only used by a single thread do not need protection, be it by atomic variables or a mutex. Both atomic variables and use of a mutex are slower than thread-unsafe access, and so are not desirable where not needed.

Sunday, May 29, 2005

Asynchronous I/O - Options

I've seen three options for asynchronous I/O that are at least halfway viable. The difference between each of these options consist not of how I/O requests are made, but of how the application is notified that an I/O request has completed (be it successfully or unsuccessfully).

The first is event-based notification. Event-based notification requires that the calling program create an event for each read or write request. The OS will remember which event goes with which request, and then set that event when the operation has completed. Afterwards, the application may call a function to get the status of the request (whether or not it has failed, and how much was read or written).

The use of this isn't readily apparent. Obviously it would be far too cumbersome to use for most asynchronous I/O, as it would require that the application remember every request it made, and then poll the events periodically, not to mention the huge waste of resources it would be to allocate a large number of events for this purpose.

Yet it is useful for one thing: simulating synchronous I/O. I don't know if this is the case with the POSIX APIs, but in Windows, files are opened for either synchronous or asynchronous access, and you cannot mix calls for a file. Consequently, if you have a place in the application where you need to do I/O synchronously on a file opened asynchronously (such as verifying the header for a file is valid), this allows you to do so.

The second method of asynchronous I/O notification is completion callback routines. In this case, the method of notification is a callback function which the application passes to the read/write function, and the OS calls when the I/O completes, with the parameters and the completion status of the request. This is a very generic, very versatile method, but the way it's implemented in various OSes leaves something to be desired.

The final method is something of a cross between event-based and completion callback notifications: the notification port. A notification port is essentially a message queue to which I/O completion notifications are posted. Threads can retrieve a notification from the port in a first-in-first-out manner, and can wait on the port if no notifications are remaining.

The really cool thing about notification ports (although it's common to other message queues, as well) is that more than one thread can wait on one port simultaneously. This makes it trivial to set up thread pools that are based on processing following I/O.

Windows NT introduced kernel support for notification ports (called I/O completion ports) in version 3.5. While possible to implement a message queue (or to marshal I/O completion notifications to a message queue) in user mode, doing it in the kernel gives some very nice features, such as optimization of CPU use. It's usually considered a good idea to create more threads in a thread pool than CPUs in the computer. The reason is that the threads may go into wait for something not related to the notification port, such as doing synchronous I/O while processing a notification. If there are more threads than CPUs, that ensures that even if some threads go into wait, other threads will be ready to keep all CPUs busy.

The problem with that is that it is wasteful in a different way. It's moderately expensive for a CPU to stop working on one thread and start working on another - this happens every time a thread's time-slice expires. If there are more threads running than CPUs to run on, this will happen more often - every 60 to 120 ms, on Windows NT. I/O completion ports were designed to solve both problems, by tracking (in the kernel) exactly when threads go into wait. If one thread from the notification pool goes into wait while processing a notification message, another thread will be run from the pool; yet no more threads in the pool than CPUs will be allowed to run concurrently. This is possible because the kernel knows exactly how many threads are running, and how many are waiting; this is something that can't reasonably be done without kernel support.

Asynchronous I/O - Introduction

I just felt like taking a break from discussing code that I've already written (which is everything on the blog so far) and discuss some code I'm currently writing. In addition to threading, one of the features I most wanted to support in LibQ was asynchronous I/O. The reason for this is that, like threading, asynchronous I/O is fiercely platform-dependant (even more than threads, actually); as well, not all platforms support asynchronous I/O at all (Windows 9x comes readily to mind - it supports asynchronous I/O for sockets, but not for files).

Synchronous I/O is the most common kind of file I/O. With synchronous I/O functions like fread, fwrite, etc., the function waits for the kernel to perform the I/O, and does not return until the I/O is completed (or failed). Asynchronous I/O is just the opposite - the I/O functions return immediately, before the I/O has been performed. The I/O will then be performed at some later time, while the application continues executing.

Asynchronous I/O is extremely useful when significant amounts of slow I/O are being performed, as it allows the application to continue using the CPU while the I/O is performed in parallel. Depending on how much the kernel needs the CPU to communicate with the device, the savings from asynchronous I/O can vary greatly. In the worst case, you may only gain a bit of speed by allowing the CPU and bus be used in parallel; in the best case, the increases in speed can be hundreds of percent, if multiple independent devices are accessed simultaneously.

Saturday, May 28, 2005

Fast Semaphores - Implementation

The explaination I gave previously about how semaphores work should give you a pretty good idea how we're going to do this. At the heart of this thing is the semaphore count; this counter will be accessed with our atomic functions, so we are thread safe (and fast) accessing it. When this count is positive, it represents the number of threads that can enter the semaphore without waiting. When it's negative, it represents the negative of the number of threads currently waiting. New threads trying to enter the semaphore will have to wait if it's nonpositive (0 or less). Each thread that tries to enter the semaphore (and is willing to wait for it) decrements the counter, to register the fact that they are in line (or in the semaphore; either way).

Threads that need to wait will do so on a slow semaphore, as we can't do without kernel support. This slow semaphore will always be at a count of 0 or less, as threads will only try to grab it when they have to wait; when threads need to be released, this semaphore will be posted, but never increased above 0.

As we did with the slow semaphore, in addition to supporting a method that waits indefinitely for the semaphore to become available, we'll also support a try-wait method, which fails immediately if the semaphore cannot be obtained (the reason the slow semaphore did this was that POSIX semaphores, unlike Windows semaphores, don't support timed waits).

Finally, our fast semaphore supports spinning, in addition to waiting. When you call the constructor, you tell it the number of times it should try to spin before waiting on the slow semaphore - this number should be 0 for single-CPU systems, and something above that for multi-CPU systems (the exact number for a particular application is probably best found by trial and error). When the Wait function is called, the fast semaphore will spin as many times as it can; if it can't get the semaphore while spinning, it will wait on the slow semaphore. The try-wait function does not spin, it simply returns if the semaphore cannot be grabbed.

I think that's all the important explanation, so here's the code:

class CFastSemaphore
{
private:
CSemaphore m_sem; // The slow semaphore that threads will wait on

unsigned int m_nSpinCount; // Constant. The number of times to attempt to grab the semaphore before waiting

long m_nSemCount; // Atomically accessed. Current count semaphore is at. Positive numbers indicate the number of threads can can still claim the semaphore; negative numbers indicate the number of threads waiting on the semaphore.

public:
inline CFastSemaphore(long nInitialCount, unsigned int nSpinCount = 0) : m_sem(0)
{
assert(nInitialCount >= 0);

m_nSpinCount = nSpinCount;

m_nSemCount = nInitialCount;
}

inline ~CFastSemaphore()
{
}

inline bool TryWait()
{
long nSemCount;

do
{
nSemCount = m_nSemCount;

// If there are no free slots available, fail
if (nSemCount <= 0)
return false;
// Try to claim a slot
} while (AtomicCompareExchange(&m_nSemCount, nSemCount - 1, nSemCount) != nSemCount);

// Got it
return true;
}

void Wait()
{
// Spin first, in case we can get it without waiting
unsigned int nSpinCount = m_nSpinCount;
while (nSpinCount-- > 0)
{
if (TryWait())
return;
}

long nSemCount = AtomicDecrement(&m_nSemCount);

if (nSemCount >= 0)
return; // We got it; no need to wait

// We have to wait
m_sem.Wait();
}

inline Post(long nCount = 1)
{
assert(nCount > 0);

long nSemCount = AtomicExchangeAdd(&m_nSemCount, nCount);

// Were there any waiters?
if (nSemCount < 0)
m_sem.Post(min(nCount, -nSemCount)); // Release as many as possible
}
};

Fast Semaphores - Spinlocks

As I already mentioned, a round trip to kernel mode and back is expensive, in terms of CPU usage. But there's more to it than that. Once a thread has surrendered the CPU in a wait, the CPU will be given to another thread. The waiting thread is likely to, even if it gets woken very quickly, not get the CPU back for quite some time - perhaps more than 100 ms. If that thread needs to be responsive, that's a really long time. For this reason, something called spinlocks are employed.

Spinlocks are simply code that loops, repeatedly testing some condition without going into wait. Of course, this wastes CPU cycles, but the benefit can be worth it. If another thread is executing concurrently on another processor and signals the object, the spinning thread can avoid going into wait completely, saving a substantial amount of lag time. Of course, if this is a single-CPU system, the spinning is just dead CPU time, as there are no concurrently executing threads, period. As well, this is pointless if the synchronization object is likely to not be signalled for a long period of time (such as protecting a file handle during reads). But if something like a mutex is likely to be locked very frequently and only for a few - or even a few hundred - cycles, this can make a huge difference in performance.

Friday, May 27, 2005

Fast Semaphores - Faster Than Slow

Okay. We have our semaphore, we have our event, and we have our atomic functions. Now what do we do with them? Well, you can do lots of things. Just to give you a little demonstration, we'll start off with something odd: a fast semaphore.

So what's fast? Well, fast and slow are terms used to describe where synchronization objects do their work. Remember that there are two modes the processor can be in: user mode, and kernel mode: kernel mode is where the OS (specifically the kernel) and drivers live, and user mode is where the programs running live. To access kernel functions through the OS API, your program must call into kernel mode, and then kernel mode will jump back to your program when it's done.

These jumps are not cheap. In fact, even a simple call into the kernel and back can take 1000 cycles on Windows on x86 (I don't have benchmark data for other platforms). While 1000 cycles isn't much, given the current processor speed of several billion cycles per second, when you've got something like a mutex (or semaphore) that you may lock just to do a few cycles of shared variable changing, and need to lock it many times each second, 2000 cycles (one call to enter the mutex, and another to leave it) can be quite a hefty price.

Both the event and semaphore classes we've created so far have been slow - every call goes into kernel mode. But with atomic functions, we can do better. We're going to make fast version of the semaphore and event that take only some 20-30 cycles. Sound better? Well, here's how.

A semaphore is just a counter that can control threads; likewise, an event is a boolean that can control threads. The variables we can do in user mode with atomic functions. But there's one thing we'll never be able to do in user mode, and that's alter the run status of a thread - make it go to sleep or wake up; that can only be done by the kernel. Thus, by definition, a fast synchronization object stays in user mode, except to go to sleep or wake up another thread. If the thread never needs to do either of those, it stays totally in user mode. The result of this, if you're locking something very frequently, and for very short periods of time, is significant performance gains.

Events - POSIX Version

POSIX, on the other hand, does not support events; rather, it supports something called condition variables. Condition variables are a way of signalling waiting thread(s) that a condition inside a mutex has become true (generally when a variable takes a certain value). You do this by calling pthread_cond_signal to signal one waiting thread, and pthread_cond_broadcast to signal all waiting threads. However, conditions make you do a bit more yourself. You are expected to check that the value of the variable isn't what you want (inside the mutex) before trying to wait on the condition, and signalling a condition that no threads are waiting on has no effect whatsoever.

What's cool about conditions, though, is that they are linked to a variable in a mutex. To check if the condition is true, you'd do the following:
1. Enter the mutex
2. Check the value of the variable to see that the condition isn't already true
3. Wait on the condition to be signalled
4. Verify that the variable is now set so that the condition is true
5. Do whatever you want when the condition is true
6. Unlock the mutex

If you're a Windows programmer and you're used to using a separate mutex and event for this kind of thing, you'll notice a conspicuous absence: the mutex is not left before waiting on the condition, nor is it entered after waiting. You'd quickly learn of the error of your ways if you tried this with an event and mutex, as you'd deadlock the thread. The mutex doesn't remain locked while the thread is waiting on the condition, but it relocked when the thread comes out of wait.

Anyway, I'm getting sidetracked. So, here's the game plan (not that it's anything remarkable): a bool variable protected by a mutex, and a condition for threads to wait on. Threads waiting on the event will first check the bool, and if not set, wait on the condition. Threads setting the event will check if the event is false, and if so, set it and signal as many threads are appropriate (1 for auto-reset events, all for manual-reset events). As an optimization to prevent the signal if no threads are waiting, we tac on a waiting threads counter. And of course some crap to convert the relative time to wake to the absolute time to wake. The code:

class CEvent // POSIX version of event: condition variable (fast?)
{
private:
bool m_bAutoReset; // Constant

pthread_mutex_t m_mutex; // Mutex
pthread_cond_t m_cond; // Condition variable

// Protected by m_mutex
bool m_bSet; // Set or clear

unsigned int m_nWaitingThreads; // Number of threads waiting on the event

// As the name implies, this must be called inside the mutex
// Does the wait. The parameter specifies when the thread should wake up, should the event not get set before then. If this is NULL, the thread will wait indefinitely on the event. Returns whether the event got set (if not, the timeout must have expired).
bool InnerWait(const timespec *restrict abstime)
{
if (!m_bSet)
{
m_nWaitingThreads++;

do
{
int nRetVal;

// Do the wait, either timed or indefinite
if (abstime)
nRetVal = pthread_cond_timedwait(&m_cond, &m_mutex, abstime);
else
nRetVal = pthread_cond_wait(&m_cond, &m_mutex);

assert(nRetVal == 0 || nRetVal == ETIMEDOUT);
} while (!m_bSet && nRetVal != ETIMEDOUT); // Loop until it gets set or the timeout expires

m_nWaitingThreads--;
}

// Did the event get set?
bool bSuccess = m_bSet;
// If the event is set and it's an auto-reset event, reset it now that we're awake
if (m_bSet && m_bAutoReset)
m_bSet = false;

return bSuccess;
}

public:
CEvent(bool bAutoReset, bool bSet)
{
if (pthread_mutex_init(&m_mutex, NULL) != 0)
throw std::bad_alloc("Unable to create mutex");
else if (pthread_cond_init(&m_cond, NULL) != 0)
{
pthread_mutex_destroy(&m_mutex);

throw std::bad_alloc("Unable to create condition variable");
}

m_bAutoReset = bAutoReset;
m_bSet = bSet;

m_nWaitingThreads = 0;
}

inline ~CEvent()
{
pthread_cond_destroy(&m_cond);
pthread_mutex_destroy(&m_mutex);
}

void Set()
{
pthread_mutex_lock(&m_mutex);

if (!m_bSet) // If it's already set, do nothing
{
m_bSet = true; // Set the event

// Check if there are any waiters, and release them appropriately
if (m_nWaitingThreads)
{
if (m_bAutoReset)
pthread_cond_signal(&m_cond); // Release one thread
else
pthread_cond_broadcast(&m_cond); // Release all threads
}
}

pthread_mutex_unlock(&m_mutex);
}

inline void Reset()
{
pthread_mutex_lock(&m_mutex);

m_bSet = false; // Ding

pthread_mutex_unlock(&m_mutex);
}

bool Wait(unsigned int nTimeoutMS)
{
// Calculate the time to wake based on the time to sleep. I hope I understand how this is supposed to work on POSIX.
timespec now, timeout, later;

clock_gettime(CLOCK_REALTIME, &now);

timeout.tv_sec = nTimeoutMS / 1000; // Seconds
timeout.tv_nsec = (nTimeoutMS % 1000) * 1000000L; // Nanoseconds

later.tv_sec = now.tv_sec + timeout.tv_sec;
later.tv_nsec = now.tv_nsec + timeout.tv_nsec;
if (later.tv_nsec >= 1000000000L)
{
later.tv_nsec -= 1000000000L;
later.tv_sec++;
}

pthread_mutex_lock(&m_mutex);

bool bSuccess = InnerWait(&later);

pthread_mutex_unlock(&m_mutex);

return bSuccess;
}

inline void Wait()
{
pthread_mutex_lock(&m_mutex);

InnerWait(NULL);

pthread_mutex_unlock(&m_mutex);
}
};

Events - Windows Version

As with semaphores, Windows supports events directly. In fact, the interface for using events is very similar to using semaphores; you create them with CreateEvent, close them with CloseHandle, wait on them with WaitForSingleObject, set them with SetEvent, and reset them with ResetEvent. Intuitive, no?

The code, which does not particularly need any further explanation:


class CEvent // Win32 version of CEvent (slow)
{
private:
HANDLE m_hEvent;

inline HANDLE GetEvent()
{
return m_hEvent;
}

public:
// bAutoReset controls how the event reacts when it gets set. An auto-reset event will automatically go back to the unset state after one thread gets released from waiting because of the event being set. A non-auto-reset event will stay set until reset explicitely. CEvent throws bad_alloc if it cannot create the event.
inline CEvent(bool bAutoReset, bool bSet)
{
m_hEvent = CreateEvent(NULL, !bAutoReset, bSet, NULL);
if (!m_hEvent)
throw std::bad_alloc("Unable to create event");
}

inline ~CEvent()
{
verify(CloseHandle(m_hEvent));
}

inline void Set()
{
verify(SetEvent(m_hEvent));
}

inline void Reset()
{
verify(ResetEvent(m_hEvent));
}

// Returns true if the event was set, false if the timeout expired
inline bool Wait(unsigned int nTimeoutMS)
{
DWORD nRetVal = WaitForSingleObject(m_hEvent, nTimeoutMS);
assert(nRetVal != WAIT_FAILED);

if (nRetVal == WAIT_TIMEOUT)
return false;

return true;
}

// Waits indefinitely for the event to be set
inline void Wait()
{
verify(Wait(INFINITE));
}
};

Ding!

Just finished my last final, and no school over summer. Hopefully I'll have more time work on the blog and coding. If not, at least I'll have lots more time to play World of Warcraft :P

Wednesday, May 25, 2005

Events - Introduction

Okay, so we've got our semaphore class done. Turns out, that'll be the easiest of the thread-synchronization classes, as it's the only one that is supported directly on both platforms. Thus, we're gonna have to start doing some emulating. Depending on your personality, this may be where things get interesting.

The next class we're going to write is an encapsulation of an event. An event is basically a boolean that threads can wait on - it is either set or cleared (if you're familiar with POSIX programming, an event is very similar to a condition variable, but not exactly the same). When the event gets set, threads waiting on the event get released.

Now, there are two types of events: manual-reset and auto-reset. A manual-reset event functions exactly like a boolean: you set it, and it remains set until you clear it, and vice versa. All threads that are waiting when the event gets set will be released, and any threads attempting to wait on a set event will not wait.

An auto-reset event is actually more useful. An auto-reset event will remain cleared until it gets set, but once set, will automatically get cleared after exactly one thread is released from waiting. If a thread is waiting when the event gets set, that thread will immediately get released, and the event will remain cleared; but if no threads are waiting when the event gets set, the event will remain set until a thread tries to wait on it, at which time the thread will not wait, and the event will get cleared. Really, you can imagine an event as being a semaphore whose value never goes above 1, no matter how many times it is posted; alternatively, you could consider it a thread queue.

Semaphore - Part 3

And lastly, the POSIX semaphore version. As I mentioned last post, I decided to use XSI-style semaphores, rather than the newer semaphores option. Since everything is supported by the OS, all we need to do is wrap the functions; this will be very straightforward. The class:

class CSemaphore // POSIX version of CSemaphore: XSI semaphore (slow)
{
private:
int m_iSem; // Semaphore index

inline int &GetSemaphore()
{
return m_iSem;
}

public:
inline CSemaphore(long nInitialCount)
{
assert(nInitialCount >= 0);

semun ctl;
ctl.val = nCount;
assert(ctl.val == nInitialCount);

// Create the semaphore
m_iSem = semget(IPC_PRIVATE, 1, S_IRUSR | S_IWUSR);
if (m_iSem <= 0)
throw bad_alloc("Unable to create semaphore");

// Initialize the semaphore count
verify(semctl(m_iSem, 0, SETVAL, ctl) == 0);
}
inline ~CSemaphore()
{
verify(semctl(m_iSem, 0, IPC_RMID) == 0);
}

inline bool TryWait()
{
sembuf op;

op.sem_num = 0;
op.sem_op = -1;
op.sem_flg = IPC_NOWAIT; // Don't wait if we can't lock it now

return (semop(m_iSem, &op, 1) == 0);
}

inline void Wait()
{
sembuf op;

op.sem_num = 0;
op.sem_op = -1;
op.sem_flg = 0; // Wait if necessary

verify(semop(m_iSem, &op, 1) == 0);
}

inline void Post(long nCount = 1)
{
assert(nCount > 0);

sembuf op;

op.sem_num = 0;
op.sem_op = nCount;
op.sem_flg = 0;

assert(op.sem_op == nCount); // Make sure there wasn't any truncation
verify(semop(m_iSem, &op, 1) == 0);
}
};

Monday, May 23, 2005

Semaphore - Part 2

Okay, so now that we know what a semaphore is, how do we implement it? Well, pretty uninterestingly, actually. As mentioned previously, both Windows and POSIX have semaphore APIs right out of the box (well, POSIX with the semaphores extension). So this won't be much work at all.

First of all, let me explain one design decision common to both the Windows and the POSIX implementations. The semaphore class does not allow the use of named (inter-process) semaphores, even though both Windows and POSIX supports them. The reason for this is that, for the moment, the LibQ functions are intended to be process-local. While many classes are designed to work with thread synchronization, they are not designed for inter-process synchronization. I might go back later and add some classes for inter-process communication, but not now.

Now, Windows implements semaphores as kernel objects (HANDLEs). We create the semaphore with CreateSemaphore, destroy it with CloseHandle, increment it with ReleaseSemaphore, and wait on it with the ubiquitous WaitForSingleObject.

Looking at the documentation for those functions, you can see that I left one thing out of the semaphore class: a maximum value limit. The reason for this is that, to my knowledge, the POSIX semaphore API doesn't support this. Since LibQ is supposed to function identically on Windows and POSIX, that means we can't use it.

You'll also notice that I use WaitForSingleObject, rather than an alertable call to WaitForSingleObjectEx. The reason for this is that I thought it was best to not deal with the complications of alertable waits, since we'd just have to hide it from the application anyway, and we may or may not be able to guarantee similar behavior on POSIX. There is actually another reason for this, which is due to planning for asynchronous file access, but that's getting a bit ahead of the blog. We'll come back to that.

Looking at the class itself, you'll notice there's a private GetSemaphore function which returns the HANDLE. The reason this is private is that this is something programs should NOT be using, since it's platform dependant. The reason it exists at all is so that it can be used in other LibQ classes internally in the Windows version (in some places this is necessary).

You'll notice that I use inlining profusely. I intended LibQ classes to be as lightweight as possible. Given how short these functions all are, inlining was the natural choice. In fact, a sizeable amount of functions in all of the LibQ classes are inline, to avoid the overhead of a function call.

Lastly, you can see that I use bad_alloc to indicate the inability to create the semaphore. I chose to use this because it's convenient, and there's no need to go creating exception classes for every error that could possibly occur in the constructors of any of these classes. And besides that, bad_alloc does fit the description, roughly.

Without further ado, the Windows version of CSemaphore (sorry, I'm not familiar with posting code online; it lost all my indentations):



class CSemaphore // Win32 version of CSemaphore (slow)
{
private:
HANDLE m_hSem; // The semaphore

// Internal function to be used by LibQ itself
inline HANDLE GetSemaphore()
{ return m_hSem; }

public:
// nInitialCount is the initial semaphore count. Must be nonnegative
inline CSemaphore(long nInitialCount)
{
assert(nInitialCount >= 0);

m_hSem = CreateSemaphore(NULL, nInitialCount, LONG_MAX, NULL);
if (!m_hSem)
throw std::bad_alloc("Unable to create semaphore");
}

inline ~CSemaphore()
{
verify(CloseHandle(m_hSem));
}

// Try to get the semaphore without waiting. Fails if can't.
inline bool TryWait(unsigned long nTimeoutMS)
{
DWORD nRetVal = WaitForSingleObject(m_hSem, 0);
assert(nRetVal == WAIT_OBJECT_0 || nRetVal == WAIT_TIMEOUT);

if (nRetVal == WAIT_TIMEOUT)
return false;

return true;
}

// Waits indefinitely for the semaphore
inline void Wait()
{
verify(WaitForSingleObject(m_hSem, INFINITE) == WAIT_OBJECT_0);
}

// Increases the semaphore's count by nCount. Usually the default of 1 is desired. This count must be positive.
inline void Post(long nCount = 1)
{
assert(nCount > 0);

verify(ReleaseSemaphore(m_hSem, nCount, NULL));
}
};


UPDATE: I've been thinking it over, and I've decided to remove the timed wait function. The reason for this is due to the fact that I've decided to change the POSIX version of CSemaphore to use XSI semaphores, rather than the semaphore option semaphores; XSI semaphores do not support timed waits. The reason for this change is that semaphore option semaphores don't support posting more than one at a time. Because each post requires a transition to kernel mode, this could be a serious performance issue if it's common to release many threads at once (which I think will be more important than timed waits). So now I need to rewrite the POSIX version of CSemaphore :P

Also, I added the macro verify(x). This is similar to assert, but in non-debug build it gets substituted by (x) so that it will still be evaluated. Will be using that lots in the future.

Sunday, May 22, 2005

Semaphores - Part 1

Okay, let's start working on the more interesting stuff. Well, kind of. The semaphore is the only one of the thread synchronization objects that's supported on both Windows and POSIX (at least that we'll be using). That means that this class will be fairly uninteresting, as it just wraps OS functionality.

So, what is a semaphore? Well, it's kind of hard to describe, which is partly why I took so long to get around to writing this (although I'd be lying if I said laziness didn't play a part). Conceptually, you can think of a semaphore as a thread gatekeeper. Threads line up to get access to some resource, and the semaphore either lets some go (but not altering the line order - first come, first serve), or makes them wait. In effect, it's like a thread waiting queue.

Programmatically, a semaphore is a counter. This counter represents the number of threads that may yet pass through the semaphore without waiting. When it's at 0 or lower, threads will line up and wait their turn.

There are two semaphore operations: wait and post. Wait decrements the semaphore counter; if the result of the operation is negative (in other words, the counter was 0 or negative to begin with), the thread will wait its turn. Post does just the opposite: increases the counter by 1 or more, allowing that many threads to proceed. If there are fewer threads waiting than the number posted, all waiting threads will be released, and the counter will end up being nonnegative.

Friday, May 20, 2005

Mini-Status Report

Just to mention what is already coded, although not blogged (I've been coding for a week or two longer than I've been blogging) in the project so far:
- CSyncFile
- IAllocator
- CHeap
- CEvent
- CFastEvent
- CSemaphore
- CFastSemaphore
- CMutex
- CMutexLock
- CCondition
- CReadWriteLock
- CThreadLocalValue
- CThread

Atomic Functions - Part 4

And finally, for icing on the cake, the Mac version (not that it's ever been compiled and tested on a Mac...), with self-documenting comments:

  asm
long // <- Initial value of *addr (r3).
AtomicCompareExchange(
long volatile *addr, // -> Location to update (r3).
long next, // -> New value (r4).
long prev) // -> Previous value (r5).
{
retry:
lwarx r6, 0, r3 // current = *addr;
cmpw r6, r5 // if( current != prev )
bne fail // goto fail;
stwcx. r4, 0, r3 // if( reservation == *addr ) *addr = next;
bne- retry // else goto retry;
mr r3, r6 // Return current.
blr // We're outta here.
fail:
stwcx. r6, 0, r3 // Clear reservation.
mr r3, r6 // Return current.
blr // We're outta here.
}

Note that this was taken from IBM's site; I modified it a little bit to match my own applications.

You might notice that the rest of the atomic functions aren't here. I left them out in the hopes of avoiding further pages of boring code, when they're all essentially the same as the AtomicSignedExchangeAdd in the last post (they all use AtomicCompareExchange exclusively).

All this inline assembly is pretty boring. Things should get more interesting once I get into the classes, how they work, and how they may be emulated on some platforms.

Atomic Functions - Part 3

Oh yeah, I forgot one function: AtomicSignedExchangeAdd. While this is admittedly a pretty odd function, it's quite important, as it'll be the first you get to see how real practical use of atomic functions works (at least on this blog).

// This could probably be made much faster with an inline assembly version
long AtomicSignedExchangeAdd(long volatile *lpnDest, long nAddend, bool bSigned)
{
assert(lpnDest);

long nValue;

// The basic principle of atomic access on RISC CPUs, and the principle of doing complex atomic operations at all: read, modify, try-write, repeat until we can pull it off without somebody screwing with it.
do
{
// Read the destination value and sign
nValue = *lpnDest;
bool bValueSign = (nValue < 0);

// Check if the sign matches the desired sign. If not, fail - we don't want to do the addition if the sign isn't correct.
if (bValueSign != bSigned)
break;

// Try to change the value from what it was to what we want (the specified value added). If the value has been changed since we read it, loop back and read it again.
} while (AtomicCompareExchange(lpnDest, nValue + nAddend, nValue) != nValue);

return nValue;
}

The comments weren't in the original code. I just added them in the hopes of making it more understandable than if I commented about how it works here :P

Atomic Functions - Part 2

Okay, let's write these functions. Starting with the first, a quick look at the Intel IA-32 manual reveals how it's done:
XADD—Exchange and Add
Exchanges the first operand (destination operand) with the second operand (source operand),
then loads the sum of the two values into the destination operand. The destination operand can
be a register or a memory location; the source operand is a register.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

That makes things pretty straightforward:
long AtomicExchangeAdd(long volatile *lpnNumber, long nAddend)
{
assert(lpnNumber);

__asm
{
push ebx

mov eax, nAddend
mov ebx, lpnNumber

lock xadd [ebx], eax

pop ebx
};
}

On to XCHG:
Exchanges the contents of the destination (first) and source (second) operands. The operands
can be two general-purpose registers or a register and a memory location. If a memory operand
is referenced, the processor’s locking protocol is automatically implemented for the duration of
the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value
of the IOPL. (See the LOCK prefix description in this chapter for more information on the
locking protocol.)
This instruction is useful for implementing semaphores or similar data structures for process
synchronization. (See “Bus Locking” in Chapter 7 of the IA-32 Intel Architecture Software
Developer’s Manual, Volume 3, for more information on bus locking.)
The XCHG instruction can also be used instead of the BSWAP instruction for 16-bit operands.

long AtomicExchange(long volatile *lpnDest, long nSource)
{
assert(lpnDest);

__asm
{
push ebx

mov eax, nSource
mov ebx, lpnDest

lock xchg [ebx], eax

pop ebx
};
}

CMPXCHG:
Compares the value in the AL, AX, or EAX register (depending on the size of the operand) with
the first operand (destination operand). If the two values are equal, the second operand (source
operand) is loaded into the destination operand. Otherwise, the destination operand is loaded
into the AL, AX, or EAX register.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.
To simplify the interface to the processor’s bus, the destination operand receives a write
cycle without regard to the result of the comparison. The destination operand is written back if
the comparison fails; otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing a locked write.)

long AtomicCompareExchange(long volatile *lpnDest, long nSource, long nComparand)
{
assert(lpnDest);

__asm
{
push ebx
push edx

mov eax, nComparand
mov ebx, lpnDest
mov edx, nSource

lock cmpxchg [ebx], edx

pop edx
pop ebx
};
}

BTS:
Selects the bit in a bit string (specified with the first operand, called the bit base) at the bitposition
designated by the bit offset operand (second operand), stores the value of the bit in the
CF flag, and sets the selected bit in the bit string to 1. The bit base operand can be a register or
a memory location; the bit offset operand can be a register or an immediate value. If the bit base
operand specifies a register, the instruction takes the modulo 16 or 32 (depending on the register
size) of the bit offset operand, allowing any bit position to be selected in a 16- or 32-bit register,
respectively (see Figure 3-1). If the bit base operand specifies a memory location, it represents
the address of the byte in memory that contains the bit base (bit 0 of the specified byte) of the
bit string (see Figure 3-2). The offset operand then selects a bit position within the range −231 to
231 − 1 for a register offset and 0 to 31 for an immediate offset.
Some assemblers support immediate bit offsets larger than 31 by using the immediate bit offset
field in combination with the displacement field of the memory operand. See “BT—Bit Test” in
this chapter for more information on this addressing mechanism.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

BTR:
Selects the bit in a bit string (specified with the first operand, called the bit base) at the bitposition
designated by the bit offset operand (second operand), stores the value of the bit in the
CF flag, and clears the selected bit in the bit string to 0. The bit base operand can be a register
or a memory location; the bit offset operand can be a register or an immediate value. If the bit
base operand specifies a register, the instruction takes the modulo 16 or 32 (depending on the
register size) of the bit offset operand, allowing any bit position to be selected in a 16- or 32-bit
register, respectively (see Figure 3-1). If the bit base operand specifies a memory location, it
represents the address of the byte in memory that contains the bit base (bit 0 of the specified
byte) of the bit string (see Figure 3-2). The offset operand then selects a bit position within the
range −231 to 231 − 1 for a register offset and 0 to 31 for an immediate offset.
Some assemblers support immediate bit offsets larger than 31 by using the immediate bit offset
field in combination with the displacement field of the memory operand. See “BT—Bit Test” in
this chapter for more information on this addressing mechanism.
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

long AtomicBitExchange(long volatile *lpnDest, long nBit, long bSet)
{
assert(lpnDest);
assert(nBit >= 0 && nBit < 32);

__asm
{
push ebx
push edx

mov eax, bSet
mov ebx, lpnDest
mov edx, nBit

test eax, eax // Test if bSet is true or false
jz ClearBit

SetBit:
lock bts [ebx], edx
jmp Done

ClearBit:
lock btr [ebx], edx

Done:
setc al
movzx eax, al

pop edx
pop ebx
};
}

BTC: Selects the bit in a bit string (specified with the first operand, called the bit base) at the bitposition designated by the bit offset operand (second operand), stores the value of the bit in the CF flag, and complements the selected bit in the bit string. The bit base operand can be a register or a memory location; the bit offset operand can be a register or an immediate value. If the bit base operand specifies a register, the instruction takes the modulo 16 or 32 (depending on the register size) of the bit offset operand, allowing any bit position to be selected in a 16- or 32-bit register, respectively (see Figure 3-1). If the bit base operand specifies a memory location, it represents the address of the byte in memory that contains the bit base (bit 0 of the specified byte) of the bit string (see Figure 3-2). The offset operand then selects a bit position within the range −231 to 231 − 1 for a register offset and 0 to 31 for an immediate offset. Some assemblers support immediate bit offsets larger than 31 by using the immediate bit offset field in combination with the displacement field of the memory operand. See “BT—Bit Test” in this chapter for more information on this addressing mechanism. This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

long AtomicBitExchangeCompliment(long volatile *lpnDest, long iBit)
{
assert(lpnDest);
assert(iBit >= 0 && iBit < 32);

__asm
{
push ebx

mov eax, iBit
mov ebx, lpnDest

lock btc [ebx], eax

setc al
movzx eax, al

pop ebx
};
}

So lots and lots of text, but pretty trivial.

Atomic Functions - Part 1

Okay, so. As one of the primary uses of LibQ would be to support threads and thread synchronization across the Windows and POSIX platforms (I'm particularly interested in supporting Linux and OS X, but I'd be happy if it worked elsewhere, as well) and you can't do fast thread synchronization without atomic functions, that was the obvious place to start work on LibQ.

As I was already well aware of the interlocked functions on Windows (InterlockedCompareExchange, etc.), I started out by looking at the POSIX standard. Well, unless I'm missing something, there's nothing there for this purpose. So, that means we have to write it, ourselves.

This actually isn't as redundant as it may sound, at first. Despite the fact that Windows has the interlocked series of functions, we can't really use them. A look at the requirements for InterlockedCompareExchange reveal the following:
Client: Included in Windows XP, Windows 2000 Professional, Windows NT Workstation 4.0, Windows Me, and Windows 98.
Server: Included in Windows Server 2003, Windows 2000 Server, and Windows NT Server 4.0.

No Windows 95, no NT pre-4.0. While I'm not so concerned about the lack of NT pre-4.0, I am concerned about the lack of Windows 95 support. No matter; it's not difficult to write the functions ourselves. Unfortunately, since we're going to be using assembly, we'll need a version of the functions for each archetecture we want to support. For now, I'll let this slide - I can write the x86 version myself.

A quick look through the Intel x86 instruction reference shows us our palette of functions (we don't really need to worry about using instructions other processors might not have, because most other processors are RISC, and operate in a more generic way). We have xchg (exchange), xadd (exchange add), cmpxchg (compare exchange), bts (test bit and set bit), btr (test bit and clear bit), btc (test bit and compliment bit), and lock (lock the bus for the duration of the instruction). These look like plenty to build a robust atomic operations sytem (although in reality, all we absolutely need is the compare exchange and lock bus - everything else can be emulated using those; this will become important later).

So, the list of prototypes I whipped up look like this:
// Adds nAddend to lpnNumber and returns old value of lpnNumber
long AtomicExchangeAdd(long volatile *lpnNumber, long nAddend);
// Sets lpnDest to nSource, and returns old value of lpnDest
long AtomicExchange(long volatile *lpnDest, long nSource);
// Sets lpnDest to nSource if lpnDest equals nComparand, and returns old value of lpnDest
long AtomicCompareExchange(long volatile *lpnDest, long nSource, long nComparand);

// Sets bit nBit in lpnDest to bSet, and returns old value of the bit (nonzero if set)
long AtomicBitExchange(long volatile *lpnDest, long nBit, long bSet);
// Toggles bit nBit of lpnDest, and returns old value of the bit (nonzero if set)
long AtomicBitExchangeCompliment(long volatile *lpnDest, long iBit);
// Adds nAddend to lpnDest if lpnDest's sign (high bit) is bSigned, and returns old value of lpnDest
long AtomicSignedExchangeAdd(long volatile *lpnDest, long nAddend, bool bSigned);

(the last one is a special purpose emulated function; we'll see what uses it has later)

Q & Stuff

Okay, so, it's Friday. Not only that, I missed my one class today after driving 45 minutes each way on the freeway, I didn't sleep any last night, and I'm bored. Gosh, I guess I'll make a blog.

Oh, but what shall it be about, and more importantly, what shall I call it? Well, I've been working on LibQ - a lightweight, cross-platform class library of very commonly used but platform-specific features - for the last couple of weeks, maybe I could do a design journal for that. Maybe I'll call it the LibQ Developer's Journal. But wait, I might want to post other inanity on it, too, so that's no good. Maybe The Daily Flamebait? No, that's no good either; I don't plan on posting daily, nor do I intend the majority of stuff to be rantish/flamebait stuff (unlike my posting habits on Slashdot). Hmm, maybe I'll just call it Q & Stuff; that'd be more than generic enough for anything I'd care to post. Okay, I guess that settles it.

More inanity - and maybe, if you're really lucky, something useful - to come, soon (hopefully).