Search This Blog

Monday, July 18, 2005

Depravity

So I had an evil, perverted urge to learn MMX assembly (I would learn SSE2, but I don't have that on my computer), yet nothing to use MMX for. Then I got the idea to write a function to convert a string from lower case to upper case using MMX (if the length is known in advance to be a multiple of 8 bytes, MMX can be used to calculate all 8 bytes at once). The general idea is this (since obviously branches are out in SIMD programming):

1. Read block of char (8 chars - the size of an MMX register)
2. Generate comparison mask of chars that are greater than or equal to 'a' (0x61)
3. Generate comparison mask of chars that are less than or equal to 'z' (0x7a)
4. AND masks to get a mask where each byte is 0xFF if the char is lowercase, otherwise 0
5. AND that mask with a vector of all 0x20 bytes ('a' - 'A') to form a subtraction mask, where bytes corresponding to lower case chars are 0x20, otherwise 0
6. Subtract that mask from the block of chars
7. Write block of chars back
8. Rinse and repeat

This sounds like a major pain, and really it is. But hopefully it'll be faster than traditional char-by-char conversion (and if nothing else, it's an introduction to MMX assembly).

But before we get into the actual function, we need something to compare it to. The following is a simple branchless function I just wrote, and takes 13 cycles per char, on a Pentium 4, plus the memory accesses, which will generally be cached:

push ebx
xchg ecx, edx

test ecx, ecx
je Done

Begin:
mov al, [edx]
cmp al, 'a'
setl bl
cmp al, 'z'
setg bh
or bl, bh
sub bl, 1
and bl, 20h
sub al, bl
mov [edx], al

add edx, 1
sub ecx, 1
jne Begin

Done:
pop ebx
ret

This process could be made faster, under certain circumstances, by using a simple lookup table; use of a lookup table would produce a function taking 6 cycles per char. The problem with this is that it also produces many memory accesses that will take dozens of cycles if uncached. Thus, the particular circumstance where this is beneficial: when the function is called so frequently that the table is kept entirely in the processor's data cache. If this condition is not met, the function will be significantly slower that the previous nonbranching function.

Now, the MMX version. This function is more than a little cryptic because I did some optimizations on it. One bottleneck in this function is that movq instructions (move quadword - 64 bits) are very slow - they take about 6 cycles to complete (although other instructions may execute after 1 cycle, provided they don't use the result of the movq instruction). All other instructions used here take 2 cycles, but another instruction may execute after 1 cycle, provided it does not depend on the results of the previous instruction.

This biggest oddity in how I wrote this function, however, is due to the fact that there are no instructions to load an MMX register with an immediate, and memory access is slow. Yet we need 3 masks: 1 of 0x20 bytes ('a' - 'A'), 1 of 0x60 bytes ('a' - 1), and 1 of 0x7b bytes ('z' + 1), which correspond to the compares we're going to make. What I decided to do (because it was by far the fastest solution I could come up with) was to load a 16-bit value from a normal register into an MMX register, then use the shuffle instruction (shufw) to spam that same value into the other 48 bits of the register. This is much faster than loading three MMX registers from memory, given that we can load 2 16-bit values at once with movd (move doubleword - 32 bits), and even that only takes 2 cycles.

push ebx
test edx, edx
xchg ecx, edx ; ecx is now number of blocks, edx is the string
je Done

mov eax, 7b7b6060h ; low word contains the lower bounds mask, high word contains the upper bounds mask
mov ebx, 20202020h ; the subtraction mask

Begin:
movq mm1, [edx] ; the packed characters
movd mm2, eax
movd mm3, ebx
movq mm0, mm1
pshufw mm4, mm2, 00000000b ; the lower bounds mask
pshufw mm3, mm3, 00000000b ; the subtraction mask
pshufw mm2, mm2, 01010101b ; the upper bounds mask
pcmpgtb mm1, mm4
pcmpgtb mm2, mm0
pand mm3, mm1
pand mm3, mm2
psubb mm0, mm3
movq [edx], mm0

add edx, 8
sub ecx, 1
jnz Begin

Done:
pop ebx
ret

As you can see, once we've computed the two conditional masks, we reach a bottleneck - we can no longer shove new instructions into the processor every cycle. Still, the overall time is pretty nice. The non-MMX instructions before the loop are negligible, and the instructions inside the loop take about 23 cycles + 1 memory access (but on average these accesses will be cached a majority of the time, so there's reduced delay). That's about 3 cycles and 1/16 memory access per char, which is about twice as fast as the lookup table version (in ideal conditions) and four times as fast as the nonbranching non-MMX version.

So, was the speed increase worth the effort? Probably not, but I certainly learned some stuff in the process, and that's worth the effort.

Oh, and one last thing: all the information I used on the speed of instructions is based on the Intel Pentium 4 optimization manual. Other processors may be faster or slower for each instruction. This means that the way I've got the the instructions ordered for maximum speed may not be maximal on non-P4 processors (like mine).

Sunday, July 17, 2005

The Art of Breaking and Entering - Windows Hooks

Okay, so now we know how to inject code and data into a foreign process on both Windows NT and 9x. But suppose we actually want to, say, RUN the code we injected; what then? Well, there are ways of accomplishing that, too. Three ways, in fact; at least, three ways that are suitable for general use.

The first method we'll examine is "the easy way": windows hooks. Note the absence of capitalization: we're talking about windows on screen, not Windows the OS. A window hook is a function that gets called in some circumstance involving a window. A hook function could be called when the program gets a message from its message queue, when the window's message handler gets called, when a key on the keyboard gets pressed, or a variety of other times, depending on the type of hook you set. In any case, you set the hook using SetWindowsHookEx, and remove the hook using UnhookWindowsHookEx.

So, how does this help us? Well, as it turns out, windows hooks are executed in the process that owns the hooked window. Due to the protected address spaces for each process, that requires that your code gets loaded into the process that owns the window. Windows takes responsibility for injecting the code into the process; in other words, Windows does all the work for you. For this to work, the hook function must be in a DLL, as it is the DLL that Windows loads into the process.

However, this ease of use comes with some significant drawbacks. Most significantly, in order to set a windows hook, there must be a window to hook. This implies two things: before you can inject your DLL, the program must be running, and the program must create a window. If the program doesn't create a window, you're out of luck; similarly, if you need to execute your code before the program starts up, you're also out of luck.

Another noteworthy point, although it can be overcome, is how your code gets executed. When the DLL is first loaded, it receives a DLL_PROCESS_ATTACH notification in DllMain. At this point, you can't do much, as many Windows API functions are not safe to call at this point, due to how Windows calls DllMain. Not only that, but you don't know if your DLL is being loaded in the source process (your program) or the target (the program you're invading), as DllMain will get called for both, just the same.

These problems can be circumvented by delaying initialization until the first time the hook function gets called. But at the same time, this adds a new complication: the hook function must get called after you set the hook, before your code can actually get executed.

This can similarly be worked around by setting a hook that you can ensure will be called, such as a get message hook. This hook will be called every time the program snags a message for that window using GetMessage, which you can ensure will get called by sending a message to the window with PostMessage (SendMessage won't work in this case, because GetMessage does not return messages sent with SendMessage). The usual message to use to accomplish this (although this won't work for all hook types) is WM_NULL, which does nothing, but calls the hook function just the same.

Okay, so that's the basic procedure; however, there are still a couple of details that need to be addressed. To start with, your hook must, after performing whatever it needs to do, call the next hook with CallNextHookEx. This is something that must be done manually (at least for certain types of hooks; ours is one such type). This requires that the target process know the HHOOK that your process got from SetWindowsHookEx; the HHOOK value must be communicated between processes. Fortunately, we already have a way of accomplishing this: named memory mapped files.

The next detail is that your window hook will ONLY get called if the event that is hooked occurs while the hook is in place. In our case, that means that GetMessage must return while the hook is set; if the hook is removed before the window thread gets a chance to run, your hook will never get called (wouldn't it be nice if SendMessage worked for get message hooks?). There are a couple solutions, depending on how lazy you are. The lazy solution would be to simply leave the hook installed indefinitely. However, this would also require that we leave the memory mapped file open indefinitely as well, as we wouldn't know when the target process is done with it.

A better way is to wait until the target process is done with the file mapping. This is, in fact, trivial, given that we already have a file mapping to pass data between the two processes. The classy way would be to create an event to wait on, duplicate the event HANDLE into the target process, and wait for the event to get signalled. I'm too lazy for this, given that there are easier ways of getting the same result (although not as elegant). I use a spin loop; but no just any spin loop - a Sleep-spin loop! In other words, we have a loop that checks for a return value and Sleeps if it's not there. This isn't as efficient as waiting on an event, but it works just as well. Lazy as I am, I decided to use the HHOOK value itself (the copy in the file mapping) as the return value. The target process will read the HHOOK, then set it to NULL before closing the file mapping. Thus, your process needs only to wait for it to be set to NULL, then it can know that the hook function has been called, and the hook can be removed.

The full code (note that I have all of this in the "ThingyDLL" DLL - InjectIntoWindowProcess is a function that is exported for your program to call):
HINSTANCE g_hDLL = NULL;

bool g_bInitialized = false;
HHOOK g_hHook = NULL;

BOOL APIENTRY DllMain(HINSTANCE hModule, DWORD ul_reason_for_call, LPVOID lpReserved)
{
 if (ul_reason_for_call == DLL_PROCESS_ATTACH)
   g_hDLL = (HINSTANCE)hModule;  // Save the HINSTANCE

   return TRUE;
}

// Gets the HHOOK we have on the main window from the file mapping
HHOOK GetHHOOK()
{
 // Open the file mapping - these should not fail for any reasonable reason
 HANDLE hMapping = OpenProcessSection("ThingyDLL");
 assert(hMapping != NULL);

 volatile HHOOK *lphHook = (volatile HHOOK *)MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, 0);
 assert(lphHook);

 // There's a very small, but real, chance that the patching thread will get interrupted after setting the hook, before it can write the HHOOK to the file mapping. Sleep-spin wait for the HHOOK to be set.
 while (*lphHook == NULL)
   Sleep(10);

 // Get the HHOOK
 HHOOK hHook = *lphHook;

 // Signal the injector that we've received the HHOOK
 *lphHook = NULL;

 // Close the mapping
 UnmapViewOfFile((void *)lphHook);
 CloseHandle(hMapping);

 return hHook;
}

LRESULT CALLBACK GetMessageHookProc(int nCode, WPARAM wParam, LPARAM lParam)
{
 // Are we getting called for the first time?
 if (!g_bInitialized)
 {
   // We need to set this RIGHT now, because we're going to indirectly generate messages in our MessageBox call, which would put us in an infinite loop.
   g_bInitialized = true;

   // Get the HHOOK for this hook
   g_hHook = GetHHOOK();

   // Do our stuff. For demonstration, display a message box for the hooked window, to show we're inside the process.

   // lParam in this case is a MSG * for the message being retrieved
   MSG *lpMsg = (MSG *)lParam;

   MessageBox(lpMsg->hwnd, "Hello from the inside!", "GetMessageHookProc", MB_OK | MB_ICONEXCLAMATION);
 }

 // Call the next hook
 return CallNextHookEx(g_hHook, nCode, wParam, lParam);
}

_declspec(dllexport) bool __stdcall InjectIntoWindowProcess(HWND hWnd, DWORD nTimeoutMS)
{
 // Lookup the thread and process ID for the window
 DWORD nProcessID, nThreadID = GetWindowThreadProcessId(hWnd, &nProcessID);
 if (!nThreadID)
   return false;

 // Create the file mapping to share the HHOOK with the patched process
 HANDLE hMapping = CreateProcessSection(sizeof(HHOOK), "ThingyDLL", nProcessID);
 assert(hMapping);

 // Open the file mapping
 volatile HHOOK *lphHook = (volatile HHOOK *)MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, 0);
assert(lphHook);

 *lphHook = NULL;

 bool bSuccess = false;  // Failed until proven otherwise

 // Set the window hook
 HHOOK hHook = SetWindowsHookEx(WH_GETMESSAGE, GetMessageHookProc, g_hDLL, nThreadID);

 if (hHook)
 {
   // Tell patched process the hook
   *lphHook = hHook;

   // Queue a message that will activate the hook function
   if (PostThreadMessage(nThreadID, WM_NULL, 0, 0))
   {
     // Wait for the hook function to reply
     for (int ms = 0; ms < nTimeoutMS; ms += 10)
     {
       if (*lphHook == NULL)
       {
         bSuccess = true;
         break;
       }

       Sleep(10);
     }
   }

   // Release the hook now that the DLL has been injected and been initialized
   UnhookWindowsHookEx(hHook);
 }

 // Close the file mapping
 UnmapViewOfFile((void *)lphHook);
 CloseHandle(hMapping);

 return bSuccess;
}

Saturday, July 16, 2005

Bravo, VC++ - Updated

So I'm reading a web comic and chatting with friends on MSN Messenger, and Ladik (a.k.a. Ladislav Zezula, of StormLib fame) messages me. He told me that he'd found a compiler bug that causes StormLib to generate invalid data in release - but not debug - build using VC++ 2003. He sent me the following test code that demonstrates the error he isolated:

#define MAGIC_VALUE 0x10    // Must be at least 0x10 (?)

void TestFunction(unsigned char * srcbuff3)
{
   unsigned char * pin27CC = srcbuff3 + 2;
   unsigned long x;

   pin27CC++;
   srcbuff3++;

   for(x = 0; x < MAGIC_VALUE; x++)
   {
       pin27CC++;
       srcbuff3++;
       if(*pin27CC != *srcbuff3)
           break;
   }

   printf("Result is: %s\n", pin27CC);
}

int main()
{
   TestFunction((unsigned char *)"123456789");
   getch();
}


A quick compile and execute reveals that what he said is true: the debug build correctly displays "Result is: 56789", while the release build displays "Result is: 456789". So, what went wrong? Well, a look at the assembly generated reveals a number of very strange optimizations:

00401000 mov ecx,dword ptr [esp+4]
00401004 push ebx
00401005 lea eax,[ecx+3]
00401008 push esi
00401009 inc ecx
0040100A xor esi,esi
0040100C add ecx,2
0040100F nop
00401010 mov dl,byte ptr [eax+1]
00401013 cmp dl,byte ptr [ecx-1]
00401016 jne TestFunction+67h (401067h)
00401018 mov dl,byte ptr [eax+2]
0040101B cmp dl,byte ptr [ecx]
0040101D jne TestFunction+50h (401050h)
0040101F mov dl,byte ptr [eax+3]
00401022 cmp dl,byte ptr [ecx+1]
00401025 jne TestFunction+64h (401064h)
00401027 mov dl,byte ptr [eax+4]
0040102A mov bl,byte ptr [ecx+2]
0040102D add eax,4
00401030 cmp dl,bl
00401032 jne TestFunction+67h (401067h)
00401034 add esi,4
00401037 add ecx,4
0040103A cmp esi,10h
0040103D jb TestFunction+10h (401010h)
0040103F push eax
00401040 push offset string "Result is: %s\n" (40710Ch)
00401045 call printf (401095h)
0040104A add esp,8
0040104D pop esi
0040104E pop ebx
0040104F ret
00401050 add eax,2
00401053 push eax
00401054 push offset string "Result is: %s\n" (40710Ch)
00401059 call printf (401095h)
0040105E add esp,8
00401061 pop esi
00401062 pop ebx
00401063 ret
00401064 add eax,3
00401067 push eax
00401068 push offset string "Result is: %s\n" (40710Ch)
0040106D call printf (401095h)
00401072 add esp,8
00401075 pop esi
00401076 pop ebx
00401077 ret

You can see right away that the compiler unrolled his loop to 4 iterations of 4 separate compares, using offset-based MOVs to read the characters sequentially. Interestingly, the pointers are not updated inside the multiplexed loop, but rather are updated at each iteration. To compensate for this, 4 loop break points are generated, resulting in 3 separate calls to printf, each adding an appropriate number to the pointer.

Now, this is where things get really weird. The source dictates that by the time the first compare gets executed, pin27CC (eax) will have 4 added to it, and srcbuff3 (ecx) will have 2 added. Yet that's not what the assembly does. pin27CC in fact gets 3 added, and srcbuff3 3 as well. Bizarre as this is, the loop accounts for this by accessing [eax+1] to [eax+4], and [ecx-1] to [ecx+2]. Corresponding to these values, the 4 loop break points update eax before calling printf, adding as necessary 4, 3, 2, or... 0. No, that's not a typo. The first break point goes to 00401067, which is part of the +3 case, but after the addition, so that no alteration of eax is performed (this is why there are only 3 calls to printf, despite there being 4 break points in the loop). This, of course, causes eax to be off by 1, causing the output to be incorrect.

Ah, the joys of clever-but-not-quite-clever-enough optimizing compilers.

By the way, thanks to Buster for the code formatting/colorizing script.

Update:
This bug has been fixed in VC++ 2005, as well as another bug that showed up in this code (namely the use of both eax and ecx as pointers, even though they were always the same). As well, several nice new optimizations are seen, such as reusing bytes already read from the string. The assembly from VC++ 2005:

00401520 push ebx
00401521 push esi
00401522 mov esi,offset ___xi_z+3Fh (4020CFh)
00401527 xor ecx,ecx
00401529 mov eax,esi
0040152B jmp TestFunction+10h (401530h)
0040152D lea ecx,[ecx]
00401530 mov bl,byte ptr [eax+1]
00401533 cmp bl,byte ptr [eax-1]
00401536 jne TestFunction+73h (401593h)
00401538 mov dl,byte ptr [eax+2]
0040153B cmp dl,byte ptr [eax]
0040153D jne TestFunction+49h (401569h)
0040153F cmp byte ptr [eax+3],bl
00401542 jne TestFunction+5Eh (40157Eh)
00401544 add esi,4
00401547 cmp byte ptr [eax+4],dl
0040154A jne TestFunction+76h (401596h)
0040154C add ecx,4
0040154F add eax,4
00401552 cmp ecx,10h
00401555 jb TestFunction+10h (401530h)
00401557 push esi
00401558 push offset ___xi_z+2Ch (4020BCh)
0040155D call dword ptr [__imp__printf (402070h)]
00401563 add esp,8
00401566 pop esi
00401567 pop ebx
00401568 ret
00401569 add esi,2
0040156C push esi
0040156D push offset ___xi_z+2Ch (4020BCh)
00401572 call dword ptr [__imp__printf (402070h)]
00401578 add esp,8
0040157B pop esi
0040157C pop ebx
0040157D ret
0040157E add esi,3
00401581 push esi
00401582 push offset ___xi_z+2Ch (4020BCh)
00401587 call dword ptr [__imp__printf (402070h)]
0040158D add esp,8
00401590 pop esi
00401591 pop ebx
00401592 ret
00401593 add esi,1
00401596 push esi
00401597 push offset ___xi_z+2Ch (4020BCh)
0040159C call dword ptr [__imp__printf (402070h)]
004015A2 add esp,8
004015A5 pop esi
004015A6 pop ebx
004015A7 ret

Friday, July 15, 2005

The Art of the Inside Job - Addendum

Okay, so we've got memory that we can share across processes. Now, what can we put in it? Well, as a general rule, you can put in it whatever you can put in it - that is, whatever is contained completely in the shared memory. Let me clarify.

Any basic value data types can be put in shared memory. ints, floats, arrays, etc. structs can be placed in shared memory so long as all the members of the struct fulfill the same guidelines as anything else in a shared memory area. While you probably could get it to work if you're very careful, I wouldn't recommend putting classes or anything else that has associated functions in a shared memory region.

Pointers may never be put in shared memory. Remember that each process has a separate address space, so a pointer to something in one process will almost certainly not point to the same thing in another process. As well, it is not safe to put pointers that point to data in that shared memory in a shared memory region, for the same reason: with the exception of Windows 9x (as noted earlier), there is no guarantee that a shared memory region will be mapped at the same address in multiple processes. Offsets to data in the shared memory, relative to the base address of the shared memory, are safe to share between processes, as the offset won't change, regardless of where the shared memory gets mapped.

HANDLEs are not safe to put in shared memory sections - at least not directly. A HANDLE (at least, a real HANDLE - some 'HANDLE's are really user-mode pointers cast to the HANDLE type) is a reference to a kernel-mode object. Specifically, they are indices into the process' handle table, in which each entry maps to a kernel-mode pointer to the object. As all kernel memory is shared among all processes, the objects themselves are accessible from any process, but the HANDLEs remain process specific. It is possible, however, to create a new HANDLE in a foreign process which points to the same object that a HANDLE in your process points to. This is what the DuplicateHandle function is for. In this way it is possible to share HANDLEs between processes; but remember that you now have two separate HANDLEs - one in each process - and the object they point to won't be freed until both HANDLEs are closed.

Lastly, HWNDs (handle to a window) and HHOOKs (handle to a window hook - we'll get to what these are in a post or two) are process-independent, and safe to share across processes.

There are probably a few other things that are safe to share between processes, but I believe I've covered almost all of the sharable types. When in doubt you should assume that something is not safe to share between processes.

Monday, July 11, 2005

Dear Mr. Stroustrup

The following is an e-mail I will send to Bjarne Stroustrup, in regards to his recent Design of C++0x article. I've posted it here so that it may be critiqued before I send it in.

Hello. My name is Justin Olbrantz, and I have been using C++ as my primary programming language for more than 5 years, now. As it just so happens, I've recently been working on writing a lightweight (as close as possible to 0-overhead), platform-independent class library of fiercely platform-dependant features (atomic functions, endian conversion functions, threads and thread synchronization, files, sockets, etc.) for public and my own use, making your paper on the Design of C++0x (and the declaration that you are accepting suggestions) extremely timely, as I've been putting a significant amount of thought into the matter. My requests are actually pretty minor, as I believe most of the stuff my library does would be best left up to third-party developers such as myself; however, there are a few things that I would really like to see either in the language or in the standard libraries.

First of all are fixed-size data types: that is, things like int16, int32, int64, etc. While it is possible to write portable code using the regular C/C++ data types, it's crippling to not be sure of what size a variable will be on a given machine, and overly cumbersome to write code that will always work properly with different data type sizes (particularly when you have to interface with the OS, which expects things to be certain sizes). While this could be implemented in the standard library (much like my library does), regardless of where it is implemented, I believe that it is necessary to preserve the existing signed/unsigned syntax for these types (i.e. being able to do 'unsigned int32' or 'signed int64').

In addition, a 'word' data type would be useful, which is the effective word size of the target processor (i.e. 32 bits on x86-32 or PPC32, 64 bits on Itanium, PPC64, etc.), as well as a pointer-sized integer type, which also follows the signed/unsigned semantics. In each case it is not necessary that the features be implemented in the language itself (the standard library would suffice), but they should, for ease of use, support the signed/unsigned syntax of normal types.

Atomic functions. I saw that you're already aware of the need for such functions, but I'll list them again so that I can give my rationale for designing them. Building support for atomic data types into the language itself would be ideal, due to the typical read w/reserve-compute-conditional write pattern used on non-x86 systems. However, I should think that writing an optimizing compiler which is able to take advantage of this pattern would be rather difficult (although compiler design is not my specialty). If that isn't feasible, I see no reason to not simply provide standard library functions such as atomic_add, atomic_exchange, atomic_compare_exchange, etc. While I suppose that programmers who simply want things to work without knowing the details might appreciate an atomic variable template class, I personally would not be likely to use such a thing, as in many cases it would be slower than simple primitive functions (and without the individual primitive functions an atomic data type template would be of limited use).

Byte order support, including conversion functions and macros to determine target machine byte order. In my library I chose to create 4 inline template functions: ltoh (little endian to host), btoh (big endian to host), htol (host to little endian), and htob (host to big endian) (all of these take value parameters and return values). I chose to implement these as template functions for two reasons. Creating a set of functions with distinct names (such as the network conversion functions) seemed far too cumbersome to use, particularly in macros or template classes. As well, I chose not to use simple function overloading with these 4 functions because that would prevent the programmer from explicitly specifying the type the compiler should treat the data as (as well as creating the possibility of incorrect guesses as to the intended data type by the compiler).

Vector support. This is something I want to add to my library, but frankly, I'm not yet sure whether I can do it well enough in a library (it may require compiler support). What I'd like to have is a structure that acts like an array for member access, but can be used in vector math functions. The real trick about this (and the major obstacle in my own path) is that the size of this structure would differ not only by target processor, but also by data type. x86 CPUs with SSE2 can pack 4 32-bit numbers into a single vector register, but without SSE2 it can only pack 2, and without MMX only 1 (in this case the 'vector structure' is really just using the integer unit and a normal GPR). Similarly, a processor with SSE (but lacking SSE2) can use 128-bit registers for floating point, and so could compute 4 float (assuming float is 32 bits) values in a single operation, but can only use 64-bit registers for integer operations, and so could only compute 2 int (assuming int is 32 bits) values in a single operation. Such vector support should be totally transparent; that is, once code is written to use it, it should always work, regardless of the features of the processor (i.e. if the code is compiled for SSE2 it will use 128-bit registers, but if it's compiled for a non-MMX x86 it will only use 32-bit GPRs/FPRs). This is what I would consider ideal vector support. Because it's used explicitly, maximum parallelization may be performed, yet because the vectors are of variable size, the presence or absence of vector support in the processor (and the vector capabilities of the processor) is completely transparent to the coder; the same 32-bit math code (assuming the code is fundamentally parallel, and so can be done in a vector structure) will run at 1x speed on a Pentium, and at 4x speed on a Pentium 4.

I believe those are the major things. Some items of lesser importance:
- Typed enums: the ability to specify the data type an enum will occupy. This was previously suggested, I just wanted to register my vote for it.


- A private heap allocator class. That is, support for heaps which are in isolated memory areas. This is useful, for instance, to make allocations/frees very quick by having a given heap only allocate a single type of structure (thus all entries in the heap will be exactly the same size).


- A pool template class. That is, a container which retains and reuses "freed" instances of the contained classes, so that actually allocations and deletions are rare.

- Functions to convert native data types to and from some "standard form", such as IEEE floats, big endian two's complement integers, or whatever (exactly what the standard form is isn't as important as the ability to convert to and from it using universally available functions), for transmission over the internet or disk.

I believe that is everything I wish to submit for your consideration; thank you for your time (and lots of it, given the length of this message...).

Sunday, July 03, 2005

Oh, By the Way

On Thursday I wrote up a spec sheet for the MoPaQ archive format used by Blizzard games.

Saturday, July 02, 2005

Bear with Me

Working on getting a nice code displaying script working with the blog. Don't mind things like the blog layout dissappearing from time to time :P

The Art of the Inside Job

Before we get into the really fun stuff (getting injected code running on a foreign process, which is really my specialty), we need to go over one more topic: passing data to your code in another process. At first thought you might wonder how that differs from the data injection we just went over, but really it's something else entirely.

Remember that whatever you stick in memory of another process has to be found by the process, otherwise there's no point in sticking it there to begin with. In some methods of executing code in foreign processes it's possible to pass parameters to your code, but not in others. In the latter case, you need another way of passing data across processes. There are a number of methods of accomplishing this, but we'll only discuss one, as it's so well suited to our purposes: memory mapped files.

That's right, it's what we just saw in the Windows 9x injection method; however, this time we're using it a bit differently; that is, we're using it like it's supposed to be used. In addition to being identified by a handle that's local to your process, memory mapped files can be identified by a name string. When named, most other processes in the system may open it by name. Note that, in this case, you must open the file mapping in all process that want to access it (the way shared file mappings are supposed to be done). As this is how they were made to work, this method works exactly the same on both Windows NT and Windows 9x.

So, what do we call the memory mapped file? Well, you can call it anything you like, provided it meets a couple of restrictions. First, it must be globally unique; things are gonna get really painful if two copies of your program try to use the same file mapping at the same time. Second, the foreign process must be able to determine the name without any outside help. That is, you can't simply choose a random string and tell the process what it is (if you could do that, we wouldn't need to use file mappings at all...). I'd recommend that you use a string that contains both the name of your program and the process ID of the process you're sticking code into in the string; this is a simple way of meeting both requirements.

td width="100%">const char *pszSectionNameFormat = "%s_xyz_%08X";

// Creates a memory mapped file of the specified size, giving it a name that can be found by the target process. Note that 'section' is another name for memory mapped file.
HANDLE CreateProcessSection(DWORD nSize, const char *lpszProgramName, DWORD nProcessID)
{
assert(nSize);
assert(lpszProgramName);

// We're using a fixed-size buffer, here. Care should be taken to not use an extremely large string for the program name that would overflow this buffer.
char szSectionName[128];

// Construct the name
sprintf(szSectionName, pszSectionNameFormat, lpszProgramName, nProcessID);

// Create the file mapping
return CreateFileMapping(NULL, NULL, PAGE_READWRITE, 0, nSize, szSectionName);
}

// Opens the file mapping which is targetted to this process
HANDLE OpenProcessSection(const char *lpszProgramName)
{
assert(lpszProgramName);

// Construct the name, same as before
char szSectionName[128];

sprintf(szSectionName, pszSectionNameFormat, lpszProgramName, GetCurrentProcessId());

return OpenFileMapping(FILE_MAP_WRITE, FALSE, szSectionName);
}

Friday, July 01, 2005

The Art of Imperialism - Windows 9x

So, we have a really nice method of injecting code or data into a foreign process in Windows NT. The only problem is that Windows 9x doesn't HAVE VirtualAllocEx; thus, different methods are required. In fact, Windows 9x has no way of specifically allocating memory in a foreign process; however, a trick of the trade can be used to effectively do just that.

File mapping is a technique on various operating systems that allows a file on disk to appear to the program as if it were a region of memory, accessed by a pointer. On Windows NT, file mappings act very much like regular memory - that is, only those processes that open a file mapping will have it in their address space. Windows 9x, however, works more peculiarly. In Windows 9x, the upper 2 gigabyte of each process' virtual address space are shared among all processes. The kernel stores its data there (Windows NT also does this, although just about everything in the upper gigabytes is accessible only from kernel mode); however, in Windows 9x, all file mappings are in the shared memory space, not only available to all processes, whether they mapped the file themselves, but also at the same address for every process. When one process opens a file mapping, all other processes instantly have access to that mapping.

However, in addition to mapping a disk file into a process' address space, it is possible to create a file mapping that does not use a file. In this case, the file that gets mapped is the system's swap file. That is, you're using virtual memory. As creating a file with our data would be significantly more work, we will use the swap file to store our data.

Mapping a file involves two steps in Windows: the creation of the kernel file mapping object (which you receive as a handle) with CreateFileMapping, and the mapping of that mapping object into the process' address space with MapViewOfFile.

In CreateFileMapping, you must specify the file to map, the security attributes for the file mapping, and the protection for the file map (read, write, etc.). Since we're using Windows 9x, which ignores most of the security attribute stuff, we can just specify NULL for this, which gives the mapping the default security attributes. And since we're using the swap file, we specify INVALID_HANDLE_VALUE for the file handle. As with before, we'll want both read and write access, so the protection will be PAGE_READWRITE (file mappings don't support execution, so we don't specify it; this is okay, because Windows 9x doesn't support execution protection at all, so we'll still be able to execute code in the file mapping).

When calling MapViewOfFile, you must again specify the protection for the memory that is mapped. This will simply be FILE_MAP_WRITE, which specifies both read and write access. Again, execution access isn't needed, because Windows 9x doesn't support that option anyway (memory is always executable).

However, there's a snag in this process. The file mapping is created and mapped in the current process. While mapping the file in the current process maps it in all processes in Windows 9x, this mapping will be closed when the owning process (yours) exits. I'm currently investigating a possible solution to this problem, but getting info about Windows 9x is pretty difficult these days, given how Microsoft no longer supports 9x. For now, you'll have to keep your process alive until the foreign process no longer needs the mapping you created.

// Makes the string "Squish!" available to all processes
bool InjectSquish9x()
{
const char *pszSquish = "Squish!"; // The string to write
const SIZE_T nSquishSize = strlen(pszSquish) + sizeof('\0'); // Length of the string, including the terminating null

// Create the file mapping kernel object
HANDLE hMapping = CreateFileMapping(NULL, NULL, PAGE_READWRITE, 0, nSquishSize, NULL);
if (!hMapping)
return false;

// Map the view of the file
void *lpMemory = MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, nSquishSize);
if (lpMemory)
{
memcpy(lpMemory, pszSquish, nSquishSize); // Write the string

return true;
}
else
{
// If we succeeded, leave the mapping open, so that the memory will remain available to the foreign process. If we didn't succeed, close the mapping.
CloseHandle(hMapping);

return false;
}
}

The Art of Imperialism - Windows NT

While not by any means a common thing to do, occasionally it is necessary to inject code or data into another process, without the knowledge of that process. On Windows this is fairly easy to do (this is one of those fiercely platform dependant features, and I only known how to do it on Windows). In fact, there are several API functions that exist for this purpose; however, different methods are required for the Windows NT and Windows 9x kernels.

Remember that in 32-bit Windows, each process has an isolated memory space. A pointer to something in your process will not work to access memory in another process, and vice-versa. For this reason, special methods must be used to allocate and access memory in other processes.

For Windows NT, this process is very straightforward, and uses the VirtualAllocEx and WriteProcessMemory functions. VirtualAllocEx is just like VirtualAlloc - it allocates a region of virtual address space - except that VirtualAlloc allocates memory in the current process, while VirtualAllocEx takes a handle to the process to allocate in. WriteProcessMemory writes a buffer to memory in the process whose handle you supply it.

However, before you can use either of these functions, you have to obtain a handle for the process you want to access, using OpenProcess. This is, really, the hard part, because it's where permissions checking gets done; you must have permission to access the process. When calling OpenProcess, you must supply the process ID of the process, and the access to the process that is desired. As stated in MSDN, to use VirtualAllocEx and WriteProcessMemory, only the PROCESS_VM_OPERATION and PROCESS_VM_WRITE permissions are necessary (PROCESS_VM_READ as well, if you want to also read the process' memory). With Windows NT's security model, it's best to request as little access as possible to get the job done; the more access you request, the more likely it is that you'll be denied.

When calling VirtualAllocEx, you must specify the allocation type and memory protection to apply to the allocated memory. Since you want actual memory, and not simply to reserve a portion of the address space for future you, you want to use MEM_COMMIT as the allocation type. The memory protection depends on what you want to do with it. Obviously you'll need it to be both readable and writable. However, if you're going to write executable code to the memory, you'll also need the memory to be executable, which is specified in the memory protection, as well. Thus, use PAGE_EXECUTE_READWRITE for executable code memory, and PAGE_READWRITE for memory which will not be executed.

// Allocates memory in the specified process, and writes the string "Squish!" to it.
bool InjectSquishNT(DWORD nProcessID)
{
const char *pszSquish = "Squish!"; // The string to write
const SIZE_T nSquishSize = strlen(pszSquish) + sizeof('\0'); // Length of the string, including the terminating null

// Open the process
HANDLE hProcess = OpenProcess(PROCESS_VM_OPERATION | PROCESS_VM_WRITE, FALSE, nProcessID);
if (!hProcess)
return false; // Failed

bool bSuccess = false; // Failed until proven otherwise

// Allocate the memory
void *lpMemory = VirtualAllocEx(hProcess, NULL, nSquishSize, MEM_COMMIT, PAGE_READWRITE);
if (lpMemory)
{
// Write the string
SIZE_T nBytesWritten;

if (WriteProcessMemory(hProcess, lpMemory, pszSquish, nSquishSize, &nBytesWritten) && nBytesWritten == nSquishSize)
bSuccess = true;
else
VirtualFreeEx(hProcess, lpMemory, 0, MEM_RELEASE); // If the write failed, free the allocated memory. If the write succeeded, leave it.
}

// Close the process handle
CloseHandle(hProcess);

return bSuccess;
}

Saturday, June 11, 2005

Conditions - Windows Implementation

The Windows version of the condition is fairly straightforward (and I don't know about you, but I'm getting a bit bored with all these almost-alike-but-not-quite fast synchronization objects), and more than a bit resembles the fast mutex. Here, we have a slow semaphore that waiting threads wait on, and a counter of how many waiting threads there are. The use of atomic functions to access the waiting threads counter allows us to signal from either inside or outside the mutex paired with the condition. Really, the only major difference between the condition and the mutex implementation is that with the condition, a paired mutex must be exited before waiting, and reentered after waiting.

In doing so, it's essential that the counter be incremented before leaving the mutex to wait (despite the fact that the counter is accessed atomically). Were this not the case, we'd create a race between the data (which is protected by the mutex) and the counter (which is accessed atomically) and possibly set a thread to sleep for some indefinite period of time. This is a problem that does not work the other way: we can safely access the counter outside the mutex if we're decrementing it (that is, we're waking waiting threads).

The code:

class CCondition // Win32 version of CCondition (fast)
{
private:

CMutex &m_pairedMutex; // The paired mutex
CSemaphore m_sem; // Semaphore waiters will wait on

// Atomically accessed, so that Signal may be called outside the mutex
int32 m_nWaiters; // Number of threads waiting on the semaphore

public:
// Associates a mutex with the condition. From this point on, no other mutex may be used to access the condition. However, multiple conditions may be associated with the same mutex.
inline CCondition(CMutex &mutex) : m_pairedMutex(mutex), m_sem(0)
{
m_nWaiters = 0;
}

inline ~CCondition()
{
if (m_nWaiters)
m_sem.Post(m_nWaiters);
}

// Signal that the condition is true and signal threads waiting; has no effect if no threads are waiting. Can either release one thread or all threads. May be called inside the mutex or outside.
inline void Signal(bool bSignalAll = false)
{
int32 nWaiters, nReleaseThreads;

do
{
nWaiters = m_nWaiters;
assert(nWaiters >= 0);

if (!nWaiters)
return; // Do nothing if no threads are waiting

nReleaseThreads = (bSignalAll ? nWaiters : 1);
} while (AtomicCompareExchange(&m_nWaiters, nWaiters - nReleaseThreads, nWaiters) != nWaiters);

m_sem.Post(nReleaseThreads); // Set them loose
}

// Waits for the condition to become true. Must be called from inside the mutex, and the thread will be back inside the mutex at return. HOWEVER, return does not GUARENTEE that the condition is true. You must verify that condition has indeed become true upon return.
inline void Wait()
{
// Absolutely essential that this get done before we leave the mutex or we create a race condition
AtomicIncrement(&m_nWaiters); // One more waiter

m_pairedMutex.Leave(); // Leave mutex
m_sem.Wait(); // Wait for signal
m_pairedMutex.Enter(); // Reenter mutex after signal
}
};

Conditions - POSIX Implementation

As with the mutex class, the POSIX version of the condition is trivial, due to the native support by the OS. The POSIX APIs allow you to specify a mutex to use each time you wait on the condition. However, as I can't see any use for this, and it leaves more room for error, I've decided to bind the condition to its mutex on construction. The code:

class CCondition
{
private:
CMutex &m_pairedMutex; // Paired mutex
pthread_cond_t m_cond; // Condition variable

public:
inline CCondition(CMutex &mutex) : m_pairedMutex(mutex)
{
if (pthread_cond_init(&m_cond, NULL) != 0)
throw std::bad_alloc("Unable to create condition");
}

inline ~CCondition()
{
verify(pthread_cond_destroy(&m_cond) == 0);
}

inline void Signal(bool bSignalAll = false)
{
int nRetVal;
if (bSignalAll)
nRetVal = pthread_cond_broadcast(&m_cond); // Signal all
else
nRetVal = pthread_cond_signal(&m_cond); // Signal one

assert(nRetVal == 0);
}

inline void Wait()
{
verify(pthread_cond_wait(&m_cond, &m_pairedMutex.GetMutex()) == 0);
}
};

Conditions - Description

Conditions are a lot like events - they indicate that some condition is or isn't true, and they may be waited on by threads that have nothing to do until the condition becomes true. However, unlike an event, they are integrally paired with a mutex that protects some data. A thread waits on the condition from inside the mutex, and when it gets woken, it will still be inside the mutex. Yet the mutex does not remain locked while the thread sleeps, so other threads can enter the mutex, modify the data, and signal that the condition has become true.

So, why do we implement both? Events are a Windows thing, as Windows doesn't support conditions. But POSIX doesn't support events - it only supports conditions. This seemed like a curious design decision to me when I first was learning about the POSIX API (as I've always programmed on Windows, and events were all that I knew), but I better understand the reasons, now.

While there are times where a simple "fire and forget" event may be useful, usually an event is associated with the state of some data. For example, one of my programs was an audio-streaming library. This system used three worker threads for its tasks (the rationale for the exact number is rather lengthy, so I won't cover it here) - a streaming (decoding) thread, a prebuffering thread, and a prefetching thread. The job of the prebuffering thread is to prebuffer new streams before they begin playing; the rest of the time, it sleeps, while it waits for new requests. When a request arrives, the condition gets flagged, and the thread goes to work. When this happens, the thread must lock the mutex protecting the list of streams needing prebuffering (as it's accessed by more than 1 thread), dequeue one stream, leave the mutex, then do its prebuffering outside the mutex. When it's done prebuffering, it adds the stream to the playing streams list (inside the mutex, of course), tells the OS that the stream has data to play, and checks if there are any more streams needing prebuffering; if no more streams need prebuffering, the thread goes back to waiting on the condition.

The point is that signaling of the condition will always be followed by the locking of the mutex. This is where condition variables come in. By combining the locking/unlocking with the wait on the condition, this leaves the OS free to do any desired optimizations, which could further improve performance. One example of a potential optimization would be a direct context switch of the CPU from the thread signaling the condition to the thread waiting on the condition, with 0 latency. From what I've heard, Windows internally supports something like this, called event pairs (my information is second-hand, as I know very little about how the Windows kernel works internally, apart from what I've heard from more knowledgeable friends and my Inside Windows NT book). In any case, the goal of LibQ is to provide a platform-independent set of interfaces, which take advantage of features that are optimized on each platform. Because the condition/mutex model is so common, that means we have to support it.

As a footnote, I should mention a couple of sticky points about conditions. Unlike events, conditions do not retain their value. If a thread signals a condition, absolutely nothing happens if there are no threads waiting on that condition (unlike for events, where the event will remain set until reset). The reason for this is that threads are expected to check the data associated with a condition to determine whether the condition is already true (since they'll already be inside the mutex). Thus, the data itself indicates whether the condition is true or false.

Also, according to the POSIX standard, conditions are not guaranteed to be watertight. That is, a thread waiting on a condition may get woken when the condition is not actually true. This can happen for several reasons; for example, a thread may be waiting on the mutex before a condition becomes true, and so will get the mutex before the woken thread gets to it, changing the value of the data so that the condition is no longer true. Beware of this - when Wait returns, always check that the condition is indeed true; if not, call Wait again.

Friday, June 10, 2005

Mutexes - Windows Implementation

Really, there isn't a great deal to say about the Windows version of the mutex, either. It's pretty similar to the fast event and fast semaphore already written. In this case, the atomic variable is a count of the number of would-be threads inside the mutex - at most one will be inside the mutex, and the rest will be waiting on the semaphore. Thus, Enter increments the count, waiting if the count was nonzero; Leave decrements the count, and releases a thread if the count is nonzero after decrementing. Of course, if possible, Enter spins to try to grab the mutex before ultimately incrementing the counter and going into wait. The code:

class CMutex // Windows version of CMutex (fast). This is actually platform-independent, but we need to use the native POSIX mutex type so that we can use condition variables.
{
private:
CSemaphore m_sem; // Semaphore threads waiting on the mutex will wait on

unsigned int m_nSpinCount; // Constant. Number of times to spin before waiting on the event

int32 m_nLockCount; // Interlocked. Number of threads trying to lock the mutex (also the number of threads waiting on the semaphore + 1)

public:
// The spin count is used on multi-processor systems (and will in fact do more harm than good on single-processor systems). It allows a "grace period" where the mutex will spin on the CPU waiting for another processor to release the mutex. This can be much faster than waiting on the semaphore, but just wastes CPU cycles on a single-processor system. Spinning may not be useful if threads typically hold the mutex for very long time periods.
inline CMutex(unsigned int nSpinCount = 0) : m_sem(0)
{
m_nSpinCount = nSpinCount;

m_nLockCount = 0;
}

inline ~CMutex()
{
}

// Tries to enter the mutex, but will fail if the mutex cannot be entered without waiting. Returns true if the mutex was entered.
inline bool TryEnter()
{
// If the lock count was 0, we now own the mutex
return (AtomicCompareExchange(&m_nLockCount, 1, 0) == 0);
}

// Waits indefinitely for the mutex. This function is NOT reentrant. Calling it twice in a row in a thread without calling Leave will cause the thread to permanently deadlock. Always make sure your Enter is paired with a Leave (or use the auto-lock below).
void Enter()
{
// Spin first, in case we can get it without waiting
unsigned int nSpinCount = m_nSpinCount;
while (nSpinCount-- > 0)
{
if (TryEnter())
return;
}

// Try to get the mutex
int32 nLockCount = AtomicIncrement(&m_nLockCount);

if (nLockCount == 1)
return; // No thread held the mutex. We do, now.

// Another thread held the mutex. We have to wait.
m_sem.Wait();
}

inline void Leave()
{
int32 nLockCount = AtomicDecrement(&m_nLockCount);

if (nLockCount > 0)
m_sem.Post(); // Somebody's waiting on the mutex. Release one waiter.
}
};

Mutexes - POSIX Implementation

Really there isn't much to say, here. The POSIX version of our mutex is nothing more than a thin wrapper around a pthread_mutex, with profuse use of our verify macro.

class CCondition; // Circular reference incoming. We don't have any choice but to forward declare it.

class CMutex // POSIX version of CMutex: pthread_mutex (fast?)
{
private:
pthread_mutex_t m_mutex;

inline pthread_mutex_t &GetMutex()
{
return m_mutex;
}

public:
// Compatible prototype, even though the spin count is unused
inline CMutex(unsigned int nSpinCount = 0)
{
if (pthread_mutex_init(&m_mutex) != 0)
throw bad_alloc("Unable to create mutex");
}
inline ~CMutex()
{ verify(pthread_mutex_destroy(&m_mutex) == 0); }

inline bool TryEnter()
{ return (pthread_mutex_trylock(&m_mutex) == 0); }
inline void Enter()
{ verify(pthread_mutex_lock(&m_mutex) == 0); }
inline void Leave()
{ verify(pthread_mutex_unlock(&m_mutex) == 0); }

// Allow CCondition to access the mutex directly
friend class CCondition;
};

Sorry About That

Just learned that anonymous comments weren't allowed. Fixed, now :P

Events - New POSIX Implementation

The new POSIX version of the event is similar to, but a bit more complicated than the fast semaphore. In fact, the POSIX version of the standard event is inherited from a platform-independent fast event. It's based on the same principle as the fast semaphore: use of an atomic variable in combination with a slow semaphore than never increases above 0. A bit more complicated, though, because now our atomic variable is a compound value, rather than a single counter.

The reason for this is that we can only operate on a single variable atomically, yet we have two values that must be tracked - the status of the event (set or cleared), and the number of waiters (if cleared). The compound value structure consists of the highest bit indicating that the event is set, and the low bits indicating the number of waiters. Logically, it's impossible for there to be any waiters when the event is set.

With this configuration, it should be fairly obvious how to implement this. The implementation will rely heavily on the fact that we can read a variable and then change it at a later time, as long as we can verify that the variable hasn't been changed in between (which we can, using AtomicCompareExchange - if the variable has been changed when we try to write it back, we simply read it again and repeat). This allows us to read the status value, perform logic on it, and then write it back with some computed new value, and still be totally thread-safe in the whole operation.

The Reset and TryWait methods are trivial - a single atomic function call - so I won't bother explaining them. Set and Wait aren't much more complicated, but I'll mention them a bit.

The only real logic that Set requires is what to do if the event is cleared. If it's a manual-reset event, all waiting threads get released, and the status becomes set. If it's an auto-reset event, what happens depends on whether there are waiting threads. If there are, then one thread is released, and the status remains cleared (although there's one less waiting thread). If no threads are waiting, the status becomes set, just like with the manual-reset event.

The logic in Wait is essentially the same. If the status isn't set, then the waiting thread count gets implemented, and the thread waits on the semaphore. If it's set and it's a manual-reset event, Wait returns immediately, and the status remains set. If the status is set and it's an auto-reset event, the status is cleared, and Wait returns (at this point no threads are waiting on the event).

The code for CFastEvent and the POSIX CEvent:

class CFastEvent
{
private:
CSemaphore m_sem; // Semaphore that threads will wait on

unsigned int m_nSpinCount; // Constant. The number of times to attempt to check if the event is set before waiting

bool m_bAutoReset; // Constant

int32 m_nStatus; // Atomically accessed. The sign bit indicates that the event is currently set. The rest of the bits are the number of waiting threads on the event when the event isn't set.

// Various local constants
enum
{
WAITER_MASK = 0x7FFFFFFFL,
SET = 0x80000000L
};

public:

inline CFastEvent(bool bAutoReset, bool bSet, unsigned int nSpinCount = 0) : m_sem(0)
{
assert(nSpinCount >= 0);

m_nSpinCount = nSpinCount;

m_bAutoReset = bAutoReset;

m_nStatus = (bSet ? SET : 0);
}

inline ~CFastEvent()
{
}

void Set()
{
int32 nStatus, nNewStatus;
do
{
nStatus = m_nStatus;

if (nStatus == SET)
return; // Already set

if (nStatus && m_bAutoReset)
nNewStatus = nStatus - 1; // If there are waiters and it's an auto-reset event, release 1 and leave unset
else
nNewStatus = SET; // Else set and release all
} while (AtomicCompareExchange(&m_nStatus, nNewStatus, nStatus) != nStatus);

// If there were any waiters, release the appropriate number
if (nStatus)
m_sem.Post(nStatus - (nNewStatus & WAITER_MASK));
}

inline void Reset()
{
// If it's set, reset it; else do nothing
AtomicCompareExchange(&m_nStatus, 0, SET);
}

inline bool TryWait()
{
// New event status if we successfully get the event
int32 nNewStatus = (m_bAutoReset ? 0 : SET);

// If it's previously set, we got it. Set or reset it based on whether it's an auto or manual-reset event.
return (AtomicCompareExchange(&m_nStatus, nNewStatus, SET) == SET);
}

// Wait indefinitely for the event
void Wait()
{
// Don't wait if we can get it by spinning
unsigned int nSpinCount = m_nSpinCount;
while (nSpinCount--)
{
if (TryWait())
return;
}

// Check if the event is set, queue as a waiter if not
int32 nStatus, nNewStatus;
do
{
nStatus = m_nStatus;

if (nStatus == SET)
nNewStatus = (m_bAutoReset ? 0 : SET); // Set or reset it
else
nNewStatus = nStatus + 1; // One more waiter
} while (AtomicCompareExchange(&m_nStatus, nNewStatus, nStatus) != nStatus);

if (nStatus == SET)
return; // It was set

// It wasn't set. Wait for it.
m_sem.Wait();
}
};

// Simply provide the CFastEvent class with an interface compatible with CEvent
class CEvent : public CFastEvent
{
public:
inline CFastEvent(bool bAutoReset, bool bSet) : CFastEvent(bAutoReset, bSet)
{ }
};

How to Surrender to MS in Style

I just saw this post on Slashdot (yes, I do read Slashdot), and found it so dazzlingly idiotic I just had to pass it around.
Testing is only a priority on closed source apps - Windows and IE being no exception. The very fact that users have neither access to the source code nor the ability to build the application sources means that any testing must be done "in-house". This is going to slow down the release cycle by exactly the amount of time it would take to run all the regression tests.

With Open Source, a patch can be released right away and users can compile in the new sources themselves. Any issues can be immediately identified and reported back to the maintainers, often with both the offending source code and potential fixes to the patch. Without the lengthy QA cycle, Open Source patches are much more immediate than any Closed Source shop could ever hope to achieve.

What's that I hear? Why, I do believe it's the sound of people stampeding to replace their OSS products with MS stuff after hearing how OSS (or at least this guy) does business. You just became MS's MVP of the week, dude; congrats!

Events - New Windows Implementation

This one's almost identical to the last. All I did was remove the timed Wait and replace with a TryWait, for compatability with the new POSIX verion.

class CEvent // Win32 version of CEvent (slow)
{
private:
HANDLE m_hEvent;

inline HANDLE GetEvent()
{
return m_hEvent;
}

public:
// bAutoReset controls how the event reacts when it gets set. An auto-reset event will automatically go back to the unset state after one thread gets released from waiting because of the event being set. A non-auto-reset event will stay set until reset explicitely. CEvent throws bad_alloc if it cannot create the event.
inline CEvent(bool bAutoReset, bool bSet)
{
m_hEvent = CreateEvent(NULL, !bAutoReset, bSet, NULL);
if (!m_hEvent)
throw std::bad_alloc("Unable to create event");
}

inline ~CEvent()
{
verify(CloseHandle(m_hEvent));
}

inline void Set()
{
verify(SetEvent(m_hEvent));
}

inline void Reset()
{
verify(ResetEvent(m_hEvent));
}

// Returns true if the event was set, otherwise false
inline bool TryWait()
{
DWORD nRetVal = WaitForSingleObject(m_hEvent, 0);
assert(nRetVal != WAIT_FAILED);

if (nRetVal == WAIT_TIMEOUT)
return false;

return true;
}

// Waits indefinitely for the event to be set
inline void Wait()
{
verify(WaitForSingleObject(m_hEvent, INFINITE) != WAIT_FAILED);
}
};

Ooooh. Shiny.

Have a look at what VC++ compiled a few lines of ThingyTron as, demonstrating the new atomic and endian function intrinsics (you can also see some calls that aren't intrinsic, but work nonetheless). Note that this is optimized code, so there isn't always a 1:1 relationship between the C++ line and the assembly lines below it.

int32 moosquish = 15, mooblah = 79;
00403EC0 mov dword ptr [esp+20h],0Fh

mooblah = AtomicExchangeAdd(&moosquish, mooblah);
00403EC8 mov eax,4Fh
00403ECD lea ecx,[esp+20h]
00403ED1 lock xadd dword ptr [ecx],eax

mooblah = AtomicExchange(&moosquish, mooblah);
00403ED5 mov edx,ecx
00403ED7 xchg eax,dword ptr [edx]
00403ED9 mov ecx,eax

mooblah = AtomicCompareExchange(&moosquish, mooblah, -15);
00403EDB mov eax,0FFFFFFF1h
00403EE0 lock cmpxchg dword ptr [edx],ecx

mooblah = AtomicBitExchange(&moosquish, 2, 0);
00403EE4 push 0
00403EE6 push 2
00403EE8 mov eax,edx
00403EEA push eax
00403EEB call _AtomicBitExchange@12 (401060h)

mooblah = AtomicBitExchangeCompliment(&moosquish, 7);
00403EF0 push 7
00403EF2 lea ecx,[esp+24h]
00403EF6 push ecx
00403EF7 call _AtomicBitExchangeCompliment@8 (401090h)

mooblah = AtomicSignedExchangeAdd(&moosquish, 29, false);
00403EFC push 0
00403EFE lea edx,[esp+24h]
00403F02 push 1Dh
00403F04 push edx
00403F05 call Q::AtomicSignedExchangeAdd (4010E0h)
00403F0A add esp,0Ch

void *squishyptr = (void *)0x1234567,
*fatptr = AtomicCompareExchangePointer(&squishyptr, (void *)0x76543210, (void *)0x1234567);
00403F0D push 1234567h
00403F12 mov dword ptr [esp+6Ch],eax
00403F16 push 76543210h
00403F1B lea eax,[esp+74h]
00403F1F push eax
00403F20 mov dword ptr [esp+78h],1234567h
00403F28 call _AtomicCompareExchangePointer@12 (401040h)

int64 x = ltoh(0x0011223344556677), y = btoh(x);
00403F2D mov ecx,44556677h
00403F32 bswap ecx

printf("mooblah = %d, mooblah = %d, squishyptr = 0x%X, fatptr = 0x%X, x = %i64X, y = %i64X", moosquish, mooblah, squishyptr, fatptr, x, y);
00403F34 push ecx
00403F35 mov esi,112233h
00403F3A bswap esi
00403F3C push esi
00403F3D push esi
00403F3E push ecx
00403F3F push eax
00403F40 mov ecx,dword ptr [esp+80h]
00403F47 mov edx,dword ptr [esp+7Ch]
00403F4B mov eax,dword ptr [esp+34h]
00403F4F push ecx
00403F50 push edx
00403F51 push eax
00403F52 push offset string "mooblah = %d, mooblah = %d, squi"... (40D4C0h)
00403F57 call printf (405697h)

Thursday, June 09, 2005

Silly Compiler Tricks

So, while I was writing those atomic functions in assembly, I figured I might as well go ahead and write the endian conversion functions. The functions themselves are trivial to write, using rol for 16-bit flipping, and bswap for 32-bit and 64-bit flipping. But after I get done with that, Merlin asks me why I didn't just use the VC++ flipping functions (_byteswap_ushort, _byteswap_ulong, and _byteswap_uint64). Well, the truth was that I didn't know about those. So, I go back and add a special case for the endian functions, when VC++ is used. So, after writing all that, I go to test the code (or #defines, in this case). Of course, I first build the test program (called ThingyTron, in this case). Stepping through the code leads me to the implimentation of the _byteswap_uint64 function, which looks like this:

unsigned __int64 __cdecl _byteswap_uint64(unsigned __int64 i)
{
unsigned __int64 j;
j = (i << 56);
j += (i << 40)&0x00FF000000000000;
j += (i << 24)&0x0000FF0000000000;
j += (i << 8)&0x000000FF00000000;
j += (i >> 8)&0x00000000FF000000;
j += (i >> 24)&0x0000000000FF0000;
j += (i >> 40)&0x000000000000FF00;
j += (i >> 56);
return j;

}

And if you think that's ugly, you should see the assembly the compiler generates from it. Fortunately, Merlin was right - in release build, these functions are translated into intrinsics (basically identical to the assembly I wrote, only inlined in the calling function). Because of this, not only are they as fast as the functions I wrote, but they don't have the overhead of my function calls, making them extremely fast. And so, everything worked out beautifully in the end. I do have to wonder, though, what on earth prompted them to use such a horrible flipping algorithm for the non-intrinsic version.

Silly MASM

So, I was just stepping through my new atomic functions code (the x86-32 version built with MASM) and noticed something kind of amusing. Take a look at the dissassembly (after it's been assembled):
_AtomicExchange@8:
00420A30 8B 54 24 04 mov edx,dword ptr [esp+4]
00420A34 8B 44 24 08 mov eax,dword ptr [esp+8]
00420A38 F0 87 02 lock xchg eax,dword ptr [edx]
00420A3B C2 08 00 ret 8
00420A3E 8B FF mov edi,edi
_AtomicExchangeAdd@8:
00420A40 8B 54 24 04 mov edx,dword ptr [esp+4]
00420A44 8B 44 24 08 mov eax,dword ptr [esp+8]
00420A48 F0 0F C1 02 lock xadd dword ptr [edx],eax
00420A4C C2 08 00 ret 8
00420A4F 90 nop
_AtomicCompareExchange@12:
00420A50 8B 4C 24 04 mov ecx,dword ptr [esp+4]
00420A54 8B 44 24 0C mov eax,dword ptr [esp+0Ch]
00420A58 8B 54 24 08 mov edx,dword ptr [esp+8]
00420A5C F0 0F B1 11 lock cmpxchg dword ptr [ecx],edx
00420A60 C2 0C 00 ret 0Ch
00420A63 8D A4 24 00 00 00 00 lea esp,[esp]
00420A6A 8D 9B 00 00 00 00 lea ebx,[ebx]

See how it pads out the functions, to align new functions on paragraph (16 byte) boundaries? Well, that's exactly what I told it to do. But what's amusing is WHAT it pads the functions with. It seems to like to fill the gaps with do-nothing instructions that consume the fewest possible cycles. No idea what the point of this is, but I thought it was humorous.

Backpeddling

Okay, enough procrastination; it's time to write this thing.

So, it's not been a good week, as far as coding has been concerned. Rewriting the atomic functions is turning out to be a major pain, as I'm not really sure what the optimal way to structure my code is. I want to support at least x86-32, x86-64, PPC32, and PPC64. I also want to support at least VC++ and GCC (which uses AT&T style assembly). So, how many copies of the functions do I have to write? Well, I'd rather not think about that :P Then there's the complication that VC++ does not support x86-64 inline assembly, meaning that at the very least I have to write those functions in an ASM file and compile it with ML64 (MASM 64). As well, as of last night, the x86-32 functions were in an ASM file to be compiled with ML (MASM).

As if that wasn't enough trouble, Merlin just brought to my attention that VS 2005 contains compiler intrinsics for the atomic functions. Intrinsics, of course, mean that the compiler could insert the instructions directly into the calling functions, without the need to call another function such as mine. As this is a serious benefit to performance, and performance is one of my biggest goals with LibQ, I'll definitely want to use those instead of my own assembly functions, when compiled with VS 2005.

So, what else could go wrong? Well, plenty, it turns out. First of all, I noticed a design flaw in my POSIX event implementation. It wasn't immediately obvious because the event works most of the time. The problem comes when one thread tries to reset the event just after another thread set it. Setting it should release at least one thread waiting on the event, even if the event is subsequently reset. But in our implementation, in order for a thread to be released from a wait, the event must remain signaled until that thread gets the CPU and checks the status of the event. If a second thread resets the event before this occurs (which is not unreasonable, given the latency associated with bringing threads out of wait), the threads would remain waiting indefinitely. So condition variables are not the way to go. I've written an alternate version of the event class which does not have this problem; it also doesn't have something else, due to the nature of the implementation: support for wait timeouts. So, I guess that means I'm gonna have to remove timed waits for both the Windows and POSIX implementations, to keep the feature set identical.

Lastly, some features were removed from the mutex class. I had originally said that, because the POSIX mutex lacked some features that we wanted - namely, the ability to spin before going into wait - we'd simply write our own implementation with the desired features. The problem with this is that I still want to support condition variables in LibQ, and POSIX condition variables require a mutex - a POSIX mutex. So, I don't have much choice but to use a POSIX mutex for the POSIX version of the mutex class, to allow it to interoperate with the condition variable later. I chose to leave the spinning feature in the Windows version, however, to benefit from it where possible, and simply ignore it in the POSIX version (leaving the interface identical in both cases).

Thursday, June 02, 2005

New and Improved

Well, I'm actually getting pretty excited about the x86-64 architecture, now that I'm learning about it. The reason I'm just now learning about this is really that I've never really needed to know, before. For quite some time the x86-64 was only available on high end server systems, which I would never have, and it was not expected that my programs (mostly related to games) would be used on such a system. But now, x86-64 is available on the desktop, and Windows is ready to support it.

So, let me give a brief overview of how x86-64 differs from x86-32 (legacy x86), and why I'm excited about it.

x86-64 is a 64-bit incarnation of the x86 architecture, supporting the same instruction set and features (most of the features, anyway). In practical terms, that means that it can do math on 64-bit numbers in 64-bit registers with single instructions. This means that, on average, code to work with 64-bit (or larger) numbers will be half as large, and take half as many cycles to execute (as 32-bit processors would have to do multiple operations to emulate 64-bit math).

Associated with 64-bit registers is the ability to use 64-bit pointers. 64-bit pointers, of course, mean an exponentially greater amount of memory that may be accessed by the processor. But even more importantly, you don't need more than 4 gigs of memory (the maximum that may be addressed by 32-bit pointers) to benefit from this. Most OSes these days use virtual address spaces, where the memory pointed to by a pointer doesn't need to be at that same address in physical memory. Some of the other things used for the address space besides physical memory (and empty space) are memory mapped ports and files. A larger address space, even with the same amount of physical memory, can allow larger regions of (or even more) files or device memory to be made available at once. Now, instead of continually mapping and unmapping small pieces of a 500 gig database, you can map the whole thing at once (theoretical example only; I'm not a database coder, and I have no idea if you would actually do this in a high-performance, commercial database system).

Yet despite this, the x86-64 architecture remains backward compatible. In addition to supporting the legacy CPU modes (real mode, protected mode, and virtual 8086 mode), which run either 16-bit or 32-bit programs and OS, it also supports two new modes: long compatibility mode and long 64-bit mode. Long 64-bit mode is exactly what you'd expect a 64-bit processor to do: run programs and OS natively in 64-bit mode, with 64-bit registers, pointers, stack, etc. But this requires that both the OS and program be compiled for x86-64.

Long compatibility mode is what's really cool. It's a hybrid of the legacy modes and long mode, allowing a 64-bit OS, which takes full advantage of the 64-bit x86-64 features, to natively run old 16-bit and 32-bit programs. This means that x86-64 supports the best of both worlds: you can have a new, 64-bit OS, but still run old programs on it, natively.

Okay, so that's progress. But there are two other reasons I'm excited about x86-64, that don't have anything to do with it the fact that it's 64-bit. In addition to supporting the 8 64-bit integer registers (RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP), the 8 64-bit floating point/MMX registers (MMX0-7/FPR0-7), and 8 128-bit XMM registers (XMM0-7), x86-64 supports 16 totally new registers. 8 of these are new integer registers, named R8-15 (kinda reminiscent of the PowerPC registers), and the other 8 are new XMM registers (XMM8-15). To me this is highly significant, as the limited number of registers (especially the integer registers) was one of the key limitations of the x86 architecture (so I believe). Having twice as many registers to work with means that code can do much more complicated math without needing to access memory to fetch/store variables.

Another significant limitation of the x86 architecture, I believe, was the inability to access memory relative to the current instruction. Branches and calls are generally relative to the current instruction, but memory accesses are always absolute (that is, they go to the same memory location regardless of the location of the current instruction).

If I had to make an educated guess as to why the x86 architecture did not initially support relative data access, I'd say it's due to the fact that originally x86 programs were segmented. You had a 64 KB code segment, a 64 KB data segment, a 64 KB stack segment, and a few other optional segments (don't ask what happens when you need more than 64 KB for your code or data, you don't want to know), which could be anywhere in memory, and were pointed to by segment registers. Branches and calls were made relative not only to the current instruction, but also to the base offset of the code segment. Memory accesses were made relative to the base offset of the data segment.

But things aren't done that way, anymore. Segments are rarely used in modern OSes, replaced mainly by virtual address spaces and memory mapped files. In Windows, an executable (or library) is mapped as a single piece (although there can be holes). All the addresses in that executable are fixed, relative to the base address (unlike with segments, where you could have each segment be anywhere in physical memory). Yet the base address of an executable is not fixed; it can be loaded almost anywhere in the virtual address space. So you now have lots of pointers that point to global variables at fixed addresses, when the variables themselves could be anywhere in memory. This requires the OS to correct all these pointers when it loads the executable into memory, which takes time. This problem is elegantly remedied by the use of pointers relative to the current instruction, as the relative offset would never change, no matter where the executable gets loaded.

Whoa!

EDMONTON - A team from the University of Alberta has proven for the first time that a single molecule can switch electrical currents off and on, a puzzle that scientists worldwide have been trying to crack for decades.

The finding could revolutionize the field of electronics, providing a leap ahead for everything from computers to batteries to medical equipment.

"We've made a tiny, tiny, maybe the ultimate tiniest transistor," said Robert Wolkow, a physics professor and the principal investigator who headed the research team from the National Institute for Nanotechnology at the U of A.

"This has been a key goal of researchers in this field for nearly 20 years and we have done it. ... Molecular electronics are indeed possible, no longer just a futuristic theory."

Edmonton Journal

Cool and Uncool

So I've been reading about the x86-64 archetecture, and programming it. It's pretty cool. The 8 additional integer registers alone make it substantially better than x86-32. Also, the fastcallish (or perhaps you'd prefer 'RISCish') calling convention is nice.

The functions themselves look trivial to write. Not so cool is the fact that VC++ doesn't support inline assembly. So I guess I'm gonna have to learn to use the standalone assembler for x86-64. Things are going to get reaaally unfun (and unportable) if I have to decorate the function names manually.

On an unrelated note, I should make a transacted file class at some point - that is, a file that remembers all the modifications you make to it (and will recall them if you read from a modified area), but doesn't actually alter the file on disk until you close it. That would support the discarding of changes on close, in addition to saving.

Wednesday, June 01, 2005

Blinding Flash of the Obvious

So yeah, I had one of those. I was working on modifying the CMutex class to support delayed semaphore allocation (that would reduce resource usage for rarely used mutexes), and noticed a couple of glaring deficiencies in the atomic functions set. First, there's no AtomicCompareExchange function for pointers. This is important because a pointer can vary in size - either 32-bits or 64-bits. That brought to light a second epiphany: NONE of those atomic functions I wrote support 64-bit pointers. Blindingly obvious, but quite frankly, I've never had a 64-bit CPU to work on, and thus it's not exactly second nature for me to work with 64-bit assembly (particularly x64 assembly). Guess it's time to grab those AMD x86-64 manuals off my shelf and write some atomic functions...

And while I'm at it, I should probably make some defines for 8, 16, 32, and 64-bit data types, rather than using short, long, and long long...