Search This Blog

Wednesday, July 27, 2005

Infatuation

So I just finished watching Elfen Lied. It's good! At least, I really liked it. I suppose I'm a sucker for happy endings, even if dozens of bad guys have to get dismembered in the process. If you can stomach the graphic violence (i.e. limbs getting ripped off, severed bones, etc.) you should definitely watch it. The anime itself is licensed in English, but the manga is not; so if you want something you can ethically (though the legality is questionable) download, you can read the manga version (note: I haven't read the manga, so I don't know how it compares to the anime). There's also an anime trailer available which isn't as graphic as the series itself.




Incidentally, my new desktop.

Tuesday, July 26, 2005

The Art of Breaking and Entering - Thread Hijacking Variation

Well, after thinking about it for a while, I've decided to cover one more method of code injection into a foreign process. The reason for this comes from the fact that the previous method, while almost perfect, has a trait that can be undesirable in some cases: it overwrites the code in the executable, and leaves this code overwritten until after the DLLs are initialized. While this is, indeed, the best way to accomplish the task of injecting code, it leaves a tell-tale sign that the process has been tampered with. It would be trivial to implement a simple hack-detector that checks the first few bytes of the executable's entry point from a load-time DLL.

The fourth method of code injection does not have this limitation, but has new limitations of its own. This method is actually very similar to the previous method; however, instead of waiting until the executable's main function is about to be called, this method executes code as soon as the process is created.

This works by virtue of the fact that when a thread is suspended, it is possible to read and write its register state (called its context) using the SetThreadContext and GetThreadContext functions. In this way we can alter the instruction pointer to point to our loader code, which will then jump back to the original instruction pointer code. Doing this leaves no readily detectable signs that the process has been altered.

The problem is that I don't fully know the limits of this method. In Windows NT this method is safe, because process initialization (including DLL initialization) is done in a user-mode asynchronous procedure call (APC), which will be preemptively executed before the code at the instruction pointer of the thread's context (where code would be injected), regardless of whether the process is created suspended.

On Windows 9x, however, things aren't so clear-cut. When Windows 9x creates a new running (always running, to begin with) process, the code that gets executed for the initial thread resides in Kernel32.dll. This code performs early process initialization, then calls a system call to suspend the process if the process was created suspended (this is where injected code would get executed). When the thread is resumed, the system call returns, a substantial amount of late initialization code is executed (DLLs are initialized here, among other things), and the executable's entry point is called.

The fact that the DLLs have not yet been initialized by the time injected code would be executed isn't really a problem, since you can force them to be initialized by calling LoadLibrary for the ones that are required (Kernel32 doesn't have a DllMain, so this is safe). The problem is that there's so much late initialization code that I don't know what it does, so I'm not really comfortable with doing anything complex at this point, when the code is intended to run on 9x as well as NT.

But that's the way it's done. This is actually the method used by LMPQAPI and MPQDraft. Both get around this late initialization problem in the same way: hooking an API function in the initialization procedure, then performing the full-scale patching when that function gets called by the executable (ensuring that all process initialization will have been performed by that point). I haven't bothered to write any example code, because the process is nearly identical to the previous method, and so would be easy to modify. I might write some code later, if I feel any less lazy.

Thursday, July 21, 2005

The Art of Breaking and Entering - Thread Hijacking

While the first two mechanisms of DLL injection I've shown have used well documented Windows API functions, the third and final method is quite a bit more exotic. This method consists literally of hijacking a (the, to be exact) thread that already exists in the target process and making it execute code we injected using methods discussed previously.

The trick, here, is the fact that new processes can be created suspended. When CreateProcess is called with CREATE_SUSPENDED, Windows begins the usual way: creating the process' address space, loading the module, preparing the kernel for the new process, and creating the initial thread. In reality, processes are nothing more than an environment for threads to run it; what's really suspended is the initial thread. When run, this initial thread does several things, most notably preparing the executable for execution (including loading all required DLLs) calling the executable's entry point function (main or WinMain), and then calling ExitThread with the return value of the entry point (if there are no other threads running in the process, ExitThread has the effect of destroying the process).

While this thread is suspended, we have access to the process, allowing us to do any number of evil things. There are a number of possible ways to go about hijacking the thread, but I'll only present the best one (the most robust and with the highest reliability): overwriting the entry point. Here, we overwrite the first few bytes of the entry point with a JMP instruction, to jump to our injected code, which will load your DLL, call a patching function, and then jump back to the application.

There are numerous advantages to this technique over the others. Unlike CreateRemoteThread, this method does not mandate Windows NT (I should note, in case you don't realize, that "NT" refers to the NT platform, which includes NT Workstation/Server, 2000, XP, and Server 2003). As well, it is the only method that not only allows synchronous operation, but also allows your code to be executed before the target executable begins running.

This sounds fairly simple, but it turns out to be a major hassle to get right (I seriously doubt I could have gotten the code for this post working on the first try had I not been doing this kind of thing for years). This is especially true when you intend to create a version which works on both Windows 9x and NT, which is a very nice feature.

The first complication of this method is rather severe: you must be sure that you get EVERYTHING you need in your injected loader code into the process, both code and data. Among other things, that implies that you must write your loader code in assembly, and you may not call imported API functions (because your loader code doesn't have an import table). If you wish to call any API functions (which you will, considering that you'll at least need LoadLibrary), you must pass the address of the functions to your loader from the parent process.

There are also many numerous smaller complications. If you intended to support both 9x and NT, you must ensure that you can inject either via allocated memory (for NT) or a file mapping (for 9x). And in the case of 9x, you must ensure that the mapping does not get closed before the loader has finished executing (this is tricky because the mapping was created in the parent process, and if the parent process closes it, the mapping will disappear from the target process, as well).

I've been putting a LOT of effort into researching this method. As far as I've been able to tell, it has only one inherent limitation. As the loader code executes before main/WinMain, the executable will not have been initialized, and so you cannot call any functions in it. This may be worked around by hooking some function the executable imports, and then delaying your initialization until that function is called (this is what LMPQAPI does to create a server using MPQ editing functions in StarEdit.exe).

Two more limitations are imposed by my implementation. First, the executable must load at its preferred address (not be relocated), as that's where the injector expects it to be. Second, because the patching process is architecture-specific, it is limited to what I wrote: a 32-bit process patching a 32-bit process. It is likely that these problems can both be fixed, but I'm too lazy to do it, at the moment.

// Amount of space to reserve for the loader function that gets injected
#define LOADER_MAX_SIZE 192
#define PATCHER_DATA_ALIGNMENT 16  // Alignment to use for the patcher data

// Rounds an offset up to the nearest PATCHER_DATA_ALIGNMENT boundary
#define ALIGN_PATCHER_DATA(x) (((UINT_PTR)x + PATCHER_DATA_ALIGNMENT - 1) & ~(PATCHER_DATA_ALIGNMENT - 1))

typedef LPVOID (WINAPI *VirtualAllocExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD flAllocationType,
 DWORD flProtect
);

typedef BOOL (WINAPI *VirtualFreeExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD dwFreeType
);

// The JMP rel32 instruction
#include <pshpack1.h>
struct JMP32
{
 BYTE byOpcode;  // 0xE9
 DWORD nRelOffset;  // Offset relative to the instruction AFTER this JMP

 inline JMP32()
 { byOpcode = 0xE9; }
};
#include <poppack.h>

// The parameters that will get injected into the target process
struct LOADERFUNCTIONPARAMS
{
 BOOL bCompleted;  // Whether the loader has finished
 DWORD nErrCode;  // GetLastError value when the loader succeeds/fails

 HANDLE hParamsSection;  // If the parameter block is in a file mapping, HANDLE of the mapping; NULL otherwise.

 FARPROC lpfnLoadLibraryA;  // Functions that the loader will call
 FARPROC lpfnMapViewOfFile;
 FARPROC lpfnGetLastError;
 FARPROC lpfnExitProcess;

 UINT_PTR nReturnAddress;  // The address that our loader function will return to

 JMP32 jmpOverwritten;  // The data we overwrite in the WinMain function with the JMP to the loader

 UINT_PTR nPatcherRVA;  // RVA of patcher entry point in DLL
 size_t nPatcherDataLen;  // Length of data to be passed to patcher

 char szDLLFilePath[MAX_PATH];  // Name of patcher DLL

 BYTE fnLoaderFunction[LOADER_MAX_SIZE];  // Loader function code

 BYTE byPatcherData[PATCHER_DATA_ALIGNMENT];  // Patcher data of variable length
};

// The loader function for x86-32. This function will return (on success) to the start function for the process' initial thread.
void __declspec(naked) __stdcall LoaderFunction86_32()
{
 __asm {
   ; Use CALL to generate the return address we need to overwrite with the entry point's address
   call Loader

Loader:
   push ebp
   mov ebp, esp
   pushad
   ; int 3  ; Uncomment this for debugging the loader function

   ; Compute the address of the LOADERFUNCTIONPARAMS block. It will be at the page boundary beneath this code
   mov ebx, [ebp+4]
   and ebx, 0xFFFFF000

   ; If the parameter block is in a file mapping, lock it, first
   mov edx, [ebx]LOADERFUNCTIONPARAMS.hParamsSection

   test edx, edx
   jz LoadDLL

   push 0
   push 0
   push 0
   push FILE_MAP_WRITE
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnMapViewOfFile

   test eax, eax
   jz Failure

LoadDLL:  ; Call LoadLibraryA to load DLL.
   lea edx, [ebx]LOADERFUNCTIONPARAMS.szDLLFilePath
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnLoadLibraryA

   test eax, eax
   jz Failure

LibraryLoaded:  ; Now call the patcher entry point, if there is one
   cmp [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA, 0
   je RewriteEntryPoint

   lea ecx, [ebx]LOADERFUNCTIONPARAMS.byPatcherData
   add ecx, (PATCHER_DATA_ALIGNMENT - 1)  // Align the data on a 16 byte boundary
   and ecx, ~(PATCHER_DATA_ALIGNMENT - 1)
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nPatcherDataLen
   add eax, [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA
   push edx
   push ecx
   call eax

   test eax, eax
   jz Failure

RewriteEntryPoint:  ; Put the original bytes from the entry point back
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nReturnAddress
   lea esi, [ebx]LOADERFUNCTIONPARAMS.jmpOverwritten
   mov edi, edx
   mov ecx, size JMP32
   rep movsb
   mov [ebp+4], edx  ; Set the return address to the entry point

Done:  ; Patching completed successfully. Acknowledge success and return to the entry point.
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, NO_ERROR
   mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE

   popad
   mov esp, ebp
   pop ebp
   ret

Failure:  ; Save GetLastError value and call ExitProcess
   call [ebx]LOADERFUNCTIONPARAMS.lpfnGetLastError
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, eax
   push 0
   ;mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE
   call [ebx]LOADERFUNCTIONPARAMS.lpfnExitProcess
 };
}

// Get the entry point for a module from its file path
bool FindModuleEntryPoint(LPCSTR lpszFilePath, UINT_PTR &lpfnEntryPoint)
{
 assert(lpszFilePath);

 // Map the module as a data file (essentially as a memory mapped file)
 HMODULE hModule = LoadLibraryEx(lpszFilePath, NULL, LOAD_LIBRARY_AS_DATAFILE);
 if (!hModule)
   return false;

 bool bSuccess = false;

 // Wrap code in a try-except block, since we're going to be working with unverified pointers
 __try
 {
   // Find the DOS header. An HMODULE is a pointer to the module in memory, but LoadLibrary stores flags in the lower bits of the HMODULE.
   IMAGE_DOS_HEADER *lpDosHeader = (IMAGE_DOS_HEADER *)((UINT_PTR)hModule & ~(UINT_PTR)0xFFF);

   if (lpDosHeader->e_magic == IMAGE_DOS_SIGNATURE && lpDosHeader->e_lfanew)
   {
     // Locate the NT headers
     DWORD *lpNTSignature = (DWORD *)((UINT_PTR)lpDosHeader + lpDosHeader->e_lfanew);
     IMAGE_FILE_HEADER *lpNTHeader = (IMAGE_FILE_HEADER *)((UINT_PTR)lpNTSignature + sizeof(DWORD));
     IMAGE_OPTIONAL_HEADER32 *lpOptHeader = (IMAGE_OPTIONAL_HEADER32 *)((UINT_PTR)lpNTHeader + IMAGE_SIZEOF_FILE_HEADER);
     
     if (*lpNTSignature == IMAGE_NT_SIGNATURE)
     {
       lpfnEntryPoint = lpOptHeader->AddressOfEntryPoint + lpOptHeader->ImageBase;

       bSuccess = true;
     }
   }
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { }

 FreeLibrary(hModule);

 return bSuccess;
}
// Finds the entry point of the target executable, saves the entry point data, and overwrites the entry point with the JMP instruction
bool HookModuleEntryPoint32(LPCSTR lpszFilePath, HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsBlock, UINT_PTR &lpfnEntryPoint, JMP32 &jmpOverwritten)
{
 assert(lpParamsBlock);

 // Find the entry point for the module
 if (!FindModuleEntryPoint(lpszFilePath, lpfnEntryPoint))
   return false;

 // Protect against access violations
 __try
 {
   // Unprotect where we need to read/write
   DWORD nOldProtect;
   if (!VirtualProtectEx(hProcess, (void *)lpfnEntryPoint, sizeof(JMP32), PAGE_EXECUTE_READWRITE, &nOldProtect))
     return false;

   // Get the old entry point
   SIZE_T nBytesRead;

   if (!ReadProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmpOverwritten, sizeof(JMP32), &nBytesRead) || nBytesRead != sizeof(JMP32))
     return false;

   // Write the JMP to the entry point
   SIZE_T nBytesWritten;
   JMP32 jmp;

   // Compute the relative offset of the loader function
   DWORD nLoaderAddress = (DWORD)&lpParamsBlock->fnLoaderFunction;

   jmp.nRelOffset = nLoaderAddress - (lpfnEntryPoint + sizeof(jmp));

   if (!WriteProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmp, sizeof(jmp), &nBytesWritten) || nBytesWritten != sizeof(jmp))
     return false;

   return true;
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { return false; }
}

// Wait until the loader function, for better or worse, has finished. Return value is the error code from the process
bool GetLoaderErrorCode(HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsMemory, DWORD &nErrCode)
{
 // The plan is very simple: poll the parameter block every 10 ms to check for completion. Also watch the process HANDLE for termination.
 SIZE_T nBytesRead;

 while (WaitForSingleObject(hProcess, 10) != WAIT_OBJECT_0)
 {
   // Read the completion indicator flag
   BOOL bCompleted;

   if (!ReadProcessMemory(hProcess, &lpParamsMemory->bCompleted, &bCompleted, sizeof(bCompleted), &nBytesRead) || nBytesRead != sizeof(bCompleted))
     return false;

   if (bCompleted)
     break;
 }

 // Read the error code and return
 if (!ReadProcessMemory(hProcess, &lpParamsMemory->nErrCode, &nErrCode, sizeof(nErrCode), &nBytesRead) || nBytesRead != sizeof(nErrCode))
   return false;

 return true;
}

// May fail for two reasons: unable to allocate the memory, or this is a Windows 9x machine. If the latter, bIsNT will be false
bool InjectDLLAndResumeProcessNT(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, bool &bIsNT, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We don't know if we're on NT or 9x, and the version APIs can be easily fooled. Do it by trial and error: try to use VirtualAllocEx, and fall back to file mappings if VirtualAllocEx isn't available.
 bIsNT = false;

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 VirtualAllocExPtr lpfnVirtualAllocEx = (VirtualAllocExPtr)GetProcAddress(hKernel32, "VirtualAllocEx");
 VirtualFreeExPtr lpfnVirtualFreeEx = (VirtualFreeExPtr)GetProcAddress(hKernel32, "VirtualFreeEx");

 if (!lpfnVirtualAllocEx || !lpfnVirtualFreeEx)
   return false;

 // Windows 9x usually has stubs for VirtualAllocEx and VirtualFreeEx, so we still don't know if they're really there. Try to allocate the memory.
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)lpfnVirtualAllocEx(hProcess, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

 // The moment of truth: NT or 9x?
 if (lpParamsMemory || GetLastError() != ERROR_CALL_NOT_IMPLEMENTED)
   bIsNT = true;

 if (!lpParamsMemory)
   return false;

 bool bSuccess = false;

 // This is Windows NT
 // Hook the entry point
 if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
 {
   // Compute the offset to write the patcher data at.
   BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

   // Write the parameters and patcher data
   SIZE_T nBytesWritten;

   if (WriteProcessMemory(hProcess, lpParamsMemory, &params, sizeof(params), &nBytesWritten) && nBytesWritten == sizeof(params))
   {
     if (!nPatcherDataLen || (WriteProcessMemory(hProcess, lpPatcherDataMemory, lpPatcherData, nPatcherDataLen, &nBytesWritten) && nBytesWritten == nPatcherDataLen))
     {
       // It's all set. Let it run until the loader function finishes.
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }
 }
 
 // Free the memory
 lpfnVirtualFreeEx(hProcess, lpParamsMemory, 0, MEM_RELEASE);

 return bSuccess;
}

bool InjectDLLAndResumeProcess9x(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We're on 9x. Use a file mapping.
 HANDLE hMapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, NULL);
 if (!hMapping)
   return false;

 bool bSuccess = false;

 // Map the file mapping so we can write to it
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, 0);
 if (lpParamsMemory)
 {
   // Overwrite the entry point and get the old one
   if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
   {
     // Duplicate the file mapping HANDLE into the target process
     if (DuplicateHandle(GetCurrentProcess(), hMapping, hProcess, &params.hParamsSection, 0, FALSE, DUPLICATE_SAME_ACCESS))
     {
       BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

       // Copy the patcher data
       memcpy(lpParamsMemory, &params, sizeof(params));
       memcpy(lpPatcherDataMemory, lpPatcherData, nPatcherDataLen);

       // Let the loader run
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }

   // Unmap the view
   UnmapViewOfFile(lpParamsMemory);
 }

 // Close the file mapping
 CloseHandle(hMapping);

 return bSuccess;
}

// Allocates the parameter struct in the foreign process and sets the members
bool InjectDLLAndResumeProcess(HANDLE hProcess, HANDLE hThread, LPCSTR lpszExecPath, LPCSTR lpszDLLFilePath, UINT_PTR nPatcherRVA, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 assert(hProcess);
 assert(lpszExecPath);
 assert(lpszDLLFilePath);
 assert(strlen(lpszDLLFilePath) < MAX_PATH);

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 // Construct a local copy of the param block and initialize it
 LOADERFUNCTIONPARAMS params;

 params.hParamsSection = NULL;

 params.bCompleted = FALSE;

 params.lpfnLoadLibraryA = GetProcAddress(hKernel32, "LoadLibraryA");
 params.lpfnMapViewOfFile = GetProcAddress(hKernel32, "MapViewOfFile");
 params.lpfnGetLastError = GetProcAddress(hKernel32, "GetLastError");
 params.lpfnExitProcess = GetProcAddress(hKernel32, "ExitProcess");

 params.nPatcherRVA = nPatcherRVA;
 params.nPatcherDataLen = nPatcherDataLen;

 strcpy(params.szDLLFilePath, lpszDLLFilePath);

#ifdef _DEBUG
 // In debug build in VC++, "LoaderFunction86_32" is actually a JMP stub. Find the real function.
 JMP32 *pJmpStub = (JMP32 *)LoaderFunction86_32;
 LPBYTE lpbyLoaderFunction = (LPBYTE)(pJmpStub->nRelOffset + (DWORD)LoaderFunction86_32 + sizeof(JMP32));

 memcpy(&params.fnLoaderFunction, lpbyLoaderFunction, LOADER_MAX_SIZE);
#else
 memcpy(&params.fnLoaderFunction, LoaderFunction86_32, LOADER_MAX_SIZE);
#endif

 // The patcher data will be written directly into the process, because it occupies extra data after the struct

 // Try to patch using the NT method first. If it's not NT, use the 9x method.
 bool bIsNT = false;

 if (InjectDLLAndResumeProcessNT(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, bIsNT, nErrCode))
   return true;  // Successfully patched with the NT method
 else if (!bIsNT && InjectDLLAndResumeProcess9x(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, nErrCode))
   return true;

 return false;  // Patching failed
}

The Art of Breaking and Entering - Remote Threads - Updated

Next up on our list of DLL injection methods is "the Windows NT way". Like many of the other features available on Windows NT but not 9x, this method is easy, elegant, and versatile. Just like VirtualAlloc and VirtualAllocEx, Windows NT supports a version of CreateThread called CreateRemoteThread which can operate on a foreign process.

CreateRemoteThread is almost identical to CreateThread, and it is not surprising that CreateRemoteThread requires the thread function it will execute to be in the process the thread gets created in. While you could inject some assembly to load the DLL using VirtualAllocEx, there is an easier way, in this case. It just so happens that the prototype of the thread function CreateRemoteThread will execute exactly matches that of LoadLibrary (either the ASCII or Unicode version will do, so long as you use the appropriate string). The new thread will thus call LoadLibrary, loading the DLL and executing DllMain, then set the return value of LoadLibrary (and indirectly that of DllMain) as the thread exit code, which your program can retrieve with GetExitCodeThread.

Of course, this only loads the DLL and executes DllMain, which, as previously mentioned, does not permit a great deal of activity. This is where creating an initialization thread from DllMain comes in handy, as mentioned last post. However, there's one more thing to be mentioned. From DllMain you cannot tell whether you're in the patcher process or the target process; at least not by any methods inherent to the process. One simple solution to this is to check if there's a memory mapped file corresponding to the current process. If such a file mapping exists, then the process is a target process, and initialization should be performed; otherwise, you're in the patcher process. The use of a file mapping is particularly convenient, because you can pass data to and from the target process in the very same file mapping.

Oh, and one last thing to mention: getting the address of LoadLibrary. This may be accomplished simply by using GetModuleHandle and GetProcAddress. While it's true that almost all the time you can't be sure that a DLL will be loaded in exactly the same place in two different processes, Kernel32.dll and NTDLL.dll are the exceptions to this rule. Windows has some built-in checks to ensure that Kernel32 and NTDLL will always get loaded at their preferred address, guaranteeing that their base addresses will be the same for all processes.

DWORD APIENTRY InitializationFunction(void *lpParam)
{
 MessageBox(NULL, "Hello from the inside!", "InitializationFunction", MB_OK | MB_ICONEXCLAMATION);

 return 0;
}

BOOL APIENTRY DllMain(HINSTANCE hModule, DWORD ul_reason_for_call, LPVOID lpReserved)
{
 if (ul_reason_for_call == DLL_PROCESS_ATTACH)
 {
   g_hDLL = (HINSTANCE)hModule;  // Save the HINSTANCE

   // If we're a target process, execute the initialization thread
   HANDLE hMapping = OpenProcessSection("InjectIntoProcessNT");
   if (hMapping)
   {
     // Close the indicator mapping. If there was any actual data in the mapping, we would need to pass the file mapping HANDLE to the initialization function instead of closing it.
     CloseHandle(hMapping);

     // Create the initialization thread
     HANDLE hThread = CreateThread(NULL, 0, InitializationFunction, 0, 0, NULL);
     if (!hThread)
       return FALSE;

     // Close the thread (it'll keep on running)
     CloseHandle(hThread);
   }
 }

 return TRUE;
}

_declspec(dllexport) bool __stdcall InjectIntoProcessNT(DWORD nProcessID, DWORD nTimeoutMS)
{
 // Get this DLL's path
 char szDLLPath[MAX_PATH + 1];
 GetModuleFileName((HMODULE)g_hDLL, szDLLPath, MAX_PATH);

 // Get the address of LoadLibrary(A)
 HMODULE hKernel32 = GetModuleHandle("Kernel32");
 FARPROC lpfnLoadLibraryA = GetProcAddress(hKernel32, "LoadLibraryA");

 // Open a HANDLE to the process. We'll need access to create the loader thread, as well as allocate memory for and write the DLL path.
 HANDLE hProcess = OpenProcess(PROCESS_CREATE_THREAD | PROCESS_VM_OPERATION | PROCESS_VM_WRITE, FALSE, nProcessID);
 if (!hProcess)
   return false;

 bool bSuccess = false, bTimedOut = false;  // You know the drill

 // Create the "you are a target process" file mapping
 HANDLE hMapping = CreateProcessSection(1, "InjectIntoProcessNT", nProcessID);
 if (hMapping)
 {
   // Allocate memory for the DLL path
   void *lpDLLPathMemory = VirtualAllocEx(hProcess, NULL, MAX_PATH + 1, MEM_COMMIT, PAGE_READWRITE);
   if (lpDLLPathMemory)
   {
     // Write the path
     SIZE_T nBytesWritten;
     WriteProcessMemory(hProcess, lpDLLPathMemory, szDLLPath, MAX_PATH + 1, &nBytesWritten);

     // Create the loader thread
     DWORD nThreadID;
     HANDLE hThread = CreateRemoteThread(hProcess, NULL, 0, (LPTHREAD_START_ROUTINE)lpfnLoadLibraryA, lpDLLPathMemory, 0, &nThreadID);

     // Wait for the loader thread to terminate (it will terminate when DllMain returns). If it's necessary to verify that the initialization thread has successfully completed, an alternate waiting method, such as the one we used for windows hooks, will be necessary, here.
     bTimedOut = (WaitForSingleObject(hThread, nTimeoutMS) != WAIT_OBJECT_0);
     if (!bTimedOut)
     {
       // Get the thread's return value to check if DllMain executed successfully
       DWORD nExitCode;
       GetExitCodeThread(hThread, &nExitCode);

       if (nExitCode != NULL)
         bSuccess = true;  // DllMain executed successfully
     }

     // Close the thread. It will continue to run if it hasn't terminated.
     CloseHandle(hThread);

     // Free the memory for the DLL path
     if (!bTimedOut)
       VirtualFreeEx(hProcess, lpDLLPathMemory, 0, MEM_RELEASE);
   }

   // Close the file mapping
   if (!bTimedOut)
     CloseHandle(hMapping);
 }

 // Close the target process
 CloseHandle(hProcess);

 return bSuccess;
}


Update:
Note that despite the theoretical possibility of creating a thread in a process during startup (while the main thread is suspended with CREATE_SUSPENDED) so that the patcher can execute synchronously, this does not work in practice, because CSRSS (the Win32 subsystem process) freaks out if the first thread to execute isn't the first thread to be created, and kills the process.