Search This Blog

Thursday, July 21, 2005

The Art of Breaking and Entering - Thread Hijacking

While the first two mechanisms of DLL injection I've shown have used well documented Windows API functions, the third and final method is quite a bit more exotic. This method consists literally of hijacking a (the, to be exact) thread that already exists in the target process and making it execute code we injected using methods discussed previously.

The trick, here, is the fact that new processes can be created suspended. When CreateProcess is called with CREATE_SUSPENDED, Windows begins the usual way: creating the process' address space, loading the module, preparing the kernel for the new process, and creating the initial thread. In reality, processes are nothing more than an environment for threads to run it; what's really suspended is the initial thread. When run, this initial thread does several things, most notably preparing the executable for execution (including loading all required DLLs) calling the executable's entry point function (main or WinMain), and then calling ExitThread with the return value of the entry point (if there are no other threads running in the process, ExitThread has the effect of destroying the process).

While this thread is suspended, we have access to the process, allowing us to do any number of evil things. There are a number of possible ways to go about hijacking the thread, but I'll only present the best one (the most robust and with the highest reliability): overwriting the entry point. Here, we overwrite the first few bytes of the entry point with a JMP instruction, to jump to our injected code, which will load your DLL, call a patching function, and then jump back to the application.

There are numerous advantages to this technique over the others. Unlike CreateRemoteThread, this method does not mandate Windows NT (I should note, in case you don't realize, that "NT" refers to the NT platform, which includes NT Workstation/Server, 2000, XP, and Server 2003). As well, it is the only method that not only allows synchronous operation, but also allows your code to be executed before the target executable begins running.

This sounds fairly simple, but it turns out to be a major hassle to get right (I seriously doubt I could have gotten the code for this post working on the first try had I not been doing this kind of thing for years). This is especially true when you intend to create a version which works on both Windows 9x and NT, which is a very nice feature.

The first complication of this method is rather severe: you must be sure that you get EVERYTHING you need in your injected loader code into the process, both code and data. Among other things, that implies that you must write your loader code in assembly, and you may not call imported API functions (because your loader code doesn't have an import table). If you wish to call any API functions (which you will, considering that you'll at least need LoadLibrary), you must pass the address of the functions to your loader from the parent process.

There are also many numerous smaller complications. If you intended to support both 9x and NT, you must ensure that you can inject either via allocated memory (for NT) or a file mapping (for 9x). And in the case of 9x, you must ensure that the mapping does not get closed before the loader has finished executing (this is tricky because the mapping was created in the parent process, and if the parent process closes it, the mapping will disappear from the target process, as well).

I've been putting a LOT of effort into researching this method. As far as I've been able to tell, it has only one inherent limitation. As the loader code executes before main/WinMain, the executable will not have been initialized, and so you cannot call any functions in it. This may be worked around by hooking some function the executable imports, and then delaying your initialization until that function is called (this is what LMPQAPI does to create a server using MPQ editing functions in StarEdit.exe).

Two more limitations are imposed by my implementation. First, the executable must load at its preferred address (not be relocated), as that's where the injector expects it to be. Second, because the patching process is architecture-specific, it is limited to what I wrote: a 32-bit process patching a 32-bit process. It is likely that these problems can both be fixed, but I'm too lazy to do it, at the moment.

// Amount of space to reserve for the loader function that gets injected
#define LOADER_MAX_SIZE 192
#define PATCHER_DATA_ALIGNMENT 16  // Alignment to use for the patcher data

// Rounds an offset up to the nearest PATCHER_DATA_ALIGNMENT boundary
#define ALIGN_PATCHER_DATA(x) (((UINT_PTR)x + PATCHER_DATA_ALIGNMENT - 1) & ~(PATCHER_DATA_ALIGNMENT - 1))

typedef LPVOID (WINAPI *VirtualAllocExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD flAllocationType,
 DWORD flProtect
);

typedef BOOL (WINAPI *VirtualFreeExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD dwFreeType
);

// The JMP rel32 instruction
#include <pshpack1.h>
struct JMP32
{
 BYTE byOpcode;  // 0xE9
 DWORD nRelOffset;  // Offset relative to the instruction AFTER this JMP

 inline JMP32()
 { byOpcode = 0xE9; }
};
#include <poppack.h>

// The parameters that will get injected into the target process
struct LOADERFUNCTIONPARAMS
{
 BOOL bCompleted;  // Whether the loader has finished
 DWORD nErrCode;  // GetLastError value when the loader succeeds/fails

 HANDLE hParamsSection;  // If the parameter block is in a file mapping, HANDLE of the mapping; NULL otherwise.

 FARPROC lpfnLoadLibraryA;  // Functions that the loader will call
 FARPROC lpfnMapViewOfFile;
 FARPROC lpfnGetLastError;
 FARPROC lpfnExitProcess;

 UINT_PTR nReturnAddress;  // The address that our loader function will return to

 JMP32 jmpOverwritten;  // The data we overwrite in the WinMain function with the JMP to the loader

 UINT_PTR nPatcherRVA;  // RVA of patcher entry point in DLL
 size_t nPatcherDataLen;  // Length of data to be passed to patcher

 char szDLLFilePath[MAX_PATH];  // Name of patcher DLL

 BYTE fnLoaderFunction[LOADER_MAX_SIZE];  // Loader function code

 BYTE byPatcherData[PATCHER_DATA_ALIGNMENT];  // Patcher data of variable length
};

// The loader function for x86-32. This function will return (on success) to the start function for the process' initial thread.
void __declspec(naked) __stdcall LoaderFunction86_32()
{
 __asm {
   ; Use CALL to generate the return address we need to overwrite with the entry point's address
   call Loader

Loader:
   push ebp
   mov ebp, esp
   pushad
   ; int 3  ; Uncomment this for debugging the loader function

   ; Compute the address of the LOADERFUNCTIONPARAMS block. It will be at the page boundary beneath this code
   mov ebx, [ebp+4]
   and ebx, 0xFFFFF000

   ; If the parameter block is in a file mapping, lock it, first
   mov edx, [ebx]LOADERFUNCTIONPARAMS.hParamsSection

   test edx, edx
   jz LoadDLL

   push 0
   push 0
   push 0
   push FILE_MAP_WRITE
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnMapViewOfFile

   test eax, eax
   jz Failure

LoadDLL:  ; Call LoadLibraryA to load DLL.
   lea edx, [ebx]LOADERFUNCTIONPARAMS.szDLLFilePath
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnLoadLibraryA

   test eax, eax
   jz Failure

LibraryLoaded:  ; Now call the patcher entry point, if there is one
   cmp [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA, 0
   je RewriteEntryPoint

   lea ecx, [ebx]LOADERFUNCTIONPARAMS.byPatcherData
   add ecx, (PATCHER_DATA_ALIGNMENT - 1)  // Align the data on a 16 byte boundary
   and ecx, ~(PATCHER_DATA_ALIGNMENT - 1)
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nPatcherDataLen
   add eax, [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA
   push edx
   push ecx
   call eax

   test eax, eax
   jz Failure

RewriteEntryPoint:  ; Put the original bytes from the entry point back
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nReturnAddress
   lea esi, [ebx]LOADERFUNCTIONPARAMS.jmpOverwritten
   mov edi, edx
   mov ecx, size JMP32
   rep movsb
   mov [ebp+4], edx  ; Set the return address to the entry point

Done:  ; Patching completed successfully. Acknowledge success and return to the entry point.
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, NO_ERROR
   mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE

   popad
   mov esp, ebp
   pop ebp
   ret

Failure:  ; Save GetLastError value and call ExitProcess
   call [ebx]LOADERFUNCTIONPARAMS.lpfnGetLastError
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, eax
   push 0
   ;mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE
   call [ebx]LOADERFUNCTIONPARAMS.lpfnExitProcess
 };
}

// Get the entry point for a module from its file path
bool FindModuleEntryPoint(LPCSTR lpszFilePath, UINT_PTR &lpfnEntryPoint)
{
 assert(lpszFilePath);

 // Map the module as a data file (essentially as a memory mapped file)
 HMODULE hModule = LoadLibraryEx(lpszFilePath, NULL, LOAD_LIBRARY_AS_DATAFILE);
 if (!hModule)
   return false;

 bool bSuccess = false;

 // Wrap code in a try-except block, since we're going to be working with unverified pointers
 __try
 {
   // Find the DOS header. An HMODULE is a pointer to the module in memory, but LoadLibrary stores flags in the lower bits of the HMODULE.
   IMAGE_DOS_HEADER *lpDosHeader = (IMAGE_DOS_HEADER *)((UINT_PTR)hModule & ~(UINT_PTR)0xFFF);

   if (lpDosHeader->e_magic == IMAGE_DOS_SIGNATURE && lpDosHeader->e_lfanew)
   {
     // Locate the NT headers
     DWORD *lpNTSignature = (DWORD *)((UINT_PTR)lpDosHeader + lpDosHeader->e_lfanew);
     IMAGE_FILE_HEADER *lpNTHeader = (IMAGE_FILE_HEADER *)((UINT_PTR)lpNTSignature + sizeof(DWORD));
     IMAGE_OPTIONAL_HEADER32 *lpOptHeader = (IMAGE_OPTIONAL_HEADER32 *)((UINT_PTR)lpNTHeader + IMAGE_SIZEOF_FILE_HEADER);
     
     if (*lpNTSignature == IMAGE_NT_SIGNATURE)
     {
       lpfnEntryPoint = lpOptHeader->AddressOfEntryPoint + lpOptHeader->ImageBase;

       bSuccess = true;
     }
   }
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { }

 FreeLibrary(hModule);

 return bSuccess;
}
// Finds the entry point of the target executable, saves the entry point data, and overwrites the entry point with the JMP instruction
bool HookModuleEntryPoint32(LPCSTR lpszFilePath, HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsBlock, UINT_PTR &lpfnEntryPoint, JMP32 &jmpOverwritten)
{
 assert(lpParamsBlock);

 // Find the entry point for the module
 if (!FindModuleEntryPoint(lpszFilePath, lpfnEntryPoint))
   return false;

 // Protect against access violations
 __try
 {
   // Unprotect where we need to read/write
   DWORD nOldProtect;
   if (!VirtualProtectEx(hProcess, (void *)lpfnEntryPoint, sizeof(JMP32), PAGE_EXECUTE_READWRITE, &nOldProtect))
     return false;

   // Get the old entry point
   SIZE_T nBytesRead;

   if (!ReadProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmpOverwritten, sizeof(JMP32), &nBytesRead) || nBytesRead != sizeof(JMP32))
     return false;

   // Write the JMP to the entry point
   SIZE_T nBytesWritten;
   JMP32 jmp;

   // Compute the relative offset of the loader function
   DWORD nLoaderAddress = (DWORD)&lpParamsBlock->fnLoaderFunction;

   jmp.nRelOffset = nLoaderAddress - (lpfnEntryPoint + sizeof(jmp));

   if (!WriteProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmp, sizeof(jmp), &nBytesWritten) || nBytesWritten != sizeof(jmp))
     return false;

   return true;
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { return false; }
}

// Wait until the loader function, for better or worse, has finished. Return value is the error code from the process
bool GetLoaderErrorCode(HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsMemory, DWORD &nErrCode)
{
 // The plan is very simple: poll the parameter block every 10 ms to check for completion. Also watch the process HANDLE for termination.
 SIZE_T nBytesRead;

 while (WaitForSingleObject(hProcess, 10) != WAIT_OBJECT_0)
 {
   // Read the completion indicator flag
   BOOL bCompleted;

   if (!ReadProcessMemory(hProcess, &lpParamsMemory->bCompleted, &bCompleted, sizeof(bCompleted), &nBytesRead) || nBytesRead != sizeof(bCompleted))
     return false;

   if (bCompleted)
     break;
 }

 // Read the error code and return
 if (!ReadProcessMemory(hProcess, &lpParamsMemory->nErrCode, &nErrCode, sizeof(nErrCode), &nBytesRead) || nBytesRead != sizeof(nErrCode))
   return false;

 return true;
}

// May fail for two reasons: unable to allocate the memory, or this is a Windows 9x machine. If the latter, bIsNT will be false
bool InjectDLLAndResumeProcessNT(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, bool &bIsNT, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We don't know if we're on NT or 9x, and the version APIs can be easily fooled. Do it by trial and error: try to use VirtualAllocEx, and fall back to file mappings if VirtualAllocEx isn't available.
 bIsNT = false;

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 VirtualAllocExPtr lpfnVirtualAllocEx = (VirtualAllocExPtr)GetProcAddress(hKernel32, "VirtualAllocEx");
 VirtualFreeExPtr lpfnVirtualFreeEx = (VirtualFreeExPtr)GetProcAddress(hKernel32, "VirtualFreeEx");

 if (!lpfnVirtualAllocEx || !lpfnVirtualFreeEx)
   return false;

 // Windows 9x usually has stubs for VirtualAllocEx and VirtualFreeEx, so we still don't know if they're really there. Try to allocate the memory.
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)lpfnVirtualAllocEx(hProcess, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

 // The moment of truth: NT or 9x?
 if (lpParamsMemory || GetLastError() != ERROR_CALL_NOT_IMPLEMENTED)
   bIsNT = true;

 if (!lpParamsMemory)
   return false;

 bool bSuccess = false;

 // This is Windows NT
 // Hook the entry point
 if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
 {
   // Compute the offset to write the patcher data at.
   BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

   // Write the parameters and patcher data
   SIZE_T nBytesWritten;

   if (WriteProcessMemory(hProcess, lpParamsMemory, &params, sizeof(params), &nBytesWritten) && nBytesWritten == sizeof(params))
   {
     if (!nPatcherDataLen || (WriteProcessMemory(hProcess, lpPatcherDataMemory, lpPatcherData, nPatcherDataLen, &nBytesWritten) && nBytesWritten == nPatcherDataLen))
     {
       // It's all set. Let it run until the loader function finishes.
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }
 }
 
 // Free the memory
 lpfnVirtualFreeEx(hProcess, lpParamsMemory, 0, MEM_RELEASE);

 return bSuccess;
}

bool InjectDLLAndResumeProcess9x(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We're on 9x. Use a file mapping.
 HANDLE hMapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, NULL);
 if (!hMapping)
   return false;

 bool bSuccess = false;

 // Map the file mapping so we can write to it
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, 0);
 if (lpParamsMemory)
 {
   // Overwrite the entry point and get the old one
   if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
   {
     // Duplicate the file mapping HANDLE into the target process
     if (DuplicateHandle(GetCurrentProcess(), hMapping, hProcess, &params.hParamsSection, 0, FALSE, DUPLICATE_SAME_ACCESS))
     {
       BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

       // Copy the patcher data
       memcpy(lpParamsMemory, &params, sizeof(params));
       memcpy(lpPatcherDataMemory, lpPatcherData, nPatcherDataLen);

       // Let the loader run
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }

   // Unmap the view
   UnmapViewOfFile(lpParamsMemory);
 }

 // Close the file mapping
 CloseHandle(hMapping);

 return bSuccess;
}

// Allocates the parameter struct in the foreign process and sets the members
bool InjectDLLAndResumeProcess(HANDLE hProcess, HANDLE hThread, LPCSTR lpszExecPath, LPCSTR lpszDLLFilePath, UINT_PTR nPatcherRVA, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 assert(hProcess);
 assert(lpszExecPath);
 assert(lpszDLLFilePath);
 assert(strlen(lpszDLLFilePath) < MAX_PATH);

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 // Construct a local copy of the param block and initialize it
 LOADERFUNCTIONPARAMS params;

 params.hParamsSection = NULL;

 params.bCompleted = FALSE;

 params.lpfnLoadLibraryA = GetProcAddress(hKernel32, "LoadLibraryA");
 params.lpfnMapViewOfFile = GetProcAddress(hKernel32, "MapViewOfFile");
 params.lpfnGetLastError = GetProcAddress(hKernel32, "GetLastError");
 params.lpfnExitProcess = GetProcAddress(hKernel32, "ExitProcess");

 params.nPatcherRVA = nPatcherRVA;
 params.nPatcherDataLen = nPatcherDataLen;

 strcpy(params.szDLLFilePath, lpszDLLFilePath);

#ifdef _DEBUG
 // In debug build in VC++, "LoaderFunction86_32" is actually a JMP stub. Find the real function.
 JMP32 *pJmpStub = (JMP32 *)LoaderFunction86_32;
 LPBYTE lpbyLoaderFunction = (LPBYTE)(pJmpStub->nRelOffset + (DWORD)LoaderFunction86_32 + sizeof(JMP32));

 memcpy(&params.fnLoaderFunction, lpbyLoaderFunction, LOADER_MAX_SIZE);
#else
 memcpy(&params.fnLoaderFunction, LoaderFunction86_32, LOADER_MAX_SIZE);
#endif

 // The patcher data will be written directly into the process, because it occupies extra data after the struct

 // Try to patch using the NT method first. If it's not NT, use the 9x method.
 bool bIsNT = false;

 if (InjectDLLAndResumeProcessNT(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, bIsNT, nErrCode))
   return true;  // Successfully patched with the NT method
 else if (!bIsNT && InjectDLLAndResumeProcess9x(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, nErrCode))
   return true;

 return false;  // Patching failed
}

2 comments:

Ryan Govostes said...

Have you seen Unsanity's Application Enhancer for Mac OS X? It exploits the some of the same ideas that you've covered here, using derivatives of Jonathan Rentzsch's mach_inject and mach_override.

I think something like APE for Windows would be pretty cool, and you've pretty much done all the work. Have you considered it?

On an unrelated note, you seem to use inline code and assembly frequently. Due to the proportional font in use and the narrow Blogger template, it can be a little awkward to read. Have you considered using a monospaced font for code, and possibly putting it in non-wrapping, scrollable text form element?

mrbrdo said...

Interesting read... However, there are some things which can be done much more efficiently. I have written my version which has the following differences:
- you CAN get the base address of the executable in memory, even if it is relocated (you just need to get the module handle, use EnumProcessModules)
- i don't use inline assembly, and my assembly code is about 6 lines of assembly code

But, thank you a lot for the idea of just writing the code you overwrite back, as i first went on to use a disassembler so i could figure the correct opcode sizes (which is not needed).
If you are interested in my version you can drop me an e-mail on mrbrdo at email dot si