Well, after thinking about it for a while, I've decided to cover one more method of code injection into a foreign process. The reason for this comes from the fact that the previous method, while almost perfect, has a trait that can be undesirable in some cases: it overwrites the code in the executable, and leaves this code overwritten until after the DLLs are initialized. While this is, indeed, the best way to accomplish the task of injecting code, it leaves a tell-tale sign that the process has been tampered with. It would be trivial to implement a simple hack-detector that checks the first few bytes of the executable's entry point from a load-time DLL.
The fourth method of code injection does not have this limitation, but has new limitations of its own. This method is actually very similar to the previous method; however, instead of waiting until the executable's main function is about to be called, this method executes code as soon as the process is created.
This works by virtue of the fact that when a thread is suspended, it is possible to read and write its register state (called its context) using the SetThreadContext and GetThreadContext functions. In this way we can alter the instruction pointer to point to our loader code, which will then jump back to the original instruction pointer code. Doing this leaves no readily detectable signs that the process has been altered.
The problem is that I don't fully know the limits of this method. In Windows NT this method is safe, because process initialization (including DLL initialization) is done in a user-mode asynchronous procedure call (APC), which will be preemptively executed before the code at the instruction pointer of the thread's context (where code would be injected), regardless of whether the process is created suspended.
On Windows 9x, however, things aren't so clear-cut. When Windows 9x creates a new running (always running, to begin with) process, the code that gets executed for the initial thread resides in Kernel32.dll. This code performs early process initialization, then calls a system call to suspend the process if the process was created suspended (this is where injected code would get executed). When the thread is resumed, the system call returns, a substantial amount of late initialization code is executed (DLLs are initialized here, among other things), and the executable's entry point is called.
The fact that the DLLs have not yet been initialized by the time injected code would be executed isn't really a problem, since you can force them to be initialized by calling LoadLibrary for the ones that are required (Kernel32 doesn't have a DllMain, so this is safe). The problem is that there's so much late initialization code that I don't know what it does, so I'm not really comfortable with doing anything complex at this point, when the code is intended to run on 9x as well as NT.
But that's the way it's done. This is actually the method used by LMPQAPI and MPQDraft. Both get around this late initialization problem in the same way: hooking an API function in the initialization procedure, then performing the full-scale patching when that function gets called by the executable (ensuring that all process initialization will have been performed by that point). I haven't bothered to write any example code, because the process is nearly identical to the previous method, and so would be easy to modify. I might write some code later, if I feel any less lazy.