Search This Blog

Wednesday, August 31, 2005

Let's Make a CPU - Decision - The Stack

And so, after several days of thinking about it, I've decided how I want to do the stack and return address with my CPU. The theme of my CPU is bare bones simplicity while still being practical for use. As such, I've gone with the simplest possible solution.

At the time I thought that the link register method was the simplest solution, but this turned out to be incorrect. The link register method makes the stack a software structure, but leaves the link register up to the hardware (and the register can be accessed by software, no less). The simplest solution would be to leave both entirely up to software; the reason it wasn't obvious to me at the time was that there's no way to do this on x86-32, PPC, or MIPS.

All that's really needed for software to handle return addresses is a way to GET the return address. This can be accomplished by an instruction which loads an address relative to the current instruction pointer. Such an instruction would not be specific to return addresses, as it would be useful for other applications: it would make it possible to address data relative to code; neither x86-32, PPC, or MIPS can do this, and it would be of substantial benefit given how current operating systems work (x86-64 can, as I praised in my summary a month or two back).

Thursday, August 25, 2005

How a PS3 Does It

IBM has released a variety of technical documents on the Cell (the Playstation 3's CPU, for those of you who are totally isolated from the world of video games). This seemed like a particularly timely piece of news, considering not only that I'm doing my own series on CPUs, but also that a few days ago I was looking around for information on how the Cell works, particularly with respect to how the synergistic processing units (SPUs) are programmed.

A bunch of the specs require registration. The only one that doesn't is this brief overview of the architecture.

I haven't had a chance to read any of this, yet (I just got back from school). Maybe, if I'm in a good mood, I'll look it over and write an architecture summary of the Cell, this weekend.

Let's Make a CPU - Dilemma - The Stack

As of right now, I have two major dilemmas in constructing the architecture. The first of these is what to do about the stack and function calls. I can think of 3 different ways to do it. But first, let me explain what I've already decided about the stack.

The x86 architecture (and possibly others that I don't know about) has the concept of pushing onto and popping off of the stack, using special instructions which cause the CPU to advance the stack pointer and store the contents of a register on the stack, in a single instruction. This is not how I'm going to do it. Like other RISC architectures, my CPU will treat the stack pointer the same as any other pointer: the program, prior to making a function call, will reserve stack space for the parameters by modifying the stack pointer, then store the parameters at the memory locations just reserved, and, once the call has returned, modify the stack pointer so as to remove the parameters from the stack.

Now, the three candidate stack models for me to choose between:

Integrated Stack
The first method is used by the x86 architecture. In it, a special (hardware defined) stack pointer register exists, which the program can modify as necessary (either by treating it as a pointer, or using the PUSH/POP instructions), and write to the stack. In addition to function parameters, the CPU uses the stack to hold the return addresses in function calls: the CALL instruction pushing the return address onto the stack, and the RET instruction popping it into the instruction pointer. The model here is a single, integrated stack, shared by both hardware (the CPU) and software (the program).

This method is elegant in it's simplicity: there is only a single mechanism to cover both bases. However, there are a few negatives to this method. On a philosophical level, this method completely dissolves the line between hardware and software; both the CPU and the program access the stack, mostly independently, but this still requires the software to know what the CPU is doing, and work around it. The practical side of this is that, as the software has full access to the stack, it's possible for the software to break the hardware (that is, by altering the stack pointer so as to cause an access violation when the CPU jumps to the return address), or to interfere with it (that is, the return address can be altered by the software, either intentionally - a function modifying the return address - or unintentionally - a buffer overflow exploit). Just the same, it's possible for the hardware to break the software (try forgetting about the 4/8 bytes the return address takes on the stack in an assembly program, and see how far your program gets before it blows up!).

Minimalistic Hardware
Several architectures, such as MIPS, do exactly the opposite. The stack is totally owned by the software, and the hardware has no knowledge of the stack. Function calls are handled by the call instruction copying the return address to a register; once this register receives the return address (inside the called function), this address becomes the property of the software (which must preserve this value, if the function makes function calls of its own), until ultimately the function returns by jumping to the return address in the register. The idea here is to have the hardware do the absolute minimum necessary, which amounts to copying the return address and jumping; everything else is done in software, including the jump to the return address.

This is elegant in its own way, in that almost everything is done by the software. The software, rather than hardware, takes responsibility for program flow, allowing (potentially) greater optimization of resource (particularly register and memory) usage. It has the added benefit of resisting (though not being immune to) code injection by buffer overflow, as only places where the return address is saved on the stack (as opposed to moving it to another register) are vulnerable. The downside is that it still allows functions to modify the return address; as well, there is still some mixing of software and hardware (although much less than with the integrated stack).

Software-Hardware Isolation
The final method is used, I believe, by Itanium. In this method, the CPU is again ignorant of the stack (at least, the stack used by the program), and uses a private, system stack (pointed to by a private register) for return addresses. (which may even be stored in on-chip cache, increasing the performance of returns considerably, as it allows the CPU to know ahead of time exactly where the function will be returning to, and prefetch the initial instructions).

This has the obvious advantage that it's immune to alterations (either intentional or unintentional) of the return addresses, as software does not have access to the system stack. The philosopher would also be pleased by the complete isolation of hardware (the call stack) and software. The downside is that it adds additional complexity to the CPU as well as the system, as now the system has two stacks (even though one is implemented in software, it still exists).

Let's Make a CPU - Basic Features

So, I can't wait to get started on designing this CPU, and this is the perfect kind of thing for me to put on my blog. So, first let's go over the basic features and why I chose them:

The architecture will be RISC. While a somewhat arbitrary decision, I think that RISC is much nicer than CISC, and particularly nice for an educational use such as this.

The CPU will use fixed-length instructions. While variable length instructions are easier and more spacious to decode, they're a major pain in a variety of circumstances.

The CPU will be 32-bit, and have instructions 32 bits large. While these are independent features, I decided that, for simplicity, the register size and instruction size would be the same. The reason I chose the size of 32 bits is due to what can fit in an instruction. 32 bits is the optimal size for instructions, based on several factors. 8 or 16-bit instructions are too small to fit all the parameters into, for many instructions. 64 bits, however, would be pretty wasteful, as most instructions won't use more than 16-24 bits.

There will be 32 (integer) registers. I don't really see a need for floating point registers at this time, and it would cause several complex problems I'd rather not get into. The number 32 was the very first thing to come to mind when I considered how many registers the CPU should have; however, when I started looking at how to pack the instructions, it turned out that 32 (5 bits worth) was the ideal number of registers, given the available number of bits in each instruction (if 32 had been too much to fit in an instruction, however, I would have had no choice but to reduce it).

The CPU will be little endian. Have plenty of friends that hate little endian with a passion, but it seems most reasonable to me. While about everything else is equal between little and big endians, little endian has the advantage of being able to downcast a variable without knowing exactly how large the variable is (i.e. can cast a 32-bit variable to an 8 or 16-bit variable without having to know anything other than that the variable is at least 16 bits).

The CPU will use two's complements integers. As I can't see any significant reason to use anything else (and plenty of reason to use this), this seems like and open and shut case.

Tuesday, August 23, 2005

& Work & Play

So, this week I started school. This is gonna be a really unpleasant semester, with three labs, and a few more units than I'm used to taking. However, there is some good news (and no, I didn't just switch to Geico): in the first computer archetecture class (today), they said that during the semester we (the students) will be creating our own CPUs on an emulator. I'm going to enjoy this immensely... *evil grin*

And as if school wasn't enough to occupy my time, I've found a game that I've been playing in my free time (which has only amounted to a few hours). Go on, guess. I bet you can't guess it.



Super Mario World. Yup, the first SNES game, back in 1991. For some reason I just had an urge to play it, so I'm playing it in all my free time (which really has only amounted to several hours). I'm currently at Chocolate Island.

Saturday, August 20, 2005

Reader/Writer Lock - Introduction

Okay, enough procrastination; let's finish the thread synchronization object library. We've been over events, conditions, mutexes, semaphores, and atomic functions; that leaves us with only one left.

Suppose we have a linked list of something or other, which is accessed by multiple threads. Clearly it would be disastrous if two threads tried to add new entries to the end of the list at the same time. It would also be disastrous if one thread was iterating through the list (reading, only) while another thread modified the list in some way, such as inserting or deleting an element. Protecting the list with a mutex would handily solve the problem, as there would never be more than one thread accessing the list at once.

But consider what the consequence of two threads iterating (and reading from) the list simultaneously would be: nothing. So long as neither thread modifies the list, two threads can read from it without any problems, whatsoever. However, if the list is protected by a mutex, this is impossible, as only one thread can ever hold the mutex at a time; this is wasteful, particularly if the reads take a long time to execute.

The solution is the last kind of thread synchronization object: the reader/writer lock (also called a single writer, multiple reader guard). This type of lock can be locked two ways: for reading or for writing. When a thread has the lock for reading, other threads may enter the lock for reading simultaneously, but threads trying to enter the lock for writing must wait until all readers have left the lock. However, while a thread has the lock for writing, no more threads may enter the lock, regardless of whether they want read or write access.

A practical example of where this would be useful is in a database. Multiple threads may read from the database concurrently, but if a thread wants to write to the database, it must wait until it has exclusive access. Because reading from disk can be time-consuming, overlapping reads can significantly increase performance.

Friday, August 19, 2005

& Japanese

So this last week before the fall semester starts, I decided to improve my knowledge of Japanese writing, some. Specifically, I wanted to learn all of the hiragana and katakana (I already knew all but 6 of the hiragana), as well as 100 kanji (although that last part got torpedoed by a particular circumstance). If you have no idea what those three words mean, let me explain the basics of how Japanese writing works.

Japanese is composed of a very simple sound set: about 14 consonants and 5 vowels (note that English doesn't really have 21 consonants and 5 vowels, as many letters can be pronounced in different ways - i.e. there are about a million vowel sounds in English). Syllables in Japanese are a fixed unit of time that the syllable is spoken, regardless of which syllable it is. Syllables can be composed of a vowel alone, a consonant followed by a vowel, or a consonant alone; however, only a couple of consonant can exist without a succeeding vowel (n and s come to mind). Because of this very simple structure, it was decided that the phonetic alphabet would consist of one character per syllable. All in all this comes out to 71 characters; of these, 25 are duplicates of other characters, but carry some simple decoration to distinguish them.

There are actually two phonetic alphabets consisting of these same characters - the hiragana and katakana. Each has a character for each syllable, but the symbols for each alphabet differ. Hiragana is used to write (most) Japanese words. Katakana is used for borrowed words (that have been incorporated into Japanese - English is notorious for borrowing words) and to spell out foreign words (and when the words can't be exactly represented in the Japanese syllable set, you get the characteristic Japanese accent).

If this were all there was to the Japanese writing system, it would not be extremely difficult to learn. The problem is that, in addition to these two phonetic alphabets, Japanese usually uses something different for verbs, nouns, and sometimes adjectives/adverbs: the dreaded kanji. Kanji are Chinese pictographs, of which there are around 50,000, and the primary reason that learning Japanese is less fun than a root canal. Each kanji represents one particular word or idea, and the average Japanese adult needs to know around 2000 of these pictographs in order to get by.

Despite the use of Chinese pictographs, Japanese is not actually related to Chinese. The same kanji will generally mean the same thing in both Japanese and Chinese, but the spoken word for the kanji will differ. The Japanese simply decided to use the Chinese pictographs, rather than inventing a writing system from scratch. From what I've heard, the hiragana and katakana were developed after the adoption of the Chinese writing system, by aristocratic women with a substantial amount of free time, yet little formal education in writing (and, obviously, it takes a LOT of education to learn the Chinese writing system).

So, that's a simplistic overview of the Japanese writing system. I left out a number of things that I know, and there are surely things I don't. But that should give you an idea how it works, and why it's so difficult to learn (although still not as hard as Chinese, due to the fact that ALL words in Chinese are represented by pictographs, and so there are more of them to learn).

As a final note, it's thought by some that Japanese and Korean are sister languages; even from my very limited knowledge of the two languages, I can see a moderate number of words that are similar. Korean, however, uses an elegant phonetic alphabet in which each symbol represents a single sound, sort of like English (if I recall correctly, there are 14 consonants and 14 vowels in Korean). Each square kanji-looking Korean character is actually a single syllable, which is a composite of multiple sound symbols arranged in an order I don't really understand. As well, Korean does not have the same consonant-vowel-syllable structure as Japanese, and may have more than three sounds per syllable (three referring to a consonant and a diphthong, which I didn't discuss in this post). Best of all, Korean does not use (or rather, rarely uses, as at one point it did use) those terrible Chinese pictographs.

Thursday, August 18, 2005

& Power Tools

So, my parents, in their infinite wisdom, have decided to add on another room to the house. Not only that, but the room will be built where the patio is, and the patio roof will become the roof of the new room. That's quite an interesting (and complicated) prospect.

In fact, we started on the first step of the process last weekend: relocating the patio roof. The plans for the room require that the patio roof be moved about two and a half feet. Considering that the roof weighs more than a ton, and is anchored into the patio concrete with sturdy metal pipes that go into the 4x4 legs, this is no small task, and took us about a day, with myself and my dad.



There were a few complications in this process. One of the four 4x4 patio legs, and part of a support beam, were eaten away by carpenter ants (making the roof itself somewhat of a safety hazard). This required attaching a large 4x6 beam to the leg to act as a support during relocation (and eventually the leg itself will be completely replaced).



2x4s were added between the roof and the legs on each corner of the roof, to reinforce the roof during transport. The legs and the pipes inside them were sawed off at the concrete, so that the roof could be moved. Very carefully, we lifted each leg onto the make-shift trolley that we would use to move the roof.

The trolley itself consisted of a bit of BC technology: two 14-foot 4x6s on top of a set of rollers, which were segments of metal piping I cut with a sawzall. This worked remarkably well, given how many thousands of years old this method is. Finally, after moving the patio, we lifted it on to large blocks cut from a 4x6 (where it currently resides), which would keep the roof high enough that we could build the walls underneath it.

Now that the roof is moved, the next phase of the project will be to convert it to the roof of an indoor room. This is not extremely difficult. Each of the 2x6s supporting the roof will be complemented with an additional 2x6 on one side, and a 2x2 on the other. In addition to reinforcing the roof, this serves to reduce the width between beams to that which may be readily filled by rolled insulation (23 inches wide).



While the relocation of the roof proceeded smoothly, a couple of humorous incidents punctuated the process, not the least of which being when my dad ended up getting a - shall we say - surprise bikini wax by a runaway electric drill (that particular model of drill has enough torque to sprain your wrist).

Wednesday, August 17, 2005

Bad Apples

So, if you're one of those people using OS X 10.4 on a Mac, you might want to avoid downloading Security Update 2005-007, as I hear it broke all 64 bits applications. You just know that MS is having a holiday, everyone saying "Haha, losers!"

UPDATE:
A fix for this problem was released today (8/18/2005)

& Kajiurese

As I was watching Madlax, I couldn't help but notice one really pretty song in the soundtrack. In addition to a really pretty (though quiet) melody, the lyrics were really poetic in their rhyming and meter (the chorus actually made me think of some passages in The Highwayman: another of my mom's favorites that I'd become familiar with, over the years). Some looking online revealed that the name of the song was Lost Command, on the Madlax soundtrack 2. Having identified the song, I wanted to know what the lyrics meant; boy, did that end up being more than I bargained for.

The first candidate for locating the lyrics was the insert booklet with the soundtrack; this, however, proved to be no help, as the booklet only contains the lyrics of several Madlax tracks. The next place to look was Google; this also ended in failure.

Finally, I decided I'd transcribe the lyrics and try to find somebody to translate it, myself; that was when things got really ugly. As I transcribed the lyrics, it became apparent that it was not Japanese, like some of the other songs on the Madlax soundtrack (the ones whose lyrics were included in the soundtrack booklet). So, I sent the songs to more than a dozen people I knew who spoke various languages. While I got several "that sounds kinda like [insert language that person doesn't speak, here]" ultimately all of them proved incorrect. Native speakers attested to the fact that it was not Spanish, Portuguese, French, German, Russian, Vietnamese, Hebrew, Armenian, or Czech. Second-hand speakers told me that it probably wasn't Italian, Latin, Greek, Arabic, Chinese, or any Slavic language.

So, that ruled out pretty much all major world languages, and exhausted my list of acquaintances. Now what? Well, I decided to go posting on language newsgroups. While this did not bring an answer to the question, it did bring a few clues. One person found a Japanese site which thought the lyrics were not any actual language; it was about this time that I started referring to the language as 'Kajiurese' (named after the composer, Yuki Kajiura). Another poster suggested that it might be a Creole language, or a cross between Japanese and some African language.

That was about as far as I got, and the true identify of Kajiurese (if there is one) remains unknown. However, we can do some characterization of the language. It definitely bears a resemblance to Japanese in the structure of its syllables. Each syllable is composed of a vowel, sometimes preceded by an initial consonant, and sometimes succeeded by a terminal consonant. Unlike English, where you can expect almost any ordering of letters which can be pronounced, only a handful of consonants can act as terminal consonants (this is like Japanese, where only 'n' and 's' can come without a succeeding vowel). Additionally, despite several consonants that do not occur in Japanese, Kajiurese has only 5 vowels (and a sixth which is a diphthong of the 'o' and 'u' sounds) - the very same set in Japanese. It should also be noted that the manner in which some of the terminal consonants are voiced (with both the vowel and terminating consonant sustained for a period before the next syllable) resembles how terminal consonants are pronounced in Japanese, although as some are not pronounced this way, I'm assuming that this is simply due to the Japanese accent of the singer.

After Lost Command, several other tracks in the Madlax soundtrack were identified as probably being Kajiurese (although Margaret features two sounds not seen in any of the other songs). From these songs, the following list of sounds in Kajiurese was constructed (* indicates rare sounds):

Initial Consonant Vowel Terminating Consonant
b
ch*
d
h
k
l
m
n
r
s
sh*
st*
s
t
th
v
y
z
a
e
i
o
ou*
u
n
r*

Finally, the songs themselves. Syllables are delimited by apostrophes. Word breaks are not certain; they are merely my best guess where the breaks occur.

Lost Command:
Chorus:
kou ma're'a tan ba're di'tha
ko ba're'a san ma're di'tha
ha ba'ri'e'a ha ba'ri'e i di'tha
kou ma're'a tan ba're di'tha
so ba'ri'e'a sa'ba di'tha
e ba'de i'a'ra vi'da ya'ra

vi'do'ri men'ta
ha'va'na ya'ri'a
vi'do'ri men'ta ar o'o'ga

vi'de'ri men'ta
ar ka'di've'ta
de'sta'ri ven'to o'o'ga

Chorus

Madlax:
Chorus:
ka'ma'ni'a ma'di'a e'de'mi'ne'a ma'ri'a
ka'ma'ni'a so'di'a e'da'za
ke'ze sa'ma'ni'a ma'di'a e'ke'ze'na a'do'ri'a
kar ma'la'mi
sar ma'la'mi

kar'da me'a ho'za
se'ra vi'a di'a kio'za
e'ya me'a vi'a pe'za
ke'ze sar ma'la'mi
kar ma'la'mi

so'ri vi'a to'za
ve'a vi'da di'a ka'za
e'ra me'a vi'a to'za
e'ye sar ma'la'mi
ar'ka di'sa

Chorus

ka'ma'ni'a ma'di'a e'de'mi'ne'a ma'ri'a
e'ka'za
sa'ma'ni'a ma'di'a e'ke'ze'na a'do'ri'a
e'ka'za

Elanore:
Chorus:
e'sto'ra vi'tha e'ka'na'e ma
ha'vi'sta di i'kan'ta
a'do'o'ra e'ya i'sa'ma e'ra
i'vi'sa di e'ka'n'ta

Chorus

i'me'a de'ta a'no'o'ra
a'sto'ri de'tu a'do'o'ga
mi'i'da me'a a'mi'i'to
i'no'te'di no'te'di e'ya
(2 times)

Chorus

Margaret:
ko ko me'i'ta
ke'sta're go'shi'mi'o'is
i'vi'tha e'ya du'vi'a
no no che'i'ta
ke'da're no'me'ni'ta
i've'tha i'ya na'di'a
(2 times)

Tuesday, August 16, 2005

& Links

Added a links section to the left sidebar. All of those are either sites I visit frequently (at least daily) or the sites of my friends.

& Horror

So, yesterday I went to AnimeSuki (for those of you not into anime, AnimeSuki is one of the biggest catalogs of fan-subbed anime episodes), looking for new episodes of Mahoraba and Honey and Clover. Unfortunately, I found neither of these; but scanning the list of new releases brought something else to my eye: Great Detectives Poirot and Marple, episode 22. Heavens. Surely that can't be as bad as it sounds, right..? Unfortunately, a quick trip to ANN reveals that it is indeed an anime based on Agatha Christie's novels.

Poirot and Ms. Marple are people that I've become begrudgingly familiar with, over the years, due to my mom being a big Agatha Christie fan, and her frequently watching movie adaptations of these series on PBS, not outside my range of hearing. Somehow it disturbs me greatly that there is an anime adaptation, as well.

Monday, August 15, 2005

And There Were Two

So about 2 minutes after I posted that anime post, Jean-Francois Roy (one of my programming friends, and a legend in the Mac gaming world) informed me that his blog is now public. Depending on whether you use a Mac, Unix, or something vaguely similar, you may or may not care, as those are the primary subjects of his blog.

& Anime

Good Gord, has it been 2 1/2 weeks, already? Well I suppose I might as well post a little about my unproductive exploits, given the lack of productive ones. Let's see, where to start..? Well, predominantly, I've been watching anime, for the first time in a couple years (although I have been accumulating manga on my bookshelf over that time). After much prodding by one Magus (not that that name is distinguished enough for any of you to have any clue who I'm talking about), I watched Elfen Lied (as I already mentioned). After Elfen Lied, I watched the following (the ones that are bolded are ones I really liked, and would recommend):
- School Rumble (good series, even though the ending - or lack thereof - BLOWS)
- Air
- Madlax (interesting in a couple ways, but I wouldn't really recommend it)
- His and Her Circumstances (can't seem to find the English distributor's page for the series)
- Gunslinger Girl (while somewhat of a mass-produced anime, and lacking a central plot, I found it interesting, although I have to admit that may be because it has similarities to two of my own stories, in different ways)
- Full Metal Panic: The Second Raid (still in progress)
- Mahoraba
- Honey and Clover (still in progress; pretty amusing at times, but the lethargic pace makes it rather boring, on average)
- Banner of the Stars III
- Noir

As far as manga reading has been going, I've recently been reading the following (most of these are still in progress):
- Great Teacher Onizuka
- Cheeky Angel
- Bleach
- Yotsuba &!(this is licensed, but ADV's site is hideously outdated, and even stuff released 6 months ago isn't on there, yet)
- Elfen Lied
- Scrapped Princess (I'd recommend this anime, but as there's only one volume out, I can't really recommend the manga, yet)
- Full Metal Panic!

Of these, I want to give special recognition to Yotsuba &! That is one of the funniest series I've ever read. In fact, I've got a sample chapter uploaded here, for your viewing pleasure. Now, the series is licensed in English (and by the RIAA of the anime industry, no less...), so you can't download all the chapters (technically it isn't legal to download any of the chapters, but I figure me putting this one up here is more likely to increase their sales than decrease them...). So if you like the chapter, go buy the books!

Wednesday, July 27, 2005

Infatuation

So I just finished watching Elfen Lied. It's good! At least, I really liked it. I suppose I'm a sucker for happy endings, even if dozens of bad guys have to get dismembered in the process. If you can stomach the graphic violence (i.e. limbs getting ripped off, severed bones, etc.) you should definitely watch it. The anime itself is licensed in English, but the manga is not; so if you want something you can ethically (though the legality is questionable) download, you can read the manga version (note: I haven't read the manga, so I don't know how it compares to the anime). There's also an anime trailer available which isn't as graphic as the series itself.




Incidentally, my new desktop.

Tuesday, July 26, 2005

The Art of Breaking and Entering - Thread Hijacking Variation

Well, after thinking about it for a while, I've decided to cover one more method of code injection into a foreign process. The reason for this comes from the fact that the previous method, while almost perfect, has a trait that can be undesirable in some cases: it overwrites the code in the executable, and leaves this code overwritten until after the DLLs are initialized. While this is, indeed, the best way to accomplish the task of injecting code, it leaves a tell-tale sign that the process has been tampered with. It would be trivial to implement a simple hack-detector that checks the first few bytes of the executable's entry point from a load-time DLL.

The fourth method of code injection does not have this limitation, but has new limitations of its own. This method is actually very similar to the previous method; however, instead of waiting until the executable's main function is about to be called, this method executes code as soon as the process is created.

This works by virtue of the fact that when a thread is suspended, it is possible to read and write its register state (called its context) using the SetThreadContext and GetThreadContext functions. In this way we can alter the instruction pointer to point to our loader code, which will then jump back to the original instruction pointer code. Doing this leaves no readily detectable signs that the process has been altered.

The problem is that I don't fully know the limits of this method. In Windows NT this method is safe, because process initialization (including DLL initialization) is done in a user-mode asynchronous procedure call (APC), which will be preemptively executed before the code at the instruction pointer of the thread's context (where code would be injected), regardless of whether the process is created suspended.

On Windows 9x, however, things aren't so clear-cut. When Windows 9x creates a new running (always running, to begin with) process, the code that gets executed for the initial thread resides in Kernel32.dll. This code performs early process initialization, then calls a system call to suspend the process if the process was created suspended (this is where injected code would get executed). When the thread is resumed, the system call returns, a substantial amount of late initialization code is executed (DLLs are initialized here, among other things), and the executable's entry point is called.

The fact that the DLLs have not yet been initialized by the time injected code would be executed isn't really a problem, since you can force them to be initialized by calling LoadLibrary for the ones that are required (Kernel32 doesn't have a DllMain, so this is safe). The problem is that there's so much late initialization code that I don't know what it does, so I'm not really comfortable with doing anything complex at this point, when the code is intended to run on 9x as well as NT.

But that's the way it's done. This is actually the method used by LMPQAPI and MPQDraft. Both get around this late initialization problem in the same way: hooking an API function in the initialization procedure, then performing the full-scale patching when that function gets called by the executable (ensuring that all process initialization will have been performed by that point). I haven't bothered to write any example code, because the process is nearly identical to the previous method, and so would be easy to modify. I might write some code later, if I feel any less lazy.

Thursday, July 21, 2005

The Art of Breaking and Entering - Thread Hijacking

While the first two mechanisms of DLL injection I've shown have used well documented Windows API functions, the third and final method is quite a bit more exotic. This method consists literally of hijacking a (the, to be exact) thread that already exists in the target process and making it execute code we injected using methods discussed previously.

The trick, here, is the fact that new processes can be created suspended. When CreateProcess is called with CREATE_SUSPENDED, Windows begins the usual way: creating the process' address space, loading the module, preparing the kernel for the new process, and creating the initial thread. In reality, processes are nothing more than an environment for threads to run it; what's really suspended is the initial thread. When run, this initial thread does several things, most notably preparing the executable for execution (including loading all required DLLs) calling the executable's entry point function (main or WinMain), and then calling ExitThread with the return value of the entry point (if there are no other threads running in the process, ExitThread has the effect of destroying the process).

While this thread is suspended, we have access to the process, allowing us to do any number of evil things. There are a number of possible ways to go about hijacking the thread, but I'll only present the best one (the most robust and with the highest reliability): overwriting the entry point. Here, we overwrite the first few bytes of the entry point with a JMP instruction, to jump to our injected code, which will load your DLL, call a patching function, and then jump back to the application.

There are numerous advantages to this technique over the others. Unlike CreateRemoteThread, this method does not mandate Windows NT (I should note, in case you don't realize, that "NT" refers to the NT platform, which includes NT Workstation/Server, 2000, XP, and Server 2003). As well, it is the only method that not only allows synchronous operation, but also allows your code to be executed before the target executable begins running.

This sounds fairly simple, but it turns out to be a major hassle to get right (I seriously doubt I could have gotten the code for this post working on the first try had I not been doing this kind of thing for years). This is especially true when you intend to create a version which works on both Windows 9x and NT, which is a very nice feature.

The first complication of this method is rather severe: you must be sure that you get EVERYTHING you need in your injected loader code into the process, both code and data. Among other things, that implies that you must write your loader code in assembly, and you may not call imported API functions (because your loader code doesn't have an import table). If you wish to call any API functions (which you will, considering that you'll at least need LoadLibrary), you must pass the address of the functions to your loader from the parent process.

There are also many numerous smaller complications. If you intended to support both 9x and NT, you must ensure that you can inject either via allocated memory (for NT) or a file mapping (for 9x). And in the case of 9x, you must ensure that the mapping does not get closed before the loader has finished executing (this is tricky because the mapping was created in the parent process, and if the parent process closes it, the mapping will disappear from the target process, as well).

I've been putting a LOT of effort into researching this method. As far as I've been able to tell, it has only one inherent limitation. As the loader code executes before main/WinMain, the executable will not have been initialized, and so you cannot call any functions in it. This may be worked around by hooking some function the executable imports, and then delaying your initialization until that function is called (this is what LMPQAPI does to create a server using MPQ editing functions in StarEdit.exe).

Two more limitations are imposed by my implementation. First, the executable must load at its preferred address (not be relocated), as that's where the injector expects it to be. Second, because the patching process is architecture-specific, it is limited to what I wrote: a 32-bit process patching a 32-bit process. It is likely that these problems can both be fixed, but I'm too lazy to do it, at the moment.

// Amount of space to reserve for the loader function that gets injected
#define LOADER_MAX_SIZE 192
#define PATCHER_DATA_ALIGNMENT 16  // Alignment to use for the patcher data

// Rounds an offset up to the nearest PATCHER_DATA_ALIGNMENT boundary
#define ALIGN_PATCHER_DATA(x) (((UINT_PTR)x + PATCHER_DATA_ALIGNMENT - 1) & ~(PATCHER_DATA_ALIGNMENT - 1))

typedef LPVOID (WINAPI *VirtualAllocExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD flAllocationType,
 DWORD flProtect
);

typedef BOOL (WINAPI *VirtualFreeExPtr)
(
 HANDLE hProcess,
 LPVOID lpAddress,
 SIZE_T dwSize,
 DWORD dwFreeType
);

// The JMP rel32 instruction
#include <pshpack1.h>
struct JMP32
{
 BYTE byOpcode;  // 0xE9
 DWORD nRelOffset;  // Offset relative to the instruction AFTER this JMP

 inline JMP32()
 { byOpcode = 0xE9; }
};
#include <poppack.h>

// The parameters that will get injected into the target process
struct LOADERFUNCTIONPARAMS
{
 BOOL bCompleted;  // Whether the loader has finished
 DWORD nErrCode;  // GetLastError value when the loader succeeds/fails

 HANDLE hParamsSection;  // If the parameter block is in a file mapping, HANDLE of the mapping; NULL otherwise.

 FARPROC lpfnLoadLibraryA;  // Functions that the loader will call
 FARPROC lpfnMapViewOfFile;
 FARPROC lpfnGetLastError;
 FARPROC lpfnExitProcess;

 UINT_PTR nReturnAddress;  // The address that our loader function will return to

 JMP32 jmpOverwritten;  // The data we overwrite in the WinMain function with the JMP to the loader

 UINT_PTR nPatcherRVA;  // RVA of patcher entry point in DLL
 size_t nPatcherDataLen;  // Length of data to be passed to patcher

 char szDLLFilePath[MAX_PATH];  // Name of patcher DLL

 BYTE fnLoaderFunction[LOADER_MAX_SIZE];  // Loader function code

 BYTE byPatcherData[PATCHER_DATA_ALIGNMENT];  // Patcher data of variable length
};

// The loader function for x86-32. This function will return (on success) to the start function for the process' initial thread.
void __declspec(naked) __stdcall LoaderFunction86_32()
{
 __asm {
   ; Use CALL to generate the return address we need to overwrite with the entry point's address
   call Loader

Loader:
   push ebp
   mov ebp, esp
   pushad
   ; int 3  ; Uncomment this for debugging the loader function

   ; Compute the address of the LOADERFUNCTIONPARAMS block. It will be at the page boundary beneath this code
   mov ebx, [ebp+4]
   and ebx, 0xFFFFF000

   ; If the parameter block is in a file mapping, lock it, first
   mov edx, [ebx]LOADERFUNCTIONPARAMS.hParamsSection

   test edx, edx
   jz LoadDLL

   push 0
   push 0
   push 0
   push FILE_MAP_WRITE
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnMapViewOfFile

   test eax, eax
   jz Failure

LoadDLL:  ; Call LoadLibraryA to load DLL.
   lea edx, [ebx]LOADERFUNCTIONPARAMS.szDLLFilePath
   push edx
   call [ebx]LOADERFUNCTIONPARAMS.lpfnLoadLibraryA

   test eax, eax
   jz Failure

LibraryLoaded:  ; Now call the patcher entry point, if there is one
   cmp [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA, 0
   je RewriteEntryPoint

   lea ecx, [ebx]LOADERFUNCTIONPARAMS.byPatcherData
   add ecx, (PATCHER_DATA_ALIGNMENT - 1)  // Align the data on a 16 byte boundary
   and ecx, ~(PATCHER_DATA_ALIGNMENT - 1)
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nPatcherDataLen
   add eax, [ebx]LOADERFUNCTIONPARAMS.nPatcherRVA
   push edx
   push ecx
   call eax

   test eax, eax
   jz Failure

RewriteEntryPoint:  ; Put the original bytes from the entry point back
   mov edx, [ebx]LOADERFUNCTIONPARAMS.nReturnAddress
   lea esi, [ebx]LOADERFUNCTIONPARAMS.jmpOverwritten
   mov edi, edx
   mov ecx, size JMP32
   rep movsb
   mov [ebp+4], edx  ; Set the return address to the entry point

Done:  ; Patching completed successfully. Acknowledge success and return to the entry point.
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, NO_ERROR
   mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE

   popad
   mov esp, ebp
   pop ebp
   ret

Failure:  ; Save GetLastError value and call ExitProcess
   call [ebx]LOADERFUNCTIONPARAMS.lpfnGetLastError
   mov [ebx]LOADERFUNCTIONPARAMS.nErrCode, eax
   push 0
   ;mov [ebx]LOADERFUNCTIONPARAMS.bCompleted, TRUE
   call [ebx]LOADERFUNCTIONPARAMS.lpfnExitProcess
 };
}

// Get the entry point for a module from its file path
bool FindModuleEntryPoint(LPCSTR lpszFilePath, UINT_PTR &lpfnEntryPoint)
{
 assert(lpszFilePath);

 // Map the module as a data file (essentially as a memory mapped file)
 HMODULE hModule = LoadLibraryEx(lpszFilePath, NULL, LOAD_LIBRARY_AS_DATAFILE);
 if (!hModule)
   return false;

 bool bSuccess = false;

 // Wrap code in a try-except block, since we're going to be working with unverified pointers
 __try
 {
   // Find the DOS header. An HMODULE is a pointer to the module in memory, but LoadLibrary stores flags in the lower bits of the HMODULE.
   IMAGE_DOS_HEADER *lpDosHeader = (IMAGE_DOS_HEADER *)((UINT_PTR)hModule & ~(UINT_PTR)0xFFF);

   if (lpDosHeader->e_magic == IMAGE_DOS_SIGNATURE && lpDosHeader->e_lfanew)
   {
     // Locate the NT headers
     DWORD *lpNTSignature = (DWORD *)((UINT_PTR)lpDosHeader + lpDosHeader->e_lfanew);
     IMAGE_FILE_HEADER *lpNTHeader = (IMAGE_FILE_HEADER *)((UINT_PTR)lpNTSignature + sizeof(DWORD));
     IMAGE_OPTIONAL_HEADER32 *lpOptHeader = (IMAGE_OPTIONAL_HEADER32 *)((UINT_PTR)lpNTHeader + IMAGE_SIZEOF_FILE_HEADER);
     
     if (*lpNTSignature == IMAGE_NT_SIGNATURE)
     {
       lpfnEntryPoint = lpOptHeader->AddressOfEntryPoint + lpOptHeader->ImageBase;

       bSuccess = true;
     }
   }
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { }

 FreeLibrary(hModule);

 return bSuccess;
}
// Finds the entry point of the target executable, saves the entry point data, and overwrites the entry point with the JMP instruction
bool HookModuleEntryPoint32(LPCSTR lpszFilePath, HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsBlock, UINT_PTR &lpfnEntryPoint, JMP32 &jmpOverwritten)
{
 assert(lpParamsBlock);

 // Find the entry point for the module
 if (!FindModuleEntryPoint(lpszFilePath, lpfnEntryPoint))
   return false;

 // Protect against access violations
 __try
 {
   // Unprotect where we need to read/write
   DWORD nOldProtect;
   if (!VirtualProtectEx(hProcess, (void *)lpfnEntryPoint, sizeof(JMP32), PAGE_EXECUTE_READWRITE, &nOldProtect))
     return false;

   // Get the old entry point
   SIZE_T nBytesRead;

   if (!ReadProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmpOverwritten, sizeof(JMP32), &nBytesRead) || nBytesRead != sizeof(JMP32))
     return false;

   // Write the JMP to the entry point
   SIZE_T nBytesWritten;
   JMP32 jmp;

   // Compute the relative offset of the loader function
   DWORD nLoaderAddress = (DWORD)&lpParamsBlock->fnLoaderFunction;

   jmp.nRelOffset = nLoaderAddress - (lpfnEntryPoint + sizeof(jmp));

   if (!WriteProcessMemory(hProcess, (void *)lpfnEntryPoint, &jmp, sizeof(jmp), &nBytesWritten) || nBytesWritten != sizeof(jmp))
     return false;

   return true;
 }
 __except (EXCEPTION_EXECUTE_HANDLER)
 { return false; }
}

// Wait until the loader function, for better or worse, has finished. Return value is the error code from the process
bool GetLoaderErrorCode(HANDLE hProcess, LOADERFUNCTIONPARAMS *lpParamsMemory, DWORD &nErrCode)
{
 // The plan is very simple: poll the parameter block every 10 ms to check for completion. Also watch the process HANDLE for termination.
 SIZE_T nBytesRead;

 while (WaitForSingleObject(hProcess, 10) != WAIT_OBJECT_0)
 {
   // Read the completion indicator flag
   BOOL bCompleted;

   if (!ReadProcessMemory(hProcess, &lpParamsMemory->bCompleted, &bCompleted, sizeof(bCompleted), &nBytesRead) || nBytesRead != sizeof(bCompleted))
     return false;

   if (bCompleted)
     break;
 }

 // Read the error code and return
 if (!ReadProcessMemory(hProcess, &lpParamsMemory->nErrCode, &nErrCode, sizeof(nErrCode), &nBytesRead) || nBytesRead != sizeof(nErrCode))
   return false;

 return true;
}

// May fail for two reasons: unable to allocate the memory, or this is a Windows 9x machine. If the latter, bIsNT will be false
bool InjectDLLAndResumeProcessNT(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, bool &bIsNT, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We don't know if we're on NT or 9x, and the version APIs can be easily fooled. Do it by trial and error: try to use VirtualAllocEx, and fall back to file mappings if VirtualAllocEx isn't available.
 bIsNT = false;

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 VirtualAllocExPtr lpfnVirtualAllocEx = (VirtualAllocExPtr)GetProcAddress(hKernel32, "VirtualAllocEx");
 VirtualFreeExPtr lpfnVirtualFreeEx = (VirtualFreeExPtr)GetProcAddress(hKernel32, "VirtualFreeEx");

 if (!lpfnVirtualAllocEx || !lpfnVirtualFreeEx)
   return false;

 // Windows 9x usually has stubs for VirtualAllocEx and VirtualFreeEx, so we still don't know if they're really there. Try to allocate the memory.
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)lpfnVirtualAllocEx(hProcess, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

 // The moment of truth: NT or 9x?
 if (lpParamsMemory || GetLastError() != ERROR_CALL_NOT_IMPLEMENTED)
   bIsNT = true;

 if (!lpParamsMemory)
   return false;

 bool bSuccess = false;

 // This is Windows NT
 // Hook the entry point
 if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
 {
   // Compute the offset to write the patcher data at.
   BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

   // Write the parameters and patcher data
   SIZE_T nBytesWritten;

   if (WriteProcessMemory(hProcess, lpParamsMemory, &params, sizeof(params), &nBytesWritten) && nBytesWritten == sizeof(params))
   {
     if (!nPatcherDataLen || (WriteProcessMemory(hProcess, lpPatcherDataMemory, lpPatcherData, nPatcherDataLen, &nBytesWritten) && nBytesWritten == nPatcherDataLen))
     {
       // It's all set. Let it run until the loader function finishes.
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }
 }
 
 // Free the memory
 lpfnVirtualFreeEx(hProcess, lpParamsMemory, 0, MEM_RELEASE);

 return bSuccess;
}

bool InjectDLLAndResumeProcess9x(HANDLE hProcess, HANDLE hThread, LPCSTR lpszFilePath, LOADERFUNCTIONPARAMS &params, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 if (nPatcherDataLen)
   assert(lpPatcherData);

 // We're on 9x. Use a file mapping.
 HANDLE hMapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, sizeof(LOADERFUNCTIONPARAMS) + nPatcherDataLen, NULL);
 if (!hMapping)
   return false;

 bool bSuccess = false;

 // Map the file mapping so we can write to it
 LOADERFUNCTIONPARAMS *lpParamsMemory = (LOADERFUNCTIONPARAMS *)MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, 0);
 if (lpParamsMemory)
 {
   // Overwrite the entry point and get the old one
   if (HookModuleEntryPoint32(lpszFilePath, hProcess, lpParamsMemory, params.nReturnAddress, params.jmpOverwritten))
   {
     // Duplicate the file mapping HANDLE into the target process
     if (DuplicateHandle(GetCurrentProcess(), hMapping, hProcess, &params.hParamsSection, 0, FALSE, DUPLICATE_SAME_ACCESS))
     {
       BYTE *lpPatcherDataMemory = (BYTE *)ALIGN_PATCHER_DATA(lpParamsMemory->byPatcherData);

       // Copy the patcher data
       memcpy(lpParamsMemory, &params, sizeof(params));
       memcpy(lpPatcherDataMemory, lpPatcherData, nPatcherDataLen);

       // Let the loader run
       if (ResumeThread(hThread) != (DWORD)-1)
         bSuccess = GetLoaderErrorCode(hProcess, lpParamsMemory, nErrCode);
     }
   }

   // Unmap the view
   UnmapViewOfFile(lpParamsMemory);
 }

 // Close the file mapping
 CloseHandle(hMapping);

 return bSuccess;
}

// Allocates the parameter struct in the foreign process and sets the members
bool InjectDLLAndResumeProcess(HANDLE hProcess, HANDLE hThread, LPCSTR lpszExecPath, LPCSTR lpszDLLFilePath, UINT_PTR nPatcherRVA, const void *lpPatcherData, size_t nPatcherDataLen, DWORD &nErrCode)
{
 assert(hProcess);
 assert(lpszExecPath);
 assert(lpszDLLFilePath);
 assert(strlen(lpszDLLFilePath) < MAX_PATH);

 HMODULE hKernel32 = GetModuleHandle("Kernel32");

 // Construct a local copy of the param block and initialize it
 LOADERFUNCTIONPARAMS params;

 params.hParamsSection = NULL;

 params.bCompleted = FALSE;

 params.lpfnLoadLibraryA = GetProcAddress(hKernel32, "LoadLibraryA");
 params.lpfnMapViewOfFile = GetProcAddress(hKernel32, "MapViewOfFile");
 params.lpfnGetLastError = GetProcAddress(hKernel32, "GetLastError");
 params.lpfnExitProcess = GetProcAddress(hKernel32, "ExitProcess");

 params.nPatcherRVA = nPatcherRVA;
 params.nPatcherDataLen = nPatcherDataLen;

 strcpy(params.szDLLFilePath, lpszDLLFilePath);

#ifdef _DEBUG
 // In debug build in VC++, "LoaderFunction86_32" is actually a JMP stub. Find the real function.
 JMP32 *pJmpStub = (JMP32 *)LoaderFunction86_32;
 LPBYTE lpbyLoaderFunction = (LPBYTE)(pJmpStub->nRelOffset + (DWORD)LoaderFunction86_32 + sizeof(JMP32));

 memcpy(&params.fnLoaderFunction, lpbyLoaderFunction, LOADER_MAX_SIZE);
#else
 memcpy(&params.fnLoaderFunction, LoaderFunction86_32, LOADER_MAX_SIZE);
#endif

 // The patcher data will be written directly into the process, because it occupies extra data after the struct

 // Try to patch using the NT method first. If it's not NT, use the 9x method.
 bool bIsNT = false;

 if (InjectDLLAndResumeProcessNT(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, bIsNT, nErrCode))
   return true;  // Successfully patched with the NT method
 else if (!bIsNT && InjectDLLAndResumeProcess9x(hProcess, hThread, lpszExecPath, params, lpPatcherData, nPatcherDataLen, nErrCode))
   return true;

 return false;  // Patching failed
}