Search This Blog

Wednesday, August 04, 2010

NWScript Stack System

Note: SW has written a new post, talking about NWScript in general, akin to my post yesterday (though possibly in more detail than mine; I haven't actually read it all yet).

So, now we know where we are (the NWScript system), and where we need to go (LLVM assembly). Now, how do we get there?

Well, a lot of things are simple enough, requiring only trivial use of LLVM and a parser. But after investigating the NWScript system for this purpose, I identified several areas that would be nontrivial and would require some thought and effort, which I'll discuss.

The first challenge - that is, a nontrivial design issue - was the stack machine. Stack machines are easy to interpret, but harder to convert to something like LLVM assembly that uses registers (like real assembly languages) and strongly declared and typed variables (unlike real assembly languages). Stack machines use a huge number of stack locations during execution, the vast majority of them (more than 96% of them, in one sample I looked at) temporary copies that last only until the next math (or other) instruction. From the opposite perspective, it's necessary that all accesses of a single variable be mapped to the same memory location, regardless of where the variable is, relative to the top of the stack. It's even possible that multiple separate locations in the code could create a stack location that represents only a single variable, despite being created in multiple places.

To illustrate these problems, a few examples:

// var1 * 0x1E
CPTOPSP FFFFFFEC, 0004
CONSTI 0000001E
MULII
// + var2
CPTOPSP FFFFFFEC, 0004
ADDII
// var3 =
CPDOWNSP FFFFFFF0, 0004
MOVSP FFFFFFFC


Here, we can see two stack locations are created that do not represent variables at all, but rather temporary parameters to the MUL instruction. A third temporary that also is not a real variable is created as the return value to the MUL instruction, which is then used as a parameter to the ADD instruction along with a fourth temporary copied from a variable. Finally, a fifth temporary is created for the return value from the ADD instruction, which is copied down to a result variable and destroyed. The C code for this would be var3 = (var1 * 0x1E) + var2;

// if (3 == 0)
00000045 CONSTI 00000003
0000004B CONSTI 00000000
00000051 EQUALII
00000053 JZ off_00000065
// true: var = 1
00000059 CONSTI 00000001
0000005F JMP off_0000006B
// false: var = 2
00000065 CONSTI 00000002
0000006B

In this second case, we have an optimized variable initialization where a stack location is created in two separate locations (depending on whether a condition is true), but is in reality only a single variable. This is from a script that has int g_i = (3 == 0 ? 1 : 2);

It may also be possible to use the stack as, well, a stack, building up a variable-length list of things as they're processed in one loop, then popping them in a second loop. I can't think of a practical use for this off the top of my head (something that is both useful and couldn't be accomplished any other way), but it might be possible.

So, the stack system poses challenges related to identifying variables. However, a second, less obvious problem exists, although it's not as specific to stack machines. In this case, the problem relates to functions. In NWScript, a call consists of:
  1. Reserve a temporary slot on the stack for the return value
  2. Push the parameters in reverse order
  3. Call the function
  4. Copy and pop the return value
As implied, the caller allocates space for the return value, and the called function must remove all parameters before return. This causes a problem for us, because we need to keep track of stack positions so that we can locate variables, and alignment of the stack would be impossible as long as we have no idea what a function call will do to the stack.

The solution I came up with to this problem (as well as other problems, which I'll talk about in the future), was a two-pass system that performs analysis and intermediate code generation (intermediate code is an expanded form which deals with variable references rather than the stack, making it much easier to work with; this intermediate code would then be converted to the output code type in a third pass). The two passes could have been done in a single pass, but this would have been quite messy and needlessly memory-hungry, as opposed to the more elegant and efficient two-pass system.

First Pass
The first pass through the script file performs a couple tasks. It identifies functions and code flow blocks: the largest groups of instructions that execute as an atomic unit and have only a single execution path from the first to the last instruction in the block. It also determines the number of parameters and return values for each function (in NWScript, functions only have a single return value, but when the bytecode is generated, structs and other complex types are flattened to primitive types, making it look like there are multiple return values).

This pass is quite simple. It proceeds by scanning through the code, beginning with the main function, parsing each instruction as it goes. The current stack pointer (but not an actual stack) is maintained from instruction to instruction. As well, a code flow graph is created by watching for branch instructions and creating new flow blocks and splitting old ones (if a branch jumps into the middle of an existing flow, that flow must be split, as it's not an atomic unit). Finally, entries for functions are created as they're encountered.

The biggest trick here is dealing with function calls. The effect of other operations on the stack can be trivially calculated by decoding the instruction, but we can't step past a call instruction until we know the stack displacement. The same is true for calls to API functions, but we already know exactly what they do to the stack, as those are a closed set.

What I decided to do was to use a task queue. When a new, previously unencountered function is encountered in a call, a new function entry and task queue entry corresponding to the function are created, and the current analysis task is marked as blocking on that function. Conditional branches are handled more or less the same way: a new task is created for one branch, though not blocking on anything, and analysis continues at the other branch. Whenever the current task blocks (e.g. it calls an unanalyzed function), analysis switches to the first task in the queue that is not blocking on anything, and execution continues from there.

Tasks ultimately complete when a return is encountered. When that occurs, the number of parameters for the function is calculated by the difference in stack pointer from the beginning of the function to the end. The function is marked as analyzed, and any tasks blocking on the function are unblocked and may potentially continue analysis. The number of return values is determined by observing the lowest stack address written to within the function. This requires that all paths through a function be evaluated, but as the return value does not affect overall stack displacement, tasks blocking on the function may resume prior to determining the exact number of return values.

While this can potentially generate a large number of blocked tasks in the queue, so long as the script does not contain a loop that has no exit path or a function that calls itself recursively without a path that returns, this is guaranteed to ultimately locate a function that can return. While functions may have multiple paths from beginning to end, and all must eventually be traversed, we need only a single path to return to know the function's displacement, resulting in unblocking of tasks waiting on that function. This discovery process cascades, and analysis of all routes through the script eventually completes.

Second Pass

After the completion of the first pass, we now know all functions, code flows, and the number of parameters and return values for each function. The second pass performs several functions:
  • Determine the types and locations of local and global variables and create variable objects for them
  • Determine the types of function parameters and return values
  • Generate the intermediate code
The second pass is somewhat simpler in code flow logic, but much more complicated in terms of the amount of work done. It too operates on a task queue system, but has no concept of blocking - the task queue exists merely to queue up alternate paths for processing after the current path completes. It operates locally on a function using the code flow graphs as rails to guide analysis, processing each function completely before moving on to the next.

The code for this pass resembles a mini-interpreter (well, I suppose it is), including a stack filled with references to the variable objects at that location. A new variable object is created each time some manner of instruction pushes a variable on the stack. To make things simpler during processing and final code generation, variables are explicitly created and destroyed with dedicated instructions in the intermediate code, unlike the implicit generation/destruction in the bytecode.

Because the stack is necessary to determine what variables individual instructions reference, functions must, as in the first pass, be executed in control flow order, rather than simply evaluating each flow in the list in random order (as we do with functions). The stack at the end of each flow is saved until analysis of the function completes, so that successor flows evaluated later can pick up the stack where the previous flow left off (and one other reason I'll explain in a second).

Most of the details of the second phase relate not to the topic of this post, but to later ones. One detail worth mentioning, however, is that after all flows for a function have been evaluated, stack merging occurs. Because it's possible that a stack location might be created in two separate locations that together form a single variable (as in the example above), when multiple flows converge in a single successor flow, the stacks from each of the predecessor flows must be merged: for each offset in the stacks, the variables in each corresponding position are merged, eliminating all but one copy. Because the intermediate code representation is flexible, variable merging is trivial, and requires no modification of the intermediate code itself.

However, for this and other reasons, this compiler does not, as of now, support variable use of the stack: when the stack is extended to a variable length by a loop or some such, as mentioned earlier. There is no way to generate such a structure in NWScript (the scripting language), and manually-written bytecode files are anywhere between incredibly rare and non-existent, so we thought that was a reasonable omission.

One really cool thing about this system, however, is that the host program can be a hybrid of interpretation and compilation. E.g. the compilation of all scripts can be offloaded to a worker thread when a module is loaded to compile in the background, and the host program can use the interpreter for a script until compilation of that script finishes. The interpreter could also be used as a fall-back for scripts the compiler is unable to compile for one reason or another, such as those with evil stack usage.

Below is an example of a random block (a flow, to be precise) of bytecode and its corresponding intermediate code. Note that the addresses differ by 0xD because the bytecode block is generated by a disassembler that includes the file header in the address:

00005282 ACTION GetTimeHour(0010), 00
00005287 CPTOPSP FFFFFF34, 0004
0000528F ACTION StringToInt(00E8), 01
00005294 ACTION GetTimeHour(0010), 00
00005299 SUBII
0000529B ADDII
0000529D CPDOWNSP FFFFFF00, 0004
000052A5 MOVSP FFFFFFFC
000052AB CONSTI 00000000
000052B1 CONSTI 00000000
000052B7 CONSTI 00000000
000052BD CPTOPSP FFFFFEF8, 0004
000052C5 ACTION SetTime(000C), 04
000052CA CPTOPSP FFFFFF04, 0004
000052D2 CONSTI 00000000
000052D8 LTII
000052DA JZ off_00005304


00005275: CREATE 01F36000 ( temp SSA int )
00005275: ACTION 0010 (0) 01F36000 ( temp SSA int )
0000527A: CREATE 01F36030 ( temp SSA string )
0000527A: ASSIGN 01F36030 ( temp SSA string ), 02102C10 ( string )
00005282: CREATE 01F36060 ( temp SSA int )
00005282: ACTION 00E8 (1) 01F36060 ( temp SSA int ), 01F36030 ( temp SSA string )
00005282: DELETE 01F36030 ( temp SSA string )
00005287: CREATE 01F36090 ( temp SSA int )
00005287: ACTION 0010 (0) 01F36090 ( temp SSA int )
0000528C: CREATE 01F360C0 ( temp SSA int )
0000528C: SUB 01F360C0 ( temp SSA int ), 01F36060 ( temp SSA int ), 01F36090 ( temp SSA int )
0000528C: DELETE 01F36090 ( temp SSA int )
0000528C: DELETE 01F36060 ( temp SSA int )
0000528E: CREATE 01F360F0 ( temp SSA int )
0000528E: ADD 01F360F0 ( temp SSA int ), 01F36000 ( temp SSA int ), 01F360C0 ( temp SSA int )
0000528E: DELETE 01F360C0 ( temp SSA int )
0000528E: DELETE 01F36000 ( temp SSA int )
00005290: ASSIGN 02102910 ( int ), 01F360F0 ( temp SSA int )
00005298: DELETE 01F360F0 ( temp SSA int )
0000529E: CREATE 01F36150 ( temp SSA int )
0000529E: ASSIGN 01F36150 ( temp SSA int ), 01F36120 ( const int )
000052A4: CREATE 01F361B0 ( temp SSA int )
000052A4: ASSIGN 01F361B0 ( temp SSA int ), 01F36180 ( const int )
000052AA: CREATE 01F36210 ( temp SSA int )
000052AA: ASSIGN 01F36210 ( temp SSA int ), 01F361E0 ( const int )
000052B0: CREATE 01F36240 ( temp SSA int )
000052B0: ASSIGN 01F36240 ( temp SSA int ), 02102910 ( int )
000052B8: ACTION 000C (4) 01F36240 ( temp SSA int ), 01F36210 ( temp SSA int ), 01F361B0 ( temp SSA int ), 01F36150 ( temp SSA int )
000052B8: DELETE 01F36240 ( temp SSA int )
000052B8: DELETE 01F36210 ( temp SSA int )
000052B8: DELETE 01F361B0 ( temp SSA int )
000052B8: DELETE 01F36150 ( temp SSA int )
000052BD: CREATE 01F36270 ( temp SSA int )
000052BD: ASSIGN 01F36270 ( temp SSA int ), 02102910 ( int )
000052C5: CREATE 01F362D0 ( temp SSA int )
000052C5: ASSIGN 01F362D0 ( temp SSA int ), 01F362A0 ( const int )
000052CB: CREATE 01F36300 ( temp SSA int )
000052CB: LT 01F36300 ( temp SSA int ), 01F36270 ( temp SSA int ), 01F362D0 ( temp SSA int )
000052CB: DELETE 01F362D0 ( temp SSA int )
000052CB: DELETE 01F36270 ( temp SSA int )
000052CD: TEST 01F36300 ( temp SSA int )
000052CD: DELETE 01F36300 ( temp SSA int )
000052CD: JZ 000052F7


And lastly, perhaps getting ahead of myself, the same segment after running it through my variable optimization pass:

00005275: CREATE 01F36000 ( temp SSA int )
00005275: ACTION 0010 (0) 01F36000 ( temp SSA int )
00005282: CREATE 01F36060 ( temp SSA int )
00005282: ACTION 00E8 (1) 01F36060 ( temp SSA int ), 02102C10 ( merged string )
00005287: CREATE 01F36090 ( temp SSA int )
00005287: ACTION 0010 (0) 01F36090 ( temp SSA int )
0000528C: CREATE 01F360C0 ( temp SSA int )
0000528C: SUB 01F360C0 ( temp SSA int ), 01F36060 ( temp SSA int ), 01F36090 ( temp SSA int )
0000528C: DELETE 01F36090 ( temp SSA int )
0000528C: DELETE 01F36060 ( temp SSA int )
0000528E: ADD 02102910 ( merged int ), 01F36000 ( temp SSA int ), 01F360C0 ( temp SSA int )
0000528E: DELETE 01F360C0 ( temp SSA int )
0000528E: DELETE 01F36000 ( temp SSA int )
000052B8: ACTION 000C (4) 02102910 ( merged int ), 01F361E0 ( merged const int ), 01F36180 ( merged const int ), 01F36120 ( merged const int )
000052CB: CREATE 01F36300 ( temp SSA int )
000052CB: LT 01F36300 ( temp SSA int ), 02102910 ( merged int ), 01F362A0 ( merged const int )
000052CD: TEST 01F36300 ( temp SSA int )
000052CD: DELETE 01F36300 ( temp SSA int )
000052CD: JZ 000052F7


For those keeping score, that's 17 instructions in the original bytecode, 43 in the raw intermediate code (26 that are create or destroy, 9 that are assigns to temporary variables), and 19 in the optimized intermediate code (10 that are create or destroy; if that still sounds like too many, note that all of those variables will be converted to registers in LLVM).

No comments: