Search This Blog

Wednesday, August 04, 2010

NWScript Stack System

Note: SW has written a new post, talking about NWScript in general, akin to my post yesterday (though possibly in more detail than mine; I haven't actually read it all yet).

So, now we know where we are (the NWScript system), and where we need to go (LLVM assembly). Now, how do we get there?

Well, a lot of things are simple enough, requiring only trivial use of LLVM and a parser. But after investigating the NWScript system for this purpose, I identified several areas that would be nontrivial and would require some thought and effort, which I'll discuss.

The first challenge - that is, a nontrivial design issue - was the stack machine. Stack machines are easy to interpret, but harder to convert to something like LLVM assembly that uses registers (like real assembly languages) and strongly declared and typed variables (unlike real assembly languages). Stack machines use a huge number of stack locations during execution, the vast majority of them (more than 96% of them, in one sample I looked at) temporary copies that last only until the next math (or other) instruction. From the opposite perspective, it's necessary that all accesses of a single variable be mapped to the same memory location, regardless of where the variable is, relative to the top of the stack. It's even possible that multiple separate locations in the code could create a stack location that represents only a single variable, despite being created in multiple places.

To illustrate these problems, a few examples:

// var1 * 0x1E
CONSTI 0000001E
// + var2
// var3 =

Here, we can see two stack locations are created that do not represent variables at all, but rather temporary parameters to the MUL instruction. A third temporary that also is not a real variable is created as the return value to the MUL instruction, which is then used as a parameter to the ADD instruction along with a fourth temporary copied from a variable. Finally, a fifth temporary is created for the return value from the ADD instruction, which is copied down to a result variable and destroyed. The C code for this would be var3 = (var1 * 0x1E) + var2;

// if (3 == 0)
00000045 CONSTI 00000003
0000004B CONSTI 00000000
00000051 EQUALII
00000053 JZ off_00000065
// true: var = 1
00000059 CONSTI 00000001
0000005F JMP off_0000006B
// false: var = 2
00000065 CONSTI 00000002

In this second case, we have an optimized variable initialization where a stack location is created in two separate locations (depending on whether a condition is true), but is in reality only a single variable. This is from a script that has int g_i = (3 == 0 ? 1 : 2);

It may also be possible to use the stack as, well, a stack, building up a variable-length list of things as they're processed in one loop, then popping them in a second loop. I can't think of a practical use for this off the top of my head (something that is both useful and couldn't be accomplished any other way), but it might be possible.

So, the stack system poses challenges related to identifying variables. However, a second, less obvious problem exists, although it's not as specific to stack machines. In this case, the problem relates to functions. In NWScript, a call consists of:
  1. Reserve a temporary slot on the stack for the return value
  2. Push the parameters in reverse order
  3. Call the function
  4. Copy and pop the return value
As implied, the caller allocates space for the return value, and the called function must remove all parameters before return. This causes a problem for us, because we need to keep track of stack positions so that we can locate variables, and alignment of the stack would be impossible as long as we have no idea what a function call will do to the stack.

The solution I came up with to this problem (as well as other problems, which I'll talk about in the future), was a two-pass system that performs analysis and intermediate code generation (intermediate code is an expanded form which deals with variable references rather than the stack, making it much easier to work with; this intermediate code would then be converted to the output code type in a third pass). The two passes could have been done in a single pass, but this would have been quite messy and needlessly memory-hungry, as opposed to the more elegant and efficient two-pass system.

First Pass
The first pass through the script file performs a couple tasks. It identifies functions and code flow blocks: the largest groups of instructions that execute as an atomic unit and have only a single execution path from the first to the last instruction in the block. It also determines the number of parameters and return values for each function (in NWScript, functions only have a single return value, but when the bytecode is generated, structs and other complex types are flattened to primitive types, making it look like there are multiple return values).

This pass is quite simple. It proceeds by scanning through the code, beginning with the main function, parsing each instruction as it goes. The current stack pointer (but not an actual stack) is maintained from instruction to instruction. As well, a code flow graph is created by watching for branch instructions and creating new flow blocks and splitting old ones (if a branch jumps into the middle of an existing flow, that flow must be split, as it's not an atomic unit). Finally, entries for functions are created as they're encountered.

The biggest trick here is dealing with function calls. The effect of other operations on the stack can be trivially calculated by decoding the instruction, but we can't step past a call instruction until we know the stack displacement. The same is true for calls to API functions, but we already know exactly what they do to the stack, as those are a closed set.

What I decided to do was to use a task queue. When a new, previously unencountered function is encountered in a call, a new function entry and task queue entry corresponding to the function are created, and the current analysis task is marked as blocking on that function. Conditional branches are handled more or less the same way: a new task is created for one branch, though not blocking on anything, and analysis continues at the other branch. Whenever the current task blocks (e.g. it calls an unanalyzed function), analysis switches to the first task in the queue that is not blocking on anything, and execution continues from there.

Tasks ultimately complete when a return is encountered. When that occurs, the number of parameters for the function is calculated by the difference in stack pointer from the beginning of the function to the end. The function is marked as analyzed, and any tasks blocking on the function are unblocked and may potentially continue analysis. The number of return values is determined by observing the lowest stack address written to within the function. This requires that all paths through a function be evaluated, but as the return value does not affect overall stack displacement, tasks blocking on the function may resume prior to determining the exact number of return values.

While this can potentially generate a large number of blocked tasks in the queue, so long as the script does not contain a loop that has no exit path or a function that calls itself recursively without a path that returns, this is guaranteed to ultimately locate a function that can return. While functions may have multiple paths from beginning to end, and all must eventually be traversed, we need only a single path to return to know the function's displacement, resulting in unblocking of tasks waiting on that function. This discovery process cascades, and analysis of all routes through the script eventually completes.

Second Pass

After the completion of the first pass, we now know all functions, code flows, and the number of parameters and return values for each function. The second pass performs several functions:
  • Determine the types and locations of local and global variables and create variable objects for them
  • Determine the types of function parameters and return values
  • Generate the intermediate code
The second pass is somewhat simpler in code flow logic, but much more complicated in terms of the amount of work done. It too operates on a task queue system, but has no concept of blocking - the task queue exists merely to queue up alternate paths for processing after the current path completes. It operates locally on a function using the code flow graphs as rails to guide analysis, processing each function completely before moving on to the next.

The code for this pass resembles a mini-interpreter (well, I suppose it is), including a stack filled with references to the variable objects at that location. A new variable object is created each time some manner of instruction pushes a variable on the stack. To make things simpler during processing and final code generation, variables are explicitly created and destroyed with dedicated instructions in the intermediate code, unlike the implicit generation/destruction in the bytecode.

Because the stack is necessary to determine what variables individual instructions reference, functions must, as in the first pass, be executed in control flow order, rather than simply evaluating each flow in the list in random order (as we do with functions). The stack at the end of each flow is saved until analysis of the function completes, so that successor flows evaluated later can pick up the stack where the previous flow left off (and one other reason I'll explain in a second).

Most of the details of the second phase relate not to the topic of this post, but to later ones. One detail worth mentioning, however, is that after all flows for a function have been evaluated, stack merging occurs. Because it's possible that a stack location might be created in two separate locations that together form a single variable (as in the example above), when multiple flows converge in a single successor flow, the stacks from each of the predecessor flows must be merged: for each offset in the stacks, the variables in each corresponding position are merged, eliminating all but one copy. Because the intermediate code representation is flexible, variable merging is trivial, and requires no modification of the intermediate code itself.

However, for this and other reasons, this compiler does not, as of now, support variable use of the stack: when the stack is extended to a variable length by a loop or some such, as mentioned earlier. There is no way to generate such a structure in NWScript (the scripting language), and manually-written bytecode files are anywhere between incredibly rare and non-existent, so we thought that was a reasonable omission.

One really cool thing about this system, however, is that the host program can be a hybrid of interpretation and compilation. E.g. the compilation of all scripts can be offloaded to a worker thread when a module is loaded to compile in the background, and the host program can use the interpreter for a script until compilation of that script finishes. The interpreter could also be used as a fall-back for scripts the compiler is unable to compile for one reason or another, such as those with evil stack usage.

Below is an example of a random block (a flow, to be precise) of bytecode and its corresponding intermediate code. Note that the addresses differ by 0xD because the bytecode block is generated by a disassembler that includes the file header in the address:

00005282 ACTION GetTimeHour(0010), 00
00005287 CPTOPSP FFFFFF34, 0004
0000528F ACTION StringToInt(00E8), 01
00005294 ACTION GetTimeHour(0010), 00
00005299 SUBII
0000529B ADDII
0000529D CPDOWNSP FFFFFF00, 0004
000052AB CONSTI 00000000
000052B1 CONSTI 00000000
000052B7 CONSTI 00000000
000052C5 ACTION SetTime(000C), 04
000052CA CPTOPSP FFFFFF04, 0004
000052D2 CONSTI 00000000
000052D8 LTII
000052DA JZ off_00005304

00005275: CREATE 01F36000 ( temp SSA int )
00005275: ACTION 0010 (0) 01F36000 ( temp SSA int )
0000527A: CREATE 01F36030 ( temp SSA string )
0000527A: ASSIGN 01F36030 ( temp SSA string ), 02102C10 ( string )
00005282: CREATE 01F36060 ( temp SSA int )
00005282: ACTION 00E8 (1) 01F36060 ( temp SSA int ), 01F36030 ( temp SSA string )
00005282: DELETE 01F36030 ( temp SSA string )
00005287: CREATE 01F36090 ( temp SSA int )
00005287: ACTION 0010 (0) 01F36090 ( temp SSA int )
0000528C: CREATE 01F360C0 ( temp SSA int )
0000528C: SUB 01F360C0 ( temp SSA int ), 01F36060 ( temp SSA int ), 01F36090 ( temp SSA int )
0000528C: DELETE 01F36090 ( temp SSA int )
0000528C: DELETE 01F36060 ( temp SSA int )
0000528E: CREATE 01F360F0 ( temp SSA int )
0000528E: ADD 01F360F0 ( temp SSA int ), 01F36000 ( temp SSA int ), 01F360C0 ( temp SSA int )
0000528E: DELETE 01F360C0 ( temp SSA int )
0000528E: DELETE 01F36000 ( temp SSA int )
00005290: ASSIGN 02102910 ( int ), 01F360F0 ( temp SSA int )
00005298: DELETE 01F360F0 ( temp SSA int )
0000529E: CREATE 01F36150 ( temp SSA int )
0000529E: ASSIGN 01F36150 ( temp SSA int ), 01F36120 ( const int )
000052A4: CREATE 01F361B0 ( temp SSA int )
000052A4: ASSIGN 01F361B0 ( temp SSA int ), 01F36180 ( const int )
000052AA: CREATE 01F36210 ( temp SSA int )
000052AA: ASSIGN 01F36210 ( temp SSA int ), 01F361E0 ( const int )
000052B0: CREATE 01F36240 ( temp SSA int )
000052B0: ASSIGN 01F36240 ( temp SSA int ), 02102910 ( int )
000052B8: ACTION 000C (4) 01F36240 ( temp SSA int ), 01F36210 ( temp SSA int ), 01F361B0 ( temp SSA int ), 01F36150 ( temp SSA int )
000052B8: DELETE 01F36240 ( temp SSA int )
000052B8: DELETE 01F36210 ( temp SSA int )
000052B8: DELETE 01F361B0 ( temp SSA int )
000052B8: DELETE 01F36150 ( temp SSA int )
000052BD: CREATE 01F36270 ( temp SSA int )
000052BD: ASSIGN 01F36270 ( temp SSA int ), 02102910 ( int )
000052C5: CREATE 01F362D0 ( temp SSA int )
000052C5: ASSIGN 01F362D0 ( temp SSA int ), 01F362A0 ( const int )
000052CB: CREATE 01F36300 ( temp SSA int )
000052CB: LT 01F36300 ( temp SSA int ), 01F36270 ( temp SSA int ), 01F362D0 ( temp SSA int )
000052CB: DELETE 01F362D0 ( temp SSA int )
000052CB: DELETE 01F36270 ( temp SSA int )
000052CD: TEST 01F36300 ( temp SSA int )
000052CD: DELETE 01F36300 ( temp SSA int )
000052CD: JZ 000052F7

And lastly, perhaps getting ahead of myself, the same segment after running it through my variable optimization pass:

00005275: CREATE 01F36000 ( temp SSA int )
00005275: ACTION 0010 (0) 01F36000 ( temp SSA int )
00005282: CREATE 01F36060 ( temp SSA int )
00005282: ACTION 00E8 (1) 01F36060 ( temp SSA int ), 02102C10 ( merged string )
00005287: CREATE 01F36090 ( temp SSA int )
00005287: ACTION 0010 (0) 01F36090 ( temp SSA int )
0000528C: CREATE 01F360C0 ( temp SSA int )
0000528C: SUB 01F360C0 ( temp SSA int ), 01F36060 ( temp SSA int ), 01F36090 ( temp SSA int )
0000528C: DELETE 01F36090 ( temp SSA int )
0000528C: DELETE 01F36060 ( temp SSA int )
0000528E: ADD 02102910 ( merged int ), 01F36000 ( temp SSA int ), 01F360C0 ( temp SSA int )
0000528E: DELETE 01F360C0 ( temp SSA int )
0000528E: DELETE 01F36000 ( temp SSA int )
000052B8: ACTION 000C (4) 02102910 ( merged int ), 01F361E0 ( merged const int ), 01F36180 ( merged const int ), 01F36120 ( merged const int )
000052CB: CREATE 01F36300 ( temp SSA int )
000052CB: LT 01F36300 ( temp SSA int ), 02102910 ( merged int ), 01F362A0 ( merged const int )
000052CD: TEST 01F36300 ( temp SSA int )
000052CD: DELETE 01F36300 ( temp SSA int )
000052CD: JZ 000052F7

For those keeping score, that's 17 instructions in the original bytecode, 43 in the raw intermediate code (26 that are create or destroy, 9 that are assigns to temporary variables), and 19 in the optimized intermediate code (10 that are create or destroy; if that still sounds like too many, note that all of those variables will be converted to registers in LLVM).

Monday, August 02, 2010

& NWScript

The attentive might have wondered why I bothered to spend a moderate amount of timewriting about a topic I'd already moved past, which is kind of uncharacteristic of me. As I said before, I had looked at LLVM for a couple days a couple weeks prior to the corresponding blog post, then moved on (I can't recall if there was anything in between LLVM and cache-oblivious algorithms).

Well, part of the reason I didn't spend too much time on LLVM was because while it's neat, I didn't have an immediate use for it. However, a few days before I wrote said post Skywing randomly came up to me and asked if I wanted to write a native code compiler for NWScript, the scripting language in NeverWinter Nights 1 and 2, for the client and server he's been writing. Suddenly I had an immediate and interesting use for LLVM.

NWScript (for more information see this) is a language based on C, and supports a subset of the features. It supports only a small number of data types: int, float, string, and a number of opaque types that are more or less handles to game structures accessed solely via APIs. It also supports user-defined structures, including the built-in type vector (consisting of 3 floats), though it does not support arrays, pointers, etc. All variables and functions are strongly typed, although some game objects - those referred to with handles, such as the "object" type - may not be fully described by their scripting type. NWScript has a special kind of callback pointer called an action (a term which, confusingly, has at least two additional meanings in NWScript), which is more or less a deferred function call that may be invoked in the future - specifically, by the engine - and consists of a function pointer and the set of arguments to pass to that function (a call to an action takes no parameters from the caller itself).

NWScript is compiled by the level editor into a bytecode that is interpreted by a virtual machine within the games. Like the Java and .NET VMs, the VM for NWScript is a stack system, where a strongly-typed stack is used for opcode and function parameters and return values. Parameters to assembly instructions, as with function calls, are pushed on the top of the stack, and instructions and function calls pop the parameters and, if there is one, push the result/return value onto the top of the stack. Data down the stack is copied to the top of the stack with load operations, and copied from the top of the stack to locations further down by store operations.

Unlike those VMs, however, NWScript uses a pure, integrated stack system, implementing global and local variables through the stack, as well. Global variables are accessed through instructions that load/store values at indices relative to the global variable pointer BP, which is an index into the stack. Normally, global variables are at the very bottom of the stack, pushed in a wrapper function called #globals before the script's main function is called; there are opcodes to alter BP, though this is not known to ever be used. Local variables are accessed by loads and stores relative to the top of the stack, just as with the x86 (ignoring the CPU registers). Unlike the x86, however, return addresses are not kept on the stack, but like Itanium are maintained in a separate, program-inaccessible stack.

Oh, and for those wondering, the Java and .NET VMs have separate mechanisms for global and local variables that do not use the stack. This is actually rather surprising - you would expect them to use the stack at least for local variables, as actual processor architectures do (when they run out of registers to assign).

The following show a couple random pieces of NWScript bytecode with some comments:
; Allocate int and object local variables on the stack
0000007F 02 03 RSADDI
00000081 02 06 RSADDO
; Push parameter 0
00000083 04 06 00000000 CONSTO 00000000
; Call game API function
00000089 05 00 0360 01 ACTION GetControlledCharacter(0360), 01
; Copy return value to the object local variable
0000008E 01 01 FFFFFFF8 0004 CPDOWNSP FFFFFFF8, 0004
; Pop the return value

; Allocate float on stack
00000015 02 04 RSADDF
; Incredibly stupid way of setting the float to 3.0f
00000017 04 04 40400000 CONSTF 3.000000
0000001D 01 01 FFFFFFF8 0004 CPDOWNSP FFFFFFF8, 0004
; Assign the current stack frame to BP
0000002B 2A 00 SAVEBP
; Call main()
0000002D 1E 00 00000010 JSR fn_0000003D
; Restore BP to what it was before (not that it was anything)
00000033 2B 00 RESTOREBP
; Clean up the global variables
0000003B 20 00 RETN

; Push last two parameters to FloatToString
0000043C 04 03 00000009 CONSTI 00000009
00000442 04 03 00000012 CONSTI 00000012
; Divide 5.0f by 2.0f
00000448 04 04 40A00000 CONSTF 5.000000
0000044E 04 04 40000000 CONSTF 2.000000
00000454 17 21 DIVFF
; Return value of DIVFF is the first parameter to FloatToString
00000456 05 00 0003 03 ACTION FloatToString(0003), 03
; Return value is second parameter to SendMessageToPC. Push first parameter.
0000045B 04 06 00000000 CONSTO 00000000
00000461 05 00 0176 02 ACTION SendMessageToPC(0176), 02
Stack systems - especially this one, where everything is held on the stack - make for VMs that are very simple and easy to implement, and Merlin told me in the past that they're easy to compile to native code, as well. So it's not surprising that the developers used this architecture for NWScript.

You might notice that Skywing has started a series of his own on the project, today. I actually wrote this post several weeks ago, but was waiting until the project was done to post about it. But now that it looks like he's gonna cover it, I need to start posting or I'll end up posting after him despite starting before him.