Search This Blog

Sunday, June 27, 2010

& LLVM

A few weeks ago - before I got into cache-oblivious algorithms - I briefly had a different topic du jour. In particular, I had just learned about Low-Level Virtual Machine, and spent a couple days researching that.

LLVM is a bit hard to explain, because the boundary around what it is is rather fuzzy. To sum it up in a noun phrase, I'd say that LLVM is a modular compilation system and virtual machine. I say "compilation system" because it doesn't really consist of an actual compiler, and there's a lot of it that is outside the definition of a compiler.

The core idea is that you have a program in some programming language which is compiled into a platform and language independent intermediate format. From there, the program is translated into some machine code (whatever type is used on the machine it's being run on) at a later time, either during execution (just-in-time compiling) or prior to execution (e.g. an installer converting the application to native code when the program is installed), and executed. In that sense there's some similarity to the .NET platform and Java platform, though it isn't totally analogous to those.

LLVM is a collection of modules that makes the various parts of the process possible. These components form a redundant superset of a full system, so you can choose which you want for a particular purpose and leave the rest.

Going chronologically through that process, the first module in LLVM is its assembler. LLVM does not directly parse and compile any particular programming language. When creating a working compiler, you must create a language parser that processes a program and performs any language-specific optimizations. LLVM takes as its input a program in LLVM assembly plus various metadata that could be useful later in the process, and assembles them into language and platform independent LLVM intermediate bytecode. A few of the open-source parsers already available for LLVM include C, C++, Objective-C, and many others; there are even parsers under development that take .NET and Java bytecodes as input.

The LLVM assembly language differs substantially from the Java and .NET bytecode languages, and bears some resemblance to RISC assembly languages such as MIPS and PPC. LLVM assembly consists of an infinite set of Single Static Assignment (SSA) registers; each register can be assigned only once, and a new register must be allocated for the result of every computation. However, LLVM also stores a significant amount of metadata, such as call graphs and other things at a higher level than a simple assembly language, in the bytecode (language-dependent metadata provided by the compiler is also stored for later use by language-dependent modules, e.g. if a particular language has a runtime environment that needs more information than just the native code). This combination of the assembly language design and rich metadata allows the optimizer and native code generator to act much more effectively, in a language-independent manner, than if they were working on a bare assembly language. Yet they are compact enough to not be significantly larger than the equivalent native code alone.

The next step in the process is global optimization. LLVM implements a substantial number of language-independent optimization algorithms, and compiler writers can use some or all of them as they like. Because of the metadata stored with the assembly language, LLVM is not limited to dealing simply with assembly, and the effectiveness of it's optimization algorithms resembles that of a standalone compiler (e.g. Visual C++). Even better, because optimization occurs after the entire program has been assembled, LLVM can do whole-program optimizations (known as link-time code generation in VC++ and others) that are outside the ability of simple compilers that only look at a function or a source module at a time.

After optimization comes native code generation (or, alternately, you could interpret the bytecode without producing native code). This step takes the assembly and metadata that has undergone language and platform independent optimization and creates executable machine code, usually after performing platform-specific optimizations, and the code is then executed. Native code generators are language-independent, interchangeable modules invoked by LLVM and, given a program in LLVM bytecode, native code can be generated for any platform that a native code generation module exists for (x86 and PPC are a couple that are already supported with open-source implementations).

Finally, LLVM provides a runtime system that allows for various runtime tasks to be performed. We'll get to the other tasks later, but one to mention here is that this runtime system allows language or platform specific runtime environments to interface with LLVM-generated native code, as well as interfacing with system or dynamic link libraries. Language-dependent metadata provided by the language compiler in the distant past is also available for use here.

So, that's the basic process. However, as I've been describing the logical process and not the details, this description is much more linear than reality. As was mentioned, LLVM provides a number of modules that may be optionally used, and some are redundant. As well, there's also flexibility in how the modules interact.

LLVM especially provides a great deal of flexibility in the timing of optimization and native code generation. While LLVM can be used like a normal compiler, taking in LLVM assembly and generating an executable with native code that can be directly executed on the target platform, it does not need to be. It's equally possible to store the LLVM bytecode in a platform-independent "executable" and optimize it and compile it to native code at a later point, either on execution or prior to execution. LLVM can even compile individual functions to native code the first time they are called, allowing some portions of the program to potentially never be compiled at all.

Finally, LLVM also supports profiling and dynamic recompilation of a program. The LLVM runtime is capable of profiling the native code of the program at runtime, determining where the program spends most of its time, what functions are executed most often, etc. based on actual usage. It can then recompile the program from bytecode, either between executions or during idle time, using this profile data to perform optimizations based on actual usage.

So, that's squishy. But why should you care? Well, a few reasons. Probably the most attractive is that you can make a full, optimizing compiler for a language that will generate native code for many platforms simply by writing a parser and LLVM assembly generator - thousands of lines of code compared to hundreds of thousands or millions (see here for one example of this). Alternately, if you're designing a processor for some reason (work, school, or hobby), you can write a native code generator and instantly support all the languages parsers already exist for. And then there are some less obvious uses, such as using the metadata generated by LLVM to perform complex analysis of source code for the purpose of searching for bugs, security holes, etc.

LLVM is a free open-source project that may be used in free or commercial projects, open-source or proprietary. Its largest benefactor is Apple, who uses it extensively. Other companies such as Adobe also use it in various ways. For more information, see this brief summary of the LLVM system, and the LLVM web page.

No comments: