Search This Blog

Friday, December 28, 2007

Musings on the MoPaQ Format

So, Q survived finals and term papers, though not unscathed. Now for something completely different, and pseudo-random.

A couple weeks ago, BahamutZero and I were talking about the MPQ file format. I can't remember exactly what we were talking about, but he mentioned that it didn't seem all that good. It occurred to me that while I'm probably the most knowledgeable person on the format outside Blizzard (excluding, of course, the author of the format, who no longer works for Blizzard), I'd never really formed an opinion on how good or bad it was in general. That gave me the idea to write a critique of the format - at least, of the more interesting things in the format - on my blog.

Hashing

One of the more distinctive features of MPQs is the way in which files are located within the archives. The standard method of finding files in databases and file systems is the B-tree or B+-tree. For databases, there is one tree for each index in each table; for file systems, each directory has its own tree. This allows for efficient searching (O(log N) disk accesses per lookup) and the ability to keep only part of the index in memory at any time. However, archive files are generally concerned more about optimizing size than speed (ACE being the extreme example), they may use something simpler, such as a sorted or unsorted array of filenames (and if unsorted, then it is possible to just store the filenames along with the file data, itself).

The MPQ format, however, opted for something more industrial-strength: hashing. Each MPQ archive has a single large hash table (an O(1) structure), containing entries for all (readable) files in the archive. Files are identified by two 32-bit hashes of the filename, a language code, and a platform code (what exactly this is remains to be seen, as it's still unused); the index is derived from a third hash of the filename. Note that nowhere is the filename itself.

You really can't do any better than this for the original purpose MPQs were designed for: to store read-only game data in a compact form that is both fast to access and fast to load. The use of fixed-size hashes (rather than the actual filenames) makes the size of archive data in memory optimal, and the hash table structure makes file lookups optimal. The format of the hash table (specifically, how unused entries are encoded) means that the size on disc could be further reduced by compressing the hash table (it isn't compressed on disc); presumably this was not done to make opening archives faster. Apart from that, the hash table has no real weaknesses for the originally intended purpose.

However, the hash table proves to be a pain for modders because of the lack of absolute filenames in the archives. This apparently became a pain for Blizzard as well, indicated by the fact that in Starcraft: Brood War (the third game/expansion to use MPQs), Blizzard added a separate mechanism not integrated with the hash table for storing the names of files in the archive.

These two things taken together, however, suggest one way in which the MPQ format could have been improved (besides a minor alteration that would have allowed the hash table to be resized; presumably Blizzard didn't see a need for this ability when creating the MPQ format). Rather than including the hash table in the archive format itself, Blizzard could have made the hash table a global, separate entity. That is, store the filenames in some simple structure in the archives, and have the installer/patcher code create a global hash table (containing files of all archives) which would be used by the archive accessing code; this could even be specialized, such as only indexing files of the correct language. This would reduce the in-memory overhead (although increase the total size on disc) as well as potentially decrease the search time for finding a file in multiple archives. It has the added benefit of removing that complexity from the format itself, and moving it to support code. If memory serves, Neverwinter Nights took this approach.

Language/Platform Codes

Another interesting design decision in MPQs was the choice of including support for different languages/platforms in the archive format itself. As briefly mentioned previously, files in MPQs are identified, rather than by filenames, by three hashes derived from the filename, a language code, and a platform code. When a game runs, it sets the language and platform currently running. The MPQ access code then looks through the archives for a file matching that description. If no such file exists, it performs a second search, using the same hashes, but using default language and platform codes (in practice, files identified by these codes are generally American English files).

While it's nice that Blizzard put some thought into internationalization, I tend to think this was something that could have been better implemented outside the archive format itself. It would have been just as easy to use a prefix or suffix on the filename for this purpose (e.g. "platform\language\filename" or "filename.platform.language"). Both of these could have been implemented completely in code.

Alternatively, had they used a global index, they could have simply included only files of the proper language/platform (or the neutral language/platform, if no specific version exists) in the index when it is built by the installer/patcher, without the need to modify the filenames at all.

Separate Hash and Block Tables

One detail about the hash table that wasn't previously mentioned was the fact that the hash table stores only some information about each file (specifically, only that already mentioned). Additional information, such as the file size and offset, as well as file modes and flags, is stored in the block/file table ('block' is more technically accurate, but 'file' better describes its intended purpose). The hash table then stores an index into the file table for each entry in the hash table.

This is something that is common in file systems - you typically have a file table, containing the basic info about each file's data, and an index (often B/B+-tree) of filenames that index into the file table. The obvious benefit of this is that the file table can be modified less frequently, and doesn't need to be completely rewritten every time there's a change in the directory structure. A second benefit is that it's possible to create hard links - multiple 'files' in multiple directories (or in the same directory with different names) that refer to the same physical file.

File Data Blocking

Another feature of the MPQ format that resembles a file system more than an archive format is the way it structures file data. It's generally beneficial, from a compression ratio perspective, to compress a data set as a whole (though compression libraries typically feature a mechanism to indicate when it's necessary to read more data or write out compressed data, so that you don't have to load the entire file into memory at once). However, file systems that support compression don't do that. As file systems must read and write entire sectors of data, file data is broken into chunks of some number of sectors, compressed as a single block, then written out, padding out to the nearest sector; if at least one sector cannot be saved by the compression, the compressed data is discarded, and the uncompressed data is used, instead (there no point having to decompress something when compression doesn't even reduce the size).

However, the MPQ format also does this, despite the fact that it does not pad out disc sectors. The reason is that there's a second benefit of blocking data in this way. When a block of data is compressed as a stream (a single entity), it can generally only be read likewise - as a stream, from beginning to end. That is, you cannot seek in compressed data. As seeking is an essential function in general file systems (and similarly in game archive formats), this constraint is unacceptable. Blocking is thus used to reduce the size of the blocks that must be accessed atomically. To read from an arbitrary point, you find the block containing the position desired and decompress it.

One last point unique to the MPQ format is that when files are blocked, each block may use a different compression algorithm, indicated by a byte prefix with the compression algorithm.

Single Block Files

Nevertheless, as mentioned previously, it's better from a compression (and possibly performance, as well) standpoint to compress files as a single unit. In most cases this would work for arbitrary data (the type that might appear in an MPQ), as many file types will be read once and then kept in memory. To take advantage of this, the MPQ format has a second mode of file storage (I can't recall off the top of my head when this was added. Possibly in World of Warcraft): single block compression. This is basically exactly what it sounds like: the entire file is treated exactly as a single block of data - compressed in a single go, and prefixed with a byte indicating the compression algorithm, just like each block normally is.

In theory (and at the level of the file format), this isn't any different from normal, non-blocked archive formats. The problem is really with the implementation (the code, to be precise). The implementor wanted an easy way to take advantage of improved compression by not blocking data, when seeking is not necessary, so decided to use exactly the same block decompression code as for blocked files - blocks are loaded into memory whole, decrypted, and decompressed all at once. The problem here is that it requires two buffers, and requires the entire compressed block to be read before anything can be decrypted or uncompressed.

ADPCM Compression

As mentioned, the MPQ format supports multiple compression algorithms. Currently, the PKWare Implode, Zip Deflate, BZip2, and Huffman encoding algorithms are supported. There's also one more that's more unexpected: the ADPCM algorithm. I briefly discussed ADPCM compression on the E Terra blog. It's a lossy format that compresses 16-bit audio samples down to 4 bits, giving 4:1 compression ratio, at a non-negligible loss of quality.

This is used for a compression mode typically referred to as "WAV compression". This mode is used for compressing sound effects and music prior to Warcraft III. The MPQ editor scans the WAV to import, and distinguishes blocks that contain only PCM data from blocks that contain anything else. Blocks containing only PCM data are compressed with ADPCM, the others are compressed with one of the lossless compression algorithms. This way, the game sees only a PCM-encoded WAV, though it had actually been converted transparently to ADPCM and back.

This is a pretty strange feature for an archive file format - at least, as a basic compression type. Especially considering that there are better formats for this that have higher compression ratios and higher quality (MP3 comes readily to mind), without requiring an alteration to the basic archive format.

I suspect the real reason for this design decision was historical. WAV compression first appeared in Starcraft (the same time support for different compression algorithms for each block appeared, by the way). Diablo, the only game to use MPQs before that, used uncompressed WAVs, played through several Storm.dll APIs. I suspect ADPCM was used to allow the existing Storm.dll streaming code to continue to work using the new compression algorithm. Though again, it's not a particularly good solution, I think.

Extended Attributes

The last feature worth noting is a system for adding new metadata attributes to files in MPQ archives that weren't included in the original format, which I've dubbed extended attributes. Extended attributes are stored as a file (conveniently named "(attributes)") in the archive consisting of parallel arrays of the extended attributes, each array containing one entry for each entry in the block table. The extended attributes in use at present (as of World of Warcraft) are CRC32, FILETIME, and MD5; an archive may contain any combination of these. Oddly, when Storm loads these extended attributes, it merges them with the rest of the fields in the block table, creating a single large table in memory.

This design decision has both advantages and disadvantages. Storing these in parallel arrays makes it, in theory, efficient to store, load, and access them when present, and ignore them when absent; however, the implementation peculiarity mentioned previously makes it slower to load them, with the benefit to slightly improve access (as it doesn't have to check whether they're all present each time you access them).

There are a couple disadvantages, however. First, as extended attributes are stored in arrays parallel to the block table, it isn't possible to exclude entries for files where you don't need that attribute, requiring more space if only some files need certain attributes. I suppose this is only an issue for the tables when loaded in memory, as, as the attributes file is compressed in the archive, unused entries can (have have been observed to) be zeroed, allowing them to take almost no disk space after compression.

The other disadvantage is that this system is not readily extensible. Because new attributes must be known by the MPQ library (for size calculations when processing the attributes file), it's cumbersome to add new attributes, as the MPQ library must also be updated to support them. This is even worse with regard to the fact that the block table mixing code and data must be updated correspondingly. It also doesn't allow for custom metadata at all.

4 comments:

Devoid said...

Hi,

Good info on the MPQ format. Very useful.

Cheers,
Devoid

Olivier Ragain said...

Yeap, even for other archives too

Stoo42 said...

Hi Justin,

First up, cheers for all your info on the MPQ file format. I've written a C#.NET library to read and get files from inside MPQ's, based pretty much entirely on your dev/klog description of the format.

Secondly, was wondering how I actually got hold of you personally to talk more about it if possible? Maybe when my lib is more "polished" (its rough as at the moment), link from klog?

Also, I think I've found a bug in the MPQ format... I'm just not sure though, as I dont class myself as a great programmer at all! Looks as tho the first chunk of each file is stored twice. First chunk offset != the byte after the chunk offset table at the start of each file. At least, thats what it looked like.

Hopefully I'll hear from you,

Stoo

Anonymous said...

thank you mpq post, very very useful!