Search This Blog

Thursday, August 09, 2007

Of Codes and Languages - Trans-Roman Alpha

For those who haven't played it, World of Warcraft contains 10 different playable races organized into 2 factions. Each race has its own distinct language, with 1 on each faction known by all races of that faction. Most text spoken by players or other characters in game is in a particular language. If you know that language, it appears in English (or whatever language) exactly as it was typed/said. If you don't know that language, it gets translated into the language of the character that said it, making it unintelligible to you. Besides lending a touch of realism to the game, this also serves as a language barrier to prevent communication between opposing factions.

For example, the following items are spoken by some of the enemies in one instance, in Common - the native language of humans, and, as the name suggests, the common language known by all races on the Alliance:

Common: Andovis ras waldir
English: Release the hounds!

Common: Ras garde hamerung nud nud valesh noth. Hir bur dana bor.
English: The light condemns all who harbor evil. Now you will die.

If you carefully draw a line through a handful of data points, you can get an idea of how this works just from these two examples. Blizzard has created a small (several dozen words) vocabulary for each language - approximately 6 words of each length. When translating text, each word is processed individually; the word is hashed and used to choose a translated word of the same size, in a lossy, many-to-one relationship. An elegant, simple but effective algorithm.

This got me thinking - could you create a coding system such that you could reversibly encode data in something that looks like a foreign language? The point, of course, being to use the fact that it looks like a language as a decoy, while the information is actually in something like an encrypted form.

My first attempt at this was an algorithm I called Trans-Roman Alpha ("trans-Roman" because it used the Roman alphabet). This was an extremely simple algorithm: reversibly convert a word into a numeric form (basically treating it as a base-26 number), then decoding it in an opposite direction, using a different mapping of "digit" to letter. A few other complications were also added in, such as word fusions and splitting to hide the original word lengths. Some familiar phrases, in Trans-Roman Alpha:

"pqr bxq pgy psc dywddw jjf"
"psc dynts bl dytg djp mckgy bzz cy gcxy sdwcn jfydkltd yd htc yn r vy lzt fcypt jc"

As you can see, the fact that the algorithm is too dumb to form pronounceable syllables means that the best the algorithm can do is to either work with syllabaries like Hiragana, which represent one syllable in each character, or to use an abjad writing system: a system where only consonants are written. In the latter case, the resulting phrases would be far longer than the source text, making it somewhat impractical.

While use of a syllabary would not have the problem of length (and in fact would be about the best you could do with this algorithm), both it and the abjad solution have a more significant problem: both will generate a fairly even distribution of characters, in a nearly random order. This is entirely unlike real languages, which do not form an even distribution in this regard, nor do they occur in random order (though encryption systems do both). Such a system would be unlikely to hold up to the most basic tests used to identify languages (or at least make a best guess), and so would not be particularly likely to fool anyone knowledgeable about the topic.

1 comment:

Justin Olbrantz (Quantam) said...

For extra fun, the entire last paragraph of this post in Trans-Roman Alpha:

"mzknn cyfzgcy gmcyfcy mg dl mk hbx kykvtbp cyvhtcy tgjzcypsc dycm fbj ddd y gmcymhlt c ffyfgb dyjlcy dckl cykvtbp cycmygxwl jfy psc dysfk ty n r vy tdfq zcyqbc yw mvf cy cs gsdy mh nds frsqmy mldt ynlc yfgb dypsc dyrzdk jf yltp ggsr gjytgjzcyfcy ctcjc yvrd wgpp zm gr ry cm fbj ddd y mldt ylzt fcywhtl xbf ndy fcy mqpfz kdy nq ltdy bh q hrzw zgpvf sygmcyfdtc fj thf hkyjlcy fcy cxp fzw fy zqfqgn fylqp wgf ycs gsdy hlc yrv qc mpxc lyhvt pxsd y dsscdy mj zzt dpz wjy ftk nn cyqbc yvhtcy frh lcy xnc ynq ltdy bh q hrzw zgpvf syjlcy cs gsdy ppnx lnf ynhtc yqbc ytxgsdy nvkjg fyjlcy zqfqgn fylqp wgf ypw sgr pg ybm k lgl xrm rr ychsc mn rgyqbc ymldt yflzd dyfcy wkgvx p fykvtbp cycmyhjn n j qvqgy jqc y tghz cyqxyjqc y psc dyjvcjc y ktxs zy ptnkp fyrx lqc yjqc y wmchqq qt jymj zzt dpz wjy nmcy dpc y dswn ldyrndjc yfcy sfk ty m t tqcc y fgb dyvh cykvtbp cyvhtcy cmyr kzn tv gqqpsz ky ckhr ldfy jqc y qxhlc ymxc fjl gymgl gzrnb jnswph ygxwl jfy psc dyffb bpf"