For those who haven't played it, World of Warcraft contains 10 different playable races organized into 2 factions. Each race has its own distinct language, with 1 on each faction known by all races of that faction. Most text spoken by players or other characters in game is in a particular language. If you know that language, it appears in English (or whatever language) exactly as it was typed/said. If you don't know that language, it gets translated into the language of the character that said it, making it unintelligible to you. Besides lending a touch of realism to the game, this also serves as a language barrier to prevent communication between opposing factions.
For example, the following items are spoken by some of the enemies in one instance, in Common - the native language of humans, and, as the name suggests, the common language known by all races on the Alliance:
Common: Andovis ras waldir
English: Release the hounds!
Common: Ras garde hamerung nud nud valesh noth. Hir bur dana bor.
English: The light condemns all who harbor evil. Now you will die.
If you carefully draw a line through a handful of data points, you can get an idea of how this works just from these two examples. Blizzard has created a small (several dozen words) vocabulary for each language - approximately 6 words of each length. When translating text, each word is processed individually; the word is hashed and used to choose a translated word of the same size, in a lossy, many-to-one relationship. An elegant, simple but effective algorithm.
This got me thinking - could you create a coding system such that you could reversibly encode data in something that looks like a foreign language? The point, of course, being to use the fact that it looks like a language as a decoy, while the information is actually in something like an encrypted form.
My first attempt at this was an algorithm I called Trans-Roman Alpha ("trans-Roman" because it used the Roman alphabet). This was an extremely simple algorithm: reversibly convert a word into a numeric form (basically treating it as a base-26 number), then decoding it in an opposite direction, using a different mapping of "digit" to letter. A few other complications were also added in, such as word fusions and splitting to hide the original word lengths. Some familiar phrases, in Trans-Roman Alpha:
"pqr bxq pgy psc dywddw jjf"
"psc dynts bl dytg djp mckgy bzz cy gcxy sdwcn jfydkltd yd htc yn r vy lzt fcypt jc"
As you can see, the fact that the algorithm is too dumb to form pronounceable syllables means that the best the algorithm can do is to either work with syllabaries like Hiragana, which represent one syllable in each character, or to use an abjad writing system: a system where only consonants are written. In the latter case, the resulting phrases would be far longer than the source text, making it somewhat impractical.
While use of a syllabary would not have the problem of length (and in fact would be about the best you could do with this algorithm), both it and the abjad solution have a more significant problem: both will generate a fairly even distribution of characters, in a nearly random order. This is entirely unlike real languages, which do not form an even distribution in this regard, nor do they occur in random order (though encryption systems do both). Such a system would be unlikely to hold up to the most basic tests used to identify languages (or at least make a best guess), and so would not be particularly likely to fool anyone knowledgeable about the topic.