So, that midterm went badly. Between the fact that I didn't study at all (it was an open-book test), and thus took longer than would be ideal, and the fact that I forgot to bring the single most important piece of paper I have (the LISP function reference sheet), this won't make my list of highest test scores ever. On the plus side, I got some feedback on my ideas for a term project (the ones I posted) from my teacher, chose one candidate to submit as a proposal, and got approval for it today.
One experiment described in my AI textbook involved speech synthesis and machine learning. An AI system took in a letter from an English word, along with several letters before and after it, and produced a phone - the precise spoken sound - for that letter. The experiment constructed two separate implementations, one based on the ID3 symbolic learning algorithm, and the other based on back-propagated neural networks, and compared the performance and characteristics of both implementations. Both are trained by feeding in streams of examples where the input and output are both known, and the machine learning algorithms adapt the actual output of the system to most closely match the correct answers.
Using this as a model (partly because it's vaguely similar, partly because it was just a convenient model), my experiment is to construct a system which identifies the language a piece of text is in, based purely on dumb pattern recognition rather than any specific knowledge about the structure of the languages (not unlike the model experiment). How exactly I intend to accomplish this (or at least attempt to) is where things become nontrivial, although I'd be lying if I said anything I'm going to do is particularly hard.
In this project, I intend to create multiple systems based on the same learning algorithm, one or two per language, each returning a single boolean indicating whether the algorithm thinks the current input is something in its language. The decision was motivated by the fact that a single composite system, while likely more accurate, would depend on input from ALL languages. Changing the set of training words in one language would affect the output for all languages, making it a nightmare to test incrementally.
There are three general levels of structure to language that a dumb system might be able to recognize: phonology, morphology, and syntax. Phonology describes the sounds in a language and in what order they may appear in the language (in our case, where we're using written material rather than spoken, replace phonology with orthography - how a language is written, which is related to phonology). Morphology is how words are constructed and modified. Finally, syntax is the order words appear in.
I intend to base detection on orthography and possibly syntax. Both are relatively easy to evaluate, while morphology is much more difficult (at least for my level of skill). In both cases the basic idea is the same: the program iterates over units of text, testing each one through the AI function, and counts the number of matches. It then compares the number of matches between the functions of the different languages with various statistical functions to attempt to determine if there's a clear conclusion. Exactly what statistical methods to use will probably require a fair amount of experimentation.
In the neural network implementation of both cases, this will require construction of a common character set for all the languages used, as inputs will be binary; in other words, there will be many inputs - one per character in the character set per character in the sample. This would (likely) make it infeasible to support even UCS-2 (one flavor of Unicode), as that would require hundreds of thousands of inputs. I'm expecting the combined character set to be around 35-50 characters.
Because of the limitation on character sets (and the obvious fact that character sets alone could be a dead giveaway in some cases, such as Korean), I intend to only use languages which use the Roman alphabet. Unfortunately, this rules out some cases I'd like to use, but that's the technical limitation. The fact that I'm not using morphology in this experiment suggests that languages chosen should be primarily analytic; agglutination and fusion rely too much on morphology. Some possible languages to try: English, German, Spanish, Portuguese, Italian, Esperanto, Chinese (via Mandarin or Cantonese Pinyin or Jyutping), Trique, and Romanized Sindarin. In a couple cases there are several closely related languages, intended to test how well this thing can distinguish relatively small differences.
Orthography is pretty straightforward. Each letter, as well as several before and after it, are input into the orthography functions. I'm thinking two letters before and after, but we'll see. The output would then be whether the function thinks that letter fits with the letters around it.
Syntax is substantially harder. It's infeasible to look at words as atomic units, because there is no good way of representing them as such in our algorithms (especially neural networks). So, I'm kind of having to get creative (of course this is assuming I even have time to do syntax analysis). What I'm thinking at the moment is to look at one word as well as the two words immediately before and after it. Rather than trying to process the entirety of each word (which can't readily be done, due to representation problems), I was thinking of only using the first and last so-many letters (I'm thinking three) from each of the words.
This idea isn't as arbitrary as it sounds. For the languages I'll be dealing with, many words should be six characters long or less, meaning the entire word is considered. For larger words, where the center is not able to be considered, I rely on the fact that the beginning and end of words have been shown to receive more processing than the middle, and, consequently (in a positive-feedback-like manner), they tend to contain the most important information, such as indications of the word class, inflections, and derivation morphemes.