Or rather, my conclusion about the answer to this question on the test (I posted the question to see if anyone else could figure it out) was confirmed by the teacher: there is no way to reduce the storage of any set of arbitrary points below the 96 bits required for the raw coordinates. There was an error in the question; specifically, the teacher forgot to mention one additional constraint about the values of the coordinates (though he hasn't said what it was, yet).
As for possible solutions. If it were known that all points were coplanar, you could store the normal of the plane and first point (or the first three points themselves), then convert all other points to 2D coordinates. You could use any manner of 2D coordinate system (they'll all use the same amount of space to store), but as we know the first three points are not collinear, I'd say the easiest would be barycentric coordinates. This was the first thing that came to mind while I was taking the test, but I asked and he said that I couldn't assume all points were coplanar.
Alternately, if you knew something about the locations of the points with regard to the plane, such as that they formed a regularly-spaced grid, or occurred periodically on a curve, you could reduce each point to a single value - the distance from the plane - with some overhead for describing the pattern the points appear on. Given that the hint the teacher gave me after I had finished the test was "height map", I'm guessing this is what he had in mind.
Search This Blog
Friday, March 28, 2008
Monday, March 17, 2008
What's the Matter with MediaSentry?
Let me attempt to give a brief sketch of how P2P applications work, targeted toward non-comp-sci people, particularly with respect to the RIAA file sharing suits. This is taken from the general knowledge I have about P2P programs as a computer scientist, and in some cases I know specific things about specific applications.
Basic Architecture
In P2P applications, individual computers act as either clients or servers (or both at the same time), depending on whether they're downloading, uploading, or both at a given time. Somewhere there is a list of computers in a given P2P network - which users must join at some point. For eDonkey, you log into a large network when you start the program, and these large networks contain many peers who may never communicate with each other. With BitTorrent, you are logged into a small network - only containing the files you are downloading/uploading (e.g. MP3s of a single CD) - only while you are downloading/uploading, and you may be connected to many such networks at once.
Somewhere there is a directory of users. This may be stored on a single server (I believe eDonkey was this way), or a network where a computer asks another computer "who have you seen recently?", and then asks that question of all the computers that are returned from the query, etc.
In all cases (unless you've got some kind of file-sharing virus, which I'm actually surprised we haven't seen before), the user voluntarily logs into and out of the network(s) through various actions. As well, files which are shared must be "voluntarily" shared, either from a shared files folder or by tracking specific files that should be shared; however, most programs will automatically share any files you download from other users (and I've heard some programs, when installed, automatically search for and share files that the program thinks would be good to share).
Depending on the system, peer and/or the directory server may not know all the files a given computer has available for sharing. Similarly, if you ask a given computer what files they're sharing, it may or may not be a complete list. In torrents, it only shows the files in the particular torrent; I believe eDonkey lists all shared files. All P2P systems have a way of asking a particular computer what files they are sharing, although the completeness of the response varies.
Okay, getting to the specific legal issues.
Method of Obtaining an IP and File List
First, as there are standard and intended methods of asking a computer what files it's sharing, it's (probably) not true that MediaSentry had to do anything illegal to obtain this list, like hack into the computer. Likewise, they probably didn't have to do anything other users couldn't do (although they probably made a program to scan P2P networks and catalog all files, while the typical user would have to search for people with specific files; I wouldn't call this illegal).
The big question mark is how exactly MediaSentry verified (to the best of its knowledge) that the info they obtained is true, and without knowing this we can't give a good estimate of the false-positive rate (which is likely the reason MediaSentry won't say what their methods are; they're probably lying when they say that they have developed proprietary and novel methods of investigation that should be considered trade secrets), although previous cases have shown this rate to be > 0. There are lots of ways an investigation could go wrong (or become difficult), even if they did see what appears to be a computer sharing copyrighted files.
Outdated Cache Information
It's possible that the directory server or another computer has an outdated list of files shared by a certain computer, in which case they may say that a computer is sharing files that it isn't. One example of how this could happen is that a computer was sharing some files on some network then disconnected from the internet, and another computer logged on and was given the same IP. Such outdated data could indicate that the second computer is sharing files, even though it's not (it might not even be on the P2P network at all), and in fact NOBODY at that IP has been sharing files for some time. This goes directly to the issue of not being able to positively identify a person from an IP address even if you get an IP address that that computer has as the moment the IP address is obtained (although this is highly dependent on the P2P program). This risk of false positives (and the next one or two) can more or less be eliminated by verifying that the files can actually be downloaded at the time the IP is seen "sharing" files.
Leeching
It's possible that the user is a "leecher" - somebody who downloads without allowing their computer to upload anything by messing with their system configuration. This may be done either intentionally (it's not extremely rare for people to leech so they don't have to use upstream bandwidth when all they want is to get something from someone else; such people fall under the "jackass" category) or unintentionally (P2P programs can be a huge pain to set up to work properly when you're behind a home or other type of local network, and even some ISPs block P2P uploading - but not downloading). Obviously if they're a leecher, they haven't so much as made available anything, despite the computer indicating that it's sharing stuff (although intent becomes a big question if they're not intentionally leeching). While some P2P networks will ban leechers, it's possible that leechers can report false info to the server for the explicit purpose of evading leecher banning; consequently, leeching must specifically be ruled out by successfully downloading the "shared" files.
Clock Synchronization
The issue of stale data comes up again at the ISP and organization (if there's a large network such as a school that the violating computer is on); though more importantly, there's no guarantee that the clocks on the MS computer (here I'm assuming they've actually downloaded the files from the sharer) are synchronized with the clocks on the ISP/organization. If these clocks aren't well synchronized, there's always the possibility that the account information they get from the ISP/organization isn't for the account that had the IP at the time the files were shared. This would require explicitly testing clock synchronization between everyone involved; I'd imagine it would be troublesome to get an ISP/other organization to put that kind of effort into a response to a subpoena. Although this possibility can alternately be reduced by the ISP/organization checking that there are no logons near the time sharing supposedly occurred; if there is a very large area where no logons occurred, the probability of a false positive is probably negligible, even if the clocks aren't precisely calibrated.
Network Address Translation
Next, NATs provide a major problem for identifying the offending computer, because it's entirely possible that that there are multiple computers using the same IP at the same time. In theory (and subject to the problem in the next paragraph) the router can distinguish which computer has which connection at what time (NATs assign unique port numbers to each computer sharing an IP address), but the probability of this information still being around by the time a suit is filed is low, even under normal (non-destruction-of-evidence-type) use. Whether an IP is a NAT or a single computer can be halfway reliably determined by investigators like MS using public info (I recall you got hung up on that point in one of the early trials of yours). If the IP is a NAT, it's going to be significantly harder to prove which computer shared the files, and requires forensic examination of the hard drives or someone on the network confessing (or the RIAA's preferred method: file a suit against the account holder and expect them to give up the person responsible rather than face court or settlement costs). However, this problem is short-circuited if the RIAA gets lucky enough that the P2P application uses user names (some do, some don't), and the name of the sharer is known to be used by a certain person (although I suppose someone could maliciously use the name of somebody they don't like).
IP Spoofing/ARP Poisoning
As well, I'm told by people more knowledgeable than me (I came up with the idea, and then asked them to verify that it could be done in real-world networks) that depending on the configuration of a network it's possible to operate under the IP address of somebody close to you (perhaps somebody in your dorm). This would very likely require intent to deceive, but it might be attractive for someone who wants to download stuff without getting in trouble. I don't know if there are tools out there that make this easy enough for your average user to do, but it's definitely technologically possible, given the right network. In fact, I have a friend who is a very skilled network "hacker" (he publishes articles in security journals) who had written a program to disconnect file sharers from his school network because they were hogging bandwidth and making his connection slow (and he did so without being a network administrator, as far as I'm aware; however, that was a simpler case than sustaining two-way communication); network hacking is outside my field of expertise, but I'm betting this involved what I'm describing in this paragraph. it would depend on how secure the network configuration is, but I'd reason (keeping in mind that I know some about networks but they aren't my specialty - I'm just highly inquisitive, and know a moderate amount about many topics) that wireless networks are especially vulnerable to this. Ruling this out requires knowledge of the physical layout of the network the sharing computer is connected to, and the network administration policies (I would guess that this is usually not done due to annoyance for the network administrators, but perhaps some would). This seems like a viable defense, but I'd recommend talking to a network expert about this directly (my friend isn't online at the moment, so the confirmation of feasibility didn't come from him).
Making Available
Finally (at least I think this is the end), there's the nebulous issue of making available. Even if MS knows for sure that this person was on this computer with this IP address at the time, and MS successfully downloaded a valid copy of a copyrighted file, there isn't a guarantee that this file was actually distributed to other people. In P2P applications with very large networks, it's very possible that simply nobody other than MS ever asked for a copy of a file from a specific computer, so there was no actual distribution. In such cases, it becomes very difficult to even estimate the probability of someone else downloading the file (as I've explained that it's not enough simply to ask other computers if they downloaded the file from that computer - even if the P2P application has a way of asking that - as the computer may be lying or propagating incorrect data (it could be that the "sharing" computer is a leecher and only says it uploaded the file). Obviously this is only an issue if making available is not ruled to be equivalent to distribution.
Questions, comments? Anything (or everything) I didn't explain well enough for laymen, or anybody technically apt want to know exactly what I'm referring to in some cases (it might not always be clear exactly what I was referring to, as I didn't explain the technical details behind that list of risks)?
And oops, I did it again - sat down to write something that was supposed to be fairly concise, and ended up writing something that looks like a judge's ruling document. But at least it made me forget about my flu for a couple hours, so that's a good thing.
Basic Architecture
In P2P applications, individual computers act as either clients or servers (or both at the same time), depending on whether they're downloading, uploading, or both at a given time. Somewhere there is a list of computers in a given P2P network - which users must join at some point. For eDonkey, you log into a large network when you start the program, and these large networks contain many peers who may never communicate with each other. With BitTorrent, you are logged into a small network - only containing the files you are downloading/uploading (e.g. MP3s of a single CD) - only while you are downloading/uploading, and you may be connected to many such networks at once.
Somewhere there is a directory of users. This may be stored on a single server (I believe eDonkey was this way), or a network where a computer asks another computer "who have you seen recently?", and then asks that question of all the computers that are returned from the query, etc.
In all cases (unless you've got some kind of file-sharing virus, which I'm actually surprised we haven't seen before), the user voluntarily logs into and out of the network(s) through various actions. As well, files which are shared must be "voluntarily" shared, either from a shared files folder or by tracking specific files that should be shared; however, most programs will automatically share any files you download from other users (and I've heard some programs, when installed, automatically search for and share files that the program thinks would be good to share).
Depending on the system, peer and/or the directory server may not know all the files a given computer has available for sharing. Similarly, if you ask a given computer what files they're sharing, it may or may not be a complete list. In torrents, it only shows the files in the particular torrent; I believe eDonkey lists all shared files. All P2P systems have a way of asking a particular computer what files they are sharing, although the completeness of the response varies.
Okay, getting to the specific legal issues.
Method of Obtaining an IP and File List
First, as there are standard and intended methods of asking a computer what files it's sharing, it's (probably) not true that MediaSentry had to do anything illegal to obtain this list, like hack into the computer. Likewise, they probably didn't have to do anything other users couldn't do (although they probably made a program to scan P2P networks and catalog all files, while the typical user would have to search for people with specific files; I wouldn't call this illegal).
The big question mark is how exactly MediaSentry verified (to the best of its knowledge) that the info they obtained is true, and without knowing this we can't give a good estimate of the false-positive rate (which is likely the reason MediaSentry won't say what their methods are; they're probably lying when they say that they have developed proprietary and novel methods of investigation that should be considered trade secrets), although previous cases have shown this rate to be > 0. There are lots of ways an investigation could go wrong (or become difficult), even if they did see what appears to be a computer sharing copyrighted files.
Outdated Cache Information
It's possible that the directory server or another computer has an outdated list of files shared by a certain computer, in which case they may say that a computer is sharing files that it isn't. One example of how this could happen is that a computer was sharing some files on some network then disconnected from the internet, and another computer logged on and was given the same IP. Such outdated data could indicate that the second computer is sharing files, even though it's not (it might not even be on the P2P network at all), and in fact NOBODY at that IP has been sharing files for some time. This goes directly to the issue of not being able to positively identify a person from an IP address even if you get an IP address that that computer has as the moment the IP address is obtained (although this is highly dependent on the P2P program). This risk of false positives (and the next one or two) can more or less be eliminated by verifying that the files can actually be downloaded at the time the IP is seen "sharing" files.
Leeching
It's possible that the user is a "leecher" - somebody who downloads without allowing their computer to upload anything by messing with their system configuration. This may be done either intentionally (it's not extremely rare for people to leech so they don't have to use upstream bandwidth when all they want is to get something from someone else; such people fall under the "jackass" category) or unintentionally (P2P programs can be a huge pain to set up to work properly when you're behind a home or other type of local network, and even some ISPs block P2P uploading - but not downloading). Obviously if they're a leecher, they haven't so much as made available anything, despite the computer indicating that it's sharing stuff (although intent becomes a big question if they're not intentionally leeching). While some P2P networks will ban leechers, it's possible that leechers can report false info to the server for the explicit purpose of evading leecher banning; consequently, leeching must specifically be ruled out by successfully downloading the "shared" files.
Clock Synchronization
The issue of stale data comes up again at the ISP and organization (if there's a large network such as a school that the violating computer is on); though more importantly, there's no guarantee that the clocks on the MS computer (here I'm assuming they've actually downloaded the files from the sharer) are synchronized with the clocks on the ISP/organization. If these clocks aren't well synchronized, there's always the possibility that the account information they get from the ISP/organization isn't for the account that had the IP at the time the files were shared. This would require explicitly testing clock synchronization between everyone involved; I'd imagine it would be troublesome to get an ISP/other organization to put that kind of effort into a response to a subpoena. Although this possibility can alternately be reduced by the ISP/organization checking that there are no logons near the time sharing supposedly occurred; if there is a very large area where no logons occurred, the probability of a false positive is probably negligible, even if the clocks aren't precisely calibrated.
Network Address Translation
Next, NATs provide a major problem for identifying the offending computer, because it's entirely possible that that there are multiple computers using the same IP at the same time. In theory (and subject to the problem in the next paragraph) the router can distinguish which computer has which connection at what time (NATs assign unique port numbers to each computer sharing an IP address), but the probability of this information still being around by the time a suit is filed is low, even under normal (non-destruction-of-evidence-type) use. Whether an IP is a NAT or a single computer can be halfway reliably determined by investigators like MS using public info (I recall you got hung up on that point in one of the early trials of yours). If the IP is a NAT, it's going to be significantly harder to prove which computer shared the files, and requires forensic examination of the hard drives or someone on the network confessing (or the RIAA's preferred method: file a suit against the account holder and expect them to give up the person responsible rather than face court or settlement costs). However, this problem is short-circuited if the RIAA gets lucky enough that the P2P application uses user names (some do, some don't), and the name of the sharer is known to be used by a certain person (although I suppose someone could maliciously use the name of somebody they don't like).
IP Spoofing/ARP Poisoning
As well, I'm told by people more knowledgeable than me (I came up with the idea, and then asked them to verify that it could be done in real-world networks) that depending on the configuration of a network it's possible to operate under the IP address of somebody close to you (perhaps somebody in your dorm). This would very likely require intent to deceive, but it might be attractive for someone who wants to download stuff without getting in trouble. I don't know if there are tools out there that make this easy enough for your average user to do, but it's definitely technologically possible, given the right network. In fact, I have a friend who is a very skilled network "hacker" (he publishes articles in security journals) who had written a program to disconnect file sharers from his school network because they were hogging bandwidth and making his connection slow (and he did so without being a network administrator, as far as I'm aware; however, that was a simpler case than sustaining two-way communication); network hacking is outside my field of expertise, but I'm betting this involved what I'm describing in this paragraph. it would depend on how secure the network configuration is, but I'd reason (keeping in mind that I know some about networks but they aren't my specialty - I'm just highly inquisitive, and know a moderate amount about many topics) that wireless networks are especially vulnerable to this. Ruling this out requires knowledge of the physical layout of the network the sharing computer is connected to, and the network administration policies (I would guess that this is usually not done due to annoyance for the network administrators, but perhaps some would). This seems like a viable defense, but I'd recommend talking to a network expert about this directly (my friend isn't online at the moment, so the confirmation of feasibility didn't come from him).
Making Available
Finally (at least I think this is the end), there's the nebulous issue of making available. Even if MS knows for sure that this person was on this computer with this IP address at the time, and MS successfully downloaded a valid copy of a copyrighted file, there isn't a guarantee that this file was actually distributed to other people. In P2P applications with very large networks, it's very possible that simply nobody other than MS ever asked for a copy of a file from a specific computer, so there was no actual distribution. In such cases, it becomes very difficult to even estimate the probability of someone else downloading the file (as I've explained that it's not enough simply to ask other computers if they downloaded the file from that computer - even if the P2P application has a way of asking that - as the computer may be lying or propagating incorrect data (it could be that the "sharing" computer is a leecher and only says it uploaded the file). Obviously this is only an issue if making available is not ruled to be equivalent to distribution.
Questions, comments? Anything (or everything) I didn't explain well enough for laymen, or anybody technically apt want to know exactly what I'm referring to in some cases (it might not always be clear exactly what I was referring to, as I didn't explain the technical details behind that list of risks)?
And oops, I did it again - sat down to write something that was supposed to be fairly concise, and ended up writing something that looks like a judge's ruling document. But at least it made me forget about my flu for a couple hours, so that's a good thing.
Wednesday, March 12, 2008
Random Japanese Fact of the Day
Rather than have many particles that indicate relative position, such as English 'on [top of]', 'to the left of', 'above', 'in', etc., Japanese uses nouns for these things - 'top' (上), 'left' (左), 'inside' (中), etc., and has only a single word to indicate place (actually two - 'に/ni' and 'で/de' - with the context determining which is appropriate). An example from a CD booklet:
楽曲群 [set of compositions/music] の 中 [inside/medium/middle/among] で
This would literally translate to "at the inside of the music", or, more freely, "in the music".
It's a pretty elegant system. And this is of particular interest to me because of the fact that there's a clear hierarchy of word classes (in terms of importance) in Caia:
1. Nouns
2. Verbs
3. Everything else (adjectives, adverbs, particles) except conjugates (I don't really know how to rank conjugates in importance)
楽曲群 [set of compositions/music] の 中 [inside/medium/middle/among] で
This would literally translate to "at the inside of the music", or, more freely, "in the music".
It's a pretty elegant system. And this is of particular interest to me because of the fact that there's a clear hierarchy of word classes (in terms of importance) in Caia:
1. Nouns
2. Verbs
3. Everything else (adjectives, adverbs, particles) except conjugates (I don't really know how to rank conjugates in importance)
Tuesday, March 11, 2008
Exercise for the Reader
Here's a question right off my graphics programming test today:
You have n (greater than 3) 3D points, each represented by three 32-bit floats. The first three points a, b, and c are guaranteed to not be colinear, but nothing else is guaranteed. Devise a strategy to store the set of points in minimal space (hint: use a plane defined by a, b, and c), and give how much space would be needed to store the n points.
You have n (greater than 3) 3D points, each represented by three 32-bit floats. The first three points a, b, and c are guaranteed to not be colinear, but nothing else is guaranteed. Devise a strategy to store the set of points in minimal space (hint: use a plane defined by a, b, and c), and give how much space would be needed to store the n points.
Monday, March 10, 2008
Agglutinative Train Wreck
While looking up some other Japanese words, I ran across this one:
ABC順に 【エービーシーじゅんに】 (adv) in alphabetical order; ED
ABC is obviously an example of the meaning of the word, 順 means order, and に is the locative particle (in this case having the meaning of "in"). I thought it was hilarious that there's a Japanese word that contains Chinese kanji, Japanese hiragana, and Roman characters all together.
ABC順に 【エービーシーじゅんに】 (adv) in alphabetical order; ED
ABC is obviously an example of the meaning of the word, 順 means order, and に is the locative particle (in this case having the meaning of "in"). I thought it was hilarious that there's a Japanese word that contains Chinese kanji, Japanese hiragana, and Roman characters all together.
& AI Class 2 - Initial Plans
So, that midterm went badly. Between the fact that I didn't study at all (it was an open-book test), and thus took longer than would be ideal, and the fact that I forgot to bring the single most important piece of paper I have (the LISP function reference sheet), this won't make my list of highest test scores ever. On the plus side, I got some feedback on my ideas for a term project (the ones I posted) from my teacher, chose one candidate to submit as a proposal, and got approval for it today.
One experiment described in my AI textbook involved speech synthesis and machine learning. An AI system took in a letter from an English word, along with several letters before and after it, and produced a phone - the precise spoken sound - for that letter. The experiment constructed two separate implementations, one based on the ID3 symbolic learning algorithm, and the other based on back-propagated neural networks, and compared the performance and characteristics of both implementations. Both are trained by feeding in streams of examples where the input and output are both known, and the machine learning algorithms adapt the actual output of the system to most closely match the correct answers.
Using this as a model (partly because it's vaguely similar, partly because it was just a convenient model), my experiment is to construct a system which identifies the language a piece of text is in, based purely on dumb pattern recognition rather than any specific knowledge about the structure of the languages (not unlike the model experiment). How exactly I intend to accomplish this (or at least attempt to) is where things become nontrivial, although I'd be lying if I said anything I'm going to do is particularly hard.
In this project, I intend to create multiple systems based on the same learning algorithm, one or two per language, each returning a single boolean indicating whether the algorithm thinks the current input is something in its language. The decision was motivated by the fact that a single composite system, while likely more accurate, would depend on input from ALL languages. Changing the set of training words in one language would affect the output for all languages, making it a nightmare to test incrementally.
There are three general levels of structure to language that a dumb system might be able to recognize: phonology, morphology, and syntax. Phonology describes the sounds in a language and in what order they may appear in the language (in our case, where we're using written material rather than spoken, replace phonology with orthography - how a language is written, which is related to phonology). Morphology is how words are constructed and modified. Finally, syntax is the order words appear in.
I intend to base detection on orthography and possibly syntax. Both are relatively easy to evaluate, while morphology is much more difficult (at least for my level of skill). In both cases the basic idea is the same: the program iterates over units of text, testing each one through the AI function, and counts the number of matches. It then compares the number of matches between the functions of the different languages with various statistical functions to attempt to determine if there's a clear conclusion. Exactly what statistical methods to use will probably require a fair amount of experimentation.
In the neural network implementation of both cases, this will require construction of a common character set for all the languages used, as inputs will be binary; in other words, there will be many inputs - one per character in the character set per character in the sample. This would (likely) make it infeasible to support even UCS-2 (one flavor of Unicode), as that would require hundreds of thousands of inputs. I'm expecting the combined character set to be around 35-50 characters.
Because of the limitation on character sets (and the obvious fact that character sets alone could be a dead giveaway in some cases, such as Korean), I intend to only use languages which use the Roman alphabet. Unfortunately, this rules out some cases I'd like to use, but that's the technical limitation. The fact that I'm not using morphology in this experiment suggests that languages chosen should be primarily analytic; agglutination and fusion rely too much on morphology. Some possible languages to try: English, German, Spanish, Portuguese, Italian, Esperanto, Chinese (via Mandarin or Cantonese Pinyin or Jyutping), Trique, and Romanized Sindarin. In a couple cases there are several closely related languages, intended to test how well this thing can distinguish relatively small differences.
Orthography is pretty straightforward. Each letter, as well as several before and after it, are input into the orthography functions. I'm thinking two letters before and after, but we'll see. The output would then be whether the function thinks that letter fits with the letters around it.
Syntax is substantially harder. It's infeasible to look at words as atomic units, because there is no good way of representing them as such in our algorithms (especially neural networks). So, I'm kind of having to get creative (of course this is assuming I even have time to do syntax analysis). What I'm thinking at the moment is to look at one word as well as the two words immediately before and after it. Rather than trying to process the entirety of each word (which can't readily be done, due to representation problems), I was thinking of only using the first and last so-many letters (I'm thinking three) from each of the words.
This idea isn't as arbitrary as it sounds. For the languages I'll be dealing with, many words should be six characters long or less, meaning the entire word is considered. For larger words, where the center is not able to be considered, I rely on the fact that the beginning and end of words have been shown to receive more processing than the middle, and, consequently (in a positive-feedback-like manner), they tend to contain the most important information, such as indications of the word class, inflections, and derivation morphemes.
One experiment described in my AI textbook involved speech synthesis and machine learning. An AI system took in a letter from an English word, along with several letters before and after it, and produced a phone - the precise spoken sound - for that letter. The experiment constructed two separate implementations, one based on the ID3 symbolic learning algorithm, and the other based on back-propagated neural networks, and compared the performance and characteristics of both implementations. Both are trained by feeding in streams of examples where the input and output are both known, and the machine learning algorithms adapt the actual output of the system to most closely match the correct answers.
Using this as a model (partly because it's vaguely similar, partly because it was just a convenient model), my experiment is to construct a system which identifies the language a piece of text is in, based purely on dumb pattern recognition rather than any specific knowledge about the structure of the languages (not unlike the model experiment). How exactly I intend to accomplish this (or at least attempt to) is where things become nontrivial, although I'd be lying if I said anything I'm going to do is particularly hard.
In this project, I intend to create multiple systems based on the same learning algorithm, one or two per language, each returning a single boolean indicating whether the algorithm thinks the current input is something in its language. The decision was motivated by the fact that a single composite system, while likely more accurate, would depend on input from ALL languages. Changing the set of training words in one language would affect the output for all languages, making it a nightmare to test incrementally.
There are three general levels of structure to language that a dumb system might be able to recognize: phonology, morphology, and syntax. Phonology describes the sounds in a language and in what order they may appear in the language (in our case, where we're using written material rather than spoken, replace phonology with orthography - how a language is written, which is related to phonology). Morphology is how words are constructed and modified. Finally, syntax is the order words appear in.
I intend to base detection on orthography and possibly syntax. Both are relatively easy to evaluate, while morphology is much more difficult (at least for my level of skill). In both cases the basic idea is the same: the program iterates over units of text, testing each one through the AI function, and counts the number of matches. It then compares the number of matches between the functions of the different languages with various statistical functions to attempt to determine if there's a clear conclusion. Exactly what statistical methods to use will probably require a fair amount of experimentation.
In the neural network implementation of both cases, this will require construction of a common character set for all the languages used, as inputs will be binary; in other words, there will be many inputs - one per character in the character set per character in the sample. This would (likely) make it infeasible to support even UCS-2 (one flavor of Unicode), as that would require hundreds of thousands of inputs. I'm expecting the combined character set to be around 35-50 characters.
Because of the limitation on character sets (and the obvious fact that character sets alone could be a dead giveaway in some cases, such as Korean), I intend to only use languages which use the Roman alphabet. Unfortunately, this rules out some cases I'd like to use, but that's the technical limitation. The fact that I'm not using morphology in this experiment suggests that languages chosen should be primarily analytic; agglutination and fusion rely too much on morphology. Some possible languages to try: English, German, Spanish, Portuguese, Italian, Esperanto, Chinese (via Mandarin or Cantonese Pinyin or Jyutping), Trique, and Romanized Sindarin. In a couple cases there are several closely related languages, intended to test how well this thing can distinguish relatively small differences.
Orthography is pretty straightforward. Each letter, as well as several before and after it, are input into the orthography functions. I'm thinking two letters before and after, but we'll see. The output would then be whether the function thinks that letter fits with the letters around it.
Syntax is substantially harder. It's infeasible to look at words as atomic units, because there is no good way of representing them as such in our algorithms (especially neural networks). So, I'm kind of having to get creative (of course this is assuming I even have time to do syntax analysis). What I'm thinking at the moment is to look at one word as well as the two words immediately before and after it. Rather than trying to process the entirety of each word (which can't readily be done, due to representation problems), I was thinking of only using the first and last so-many letters (I'm thinking three) from each of the words.
This idea isn't as arbitrary as it sounds. For the languages I'll be dealing with, many words should be six characters long or less, meaning the entire word is considered. For larger words, where the center is not able to be considered, I rely on the fact that the beginning and end of words have been shown to receive more processing than the middle, and, consequently (in a positive-feedback-like manner), they tend to contain the most important information, such as indications of the word class, inflections, and derivation morphemes.
Labels:
linguistics,
programming,
reallife
Sunday, March 09, 2008
& More Awesomeness
I can't recall if I mentioned this before: while looking through the AI teacher's personal collection of AI journals, I ran across an article about the creation of an AI that could generate Chinese calligraphy. You give it a variety of examples of real Chinese calligraphy, it learns from those, and then by randomizing various parameters it's able to create novel calligraphy that is acceptable to expert Chinese calligraphers. Alternately, it can be used to create personalized handwriting fonts - you give it half a dozen characters in your own handwriting, then it generates the tens of thousands of other characters by mimicking your handwriting. There's one particularly impressive example where the AI generates the character "forever" (I think it's 永, but it's so artistic that it's hard to tell) that mimics a hand-drawn sketch of a horse.
The reason I bring this up now is that, while the article is controlled by the journals and sold for $20 a copy, I've discovered that a PDF of the article can be obtained for free on the site of an unspecified university in Hong Kong, which can be found with some looking on Google. The article is Automatic Generation of Artistic Chinese Calligraphy by Songhua Xu, Francis Lau, William Cheung, and Yunhe Pan.
The reason I bring this up now is that, while the article is controlled by the journals and sold for $20 a copy, I've discovered that a PDF of the article can be obtained for free on the site of an unspecified university in Hong Kong, which can be found with some looking on Google. The article is Automatic Generation of Artistic Chinese Calligraphy by Songhua Xu, Francis Lau, William Cheung, and Yunhe Pan.
Saturday, March 08, 2008
Google Epic Fail
As the following indicates, Google sucks at Japanese translation.
"Work with everyone involved to meet the time that I talk directly to the image more than anything to receive a lot of opportunities, so those places will eventually work on the float's theme song has become melody It is disappointing that many of the Yes."
"Work with everyone involved to meet the time that I talk directly to the image more than anything to receive a lot of opportunities, so those places will eventually work on the float's theme song has become melody It is disappointing that many of the Yes."
Wednesday, March 05, 2008
& Intellectual Property
I just finished writing a mammoth post in the discussion of this Ars Technica article about copyright and history. This more or less serves as my thesis on copyright, intellectual property, and file sharing in general:
After reading Alfonse's posts, I thought I should provide an alternate perspective/rebuttal from an intellectual property creator. I produce works of fiction, I produce works of information (the majority of my blog, for instance, dealing with topics related to programming, linguistics, and philosophy), and I produce works of computer programming (though this list is not all-inclusive). I spend just about all of my time creating in some way, even if what I create doesn't leave my head. Given that, most of my work is done primarily for two different reasons, which apply in varying degrees to different works:
1. Because I love what I create, and I enjoy creating it. So much so, in fact, that I do it even when I never expect to profit from it. If I didn't love and enjoy it, I wouldn't spend the considerable effort to create it.
2. Because what I create has value to others - it's useful in some way to people other than myself. Now, I'm still selfish in my own way - if I'm going to create something that I don't enjoy creating, even if it benefits others, I better be getting some other kind of incentive, such as money. But most of the stuff I create for the benefit of others is also enjoyable for me.
Conspicuously absent from that list is to make money. To me, getting paid for creating something is the icing on the cake - the cake itself is the content I produce. I think that, as much as is possible (hold that thought), that's the ideal of creation - not professions and commercialization.I believe that work created for its own merit is superior to work created solely for the purpose of obtaining your next paycheck; so yes, I think a world of amateurs would be great.
Of course, it's entirely possible that my beliefs are influenced by my own conditions - I have many skills, some of which can be used commercially, others not; so many, in fact, that I don't have time in my life to explore all of them (interestingly, writing is one of the less intellectual of my skills). If this is not the case for you, Alfonse, then that could certainly contribute to our difference of opinions.
Alternately (or perhaps additionally), perhaps our difference of perspective is due to different origins. When I read your posts, I can't get the image of a five year old throwing a temper tantrum when they can't get every single thing they want out of my head. I, on the other hand, have spent more than the last decade in an internet community based on free distribution of content. Content produced by others is freely used as the basis of derivation over thousands of hours of labor, and both derivative content and original content are given away without any expectation of (or request for) payment (some, such as myself, have even refused attempts by fans to pay for content that was intended to be free). In fact, I'm both a creator of derivative works and of original works.
On the other hand, creation of some things can be prohibitively expensive. You are absolutely correct in that big-budget movies (which I would love to see some of my stories made into, by the way), large computer programs, etc., would not be created unless the creators expected significant returns, because without those returns they would be impossible to produce. As well, some industries, such as the movie industry, cannot provide their work as a service, as you also pointed out.
Consequently, I'm (almost but not quite paradoxically) also a proponent of copyright (as well as patents, although that's an entirely different topic). Of course the copyright I'm a proponent of bears little resemblance to the copyright of today. I advocate copyright of 10-20 years, maybe 30 at most. Furthermore, there are going to have to be some changes in the nature of copyright, not just the duration, as well. This statement requires some explanation.
The world has changed. This is no longer the same world as when copyright law was established; it's not even the same world as twenty years ago. Personal computers and the internet have fundamentally changed the nature of information. I found Mark Twain's arguments in favor of copyright - specifically, that it's unfair for publishers to continue to make money off of his works when the copyright has expired and he no longer able to make money off of them - to be convincing... for the world at that time.
At the time, there was only one reasonable way to make a book: massive and expensive printing presses. As such things were well outside the reach of the common person, this gave publishers non-exclusive production rights even without enforcement of copyright. As they controlled the production, they also could set the prices people must pay for them. Furthermore, the books produced had an inherent value by nature of the cost of producing them. This was true even as recently as twenty years ago; recordable cassette tapes existed, but were inferior to the originals, and CDs could not yet be burned on a personal computer.
This is no longer the case. Personal computers and the internet have made it possible for a work to be duplicated an infinite number of times with absolutely 0 production costs (note that I'm referring to producing the copies themselves, not the content). Copies no longer have an inherent value, nor is there any limitation on who may produce these copies. Most computers now have CD and DVD burners, and information my be transferred over the internet without any special kind of hardware.
Twenty years ago, "theft" (I'm not even going to get into the debate of definition) of intellectual property was limited to those with large budgets - large-scale counterfeiter out for nothing more than profit - who could be reasonably attacked legally. Today, however, there are tens of millions (probably hundreds of millions if you count people outside the US) of IP "thieves", each wanting nothing more than to access the content themselves, with no exchange of money. Besides the fact that the fundamental nature of infringement has changed, it's quite literally impossible to target more than a small fraction of them with legal action, assuming they're even in a country you are on good terms with.
Under such conditions, copyright law enforcement will be, at best, sporadic, and at worst entirely unjust, as companies and governments are required to drastically lower accuracy of prosecution to be able to afford it at all (this is what the RIAA faces now, and it'll only get worse if personal infringement becomes criminal). Technological countermeasures (note that technology and computers are probably my #1 skill, and my current profession, so I hope you'll consider my words to carry some weight) are also doomed to failure without catastrophic side effects of the type that damage culture and invention more than piracy and counterfeiting ever could; though to be fair, such results would also occur if copyright continues to be strengthened as it has been over the last couple hundred years.
We are at the point where the morality of piracy is becoming irrelevant. Regardless of whether it's right or wrong, it's simply not possible to stop it, so you're going to have to get used to it. Even better, you could follow the lead of me and my community and thrive on it, both commercially and non-commercially. I have a job; how about you?
After reading Alfonse's posts, I thought I should provide an alternate perspective/rebuttal from an intellectual property creator. I produce works of fiction, I produce works of information (the majority of my blog, for instance, dealing with topics related to programming, linguistics, and philosophy), and I produce works of computer programming (though this list is not all-inclusive). I spend just about all of my time creating in some way, even if what I create doesn't leave my head. Given that, most of my work is done primarily for two different reasons, which apply in varying degrees to different works:
1. Because I love what I create, and I enjoy creating it. So much so, in fact, that I do it even when I never expect to profit from it. If I didn't love and enjoy it, I wouldn't spend the considerable effort to create it.
2. Because what I create has value to others - it's useful in some way to people other than myself. Now, I'm still selfish in my own way - if I'm going to create something that I don't enjoy creating, even if it benefits others, I better be getting some other kind of incentive, such as money. But most of the stuff I create for the benefit of others is also enjoyable for me.
Conspicuously absent from that list is to make money. To me, getting paid for creating something is the icing on the cake - the cake itself is the content I produce. I think that, as much as is possible (hold that thought), that's the ideal of creation - not professions and commercialization.I believe that work created for its own merit is superior to work created solely for the purpose of obtaining your next paycheck; so yes, I think a world of amateurs would be great.
Of course, it's entirely possible that my beliefs are influenced by my own conditions - I have many skills, some of which can be used commercially, others not; so many, in fact, that I don't have time in my life to explore all of them (interestingly, writing is one of the less intellectual of my skills). If this is not the case for you, Alfonse, then that could certainly contribute to our difference of opinions.
Alternately (or perhaps additionally), perhaps our difference of perspective is due to different origins. When I read your posts, I can't get the image of a five year old throwing a temper tantrum when they can't get every single thing they want out of my head. I, on the other hand, have spent more than the last decade in an internet community based on free distribution of content. Content produced by others is freely used as the basis of derivation over thousands of hours of labor, and both derivative content and original content are given away without any expectation of (or request for) payment (some, such as myself, have even refused attempts by fans to pay for content that was intended to be free). In fact, I'm both a creator of derivative works and of original works.
On the other hand, creation of some things can be prohibitively expensive. You are absolutely correct in that big-budget movies (which I would love to see some of my stories made into, by the way), large computer programs, etc., would not be created unless the creators expected significant returns, because without those returns they would be impossible to produce. As well, some industries, such as the movie industry, cannot provide their work as a service, as you also pointed out.
Consequently, I'm (almost but not quite paradoxically) also a proponent of copyright (as well as patents, although that's an entirely different topic). Of course the copyright I'm a proponent of bears little resemblance to the copyright of today. I advocate copyright of 10-20 years, maybe 30 at most. Furthermore, there are going to have to be some changes in the nature of copyright, not just the duration, as well. This statement requires some explanation.
The world has changed. This is no longer the same world as when copyright law was established; it's not even the same world as twenty years ago. Personal computers and the internet have fundamentally changed the nature of information. I found Mark Twain's arguments in favor of copyright - specifically, that it's unfair for publishers to continue to make money off of his works when the copyright has expired and he no longer able to make money off of them - to be convincing... for the world at that time.
At the time, there was only one reasonable way to make a book: massive and expensive printing presses. As such things were well outside the reach of the common person, this gave publishers non-exclusive production rights even without enforcement of copyright. As they controlled the production, they also could set the prices people must pay for them. Furthermore, the books produced had an inherent value by nature of the cost of producing them. This was true even as recently as twenty years ago; recordable cassette tapes existed, but were inferior to the originals, and CDs could not yet be burned on a personal computer.
This is no longer the case. Personal computers and the internet have made it possible for a work to be duplicated an infinite number of times with absolutely 0 production costs (note that I'm referring to producing the copies themselves, not the content). Copies no longer have an inherent value, nor is there any limitation on who may produce these copies. Most computers now have CD and DVD burners, and information my be transferred over the internet without any special kind of hardware.
Twenty years ago, "theft" (I'm not even going to get into the debate of definition) of intellectual property was limited to those with large budgets - large-scale counterfeiter out for nothing more than profit - who could be reasonably attacked legally. Today, however, there are tens of millions (probably hundreds of millions if you count people outside the US) of IP "thieves", each wanting nothing more than to access the content themselves, with no exchange of money. Besides the fact that the fundamental nature of infringement has changed, it's quite literally impossible to target more than a small fraction of them with legal action, assuming they're even in a country you are on good terms with.
Under such conditions, copyright law enforcement will be, at best, sporadic, and at worst entirely unjust, as companies and governments are required to drastically lower accuracy of prosecution to be able to afford it at all (this is what the RIAA faces now, and it'll only get worse if personal infringement becomes criminal). Technological countermeasures (note that technology and computers are probably my #1 skill, and my current profession, so I hope you'll consider my words to carry some weight) are also doomed to failure without catastrophic side effects of the type that damage culture and invention more than piracy and counterfeiting ever could; though to be fair, such results would also occur if copyright continues to be strengthened as it has been over the last couple hundred years.
We are at the point where the morality of piracy is becoming irrelevant. Regardless of whether it's right or wrong, it's simply not possible to stop it, so you're going to have to get used to it. Even better, you could follow the lead of me and my community and thrive on it, both commercially and non-commercially. I have a job; how about you?
Subscribe to:
Posts (Atom)