Search This Blog

Friday, June 27, 2008

Sansas & Bugs

Given how big I'm into music (particularly game, anime, and movie soundtracks), it'll probably come as a complete shock to most people to know that I've never had a portable CD or MP3 player (other than the CD player in my car). Probably the biggest reason for this is that I'm cheap - I save most of the money I make, and spend very little of it, even on things you'd expect me to buy (like a computer that's less than 6 years old). Well, yesterday I just bought a digital audio player: the SanDisk Sansa c250 2 gig, on sale at a price I couldn't refuse (cheaper than Amazon).

So, I spent some time playing with it yesterday, in preparation of today, when I drive my grandma to a doctor's appointment and various errands (she's had severe eye problems for the last couple months). Not a bad little sucker; though just as you might guess from the price, it didn't take long to run into problems. Naturally, as I'm too impatient to call tech support, and too inquisitive to give up on a technical challenge, this meant I had to debug the thing.

After loading almost 2 gigs of music onto it and disconnecting from the computer, it proceeded to promptly lock up on database refresh (after you modify the contents of the flash memory it scans all the files and indexes them). Wonderful. I could turn it off and on, but every time it turned on it immediately performed a database refresh, and promptly locked up. Worse, it would no longer connect to the computer, as the database refresh preempted other things, like USB port communication, meaning I couldn't delete anything that might be causing it to freeze (specifically, if you plugged it into the USB port while it was performing the database refresh, Windows would say "unrecognized USB device" after a couple seconds).

A substantial amount of experimentation revealed that it was possible to override this. Specifically, you had to have the computer send a USB signal to the device BEFORE it starts its database refresh. As the database refresh is the first thing it does when you turn it on, and plugging the USB cable in automatically turns the device on, this takes rather precise timing, and more or less requires pressing the button required to make it connect in mass storage mode*, insert the USB cable, and press "Scan for hardware changes" in Device Manager at essentially the same time (I'd say about 1/3 of a second). This will cause the USB signal from the computer to preempt the scheduled database refresh, and put it into USB storage mode.

Now that I was able to access the contents again, I spent some time fumbling around with trial and error, trying to figure out what was causing it to break; as it was 1 AM by this point, my brain wasn't in peak working condition, and this took some time. Searches on Google revealed that quite a few people had this problem and there are quite a few hypotheses as to what causes it and how to fix it, but no definitive explanation or solution (nor has Sandisk addressed this problem, despite people asking for help on their forums). As well, many of the "solutions" involved wiping the memory of the thing, and sometimes bricking it.

Through trial and error, I managed to burn through a number of hypotheses (which were either incorrect or simply not applicable to me). It appeared to be false that spaces in directory and file names caused lockups (or that bug only occurred in older versions of the firmware). I also did not observe any instances of odd characters in song titles or artists that caused this problem; to my surprise, the device even correctly handled and displayed the Japanese characters in some song and artist names (when I had first opened the package, I tried copying a single album onto it, which worked without incident; this album happened to have Japanese ID3 info). Lack of free space did not appear to cause it (I tried taking it down to 2 megs free space with good files, and it still worked fine). ID3v1 tags seemed to work fine. Even this one funky MP3 at "0 kbps" (what Explorer reports for it; I haven't looked at it with a hex editor to figure out why this is) did not cause the problem.

What ultimately ended up being the problem, at least in my instance, was that one of my game soundtrack MP3s was mislabeled as 'hard rock'. The significance of this, according to one person, is that it has a space in the genre name. Changing this to the proper genre corrected the freeze. I can't say for certain that the space in the genre is what causes the bug, but it's true that when none of my songs have a space, the player works fine, and it froze in that one case.

*The Sansa has two USB connection modes: MTP and MSC. MTP mode interfaces with media players such as Windows Media Player. This mode allows you to store media library files on the player, and make use of various features like tagging and playlists. MSC mode causes the player to act like a vanilla memory stick, allowing you to directly access the flash file system. I'd imagine it's only necessary to refresh the database in MSC mode; that's the only mode I've ever used.

Judging from Google, there are two different methods of switching between modes, which depend on what firmware you have. One method is that a USB mode option appears in the settings menu on the device. The other method (what mine has) is that the player is always in MTP mode, but connects in MSC mode if you hold the rewind button when you plug it into the USB port.

UPDATE:

Found another bug while playing around with putting DRMed WMAs on the critter (my dad also got one, and he has a bunch of DRMed WMAs to put on it, unlike my MP3s). It's only possible to load DRMed files onto the device in MTP mode, so I had to learn how to use that. It appears that my assumption was correct, that database refreshes are only necessary after adding files in MSC mode; after files are added in MTP mode, they appear in the player immediately after the player is disconnected from the computer.

While the player automatically turns on and goes into USB storage mode when you plug the USB cable in, it's possible to turn off the player by holding the power button (the same way you turn it off when it's not connected to the computer) while in USB storage mode. This is not a good idea. If you add some files to the device and then turn it off before unplugging it, it will lose track of those files, and they will not show up in the list of songs on the player (though they will still show up in the file list when it's connected to the computer in MTP mode). Adding additional files later will not cause this problem to be corrected; it is necessary to delete the files from the player and then transfer them from computer again

Tuesday, June 24, 2008

Random Thought of the Day

Did you ever notice that, in English, the simple past (e.g. "he wrote") and past progressive (e.g. "he was writing") are both very common, yet in the present tense, the present progressive (e.g. "he is writing") is overwhelmingly more common than the simple present (e.g. "he writes")? This fact actually leads into an important linguistic principle, which I'll probably write a post about in the future. I'll just leave it as food for thought, for now.

Monday, June 16, 2008

Cases, Ergative, & Accusative

Something that I vaguely implied previously, but I don't think actually said, was that there is a difference between roles and cases (even worse, there are multiple things that "role" could refer to). Roles are, in theory, purely rational, language-independent categories which describe how nouns relate to their clause's verb. Cases, on the other hand, are language-dependent categories representing many things, and there is rarely (if ever) a 1:1 mapping of the two for a language.

The Grammer of Discourse hypothesizes at least ten universal roles, which I'll only briefly describe.
Experiencer: the person experiencing an emotion or sensation
Patient: the one an action acts on
Agent: the one willfully performing an action
Range: an extension of the verb, such as indicating how, e.g. "Your blood smells good"
Measure: an extension of the verb indicating how much, e.g. "I was only bitten a little bit" (these examples brought to you by Vampire Knight)
Instrument: something which is used to perform an action; this can also be used for animate entities who unintentionally perform an action
Locative: the location an action occurs at
Source: the starting point of some kind of movement or transfer
Goal: the ending point of movement or transfer
Path: the path taken during movement or transfer

If we were to compare this list of roles with typical use of the Latin cases, we would get the following. Note that this list is approximate, and some of the roles like measure and range I'm not even sure how to represent in Latin.
Nominative case: agent, patient, experiencer, instrument
Genitive: unrelated to role in the sentence (roles refer to relation with the verb, not with other nouns)
Dative: goal, patient
Accusative: patient, experiencer, goal, rarely source
Ablative: source, instrument, locative, goal, path, possibly range and measure (some of those requiring prepositions)
Locative (rare): locative
Vocative: not related to role

However, while case is language-specific, some themes (common cases) occur much more often than others. Of the Latin cases, the nominative, genitive, dative, and accusative occur very frequently in all languages; this is not surprising, as these seem the most essential to language in general (though note that they are not guaranteed to mean exactly the same thing in all languages).

The nominative case is roughly defined as the subject of the verb. For transitive verbs having a direct object, the subject is the one performing the action (e.g. "He poked her"); for intransitive verbs the subject is the single argument (e.g. "He was hit"). The accusative case is the object of transitive verbs. Any language having this structure is called a nominative-accusative (or sometimes just accusative) language (which we're going to call N/A in the rest of this post).

However, two others - the ergative and the absolutive - also occur very commonly in languages. The ergative case is defined as the subject of transitive verbs. The absolutive case, however, includes both the subject of intransitive verbs and the object of transitive verbs. Languages using this system are called ergative-absolutive (or sometimes just ergative; E/A, here).

At first this seems very strange and arbitrary - splitting the subject depending on whether the verb is transitive or intransitive. However, this is due to the fact that we don't speak an language. In fact, even the word 'subject' reflects this bias in thinking. The N/A split carries the paradigm that all actions are done by somebody/something, regardless of whether the action is intentional or unintentional, or even whether there's anyone performing the action at all (e.g. in "He fell"). This is called the subject, and for transitive verbs, the one acted on is the called the object; thus the N/A split actually corresponds to a subject/object division.

However, we get a different picture if we discard this assumption and look at things from the perspective of roles. In reality, with many intransitive verbs (such as the one shown above) the "subject" is not the one doing the action at all, but rather the one who is subjected to the action - the patient. Thus the E/A split is based on the paradigm that the ergative case is the doer (agent or instrument) of the action, while the absolutive case is the patient of the action - an agent/patient separation. Taking it one step further, some E/A languages even require that the ergative argument commit the action intentionally, and use a different sentence structure to indicate otherwise (e.g. split-intransitivity languages use either the ergative or absolutive case for the subject of intransitive verbs, depending on whether the action is intentional or not; others use the passive voice for unintentional actions; etc.).

Given this, both seem equally sensible, and the choice itself now seems arbitrary. It's worth noting, also, that most languages in the world are either N/A or E/A. Languages using other systems are rare, which might suggest that the N/A and E/A splits are more sensible and/or useful than other methods. But hold onto that thought.

Thursday, June 12, 2008

Case & Other Cases

One thing necessary in all languages is that the nouns in a sentence that play various roles/cases must be identifiable. While the exact amount of precision varies by language and by sentence structure (there may be more than one way to say something, or only certain structures may be used in certain cases), all languages have a way to indicate the subject, direct object, etc. (although of course the exact set of roles that exists varies by language, as well). As far as I'm aware, there are three methods of accomplishing this: dependent-marking, head-marking, and analysis (note that none of these terms refers exclusively to role; I'm merely discussing them in this one specific context).

Let's start with the easy one: analysis. This is the method English uses for its core roles: subject, direct object, and sometimes the indirect object. As I pointed out in The Decline of the English Language, Modern English has a fairly rigid word order for its core roles: Subject Verb [IndirectObject] [DirectObject], as in "The boy gave the dog a bone"; some other word orders are used by native speakers, but they're uncommon, and generally only used in certain specific contexts (e.g. the Verb Subject Complement order in "Are you an idiot?"). Thus analysis refers to the use of strict word ordering to determine what role each noun has.

As I mentioned in the same paper, English wasn't always this way: it belongs to the same language family as Latin, all traditionally using dependent-marking of case. Dependent marking refers to the fact that each word is marked to indicate its role. In the same sentence "Puer [boy] cani [dog] os [bone] dabat [gave]", the four words may be placed in any order, and the meaning will still be clear, because the nouns carry the nominative, dative, and accusative cases, respectively (actually, that isn't 100% true; because some cases decline the same way, there can be some ambiguity here).

You might notice that English also does this for non-core roles, which corresponds to greater freedom as to word order. As dependent-marking does not require that the mark actually be attached to the word, English uses prepositions to mark non-core roles, rather than the traditional suffixes of Indo-European languages. This system is used for such roles as instrument in "The boy poked the dog with a bone" (the Latin version, "Puer canes osse pungebat", uses the ablative case, and the accusative case for the dog), the benefactor in "The boy bought a dog for her" (in the Latin version "Puer canes per ea emebat", a preposition is used with the ablative in this case), etc. The last example also illustrates that Latin uses prepositions as well, to mark roles outside the 6 core cases.

Both of those have been something that isn't entirely unfamiliar to English speakers. Even case still (barely) exists in the pronouns and nouns of English (having three and two cases, respectively); the third method, head-marking or agreement, is also not absolutely foreign, though it is uncommon in modern English. Verbs in Indo-European languages traditionally agree with the subject of the sentence - the verbs themselves indicate the grammatical person and number of the subject. While English has all but lost this form of agreement, you can still see vestiges of it. The verb 'am' uniquely identifies the subject as first person singular, while 'is' identifies the subject as third person singular ('are' is ambiguous, because it could refer either to second person singular or any person plural); similarly, the -s form of all other verbs (e.g. 'gives') identify the subject as third person singular. Romance languages like Spanish still contain robust subject-verb agreement, such that it is possible to uniquely identify the subject as first, second, or third person (never mind the bad terminology for now) and singular or plural.

However, you might have noticed something: in languages like Latin that have subject agreement, marking nouns with the nominative case (used for the subject) can be redundant. Head-marking, or polysynthetic, languages do away with this use of case, and purely rely on verb agreement to indicate which nouns have each role. I can't find a good example of a sentence that would indicate how this would work without introducing other things I don't want to get into, so I'm gonna make one up:
In this example, theyare attachedit pronouns representingthem the subject, direct object, and indirect object the verb of each clause. As with English pronouns in general, theyagree the attached pronouns with number and gender of the nouns. For the verbs, iusedthem the subject-verb-object order and pronoun cases, to makethem the verbs easier to read for English speakers. However, iusedthem varying word orders for nouns in the clauses to illustrateit how itcan be used head-marking with different word orders. Typically theywould useit head-marking head-marking languages with other modifiers like possessives, as well.
Finally, the Totonac language takes polysynthesis to a ridiculous extreme. According to the examples in The Grammar of Discourse, Totonac merely lists all roles in the sentence, without using agreement to indicate which nouns have which roles. One example given (I'm kind of making up my own orthography, here) is "liiteemaktamaahua [literally 'with-passing by-from-buy'] tumin [money]", which means "As [he] passes by, [he] buys [it] from [him] with money". Amazingly (and completely against expectations), native speakers of Totonac can actually understand each other.

Friday, June 06, 2008

Beyond Godly

On Recording Industry vs. the People, in response to this story, somebody suggested:
It would be interesting to set up a 'honey-pot' node (using maybe a printer or a network monitoring box), wait for a takedown notice, and say "see you in court". It would be even more interesting to see the discovery request for the hard disk of a printer.
That idea is beyond godly - set up a honeypot network that isn't actually sharing copyrighted material, and file DMCA abuse suits for every DMCA takedown notice they receive. I suspect that would very rapidly lead to more thorough investigations before companies fire off bogus DMCA takedown notices.

Thursday, June 05, 2008

Empirical Data and the RIAA

A bit ago I wrote up a rather lengthy list of factors which could, in theory, produce false-positives in identifying users sharing copyrighted files via peer-to-peer programs. Most of these risks could be mitigated by thorough investigation, though I noted that as the RIAA clearly cuts every corner they can, it's likely that few if any of these mitigating measures are taken in actual investigations.

Now the University of Washington has demonstrated some of these risks in actual occurrence in their project Tracking the Trackers: Investigating P2P Copyright Enforcement. While they've only looked at a couple of the risks I suggested, the results show quite a few false positives, indicating that my prediction that measures to minimize these risks are not being applied was accurate.

The research paper is here, if you don't want to go through the project's web site itself. The New York Times blog has also picked up this story. They also have a cute logo/illustration:


This was actually a study I've been wanting to see done for some time. The other study that I think is very important but has not yet been done is to determine empirically how, on a system like eDonkey, where users search all peers for a certain file, the number of requests a single computer gets for a single file varies with the popularity of the file. The basis of this investigation is the claim by RIAA and others that users could be sharing thousands or millions of copies of each copyrighted work, therefore constitutional limitations on civil damage awards do not apply.

Clearly files that are popular (e.g. the latest hit song) will be downloaded more (in total) than files which are unpopular. But does this mean any single computer will upload popular files significantly more often than unpopular files? I believe the answer is no, for the reason that because the files are more popular, not only are they downloaded more, but they are also available from more computers. In theory, the increase in demand is accompanied by a proportionate increase in supply, keeping the ratio invariant regardless of demand. According to this belief, I have argued on forums (one example here) that most of the people the RIAA has sued have, according to simple probability, not uploaded more than a single copy of each file, on average (so about $0.70 of damage per file, if you assume 1 download = 1 lost sale, which itself is highly suspect).