Q & Stuff

Wednesday, January 30, 2008

Random Linguistics Fact of the Day

Many languages have a concept of a hierarchy of animacy/empathy (not a very informative link, I'm afraid; this looks really interesting, but I haven't had time to read it, yet). In these systems, things that the speaker is more empathetic toward tend to be treated as superior to those with less empathy. E.g. first person > second person > animal > object, etc. (this is just a general example; specifics vary widely).

This hierarchy can appear in a wide variety of ways, from extremely subtle (so much that you might not even notice it) to very obvious. Some languages, for example, place nouns with higher empathy before nouns with lower empathy. Others have the verb of the sentence agree (e.g. in Indo-European languages like English, verbs always agree with the subject - "I am", "you are", etc.) with the parameter that has higher empathy - e.g. the subject in "I ate a fish", the object in "the dog bit him". Still other languages have the empathy of the agent (this is a more specific form of subject, which refers specifically to the one taking an action against something else) dictate the voice of the verb, such that the subject is always the thing with higher empathy; for example, you would say "I was hit by the ball" rather than "the ball hit me" ("the ball" is the agent in both cases, though it's a prepositional object in the first, and the subject in the second).

All of the aforementioned examples were fairly obvious. Here's one that's much more subtle (in fact, I didn't even notice it when WarBringer87 was first translating some stuff for me). In Armenian, nouns are modified based on who possesses them. For example, for the noun "keerk" ("book"), "[eem] keerkus" would be "my book" ("eem" means "my", but the -us suffix also indicates that; thus "eem" is optional), "[koo] keerkud" means "your book", and "eeren keerkuh" means "his/her/its book" ("eeren" isn't optional for reasons we're about to get to). At first glance this may seem like it's simply a possessive suffix based on the person of the possessor. However, one additional fact proves that it's something more interesting: "keerkuh" (lacking any possessor) can also mean "the book". Thus we actually have a three-level empathy hierarchy: first-person singular possessor (-us) > second-person singular possessor (-ud) > everything else (-uh).

In other news, now I have the urge to go look at that Trique Bible of mine to try and figure out whether the "fourth person" pronoun is something lower on the empathy hierarchy than third person.

Tuesday, January 29, 2008

Trade-offs in Linguistics

As with most things, occasionally in linguistics you have a case where there's an obviously best solution - a solution that is superior to alternatives in all aspects (efficiency, complexity, clarity, etc.). More often than not, however, you have to make due with trade-offs between opposing factors. I just ran into a nice example of such a thing a few days ago, while thinking about Caia.

For quite some time I've been thinking about how to best use noun classes (the more general form of 'gender', which is not limited to simply gender, but may group nouns in a wide variety of ways, such as 'long things', 'dangerous things', etc.) and number so as to maximize the clarity of pronouns references. Obviously the goal of this is to minimize the number of potential nouns that a given pronoun could refer to.

To get an idea of what I didn't want, we can look at Trique. Trique mainly uses four pronouns: a first-person pronoun, a second-person pronoun, and two third-person pronouns (technically one is a 'fourth-person pronoun', but I'm not exactly sure how this works); single/dual/plural is then marked with auxiliary words. This makes for a very clean, simply pronoun system. Unfortunately, it also makes for a relatively ambiguous system, as well.

Getting back to my objectives. A couple days ago, I just happened to realize the optimal way of structuring noun classes and number, which also served to provide an excellent example of why it's usually not acceptable to optimize for just one variable. The optimal method would be to have a large number of noun classes structured that the distribution of nouns (more specifically, commonly-used nouns) in these classes is uniform.

This minimizes the potential for ambiguity, as it minimizes the probability that there will be more than one noun in a given class that a pronoun could potentially refer to. Unfortunately, this also maximizes complexity, requiring that you memorize the class of each noun. People would be even more of a pain, as to evenly distribute them among classes, you would need to separate them into noun classes as well, based on the specific attributes of the person (I'm not even sure what criteria you could use to evenly distribute people into classes like this).

So, while that provided some interesting food for thought, it ultimately only provided a good example of how not to do it in Caia.

Saturday, January 26, 2008

Tweaking

Okay, I finally got around to tweaking the blog settings some (been needing to for a while). Among other things, I've made several modifications to the sidebar, including something pretty cool: a feed of my E Terra blog. I also finally got around to adding the links box back in.

Monday, January 21, 2008

Fiction & Fiction

Something I've been meaning to do for quite some time now, and haven't yet gotten around to actually doing it, is to write a blog post about some of the anime I've watched that either I started watching because the synopsis sounded a lot like one of my stories, or just things I happened to watch that turned out to be similar to my stories. Given that my recent posts have been even more random than usual, now seemed like a decent time to do it (furthermore, one new series was added to this list within the last week).

In this post I'm mainly going to talk about my own stories, how they resemble the anime series, and any notable ways in which the two differ. I'd recommend reading the linked synopses of the anime series before reading the parts about my stories.

Anime: Ai Yori Aoshi (literally "Bluer than Indigo")
Story Title: Homecoming (working title - I'm hoping to come up with something better for the official title, but this is the best I have for now)
Length: Novel (note that these are equivalent measures; they don't necessarily represent the preferred medium). A stand-alone spin-off of Blood for Blood (working title), a story originally intended to be made into a role-playing game of size comparable to the Final Fantasies.
Universe: The Blood for Blood Universe

Synopsis:
After more than a decade of education at boarding school followed by the events of Blood for Blood with few brief intermediate visits to his home, Kain (probably the second-most major character in Blood for Blood) is finally able to return to his homeland to claim his position as land-owning nobleman. Homecoming is the story of his return as he both tries to return to an ordinary life and to grow accustomed to his responsibilities as he learns to manage his estate and manor, deal with the populace of his territory, and, most central to the story, deal with his fiancé, Alice.

Alice, the daughter of a successful merchant in the upper middle classes, was betrothed to Kain shortly after the two of them were born (they were born several months apart), and raised accordingly, and Kain raised with the expectation of marrying her. Neither of them, however, is fond of the concept of predetermined marriage to someone they barely know, without any say in the decision. As both are thoughtful and well-mannered, they independently adopt patient, diplomatic approaches to the matter as they get to know each other in the few weeks before their planned marriage, and attempt to reconcile their own desires and options.

Compare and Contrast:
I can't recall where exactly I saw it, but the synopsis of Ai Yori Aoshi I first saw sounded like it was closer to Homecoming than it actually was. In Ai Yori Aoshi, any ambivalence toward the somewhat archaic concept of arranged marriage is rather one-sided and fairly brief; as ambivalence about the arrangement is one of the central themes of Homecoming, this alone makes the general dynamics of the stories significantly different. Not to mention the fact that Homecoming is not really a comedy (apart from Kain's generally amusing personality), or that Ai Yori Aoshi throws in a bunch of ecchi and harem stuff totally missing in Homecoming (even if it were made into a visual medium). Nevertheless, somehow I managed to end up liking Ai Yori Aoshi, and ultimately watched/read all of it.

Title: Gunslinger Girl (season 2 download)
Story Title: Starfall (official title)
Length: Novel. The first story in a larger work as of yet without an official name. Prequel to Eve of Tomorrow (official title).
Universe: Real world, around 20 years in the future.

In the fairly near future, a secret society known as Falcata gathered a wide array of top scientists around the world to conduct biological engineering research toward a project to develop a drastically more combat-capable human than those naturally occurring in humanity. A 15 year old boy named Safir is one of the specimens created in this study. Genetically engineered to possess several remarkable brain features, then trained from birth to hone those features to perfection, Safir quickly became by far the most successful of the first generation trials, with his combat abilities exceeding the expectations of everyone involved in the experiment, having agility greater than a human, and accuracy greater than a computer.

During the story of Starfall, Safir is sent on a series of missions to test his combat abilities, leading in the end to his final test. While the activities are so varied as to include both terrorism and military combat, the most common are missions of assassination, in increasingly difficult circumstances.

However, despite combat being the backbone of the story, the real focus of Starfall is psychological and sociological: how Safir thinks, how he acts, and how he views and relates to the world around him in his unique position, and, ultimately, how he becomes the terrifying character in Eve of Tomorrow so different from that initially in Starfall (the title Starfall is a reference to Lucifer/Satan: the star that fell from heaven).

Compare and Contrast:
This was another that I watched specifically because I saw an ad for it, looked up the synopsis, and thought it sounded like one of my stories. Apart from the fact that the girls in Gunslinger Girl are cyborgs, these two are indeed quite similar. For me the most interesting part of Gunslinger Girl was always seeing how they behaved and interacted in a world where they were raised to kill - the very same thing that I think makes Starfall interesting. I still follow the Gunslinger Girl series (manga); unfortunately, I've read all six volumes that have been written so far, so it will be a while before there's any more for me to read (and I'm pretty sure even the second anime season just starting won't go past what I've already read).

Anime: Wolf and Spice ("Spice and Wolf" would be a more literal translation of the title, but the other way seems to make more sense; download)
Story Title: The Mission (candidate official title)
Length: Series of novels
Universe: Real world, modern day

Synopsis:
The Mission is one of my more secret projects; I don't think I've told anybody much about this story other than the fact that I was probably gonna rake up some controversy with it; so don't feel bad that I'm not gonna tell you much. :P

The Mission is the story of God, in the form of an apparently late teenage human named Simon, as he treks across the United States over quite a few months, for reasons which remain obscure until right at the end.

Compare and Contrast:
The most obvious difference between Simon and Horo is that Simon is a god, while Horo seems to me to be more of a 'great spirit' type thing. Other than that, Simon's demeanor is more passive than Horo's, preferring to deal out wisdom more subtly, and without the position of known superiority (that is, nobody knows who he is); though that is not to say that humor is not an intentional part of the story (it's actually a significant part of the story). Nevertheless, especially in the second episode of Wolf and Spice, there definitely seems to be some resemblance between their hearts (in fact, the similarity of the two series didn't really occur to me until then).

Thursday, January 17, 2008

On the Name of God

As mentioned previously, I'm a fiction writer. As a writer, I've developed storylines in a significant number of different universes, often with multiple stories in each universe. Most of these universes are original - they are not based on the work of anyone else; however, some are fan fiction of existing universes, either real or imagined. Some of the existing universes my stories reside in are the Starcraft universe (most forming a collective work known as Apocalypse), the Warcraft universe, and the real world (such as The Mission, Eve of Tomorrow, etc.).

The Warcraft storyline (as of yet unnamed) consists of quite a few stories ('books', if you want to think of them that way) about the events surrounding a particular religious organization, with three characters playing the lead roles - Ambrose, the founder, and his adopted children (orphans from the events of Warcraft III), Julius and Nadia. The organization was formed for a specific purpose in the storyline: to annihilate the Scourge on Azeroth, and to banish the Burning Legion from Outland. This storyline was developed after Warcraft III was released (obviously), but before The Frozen Throne was released (I was rather annoyed to see the Scarlet Crusade in World of Warcraft, as among more obvious reasons red was the color I was using).

I suspect I'm giving much more information than is necessary for the purpose of this post. Anyway, this organization worships a god, and obviously such an entity needs a name. While some parts of the storyline (as with most other storylines) developed more gradually, I knew the name I would use from the very beginning: the name was to be derived from 'agnostos', the Greek word meaning 'unknown' (same root as English 'agnostic').

This name was chosen for several reasons. At the time Ambrose and god first met, the god had been nearly inactive since ancient times; no followers, and virtually no written records, remained. So clearly god was very much unknown to the world at that point. The second reason is similar: Ambrose giving god this name refers to some of the particular, humorous phrasing in the first conversation between Ambrose and god (when Ambrose still didn't know who god was), and reflects the somewhat tongue-in-cheek relationship between the two. I believe there was another reason for choosing 'unknown' as the name, though I can't think of it off the top of my head. The final reason for choosing this word, as well as the reason for using Greek, are left as an exercise for the reader (good luck with that). However, unforeseeable developments made the choice of Greek work even better than I initially expected.

As I'd decided from the beginning to use Greek (specifically Ancient Greek), it would seem logical to also put holy texts or mantras in Greek, as well. There's just one problem... while I know enough Latin to at least be able to make coherent text with the aid of a dictionary, I know little more about Greek than the fact that it's grammar kind of resembles Latin's (I know a bit more now, but still not enough that I'd try writing text in it). This led to the decision to use Latin as the "present day" (the time of the storyline) holy language.

It wasn't difficult to rationalize this choice, taken from real-world history. There's clearly some analogy between this fictional organization and the Catholic Church, the latter often using Latin for holy texts and common sacred phrases. Similarly, much of the New Testament was written in Greek (in fact, I believe the name Jesus itself is Greek - the original Hebrew version is Joshua). Thus there was real, historical precedent for Greek being the holy language "in the past", and Latin "in the present" (never mind the fact that technically Latin was also spoken at the time of the New Testament, as it was the language of the Romans, who ruled during that time, and that Latin was or nearly was a dead language by the time the Catholic Church was formed).

This idea was then refined after I read a portion of Negima! For those not familiar with the series, it involves, among other things, battles between wizards of different schools of magic, from the Harry Potter type of western magic to ancient Japanese sorcery. I found the fact that different languages were used for spells of different schools (e.g. Latin, Japanese, Sanskrit) particularly interesting. As it turns out, the author used Latin for most western incantations (not surprising); however, for added force, the author decided that for older, more devastating western magic (somehow a common theme in fantasy seems to be that the older magic is the stronger it is), he used Ancient Greek. I had previously planned on using Latin for the names of the special abilities of Ambrose, Julius, and Nadia; this gave me the idea to use Ancient Greek for the ultimates of the three.

*Ahem* getting back on track... I ultimately decided on 'Agnos' as the name of god. Agnos is, among other things, associated with fire. Among other examples, the Virifeges Dei ("Fangs of God"; yes, I did coin that word, myself, for this purpose), the tangible manifestation of Agnos' power, appears as a white flame. One word in Latin for fire is 'ignis' (same root as English 'ignite'). Thus 'Agnos' is the result of a dual derivation: deriving the name from Greek 'agnostos' and from Latin 'ignis', words in both holy languages which were relevant to Agnos' character. I think that turned out impressively well.

This brought up a rather sizable question: how the heck am I supposed to decline (see meaning 4a) that? 'agnostos' is a Greek word, and 'agnos' is a Greek analog; if the two were declined the same, you'd get Agnos in the nominative, Agnou in the genitive, Agno (or something like that; Angoi? Agnoh?) in the dative, Agnon in the accusative. Only, I'm writing Latin, not Greek. -os is NOT a typical Latin declension suffix.

Well, there are quite a few possibilities, based on theory alone. I could decline it like 'dominus' ('lord') but with an altered nominative: Agnos, Agni, Agno, Agnum, Agno (in the ablative), and Agne (in the vocative). I could also decline it like 'honos' (older form of 'honor', identical in English and Latin): Agnos, Agnoris, Agnori, Agnorem, Agnore, Agnos. I could even use the fourth declension.

Fortunately, some research revealed that Latin already has a mechanism for borrowing words from Greek analogous to 'agnostos': the first of those options listed: use the first declension, but use the -os form for the nominative. Ironically (and in a typical example of me overthinking a problem), there already exists a word 'agnos' in Latin - a type of tree - and it's declined in this way. While I don't mind that so much, there's one more coincidental occurrence I'm not so sure about (and in fact only just noticed): Agnos and 'agnus' ('lamb', e.g. Agnus Dei - Lamb of God) are identical in most cases. That wasn't intentional, for anybody wondering :P

Tuesday, January 15, 2008

& Conjecture

Comcast has gotten in some pretty hot water lately over its blocking of peer-to-peer (P2P) protocols like BitTorrent (BT). I was fortunate enough to get off Comcast before this whole affair began. However, things aren't altogether different on my current ISP: SBC/Yahoo DSL. Though I haven't had much luck in the past finding others having the same problem, and while there have only been a handful (this is a word I've given my own definition to, meaning five or less) of cases, I've observed BT blocking on my current ISP as well.

In this post I'm going to talk about some theories I have about how this blocking system works on AT&T, based on my observations. It should be noted first of all that this is drawn from a very small number of observations; statistics 101 tells you that the smaller the sample size, the less accurate conclusions are expected to be. Nevertheless, I have observed behavior between these incidents that seems more or less consistent.

The system appears to consist of a stateful process which determines how to deal with traffic in several steps. In its normal mode of operation, there is no blocking or throttling of BT traffic. However, when a large volume of traffic is observed through one internet connection, the system enters an activated state, where it inspects the connection more carefully. However, it's still not blocking or throttling traffic. This state is relatively hard to get into, and from my experience requires multiple gigs of traffic over a period of time (exact amount unknown, but I tend to think a couple days; I believe I've only observed this when downloading one or more entire anime series at a time).

The second factor needed to trigger blocking/throttling is a smaller (but still substantial - on the order of dozens or hundreds of megs) amount of traffic on a single port (or maybe it's tracking individual TCP connections; I'm not certain at this point). I've typically observed this as taking several hours to activate.

Once this occurs, the system goes into throttle mode for that specific port. While I'm still not sure of the details of the mechanisms involved, I'll describe the symptoms I've observed when this mode is activated. Effective bandwidth in uploading through that port gradually decreases, until no TCP connection is able to pass more than a few dozen KB before being strangled to death (I'm not sure of the exact mechanism by which connections are strangled and killed).

As throttling mode is activated per port, other ports are unaffected. If you change BT to use a different port, BT will be able to resume full use of bandwidth. However, the system is still in an activated state and watching, and will strangle that port as well if it observes too much traffic flowing through it.

I've not done experiments to determine how long it takes for a port to leave throttle mode. I have, however, observed it to take a considerable amount of time (days) to return from activated mode to normal mode. Presumably it's watching how long it's been since the last port left throttle mode.

I would hypothesize that the reason for the two levels of activation is due to the cost of tracking per-port bandwidth, and the system does not wish to allocate that kind of computing resources for something which isn't considered to be a serious problem. But this is just conjecture - I don't know for certain that it only tracks port bandwidth when in an activated state; I'm merely attempting to fit a curve to a handful of points.

If you find any more information on the matter, I'd be interested to hear it.

Monday, January 14, 2008

Gah

So, I'm randomly reading reviews of new anime series this season (two episodes in, now; if you want a list of them, check out Random Curiosity's). There were a couple that sounded interesting or different enough to consider, but I was still reading reviews on them and others. A summary of the episode, from the review of episode 2 of Shigofumi on the same blog, sums up what made the decision a lot harder (I was originally considering watching that series):

It might be a little premature for me to say this, but out of all the new shows I’ve seen so far this winter season, Shigofumi has been my favorite. There’s just something about the way the plot continues to surprise me with themes that are a lot more mature than I was expecting, from multiple instances of death to child pornography.

Viva Japan.

On a tangentially related note, a philosophical question occurred to me: what would anime writers use as a medium for delivery of ecchi if group bathing wasn't such a feature of Japanese culture?

Saturday, January 12, 2008

& Awesomeness

For those of you who don't read Digg, here are some pics from a link of random awesomeness:

Tournament Armor

Jousting Armor

Persian Armor

Samurai Armor

& Adjective Strategies

First of all, to preempt the inevitable question, no, this post isn't a part of the series the post a couple posts ago was. Like most posts on Q & Stuff, the topic of this post is just what I happened to be thinking about at the time I decided to write a post. The flavor of the day is adjectives.

For the purpose of this post, I'm going to define 'adjective' in the theoretical sense, meaning a way of specifying attributes or other things about nouns. Clearly this is a very important (I might go out on a limb and say essential) part of every language. However, the means by which the theoretical purpose of adjectives is accomplished in language is unusually variable between languages. I know of at least four strategies of performing this task, and bear in mind that linguistics is my hobby - not my profession no even an area of expertise. Of course, languages often mix and match - some attributes may use one strategy, other attributes use another; some attributes may even have multiple ways of being represented.

The most obvious to us (I'm assuming most or all reading this speak some Indo-European language as their native language) are true adjectives. True adjectives, as I'm defining them in this post, are words which serve uniquely as adjectives. They attach to and modify nouns, and can serve no other purpose - other purposes require different words entirely. For a few examples from this paragraph: 'obvious', 'true', 'different'.

I've also mentioned one other type of adjectives in the past: verbal adjectives, or (the more commonly used term) stative verbs. That is, verbs which represent states of being. Japanese has many such verbs, several examples of which I've given in the past: 'sugoi' ('to be incredible'), 'aoi' ('to be [the color] blue'), 'warui' ('to be bad'). As I mentioned previously, when you see a stative verb used in a true-adjective-like way (e.g. 'aoi hana' - 'blue flower'), what you are actually seeing is a relative clause structure ('flower which is blue').

However, there's a second subtype of this strategy. We just discussed verbs which have stative meanings in the active voice. What's much more common in English (though still not as common as true adjectives) is the use of passive voice verbs as adjectives; that is, the use of verbs which, in the active voice, are defined as an action, but have a stative meaning in the passive voice. Some examples in English: 'ruined' (from 'ruin'), 'afraid' (from the archaic verb 'affray'), 'enlightened' (from 'enlighten'), etc. Interestingly, Japanese does not like this, and will use warped structures to keep verbs used as adjectives in the active voice (from one of my Japanese books: 'nansen ni atta suifu' - literally 'sailor who met ship-wreck').

Next up is the strategy used primarily by Caia, though languages like Latin and Japanese also do things like this: the use of abstract nouns to modify other nouns in an attributive manner by means of an attributive indicator. Take the classic Japanese word 'baka'. This word is a noun, meaning 'idiot', 'stupidity', etc. (it's kind of vague). However, using the genitive/attributive particle 'na', 'baka' can be used to mean 'stupid', as in 'baka na koto' ('stupid thing'). Latin places the attribute noun in the genitive case for this purpose (though note that the genitive also is used for other things, such as possession).

I chose to use this strategy in Caia for the purpose of simplicity. Adding in a third word type - true adjectives - would merely complicate things, and not offer a substantial benefit in some other way (there are a few true adjectives in Caia, but only things that are cumbersome to represent in the 'having [attribute]' paradigm - 'few', 'big', etc.). Initially I was hoping to have Caia represent most things in nouns, and use structures to represent adjectives and verbs (which would mean Caia has only one major word category), but I ultimately decided it would be too cumbersome to do that with verbs, and admitted a second major category.

However, just because those things are cumbersome to represent using nouns does not mean that representing them in such a way is impossible; it just requires a bit different strategy: the fourth I'll talk about. In contrast to using abstract nouns such as 'stupidity', this strategy represents attributes as concrete nouns possessing that attribute. For example, rather than having a word for 'stupidity', you have a word for 'idiot [someone stupid]' (although you could have both, as English does); to insult someone, you would then say "You are an idiot". I don't believe there are any languages you've ever heard of that use this strategy much, but there are real languages that do; and, as just demonstrated, other languages can use it in addition to other strategies. In theory, by combining this strategy and the last, you could create a language where all theoretical adjectives are represented as nouns.

In other news, my post history suggests I'm forgetting things I've done in the past.

Tuesday, January 08, 2008

Come Again?

If you're unfortunate enough, you might have noticed that Windows will sometimes automatically delete files such as MP3s when you try to open them, particularly after receiving them over MSN Messenger. I just had that happen to me, but it wasn't the first time I've seen it. Following the link to help, after Windows notifies you that it has unilaterally decided to delete (and already has deleted) the file, supplies this information:

Sending and reading e-mail is one of the most popular activities on the Internet. The widespread use of this technology, however, makes it a primary way for computer viruses to spread. Because viruses and other security threats are often contained in e-mail attachments, Microsoft Windows XP Service Pack 2 (SP2) helps protect your computer by blocking e-mail attachments that might be harmful.

In most cases, Windows XP SP2 will block files that have the potential to harm your computer if they come to you through e-mail or other communication programs. Windows will block these files if your program is running in a strong security mode. Most files that contain script or code that could run without your permission will be blocked. Some common examples of this file type are those with file names that end in .exe, .bat, and .js.

Blocking these files is very important to do, since directly opening files of this type poses a risk to your computer and personal data.

This is just baiting the Slashdot crowd. Is there a known but unfixed (major) security vulnerability in Windows Media Player that allows a malicious MP3 to execute script or executable code just by being listened to? Did the RIAA play a part in this design decision?

Thursday, January 03, 2008

Caia - Design Goals & Stuff

So, what ever happened to Caia, anyway? For those of you who forgot all about it, or never read the couple of posts I mentioned it in, Caia is my pet language - one of them, anyway. I can't remember exactly why, but at one point I decided I wanted to make my own language - perhaps it was because I thought I could make a more logical language than natural languages. There were several design goals with Caia.

Caia was meant to be a real, usable language; that is, you should be able to convey any meaningful thought through it. Furthermore, it should be sufficiently efficient, by various definitions, to actually be used as a native language, rivaling those naturally occurring in the world. Among other measures, it should be able to convey meaning in a short enough amount of time to be practical for real-world communication, able to compete with natural languages.

It was intended to be both simple (as much as a usable language can be) and intuitive. Natural languages are notorious for the amount of content that is either illogical or needlessly complex - some things you simply cannot reason out if you don't know them, and must simply be memorized. Our brains allow us to effortlessly learn even such illogical and complex languages as children, but it can be painful to learn some languages as second languages, or to process languages with a computer. I had hoped to create a language that is much easier to learn, more subject to reasoning - as opposed to memorization - and perhaps can even be analyzed by a computer.

Similarly, it should be strongly structured, in both form and meaning. In English, there are plenty of examples of ambiguity - things such as "The man shot the soldier with the gun", but other languages can be even worse. For example, it's common to completely omit the subject of sentences in Japanese, to save the time used to specify the subject; e.g. you'd probably say "mise ni okonatta" ("went to the store") instead of "watashi wa mise ni okonatta" ("I went to the store"), unless the previous sentences made it very likely that, if the subject were not specified, you would be talking about somebody else (random fact: this type of thing is actually appearing in English, these days, especially in instant messaging and text messaging; another random fact: the common Japanese exclamation "kawaii!" is a complete sentence omitting the subject, meaning "[it] is cute"). And even the (typical) manner of specifying the subject in Japanese, when the subject is specified at all, is often ambiguous; the exact same sentence in Japanese ("[watashi wa] maika desu") could mean both "[I] am a squid" and "[I] would like the squid" (think of ordering at a restaurant). A third example would be Japanese adverbs; for example (take this one with a grain of salt, as I'm not certain my understanding of the grammar in this particular example is precisely correct), "sugoku futoi", which could both mean "is incredibly fat" and "is incredible and fat". I wanted to create a language more precise than English, and dramatically more precise than loose languages like Japanese.

Everything satisfying these rules was left for me to do as I pleased, in one way or another. In some cases, I chose what I thought would be optimal. For example, I thought it made the most sense to structure a language such that the more important words always come before the less important words; the result of this is the verb-subject-object word order and right-branching pattern seen in Caia. Similarly, I chose to make it a primarily analytic, isolating language, to make it easier for computers to parse, yet use agglutination for some things, to significantly reduce the time to convey meaning in some specific areas.

In other cases, there was no 'optimal' answer, and I merely did what I wanted. For example, the sound of Caia ("kai'ja" in IPA notation) is purely aesthetic - it was simply what I think would be nice to hear. For an example of a 'sentence' in Caia (it doesn't actually have any meaning, as I haven't defined the vocabulary of Caia, yet): "Vaga ran mezh kana sit" ("va'ga ɾan mɛʒ ka'na sit" in IPA notation). You might notice that it sounds a lot like Latin; this is thoroughly unsurprising, as I think Latin sounds both cool and moderately pretty. On the other hand, there are some additional sounds I think are pretty that don't occur in Latin, which result in it having a bit of a middle-eastern or Indian sound (or at least of stereotypes of them).

However, the freedom of creating such a language is less than you might suspect. Some constraints listed are very harsh, and significantly limit what I can do with it. The time efficiency constraint has been a particularly sticky point, especially given that English is my native language. It's surprisingly difficult to create a language as efficient with time as English, as English uses a number of tricks - things such as complex syllables and ablaut - to achieve quite impressive efficiency, at the cost of other things, such as complexity. I've spent a large portion of the total time thinking about Caia in thinking how to minimize the time it takes to say things, while still retaining the relative simplicity, elegance, and strong structure desired.

So, where am I going with this? Well, that'll have to wait for next post (assuming I don't lose interest in this train of thought before then), as this post is pretty long, already.

Friday, December 28, 2007

Musings on the MoPaQ Format

So, Q survived finals and term papers, though not unscathed. Now for something completely different, and pseudo-random.

A couple weeks ago, BahamutZero and I were talking about the MPQ file format. I can't remember exactly what we were talking about, but he mentioned that it didn't seem all that good. It occurred to me that while I'm probably the most knowledgeable person on the format outside Blizzard (excluding, of course, the author of the format, who no longer works for Blizzard), I'd never really formed an opinion on how good or bad it was in general. That gave me the idea to write a critique of the format - at least, of the more interesting things in the format - on my blog.

Hashing

One of the more distinctive features of MPQs is the way in which files are located within the archives. The standard method of finding files in databases and file systems is the B-tree or B+-tree. For databases, there is one tree for each index in each table; for file systems, each directory has its own tree. This allows for efficient searching (O(log N) disk accesses per lookup) and the ability to keep only part of the index in memory at any time. However, archive files are generally concerned more about optimizing size than speed (ACE being the extreme example), they may use something simpler, such as a sorted or unsorted array of filenames (and if unsorted, then it is possible to just store the filenames along with the file data, itself).

The MPQ format, however, opted for something more industrial-strength: hashing. Each MPQ archive has a single large hash table (an O(1) structure), containing entries for all (readable) files in the archive. Files are identified by two 32-bit hashes of the filename, a language code, and a platform code (what exactly this is remains to be seen, as it's still unused); the index is derived from a third hash of the filename. Note that nowhere is the filename itself.

You really can't do any better than this for the original purpose MPQs were designed for: to store read-only game data in a compact form that is both fast to access and fast to load. The use of fixed-size hashes (rather than the actual filenames) makes the size of archive data in memory optimal, and the hash table structure makes file lookups optimal. The format of the hash table (specifically, how unused entries are encoded) means that the size on disc could be further reduced by compressing the hash table (it isn't compressed on disc); presumably this was not done to make opening archives faster. Apart from that, the hash table has no real weaknesses for the originally intended purpose.

However, the hash table proves to be a pain for modders because of the lack of absolute filenames in the archives. This apparently became a pain for Blizzard as well, indicated by the fact that in Starcraft: Brood War (the third game/expansion to use MPQs), Blizzard added a separate mechanism not integrated with the hash table for storing the names of files in the archive.

These two things taken together, however, suggest one way in which the MPQ format could have been improved (besides a minor alteration that would have allowed the hash table to be resized; presumably Blizzard didn't see a need for this ability when creating the MPQ format). Rather than including the hash table in the archive format itself, Blizzard could have made the hash table a global, separate entity. That is, store the filenames in some simple structure in the archives, and have the installer/patcher code create a global hash table (containing files of all archives) which would be used by the archive accessing code; this could even be specialized, such as only indexing files of the correct language. This would reduce the in-memory overhead (although increase the total size on disc) as well as potentially decrease the search time for finding a file in multiple archives. It has the added benefit of removing that complexity from the format itself, and moving it to support code. If memory serves, Neverwinter Nights took this approach.

Language/Platform Codes

Another interesting design decision in MPQs was the choice of including support for different languages/platforms in the archive format itself. As briefly mentioned previously, files in MPQs are identified, rather than by filenames, by three hashes derived from the filename, a language code, and a platform code. When a game runs, it sets the language and platform currently running. The MPQ access code then looks through the archives for a file matching that description. If no such file exists, it performs a second search, using the same hashes, but using default language and platform codes (in practice, files identified by these codes are generally American English files).

While it's nice that Blizzard put some thought into internationalization, I tend to think this was something that could have been better implemented outside the archive format itself. It would have been just as easy to use a prefix or suffix on the filename for this purpose (e.g. "platform\language\filename" or "filename.platform.language"). Both of these could have been implemented completely in code.

Alternatively, had they used a global index, they could have simply included only files of the proper language/platform (or the neutral language/platform, if no specific version exists) in the index when it is built by the installer/patcher, without the need to modify the filenames at all.

Separate Hash and Block Tables

One detail about the hash table that wasn't previously mentioned was the fact that the hash table stores only some information about each file (specifically, only that already mentioned). Additional information, such as the file size and offset, as well as file modes and flags, is stored in the block/file table ('block' is more technically accurate, but 'file' better describes its intended purpose). The hash table then stores an index into the file table for each entry in the hash table.

This is something that is common in file systems - you typically have a file table, containing the basic info about each file's data, and an index (often B/B+-tree) of filenames that index into the file table. The obvious benefit of this is that the file table can be modified less frequently, and doesn't need to be completely rewritten every time there's a change in the directory structure. A second benefit is that it's possible to create hard links - multiple 'files' in multiple directories (or in the same directory with different names) that refer to the same physical file.

File Data Blocking

Another feature of the MPQ format that resembles a file system more than an archive format is the way it structures file data. It's generally beneficial, from a compression ratio perspective, to compress a data set as a whole (though compression libraries typically feature a mechanism to indicate when it's necessary to read more data or write out compressed data, so that you don't have to load the entire file into memory at once). However, file systems that support compression don't do that. As file systems must read and write entire sectors of data, file data is broken into chunks of some number of sectors, compressed as a single block, then written out, padding out to the nearest sector; if at least one sector cannot be saved by the compression, the compressed data is discarded, and the uncompressed data is used, instead (there no point having to decompress something when compression doesn't even reduce the size).

However, the MPQ format also does this, despite the fact that it does not pad out disc sectors. The reason is that there's a second benefit of blocking data in this way. When a block of data is compressed as a stream (a single entity), it can generally only be read likewise - as a stream, from beginning to end. That is, you cannot seek in compressed data. As seeking is an essential function in general file systems (and similarly in game archive formats), this constraint is unacceptable. Blocking is thus used to reduce the size of the blocks that must be accessed atomically. To read from an arbitrary point, you find the block containing the position desired and decompress it.

One last point unique to the MPQ format is that when files are blocked, each block may use a different compression algorithm, indicated by a byte prefix with the compression algorithm.

Single Block Files

Nevertheless, as mentioned previously, it's better from a compression (and possibly performance, as well) standpoint to compress files as a single unit. In most cases this would work for arbitrary data (the type that might appear in an MPQ), as many file types will be read once and then kept in memory. To take advantage of this, the MPQ format has a second mode of file storage (I can't recall off the top of my head when this was added. Possibly in World of Warcraft): single block compression. This is basically exactly what it sounds like: the entire file is treated exactly as a single block of data - compressed in a single go, and prefixed with a byte indicating the compression algorithm, just like each block normally is.

In theory (and at the level of the file format), this isn't any different from normal, non-blocked archive formats. The problem is really with the implementation (the code, to be precise). The implementor wanted an easy way to take advantage of improved compression by not blocking data, when seeking is not necessary, so decided to use exactly the same block decompression code as for blocked files - blocks are loaded into memory whole, decrypted, and decompressed all at once. The problem here is that it requires two buffers, and requires the entire compressed block to be read before anything can be decrypted or uncompressed.

ADPCM Compression

As mentioned, the MPQ format supports multiple compression algorithms. Currently, the PKWare Implode, Zip Deflate, BZip2, and Huffman encoding algorithms are supported. There's also one more that's more unexpected: the ADPCM algorithm. I briefly discussed ADPCM compression on the E Terra blog. It's a lossy format that compresses 16-bit audio samples down to 4 bits, giving 4:1 compression ratio, at a non-negligible loss of quality.

This is used for a compression mode typically referred to as "WAV compression". This mode is used for compressing sound effects and music prior to Warcraft III. The MPQ editor scans the WAV to import, and distinguishes blocks that contain only PCM data from blocks that contain anything else. Blocks containing only PCM data are compressed with ADPCM, the others are compressed with one of the lossless compression algorithms. This way, the game sees only a PCM-encoded WAV, though it had actually been converted transparently to ADPCM and back.

This is a pretty strange feature for an archive file format - at least, as a basic compression type. Especially considering that there are better formats for this that have higher compression ratios and higher quality (MP3 comes readily to mind), without requiring an alteration to the basic archive format.

I suspect the real reason for this design decision was historical. WAV compression first appeared in Starcraft (the same time support for different compression algorithms for each block appeared, by the way). Diablo, the only game to use MPQs before that, used uncompressed WAVs, played through several Storm.dll APIs. I suspect ADPCM was used to allow the existing Storm.dll streaming code to continue to work using the new compression algorithm. Though again, it's not a particularly good solution, I think.

Extended Attributes

The last feature worth noting is a system for adding new metadata attributes to files in MPQ archives that weren't included in the original format, which I've dubbed extended attributes. Extended attributes are stored as a file (conveniently named "(attributes)") in the archive consisting of parallel arrays of the extended attributes, each array containing one entry for each entry in the block table. The extended attributes in use at present (as of World of Warcraft) are CRC32, FILETIME, and MD5; an archive may contain any combination of these. Oddly, when Storm loads these extended attributes, it merges them with the rest of the fields in the block table, creating a single large table in memory.

This design decision has both advantages and disadvantages. Storing these in parallel arrays makes it, in theory, efficient to store, load, and access them when present, and ignore them when absent; however, the implementation peculiarity mentioned previously makes it slower to load them, with the benefit to slightly improve access (as it doesn't have to check whether they're all present each time you access them).

There are a couple disadvantages, however. First, as extended attributes are stored in arrays parallel to the block table, it isn't possible to exclude entries for files where you don't need that attribute, requiring more space if only some files need certain attributes. I suppose this is only an issue for the tables when loaded in memory, as, as the attributes file is compressed in the archive, unused entries can (have have been observed to) be zeroed, allowing them to take almost no disk space after compression.

The other disadvantage is that this system is not readily extensible. Because new attributes must be known by the MPQ library (for size calculations when processing the attributes file), it's cumbersome to add new attributes, as the MPQ library must also be updated to support them. This is even worse with regard to the fact that the block table mixing code and data must be updated correspondingly. It also doesn't allow for custom metadata at all.

Sunday, December 09, 2007

Random Fact of the Day

Q is officially starting to panic.

Stuff to do this week (last week of school):
- Finish the class version of E Terra
- Start/finish video of something for school
- Start/finish term paper
- Study for and take 4 finals

Friday, November 30, 2007

Exercise for the Reader

Okay, here's one you can apply to your real, daily lives (well, some of you, anyway). When does throwing a stronger punch/kick (something along those lines) require using less strength/energy than throwing a weaker one?

No, there's no esoteric Zen metaphysics at work, here. Nothing but classical European physics.

Edit: Also, English is a bit ambiguous for the question. I'm asking when is it necessary to punch weaker in order to punch stronger?

Thursday, November 29, 2007

Solution

The basic steps used in the link to find the distance between P3 and P are:
1. Calculate the distance between P1 and P
2. Calculate the exact location of P
3. Calculate the distance between P and P3

Now, a little common sense suggests there are two too many steps here. But we can't compute the distance between P and P3 directly, because at this point the location of P is unknown.

Yet there's a very simple solution: we rotate P2 90 degrees around P1 to create P2'. By definition the distance between P1 and P' is the same as the distance between P3 and P. This has exactly the same form as the version in the link, but it is able to find the distance between P1 and P' in the first step, instead of the distance between P1 and P. Thus we've eliminated steps 2 and 3 entirely, and use exactly the same math as before:
u' = ( (x3 - x1)(x2' - x1) + (y3 - y1)(y2' - y1) ) / ( ||p2' - p1||^2 )

A visual proof:

While understanding the way the distance is calculated using the dot product requires knowledge of calculus/linear algebra, deriving this improved equation from the original in the link is just trivial high school geometry. That's why it's neat.

Sucky Test Teacher Strikes Again

So, just got back the graded midterms in networking class (theoretical stuff), taught by the same teacher. Not surprisingly, it was disappointing. As before, this guy could teach a class on how not to write (and grade) tests. Among smaller gripes were two main things:

1. One question (in particular) was taken straight out of the book, and was fairly simple. What could go wrong? Well, giving the same answer to the problem as in the book got you a wrong answer on the test (and no, it wasn't an essay question where theoretically he could expect more detail).

2. Throughout the semester (there's one week left), whenever math was used, it was exclusively high-school-level math (basic algebra and such; there was one area under the curve problem, but as the "curve" was a line, it could be calculated with basic algebra). Neither the book, the lectures, nor any homework has gone beyond that. So, what does he do? He puts a calculus question on the test, and makes it worth 25% of the test (note that calculus is not a prerequisite for this class; I wonder if he could get in trouble with the dean for this).

Ultimately, Q got (like always, although it's always hard to believe in classes with this teacher) an A on the test. I don't know what the exact ranges for grades are, but I talked to one person who got an A- for 58%, and rumor has it that one person got a C for 18%.

Which brings me back to what I said in the previous post about this guy: CURVING THE SCORE DOES NOT MAKE UP FOR A HORRIBLE TEST.

Wednesday, November 28, 2007

Exercise for the Reader

Oh no, not another one of Q's easy but slightly tricky quiz questions. Today it's regarding finding the shortest distance between a point and a line. This explains the theory behind it, and how to calculate it.

But if you are only interested in the distance between P and P3 itself, and you don't care what P is, this math is moderately wasteful, as you have to first find P, then calculate the distance between P and P3. Surely there must be a way to calculate the distance without ever calculating P. And in fact, there is.

Find it, without cheating (looking up the answer online of through friends. It's actually pretty easy and simple. I just thought it was kind of neat.

Monday, November 05, 2007

& Insanity

*evil, maniacal laughter*

No, that's not me starting to panic about E Terra and deadlines (although I am starting to panic about E Terra and deadlines, as indicated by my physical stress symptoms that are starting to appear). Let's just say that I'm very pleased with my class schedule and related things, and gloating without being able to say exactly why.

Happy, happy months ahead (and at least one very stressful month)!

Friday, November 02, 2007

World Without Windows

Okay, so that title is a bit misleading. Anyway, this post hopes to provide some meaningful answers to the question: what would the world be like if the overwhelmingly dominant operating system was secure in ways that Windows is not. For the purposes of this discussion, I'm defining "secure" by several criteria:
1. All users run as limited users - they can't do administrative tasks or screw with the OS without explicitly logging on as admin or running a program as administrator (e.g. Windows run as or Unix sudo)
2. The system is fully isolating with respect to users - one user may not access another user's data without explicit permission
3. There are no privilege escalation exploits in the OS - tricks that limited users could use to gain administrator privilege without having to enter the administrator password
4. There are no remote exploits in the OS itself - in the kernel, standard drivers, basic services, etc.

So, we have this idealized, nonexistent operating system; let's call it Qunix. How exactly, then, would the world look if Qunix had 95% market share? Would this be, as the average Slashdotter seems to believe, a secure and malware-free utopia, where nobody knows what viruses, worms, spyware, or security breaches are, because they don't exist?

The answer, actually, is somewhat depressing: the world would look pretty similar to how it looks right now. Malware and security breaches would still be prominent, the security industry (anti-malware products) would still be big business, and the black hat industry would have similar job security. Granted, the nature of malware would be different, but that would not make it any less prolific or dangerous.

Ultimately, those four criteria I specified have one intended goal: to put everything the user does in a sandbox, where it can't harm the OS or other users (this was how Windows NT was originally envisioned, but time has proved that hope misplaced). Let's assume, for the moment, that these measures achieve that goal (we'll come back to why they don't, later). With this assumption, it becomes impossible for a piece of malware (or a hacker exploiting a buffer overflow, or some such) to invade the kernel, either to destroy the system or to merely hide its existence from the user and malware scanners (a rootkit, in other words).

Unfortunately, while there's no denying that this would make the lives of evil-doers harder, this is anything but the doom of malware/security breaches. Even without the ability to harm the OS itself, a piece of malware could still damage that user's data, and data is often more valuable than the computer it resides on.

Furthermore, the ability to invade the kernel is no requirement for a virile piece of malware. While hiding is more difficult, creating a virus/worm/etc. that runs entirely in user mode is completely viable. Macro viruses, worms that spread through chat programs, and old-fashioned viruses that spread from a disk/e-mail to the computer and back would still be viable and common (although, amusingly, Windows is more resistant to this last type of virus than Linux). There would still inevitably be security holes in third party applications allowing an attacker to get a foothold in the computer and execute code under the user's privileges, and the user could still get (their data) owned, without the attacker ever invading the kernel.

Thus, the necessity of anti-malware products would remain. Now, it would be reasonable to assume that anti-malware products would run with administrative privileges. However, this advantage of privilege would only make life more difficult for malware authors. While it would make it impossible to completely hide from a scanner running at higher privilege, there are many ways of obfuscating, evolving, and encrypting a piece of malware such that it is not readily recognizable by a malware scanner.

Clearly this could be overcome by the malware scanner being updated to respond to a new threat... but that's exactly how the world works right now: anti-malware programs must be kept up to date, or they will not be able to protect against everything that has been analyzed (not to mention the time between when a piece of malware is released into the wild and protection is added to anti-malware products). Consequently, malware analysis labs would still be working frantically, and companies would still have support contracts with anti-malware companies to keep their computers perpetually updated with the latest malware protection.

Now, let's make one final invalid assumption, for the sake of argument: through a combination of various methods, such as security cookies, data execution prevention, and other manner of code hardening, that it's impossible for an attacker to penetrate an application running on the computer (e.g. code injection into a web server, an office application executing code in a document, etc.). That leaves one final mode of attack, one which has been used for decades with incredible success, and one which all of the aforementioned measures combined can't stop: PEBKAC; that is, user naivety.

Even if you could stop all remote and automated methods of invading a system, it will always be trivial to trick the user into running something that is actually malware. This fact nullifies every one of the defense measures proposed previously. Even if a user cannot be attacked other ways, an executed program could wipe all their data. Even if a user only runs as an administrator to install new programs/drivers and perform administrative tasks, an executed "installer" could wipe the data of all other users, and an installed "driver" could install a rootkit for future or immediate use. Similarly, even an air-gapped computer (one which has no network connection at all) still remains susceptible to infection (remember, viruses were rampant on air-gapped computers long before networks or the internet entered the average home/business).

To give you an idea how easily malware can spread relying only on tricking users into manually running it, you only need to take a brief look at the Storm worm. While this worm has been revised and updated extensively over its life, it began as a humble executable that was e-mailed to people; when run, it infected the computer. This worm is now considered to compose the largest botnet in history.

Thursday, November 01, 2007

It's That Time Again

Time to register for spring classes.

So, what's on the plate this semester? After talking to the adviser, it looks like I have exactly 8 classes (3 units each) needed to graduate (I've already finished my biology major, including GE courses, so all 8 are in computer science - third year and fourth year courses). Some of these courses are mandatory, either because they're directly required by the major, or they're required as prerequisites for courses I absolutely want to take. The ones I need specifically, along with their description from the school catalog:

Programming Languages and Translation
Introduce both basic concepts of programming languages and principles of translation. The topics include the history of programming languages and various programming paradigms, language design issues and criteria, developing practical translator for modern programming languages.

Software Engineering
Basic concepts, principles, methods, techniques and practices of software engineering. All aspects of software engineering fields will be covered briefly. Software engineering tools are recommended to use.

Artificial Intelligence
Use of computers to simulate human intelligence. Topics include production systems, pattern recognition, problem solving, searching game trees, knowledge representation, and logical reasoning. Programming in AI environments.

Principles of Computer Graphics
Examination and analysis of computer graphics; software structures, display processor organization, graphical input/output devices, display files. Algorithmic techniques for clipping, windowing, character generation and viewpoint transformation.

Advanced Game Programming
Intermediate and advanced game programming techniques including 3D game development, realtime rendering, physic simulation, etc.

Game Development Project [thesisish thingy]
Individual or team develops realistic games based on the theories and techniques, present and demonstrate their work regularly.

I definitely need to take advanced game programming and computer graphics this semester. The other two slots are open. I'm thinking of taking AI, since that would be useful for E Terra. Unfortunately, that and compilers are at the same time (so are mutually exclusive), and they would both be used for E Terra AI :P I'm hoping to be able to use E Terra for the game project, but I won't be able to take that until the fall.

Besides those, I have a couple decisions to make. I have to take one other programming language than C++ - either Visual Basic, Java, or C#. VB is definitely out of the running, but I'm not sure whether Java or C# would be better. I'm learning some C# this semester because we use it in the game programming class (with XNA), but other than that I'm not sure which is better.

Finally, if my petition to drop one class (not mentioned), which the teacher says is unnecessary, is accepted, I'll need one more upper-division class for the units. Not too sure about which to take for that one. Here are the most appealing prospects, although none of them are something I'd be inclined to take if I didn't have to:

UNIX and Open Source Systems
Introduces the UNIX operating systems, various open source applications and systems, open source programming languages, and open source software development techniques.

Data Security and Encryption Techniques
System security and encryption. Current issues in security, encryption and privacy of computer based systems.

Advanced Operating Systems
The course covers internal structures of a modern operating system. The specific topics include processing, process communication, file systems, networking, and the I/O system. There are several programming assignments which include system calls, and other low level interfaces.

Web Programming and Data Management
Various techniques for developing Web-based database applications using software engineering methodology. Introduce concept and architecture of Web servers, Web database design techniques, client/server side programming, and Web application tools and techniques.

Or, I suppose there's always...
Independent Study
Special topic in Computer Science selected in consultation with and completed under the supervision of instructor.

Internship in Computer Science
Practical experience and service learning relevant to computer science in industry or organizations. Written and oral reports are required.

Hmmmm. Are you thinking what I'm thinking, Pinky?

Search This Blog