Search This Blog

Thursday, February 25, 2010

Q's Squishy Thing of the Day

Meet ShadyURL. It encodes URLs in things that look like spam, porn, malware, and cults/racism. Some of the things I've linked to various people in the last day: (best URL so far)

And, just for fun, here's the Copyright Alliance blog:

And this blog:, which is awesome, because I'm actually 3/4 German

And yes, I have been made aware that I'm late to the party on this one.

Wednesday, February 24, 2010

& Content Filtering

Following a brief scuffle with Patrick Ross in the comments of the Copyright Alliance blog, I thought the topic I briefly discussed deserved elaboration: content filtering. That is, the analysis and identification of illicit content packets as they pass through a router, based on hashing (not so effective) or fingerprinting (more effective), usually followed by dropping the packet.

I'll discuss the full range of content filtering, though one thing at a time.

First of all, the specific type I referred to previously: filtering at the internet service provider level. Imagine an ISP that is being legally pressured to do something about file-sharing by copyright industry representatives, a situation hundreds of ISPs around the world find themselves in at this very moment.

Now, Mr. Sales Rep, from another company that makes deep packet inspection hardware, offers to sell a product to Mr. Business Suit from the ISP, that will solve the file-sharing problem, as many sales reps are currently doing. This company's tests show that the product is 95% effective at identifying and blocking traffic containing illicit copyrighted material, and has a low rate of false-positives. Naturally, Mr. Business Suit at the ISP looks at this product, and sees an amazing solution to all of their problems. Mr. Engineer at the same ISP looks at the product and sees a million-dollar paperweight, and tells Mr. Sales Rep to get out of his office. Nevertheless, convinced by Mr. Sales Rep, Mr. Business Suit purchases the product, and has Mr. Engineer install it.

Now the product has been installed, and everybody watches eagerly, as Mr. Engineer turns the new product on. Immediately the product begins logging transfers of copyrighted content by the thousands, and successfully blocks them. Yet Mr. Engineer looks at his network statistics and sees that not only is the product having 0 effect on the amount of internet traffic, there is still just as much illicit content being successfully uploaded by users of the ISP.

What could possibly have gone wrong? And why were the appraisals of the product so drastically different between Mr. Business Suit and Mr. Engineer, to begin with? Is the product defective? Did Mr. Sales Rep lie?

Well, not exactly.

What happened is that the ISP's users adapted effortlessly to the new piece of filtering hardware. While it's certainly viable, if implemented competently, to detect and block things like copyrighted content, this is only possible if you have access to the data being transmitted. The universal Achilles heel of such identification algorithms is encryption.

Modern file-sharing software supports end-to-end encryption - the same kind used to secure credit card transactions online: the uploader encrypts the data, the downloader decrypts it, and nothing in the middle can access the data in between the two, because nothing else has the encryption key. This "nothing" includes that million-dollar product our ISP just bought.

Now, this encryption is not a technology that needs to be developed, nor does it need to be downloaded and installed by the user. It's already there. If a user is able to share files through a P2P application, the encryption code is already in that P2P application; it needs only to be enabled by a user clicking a check box. And, of course, you can be certain that it will be turned on by default in future versions of said software if content filtering by ISPs becomes common.

In other words, each of those "blocked" uploads the product registers is merely the first of two attempts. A blocked upload is merely an upload that will succeed seconds later, after the user clicks the box to enable encryption (though if content filtering is widely deployed, users won't even need to do that). Thus, to make a long story short, while you have successfully prevented file-sharers from uploading unencrypted illicit content, you haven't actually prevented a single copyright infringement.

This is a theoretical problem, not an implementation issue. As such, there is no basis to hope that this is a limitation that will ever be overcome in the future.

But look on the bright side: Mr. Sales Rep got a nice commission off that million bucks the ISP paid his company, and as he technically never lied, the ISP has no legal recourse to argue fraudulent advertising.

However, the fact that ISP-level filtering is a technological dead-end should not be taken to mean that all filtering technology is useless. As stated, filtering technology can be effective, given that it has access to the data. Of course, the hackers of the world will continually work to find new ways to evade such filtering algorithms, but it should still be possible to successfully filter enough to justify the cost of the filtering hardware/software.

One example where this works to a satisfactory degree, both in theory and in practice, is YouTube. Because YouTube actually processes and decodes the content uploaded to it, it's impossible for it to not have access to the data - it couldn't function otherwise. As such, it always has access to the full, unencrypted content uploaded, at which point filtering of that content is possible, and in fact is already being performed.

Dumb file storage sites - sites like RapidShare - which store data without any regard to what type of data it is, fall somewhere in the middle. As they do not require access to the data itself, encryption is entirely possible, and would indeed be capable of evading any filtering of uploaded content done on the part of the site. However, use of this type of encryption would be much more of an inconvenience than is the case with encryption in P2P programs; in this case, encryption must be done manually, by the user, through a completely separate program (almost anything that can make ZIP files can encrypt them, for instance), and the encryption key must be distributed through other channels, such as forums that link to the encrypted file. As such, while filtering at the level of such sites will certainly not prevent such encrypted transmission of content (nor probably even a majority of total transmission), it's possible that filtering systems might reduce sharing of illicit content by some sufficiently valuable fraction by means of sheer annoyance.

Wednesday, February 10, 2010

Die! *smash*

I have squashed the mystery.

I did some chatting with Merlin (never mind who that is, just that he knows some about the electrical engineering of computers, as well as programming). He gave me a couple of theoretical possibilities what the problem could be, some of which could be tested, others were simply components on the motherboard that I had no way to individually test.

One thing was the power supply. According to him, it's possible that when data is being sent from the memory to the controller (in 8 blocks of 64 bits), if the power supply is flaky the voltage could droop over time, which could hypothetically explain why the bit only fails only once in each cache line even if the problem is in one of those 64 data lines. However, this was easily disproved, as I do have other power supplies.

More importantly, while talking to him I thought of my grandma's computer (my computer prior to 2001), which had the same type of memory. While this computer is too old to support 512 meg DIMMs (this was why I couldn't just use it to verify the DIMMs worked and be done with it), it did have some smaller DIMMs in it (256 meg). See where this is going?

Now I had more than two DIMMs, and with them I was able to demonstrate that the same bit failure occurred with any combination of two DIMMs (although the frequency of the error did vary some depending on the pair used). This proved conclusively that the DIMMs themselves were not responsible, and the problem had to reside in the common element - the motherboard or the CPU. This appears to be a problem that only shows up when both DIMM slots were full.

Now, it's still possible that it could be a software (BIOS) problem that could be fixed by updating the BIOS, but I don't care to try that, for the reason I mentioned previously.

...Now What?

A few weeks ago my parents' desktop died. This was my old Athlon XP computer I had before I bought the one I have now (about a year and a half ago). A bit of trivial diagnostics demonstrated that either the CPU or motherboard had died, and I didn't have the parts necessary to determine which. So, my dad went looking on eBay for a replacement motherboard and CPU to go with the existing memory, as they didn't really have the money to buy a new computer right now (actually, I was the one who did most of the looking at the specs, to make sure it would work).

A few days ago, the motherboard+CPU arrived, and I quickly set to determining whether or not they worked, as the amount of time to return it was limited. As my past experiences have made me a little paranoid in that respect, naturally the first thing I did was run MemTest 86 on it.

While it seemed to work for the most part, about once every pass (took a couple hours) on test #7, an error would show up. After leaving it running overnight, some patterns emerged. The error was always the 0x02000000 bit getting set when it should be cleared, and it usually occurred when the value written to the memory was 0xFDxxxxxx (less commonly, 0xF9xxxxxx). The base address of the error seemed to always be at 0x3C, 0x7C, 0xBC, or 0xFC. The errors also appeared to cluster in the low 512 megs of the memory space.

In short, it appeared to be 1 bit every 64 bytes, where the containing byte changes from 11111x01 to 11111x11.

Separately, neither of the two 512-meg DIMMs shows this problem, regardless of which of the two sockets the DIMM is in (as tested by running MemTest 86 overnight, which would be expected to produce about a dozen errors). However, when put in in one order, this error occurs. Peculiarly, when put in the opposite order, MemTest begins spewing out memory errors before soon crashing (presumably due to memory errors in the locations the program is at).

So, where is the problem? Well, I wish I knew the answer to that.

The fact that both DIMMs worked correctly when alone suggests that the DIMMs are good (these are the same ones that were used in the computer that failed, and to my knowledge there was nothing wrong with them then). The fact that each DIMM works in both sockets suggests that it's not a bad socket. The fact that it's every 64 bytes tells me that it's not one of the data lines on the DIMMs or on the motherboard, as the data width is 64 bits (64 pins). The fact that the error period isn't into the kilobytes indicates that it's not a CPU cache failure, as I saw in the past.

64 bytes is the size of the cache line for the CPU, which seems to hint that it's somewhere after where the 64-bit memory reads are assembled into whole cache lines; unfortunately, I don't know where exactly this occurs, to narrow down the possibilities. Yet the fact that it only seems to occur in the lowest 512 megs of the memory space is very suspicious, and seems to suggest something with one of the DIMMs, sockets, or data paths on the motherboard, which appear to already be ruled out from earlier tests.

So, I'm really at a loss as to what to make of this and what to do next, and I've got three days left to do it. I can't test the memory on another computer, because I don't have another computer to test it on. The BIOS could use updating, which could theoretically fix the problem, but I've been reluctant to do that because it's a trivial way to say "You broke it, I'm not giving you a refund!"

I suppose I need to try mixing and matching the CPUs and motherboards. I need to do that anyway, to find out exactly what broke on the old computer, but it should also tell me whether the problem is in the CPU or motherboard (though nothing more specific than that).

Saturday, February 06, 2010

Let's Do Some Math!

There are a couple common equations of importance for what I'm working on at the moment. Both of them involve the idea of a set of items, where each item has a probability of being chosen (and more than one may be chosen in some cases). It's trivial to go item by item, draw a random number, and see if that item is chosen, but this is obviously inefficient. I wanted equations that would allow me to stick in a single random number and it will tell me the first relative item to be chosen, with a fixed number of operations.

The first equation is based on each item in the set having an equal probability of being chosen (note that more than one may be chosen, or none may be chosen):
probability(item) = perItemProbability

The general form of this decision is (rnd <= perItemProbability), where rnd is between 0 and 1 (as is the case with all instances of rnd in this post). Trivial. The equation we want to derive, however, not quite as much.

While it's tempting to think that the probability of any of n items being chosen is (perItemProbability * n), it's not. This isn't a linear function. Rather, you have to derive it from the probability that NONE of those items will be chosen, which is exponential.

(1 - perItemProbability)^n : probability none of n items will be chosen
1 - (1 - perItemProbability)^n : probability that at least one of n items will be chosen
rnd = 1 - (1 - perItemProbability)^n : switch out probability for a random number
1 - rnd = (1 - perItemProbability)^n : arithmetic rearrangement
log(1 - rnd) = log(1 - perItemProbability) * n : take the log of both sides
log(1 - rnd) / log(1 - perItemProbability) = n : arithmetic rearrangement

This gives us what we want: the index of the first chosen item. Of course you have to watch for values that result in taking log(0). You also have to notice that the range of the result is [0, infinity]; this is correct, but if the value is greater than the upper bound of your set of items, you have to recognize that no item was chosen.

Next up is the case where the probability of each item follows a geometric series:
probability(n + 1) = probability(n) * (1/divisor)
probability(0) = p0
sum of probabilities = 1

With this series there's no misleading simple solution, so we're forced to go straight to a more complex one. We know the sum of all the probabilities is 1, but we don't know what value of p0 to use to produce that sum. We can trivially calculate this using the geometric series equation:
1 = p0 * (1 - (1/divisor)^n) / (1 - (1/divisor)) : geometric series equation
p0 = (1 - (1/divisor)) / (1 - (1/divisor)^n) : arithmetic rearrangement

Okay, that's the easy part. Now the equation itself. Here again we'll solve for n:
rnd = p0 * (1 - (1/divisor)^n) / (1 - (1/divisor)) : substitute rnd for sum of probability
rnd / p0 * (1 - (1/divisor)) = 1 - (1/divisor)^n : arithmetic rearrangement
1 - (rnd / p0) * (1 - (1/divisor)) = (1/divisor)^n : arithmetic rearrangement
log(1 - (rnd / p0) * (1 - (1/divisor))) = log(1/divisor) * n : log of both sides
n = log(1 - (rnd / p0) * (1 - (1/divisor))) / log(1/divisor) : arithmetic rearrangement

And thus the solution: a long, ugly equation to calculate something simple. Again, you have to watch for values that result in errors: divisor must be greater than 1; but that's pretty obvious: a geometric series will only sum to a finite number if (1/divisor) is less than 1. You also have to note that when rnd = 1, n = the number of items in the set, which will be out of bounds.

Isn't math fun?

Thursday, February 04, 2010

& Solid State Stuff

Pretty soon I'm going to be looking to upgrade some things on my computer, starting with the hard drive. In addition to getting a large disk to begin using for archived anime (as opposed to the large stack of DVDs I've burned), I was going to get a solid-state disk to put my Windows 7 that I won in a raffle a couple weeks ago on.

For anyone not familiar with them, SSDs have the benefit that they have 0 seek time (as opposed to about 9 ms for HDs), combined with linear read speeds several times that of HDs. For very large and contiguous files, which involve only linear access, an SSD could be 3 or 4x as fast as an HD. But for accessing large numbers of small, non-contiguous files, which involve many seeks, the difference is much, much larger. Thus SSDs can be anywhere from substantially faster to massively faster than HDs, depending on the type of access.

But what exactly do I do with the SSD when I get it? Prior to Wednesday, the answer was obvious: use it as my boot drive. That is, put Windows and frequently used programs on it, and use a separate HD to put large and infrequently used stuff on.

Today, however, there's another option, thanks to the announcement of this evil little thing: a disk controller which combines an HD and an SSD, using the SSD as a massive cache for the much larger HD. That is, instead of having the SSD serve as a dedicated boot drive, it's used to cache frequently used files from all over the HD (when talking about an 80 gig cache, "frequently" probably amounts to once every week or two).

One obvious advantage is that the superior performance of the SSD is applied not just to system files (e.g. Windows), but also to anything on the disk that gets used frequently. Similarly, you don't need to let part of the SSD go to waste if you don't have enough files to fill it up (I'm told using an SSD as your boot drive takes 30-40 gigs of SSD, while higher end SSDs go up to 256 gigs). Another big benefit is that you don't have to manually split files that are normally on the boot drive (e.g. Program Files and the Users documents folder) between the SSD and HD depending on whether you want them fast or slow; for example, I wouldn't want to put downloaded anime on my (dedicated) SSD, nor would I want my SSD bloated with the various bulky compiler intermediates.

Now, the big question is how exactly this is going to work. According to the manufacturer, it requires a driver for Windows in order to work. This suggests that caching may all be done in software, rather than hardware. This brings up the obvious question: in what cases the cache will be used? If the cache is only used within the current Windows boot (essentially serving as a gigantic system cache), this won't be very useful, at least for me (perhaps for a database server, with multiple terabytes of database, a 100+ gig cache would be substantially beneficial). I already have 8 gigs of RAM, about 4 of which (at a minimum) is available to be used by the system cache, which is a pretty decent cache for one session (and RAM is much faster than an SSD, anyway).

From the write-up, however, it sounds like the cache at least persists between Windows boots, which would mean it wouldn't quite be that bad. However, if the caching is all software-based via a Windows driver, this obviously means that it won't be of any benefit prior to loading that driver - that is, during boot up/shut down and going into/out of hibernate, which are things I was hoping to gain out of getting an SSD.

I guess we'll find out when the thing ships, in 3 weeks or so.