Q & Stuff: bugs

Showing posts with label bugs. Show all posts

Wednesday, September 08, 2010

Public Service Announcement

When you call Discover card customer service, the very first thing you have to do is enter the last 4 digits of your Discover card, last 4 digits of "your social security number", and your ZIP code. It's important to note here that "your social security number" and "your ZIP code" do NOT mean "your social security number" and "your ZIP code"; it means the social security number and ZIP code of the primary cardholder, which may or may not be you.

Discover card apologizes for the inconvenience of refusing to even let you talk to a real person without providing something it specifically did not ask for.

Tuesday, May 25, 2010

When Size Matters

I and another person, who is an IT professional, spent some time this weekend doing some volunteer work at a charity - more specifically, a homeless shelter and rehabilitation clinic, though any more details information isn't particularly relevant. Our task was to do "computer stuff", including basic maintenance on all their computers (those in the administration offices and those for use by residents), as well as take a look at what they had in storage.

While the computers available to residents weren't bad (about 5 years old or less), the situation in the office was more grim. Almost all of the computers were Pentium 3s and 4s, most with 256 megs of RAM and running Windows 2000 with Internet Explorer 6. That's right, Windows 2000 and IE6, which are both several years past end of life, and thus inherently insecure; oh, and did I mention that all the office staff run with admin privileges?

So, we started off with the easy stuff: chkdsk (full disk surface scan) and defrag on all of them. We intended to run virus scans on all of them, but were repeatedly thwarted by bad CD-ROM drives and stupid video cards that prevented us from using the AV boot disks I'd burned for the occasion (as I didn't trust the computers to not have rootkits), and eventually we had to resort on many computers to just running the AV from inside Windows.

While there were a few more specific problems with the computers, one universal complaint among the office staff was that the computers were anywhere from slow to extremely slow. While defrag no doubt helped a little with that (the computers were between 5 and 40% fragmented), the primary problem appeared to be something else entirely. Most of the computers had 256 megs of RAM, and on boot most of them were using anywhere from 210 to 260 megs of memory; add in 20 megs of memory for IE (I wanted to switch them to Firefox or a newer version of IE, but as both of those used significantly more memory that wasn't really an option), 30-40 megs for an Office app or two, and the observation that Windows starts disk-thrashing when you get to about 85% of physical memory used (about 220 megs), and it was clear that the computers were severely disk-thrashing.

So, memory upgrades were the obvious prescription. But that raises the question: why were the computers using so much memory to begin with? Anyone who remembers back from the day 10 years ago, including both myself and the other guy, can tell you that 256 megs should be plenty for Windows 2000 (and perhaps even XP). Given all the available evidence - that the computers ran Windows 2000 and IE6 in admin mode, our inability to do a proper virus scan on most of them, and the fact that there was a lot of unaccounted for memory (memory that was in use but not accounted for by running processes or kernel allocations) - naturally we assumed that they were severely infested with malware.

As irony would have it, the cause was actually just the opposite. On Sunday I finally got fed up with the situation, and decided to try a drastic experiment inspired with some experience in the past: I uninstalled the AVG Free antivirus that was on most of the computers. This experiment paid off; it turns out that AVG was in fact using a full 130 megs of memory. Without it, Windows was using about 80 megs of memory at boot, which is very much like what we remembered back from the day, when virus scanners took 30-40 megs, which would leave about 110 of the 256 megs available for running applications comfortably.

So... now what? Well, another 256 megs memory would put an end to the problem. But in the mean time, I had a stop-gap measure: get a smaller AV. Out of my previous tests on the best of the free AVs, the best current-generation AV in this regard was Avira AntiVir Personal, weighing in at about 95 megs. So, I switched all of the computers over to that. This reduced the amount of memory in use at boot to about 180 megs, leaving about 40 megs available for comfortably running applications. While that isn't a huge amount, I'm hoping that in a case this severe it will substantially improve performance until more memory can be acquired cheaply.

On that topic, I actually ordered a batch of used and very cheap 256 meg sticks off eBay, and I'm hoping to be able to test them out (which will take a couple days to thoroughly test this quantity of sticks) and get them installed this weekend or next, bringing the computers up to 512 megs; assuming, of course, that these computers (most of them Dells) aren't of the extremely finicky variety that reject all but very closely-matched memory. At 512 megs, it becomes reasonable to start looking at upgrading the computers to XP (pretty sure it wouldn't be a good idea to try Vista on those computers) and Firefox, which would go a long way to improving security (and, of course, at that point it would be worth making them not run as admin).

On one last note, one thing that continues to stump both of us is how on one computer the fonts in IE are enormous. On typical web pages lines of text are often 1-1.5 inches tall, making it extremely difficult to use the web. We both independently looked at both font settings and accessibility features in both IE and Windows, but none of those proved to be the cause. Setting fonts to smallest in IE made text more reasonable, but on some pages the text was actually "smallest", making it impossibly small to read; thus this was not a sufficient solution to the problem.

Monday, August 31, 2009

& Very Old Things - Part 3

So. Thus far everything I did involved nothing more than the memory editor and cheat search. In other words, something just about anybody could do. To really say that I reverse-engineered Blaster Master, I needed something a lot bigger, and I knew just what to do: I was gonna find the cause of a 21-year-old bug. There's a well known glitch in Blaster Master that on some bosses, if you or the boss is being hit when you pause the game, you (or the boss) will continue to take damage while the game is paused. This only works on some bosses, however, leaving a big question mark as to whether it's a bug or a feature (it was especially strange that it affected both player and boss); I wanted to find out.

Unsurprisingly, it's quite a bit of work to reverse-engineer something you have absolutely no info on (e.g. in computer programs you at least know OS API calls the program makes, and can use that as a starting point), especially when you're learning both the hardware and the assembly language as you go. Ultimately, it took me an embarrassingly long time to get the job done (certainly more than it was worth), and I ended up disassembling a lot more of the game than was necessary (as is often the case when you don't have any idea what you are looking for). I'll just talk about some of the more important lines of investigation, findings, and complications.

As I didn't know anything about the game, naturally the first thing was to hit "step into" and see where I ended up. I always ended up in the same place: a busy loop that checks a single value. Given the nature of this and some idea about the hardware, I reasoned that this was a loop to wait for the next vertical sync, to start on the next frame; this was confirmed by observing that address being written to from the non-maskable interrupt (vsync) handler.

From there I stepped out of the function and took a look around, writing down the various functions called after vsync. I then went about refining the list and detailing more levels of the call tree, ultimately (after all of my disassembly) resulting in the call tree and other related notes below (note the comments have a focus on things relevant to finding this pause glitch). One thing of particular importance is that there's an object-oriented handler for each object type, and the appropriate handler is called for each non-empty object table entry.

$E936 waits for vertical blank to occur
$C971: for each object calls $C8DF, decreases hit timer if nonzero, calls $C9A4, and calls $C928
-$C8DF copies object data from $400+$56 to $46 ($C925 writes life)
-$C9A4 sets hit timer back to full and decreases life in the object buffer when hit with hit timer at 0 after decrement
--$EA3A
--$EB51 saves the object-specific handler to $7A (LE)
--jumps to the object-specific handler at $C9D3
-$C928 copies object data from $46 to $400+$56
$CA4B->$EA3A->$E61B->$E63C/$EB98 unmaps $8DF6 page [the page the object handlers are], which gets mapped by the NMI handler

$EB7E NMI handler (does not branch)
$EB97 IRQ/BRK handler (stub)
$F7: controller 1 state (bits in reverse order)
$F8: controller 2 state (bits in reverse order)

For a while I did some looking around with breakpoints on the location in the object buffer where the life is stored, looking for things that modify it. Some of the information gathered this way:

Copy of object struct for current object: 46
Buffer used to hold 16-bit pointers [LE] from various tables: 7A

$53 [life of the current object] written to directly by $8DF6 in $C9A4 when damage is taken
$8C6B-8C9F, $8DDF-8DFE (inside $8D76) only executed when hit timer is 0
During pause taking hit, $C1E6->$D7A0 returns A0/N+ to $8DD5 on boss 2, but 7F/N- to boss 5

$D7A0 seems to determine whether you get hit during pause. $7E indicates whether you're being hit or not: high bit for hit, lower bits for damage taken.
$7E is written to by $D71A in ($C141->$D711) when hit by boss, $D7B2 to clear hit per frame. $D71A is not reached when paused with boss 5 because $D6CD does not return
$D6CD returns if the current object is hitting the player, throws an "exception" returning to caller of caller if not hitting (returns from $D710)

Eventually I decided to take a more systematic look at the object handler for the catacombs (overhead view) player object:

$8C38->8D98->$8DBF catacombs player object handler paused
-$C15F
-$C0FF->$EF2B sets $3E and $3F to the X and Y coordinates (in pixels on screen) of the player for $D7A0
-$C1E6->$D7A0 returns A0/N+ to $8DD5 on boss 2, but 7F/N- to boss 5, when paused and being hit. A indicates the damage player should take. highest bit of A and the negative flag indicates whether player is hit or not.
-on N+, $C216 then handles hit (including decrease life)
-$93BD
-$EA3A
-$F029
-$E63C

From this we can see that $C1E6 is the switch we're looking for, determining whether the player takes damage; though a look through the function shows a clear lack of anything resembling hit detection - it simply reads from a struct in memory at $7C, generated who knows where. So, if that isn't determined in the player object handler, maybe it was handled in the objects you can be hit by. For that reason, and to satisfy my curiosity a bit about what makes bosses tick, I decided to take a look at a boss handler function; and as I hate the crab boss so much, naturally I went with that, first.

It was about this time that I discovered that NOPing function calls was a very effective way to quickly get an idea what a function is doing. This method works poorly on computers, because of the greater number of registers and drastically greater stack usage, making such NOPs result in crashes more often than not. On the NES (or at least in Blaster Master), however, the method works very well, with crashes very uncommon.

Crab (level 5) boss unpaused object handler ($A6D0):
-$A07B: (no discernible effect)
-$A6DE: handles movement of crab and state transitioning (but not animation)
-$A7A2: spawns bubbles when necessary
--$C1F5: spawns a bubble
-$C0FF: updates crab screen position based on crab 65 position (NOPout fixes crab position where it shouldn't be, despite crab 65 object movement). affects everything onward.
-$C153->$EB51
-for each body segment (right pincer, left pincer, back):
--$C141->$D711: detection of whether player has been hit. detection of player being hit by green sprite is elsewhere.
---$D6CD: checks for collision
--$C216->$DECC (only if hit)
-$C12C: (no discernible effect)
-$C189: draws (animates?) the green sprite
-$C144->$D697: sets player damage if player hits green sprite. also handles things damaging the boss.
--$D6CD: returns on collision (either player hit by something or boss hit by something). eats the stack frame if no collision (returns to caller of caller)
-$9EB3: animates the background layer

Okay, so $D6CD is another function that looks like a good candidate. So, let's take a look and see if it's what we're looking for or another tangentially related function (don't blame me for the formatting, Blogger doesn't like indentations).

$44 = A
$00 = $3E - ($40 / 2)
$01 = $3F - ($41 / 2)
X = 0xF
do
{
A = $7E[X]
if (A > 0)
{
$45 = A
A = $7C[X] - $00
if (A <= $40)
{
A = $7D[X] - $01
if (A <= $41)
return
}
}
X -= 3
} while (X >= 0)

throw

Now that looks like a collision detection function, more or less. We can see the X coordinate, Y coordinate, width, and height of whatever it is we're comparing against at $3E-$41, respectively (this is a case of a memory region being used like a stack frame). We can also see an array of 5 structs containing the X coordinate, Y coordinate, and a thingy, respectively, starting at $7C. Clearly only the size of the "current" object is considered, and the array of thingies are just points.

Well, let's think about it for a minute. We know that this function is called 4 times for the crab boss, for the 4 different parts of the body (and that one of those is not like the others). What might collide with the boss that would be of interest? Well, the player and the player's projectiles. A bit of playing around proved this to be true: the first entry in the array is the player, the rest are the player's projectiles (limited to 4, as you can see). That third byte in the struct array indicates the amount of damage the projectile does (or 7F for the player).

So, now we have everything we need to figure out how this works. The boss object checks for collisions with the player and the player's projectiles. If it hits the former, it marks the player for damage, which is then applied by the player object handler (presumably next frame, since the player handler always gets executed before enemy handlers); if it's the latter, the boss takes damage, though only if the part hit is the green sprite, which is the hit box.

Now that we know how the collision detection and damage system works, and why bosses that are susceptible to being hit while paused also themselves can hit you while paused (the collision detection for both is one in the same); that just leaves the question of why it only operates on some bosses. Well, it was about here that I discovered that that wasn't actually true. While only some bosses can be hit (and hit you) while paused, this is not the case for projectiles. Specifically, the projectiles fired by boss 3 and boss 8-1 can hit you while paused, even though the boss cannot be hit while paused.

Well, it turns out there are two sets of object-specific handlers, one for when the game is running, one for when it's paused. Now, bullets and other things are generally not prone to the problem of dealing damage while paused, as they disappear as soon as they hit something (e.g. the bubbles spewed by the crab boss). The things that can hit you while paused appear to be exactly those objects that have pause handlers and that don't disappear after hitting you; this is the case with all susceptible bosses, plus the projectiles of bosses 3 and 8-1.

Boss object handlers:
1 (63): base $A58A, paused stub
2 (5F): $9B64
3 (61): $A196, paused stub
4 (5D): $970C
5 (65): $A6CD, paused stub
6 (5F): $9B64
7 (5D): $970C
8-1 (67): $AA34, paused stub
8-2 (69): $AC84, paused stub

While I didn't go searching for every single object handler to compare the paused and non-paused versions, I did compile a list of the ones for the bosses. Note that there is only actually a single table of handlers in memory; the address in the table points to the paused handler, while the unpaused handler is always 3 bytes afterward (3 bytes is the size of a JMP instruction). For the bosses not prone to this glitch, you can see that I've indicated the pause handler is a stub that returns immediately. For the rest, the pause handler jumps midway into the unpaused handler (after things like movement).

To illustrate this, take a look at the handler for boss 2 ($9B67):

-$A07B: (no discernible effect)
-$C01E: handles movement of boss
-if time to spawn next projectile:
--$C1F5->$D851: fork projectile from boss, giving it a new proto-object type
--$C216
-paused handler jumps directly to here
-$C0FF
-$9D77: draw and do hit detection for arms
-$C090->$D770: do hit detection for back (not hit box)
--$D711: hit detection
-$C093: do hit detection for hit box
-$9EB3: moves and animates the background

So, that's how it is. Now we know why it happens and why it only affects some things. That just leaves one more question: is it a bug or a feature? Unfortunately, the evidence is much too ambiguous to answer that clearly. While it's within the bounds of imagination that damaging the boss during pause could have been intentionally put in as a cheat of sorts, it's very hard to imagine that the same thing against the player would be intended.

Yet it's also hard to imagine that it could be a bug; while the player being hit during pause could be eliminated by a single change to the player handler (a single mistake, in other words), doing the same for bosses would require that every boss's handler be changed. Furthermore, there's the fact that only 2 of the 5 unique boss object handlers (2 of them are used for 2 bosses each, bringing the total to 9 bosses) have pause handlers, in contrast to the player and most projectiles, and without pause handlers the bosses don't even get drawn completely (and what point is there in having a pause that lets you see the boss battle if you can't see the boss?). It seems as though a lot of stuff was left out or in arbitrarily; I have to wonder if there is a deadline lurking in here, somewhere.

Various resources used during reverse-engineering:
Nintendo Entertainment System Documentation
6502 Instruction Summary
How NES Graphics Work
Comprehensive NES Mapper Document
For more info, see this big list of documents

Saturday, August 29, 2009

& Very Old Things - Part 2

As for the hacking itself, this started out with simple stuff. Search memory, find special values (life, lives, gun power, boss items, etc.), and overwrite them. This was quite easy with an emulator that has a cheat search feature; you simply have it save the memory contents at the initial state, the repeatedly perform searches based on specified criteria until you've narrowed it down to 1 or a small number of possibilities. For example, finding player health was a trivial matter of searching for something that decreases each time you take damage and stays the same in between damage.

With the exception of the boss items, all of the others in this list were similarly easy to find. Boss items, however, were a bit more complicated. Because you only get the items after beating a boss, naturally this is a situation where there are likely to be a lot of changes all over RAM, as it's a major transition in the gameplay; this means that you can't narrow it down as easily. I had to try several locations, starting with the ones whose values made them the most probable to be correct (specifically ones that had a bit flipped on when the item was picked up).

However, there was one other complication figuring that one out: boss items are represented by two separate bit masks that are partially redundant. I first discovered the mask at 3FC, because that's the one that changes the bullet your tank fires after you get the first boss item; as it was a visual change, it was very easy to see that it was working. Once I found that, I checked to see if it was a bit flag, and confirmed it was, finding the flags for the second and third boss items by experimentation.

That's where things got weird. While this mask contained flags for boss items, the flags for the other boss items didn't seem to actually do anything. As well, while having the hover bit flag set caused hover power to be displayed (it's hidden prior to getting the third boss item), you couldn't actual hover. The bits here also did not make the upgrades show up on the pause screen. So, I went on, with my infinite life and gun power, and killed a few more bosses (somehow I still remember most of the maps of this game).

I found that there was a second bit mask with a completely different set of flags, one for each of the 7 items in the game. For the first 2 items, these flags did nothing more than make the item show up in the pause screen. For the third item (hover) this flag allowed you to hover, but did not show the hover power bar or allow you to accumulate hover power (so you still couldn't actually fly with this flag alone). I still can't imagine why it was done this way, but at least I figured out all the flags.

One thing that I couldn't figure out with cheat searches, however, was boss life. That is, I found it - repeatedly: it was in a different place every time (including even for the same boss). I correctly reasoned that this implied that the boss's life was not a special value, but that bosses were simply common entries in the list of objects in the game. As it turns out, so is the player; you just don't notice that because the player is always at index 0 in the object array and thus always at address 0x400, while the boss was stuck in whatever slot was available at the time. This led directly to the discovery of the object array itself. From there, I started looking at the struct that represented each object, beginning with offset 0xD: life.

As I've already pointed out that you tend to have to get clever when coding on a system with so little RAM, it shouldn't be surprising that maps are stored in a similarly clever way. As far as I can tell, there is only a single map for each area in the game (8 areas), which is 128x256 or less. However, the map is diced up and jam-packed in there such that they need to do some camera tricks when transitioning between areas (teleporting through some doors) to make sure you never see other sections that are contiguous in memory.

While I was working on it, I figured I should try to answer some long-standing questions about Blaster Master. To begin with, why does the power of your gun (what it actually shoots) not always match the gun power that's displayed? For example, if you have full gun (8 bars) and get hit, you take 1 bar of damage, but the effect of your gun falls beneath what 7 bars should give you. Many years ago I hypothesized that there were two variables for gun power; one was the one displayed, and one that was the actual effect: getting hit did a bit more damage to effect than the display.

It turned out to be simpler than that. It's a consequence of the fact that while most hits (I can't think of any counterexamples) do an integral number of bars of damage, life, gun, and hover are stored from 0-FF, in bytes. See where this is going? One bar = 20, but 8 bars = FF, not 100. In other words, the number of bars displayed is (x + 1) / 20. When you get full gun and then take a hit, your gun drops to DF; while this still displays as 7 bars, it doesn't give you the maximum gun effect (which requires E0). It's not clear whether this is a bug or a feature, though I'm kind of leaning toward a bug (specifically, an edge case the coders didn't think of, which could be easily solved).

All numbers are in hex.

Continues: 37E
Lives: DD
Gun: C3
Hover: 92
Bosses killed: 3FB. bit x = boss x+1
Boss items 1: 3FC. same values as 3FB.
Boss items 2: 99. bit 0: hover, 1: dive, 2: wall 1, 3: wall 2, 4: crusher, 5: key, 6: hyper
Homing Missiles: 6F0
Lightning: 6F1
Thunderhead Missiles: 6F2
Pause: 15

Gun Levels:
20: 2 long shots
40: 3 long shots
80: circling shots
C0: white wave shots
E0: full wave shots

Beginning of Object Table: 400. 12 slots. Slot 0 is reserved for the player. Slots 1-4 are reserved for projectiles. Enemies fill slots 5-11 as needed.

Object Struct (14 bytes)
byte @0: Object type. A few of the known object types:
3: Normal
4: Dying
5: On left wall
6: On right wall
7: On ceiling
8: Diving
9: Climbing H->V convex
A: Climbing V->H convex
B: Climbing horizontal->vertical concave
C: Climbing V->H concave
D/E/F 10/11/12: Entering/transitioning/leaving door right/left
1B: Outside tank
1D: Entering tank
84/85/86: Entering/transitioning/leaving door catacombs

byte @1: Orientation. The exact meaning of this seems to vary by object type.
Tank mode: Between 0 for left and F for right (the intermediates indicate that the tank is changing face), with the high bit set if the tank is pointing to the left
Catacombs mode: 0 for up, 1 for right, 2 for down, etc.
uint16 [LE] @2: Horizontal position. The high byte indicates the attribute-size tile, while the bottom byte is the position within that tile. The highest bit indicates that you're currently moving through a teleporting door.
uint16 [LE] @4: Vertical position
byte @6: X velocity
byte @7: Y velocity
byte @8: Attribute-size tile index on screen? This was always observed to be (Y * 0x11) + X where X and Y are relative to the tiles visible on screen.
byte @9: Hit timer: the number of ticks until the object can be hit again. Controls invulnerability, damage flashing, and also decreases gun by 1 for each tick above 1 (this causes gun power to drop smoothly rather than suddenly, though I'm not sure why they did it this way; life does not do this when damage is taken).
byte @A: Current AI state
byte @B: Frames till AI state change
byte @C: ?? (no observed effect)
byte @D: Life. For trivia value, normal tank shot does 4, hyper shot does 6, crusher shot does 8, normal catacomb shot does 1, catacomb shot with wave gun does 2, and grenades do 2.

Note that the AI state bytes in the struct are for simple enemies. Bosses have other state stored elsewhere. For example, in the above crab boss, the AI state and timer control movement but not spraying bubbles, which occurs entirely independently. The density of bubble spray, however, was directly controlled by the life of the boss; e.g. below 0x20 life (it starts with 50) it does the full sawtooth volley with dozens of bubbles.

Sunday, July 26, 2009

& That Memory Leak

I'm pretty sure that a while back I mentioned a particular Firefox plugin called Feed Sidebar. It's a nice little RSS feed plugin that does exactly what I wanted an RSS plugin to do (and all others work differently). As such, I'm rather fond of it, and have been rather tolerant of its faults.

The particular fault I believe I discussed was that it leaked memory. With the ~15 RSS feeds I have on it set to update every 10 minutes, it leaked about 20 megs of memory per hour. While many would probably not find this to be a serious problem (e.g. 8 hours x 20 megs/hour = 160 megs leaked before shutting down for the day), as I tend to leave my computer running for weeks straight, this was a moderate annoyance. Left on its own, Firefox would crash about every three days due to address space exhaustion (32-bit applications have a 2 gig address space, and fragmentation can reduce the amount actually allocatable below that; I was seeing it crash around 1.5 gigs). Much more frequently, I'd start World of Warcraft and my computer would grind to a halt, as Windows really, really doesn't like it when you get > 90% memory usage (up until a month or two ago, when I have 4 gigs of RAM), and I'd have to manually restart Firefox at that point.

On one occasion I e-mailed the author about the problem; he said that he'd been looking for the leak for a while, and was stumped. Unfortunately, as I haven't done anything with Javascript (what Firefox plugins are written in) in over a decade, and I don't know the first thing about the Firefox plugin architecture, I couldn't go looking for the problem myself as I've done so many times in the past (at least 3 or 4 incidents on the blog).

However, a couple weeks ago the problem suddenly become much worse. Immediately after installing Firefox 3.5, I noticed that it was now hemorrhaging about 150 megs/hour, more than seven times as fast as in 3.0 and earlier (in fact, at first I thought it was that 3.5 was just unstable, as it crashed about every day; then I noticed that it was allocating massive amounts of memory). After verifying both that this was consistent/reproducible and that this was indeed due to the particular addon, I e-mailed the author again, with my new data.

Apparently that got a pretty quick response. A couple days later he sent me a test build he wanted me to try. After some looking, he'd come to believe that there was a bug in Firefox that was leaking SQLite resources; resources that should have been freed when an SQLite connection was closed were being leaked (it was not clear from his e-mail whether this was a new bug with 3.5, or just that the quantity of resources leaked was greater in 3.5). This test build included a workaround for it (presumably he saved the connection instead of creating a new connection every time it updates the RSS feeds, but he didn't say).

This seemed to work well. After installing it, I immediately noticed that it was not hemorrhaging memory like it did before (previously it was leaking about 25 megs each time the RSS feeds were updated, which was now gone). Rather, after a couple of days of collecting data, I concluded that it was now only leaking 1-2 megs each update (about 10 megs/hour).

It's not clear yet whether this is true leakage (which would suggest that one leak was plugged but another remains) or that this is a side-effect of the workaround. It's conceivable that Firefox might delay freeing SQLite resources as long as a connection is in use; if he's indeed keeping an SQLite connection open the whole time, we could imagine that this "leak" is not a leak (lack of resource reclamation) so much as a deferral of resource reclamation. If this is the case, this should go away after Firefox fixes the SQLite connection leak and he goes back to creating a new connection on every update.

We'll see.

Incidentally, version 4.1 of Feed Sidebar was released today; this is the version I tried a test build of. After I've had more time to collect data I'll have to see if it's leaking the same amount as the test build, or if there have been further modifications.

Thursday, March 26, 2009

Die, .NET. Thanks.

So, just encountered an (extremely) evil quirk of the .NET platform in a bug.

Everyone who programs .NET knows that one key difference between the two is that structs are (without ref specified) always passed by value, while classes are passed by reference. Apparently that rule is not limited to actual passing of structs, themselves; passing a "pointer" to a callback function for an instance of a struct causes a copy of the entire struct to be passed, and the callback is then called on that copy, not your original instance.

Example:
system.FindCollisions(collisionSet.OnPossibleCollision, workingSet);

In this line, FindCollisions receives a local copy of collisionSet. When it then calls that callback function, that callback operates on the local copy, not on collisionSet itself.

I'm not sure whether this is by design or whether it's a bug. While it's consistent with the policy of always passing structs by value, the fact that it's so counter-intuitive makes me wonder if it might not be a bug.

Wednesday, November 19, 2008

Various Thingies

First of all, I should mention that my house is fine; the fire didn't get near it. The probability of it getting here was fairly low, but we did a bit of better-safe-than-sorry packing. Though last night while driving home from school I did drive past a (unrelated) fire that filled the entire intersection with smoke in about a 50 feet radius; I still don't know what was on fire (I couldn't see it), but the smoke was very obvious, and I heard fire trucks going by.

In other news, it's been relatively difficult to collect data on Firefox after reenabling the Feed Sidebar addon. Firefox crashed after three days of logging memory usage, and then a couple days later I needed to restart it because I needed the memory for WoW (Firefox was using about a gig). But the addon defintiely seems like the cause of the memory leak. From the days I gathered data, it looks like it leaks about 40 megs/hour (although that's only over a couple days; it might decrease over time).

Finally, I just noticed something that happened last year: the Starcraft soundtrack, not previously available (the compressed audio shipped with the games is 22 khz ADPCM, which is pretty poor quality), is on iTunes for $10; the other Blizzard OSTs that were included in the collectors' editions of Diablo 2, Warcraft 3, and World of Warcraft are also available there (though unfortunately all of them are single CDs, which means they are incomplete). The music is DRM-free (although I hear they encode personally identifying information in the audio files), 256 kbps AAC (good quality), though you will have to install the Apple iTunes crapware to buy it. I'm told the M4A files should play on all PC audio players that support AAC (I know they work on WinAmp), though they are not MP3s, and will not work on MP3 audio players. That's your public service announcement for today.

Wednesday, November 12, 2008

& More Leakage

So, after writing that last post about the audio driver handle leak, I decided to log some data - specifically, the amount of memory Firefox allocates, and the number of handles in the Symantec Anti-Virus process smc.exe. It's now been about a week since I started gathering data (although unfortunately the power went out in the middle, so I ended up with two smaller replicates).

The data for smc.exe shows that it begins at approximately 450 handles on startup, and acquires an additional 3100 handles per day (although 'day' is about 14 hours, as I hibernate my computer at night; meaning about 220 handles/hour). This definitely doesn't seem normal, and I'm going to venture a guess that it's a handle leak. I also noted that the increase seems to be linear over the course of the day, so is unlikely to be related to something like automatic update.

I already knew that Firefox was hemmorhaging memory. If I recall correctly, the amount of memory allocated by Firefox increased by 200-300 MB per day. This time, I tried using Firefox for several days without two of the three addons I normally use (the third was NoScript, so I didn't want to try without that unless I had to). While this test didn't last as long as I'd hoped (thanks to that power outage), after four days, Firefox had only increased from 125 MB (when I first started it, with a lot of saved tabs) to 205 MB (now). In four days I would have predicted it would hit 600-900 megs.

This strongly suggests that one of the two plugins is responsible for the massive leakage, although I'll have to watch what happens after I reenable the one most likely the be causing the leak (as the other is newly installed, and this problem has been around for longer): Feed Sidebar (version 3.1.6). So, we'll see what happens with that. Might have an answer in another 4-7 days about that.

Tuesday, November 04, 2008

& the Audio Driver Incident

Several months ago, I (finally) upgraded my computer. My old one was a 1.8 GHz Athlon XP (single 32-bit core) with 1.25 gigs RAM and a GeForce 3; in other words, it was 2002 or 2003 hardware. My new computer is a 2.4 GHz Core 2 (quad 64-bit cores) with 4 gigs RAM and a Radeon 4850; depending on the benchmarks, my new CPU is 10-18x as fast as my old one, if you count all 4 cores. After trying various voodoo to try to get my old XP installation to run on my new computer (despite the fact that it wouldn't have been able to use about a gig of my RAM), I ultimately gave up and installed Windows Server 2008 64-bit. After dealing with a whole bunch of problems getting various stuff working with 64-bit 2008, things ultimately ended up being acceptable, and I've used that ever since.

However, a couple relatively minor problems have been pretty long-standing, and continued until a few days ago. One was easy to diagnose: Firefox was leaking memory like heck. For every day I left my computer on, Firefox would grow in RAM usage by a couple hundred megs, getting up to a good 2 gigs on occasion (I usually kill it before it gets to that point). While this was certainly an annoyance, it wasn't much of a problem, as I have 4 gigs memory, and I can simply restart it to reclaim all the leaked memory whenever it gets so large it becomes a problem.

One was much harder to diagnose, however. Something else was leaking memory in addition to Firefox, and it was not clear what was causing this. Total system memory usage would increase over days, and if you ignored Firefox, would end up using up all of my 4 gigs memory by about 2 weeks since the last reboot. Unlike with Firefox, there was no apparent problem - no single process was showing a significant accumulation of memory, nor were excess processes being created, leaving 1-2 gigs of memory I couldn't account for. So, I went several months without knowing what the problem was, usually handling it by restarting my computer every week or so.

Then, one day my dad called me from work to ask me why his computer at work was sometimes performing poorly. So I had him look through the process list and system statistics and look for memory leaks, excessive CPU usage, etc. As I don't have the exact terminology used on those pages memorized, I also opened up the listing on my computer to be sure I told him to look for the right things.

This brought something very curious to my attention: the total handle count for my computer was over 4 million. This is a VERY large number of handles; normally computers don't have more than 20-50k handles at a time - 2 orders of magnitude less than what my computer was experiencing. This was an almost certain indication that something was leaking handles on a massive scale. After adding the handles column to the process list, I found that audiodg.exe was the process with some 98% of those handles. Some looking online revealed that that process is a host for audio driver components and DRM. Some further looking for audiodg.exe and handle leaks found some reverse-engineering by one person that showed that this was due to the Sonic Focus audio driver for my Asus motherboard leaking registry handles.

Fortunately, there was an updated driver available by this time that addresses the issue. As my computer was currently at 96% RAM usage (the worst it's ever been - usually I reboot it before it gets to this point), I immediately installed the driver and restarted the audio services (of which audiodg.exe is one). This resulted in a shocking instant 1.3 gig drop in kernel memory usage to less than 400 megs total. It's been one and a half days since then, and audiodg.exe currently is using 226 handles, suggesting that the problem is either dead or drastically reduced (it has increased by like 70 handles in those 1.5 days); and even if it is still leaking handles, 50 handles a day is a tolorable leakage, as that's only like 10 k/day.

So, this whole thing revealed that Windows is quote robust. Given that most computers never go above 50k handles, I was very surprised that Windows was able to handle 6.6 million handles (the highest I've ever seen it get to) without falling over and dying (although this wouldn't have been possible with a 32-bit version of Windows, as that 1.7 gigs of kernel memory wouldn't have fit in the 2 gig kernel address space after memory-mapped devices have memory space allocated). Traditionally, Unix has had a limit of 1024 file handles per process, though I don't know what's typical these days (I know at least some Unix OS have that as a configurable option).

After pursuing that problem to its conclusion, I decided to do some more looking for handle leaks in other processes. While the average process used only 200-500 handles, a number a processes (which are not abnormally high) get as high as 2k handles. However, one process - smc.exe, a part of Symantec Antivirus - has almost 50 k handles allocated, making it a good candidate for a handle leak. Looking at the process in Process Explorer shows that a good 95% of these handles are of the same type - specifically, unnamed event handles - providing further evidence in support of handle leakage. That's as far as I've gotten so far; I haven't spent much time investigating the problem, or looking for an analysis online (though the brief searches I did didn't find anything related to this). So, that's work for the future.

Saturday, July 12, 2008

Epic Fail

So, on Friday I got a new computer. The computer consists of a quad-core Core 2 CPU, 4 gigs of memory, and a Radeon HD 4850 based video card. Although there are some known techniques for getting an existing Windows installation to work in a new computer, this install simply refused to work with the USB ports on this computer (the computer freezes up several seconds after Windows has booted; disabling the USB ports in the BIOS allows it to work, but is not an acceptable solution). So, I ultimately ended up reinstalling Windows.

I had quite a few options when it came to choosing a version of Windows. Thanks to my obsessive downloading of everything on MSDN Academic Alliance, I have legal copies of Windows 2000, Windows XP x86, Windows XP x64, Vista x86 & x64, two copies of Windows Server 2003, and Windows Server 2008 x86 & x64. For those not familiar with the Servers, 2003 is an updated server version of XP, and 2008 is an updated server version of Vista.

As Server 2008 is an updated version of Vista with additional features (and the newest of any version), I figured I'd use that, and that's what I'm writing on right now. However, this install may be short-lived. As it turns out, just about nothing works on Server 2008. In the last three hours I've encountered the following:
- The Asus motherboard driver installer for Vista x64 will not run. When run, it says "Does not support this Operating System: WNT_6.0I_64". If I understand this correctly, it's saying it doesn't support Windows NT 6.0 x64. This is curious, as this is exactly what Vista x64 is, suggesting that the installer does not run on the system it was made for. Furthermore, several pieces of motherboard hardware do not have drivers included with Server 2008, and so appear as Unknown Devices and PCI Devices (there are still a couple unknown devices left if you manually install each driver). Epic Asus fail.
- The other major driver I needed was the 4850 driver. This was especially important because the 4850 has a known issue where the fan speed stays too low, resulting in hot temperatures. So, I downloaded the latest version of the drivers and ATI Catalyst programs from the video card manufacturer (as best I can tell the ATI web site doesn't list drivers for the 4850) and installed the driver and program. Installation had no problems; running the Catalyst Control Center, however, resulted in the message "The Catalyst Control Center is not supported by the driver version of your enabled graphics adapter.". Very curious, considering that driver and the Control Center came bundled in the same ZIP file. Epic ATI fail.
- One of the programs I use most of all (by far) is Windows Live Messenger. Naturally I soon needed to install it on this computer. The Windows installer even helpfully created a Windows Live Messenger Download link in my start menu. Unfortunately, following the link, downloading the program, and double-clicking it (I'm not even mentioning the UAC and IE annoyances) brought up the error message "Sorry, Windows Live programs cannot be installed on Windows Server, Windows XP Professional x64 Edition, or Windows operating systems earlier than Windows XP Service Pack 2". By process of elimination, this appears to say that only supports XP x86 SP2+, Vista x86, and Vista x64; curious, given the fact that Microsoft advertises support for Server 2008. Epic Microsoft fail.
- The other program I use most often is FireFox. So, that was next on the list. Download, install, so far so good. Launching FireFox, however, is a completely different story: instant crash. Epic FireFox fail.
- And just for good measure, this install has blue-screened once so far (in about 3 hours), with the PAGE_FAULT_IN_NONPAGED_AREA bugcheck. I'm not sure exactly whose failure this is, but the Asus driver problems seem the most likely suspect. Epic fail.

Friday, June 27, 2008

Sansas & Bugs

Given how big I'm into music (particularly game, anime, and movie soundtracks), it'll probably come as a complete shock to most people to know that I've never had a portable CD or MP3 player (other than the CD player in my car). Probably the biggest reason for this is that I'm cheap - I save most of the money I make, and spend very little of it, even on things you'd expect me to buy (like a computer that's less than 6 years old). Well, yesterday I just bought a digital audio player: the SanDisk Sansa c250 2 gig, on sale at a price I couldn't refuse (cheaper than Amazon).

So, I spent some time playing with it yesterday, in preparation of today, when I drive my grandma to a doctor's appointment and various errands (she's had severe eye problems for the last couple months). Not a bad little sucker; though just as you might guess from the price, it didn't take long to run into problems. Naturally, as I'm too impatient to call tech support, and too inquisitive to give up on a technical challenge, this meant I had to debug the thing.

After loading almost 2 gigs of music onto it and disconnecting from the computer, it proceeded to promptly lock up on database refresh (after you modify the contents of the flash memory it scans all the files and indexes them). Wonderful. I could turn it off and on, but every time it turned on it immediately performed a database refresh, and promptly locked up. Worse, it would no longer connect to the computer, as the database refresh preempted other things, like USB port communication, meaning I couldn't delete anything that might be causing it to freeze (specifically, if you plugged it into the USB port while it was performing the database refresh, Windows would say "unrecognized USB device" after a couple seconds).

A substantial amount of experimentation revealed that it was possible to override this. Specifically, you had to have the computer send a USB signal to the device BEFORE it starts its database refresh. As the database refresh is the first thing it does when you turn it on, and plugging the USB cable in automatically turns the device on, this takes rather precise timing, and more or less requires pressing the button required to make it connect in mass storage mode*, insert the USB cable, and press "Scan for hardware changes" in Device Manager at essentially the same time (I'd say about 1/3 of a second). This will cause the USB signal from the computer to preempt the scheduled database refresh, and put it into USB storage mode.

Now that I was able to access the contents again, I spent some time fumbling around with trial and error, trying to figure out what was causing it to break; as it was 1 AM by this point, my brain wasn't in peak working condition, and this took some time. Searches on Google revealed that quite a few people had this problem and there are quite a few hypotheses as to what causes it and how to fix it, but no definitive explanation or solution (nor has Sandisk addressed this problem, despite people asking for help on their forums). As well, many of the "solutions" involved wiping the memory of the thing, and sometimes bricking it.

Through trial and error, I managed to burn through a number of hypotheses (which were either incorrect or simply not applicable to me). It appeared to be false that spaces in directory and file names caused lockups (or that bug only occurred in older versions of the firmware). I also did not observe any instances of odd characters in song titles or artists that caused this problem; to my surprise, the device even correctly handled and displayed the Japanese characters in some song and artist names (when I had first opened the package, I tried copying a single album onto it, which worked without incident; this album happened to have Japanese ID3 info). Lack of free space did not appear to cause it (I tried taking it down to 2 megs free space with good files, and it still worked fine). ID3v1 tags seemed to work fine. Even this one funky MP3 at "0 kbps" (what Explorer reports for it; I haven't looked at it with a hex editor to figure out why this is) did not cause the problem.

What ultimately ended up being the problem, at least in my instance, was that one of my game soundtrack MP3s was mislabeled as 'hard rock'. The significance of this, according to one person, is that it has a space in the genre name. Changing this to the proper genre corrected the freeze. I can't say for certain that the space in the genre is what causes the bug, but it's true that when none of my songs have a space, the player works fine, and it froze in that one case.

*The Sansa has two USB connection modes: MTP and MSC. MTP mode interfaces with media players such as Windows Media Player. This mode allows you to store media library files on the player, and make use of various features like tagging and playlists. MSC mode causes the player to act like a vanilla memory stick, allowing you to directly access the flash file system. I'd imagine it's only necessary to refresh the database in MSC mode; that's the only mode I've ever used.

Judging from Google, there are two different methods of switching between modes, which depend on what firmware you have. One method is that a USB mode option appears in the settings menu on the device. The other method (what mine has) is that the player is always in MTP mode, but connects in MSC mode if you hold the rewind button when you plug it into the USB port.

UPDATE:

Found another bug while playing around with putting DRMed WMAs on the critter (my dad also got one, and he has a bunch of DRMed WMAs to put on it, unlike my MP3s). It's only possible to load DRMed files onto the device in MTP mode, so I had to learn how to use that. It appears that my assumption was correct, that database refreshes are only necessary after adding files in MSC mode; after files are added in MTP mode, they appear in the player immediately after the player is disconnected from the computer.

While the player automatically turns on and goes into USB storage mode when you plug the USB cable in, it's possible to turn off the player by holding the power button (the same way you turn it off when it's not connected to the computer) while in USB storage mode. This is not a good idea. If you add some files to the device and then turn it off before unplugging it, it will lose track of those files, and they will not show up in the list of songs on the player (though they will still show up in the file list when it's connected to the computer in MTP mode). Adding additional files later will not cause this problem to be corrected; it is necessary to delete the files from the player and then transfer them from computer again

Friday, November 02, 2007

World Without Windows

Okay, so that title is a bit misleading. Anyway, this post hopes to provide some meaningful answers to the question: what would the world be like if the overwhelmingly dominant operating system was secure in ways that Windows is not. For the purposes of this discussion, I'm defining "secure" by several criteria:
1. All users run as limited users - they can't do administrative tasks or screw with the OS without explicitly logging on as admin or running a program as administrator (e.g. Windows run as or Unix sudo)
2. The system is fully isolating with respect to users - one user may not access another user's data without explicit permission
3. There are no privilege escalation exploits in the OS - tricks that limited users could use to gain administrator privilege without having to enter the administrator password
4. There are no remote exploits in the OS itself - in the kernel, standard drivers, basic services, etc.

So, we have this idealized, nonexistent operating system; let's call it Qunix. How exactly, then, would the world look if Qunix had 95% market share? Would this be, as the average Slashdotter seems to believe, a secure and malware-free utopia, where nobody knows what viruses, worms, spyware, or security breaches are, because they don't exist?

The answer, actually, is somewhat depressing: the world would look pretty similar to how it looks right now. Malware and security breaches would still be prominent, the security industry (anti-malware products) would still be big business, and the black hat industry would have similar job security. Granted, the nature of malware would be different, but that would not make it any less prolific or dangerous.

Ultimately, those four criteria I specified have one intended goal: to put everything the user does in a sandbox, where it can't harm the OS or other users (this was how Windows NT was originally envisioned, but time has proved that hope misplaced). Let's assume, for the moment, that these measures achieve that goal (we'll come back to why they don't, later). With this assumption, it becomes impossible for a piece of malware (or a hacker exploiting a buffer overflow, or some such) to invade the kernel, either to destroy the system or to merely hide its existence from the user and malware scanners (a rootkit, in other words).

Unfortunately, while there's no denying that this would make the lives of evil-doers harder, this is anything but the doom of malware/security breaches. Even without the ability to harm the OS itself, a piece of malware could still damage that user's data, and data is often more valuable than the computer it resides on.

Furthermore, the ability to invade the kernel is no requirement for a virile piece of malware. While hiding is more difficult, creating a virus/worm/etc. that runs entirely in user mode is completely viable. Macro viruses, worms that spread through chat programs, and old-fashioned viruses that spread from a disk/e-mail to the computer and back would still be viable and common (although, amusingly, Windows is more resistant to this last type of virus than Linux). There would still inevitably be security holes in third party applications allowing an attacker to get a foothold in the computer and execute code under the user's privileges, and the user could still get (their data) owned, without the attacker ever invading the kernel.

Thus, the necessity of anti-malware products would remain. Now, it would be reasonable to assume that anti-malware products would run with administrative privileges. However, this advantage of privilege would only make life more difficult for malware authors. While it would make it impossible to completely hide from a scanner running at higher privilege, there are many ways of obfuscating, evolving, and encrypting a piece of malware such that it is not readily recognizable by a malware scanner.

Clearly this could be overcome by the malware scanner being updated to respond to a new threat... but that's exactly how the world works right now: anti-malware programs must be kept up to date, or they will not be able to protect against everything that has been analyzed (not to mention the time between when a piece of malware is released into the wild and protection is added to anti-malware products). Consequently, malware analysis labs would still be working frantically, and companies would still have support contracts with anti-malware companies to keep their computers perpetually updated with the latest malware protection.

Now, let's make one final invalid assumption, for the sake of argument: through a combination of various methods, such as security cookies, data execution prevention, and other manner of code hardening, that it's impossible for an attacker to penetrate an application running on the computer (e.g. code injection into a web server, an office application executing code in a document, etc.). That leaves one final mode of attack, one which has been used for decades with incredible success, and one which all of the aforementioned measures combined can't stop: PEBKAC; that is, user naivety.

Even if you could stop all remote and automated methods of invading a system, it will always be trivial to trick the user into running something that is actually malware. This fact nullifies every one of the defense measures proposed previously. Even if a user cannot be attacked other ways, an executed program could wipe all their data. Even if a user only runs as an administrator to install new programs/drivers and perform administrative tasks, an executed "installer" could wipe the data of all other users, and an installed "driver" could install a rootkit for future or immediate use. Similarly, even an air-gapped computer (one which has no network connection at all) still remains susceptible to infection (remember, viruses were rampant on air-gapped computers long before networks or the internet entered the average home/business).

To give you an idea how easily malware can spread relying only on tricking users into manually running it, you only need to take a brief look at the Storm worm. While this worm has been revised and updated extensively over its life, it began as a humble executable that was e-mailed to people; when run, it infected the computer. This worm is now considered to compose the largest botnet in history.

Thursday, October 11, 2007

Annoyance

How dost thou suck, Thursday? Let me count the ways. Well, my Visual C# Express just stopped working due to me not registering it; more specifically, due to a bug in Visual C# that makes it not accept any registration numbers, regardless of whether they are valid. Did I mention that MS offers NO technical support for the express editions (and XNA only works with Visual C# Express)? So no working on E Terra (or anything else related to video game development class, for that matter), for the time being.

On a related note, I still have no programmers who have volunteered to work with me on it. Although arguably that's my fault for waiting so long to get started. As well, the forum of the video game design club is broken and not accepting registrations, so I can't post help wanted ads on their site. In either case, doing all the coding for an unsimple RTS myself will be... interesting.

Finally, Live Spaces is broken today and not allowing any new blog posts, so I can't even blog about E Terra (I'd already written up a new entry to post).

Wednesday, September 19, 2007

Logitech: At It Again

So I just spent an enjoyable period attempting to contact Logitech customer support through their Email Us link in their support site. Shortly after sending the first one, I felt inclined to send this one:

I take it you people don't like getting support requests, and do as much as you can to discourage them. There's a rather gaping bug with your support site, where it will say invalid e-mail address/password on attempt to login, tell you that the e-mail address is not in the database when you attempt to get your password sent to you, then tell you that the e-mail address is already in the database when you try to create a new account using that e-mail address. This occurred with both my primary address and backup address. I wasn't able to manage to register until I tried this one, which isn't even mine.

As you can see, this has nothing to do with the mouse I entered (that was a different support request from a few minutes ago). I just thought you might want to know that your web site makes people want to hurt you physically, and I needed to enter product info to contact you at all.

I thought I wrote about Logitech's technical incompetence a year or so back, as well, but I guess not. Maybe that was on my todo list...

Friday, May 04, 2007

I Didn't Actually Win

A ways back I posted about my great amount of amusement at one of the bugs that showed up on my list at work. Obviously I never got around to posting about what I found when I actually had a chance to investigate the bug.

It turned out to be a mixture of several problems. What turned out to be happening is that the program was crashing (a simple user-mode crash; nothing fancy). However, because a user-mode debugger wasn't installed on that computer, the crash launched the kernel debugger (don't ask my why there was a kernel debugger but not a user mode debugger; I don't know). This kernel debugger, in fact, would halt the entire system and stop at a breakpoint in kernel mode code; debugging could then be done by linking the computer to another computer (the one with the debugger client) with a serial cable. So, thanks to the kernel debugger getting invoked, a common crash got elevated to a complete system halt, complete with hosed hardware.

Annoyed, I installed WinDbg on the computer, and tried it again, with the hope of finding what was crashing. The cause immediately became clear, to my further annoyance: IsBadReadPtr was throwing an access violation. For those not familiar with this function, it consists of establishing a structured exception handling frame, then reading from the supplied pointer. Normally, the access violation is caught by the exception handler and the function merely returns true. But in this case, something was catching the exception before the handler.

That something was AppVerifier - a program offered by MS to perform very strict code checks on a program. While these checks tend to whine a lot about stuff that isn't really a problem, they're helpful in that they can catch things that would normally result in a crash, often in rare circumstances (making the crash very difficult to debug). In this case, AppVerifier was catching the exception too early, and making a fuss about something that couldn't possible have resulted in a crash anyway.

Unfortunately, that wasn't the end of the matter. A quick look at the stack revealed that IsBadReadPtr was being called from an internal Windows function. As this was probably the function checking for an invalid parameter passed to an API function (and thus could potentially mean that my program was passing an invalid parameter to an API function - bad), this meant that I couldn't ignore it.

It turned out to be a bug in the GUI library our company wrote and uses (the author of that library is my arch-nemesis). The list view class contains two image list classes used for checkbox and other icons. What was happening is that, because of the poor architecture of this library (which I fight with regularly), the list view class was being destructed before the list view window itself (actually all windows are like that, in this library). This meant the destruction of all child classes, including the two image lists. Unfortunately, that list view was still USING those image lists, as the class did not unselect them from the list view window before destructing. When the dialog was closed, the list view window was destroyed, and the window attempted to free the image lists (this is the default behavior for list view windows; you can set an option to not automatically free them), and of course the image list pointers were now invalid.

Another day, another fixed bug, another few hundred calories burned laughing.

Sunday, April 15, 2007

Incompetence Weekend Extravaganza!

Well, it was a relatively uneventful weekend. Too little work, too little school, too little World of Warcraft. But there was plenty of incompetence to go around.

It's been a couple weeks since my program at work shipped (I'm a contract worker), and I haven't logged onto the company VPN after that until yesterday. I have a new project to work on, so it's about time I get started on it. I put in my user name and password into the VPN client; several seconds later, my cell phone begins to ring. It's an out of state number, although I recognized the area code as being the same area as my work. I answered it, and got a recording saying something like "please enter your PIN". The fact that my VPN login failed immediately after I hung up (having no idea what PIN they were talking about, or why I was supposed to enter it) made it seem likely that my work had switched to two-factor authentication; that is, after you enter your password, it calls your phone and asks for a second password.

So, I go into the company IRC channel (yes, it is a private, encrypted channel) and start bitching; you know, stuff like "What the **** is this PIN I'm supposed to be entering?" The reply was just as disturbing as the fact that I'd never set, nor been told, my PIN: it was e-mailed to me (in fact, it was the same for everybody)... through the company e-mail system, which is only accessible from the building or through the VPN. Brilliant, Einstein.

The solution ended up being simple and easy, though just as depressing: it appears that everyone in the company had the same default PIN (a trivial number), so it was trivial to find somebody to tell me what theirs is, and get back on the VPN. So, I log on to the VPN, and try to get to work, only to find that the VPN is totally not working, and back to World of Warcraft I went.

So today, I try again. After some more bitching in the IRC channel, Skywing helps me debug the problem. The specific problem was that the routes to the company resources were missing from my computer. Further investigation on my computer and by Skywing on the backend reveal that somebody or something (it actually turned out to be a misbehaving SQL script) had wiped the routes settings for a significant number of people (including all of the development team) on the pre-production server (the one the company uses for testing new versions not yet released to customers), which would have been propagated in turn to the computers of VPN users.

So, Skywing fixed that, and for the first time in a couple weeks I update my CVS checkout and start coding. My latest project relies on a certain third-party library which deals with routers, so I spent quite some time playing with the thing to figure out all the stuff I needed to know that wasn't in the SDK. One of the more puzzling things (when I was first starting out) was that most functions took an "encryption password" parameter. I supposed that that was the key to encrypt the router data it stores in the registry, but at the time it seemed unimportant, as the default value was NULL (suggesting it was optional).

The very first function I called returned an error code indicating invalid key. After some time of tweaking the parameters, trying to fix a variety of things that had the highest probability of being what the error was referring to, I had no luck. Naturally, the next thing I did was get out a debugger and disassembler, and started looking for the cause of the problem in the library. A number of calls to Microsoft CryptoAPI, revolving around that very encryption password, and ultimately resulting in failure, let me to the (correct) conclusion that that value was actually a password to use the library at all. While I haven't spent the time to investigate to certainty, it appears that this password is hard-coded into the library, and the library functions will fail if the proper password is not supplied. As far as I can tell, this password is unique to the copy of the library given to each company that licenses the library. That's pretty much a textbook case of putting a vault door on a straw hut, right there.

I periodically check Slashdot throughout the day, and today was no exception. Much to my continuing dismay, I find Sony doing something so phenomenally stupid that it just blows away any dumb things they've done before (rootkits, anyone?). Of course I'm talking about Sony Strikes Again. In short, Sony has rolled out a new form of DVD DRM that requires old players to receive upgraded firmware to be able to play it; of course, Sony won't have updated firmware for their own players for a few weeks after the new DVDs have already hit stores. I don't know about the real people, but if I were the CEO of Sony, I'd find the person responsible for shipping DVDs that my own players won't be able to play for weeks, fire him, and then call every single media company in Japan to tell them what he had done, ensuring that he would never work in that industry ever again.

My final piece of stupidity, if no further incidents appear while I'm writing this post (it's already happened once today - this very incident occurred while I was telling someone about the Sony incident), involved my further work with the third-party library. To it's credit, however, this wasn't a problem with the library; I merely discovered this while observing it map ports in my router. That is, I discovered that I can view ports mapped with UPnP in the web interface to my router, as well as what programs opened them. This discovery showed me that Windows Live Messenger, for the purpose of direct file sends (of which I've not made any since Messenger was started today), had mapped 39,538 ports to my computer, and was holding them until it was closed. I think that really speaks for itself, so I won't comment additionally on it.

Friday, January 12, 2007

I Win

So, yesterday my program at work (a configuration tool for their server product) went into official Q/A, and I've acquired a list of bugs (six so far). Some of these are absolutely hysterical. Take this one, for instance:

On two of the different servers we were setting up for our test networks, the computers froze after clicking finish with the connect now checkbox checked. This happened on both servers that made it that far.

One of the servers that froze seems to have completely screwed up the subnet it was on until it was unplugged from that router. In other words, workstations on the same subnet could not pass traffic through the router or renew their IP address from the dhcp server until the computer was unplugged from the router.

I was laughing out loud for about 20 minutes straight after I read that bug report. Besides the obvious hilarity of a single computer downing an entire subnet, my program runs entirely in user mode, and it would be a huge security hole in the OS if a user-mode program could wreck this kind of havoc. I suspect this is actually a bug in the server (which has a driver component) or the backend (which feeds instructions to the server), though I haven't started investigating, yet.

Search This Blog