Search This Blog

Wednesday, February 10, 2010

...Now What?

A few weeks ago my parents' desktop died. This was my old Athlon XP computer I had before I bought the one I have now (about a year and a half ago). A bit of trivial diagnostics demonstrated that either the CPU or motherboard had died, and I didn't have the parts necessary to determine which. So, my dad went looking on eBay for a replacement motherboard and CPU to go with the existing memory, as they didn't really have the money to buy a new computer right now (actually, I was the one who did most of the looking at the specs, to make sure it would work).

A few days ago, the motherboard+CPU arrived, and I quickly set to determining whether or not they worked, as the amount of time to return it was limited. As my past experiences have made me a little paranoid in that respect, naturally the first thing I did was run MemTest 86 on it.

While it seemed to work for the most part, about once every pass (took a couple hours) on test #7, an error would show up. After leaving it running overnight, some patterns emerged. The error was always the 0x02000000 bit getting set when it should be cleared, and it usually occurred when the value written to the memory was 0xFDxxxxxx (less commonly, 0xF9xxxxxx). The base address of the error seemed to always be at 0x3C, 0x7C, 0xBC, or 0xFC. The errors also appeared to cluster in the low 512 megs of the memory space.

In short, it appeared to be 1 bit every 64 bytes, where the containing byte changes from 11111x01 to 11111x11.

Separately, neither of the two 512-meg DIMMs shows this problem, regardless of which of the two sockets the DIMM is in (as tested by running MemTest 86 overnight, which would be expected to produce about a dozen errors). However, when put in in one order, this error occurs. Peculiarly, when put in the opposite order, MemTest begins spewing out memory errors before soon crashing (presumably due to memory errors in the locations the program is at).

So, where is the problem? Well, I wish I knew the answer to that.

The fact that both DIMMs worked correctly when alone suggests that the DIMMs are good (these are the same ones that were used in the computer that failed, and to my knowledge there was nothing wrong with them then). The fact that each DIMM works in both sockets suggests that it's not a bad socket. The fact that it's every 64 bytes tells me that it's not one of the data lines on the DIMMs or on the motherboard, as the data width is 64 bits (64 pins). The fact that the error period isn't into the kilobytes indicates that it's not a CPU cache failure, as I saw in the past.

64 bytes is the size of the cache line for the CPU, which seems to hint that it's somewhere after where the 64-bit memory reads are assembled into whole cache lines; unfortunately, I don't know where exactly this occurs, to narrow down the possibilities. Yet the fact that it only seems to occur in the lowest 512 megs of the memory space is very suspicious, and seems to suggest something with one of the DIMMs, sockets, or data paths on the motherboard, which appear to already be ruled out from earlier tests.

So, I'm really at a loss as to what to make of this and what to do next, and I've got three days left to do it. I can't test the memory on another computer, because I don't have another computer to test it on. The BIOS could use updating, which could theoretically fix the problem, but I've been reluctant to do that because it's a trivial way to say "You broke it, I'm not giving you a refund!"

I suppose I need to try mixing and matching the CPUs and motherboards. I need to do that anyway, to find out exactly what broke on the old computer, but it should also tell me whether the problem is in the CPU or motherboard (though nothing more specific than that).

No comments: