Search This Blog

Saturday, May 20, 2006

**** Me!

So, I randomly decided to rerun MemTest 86 on my computer, following one of its many BSODs, of late. This proved rather... enlightening. The last time I'd ran it on my computer, it was showing one or two invalid bits every run (it was always only on test #7 - random data, which takes about 10 minutes). Today, however, it was finding bad bits at a rate of one every 10 seconds, or so... yeah, that's bad. At that point I wondered why on earth I'd kept using this CPU for so long, and immediately pulled it and replaced it with an old 1 Ghz Athlon, which is currently running MemTest 86 (and will be running some other diagnostics afterwards).

So, why did I (and still do) think that it's a problem with the L1 cache, as opposed to the memory, motherboard, or L2 cache (listed in order from most probable for a memory error)? After running MemTest 86 for several hours (it took that long when it was only generating errors every 10 minutes; now I can get as much info in only 10 or 15 minutes), I noticed that the errors were always at bit 27 of each 32-bit block of memory - the bit was 0 when it should have been 1. The addresses, however, varied widely, and spanned all three of my DIMMs. The probability of this being a memory problem was virtually 0.

However, while there was no relation among the upper bits in the addresses where failure appeared, the pattern in the lower bits was unmistakable: the lowest 16 bits were, without exception, 0x0194 or 0x8194. That is, the lowest 15 bits were identical in all cases.

Now, let's talk about my computer. It's an AMD Athlon XP 2200+. I looked up info on the AMD web site, and found that it had 128 KB of L1 cache, divided into 64 KB data and code caches. Both were 2-way set associative. The L2 cache consists of 256 KB of 16-way set associative cache. This information allowed me to make some predictions for what would happen if a bit was bad in both of them. 256 KB / 16 (-way set associative) would mean that a single bad bit would show up no more than every 16 KB. However, in the L1 cache, 64 KB (data cache) / 2 (-way set associative) means that a single bad bit would show up no more than every 32 KB - the exact pattern that was observed in MemTest 86, and even the same pattern observed in the BSOD crash dumps SW has been sorting through the last couple days.

Interestingly, as previously mentioned, only a single one of MemTest 86's tests showed a problem with the memory - the random number write/read test. All the rest of the tests use heuristics - methods that will detect failures common in memory chips due to the architecture of the chips. Thus it's not surprising that these tests would not be able to detect errors in the cache, with its entirely different architecture. Even still, I find it interesting to see this proven experimentally.

Update @ 6:24 PM: The new CPU has completed one full run of MemTest 86 (took about 1 3/4 hours) with no errors. While that's not a guarantee that it has no problems, it's almost certain that the bad bit problem in the previous CPU is gone, now.

No comments: