It all started yesterday morning when I did something silly. That is, I powered on my computer with the expectation that it would function just as it had a mere handful or so of hours earlier. That was silly of me, you see, because it makes no sense to think that a computer that worked roughly six hours ago would still work six hours later. I just can't seem to get past my irrational expectation that a device that functioned fine when last used might still function a short time later. Silly me.
What changed, you ask, between Monday evening and this morning? I'll tell you: my motherboard ceased to have an on-board LAN. The very last thing I did before powering down Monday night was to synchronize a bunch of files across the network between my desktop and laptop computers. The first thing I noticed Tuesday morning when I turned on my computer was that none of my networked drives were available. This seemingly small problem ultimately wasted more than an entire day of my time and gave me some pretty surprising moments.
The first thing that I suspected with regard to my lack of network connections was that our house server might be down. It wasn't. Thanks to my new KVM, which was a nightmare in its own right, I was able to test and reject that hypothesis within a few seconds. My next suspicion was that Windows XP had simply glitched, but restarting the system twice failed to correct the problem. It wasn't until I checked my network connections folder that I realized that my computer no longer had a network adapter installed.
My motherboard at the time, a Gigabyte GA-7VAXP, had its own on-board LAN, courtesy of a Realtek 8100BL Ethernet 10/100Mb LAN controller to be precise. That device no longer appeared to be a part of my system, however, as both my network connections folder and device manager clearly didn't contain any reference to it. "Aha," I thought, "the on-board LAN must somehow have become disabled in the BIOS." I didn't know how that could happen, mind you, but it seemed like a reasonable conclusion.
Unfortunately, the device wasn't merely undetected by Windows XP; it was no longer listed in my BIOS settings either. That's a new one on me. My BIOS had an "Integrated Peripherals" page, upon which was listed all of the various system devices (e.g., IDE primary, IDE secondary, RAID controller, AC97 audio, etc.). Prior to yesterday morning, there was always an item listed for the on-board LAN. As of yesterday morning, however, it was no longer there. Like I said, that's a new one on me. After trying several things to recover it, and checking many times to make sure I simply hadn't lost my mind, I decided that my Realtek controller had gone MIA.
To get through my basic morning administrivia, I decided to install an old 3Com 10/100 Ethernet PCI adapter that I had available. I hadn't used it in some time, but I knew I could count on it; it's from 3Com, after all, and their stuff usually just works without a hitch (though not always). I installed it, powered up the machine, and, sure enough, it was detected and available for use without any effort whatsoever on my part.
It was when I tried to configure it that I ran into a thick wall of typical Microsoftian stupidity. I lease four static IP addresses from my ISP, largely because I don't have a router among other reasons. The final bytes of my addresses range from 65 - 68, which are assigned to our house server, my desktop machine, my laptop, and my wife's computer respectively. When I tried to assign the x.x.x.66 address to the 3Com adapter, Windows XP gave me an interesting warning.
To be more specific, it told me that the address I was trying to assign was already in use by the Realtek network adapter, which it said was hidden from view because it was no longer installed in the system. Windows XP warned me that problems might ensue, were it again to be installed, and it offered me the opportunity to change the address to something else. I didn't want to change the address, however, because I didn't have any hope that my Realtek controller was going to self-resurrect any time soon. Thus, I told Windows XP that I was happy with the choice I'd made.
Windows XP responded by leaving the 3Com adapter unconfigured. Gee, wasn't that nice of it? It was thoughtful enough to warn me of a potential problem, but it got downright paternalistic when I told it that I knew what I was doing. Apparently, Microsoft thinks their operating system knows better than the user. As with all such silly paternalistic garbage this would work nicely were that underlying assumption true, which it clearly isn't.
To make this already long part of the story short, I couldn't assign the IP address I wanted to assign because it was already in use by a device that was no longer installed. Yeah, that makes sense. Worse, because the device was no longer installed, I could do nothing to remove it. No amount of effort on my part turned up any safe and reliable way to get rid of the bloody Realtek controller, and were it not for later developments it would probably have been lurking in my registry forever. And to think my parents told me that ghosts weren't real!
The workaround turned out to be surprisingly simple, though typical in its Microsoftian stupidity. All I had to do was temporarily assign a different IP address. Once the 3Com card was configured correctly and working with that address (viz., x.x.x.68), I could then reconfigure the card with the desired address (viz., x.x.x.66) without any trouble—and without any warning message whatsoever for that matter, which makes very little sense as far as I'm concerned. In short, Microsoft still can't code their way out of a wet paper sack when it comes to making something work intelligently. Trustworthy computing my booty. I'd settle for competent or even barely functional computing, thank you very much.
Anyway, after tending to the morning administrivia (e.g., email, daily downloads, etc.), I went digging through my receipts for purchases from last year. I was quite happily surprised to discover that my GA-7VAXP motherboard was not merely under warranty; it was also still under an extended service plan that I bought for it. I normally don't buy those things, considering them a waste of money, but I apparently did so when I purchased that motherboard because of all my previous troubles. A quick call to Fry's Electronics confirmed that all I had to do was bring my receipt and the motherboard, and they would either replace it or upgrade me free of charge with no questions asked.
For once it seemed like my luck was on an upswing! I headed off to Fry's, hoping in my heart of hearts that they might be able to upgrade me to one of the newer motherboards based on NVIDIA's nForce2 chipset. I've had so many problems with VIA-based boards that I'm itching to try something else.
Once at Fry's, I was helped by a fellow named Jeffrey, who ended up spending far too much time with me throughout the rest of the day. Foreshadowing: sign of quality films and literature everywhere. Jeffrey determined pretty quickly that Fry's no longer carries the GA-7VAXP. It didn't take much after that to convince him that the most comparable board they had was a nice, new Gigabyte 7N-400 Pro, which is indeed based on the nForce2 Ultra400 chipset. Again it seemed like my luck was on an upswing! And twice in one day!
I should have known it was too good to be true. After going through a ton of paperwork, I headed home, installed the new board very carefully, checked everything very thoroughly, and powered it on. My reward, upon hitting the power switch was... DOOO-BEEE-DOOO-BEEE-DOOO (shutdown) Ah, yes. How lovely.
To be less obscure, my system was giving me five tones at startup, alternating in pitch with every other tone, after which it shut itself down. All of this took a mere couple of seconds, so the motherboard clearly wasn't making it into its power-on self test (POST). When I consulted the manual to see what the tones meant I discovered that five tones is indicative of some kind of CPU problem.
Since I've had problems in the past with CPUs and their heat sinks, fans, etc., that was what I first suspected. I removed the heat sink and fan, removed the CPU, carefully cleaned them both, applied a new layer of thermal paste (Arctic Alumina for those who are interested), and reinstalled them both. This time it seemed that the system made it a bit further into its power-up cycle before it gave me the five annoying tones and shut itself down. I began to think that heat might very well somehow be the issue, though I honestly couldn't understand how.
After still more fussing I discovered that if I cranked up the fan on my Volcano 9 heat sink (sans "cool mod" in my case) all the way and forcibly throttled the front-side bus (FSB) down to 100 MHz. via an on-board DIP switch, I could actually get past the POST and make it into the system BIOS before the system did its musical shutdown thing. What I found unnerved me: the motherboard was consistently reporting my Athlon 2400+ XP CPU as an Athlon 1800+ XP. Worse, though everything I had discovered indeed made it seem like heat was the problem, my CPU was running at roughly 50° C. That's a bit hot, of course, but it's not all that hot when you consider that it was roughly 38° C (100° F) in my office at the time.
No matter what I did, I couldn't keep my system alive for more than a couple of minutes at most, and if I tried to boot Windows 98 or Windows XP the system would simply restart itself after a second or two of hard drive activity. I was growing more and more puzzled by the moment. It sure seemed that heat was somehow a factor, and yet the BIOS was reporting perfectly normal temperatures for my CPU. I was stumped.
But then I remembered something important. The box for my replacement motherboard had a sticker on it, a sticker which indicated that the board had already been purchased and returned at least once before. Suddenly, it all became clear: I must have been given a bad motherboard! Chuckling at my own foolishness for not making the connection sooner, I uninstalled it, packed it up carefully, and headed back to Fry's.
When I arrived, Jeffrey was still there, and he was both surprised and disappointed to hear of my troubles. I gave him the board, and he set it up on their testing rig. You could have knocked me over with a feather when it fired up immediately for him and worked just fine for several minutes, the CPU hovering at an average temperature of about 35° C. After listening to my entire tale, Jeffrey suggested that maybe my CPU had gone flaky too. I didn't understand how or why it would have because I'm normally very careful about handling it. Still, what he said made sense, and he suggested that I bring the CPU in for testing.
So off I went back home with the 7N-400 Pro motherboard. I should point out that the day was progressing into the late afternoon, so the California traffic was becoming worse with each trip. I had already been to Fry's twice that day, but I wasn't done making such trips just yet. Suffice it to say that as the commute time waxed, my patience waned.
When I arrived home I hooked everything up one more time and was saddened to see that the problem was still with me. So I snatched up my CPU and headed back to Fry's. All the way there I was castigating myself for frying it. I've never destroyed such an electronic component through my own carelessness in the past, and I was really disappointed that it seemed like I had finally done so. Still, I was hopeful that once the CPU was diagnosed as the "bad guy", I could then buy a new CPU and get my system working soon enough.
If I was surprised before, when the 7N-400 Pro worked at Fry's after failing to work at home, I was even more surprised when my CPU checked out just fine in Jeffrey's testing rig. He plugged it into a motherboard, and it was detected immediately as an Athlon 2400+ XP CPU just like it should have been. What had failed to work at my home about thirty minutes ago worked just fine at Fry's. It even ran cooler than it did at home.
At this point, I was upset. I was pissed off because I had essentially wasted another trip. I still didn't know what the problem was, and I had stupidly left my other components at home. Had I been thinking before I left, I would have brought the motherboard, my memory, and my video card as well, since those were the only components I had been using for testing purposes. Jeffrey agreed that I should bring them all, so that we could do a full-up test and find out which one was causing the bizarre behavior.
Unfortunately, it was now 3:30 p.m. and Jeffrey was due to get off work at 5:00 p.m. I told him I would try to make it back before then, but I couldn't make any promises because of the traffic. I drove as fast as the limits of the law and safety allowed, making it back home in about thirty minutes. I quickly removed the 7N-400 Pro from my system and gathered up my CPU, my memory, and my video card in anti-static bags for the return trip.
I drove back to Fry's even more speedily, and I made it there in time despite the still-heavier traffic. I gave Jeffrey the motherboard, my CPU, my memory, and my video card. He hooked them all up on his testing bench, and... they worked just bloody fine! Seriously, I think I actually felt my jaw hitting the floor when everything worked without a hitch. He even connected a hard drive to see if he could start loading an operating system, and that worked fine too. My CPU was being detected and reported correctly, my memory was passing its tests, and my video card was obviously working nicely as well. In short, every one of the components that had been involved in my failed testing at home checked out just fine at Fry's.
So what did that leave? The only thing Jeffrey and I could think of was the power supply. When I last upgraded my system I bought a new case with a much larger power supply, which wasn't all that long ago in October, 2002. I bought a nice Antec case with a huge (viz., 430W) power supply from Fry's because I didn't want any more power problems. Silly me again, eh?
Fortunately, I had the receipt for that case with me, and after Jeffrey discussed it with his supervisor it was agreed that I could get a replacement power supply installed if I were to bring in my Antec case. That was all that I could really accomplish that day, so I headed home frustrated. I mean, I was glad that Fry's was going to stand by their hardware—that's one reason I try to give them as much of my business as I can—but I was utterly mystified that the problem could be my power supply.
But then I got to thinking about it, and I realized that it might make sense after all. Think about it, dear reader. Assuming that the power supply was indeed the culprit, what effect would it have to reduce the FSB speed and thus the CPU speed? Aside from slowing down the system, it would also consume less power. If the CPU had barely been scraping by in terms of power, that might explain why I could get the system to work for a few minutes longer by making those changes.
Further, though my power supply had worked just fine that morning with the GA-7VAXP motherboard, there was an interesting difference between that board and the new 7N-400 Pro. That is, the latter made use of the four-pin +12V power connector in addition to the standard twenty-pin ATX main power connector, whereas the former did not. If the supply for the four-pin connector alone were flaky, then that might explain why the previous motherboard worked just fine while the new one wouldn't even POST.
I even tried a couple of experiments to reassure myself that I wasn't losing my mind. After I headed home frustrated, I hooked up the motherboard yet again—that's what the motherboard "Hokey Pokey" is all about—and confirmed that it was still failing in exactly the same way. The very same components in the very same configuration that I had seen work forty-five minutes earlier (traffic had grown even worse) no longer worked.
So I decided to rule out a short circuit as the cause of my woes. I got some nonconducting foam along with a nonconducting anti-static bag and physically held my motherboard in the air outside the case as I turned on the system. I got five tones and a shutdown for my trouble. Obviously, it was not shorting with the case. Next, I tried doing other bad things to see if the errors changed. Sure enough, when I removed all the RAM from the machine, I got a different error code which the manual reassuringly described as indicative of a memory error.
Most telling of all, however, was what happened when I unplugged the four-pin +12V power connector from the motherboard altogether. That is, I got the very same five tones followed by system shutdown. That four-pin connector is what provides the core power for the CPU, you see, and my system was behaving as if it weren't even connected when it clearly was. That made me pretty confident that my power supply was indeed somehow at fault.
Besides, if it wasn't the power supply, then I would have to start believing that I live in some kind of unreality field that defies all explanation. Not that I haven't wondered about such a thing in the past, mind you; in fact, I've wondered about it more than once. It's just that the power supply is the only other system component that was in use at my home that wasn't in use at Fry's. If the power supply wasn't the problem, then what rational explanation could there be for the problems I had been observing?
Needless to say, I just about lost my mind completely when I took the entire system in to Fry's this morning. Jeffrey was there again, and he put a brand new power supply into my case, turned it on, and was greeted with the same five tones followed by shutdown that I had been seeing all along. We were both utterly stumped at that point. The computer case held only the power supply, the motherboard, the CPU, the memory, and my video card. There wasn't anything left to test.
Or was there? Jeffrey removed the whole motherboard from my case and put it on his test bench again to verify that it was still good. Unlike yesterday afternoon, however, it was now giving him exactly the same problem. Swapping out the memory didn't change anything. Swapping out the video card didn't change anything.
Swapping out the CPU, however, changed everything. All of a sudden the system was booting just fine. So maybe it was the CPU after all? He and I couldn't understand how that could be the case, when Jeffrey had been so careful with it, but the system was booting with a different chip in place. I really didn't want to buy a new CPU, but it sure seemed like that was the problem.
Or at least, it seemed like that was the problem until Jeffrey noticed something that I hadn't, namely, that the CPU heat sink had been attached incorrectly. Honestly, I have to plead complete ignorance on this one. I've today searched the manuals of four different Socket-A motherboards, as well as the skimpy documentation that has come with each of the four different heat sink and fan combos I've used. I haven't found anything that indicates that there is only one way that a heat sink should be installed.
And yet that is clearly the case. The plastic shape of Socket-A is such that heat sinks designed for it have a small beveling, roughly one millimeter in depth. This is cut out from the heat sink so that it can be placed over the largest portion of the plastic that comprises the Socket-A interface, in which an Athlon CPU may sit. If one does not install the heat sink properly then this bevel doesn't fall where it should, the result of which is that the heat sink is not shoved flatly against the CPU core but instead comes into contact with it at an angle.
It's a very slight angle, mind you, for we're talking about a bevel that is roughly one millimeter deep. I know for a fact that I've had heat sinks and fans installed improperly before, but I have never had a motherboard so sensitive to this in the past. I guess the 7N-400 Pro must have some pretty robust internal heat-safety checking for the CPU, because that one, lousy millimeter was enough to cause my problem. That one, lousy millimeter cost me more than a full day of my time and effort.
There are lots of lessons to be drawn from this whole mess. First, I now understand why I previously saw such different performance from my heat sink and fan depending upon which way it was installed. I included this in a previous writeup on CPU coolers, but I didn't have any explanation for it. I now do. I must have been seeing the difference between installing the CPU heat sink and fan correctly as compared to incorrectly. For those who wonder why I didn't try rotating the heat sink by 180° at some point, the answer is simple: nearby capacitors on the 7N-400 Pro make it extremely difficult to get my heat sink to install in that orientation. In effect, I was installing the heat sink the only way I thought I could get it installed, and, as such, it didn't occur to me to try flipping it around.
Second, when a motherboard is telling you that the CPU is the problem, the CPU is most likely the problem. Given my great deal of experience with so many error messages that are worse than meaningless, I'm always inclined to be skeptical of them. In the case of motherboards at least, however, it seems that the messages I was getting were correct. I'll try to remember that in the future.
Third, whenever the problem seems wholly inexplicable, intermittent, or just plain weird where motherboard installation is concerned, doubt the heat sink and fan combo immediately. If only I had spent some time rigorously observing the heat sink and fan, I might have noticed the bevel and come to develop the proper understanding on my own. As it was, the problems I was getting were so strange that I suspected a flaky motherboard instead.
Fourth, vendors of motherboards and heat sink and fan combos should be absolutely ashamed of themselves. I am not a stupid man. I now hold three separate degrees, one of which is in electrical engineering. I am not an inexperienced user. I have been building systems for a long time, and this was hardly my first experience with a heat sink and fan. I am not so impatient that I never read manuals. In fact, I read them very thoroughly before I install anything.
Despite all of those facts, however, I had no idea that there was a proper way to line up the bevel on a heat sink with the underlying socket. If someone else knows where this is documented, I would really love to see it. I haven't been able to find one whit of documentation of any kind on the subject. I think vendors should be ashamed for not making this obvious, since it's clearly terribly, terribly important.
Fifth, experiences like this are what make me value local merchants in general and Fry's Electronics in particular. Despite having plenty of other things to do, Jeffrey spent a great deal of time working with me to try to solve my problem. In the final analysis, it didn't cost me a thing to get a brand new motherboard and a brand new power supply. That extended service contract I purchased was a real life-saver in this situation, and you can bet I'll be buying them more often going forward. Fry's clearly stands by their stuff, and that makes me one happy customer.
Sixth and finally, Microsoft Windows still sucks in a clinch. Yes, Windows XP is the most usable and stable version of Windows to date, but after all my hardware troubles I was disappointed to find that Windows XP would no longer boot. It wouldn't even boot in safe mode. It would make it as far as loading Mup.sys, whatever that is, before it would simply restart the machine. Heck, even Windows 98 was able to reconfigure itself correctly after changing out the motherboard, but Windows XP couldn't. Talk about adding insult to injury!
What's more, Windows XP was unable to repair my existing installation, and it was also unable to re-install over the existing installation. Oh, it copied all the files just fine, but it then gave me some ridiculous error about how it couldn't initialize the registry. Gee, I sure am glad Microsoft invented the registry. I can't tell you how many times that rotten thing has screwed me, and now it's done it again. For not only do I get to reinstall Windows XP, you see, I get to reinstall all of my other software as well, because the registry cannot simply be copied or edited from an old installation as can those simple, textual *.ini files that it replaces. (sigh) I guess that's progress in the world of trustworthy computing, eh?
In closing, I'm glad that I've found and fixed the hardware problem. It was really starting to drive me up the wall. I've still got another 2 - 3 days of work ahead of me to get my system back up to speed, of course, but at least I have reason to believe I can actually do that at this point. I hope others can learn something from this mess too and perhaps avoid the motherboard "Hokey Pokey".
08/13/2003