And elanor was not guilty after all… (a Twilight Zone episode)

Posted by – 23/02/2005

That surely will enter my “book of the strangest things I’ve seen wrt computers”. As I told before, greyhavens, the Pentium II I use as an intermediary router, was suffering from an unidentified disease that was freezing it up without letting it answer to SysRq sequences. Well… I found the cause of the illness, but first lemme tell you the whole thing…

Overnight, greyhavens, which I built with spare pieces I had home, froze with no furder warning. I had updated its kernel about a week ago, and remembered that fact, but as I rebooted it and, as everything went fine, I just let it run… for the next 24 hours. Then it froze again… I booted the older kernel (just to be on the safe side) and run some checks (all passed ok). It run for about 3 hours and froze up again. That gave me the challenge: what was causing this sudden illness?

I checked everything. Since greyhavens is nothing but a simple router, with a 350 Mb disk, there was not much to be checked. I cleaned it, checked the coolers and the temperature (it have been really hot these summer days), run memtest, e2fsck, replaced the CMOS battery (OK I was desperate), etc, etc, and it just repeated the same behavior over and over again: it ran for a while and then froze.

When I wrote the last blog entry, I was almost sure some piece of harware was faulty… The question was which one…

Without no clue, I just adopt the “standard attitude”: I observed. Then it hit me: elanor‘s led were not blinking. Elanor (named after Samwise Gangee daughter) is a PCI ethernet card I am very found of. “She” is with me for longer than I remember (I suspect it came with the first PCI computer I had), and she has been a spare card for the last three years. Since she have a BNC and a RJ-45 connector, she is a very useful spare piece, and everytime there was a computer event, I brought her with me (I used it in 3rd FISL and in DebConf4). When I built the router though, she was promoted to a first class citizen, and has been inhabiting greyhavens since then.

“What the hell!!! Elanor is dead!”. It did not make sense at all. Since greyhavens worked as expected before freezing, the idea that elanor was dead is just nonsense… Unless she was dying.

I’ve seen a lot of NICs dying, and there is always some signs: They begin to cause errors, DUPs in pings, floatations of time response, etc, etc. And Elanor was not showing any of this signs. In fact, she was perfectly healthy. But there it was: the leds were off, and that meant only one thing: my elanor was dead.

I bought a new NIC and replaced Elanor. When I turned the power switch on, what was my surprise when I realize that the brand new NIC also had its leds off!

What was going on?!? I knew the net-switch was good, and I’ve tested the cable in every connector just to be sure. Besides, it was plain nonsense to believe that a bad net-switch could freeze a Linux box so bad that not even SysRq sequences did any good. Then, since I was clueless (and had spent R$ 23,00 to buy the new NIC), I just replaced the cat-5 cable that linked the new NIC to the switch… voilá. Greyhavens is up and running again…

I replaced the new NIC for elanor again, and rebooted greyhavens. Everything is alive again…

I just cannot believe that a faulty cat-5 cable is able to freeze a Linux box as hard as greyhavens was frozen. I have never heard of something like that before, and I will not be surprise if none of the readers have had this experience also. Anyways, that really happened… as weird as it may seem.

Leave a Reply

Your email address will not be published. Required fields are marked *