Linux Help
guides forums blogs
Home Desktops Distributions ISO Images Logos Newbies Reviews Software Support & Resources Linuxhelp Wiki

Welcome Guest ( Log In | Register )



Advanced DNS Management
New ZoneEdit. New Managment.

FREE DNS Is Back

Sign Up Now
 
Reply to this topicStart new topic
> Server Crash Help
lussumo
post Mar 23 2004, 03:39 PM
Post #1


Grub-er
**

Group: Members
Posts: 33
Joined: 15-March 04
From: Toronto, Ontario
Member No.: 2,590



I recently installed debian on my server (thanks to hughesjr).

Today I came into my office and found that the server had an error on the screen. I opened up /var/log/syslog and found this error:

debian kernel: MCE: The hardware reports a non fatal, correctable incident occurred on CPU 1.
debian kernel: Bank 2: 940040000000017a

I googled that error code and found a bunch of different responses from people who said that it could be bad ram or a bad CPU.

The machine is a Dual proc athlon MP 1800.

The back story on it is that a few weeks ago it was happily running W2k server and then suddenly it started shutting down and rebooting. At that time we formatted it and reinstalled up to SP 4. However, it kept on rebooting. At the time we did a little research and found a lot of people reporting that happening after installing SP4. I decided that I was going to ditch windows anyway, so I'd just install linux and hope for the best.

When the machine was delivered to me (It had been co-located in Detroit), I turned it on and found that one of the CPU fans was bugging out. I ordered two new cpu fans and installed them both before installing linux.

So, it seems likely to me that the cpu with the screwed up fan is now toast. But how can I be sure that is what the problem is? I don't want to waste the money on a new cpu and find out that it was the ram - and I don't want to spend the money on new ram and find out it was the cpu.

Any idea how I can further debug this and find out who the culprit is?
Go to the top of the page
 
+Quote Post
lussumo
post Mar 23 2004, 05:02 PM
Post #2


Grub-er
**

Group: Members
Posts: 33
Joined: 15-March 04
From: Toronto, Ontario
Member No.: 2,590



Update: I did some more digging and found this:

http://www.geocrawler.com/archives/3/35/20...3/3/50/10413120

QUOTE
Looks like the L2 cache ECC checking spotted something going wrong,
and fixed it up. This can happen in cases where there is inadequate
cooling, power, or overclocking (or in rare circumstances, flaky CPUs)


I shut the machine down and took out cpu2 (Those cpu fans are HARD to get off!). Cleaned off the cooling goo and saw that it *is* pretty browned - more than you would expect.

At this point I think I'll just replace the two cpu's with newer ones.

Just thought I'd let anyone who reads this know that I've resolved the issue (sort of), or will soon.
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 24th October 2017 - 02:48 AM