Where do I start, and where do I look for clues?
Are all the logs found in /var/log, or are there others?
In what order should I look at the logs, and what should I look for?
(!) [Thomas] It depends what you think went wrong. Essentially:
/var/log/messages
is where syslogd will dump all its data and so is the best place to look.
But there may well be application specific data in /var/log
(XFree86.0.log) is one such example.
(?) Any pro-active steps I should be taking to get more info, should it
happen again?
The specifics of my case: my file server (a 750 Mhz Athlon running Suse 9)
simply locked up, and I couldn't get anything to display (GUI or command
line). I knew the machine was in trouble, when it didn't respond to pings.
I had to hit the reset button to get it back (and deal with fsck,
naturally). Funny thing is, the system clock reset itself to 28 minutes
after midnight (when it should have read the middle of the afternoon), but
didn't loose the date. Odd, that. The machine's been running 24/7 for about
three weeks now (I set it up around then), and no sign of problems until
now.
(!) [Thomas] This might be framebuffer related. At the lilo/grub prompt,
type:
linux video=vga16:off
(!) [Thomas] to see if that has any effect.
There have been snippets of these effects metioned in the past. The one
that springs to mind is:
[120]http://linuxgazette.net/issue74/tag/9.html
(!) [K.-H] There are ways of still getting kernel info (pro active steps):
* plug an old printer into the lpX port and declare it the system console
(kernel kompile parameter, and I don't know how exactly you activate it
-- maybe inittab).
* When running switch to system console (Alt-Ctrl-F10 on SuSE) and leave
it there. It might show a kernel oops/panic there next crash.
* search SuSE config for Magic SysRequest keys -- the function should be
compiled in the kernel but has to be activated. Then you can press weird
key-combinations like Alt-Ctrl-Sysreq-R for register dump, ...S for disk
sync,... see /usr/src/linux/Documentation for details.
* File server? What hardware? I had SCSI disks locking my system for
various reasons (Tagged queuing incompatibilites of indiv. drives, too
long cables,..)
(?) I'm going to keep your response handy -- several things to try.
Meantime, I realized I was booting the thing into runlevel 5 (rather
stupid, actually), so I've since changed it to 3. If it is, as someone
suggested, a framebuffer problem, maybe that will solve it for now. I'm
using a real old Voodoo 3 card I scrounged from my parts bin. If it happens
again, I'll have to tear the machine apart and start playing with the
memory, as someone else here suggested.
(?) install and configure Linux is one thing. Learning how to do an autopsy
seems to be quite another!
(!) [Thomas] That's because generally one doesn't do it quite like that.
Problem diagnosis is situation dependant. In any givem situation there is
often a small set of files and related information that you can analyse
without having to worry about the rest of the system.
Granted, this is related to how much information one is told at the time
(if you've been on this list for as long as I have, you'll come to realise
that usually we don't get any), and whether or not the person has tried to
remedy it.
In general though, poking around, taking an aspect of your system, looking
at what it does, and how is all related and helful to you when you have to
come to diagnose anything.
(?) Yes, well, I looked at the messages log, but saw only a gap time-wise
between cron processing around 4 in the morning, and the time of the crash.
I'm not sure which of the other logs are important in that case. Where do I
find the register dump (although I suspect it won't make much sense to me,
rather like those register dumps you get in Windows XP)?
(!) [Thomas] Syslogd might have logged it, if the problem was software
related, and indeed if the said program produced any errors. If hardware
then it might not have, depending on the severity of the hardware failure.
(?) I'm using a real old Voodoo 3 card I scrounged from my parts bin. If it
happens again, I'll have to tear the machine apart and start playing with
the memory, as someone else here suggested.
(!) [Thomas] It might be memory, but as the link I have you last time
around said, memory problems tend to be more 'visible' in the sense that
you get a lot of applications SEGFAULTing and SEGABRTing for no apparant
reason. In such instances, installing and running 'memtest86' is usually
of help.
(!) [K.-H] Most of the time I had the great luck of oopes and kernel
=10= |