When the SCO UNIX kernel panics, it often displays information
about the cause of the panic, in the form of a "register dump"
and by saving a "system image" in the dump device. The register
dump contains the values of the CPU registers at the time of the
panic. The system image contains the entire contents of system
memory at the time of the panic. This information can be used
to examine the state of the kernel when the panic occurred, and
can often be used to determine the cause of the panic.
Please Note: Not all system panics generate a register dump.
There are two steps in diagnosing the cause of a panic. First,
you must determine whether the panic is consistent or inconsistent.
Second, you must use the crash(ADM) utility to determine what the
kernel was executing at the time of the panic.
I. DETERMINING THE CONSISTENCY OF THE PANIC
1. Reading the Register Dump
The register dump is displayed on the system console, and is not
stored in any file. This information will be lost when the system
reboots. In order to preserve this information you need to make
sure that the system will not auto-reboot, and to make sure that
someone writes the panic message down before rebooting. (See
section I.3 below for instructions on how to prevent auto-boot.)
If a register dump is present, it will look like the following
sample register dump, with 'NNNN' replaced by the actual number
of memory pages in your machine.
=========================================================================
PANIC:
cr0 0xFFFFFFEB cr2 0x00FFFFFF cr3 0x00002000 tlb 0x00500E80
ss 0x00000038 uesp 0xD0119554 efl 0x00010282 ipl 0x00000000
cs 0x00000158 eip 0xD0070488 err 0x00000000 trap 0x0000000E
eax 0x00FFFFFF ecx 0x00000000 edx 0x00000305 ebx 0xD00CD780
esp 0xE0000D40 ebp 0xE0000D64 esi 0xD0119554 edi 0x00000038
ds 0x00000160 es 0x00000160 fs 0x00000000 gs 0x00000000
PANIC: Kernel mode trap. Type 0x0000000E
Trying to dump NNNN Pages.
...................................................................
...................................................................
NNNN Pages dumped
** Safe to Power Off **
- or -
** Press Any Key To Reboot **
=========================================================================
Note that most of the register dump consists of character strings
followed by hexadecimal values. The character strings are the names
of the registers in your 80386 or 80486 chip: the values are the
contents of those registers at the time of the panic.
Find the characters "cs" in the register dump. This is the chip's
Code Segment (CS) register. Find the characters "eip" in the
register dump. This is the chip's Instruction Pointer (IP)
register. The values in the CS and IP register combine to form the
address of the instruction the kernel was executing at the time of
the panic. This value is sometimes called the "PC" value.
Write down these values as a pair of numbers, separated by a colon,
and without leading zeros.
Please Note: Trailing zeros are important. For example, in the
register dump above, the value of CS is 0x00000158,
and the value of IP is 0xD0070488.
Therefore, the PC value you should write down is 158:D0070488.
IMPORTANT: When your system panics, the person who reboots the
system should write down the panic message and (if present) the
PC value in your system log.
2. Hardware and Software Panics
Panics can be caused by defective hardware or by a software problem.
If the PC value varies widely from panic to panic, you almost
certainly have a hardware problem. If you have three or more panics
with the same PC value, then it is likely that you have a software
problem. Follow the instructions in section II below to try to
diagnose your problem.
IMPORTANT: It is possible for defective RAM to cause multiple panics
at the same address. This is a hardware problem, and cannot be
fixed via software.
If the PC value varies widely, you can attempt to determine the
piece of hardware that is causing the problem. Strip your system
down to a minimum configuration: just the hard disk, the video
card, and a minimal amount of RAM. Then add hardware piece by
piece until the panics begin to occur. If necessary, swap parts
of the minimal system (such as the video card) with ones from a
system which does not panic. Always keep a notebook of your
efforts, and remember that this is likely to be long, frustrating,
and difficult search.
IMPORTANT Note: One common cause of system panics is a defective
power supply or "dirty" power. SCO Support recommends the use of an
uninterruptable power supply for critical systems.
3. Preventing SCO UNIX from autobooting.
SCO UNIX can be configured to reboot automatically after a panic.
This is controlled by the value of the "PANICBOOT" variable in the
file "/etc/default/boot". To prevent auto-reboots after a panic,
edit the file so that this line reads "PANICBOOT=NO". Make sure
that the 'P' in "PANICBOOT" is the first character on the line.
II. USING CRASH(ADM)
IMPORTANT Note: The following procedure is only useful if your
panics are due to a software problem. You MUST have at least three
panics with identical PC values in order for the results of this
procedure to be meaningful.
1. Obtaining a System Dump Image
You will need to save a copy of the system dump image for use
with crash(ADM). When you reboot the system, you will see a prompt:
There may be a system dump memory image in the swap device.
Do you want to save it? (y/n)>
Type 'y' and hit <Return>. Next you will see the prompt:
Use Floppy Drive 0 (/dev/rfd0) by default
Press ENTER to use default device.
Enter valid Floppy Drive number to use if different.
Enter "t" to use tape.
>
SCO Support strongly reccomends that you not use floppy drives to
save system dump images. Since the typical SCO UNIX system has
many megabytes of memory, you will need to use several floppy disks
to save a single image. Problems can arise if you do not have
enough floppy disks, or if you insert them in the wrong order.
Type 't' and hit <Return>. Next you will see the prompt:
Enter choice of tape drive :
1 - /dev/rct0
2 - /dev/rctmini
n - no, QUIT
>
Type '1' or '2' as appropriate for your system. Next you will
see the prompt:
Insert tape cartridge and press return, or enter q to quit.
Insert your tape, and press <Return>. You will see a message similar
to this one:
Wait.
dd if=/dev/swap of=/dev/rct0 bs=120b count=751 skip=0
(The actual numbers may vary for your machine.) Finally, you will
see the message:
Done. Use /etc/ldsysdump to copy dump from tape or diskettes
Press return to continue >
At this point you have saved the system dump image to tape. Press
<Return> and continue with the boot process. When the boot process
has completed, you will be ready to load the system dump image
onto your hard disk. Log in as root, enter the commands:
# cd /tmp
# ldsysdump image
Note that the pound sign ('#') is a prompt from root's shell: do not
type it in. Also note that the argument to ldsysdump(ADM) is the
file name in which the the system dump image is stored - it is
"image" in this example, but can be any file name. You will next
see the prompt:
Use Floppy Drive 0 (/dev/rfd0) by default.
Press ENTER to use the default.
Enter valid Floppy Drive number to use if different than default.
Enter "t" to use tape drive.
>
Type 't' and press <Return>. Next you will see the prompt:
Enter choice of tape drive :
1 - /dev/rct0
2 - /dev/rctmini
n - no, QUIT
>
Type '1' or '2' as appropriate for your system. Next you will
see the prompt:
Insert tape cartridge and press return, or enter q to quit. >
Insert your tape, and press <Return>. You will see a message
similar to this one:
Wait.
dd if=/dev/rct0 bs=120b count=751
(The actual numbers may vary for your machine.) Finally, you will
see the message:
System dump copied into image. Use crash(1M) to analyze the dump.
At this point you have a system dump image that you can use with
the crash(ADM) utility.
2. Obtaining a Stack Trace
The crash(ADM) utility will allow you to examine the state of
your system at the time of the panic. The crash(ADM) manual page
describes this command in detail. Knowledge of UNIX kernel
internals is necessary to use crash(ADM) to its full potential.
However, there are a few simple commands that can give you a good
idea about what was going on at the time of the panic.
To invoke crash(ADM), enter the command:
# crash -d image
You will then see the crash(ADM) prompt, which is a '>' character.
Note that the argument to crash(ADM) is the file in which the image
is stored - this is "image" in this example, but it can be any
file name.
IMPORTANT: crash(ADM) will only give meaningful results if the file
"/unix" is the one which generated the system dump image. If you
re-link your kernel, atempting to use the new "/unix" with the old
image file will give inaccurate results!
There are three commands which are useful. The 'panic' command
shows a limited register dump, plus the partial contents of the
kernel stack at the time of the panic. The 'trace' command shows
the complete contents of the kernel stack at the time of the
panic. By examining the contents of the kernel stack it is often
possible to determine the cause of the panic. The 'user' command
displays information about the program which was running at the
time of the panic. This information can help determine which
kernel subsystem was being used when the panic occured. Type
'quit' to exit from the crash(ADM) command.
3. A sample crash(ADM) session.
Here is a sample crash(ADM) session. The commands entered by the
user are in front of the '>' prompt. The rest is output from the
crash(ADM) command. In this session, the user entered a 'panic'
command, followed by a 'user' command and a 'quit' command.
=========================================================================
# crash -d image
dumpfile = image, namelist = /unix, outfile = stdout
> panic
System Messages:
WARNING:
Panic String: Kernel mode trap. Type 0x%x
Kernel Trap. Kernel Registers saved at e0000d10
ERR=0, TRAPNO=14
cs:eip=0158:d0070488 Flags=10282
ds = 0160 es = 0160 fs = 0000 gs = 0000
esi= d0119554 edi= 00000038 ebp= e0000d64 esp= e0000d40
eax= 00ffffff ebx= d00cd780 ecx= 00000000 edx= 000003d5
Kernel Stack before Trap:
STKADDR FRAMEPTR FUNCTION POSSIBLE ARGUMENTS
e0000d40 e0000d64 panopen (3800,1,2,d0119554)
e0000d6c e0000d88 s5openi (d0119554,1,1,d0119554)
e0000d90 e0000dac copen1 (d0119554,1,e0001148,d00adf20)
e0000db4 e0000de0 copen (1,403594,e0000e38,7ffffe7c)
e0000de8 e0000df8 open (403594,0,40358c,d011e274)
e0000e00 e0000e2c systrap (e0000e38)
> user
PER PROCESS USER AREA FOR PROCESS 5
USER ID's: uid: 0, gid: 0, real uid: 0, real gid: 0
PROCESS TIMES: user: 1, sys: 8, child user: 0, child sys: 0
PROCESS MISC:
command: sh, psargs: -
proc slot: 5, cntrl tty: 3,1
start: Wed May 22 11:31:10 1991
mem: 18, type: fork
proc/text lock: none
inode of current directory: 149
OPEN FILES AND POFILE FLAGS:
[1]: F#1, 5b [2]: F#1, 43 [3]: F#1, 0
[4]: F#2, 0 [5]: F#3, 0
FILE I/O:
u_base: 402d21, file offset: 3728, bytes: 672,
segment: sys, cmask: 0077, ulimit: 2097152
file mode(s): read write
SIGNAL DISPOSITION:
1: 1bfc 2: 1d50 3: 1d50 4: 1bfc
5: 1bfc 6: 1bfc 7: 1bfc 8: 1bfc
9: default 10: 1bfc 11: 1d50 12: 1bfc
13: 1bfc 14: 1d50 15: 1d50 16: 1bfc
17: 1bfc 18: default 19: default 20: default
21: default 22: default 23: default 24: default
25: default 26: default 27: default 28: default
> quit
=========================================================================
As you can see, the output of crash(ADM) contains a large amount
of highly cryptic information. Only a few pieces of it are
needed to get an idea of what is going on.
The output of the 'user' command contains information about the
process that was executing at the time of the panic. The main
thing you should look at is the "PROCESS MISC" section. Under
it you will see two fields which tell you what command was running:
"command" is the name of the command, and "psargs" which gives
the first few arguments to the command. The other fields of
interest in the 'user' command output are in the "USER ID's"
section, which gives the read and effective uids of the user who
ran the process.
In the sample output above, the command being run at the time of
the panic was "sh" - the Bourne shell. Since the "psargs" field
consists of a "-", this shell was a login shell. Since the values
in the "USER ID's" section are all 0, this shell was being run
by root.
The the most important part of the 'panic' command output is the
section marked "Kernel Stack before Trap". The column labeled
"FUNCTION" contains the sequence of kernel function calls which led
up to the panic. The 'trace' command output is similar, but also
shows the sequence of function calls after the event which caused
the panic. This stack is displayed as growing upwards, so that
functions are called by functions below them in the stack, and call
functions that are higher in the stack.
In the sample output above, the kernel paniced while invoking
the function panopen(). An examination of the kernel stack shows
that this function was called (via copen(), copen1(), and
s5openi()) from the function open(). The combination of these two
facts suggests that the panic was caused by a bug in the "pan"
driver.
Note that the problem may not be in the top-most function in the
function stack. If, for example, a kernel utility function such
as physio() was invoked with improper arguments, physio() would be
at the top of the kernel stack, even though the problem was in the
function which called it.
4. Diagnosing the problem
Diagnosing kernel panics requires a knowledge of kernel internals
and experience in debugging 'C' code.
It is often useful to know in which file the problematic function
is located. This can sometimes tell you if the problematic
function is part of the operating system, or if it is part of a
third-party or SCO-supplied driver.
The following shellscript, Script #1, requires that you have
the nm(CP) program installed on your system. If you do not have it
installed, please use Script #2. Script #2 uses strings(C) in place
of nm(CP).
nm(CP) is provided with the SCO UNIX Development System and the
Open Desktop Development System.
strings(C) is provided withthe standard OS
The shell script below will print out the names of the kernel
component files which reference a particular function. This example
searches for the function sioopen(): substitute the appropriate
function name for "sioopen". Log in as root and enter the command:
(Script #1)
# for i in `find /etc/conf/pack.d -name '*.o' -print`
> do
> nm $i 2> /dev/null | grep sioopen > /dev/null && echo $i
> done
-or-
(Script #2)
# for i in `find /etc/conf/pack.d -name '*.o' -print`
> do
> strings $i 2> /dev/null | grep sioopen > /dev/null && echo $i
> done
The '#' and '>' characters at the left are prompts from root's
shell: do not type them in. Be sure to get the "`" and "'"
quotes right!
Write down the list of file names printed by the above command.
For this example, you will get the output:
/etc/conf/pack.d/sio/Driver.o
You can then search for the file name in the /etc/perms files.
For this example you would type:
egrep /etc/conf/pack.d/sio/Driver.o /etc/perms/*
This will give as output:
/etc/perms/inst:LINK f644 root/sys 1 ./etc/conf/pack.d/sio/Driver.o N04
Since the name after the colon is "LINK", you know that this
kernel routine is part of the SCO-supplied link kit. If the file
does not appear in the perms files, or if the package name which
appears before the colon is part of a third-party device driver,
then you know that the problem is with that driver.
HINTS: If a routine is mentioned in a file named Driver.o, it is
usually part of a device driver, and the driver prefix is the name
of the directory in which the Driver.o file is located. If a function
name ends in "read", "write", or "ioctl" it is usually part of a
character device driver. If a function name ends in "strategy" it
is usually part of a block device driver. Kernel utility function
names include bcopy(), copyio(), cpass(), passc(), getc(), geteblk(),
getablk(), mapphys(), memget(), putc(), fubyte(), subyte(), vtop(),
vtop2(), and vtopreg().
5. Contacting SCO Support
If you are still unable to determine the cause of the panic, you
may want to contact SCO support. Follow the steps in section II.1
above, and then enter the following commands:
# crash -d image -w /tmp/crash.out
dumpfile = image, namelist = /unix, outfile = /tmp/crash.out
> panic
> trace
> user
> quit
As before, the '#' and '>' prompts are generated by the crash(ADM)
command, so do not type them in. Print out the file "/tmp/crash.out"
and send it via fax or email, to your support provider.
If you are contacting SCO Support, we ask that you provide the
following information along with the crash(ADM) output.
- The exact version of the SCO UNIX System V/386 Operating System you
are running.
- The brand and model number of the computer you are using.
- A complete description of the hardware configuration, including
the brand and model number of every peripheral card in the machine.
- A listing of all device drivers that you have installed on the
machine.
- A listing of all software you have installed on the machine which
re-linked the kernel as part of the installation.
- The amount of RAM installed in your computer.
TROUBLESHOOTING:
- How do I check the kernel contains the driver in my Unix Link-Kit?
For example, let's verify the eeG kernel driver:
The output of:
# what /etc/conf/pack.d/eeG/Driver.o
shows:
/etc/conf/pack.d/eeG/Driver.o:
eeG - Intel Gigabit, SCO OSR5 MDI driver, Ver 5.1.2 (22Apr2010 18:41)
now check the current kernel with:
# what /stand/unix | grep eeG
should show the same for the booted kernel:
eeG - Intel Gigabit, SCO OSR5 MDI driver, Ver 5.1.2 (22Apr2010 18:41)
SEE ALSO:
autoboot(ADM)
crash(ADM)
messages(M)
nm(CP)
"The Design of the Unix Operating System" by Maurice J. Bach
|