Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers
Quick Fix Summary
TL;DRBoot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.
A Kernel Panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in kernel code, hardware, or drivers that compromise system integrity.
Diagnosis & Causes
Recovery Steps
Step 1: Secure Immediate Evidence from the Console
If the server is accessible, photograph or transcribe the entire panic screen. The call trace and register dump are critical for diagnosis.
# If the system is still running but unstable, force a panic to get a log (USE WITH EXTREME CAUTION)
echo c > /proc/sysrq-trigger Step 2: Boot into a Rescue Environment & Collect Logs
Boot from a live USB/DVD or a known-good kernel. Mount the root filesystem and extract all relevant logs from the failed boot.
# Mount the root partition from the rescue environment
mount /dev/sdX1 /mnt
# Copy critical logs for analysis
cp /mnt/var/log/kern.log* /mnt/var/log/dmesg* /mnt/var/log/syslog* /root/panic_analysis/ Step 3: Analyze Kernel Logs for Oops and Panic Context
Search logs for 'Oops', 'panic', 'BUG', and the call trace. The line BEFORE the panic often indicates the culprit.
grep -B 20 -A 5 "Kernel panic" /var/log/kern.log
grep -B 10 "Oops" /var/log/kern.log
dmesg | tail -100 Step 4: Isolate the Faulty Component via Call Trace
Decode the call trace (EIP/RIP register and function names) to identify if the failure is in a specific driver (e.g., nvidia, e1000) or core kernel.
# Example: Look for module names in the trace. This points to the 'nv' driver.
# Call Trace:||#[ 123.456] [<ffffffffa0123456>] ? nv_ioctl+0x123/0x456 [nv] Step 5: Perform Hardware Diagnostics
Rule out hardware failure, which is a common root cause. Test memory and CPU thoroughly.
# Test system memory (run for several passes)
memtester 2G 2
# Check CPU for errors via mcelog
mcelog --client
# Check disk health
smartctl -a /dev/sda Step 6: Implement a Mitigation and Restore Service
Based on analysis, blacklist a faulty module, revert a kernel update, or schedule hardware replacement. Boot with minimal modules.
# Blacklist a driver module causing panic
echo "blacklist faulty_module" >> /etc/modprobe.d/blacklist.conf
# Update initramfs and reboot
update-initramfs -u -k all
reboot Step 7: Configure Persistent Crash Dumping (Kdump)
For future panics, configure Kdump to capture a full kernel memory dump (vmcore) to disk for offline analysis with 'crash' utility.
# Install kdump tools
apt install kdump-tools || yum install kexec-tools
# Configure crash kernel memory in /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=256M"
# Enable and start the service
systemctl enable kdump.service
systemctl start kdump.service Architect's Pro Tip
"Panics often occur minutes after the real fault. Correlate timestamps with systemd journal logs (`journalctl -S -1hour`) to find the triggering service or hardware event."
Frequently Asked Questions
What's the difference between an 'Oops' and a 'Kernel Panic'?
An 'Oops' is a non-fatal kernel error where the kernel can often continue running (though possibly corrupted). A 'Panic' is a deliberate, unrecoverable halt to prevent filesystem/data corruption from an irreparable error.
The server is completely unresponsive after a panic. How do I get the logs?
You have three options: 1) Physical/IPMI console screenshot, 2) Serial console output if configured, 3) A kdump vmcore if it was set up prior to the crash. Otherwise, you must rely on the console message.
Should I always update the kernel after a panic?
Not immediately. First, diagnose. If the trace points to a known bug fixed in a later stable kernel, then update. Blindly updating can introduce new incompatibilities. Reverting to the last-known-good kernel is a safer first step.