CRITICAL

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

TL;DR

Boot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.

A Kernel Panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in kernel code, hardware, or drivers that compromise system integrity.

Diagnosis & Causes

Faulty or incompatible hardware (RAM, CPU, storage).

Buggy or misconfigured kernel modules or drivers.

Corrupted filesystem or disk errors.

Kernel bugs or incompatible kernel updates.

Overheating or insufficient power supply.

Recovery Steps

Step 1: Secure Immediate Evidence from the Console

If the server is accessible, photograph or transcribe the entire panic screen. The call trace and register dump are critical for diagnosis.

bash

# If the system is still running but unstable, force a panic to get a log (USE WITH EXTREME CAUTION)
echo c > /proc/sysrq-trigger

Step 2: Boot into a Rescue Environment & Collect Logs

Boot from a live USB/DVD or a known-good kernel. Mount the root filesystem and extract all relevant logs from the failed boot.

bash

# Mount the root partition from the rescue environment
mount /dev/sdX1 /mnt
# Copy critical logs for analysis
cp /mnt/var/log/kern.log* /mnt/var/log/dmesg* /mnt/var/log/syslog* /root/panic_analysis/

Step 3: Analyze Kernel Logs for Oops and Panic Context

Search logs for 'Oops', 'panic', 'BUG', and the call trace. The line BEFORE the panic often indicates the culprit.

bash

grep -B 20 -A 5 "Kernel panic" /var/log/kern.log
grep -B 10 "Oops" /var/log/kern.log
dmesg | tail -100

Step 4: Isolate the Faulty Component via Call Trace

Decode the call trace (EIP/RIP register and function names) to identify if the failure is in a specific driver (e.g., nvidia, e1000) or core kernel.

bash

# Example: Look for module names in the trace. This points to the 'nv' driver.
# Call Trace:||#[ 123.456]  [<ffffffffa0123456>] ? nv_ioctl+0x123/0x456 [nv]

Step 5: Perform Hardware Diagnostics

Rule out hardware failure, which is a common root cause. Test memory and CPU thoroughly.

bash

# Test system memory (run for several passes)
memtester 2G 2
# Check CPU for errors via mcelog
mcelog --client
# Check disk health
smartctl -a /dev/sda

Step 6: Implement a Mitigation and Restore Service

Based on analysis, blacklist a faulty module, revert a kernel update, or schedule hardware replacement. Boot with minimal modules.

bash

# Blacklist a driver module causing panic
echo "blacklist faulty_module" >> /etc/modprobe.d/blacklist.conf
# Update initramfs and reboot
update-initramfs -u -k all
reboot

Step 7: Configure Persistent Crash Dumping (Kdump)

For future panics, configure Kdump to capture a full kernel memory dump (vmcore) to disk for offline analysis with 'crash' utility.

bash

# Install kdump tools
apt install kdump-tools || yum install kexec-tools
# Configure crash kernel memory in /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=256M"
# Enable and start the service
systemctl enable kdump.service
systemctl start kdump.service

Architect's Pro Tip

"Panics often occur minutes after the real fault. Correlate timestamps with systemd journal logs (`journalctl -S -1hour`) to find the triggering service or hardware event."

Frequently Asked Questions

What's the difference between an 'Oops' and a 'Kernel Panic'?

An 'Oops' is a non-fatal kernel error where the kernel can often continue running (though possibly corrupted). A 'Panic' is a deliberate, unrecoverable halt to prevent filesystem/data corruption from an irreparable error.

The server is completely unresponsive after a panic. How do I get the logs?

You have three options: 1) Physical/IPMI console screenshot, 2) Serial console output if configured, 3) A kdump vmcore if it was set up prior to the crash. Otherwise, you must rely on the console message.

Should I always update the kernel after a panic?

Not immediately. First, diagnose. If the trace points to a known bug fixed in a later stable kernel, then update. Blindly updating can introduce new incompatibilities. Reverting to the last-known-good kernel is a safer first step.

Related Linux Guides

502 Bad Gateway

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Secure Immediate Evidence from the Console

Step 2: Boot into a Rescue Environment & Collect Logs

Step 3: Analyze Kernel Logs for Oops and Panic Context

Step 4: Isolate the Faulty Component via Call Trace

Step 5: Perform Hardware Diagnostics

Step 6: Implement a Mitigation and Restore Service

Step 7: Configure Persistent Crash Dumping (Kdump)

Architect's Pro Tip

Frequently Asked Questions

What's the difference between an 'Oops' and a 'Kernel Panic'?

The server is completely unresponsive after a panic. How do I get the logs?

Should I always update the kernel after a panic?

Related Linux Guides

How to Fix 502 Bad Gateway in Nginx/Apache on Ubuntu 24.04

How to Fix 502 Bad Gateway in NGINX on Ubuntu 24.04 LTS

Kubernetes Troubleshooting Guide: Diagnosing ImagePullBackOff on Linux Nodes