danielrosehill's picture
Redesign interface with accordion cards and category pills
292d92c

A newer version of the Gradio SDK is available: 6.4.0

Upgrade
metadata
description: >-
  Comprehensive system health checkup including disk health, SMART status,
  filesystem checks, and overall system status
tags:
  - sysadmin
  - diagnostics
  - health
  - disk
  - smart
  - filesystem
  - comprehensive

Perform a comprehensive system health checkup:

  1. Disk Health (SMART): Check all disk SMART status and health indicators
  2. Filesystem Health: Check all mounted filesystems for errors
  3. System Resources: CPU, memory, swap, and load status
  4. Critical Services: Verify critical system services are running
  5. Security Updates: Check for pending security updates
  6. Disk Space: Check all mounted filesystems for space issues
  7. System Logs: Check for recent critical errors
  8. Hardware Errors: Check for hardware-related issues in logs

Run the following comprehensive diagnostic commands:

Disk Health (SMART):

  • sudo smartctl --scan to identify all drives
  • sudo smartctl -H /dev/sda for health status (repeat for all drives found)
  • sudo smartctl -A /dev/sda for SMART attributes (repeat for all drives)
  • Check for: Reallocated sectors, Current pending sectors, Offline uncorrectable sectors

Filesystem Health:

  • df -h for disk space on all filesystems
  • sudo btrfs device stats / if using BTRFS
  • Check mounted filesystems with mount | grep -E '^/dev'
  • For ext4: sudo tune2fs -l /dev/sdXY | grep -i 'state\|error' for filesystem state

System Resources:

  • free -h for memory usage
  • uptime for load averages
  • top -b -n 1 | head -n 20 for process overview
  • swapon --show for swap status

Critical Services:

  • systemctl status systemd-journald for logging service
  • systemctl status cron or systemctl status crond for task scheduler
  • systemctl --failed for any failed services

Updates and Security:

  • sudo apt-get update to refresh package lists
  • apt list --upgradable to check for available updates
  • grep -i security /var/log/apt/history.log | tail -n 20 for recent security updates

System Logs:

  • journalctl -p 3 -b for errors in current boot
  • journalctl -p 2 -b for critical issues in current boot
  • dmesg | grep -i 'error\|fail\|critical' | tail -n 20 for kernel errors

Hardware Status:

  • sensors for temperature monitoring (if lm-sensors installed)
  • dmesg | grep -i 'hardware error' for hardware errors
  • lspci -v | grep -i 'error' for PCIe errors

Additional Checks:

  • Check for excessive failed login attempts: sudo grep -i 'failed password' /var/log/auth.log | tail -n 10
  • Check for disk I/O errors: dmesg | grep -i 'I/O error'

Analyze all results and provide:

Summary Report:

  • Overall system health status (Healthy, Warning, Critical)
  • Disk health status for each drive
  • Filesystem health and space status
  • Memory and swap status
  • Any failed services or critical errors
  • Pending updates (especially security)
  • Temperature warnings if applicable
  • Specific issues found with severity levels

Recommendations:

  • Immediate actions needed (if any)
  • Preventive maintenance suggestions
  • Monitoring recommendations
  • Whether a reboot is recommended
  • Backup reminders if issues detected

Priority Issues: List any issues in order of urgency:

  1. Critical (requires immediate attention)
  2. Warning (should be addressed soon)
  3. Informational (for awareness)

If smartmontools is not installed, offer to install with sudo apt-get install smartmontools. If lm-sensors is not installed and temperature monitoring is desired, offer to install with sudo apt-get install lm-sensors.