--- description: Comprehensive system health checkup including disk health, SMART status, filesystem checks, and overall system status tags: [sysadmin, diagnostics, health, disk, smart, filesystem, comprehensive] --- Perform a comprehensive system health checkup: 1. **Disk Health (SMART)**: Check all disk SMART status and health indicators 2. **Filesystem Health**: Check all mounted filesystems for errors 3. **System Resources**: CPU, memory, swap, and load status 4. **Critical Services**: Verify critical system services are running 5. **Security Updates**: Check for pending security updates 6. **Disk Space**: Check all mounted filesystems for space issues 7. **System Logs**: Check for recent critical errors 8. **Hardware Errors**: Check for hardware-related issues in logs Run the following comprehensive diagnostic commands: **Disk Health (SMART):** - `sudo smartctl --scan` to identify all drives - `sudo smartctl -H /dev/sda` for health status (repeat for all drives found) - `sudo smartctl -A /dev/sda` for SMART attributes (repeat for all drives) - Check for: Reallocated sectors, Current pending sectors, Offline uncorrectable sectors **Filesystem Health:** - `df -h` for disk space on all filesystems - `sudo btrfs device stats /` if using BTRFS - Check mounted filesystems with `mount | grep -E '^/dev'` - For ext4: `sudo tune2fs -l /dev/sdXY | grep -i 'state\|error'` for filesystem state **System Resources:** - `free -h` for memory usage - `uptime` for load averages - `top -b -n 1 | head -n 20` for process overview - `swapon --show` for swap status **Critical Services:** - `systemctl status systemd-journald` for logging service - `systemctl status cron` or `systemctl status crond` for task scheduler - `systemctl --failed` for any failed services **Updates and Security:** - `sudo apt-get update` to refresh package lists - `apt list --upgradable` to check for available updates - `grep -i security /var/log/apt/history.log | tail -n 20` for recent security updates **System Logs:** - `journalctl -p 3 -b` for errors in current boot - `journalctl -p 2 -b` for critical issues in current boot - `dmesg | grep -i 'error\|fail\|critical' | tail -n 20` for kernel errors **Hardware Status:** - `sensors` for temperature monitoring (if lm-sensors installed) - `dmesg | grep -i 'hardware error'` for hardware errors - `lspci -v | grep -i 'error'` for PCIe errors **Additional Checks:** - Check for excessive failed login attempts: `sudo grep -i 'failed password' /var/log/auth.log | tail -n 10` - Check for disk I/O errors: `dmesg | grep -i 'I/O error'` Analyze all results and provide: **Summary Report:** - Overall system health status (Healthy, Warning, Critical) - Disk health status for each drive - Filesystem health and space status - Memory and swap status - Any failed services or critical errors - Pending updates (especially security) - Temperature warnings if applicable - Specific issues found with severity levels **Recommendations:** - Immediate actions needed (if any) - Preventive maintenance suggestions - Monitoring recommendations - Whether a reboot is recommended - Backup reminders if issues detected **Priority Issues:** List any issues in order of urgency: 1. Critical (requires immediate attention) 2. Warning (should be addressed soon) 3. Informational (for awareness) If smartmontools is not installed, offer to install with `sudo apt-get install smartmontools`. If lm-sensors is not installed and temperature monitoring is desired, offer to install with `sudo apt-get install lm-sensors`.