Spaces:
Sleeping
Sleeping
| title: Server Failure Predictor | |
| emoji: 🦀 | |
| colorFrom: green | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Predicts the probability of server failure | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| #### Server Health Sentinel AI | |
| Predicting Thermal Runaway Before It Happens | |
| This project is a Proof-of-Concept (PoC) for AIOps (Artificial Intelligence for IT Operations). It demonstrates how machine learning can move beyond simple "threshold-based" monitoring to predictive failure analysis. | |
| ### Try the Demo | |
| Adjust the sliders in the Live Telemetry Simulation panel to see how the model reacts to different stress scenarios. | |
| Scenario A (Idle): Low CPU, Low Temp -> System Normal | |
| Scenario B (Gaming/Load): High Sustained CPU, High Temp -> CRITICAL FAILURE IMMINENT | |
| Scenario C (Cool Down): Low Current CPU but High Sustained Load + High Temp -> CRITICAL (Predicting residual heat) | |
| ### The Model: Random Forest Classifier | |
| Unlike simple if/else logic, this system uses a Random Forest Classifier (an ensemble of 100 decision trees) to weigh multiple factors simultaneously. | |
| It was trained on a custom dataset of 10,000+ telemetry points collected from a high-performance Linux gaming laptop (HP Victus 15 / Ryzen 5 5600H) under various real-world conditions: | |
| Idle/Web Browsing (Baseline) | |
| Compilation/Workloads (CPU Spikes) | |
| Gaming (Sekiro: Shadows Die Twice) (Sustained CPU+GPU Thermal Stress) | |
| ### Feature Engineering | |
| The model doesn't just look at current stats. It relies on engineered trend features to understand context: | |
| Rolling Averages: A 1-minute sustained load is more dangerous than a 1-second spike. | |
| Thermal Inertia: Combining current temp with recent load history to predict "heat soak." | |
| Rate of Change: How fast is the temperature climbing? | |
| ### Performance | |
| AUC Score: ~0.99 (Highly accurate on test set) | |
| False Positive Rate: <0.5% | |
| False Negative Rate: <1.0% | |
| ### Tech Stack | |
| Training: Scikit-Learn, Pandas, Psutil | |
| Deployment: Gradio, Hugging Face Spaces | |
| Hardware Target: x86_64 Linux Systems |