File size: 2,148 Bytes
51d86f2
 
 
 
 
 
 
 
 
 
 
 
 
 
dfbc3b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
title: Server Failure Predictor
emoji: 🦀
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
short_description: Predicts the probability of server failure
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference



#### Server Health Sentinel AI

Predicting Thermal Runaway Before It Happens

This project is a Proof-of-Concept (PoC) for AIOps (Artificial Intelligence for IT Operations). It demonstrates how machine learning can move beyond simple "threshold-based" monitoring to predictive failure analysis.

### Try the Demo

Adjust the sliders in the Live Telemetry Simulation panel to see how the model reacts to different stress scenarios.

Scenario A (Idle): Low CPU, Low Temp -> System Normal

Scenario B (Gaming/Load): High Sustained CPU, High Temp -> CRITICAL FAILURE IMMINENT

Scenario C (Cool Down): Low Current CPU but High Sustained Load + High Temp -> CRITICAL (Predicting residual heat)

### The Model: Random Forest Classifier

Unlike simple if/else logic, this system uses a Random Forest Classifier (an ensemble of 100 decision trees) to weigh multiple factors simultaneously.

It was trained on a custom dataset of 10,000+ telemetry points collected from a high-performance Linux gaming laptop (HP Victus 15 / Ryzen 5 5600H) under various real-world conditions:

Idle/Web Browsing (Baseline)

Compilation/Workloads (CPU Spikes)

Gaming (Sekiro: Shadows Die Twice) (Sustained CPU+GPU Thermal Stress)

### Feature Engineering

The model doesn't just look at current stats. It relies on engineered trend features to understand context:

Rolling Averages: A 1-minute sustained load is more dangerous than a 1-second spike.

Thermal Inertia: Combining current temp with recent load history to predict "heat soak."

Rate of Change: How fast is the temperature climbing?

### Performance

AUC Score: ~0.99 (Highly accurate on test set)

False Positive Rate: <0.5%

False Negative Rate: <1.0%

### Tech Stack

Training: Scikit-Learn, Pandas, Psutil

Deployment: Gradio, Hugging Face Spaces

Hardware Target: x86_64 Linux Systems