heramb04 commited on
Commit
dfbc3b3
·
verified ·
1 Parent(s): 8726b74

Update README.md

Browse files

updated md contents

Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -12,3 +12,59 @@ short_description: Predicts the probability of server failure
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+
17
+
18
+ #### Server Health Sentinel AI
19
+
20
+ Predicting Thermal Runaway Before It Happens
21
+
22
+ This project is a Proof-of-Concept (PoC) for AIOps (Artificial Intelligence for IT Operations). It demonstrates how machine learning can move beyond simple "threshold-based" monitoring to predictive failure analysis.
23
+
24
+ ### Try the Demo
25
+
26
+ Adjust the sliders in the Live Telemetry Simulation panel to see how the model reacts to different stress scenarios.
27
+
28
+ Scenario A (Idle): Low CPU, Low Temp -> System Normal
29
+
30
+ Scenario B (Gaming/Load): High Sustained CPU, High Temp -> CRITICAL FAILURE IMMINENT
31
+
32
+ Scenario C (Cool Down): Low Current CPU but High Sustained Load + High Temp -> CRITICAL (Predicting residual heat)
33
+
34
+ ### The Model: Random Forest Classifier
35
+
36
+ Unlike simple if/else logic, this system uses a Random Forest Classifier (an ensemble of 100 decision trees) to weigh multiple factors simultaneously.
37
+
38
+ It was trained on a custom dataset of 10,000+ telemetry points collected from a high-performance Linux gaming laptop (HP Victus 15 / Ryzen 5 5600H) under various real-world conditions:
39
+
40
+ Idle/Web Browsing (Baseline)
41
+
42
+ Compilation/Workloads (CPU Spikes)
43
+
44
+ Gaming (Sekiro: Shadows Die Twice) (Sustained CPU+GPU Thermal Stress)
45
+
46
+ ### Feature Engineering
47
+
48
+ The model doesn't just look at current stats. It relies on engineered trend features to understand context:
49
+
50
+ Rolling Averages: A 1-minute sustained load is more dangerous than a 1-second spike.
51
+
52
+ Thermal Inertia: Combining current temp with recent load history to predict "heat soak."
53
+
54
+ Rate of Change: How fast is the temperature climbing?
55
+
56
+ ### Performance
57
+
58
+ AUC Score: ~0.99 (Highly accurate on test set)
59
+
60
+ False Positive Rate: <0.5%
61
+
62
+ False Negative Rate: <1.0%
63
+
64
+ ### Tech Stack
65
+
66
+ Training: Scikit-Learn, Pandas, Psutil
67
+
68
+ Deployment: Gradio, Hugging Face Spaces
69
+
70
+ Hardware Target: x86_64 Linux Systems