Spaces:

heramb04
/

server_failure_predictor

Sleeping

App Files Files Community

server_failure_predictor / README.md

heramb04

Update README.md

dfbc3b3 verified 3 months ago

preview code

raw

history blame contribute delete

2.15 kB

	---
	title: Server Failure Predictor
	emoji: 🦀
	colorFrom: green
	colorTo: green
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: Predicts the probability of server failure
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference



	#### Server Health Sentinel AI

	Predicting Thermal Runaway Before It Happens

	This project is a Proof-of-Concept (PoC) for AIOps (Artificial Intelligence for IT Operations). It demonstrates how machine learning can move beyond simple "threshold-based" monitoring to predictive failure analysis.

	### Try the Demo

	Adjust the sliders in the Live Telemetry Simulation panel to see how the model reacts to different stress scenarios.

	Scenario A (Idle): Low CPU, Low Temp -> System Normal

	Scenario B (Gaming/Load): High Sustained CPU, High Temp -> CRITICAL FAILURE IMMINENT

	Scenario C (Cool Down): Low Current CPU but High Sustained Load + High Temp -> CRITICAL (Predicting residual heat)

	### The Model: Random Forest Classifier

	Unlike simple if/else logic, this system uses a Random Forest Classifier (an ensemble of 100 decision trees) to weigh multiple factors simultaneously.

	It was trained on a custom dataset of 10,000+ telemetry points collected from a high-performance Linux gaming laptop (HP Victus 15 / Ryzen 5 5600H) under various real-world conditions:

	Idle/Web Browsing (Baseline)

	Compilation/Workloads (CPU Spikes)

	Gaming (Sekiro: Shadows Die Twice) (Sustained CPU+GPU Thermal Stress)

	### Feature Engineering

	The model doesn't just look at current stats. It relies on engineered trend features to understand context:

	Rolling Averages: A 1-minute sustained load is more dangerous than a 1-second spike.

	Thermal Inertia: Combining current temp with recent load history to predict "heat soak."

	Rate of Change: How fast is the temperature climbing?

	### Performance

	AUC Score: ~0.99 (Highly accurate on test set)

	False Positive Rate: <0.5%

	False Negative Rate: <1.0%

	### Tech Stack

	Training: Scikit-Learn, Pandas, Psutil

	Deployment: Gradio, Hugging Face Spaces

	Hardware Target: x86_64 Linux Systems