argus-mlops / README.md
hodfa840's picture
Fix scroll reset for HF Spaces double-iframe context
1aa566a
---
title: Argus ML Observability Platform
emoji: πŸš•
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: MLOps drift detection and auto-retraining platform
---
# Argus
> **Production ML models fail silently. Argus detects the failure, explains why, and fixes it automatically.**
[![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-FFD21E?logo=huggingface&logoColor=black)](https://huggingface.co/spaces/Hodfa71/argus-mlops)
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)](https://python.org)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.111-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
[![Streamlit](https://img.shields.io/badge/Streamlit-1.33-FF4B4B?logo=streamlit&logoColor=white)](https://streamlit.io)
[![MLflow](https://img.shields.io/badge/MLflow-2.12-0194E2?logo=mlflow&logoColor=white)](https://mlflow.org)
[![scikit-learn](https://img.shields.io/badge/scikit--learn-1.4-F7931E?logo=scikit-learn&logoColor=white)](https://scikit-learn.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
**[Try the live demo on Hugging Face Spaces](https://huggingface.co/spaces/Hodfa71/argus-mlops)**
---
## Architecture
![Architecture](assets/architecture.png)
Argus is a **production-grade ML monitoring platform** built around a NYC taxi trip-duration prediction service. It demonstrates the full MLOps lifecycle:
- Serves low-latency predictions through a **FastAPI REST API**
- Monitors model health continuously with **PSI + KS-test drift detection**
- Handles **24-hour delayed ground truth** using request-ID matching
- Explains *why* drift occurred via **root-cause analysis** (PSI Γ— feature importance)
- Retrains automatically only when both drift *and* performance degradation are confirmed
- Runs safe **champion-challenger model promotion** β€” no regression allowed
- Tracks every experiment in **MLflow**
- Visualises everything in a **dark-themed Streamlit dashboard**
---
## Live Dashboard
![Argus dashboard demo](assets/demo.gif)
---
## Dashboard Preview
![Dashboard Preview](assets/dashboard_preview.png)
---
## Drift Detection in Action
![Drift Detection](assets/drift_detection.png)
Per-feature PSI breakdown showing distribution shift between reference and live windows.
Features are color-coded: green = stable, amber = moderate, red = significant drift.
---
## Before vs. After Retraining
![Performance Recovery](assets/before_after_retraining.png)
RMSE degrades during the drift window, then recovers once the challenger is promoted.
The right panel shows rolling degradation percentage relative to the baseline.
---
## Feature Importance Tracking
![Feature Importance](assets/feature_importance.png)
Side-by-side champion vs. challenger feature importance with delta view.
---
## PSI Heatmap
![PSI Heatmap](assets/psi_heatmap.png)
Feature drift intensity across consecutive monitoring windows.
Dark = stable, amber = moderate, red = significant drift.
---
## Live Demo
| Service | URL |
|---------|-----|
| **API Docs (Swagger)** | `https://argus-ml-api.onrender.com/docs` |
| **Monitoring Dashboard** | `https://argus-dashboard.streamlit.app` |
| **Health Check** | `https://argus-ml-api.onrender.com/health` |
> Deploy your own instance: [deployment/DEPLOYMENT.md](deployment/DEPLOYMENT.md)
---
## Quick Start
```bash
# 1. Clone
git clone https://github.com/YOUR_USERNAME/argus.git
cd argus
# 2. Install
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 3. Train initial model + build reference dataset
python scripts/train_initial_model.py
# 4. Full end-to-end demo (no running API required)
python scripts/demo.py
```
**Expected demo output:**
```
STEP 4: Inject Drift
Drift type : SUDDEN (abrupt distribution shift)
trip_distance mean: 3.21 -> 12.86 (delta=+9.65)
pickup_hour mean: 11.4 -> 16.8 (delta=+5.4)
STEP 5: Drift Detection
Feature drift detected : True
Drifted features : ['trip_distance', 'pickup_hour']
Performance degraded : True (+48.6%)
STEP 6: Root-Cause Analysis
Primary cause : trip_distance (RCA score=0.521)
Action : retrain_immediately
STEP 9: Summary
{
"drift_detected": true,
"root_cause": ["trip_distance", "pickup_hour"],
"performance_drop": "48.6%",
"action": "retrain_immediately",
"model_promoted": true,
"new_model_improvement": "11.23%"
}
```
---
## API Reference
### Start the server
```bash
uvicorn app:app --reload --port 8000
# Swagger UI: http://localhost:8000/docs
```
### Predict trip duration
```bash
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"passenger_count": 2,
"trip_distance": 3.5,
"pickup_hour": 8,
"pickup_dow": 1,
"pickup_month": 3,
"pickup_is_weekend": 0,
"rate_code_id": 1,
"payment_type": 1,
"pu_location_zone": 10,
"do_location_zone": 25,
"vendor_id": 1
}'
```
```json
{
"request_id": "3f4a1b2c-...",
"predicted_duration_min": 14.23,
"model_version": "a3b7f921",
"timestamp": "2024-03-15T09:42:11Z"
}
```
### Submit delayed ground truth
```bash
curl -X POST http://localhost:8000/predict/feedback \
-H "Content-Type: application/json" \
-d '{"request_id": "3f4a1b2c-...", "actual_duration_min": 16.5}'
```
### Drift status
```bash
curl http://localhost:8000/monitor/drift
```
```json
{
"drift_detected": true,
"root_cause": ["trip_distance", "pickup_hour"],
"performance_drop": "18.3%",
"action": "retraining_triggered",
"rca_details": [
{"feature": "trip_distance", "psi": 0.312, "importance": 0.421, "rca_score": 0.444},
{"feature": "pickup_hour", "psi": 0.241, "importance": 0.187, "rca_score": 0.286}
]
}
```
### Performance metrics
```bash
curl http://localhost:8000/monitor/metrics
```
### Trigger retraining manually
```bash
curl -X POST http://localhost:8000/monitor/retrain
```
---
## Drift Simulation
```bash
# Gradual shift over 500 windows
python scripts/simulate_drift.py --drift-type gradual --steps 500
# Sudden distribution shift (severity 2x)
python scripts/simulate_drift.py --drift-type sudden --severity 2.0
# Seasonal oscillation
python scripts/simulate_drift.py --drift-type seasonal --steps 1000
# Mixed drift with 30-step feedback lag
python scripts/simulate_drift.py --drift-type mixed --feedback-lag 30
```
---
## Docker Stack
```bash
docker compose up --build
# API: http://localhost:8000
# Dashboard: http://localhost:8501
# MLflow UI: http://localhost:5000
```
---
## Project Structure
```
argus/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ main.py # FastAPI app factory + lifespan
β”‚ β”‚ β”œβ”€β”€ schemas.py # Pydantic request/response models
β”‚ β”‚ └── routes/
β”‚ β”‚ β”œβ”€β”€ predict.py # POST /predict, POST /predict/feedback
β”‚ β”‚ β”œβ”€β”€ monitor.py # GET /monitor/drift, /metrics, /history
β”‚ β”‚ └── health.py # GET /health
β”‚ β”‚
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ β”œβ”€β”€ generator.py # Synthetic NYC taxi data generator
β”‚ β”‚ β”œβ”€β”€ drift_simulator.py # Gradual / Sudden / Seasonal drift engine
β”‚ β”‚ └── preprocessing.py # Stateless feature pipeline
β”‚ β”‚
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”œβ”€β”€ trainer.py # GradientBoosting + MLflow experiment tracking
β”‚ β”‚ β”œβ”€β”€ registry.py # File-based champion/challenger registry
β”‚ β”‚ └── evaluator.py # Side-by-side model comparison
β”‚ β”‚
β”‚ β”œβ”€β”€ monitoring/
β”‚ β”‚ β”œβ”€β”€ drift_detector.py # PSI + KS-test; performance drift detection
β”‚ β”‚ β”œβ”€β”€ performance_monitor.py # Rolling window + delayed feedback matching
β”‚ β”‚ └── root_cause_analyzer.py # PSI Γ— importance ranking
β”‚ β”‚
β”‚ └── retraining/
β”‚ β”œβ”€β”€ trigger.py # Dual-gate retraining decision logic
β”‚ └── pipeline.py # Retrain β†’ compare β†’ promote workflow
β”‚
β”œβ”€β”€ dashboard/
β”‚ └── app.py # Streamlit dashboard (5 pages)
β”‚
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ test_drift_detection.py
β”‚ β”œβ”€β”€ test_api.py
β”‚ └── test_retraining.py
β”‚
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ train_initial_model.py # Bootstrap: generate data + train champion
β”‚ β”œβ”€β”€ simulate_drift.py # Production traffic + drift simulator
β”‚ β”œβ”€β”€ demo.py # Full in-process walkthrough
β”‚ └── generate_assets.py # Regenerate README images
β”‚
β”œβ”€β”€ configs/
β”‚ └── config.yaml # All thresholds, hyperparams, paths
β”‚
β”œβ”€β”€ assets/ # Generated plots for README
β”œβ”€β”€ deployment/ # Render + Streamlit Cloud guides
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Dockerfile.dashboard
β”œβ”€β”€ docker-compose.yml
└── requirements.txt
```
---
## Design Highlights
### 1. Delayed Feedback Handling
Production labels rarely arrive instantly β€” a taxi duration is only known after the trip ends.
A fraud label can take days to confirm. Argus handles this with a **request-ID matching system**:
```
Prediction time Label arrival
| |
v v
[predict] ── 24h gap ─────► [feedback]
| |
└───────── request_id β”€β”€β”€β”€β”€β”€β”˜
(matched)
```
Performance metrics are only updated when the label arrives and is correctly attributed to
the prediction at the time the features were observed.
---
### 2. Root-Cause Drift Analysis
Most drift detectors stop at a Boolean flag. Argus answers a more useful question:
**which features drifted, and how much do they actually matter to the model?**
```
RCA Score = PSI Γ— (1 + model feature importance)
```
A feature with high PSI but zero importance scores low β€” safe to ignore.
A feature with moderate PSI but high importance scores high β€” retrain urgently.
---
### 3. Dual-Gate Retraining Trigger
Retraining fires only when **all four conditions** are met simultaneously:
```
Gate 1 Feature drift detected (PSI >= 0.20)
Gate 2 Performance degraded (RMSE > baseline + 15%)
Gate 3 Enough new samples (>= 1000 since last retrain)
Gate 4 Cooldown elapsed (>= 6 hours)
```
This prevents:
- Retraining on transient anomalies (drift without performance impact)
- Retraining when performance drops for non-drift reasons
- Oscillation loops (cooldown window)
- Retraining on statistically insufficient data
---
### 4. Champion-Challenger Promotion
Every retrain produces a **challenger** model evaluated against the **champion** on a held-out
set. The challenger is only promoted if it improves RMSE by at least 5%. This ensures the
production model never regresses.
---
## Configuration
All thresholds and paths are in [configs/config.yaml](configs/config.yaml):
```yaml
monitoring:
drift:
psi_threshold: 0.2 # flag when PSI >= 0.2
ks_pvalue_threshold: 0.05 # flag when p-value < 0.05
performance:
degradation_threshold: 0.15 # retrain when RMSE +15% vs baseline
retraining:
cooldown_hours: 6
min_samples_since_last_retrain: 1000
model:
evaluation:
promotion_threshold: 0.05 # challenger must beat champion by 5%
hyperparams:
n_estimators: 200
max_depth: 6
learning_rate: 0.05
```
---
## Tests
```bash
pytest tests/ -v
# Individual suites
pytest tests/test_drift_detection.py -v
pytest tests/test_retraining.py -v
pytest tests/test_api.py -v
```
---
## MLflow Tracking
```bash
mlflow ui --backend-store-uri data/mlruns --port 5000
# Open: http://localhost:5000
```
Each run logs hyperparameters, RMSE/MAE/R2, feature importances, the trained model
artifact, and custom tags for trigger reason and root cause.
---
## License
MIT β€” see [LICENSE](LICENSE).