--- title: Argus ML Observability Platform emoji: 🚕 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false license: mit short_description: MLOps drift detection and auto-retraining platform --- # Argus > **Production ML models fail silently. Argus detects the failure, explains why, and fixes it automatically.** [![Live Demo](https://img.shields.io/badge/Live%20Demo-Hugging%20Face-FFD21E?logo=huggingface&logoColor=black)](https://huggingface.co/spaces/Hodfa71/argus-mlops) [![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-0.111-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Streamlit](https://img.shields.io/badge/Streamlit-1.33-FF4B4B?logo=streamlit&logoColor=white)](https://streamlit.io) [![MLflow](https://img.shields.io/badge/MLflow-2.12-0194E2?logo=mlflow&logoColor=white)](https://mlflow.org) [![scikit-learn](https://img.shields.io/badge/scikit--learn-1.4-F7931E?logo=scikit-learn&logoColor=white)](https://scikit-learn.org) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE) **[Try the live demo on Hugging Face Spaces](https://huggingface.co/spaces/Hodfa71/argus-mlops)** --- ## Architecture ![Architecture](assets/architecture.png) Argus is a **production-grade ML monitoring platform** built around a NYC taxi trip-duration prediction service. It demonstrates the full MLOps lifecycle: - Serves low-latency predictions through a **FastAPI REST API** - Monitors model health continuously with **PSI + KS-test drift detection** - Handles **24-hour delayed ground truth** using request-ID matching - Explains *why* drift occurred via **root-cause analysis** (PSI × feature importance) - Retrains automatically only when both drift *and* performance degradation are confirmed - Runs safe **champion-challenger model promotion** — no regression allowed - Tracks every experiment in **MLflow** - Visualises everything in a **dark-themed Streamlit dashboard** --- ## Live Dashboard ![Argus dashboard demo](assets/demo.gif) --- ## Dashboard Preview ![Dashboard Preview](assets/dashboard_preview.png) --- ## Drift Detection in Action ![Drift Detection](assets/drift_detection.png) Per-feature PSI breakdown showing distribution shift between reference and live windows. Features are color-coded: green = stable, amber = moderate, red = significant drift. --- ## Before vs. After Retraining ![Performance Recovery](assets/before_after_retraining.png) RMSE degrades during the drift window, then recovers once the challenger is promoted. The right panel shows rolling degradation percentage relative to the baseline. --- ## Feature Importance Tracking ![Feature Importance](assets/feature_importance.png) Side-by-side champion vs. challenger feature importance with delta view. --- ## PSI Heatmap ![PSI Heatmap](assets/psi_heatmap.png) Feature drift intensity across consecutive monitoring windows. Dark = stable, amber = moderate, red = significant drift. --- ## Live Demo | Service | URL | |---------|-----| | **API Docs (Swagger)** | `https://argus-ml-api.onrender.com/docs` | | **Monitoring Dashboard** | `https://argus-dashboard.streamlit.app` | | **Health Check** | `https://argus-ml-api.onrender.com/health` | > Deploy your own instance: [deployment/DEPLOYMENT.md](deployment/DEPLOYMENT.md) --- ## Quick Start ```bash # 1. Clone git clone https://github.com/YOUR_USERNAME/argus.git cd argus # 2. Install python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt # 3. Train initial model + build reference dataset python scripts/train_initial_model.py # 4. Full end-to-end demo (no running API required) python scripts/demo.py ``` **Expected demo output:** ``` STEP 4: Inject Drift Drift type : SUDDEN (abrupt distribution shift) trip_distance mean: 3.21 -> 12.86 (delta=+9.65) pickup_hour mean: 11.4 -> 16.8 (delta=+5.4) STEP 5: Drift Detection Feature drift detected : True Drifted features : ['trip_distance', 'pickup_hour'] Performance degraded : True (+48.6%) STEP 6: Root-Cause Analysis Primary cause : trip_distance (RCA score=0.521) Action : retrain_immediately STEP 9: Summary { "drift_detected": true, "root_cause": ["trip_distance", "pickup_hour"], "performance_drop": "48.6%", "action": "retrain_immediately", "model_promoted": true, "new_model_improvement": "11.23%" } ``` --- ## API Reference ### Start the server ```bash uvicorn app:app --reload --port 8000 # Swagger UI: http://localhost:8000/docs ``` ### Predict trip duration ```bash curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "passenger_count": 2, "trip_distance": 3.5, "pickup_hour": 8, "pickup_dow": 1, "pickup_month": 3, "pickup_is_weekend": 0, "rate_code_id": 1, "payment_type": 1, "pu_location_zone": 10, "do_location_zone": 25, "vendor_id": 1 }' ``` ```json { "request_id": "3f4a1b2c-...", "predicted_duration_min": 14.23, "model_version": "a3b7f921", "timestamp": "2024-03-15T09:42:11Z" } ``` ### Submit delayed ground truth ```bash curl -X POST http://localhost:8000/predict/feedback \ -H "Content-Type: application/json" \ -d '{"request_id": "3f4a1b2c-...", "actual_duration_min": 16.5}' ``` ### Drift status ```bash curl http://localhost:8000/monitor/drift ``` ```json { "drift_detected": true, "root_cause": ["trip_distance", "pickup_hour"], "performance_drop": "18.3%", "action": "retraining_triggered", "rca_details": [ {"feature": "trip_distance", "psi": 0.312, "importance": 0.421, "rca_score": 0.444}, {"feature": "pickup_hour", "psi": 0.241, "importance": 0.187, "rca_score": 0.286} ] } ``` ### Performance metrics ```bash curl http://localhost:8000/monitor/metrics ``` ### Trigger retraining manually ```bash curl -X POST http://localhost:8000/monitor/retrain ``` --- ## Drift Simulation ```bash # Gradual shift over 500 windows python scripts/simulate_drift.py --drift-type gradual --steps 500 # Sudden distribution shift (severity 2x) python scripts/simulate_drift.py --drift-type sudden --severity 2.0 # Seasonal oscillation python scripts/simulate_drift.py --drift-type seasonal --steps 1000 # Mixed drift with 30-step feedback lag python scripts/simulate_drift.py --drift-type mixed --feedback-lag 30 ``` --- ## Docker Stack ```bash docker compose up --build # API: http://localhost:8000 # Dashboard: http://localhost:8501 # MLflow UI: http://localhost:5000 ``` --- ## Project Structure ``` argus/ ├── src/ │ ├── api/ │ │ ├── main.py # FastAPI app factory + lifespan │ │ ├── schemas.py # Pydantic request/response models │ │ └── routes/ │ │ ├── predict.py # POST /predict, POST /predict/feedback │ │ ├── monitor.py # GET /monitor/drift, /metrics, /history │ │ └── health.py # GET /health │ │ │ ├── data/ │ │ ├── generator.py # Synthetic NYC taxi data generator │ │ ├── drift_simulator.py # Gradual / Sudden / Seasonal drift engine │ │ └── preprocessing.py # Stateless feature pipeline │ │ │ ├── models/ │ │ ├── trainer.py # GradientBoosting + MLflow experiment tracking │ │ ├── registry.py # File-based champion/challenger registry │ │ └── evaluator.py # Side-by-side model comparison │ │ │ ├── monitoring/ │ │ ├── drift_detector.py # PSI + KS-test; performance drift detection │ │ ├── performance_monitor.py # Rolling window + delayed feedback matching │ │ └── root_cause_analyzer.py # PSI × importance ranking │ │ │ └── retraining/ │ ├── trigger.py # Dual-gate retraining decision logic │ └── pipeline.py # Retrain → compare → promote workflow │ ├── dashboard/ │ └── app.py # Streamlit dashboard (5 pages) │ ├── tests/ │ ├── test_drift_detection.py │ ├── test_api.py │ └── test_retraining.py │ ├── scripts/ │ ├── train_initial_model.py # Bootstrap: generate data + train champion │ ├── simulate_drift.py # Production traffic + drift simulator │ ├── demo.py # Full in-process walkthrough │ └── generate_assets.py # Regenerate README images │ ├── configs/ │ └── config.yaml # All thresholds, hyperparams, paths │ ├── assets/ # Generated plots for README ├── deployment/ # Render + Streamlit Cloud guides ├── Dockerfile ├── Dockerfile.dashboard ├── docker-compose.yml └── requirements.txt ``` --- ## Design Highlights ### 1. Delayed Feedback Handling Production labels rarely arrive instantly — a taxi duration is only known after the trip ends. A fraud label can take days to confirm. Argus handles this with a **request-ID matching system**: ``` Prediction time Label arrival | | v v [predict] ── 24h gap ─────► [feedback] | | └───────── request_id ──────┘ (matched) ``` Performance metrics are only updated when the label arrives and is correctly attributed to the prediction at the time the features were observed. --- ### 2. Root-Cause Drift Analysis Most drift detectors stop at a Boolean flag. Argus answers a more useful question: **which features drifted, and how much do they actually matter to the model?** ``` RCA Score = PSI × (1 + model feature importance) ``` A feature with high PSI but zero importance scores low — safe to ignore. A feature with moderate PSI but high importance scores high — retrain urgently. --- ### 3. Dual-Gate Retraining Trigger Retraining fires only when **all four conditions** are met simultaneously: ``` Gate 1 Feature drift detected (PSI >= 0.20) Gate 2 Performance degraded (RMSE > baseline + 15%) Gate 3 Enough new samples (>= 1000 since last retrain) Gate 4 Cooldown elapsed (>= 6 hours) ``` This prevents: - Retraining on transient anomalies (drift without performance impact) - Retraining when performance drops for non-drift reasons - Oscillation loops (cooldown window) - Retraining on statistically insufficient data --- ### 4. Champion-Challenger Promotion Every retrain produces a **challenger** model evaluated against the **champion** on a held-out set. The challenger is only promoted if it improves RMSE by at least 5%. This ensures the production model never regresses. --- ## Configuration All thresholds and paths are in [configs/config.yaml](configs/config.yaml): ```yaml monitoring: drift: psi_threshold: 0.2 # flag when PSI >= 0.2 ks_pvalue_threshold: 0.05 # flag when p-value < 0.05 performance: degradation_threshold: 0.15 # retrain when RMSE +15% vs baseline retraining: cooldown_hours: 6 min_samples_since_last_retrain: 1000 model: evaluation: promotion_threshold: 0.05 # challenger must beat champion by 5% hyperparams: n_estimators: 200 max_depth: 6 learning_rate: 0.05 ``` --- ## Tests ```bash pytest tests/ -v # Individual suites pytest tests/test_drift_detection.py -v pytest tests/test_retraining.py -v pytest tests/test_api.py -v ``` --- ## MLflow Tracking ```bash mlflow ui --backend-store-uri data/mlruns --port 5000 # Open: http://localhost:5000 ``` Each run logs hyperparameters, RMSE/MAE/R2, feature importances, the trained model artifact, and custom tags for trigger reason and root cause. --- ## License MIT — see [LICENSE](LICENSE).