Spaces:
Sleeping
Sleeping
| title: ARIA DevOps Incident Response | |
| emoji: π¨ | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: docker | |
| pinned: true | |
| license: apache-2.0 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - devops | |
| - incident-response | |
| - rl-environment | |
| - multi-agent | |
| - llm-agent | |
| - grpo | |
| - curriculum-learning | |
| - huggingface | |
| - pytorch | |
| - meta | |
| short_description: "OpenEnv RL for incident response. 7 tasks, Llama-3.1-8B" | |
| # ARIA β DevOps Incident Response | |
| ### *The first OpenEnv RL environment for production incident response* | |
| [](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb) | |
| [](https://huggingface.co/spaces/Arijit-07/devops-incident-response) | |
| [](https://huggingface.co/Arijit-07/aria-devops-llama8b) | |
| [](LICENSE) | |
| > **ARIA** β Adaptive Reward & Incident Architecture | |
| > Built for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026 | |
| --- | |
| ## π Quick Links for Judges | |
| | Resource | Link | | |
| |---|---| | |
| | **Live Environment** | https://arijit-07-devops-incident-response.hf.space | | |
| | **Interactive API** | https://arijit-07-devops-incident-response.hf.space/docs | | |
| | **Trained Model (8B)** | https://huggingface.co/Arijit-07/aria-devops-llama8b | | |
| | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png | | |
| | **Blog Post** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response | | |
| | **GitHub** | https://github.com/Twilight-13/devops-incident-response | | |
| | **Validate** | https://arijit-07-devops-incident-response.hf.space/validate | | |
| | **About (machine-readable)** | https://arijit-07-devops-incident-response.hf.space/about | | |
| --- | |
| ## β‘ Run a Complete Episode Right Now | |
| ```bash | |
| # 1. Start an easy incident | |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "easy", "seed": 42}' | |
| # 2. Read logs on the failing service (reward: +0.15) | |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "read_logs", "service": "payment-service"}' | |
| # 3. Diagnose (reward: +0.30) | |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}' | |
| # 4. Fix it (reward: +0.40) | |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "restart_service", "service": "payment-service"}' | |
| # 5. Validate all 7 tasks pass | |
| curl https://arijit-07-devops-incident-response.hf.space/validate | |
| ``` | |
| --- | |
| ## π― The Problem | |
| Every company running microservices faces the same reality: **production incidents are expensive, stressful, and happen at 3am.** | |
| SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents β to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage. | |
| **ARIA fills that gap.** | |
| --- | |
| ## π¬ The 7 Tasks | |
| | Task | Max Steps | Random | Strong LLM | Scenario | | |
| |---|---|---|---|---| | |
| | `easy` | 15 | 0.05 | 0.85β1.00 | Single service OOM crash-loop | | |
| | `medium` | 20 | 0.03 | 0.55β0.75 | Cascading failure + red herring alert | | |
| | `hard` | 25 | 0.01 | 0.30β0.50 | **Silent** corruption β all services green | | |
| | `bonus` | 25 | 0.01 | 0.35β0.55 | Two simultaneous independent failures | | |
| | `security` | 20 | 0.01 | 0.40β0.60 | DDoS botnet credential stuffing | | |
| | `database` | 20 | 0.01 | 0.45β0.65 | Missing index β full table scans | | |
| | `failover` | 25 | 0.01 | 0.35β0.55 | Multi-region network partition | | |
| | `generated` | 20 | 0.01 | variable | Procedural β seed-deterministic | | |
| --- | |
| ## π Reward Function | |
| ``` | |
| Final Score = Ξ£(step_rewards) | |
| + efficiency_bonus # (1 - steps/max_steps) Γ 0.05 | |
| + diagnosis_precision # +0.03 if β₯50% keyword overlap | |
| - noop_penalty # (noops - 3) Γ 0.02 | |
| ``` | |
| Clamped to **(0.001, 0.999)** for GRPO stability. | |
| | Action | Reward | Penalty Triggers | | |
| |---|---|---| | |
| | `read_logs` correct | +0.15 | Restart healthy service: **-0.15** | | |
| | `diagnose` full match | +0.35 | Fix without diagnosing: **-0.10** | | |
| | `restart_service` correct | +0.45 | Wrong failover (payment): **-0.25** | | |
| | `block_ip_range` | +0.40 | Excessive noops: **-0.04 each** | | |
| | `alert_oncall` (required) | +0.15 | | | |
| **Semantic matching:** keyword overlap not exact string β LLMs that paraphrase aren't penalized. | |
| --- | |
| ## π ARIA Features | |
| ### Curriculum Engine | |
| Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability. | |
| ```bash | |
| GET /curriculum/status | |
| GET /curriculum/next | |
| POST /curriculum/record # {"task_id": "easy", "score": 0.85} | |
| ``` | |
| ### Incident Generator | |
| Seeds 0β99,999 β unique reproducible incidents. 6 failure modes Γ 8 services Γ 3 severities Γ 0β3 noise alerts. | |
| ```bash | |
| GET /generate/preview?seed=1337 | |
| POST /reset # {"task_id": "generated", "seed": 1337} | |
| ``` | |
| ### Dual-Agent Mode | |
| Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via `share_finding`. Neither can solve the incident alone. | |
| ```bash | |
| POST /multi-agent/reset # {"task_id": "easy", "seed": 42} | |
| POST /multi-agent/step/a/{id} # {"finding": "order-service OOM"} | |
| POST /multi-agent/step/b/{id} # {"action_type": "restart_service", ...} | |
| ``` | |
| --- | |
| ## π§ Training Results | |
| **Model:** [Arijit-07/aria-devops-llama8b](https://huggingface.co/Arijit-07/aria-devops-llama8b) | |
| | Task | Baseline | Fine-tuned | **Improvement** | | |
| |---|---|---|---| | |
| | easy | 0.320 | 0.685 | **+0.365** | | |
| | medium | 0.050 | 0.378 | **+0.328** | | |
| | hard | 0.190 | 0.869 | **+0.679** | | |
| | bonus | 0.152 | 0.682 | **+0.530** | | |
|  | |
| **Setup:** GRPO Β· Llama-3.1-8B Β· LoRA rank=32 Β· 160 episodes Β· NVIDIA L4 Β· 162 minutes Β· Unsloth + HuggingFace TRL | |
| **Key fix:** Group completions scored on fresh environment snapshots β prevents reward gate exhaustion during GRPO group generation. | |
| [](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb) | |
| --- | |
| ## π‘ API Reference | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | GET | `/health` | Liveness check | | |
| | GET | `/about` | Full machine-readable description | | |
| | GET | `/tasks` | All 8 tasks | | |
| | POST | `/reset` | Start episode | | |
| | POST | `/step` | Take action | | |
| | GET | `/state` | Full state + ground truth | | |
| | GET | `/validate` | Self-test all 7 tasks | | |
| | GET | `/metrics` | Aggregate statistics | | |
| | GET | `/leaderboard` | Top 10 episodes | | |
| | WS | `/ws` | WebSocket real-time | | |
| | GET | `/curriculum/status` | Per-task mastery | | |
| | GET | `/curriculum/next` | Recommended task | | |
| | POST | `/curriculum/record` | Feed training results | | |
| | GET | `/generate/preview` | Preview procedural incident | | |
| | POST | `/multi-agent/reset` | Start dual-agent session | | |
| | POST | `/multi-agent/step/a/{id}` | Agent A shares finding | | |
| | POST | `/multi-agent/step/b/{id}` | Agent B takes action | | |
| | GET | `/live` | Live NOC dashboard (real-time) | | |
| | GET | `/challenge` | Human vs Agent challenge | | |
| | GET | `/progress` | Score progression visualization | | |
| | GET | `/replays` | Episode replay list | | |
| | GET | `/replay/{id}` | Full episode replay | | |
| | GET | `/replay/{id}/html` | Replay HTML viewer | | |
| | GET | `/docs` | Swagger UI | | |
| --- | |
| ## π Benchmark Comparison | |
| | Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent | | |
| |---|---|---|---|---|---| | |
| | SWE-bench | Code repair | β | β | β | β | | |
| | WebArena | Web navigation | β | β | β | β | | |
| | AgentBench | General tools | β | β | β | β | | |
| | **ARIA** | **Incident response** | **β** | **β** | **β** | **β** | | |
| --- | |
| ## π Setup | |
| ```bash | |
| docker build -t aria-devops-incident . | |
| docker run -p 7860:7860 aria-devops-incident | |
| # Or local | |
| pip install -r requirements.txt | |
| uvicorn api:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| --- | |
| ## π Structure | |
| ``` | |
| βββ api.py / server/app.py # FastAPI β all endpoints | |
| βββ env.py # Environment dispatcher | |
| βββ models.py # Pydantic models | |
| βββ tasks/ # 7 tasks + generated | |
| βββ curriculum/engine.py # Adaptive difficulty | |
| βββ generator/ # Procedural incidents | |
| βββ multi_agent/session.py # Dual-agent mode | |
| βββ graders/grader.py # Deterministic grader | |
| βββ demo_llm.py # Live terminal demo | |
| βββ train_grpo.ipynb # Training notebook | |
| βββ BLOG.md # Project story | |
| βββ openenv.yaml # OpenEnv manifest | |
| ``` | |
| Apache 2.0 Β· *Built solo for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals β Bangalore, April 2026* | |