Spaces:
Sleeping
title: ARIA DevOps Incident Response
emoji: π¨
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
license: apache-2.0
tags:
- openenv
- reinforcement-learning
- devops
- incident-response
- rl-environment
- multi-agent
- llm-agent
- grpo
- curriculum-learning
- huggingface
- pytorch
- meta
short_description: OpenEnv RL for incident response. 7 tasks, Llama-3.1-8B
ARIA β DevOps Incident Response
The first OpenEnv RL environment for production incident response
ARIA β Adaptive Reward & Incident Architecture Built for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
π Quick Links for Judges
| Resource | Link |
|---|---|
| Live Environment | https://arijit-07-devops-incident-response.hf.space |
| Interactive API | https://arijit-07-devops-incident-response.hf.space/docs |
| Trained Model (8B) | https://huggingface.co/Arijit-07/aria-devops-llama8b |
| Training Curve | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png |
| Blog Post | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
| GitHub | https://github.com/Twilight-13/devops-incident-response |
| Validate | https://arijit-07-devops-incident-response.hf.space/validate |
| About (machine-readable) | https://arijit-07-devops-incident-response.hf.space/about |
β‘ Run a Complete Episode Right Now
# 1. Start an easy incident
curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy", "seed": 42}'
# 2. Read logs on the failing service (reward: +0.15)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "read_logs", "service": "payment-service"}'
# 3. Diagnose (reward: +0.30)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
# 4. Fix it (reward: +0.40)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "restart_service", "service": "payment-service"}'
# 5. Validate all 7 tasks pass
curl https://arijit-07-devops-incident-response.hf.space/validate
π― The Problem
Every company running microservices faces the same reality: production incidents are expensive, stressful, and happen at 3am.
SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents β to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.
ARIA fills that gap.
π¬ The 7 Tasks
| Task | Max Steps | Random | Strong LLM | Scenario |
|---|---|---|---|---|
easy |
15 | 0.05 | 0.85β1.00 | Single service OOM crash-loop |
medium |
20 | 0.03 | 0.55β0.75 | Cascading failure + red herring alert |
hard |
25 | 0.01 | 0.30β0.50 | Silent corruption β all services green |
bonus |
25 | 0.01 | 0.35β0.55 | Two simultaneous independent failures |
security |
20 | 0.01 | 0.40β0.60 | DDoS botnet credential stuffing |
database |
20 | 0.01 | 0.45β0.65 | Missing index β full table scans |
failover |
25 | 0.01 | 0.35β0.55 | Multi-region network partition |
generated |
20 | 0.01 | variable | Procedural β seed-deterministic |
π Reward Function
Final Score = Ξ£(step_rewards)
+ efficiency_bonus # (1 - steps/max_steps) Γ 0.05
+ diagnosis_precision # +0.03 if β₯50% keyword overlap
- noop_penalty # (noops - 3) Γ 0.02
Clamped to (0.001, 0.999) for GRPO stability.
| Action | Reward | Penalty Triggers |
|---|---|---|
read_logs correct |
+0.15 | Restart healthy service: -0.15 |
diagnose full match |
+0.35 | Fix without diagnosing: -0.10 |
restart_service correct |
+0.45 | Wrong failover (payment): -0.25 |
block_ip_range |
+0.40 | Excessive noops: -0.04 each |
alert_oncall (required) |
+0.15 |
Semantic matching: keyword overlap not exact string β LLMs that paraphrase aren't penalized.
π ARIA Features
Curriculum Engine
Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.
GET /curriculum/status
GET /curriculum/next
POST /curriculum/record # {"task_id": "easy", "score": 0.85}
Incident Generator
Seeds 0β99,999 β unique reproducible incidents. 6 failure modes Γ 8 services Γ 3 severities Γ 0β3 noise alerts.
GET /generate/preview?seed=1337
POST /reset # {"task_id": "generated", "seed": 1337}
Dual-Agent Mode
Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via share_finding. Neither can solve the incident alone.
POST /multi-agent/reset # {"task_id": "easy", "seed": 42}
POST /multi-agent/step/a/{id} # {"finding": "order-service OOM"}
POST /multi-agent/step/b/{id} # {"action_type": "restart_service", ...}
π§ Training Results
Model: Arijit-07/aria-devops-llama8b
| Task | Baseline | Fine-tuned | Improvement |
|---|---|---|---|
| easy | 0.320 | 0.685 | +0.365 |
| medium | 0.050 | 0.378 | +0.328 |
| hard | 0.190 | 0.869 | +0.679 |
| bonus | 0.152 | 0.682 | +0.530 |
Setup: GRPO Β· Llama-3.1-8B Β· LoRA rank=32 Β· 160 episodes Β· NVIDIA L4 Β· 162 minutes Β· Unsloth + HuggingFace TRL
Key fix: Group completions scored on fresh environment snapshots β prevents reward gate exhaustion during GRPO group generation.
π‘ API Reference
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Liveness check |
| GET | /about |
Full machine-readable description |
| GET | /tasks |
All 8 tasks |
| POST | /reset |
Start episode |
| POST | /step |
Take action |
| GET | /state |
Full state + ground truth |
| GET | /validate |
Self-test all 7 tasks |
| GET | /metrics |
Aggregate statistics |
| GET | /leaderboard |
Top 10 episodes |
| WS | /ws |
WebSocket real-time |
| GET | /curriculum/status |
Per-task mastery |
| GET | /curriculum/next |
Recommended task |
| POST | /curriculum/record |
Feed training results |
| GET | /generate/preview |
Preview procedural incident |
| POST | /multi-agent/reset |
Start dual-agent session |
| POST | /multi-agent/step/a/{id} |
Agent A shares finding |
| POST | /multi-agent/step/b/{id} |
Agent B takes action |
| GET | /live |
Live NOC dashboard (real-time) |
| GET | /challenge |
Human vs Agent challenge |
| GET | /progress |
Score progression visualization |
| GET | /replays |
Episode replay list |
| GET | /replay/{id} |
Full episode replay |
| GET | /replay/{id}/html |
Replay HTML viewer |
| GET | /docs |
Swagger UI |
π Benchmark Comparison
| Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent |
|---|---|---|---|---|---|
| SWE-bench | Code repair | β | β | β | β |
| WebArena | Web navigation | β | β | β | β |
| AgentBench | General tools | β | β | β | β |
| ARIA | Incident response | β | β | β | β |
π Setup
docker build -t aria-devops-incident .
docker run -p 7860:7860 aria-devops-incident
# Or local
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 7860
π Structure
βββ api.py / server/app.py # FastAPI β all endpoints
βββ env.py # Environment dispatcher
βββ models.py # Pydantic models
βββ tasks/ # 7 tasks + generated
βββ curriculum/engine.py # Adaptive difficulty
βββ generator/ # Procedural incidents
βββ multi_agent/session.py # Dual-agent mode
βββ graders/grader.py # Deterministic grader
βββ demo_llm.py # Live terminal demo
βββ train_grpo.ipynb # Training notebook
βββ BLOG.md # Project story
βββ openenv.yaml # OpenEnv manifest
Apache 2.0 Β· Built solo for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals β Bangalore, April 2026
