Arijit-07's picture
final: submission cleanup β€” remove junk files, update README endpoints, clean .gitignore
230f8d5
metadata
title: ARIA DevOps Incident Response
emoji: 🚨
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
license: apache-2.0
tags:
  - openenv
  - reinforcement-learning
  - devops
  - incident-response
  - rl-environment
  - multi-agent
  - llm-agent
  - grpo
  - curriculum-learning
  - huggingface
  - pytorch
  - meta
short_description: OpenEnv RL for incident response. 7 tasks, Llama-3.1-8B

ARIA β€” DevOps Incident Response

The first OpenEnv RL environment for production incident response

Open In Colab HF Space Trained Model License

ARIA β€” Adaptive Reward & Incident Architecture Built for the Meta Γ— PyTorch Γ— HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026


πŸ”— Quick Links for Judges


⚑ Run a Complete Episode Right Now

# 1. Start an easy incident
curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy", "seed": 42}'

# 2. Read logs on the failing service (reward: +0.15)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "read_logs", "service": "payment-service"}'

# 3. Diagnose (reward: +0.30)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'

# 4. Fix it (reward: +0.40)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "restart_service", "service": "payment-service"}'

# 5. Validate all 7 tasks pass
curl https://arijit-07-devops-incident-response.hf.space/validate

🎯 The Problem

Every company running microservices faces the same reality: production incidents are expensive, stressful, and happen at 3am.

SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents β€” to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.

ARIA fills that gap.


🎬 The 7 Tasks

Task Max Steps Random Strong LLM Scenario
easy 15 0.05 0.85–1.00 Single service OOM crash-loop
medium 20 0.03 0.55–0.75 Cascading failure + red herring alert
hard 25 0.01 0.30–0.50 Silent corruption β€” all services green
bonus 25 0.01 0.35–0.55 Two simultaneous independent failures
security 20 0.01 0.40–0.60 DDoS botnet credential stuffing
database 20 0.01 0.45–0.65 Missing index β€” full table scans
failover 25 0.01 0.35–0.55 Multi-region network partition
generated 20 0.01 variable Procedural β€” seed-deterministic

πŸ† Reward Function

Final Score = Ξ£(step_rewards)
            + efficiency_bonus     # (1 - steps/max_steps) Γ— 0.05
            + diagnosis_precision  # +0.03 if β‰₯50% keyword overlap
            - noop_penalty         # (noops - 3) Γ— 0.02

Clamped to (0.001, 0.999) for GRPO stability.

Action Reward Penalty Triggers
read_logs correct +0.15 Restart healthy service: -0.15
diagnose full match +0.35 Fix without diagnosing: -0.10
restart_service correct +0.45 Wrong failover (payment): -0.25
block_ip_range +0.40 Excessive noops: -0.04 each
alert_oncall (required) +0.15

Semantic matching: keyword overlap not exact string β€” LLMs that paraphrase aren't penalized.


🌟 ARIA Features

Curriculum Engine

Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.

GET /curriculum/status
GET /curriculum/next
POST /curriculum/record  # {"task_id": "easy", "score": 0.85}

Incident Generator

Seeds 0–99,999 β†’ unique reproducible incidents. 6 failure modes Γ— 8 services Γ— 3 severities Γ— 0–3 noise alerts.

GET /generate/preview?seed=1337
POST /reset  # {"task_id": "generated", "seed": 1337}

Dual-Agent Mode

Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via share_finding. Neither can solve the incident alone.

POST /multi-agent/reset    # {"task_id": "easy", "seed": 42}
POST /multi-agent/step/a/{id}  # {"finding": "order-service OOM"}
POST /multi-agent/step/b/{id}  # {"action_type": "restart_service", ...}

🧠 Training Results

Model: Arijit-07/aria-devops-llama8b

Task Baseline Fine-tuned Improvement
easy 0.320 0.685 +0.365
medium 0.050 0.378 +0.328
hard 0.190 0.869 +0.679
bonus 0.152 0.682 +0.530

Training Curve

Setup: GRPO Β· Llama-3.1-8B Β· LoRA rank=32 Β· 160 episodes Β· NVIDIA L4 Β· 162 minutes Β· Unsloth + HuggingFace TRL

Key fix: Group completions scored on fresh environment snapshots β€” prevents reward gate exhaustion during GRPO group generation.

Open In Colab


πŸ“‘ API Reference

Method Endpoint Description
GET /health Liveness check
GET /about Full machine-readable description
GET /tasks All 8 tasks
POST /reset Start episode
POST /step Take action
GET /state Full state + ground truth
GET /validate Self-test all 7 tasks
GET /metrics Aggregate statistics
GET /leaderboard Top 10 episodes
WS /ws WebSocket real-time
GET /curriculum/status Per-task mastery
GET /curriculum/next Recommended task
POST /curriculum/record Feed training results
GET /generate/preview Preview procedural incident
POST /multi-agent/reset Start dual-agent session
POST /multi-agent/step/a/{id} Agent A shares finding
POST /multi-agent/step/b/{id} Agent B takes action
GET /live Live NOC dashboard (real-time)
GET /challenge Human vs Agent challenge
GET /progress Score progression visualization
GET /replays Episode replay list
GET /replay/{id} Full episode replay
GET /replay/{id}/html Replay HTML viewer
GET /docs Swagger UI

πŸ“Š Benchmark Comparison

Benchmark Domain Partial Obs Dense Reward Curriculum Multi-Agent
SWE-bench Code repair βœ— βœ— βœ— βœ—
WebArena Web navigation βœ“ βœ— βœ— βœ—
AgentBench General tools βœ— βœ— βœ— βœ—
ARIA Incident response βœ“ βœ“ βœ“ βœ“

πŸš€ Setup

docker build -t aria-devops-incident .
docker run -p 7860:7860 aria-devops-incident

# Or local
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 7860

πŸ“ Structure

β”œβ”€β”€ api.py / server/app.py    # FastAPI β€” all endpoints
β”œβ”€β”€ env.py                    # Environment dispatcher
β”œβ”€β”€ models.py                 # Pydantic models
β”œβ”€β”€ tasks/                    # 7 tasks + generated
β”œβ”€β”€ curriculum/engine.py      # Adaptive difficulty
β”œβ”€β”€ generator/                # Procedural incidents
β”œβ”€β”€ multi_agent/session.py    # Dual-agent mode
β”œβ”€β”€ graders/grader.py         # Deterministic grader
β”œβ”€β”€ demo_llm.py               # Live terminal demo
β”œβ”€β”€ train_grpo.ipynb          # Training notebook
β”œβ”€β”€ BLOG.md                   # Project story
└── openenv.yaml              # OpenEnv manifest

Apache 2.0 Β· Built solo for the Meta Γ— PyTorch Γ— HuggingFace OpenEnv Hackathon Finals β€” Bangalore, April 2026