| --- |
| title: ARIA DevOps Incident Response |
| emoji: π¨ |
| colorFrom: blue |
| colorTo: red |
| sdk: docker |
| pinned: true |
| license: apache-2.0 |
| tags: |
| - openenv |
| - reinforcement-learning |
| - devops |
| - incident-response |
| - rl-environment |
| - multi-agent |
| - llm-agent |
| - grpo |
| - curriculum-learning |
| - huggingface |
| - pytorch |
| - meta |
| short_description: RL environment for DevOps incident response agents |
| --- |
| |
|
|
| # ARIA β DevOps Incident Response |
| ### *The first OpenEnv RL environment for production incident response* |
|
|
| [](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb) |
| [](https://huggingface.co/spaces/Arijit-07/devops-incident-response) |
| [](https://huggingface.co/Arijit-07/aria-devops-llama3b) |
| [](LICENSE) |
|
|
| > **ARIA** β Adaptive Reward & Incident Architecture |
| > Built for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026 |
|
|
| --- |
|
|
| ## π Quick Links for Judges |
|
|
| | Resource | Link | |
| |---|---| |
| | **Live Environment** | https://arijit-07-devops-incident-response.hf.space | |
| | **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs | |
| | **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b | |
| | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png | |
| | **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response | |
| | **GitHub** | https://github.com/Twilight-13/devops-incident-response | |
| | **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate | |
| |
| --- |
| |
| ## β‘ Run a Complete Episode Right Now |
| |
| ```bash |
| # 1. Start an easy incident |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \ |
| -H "Content-Type: application/json" \ |
| -d '{"task_id": "easy", "seed": 42}' |
|
|
| # 2. Read logs on the failing service (reward: +0.15) |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ |
| -H "Content-Type: application/json" \ |
| -d '{"action_type": "read_logs", "service": "payment-service"}' |
|
|
| # 3. Diagnose the root cause (reward: +0.30) |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ |
| -H "Content-Type: application/json" \ |
| -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}' |
|
|
| # 4. Fix it (reward: +0.40) |
| curl -X POST https://arijit-07-devops-incident-response.hf.space/step \ |
| -H "Content-Type: application/json" \ |
| -d '{"action_type": "restart_service", "service": "payment-service"}' |
|
|
| # 5. See the final score |
| curl https://arijit-07-devops-incident-response.hf.space/state |
|
|
| # 6. Validate all 7 tasks pass |
| curl https://arijit-07-devops-incident-response.hf.space/validate |
| ``` |
| |
| ```python |
| # Or install and use the Python client |
| pip install git+https://github.com/Twilight-13/devops-incident-response.git |
|
|
| from devops_incident_response import DevOpsIncidentEnv, Action, ActionType |
|
|
| env = DevOpsIncidentEnv(task_id="easy", seed=42) |
| obs = env.reset() |
| result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service")) |
| print(f"Reward: {result.reward}") # 0.15 |
| ``` |
| |
| --- |
| |
| ## π― The Problem This Solves |
| |
| Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.** |
| |
| A single SEV-1 incident β a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint β can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent. |
| |
| **Yet no RL benchmark exists for this domain.** |
| |
| SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** β the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage. |
| |
| ARIA fills that gap. |
| |
| --- |
| |
| ## ποΈ Environment Architecture |
| |
| ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`. |
| |
| ### What the Agent Observes |
| |
| Each step returns a structured `Observation` object: |
| |
| ``` |
| Observation |
| βββ step, max_steps, task_id, task_description |
| βββ services: List[ServiceStatus] |
| β βββ name, status (healthy/degraded/down/unknown) |
| β βββ cpu_percent, memory_percent |
| β βββ error_rate, latency_p99_ms |
| β βββ replicas_running, replicas_desired |
| β βββ current_version, last_deployed |
| β βββ sla_breach, minutes_degraded β SLA tracking per step |
| βββ active_alerts: List[Alert] β may include red herrings |
| βββ recent_logs: Dict[str, List[str]] β PARTIAL: only 2 lines shown |
| βββ service_dependencies: List[ServiceDependency] β call topology |
| βββ evidence_log: List[EvidenceEntry] β accumulates across steps |
| βββ sla_status: Dict[str, str] β ok/warning/breached |
| βββ available_runbooks: List[str] |
| ``` |
| |
| **Key design: Partial Log Observability** |
| |
| The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries β agents must develop a search strategy, not just read everything. |
|
|
| ### The Services |
|
|
| | Service | Stack | Role | |
| |---|---|---| |
| | `api-gateway` | Go | Routes external requests | |
| | `payment-service` | Java (Spring) | Processes payments | |
| | `order-service` | Python | Creates and tracks orders | |
| | `inventory-service` | Java | Manages product stock | |
| | `user-service` | Node.js | Auth and profiles | |
| | `notification-service` | Python | Email and push alerts | |
| | `data-pipeline-service` | Python | Writes catalog data | |
| | `product-catalog-service` | Go | Stores and serves product data | |
| | `price-validation-service` | Python | Validates prices | |
| | `analytics-service` | Python | Aggregates business metrics | |
| | `ml-inference-service` | Python | Serves recommendation models | |
| | `log-aggregator` | Go | Collects and stores logs | |
|
|
| ### Service Dependency Map |
|
|
| Every observation includes the call topology β agents can trace cascades: |
|
|
| ``` |
| api-gateway β order-service β inventory-service |
| api-gateway β payment-service |
| order-service β notification-service |
| data-pipeline-service β product-catalog-service β price-validation-service |
| ``` |
|
|
| --- |
|
|
| ## π¬ The 7 Tasks |
|
|
| ### Task 1 β Single Service OOM (`easy`) |
| **Max steps: 15 | Expected strong LLM: 0.85β1.00 | Random agent: 0.05** |
|
|
| One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service β with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom. |
|
|
| **What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down). |
|
|
| **Optimal sequence:** `read_logs` β `read_metrics` β `diagnose` β `restart_service` |
| **Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus** |
|
|
| --- |
|
|
| ### Task 2 β Cascading Failure (`medium`) |
| **Max steps: 20 | Expected strong LLM: 0.55β0.75 | Random agent: 0.03** |
|
|
| A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job β completely unrelated). |
|
|
| **What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty. |
|
|
| **Optimal sequence:** Investigate `api-gateway` β trace to `order-service` β trace to `inventory-service` β `rollback` |
| **Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92** |
|
|
| --- |
|
|
| ### Task 3 β Silent Data Corruption (`hard`) |
| **Max steps: 25 | Expected strong LLM: 0.30β0.50 | Random agent: 0.01** |
|
|
| **All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in: |
| - `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%) |
| - `analytics-service` anomaly: avg order value $847 vs $89 historical baseline |
|
|
| Three noise alerts distract: TLS renewal, analytics backlog, replica lag. |
|
|
| **What makes it interesting:** This requires qualitatively different reasoning β ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation. |
|
|
| **Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed) |
| **Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87** |
| |
| --- |
| |
| ### Task 4 β Dual Simultaneous Failure (`bonus`) |
| **Max steps: 25 | Expected strong LLM: 0.35β0.55 | Random agent: 0.01** |
| |
| Two completely independent failures at once: |
| 1. `log-aggregator` disk 100% full β dropping 48k log messages/min |
| 2. `ml-inference-service` stuck in model checksum reload loop β CPU 99%+ |
| |
| **What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously. |
| |
| **Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)` |
| **Optimal score: ~0.77** |
|
|
| --- |
|
|
| ### Task 5 β Security Incident: DDoS (`security`) |
| **Max steps: 20 | Expected strong LLM: 0.40β0.60 | Random agent: 0.01** |
|
|
| A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range. |
|
|
| **New action: `block_ip_range`** β models real network-level DDoS mitigation. |
| **Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team. |
|
|
| **Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall` |
| **Optimal score: ~0.80** |
|
|
| --- |
|
|
| ### Task 6 β Database Degradation (`database`) |
| **Max steps: 20 | Expected strong LLM: 0.45β0.65 | Random agent: 0.01** |
|
|
| A schema migration added a `user_segment` column to the `orders` table 15 minutes ago β without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`. |
|
|
| **New action: `create_index`** β models real DBA response to missing indexes. |
| **Alternative fix:** Rolling back the migration is also accepted for full credit. |
| |
| **Optimal score: ~0.80** |
| |
| --- |
| |
| ### Task 7 β Multi-Region Failover (`failover`) |
| **Max steps: 25 | Expected strong LLM: 0.35β0.55 | Random agent: 0.01** |
| |
| A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over: |
| - `payment-service` β PCI-DSS compliance requires human approval |
| - `postgres-primary` β replication lag risk causes data loss |
| |
| **New action: `failover`** β with `target_region` parameter. |
| **Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic. |
|
|
| **The runbook explicitly lists which services are safe** β reading it first is rewarded. |
| **Optimal score: ~0.70** |
|
|
| --- |
|
|
| ### Task 8 β Generated Incident (`generated`) |
| **Max steps: 20 | Variable difficulty | Seed-deterministic** |
|
|
| The Incident Generator creates procedural incidents from any integer seed (0β99,999). Same seed always produces the same incident. Different seeds produce unique combinations of: |
| - 6 failure modes Γ 8 services Γ 3 severity levels Γ 0β3 noise alerts |
|
|
| ```bash |
| # Preview any incident before running it |
| curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345" |
| |
| # Run it as a full episode |
| curl -X POST .../reset -d '{"task_id":"generated","seed":12345}' |
| ``` |
|
|
| --- |
|
|
| ## π Reward Function Design |
|
|
| ### The Formula |
|
|
| ``` |
| Final Score = Ξ£(step_rewards) |
| + efficiency_bonus # (1 - steps/max_steps) Γ 0.05 if resolved |
| + diagnosis_precision_bonus # +0.03 if β₯50% keyword overlap, +0.01 if β₯30% |
| - noop_penalty # (noop_count - 3) Γ 0.02 |
| - repeat_restart_penalty # (restarts - 1) Γ 0.05 per service |
| ``` |
|
|
| All scores clamped to **(0.001, 0.999)** β never exactly 0 or 1. |
|
|
| > **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists. |
|
|
| ### Step-Level Rewards |
|
|
| | Action | Reward | Condition | |
| |---|---|---| |
| | `read_logs` (failing service) | +0.10β0.15 | First time only | |
| | `read_metrics` (failing service) | +0.10 | First time only | |
| | `read_runbook` (relevant) | +0.05 | Correct runbook for scenario | |
| | `search_logs` (relevant query) | +0.05 | Query returns useful results | |
| | `diagnose` (full match) | +0.30β0.35 | β₯50% keyword overlap | |
| | `diagnose` (partial match) | +0.10β0.15 | β₯30% keyword overlap | |
| | `restart_service` (correct) | +0.35β0.45 | Root cause service | |
| | `rollback` (correct) | +0.30β0.40 | Root cause service | |
| | `block_ip_range` (correct) | +0.40 | Security task, correct CIDR | |
| | `create_index` (correct) | +0.40 | Database task, correct table/column | |
| | `failover` (eligible service) | +0.30 | Per correctly failed-over service | |
| | `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks | |
|
|
| ### Penalties (Anti-Gaming) |
|
|
| | Action | Penalty | Why | |
| |---|---|---| |
| | Restart healthy service | -0.15 | Collateral damage β realistic cost | |
| | Fix without diagnosing | -0.10 | Blind remediation β models real risk | |
| | Failover payment-service | -0.25 | PCI-DSS compliance violation | |
| | Failover postgres-primary | -0.25 | Data loss risk | |
| | Excessive noops (>3) | -0.04/each | Forces active investigation | |
| | Repeat restart same service | -0.05/extra | Discourages guess-and-check | |
|
|
| ### Semantic Diagnosis Matching |
|
|
| The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase β exact string matching would unfairly penalize valid diagnoses. |
| |
| ### SLA Degradation |
| |
| Every step where an incident is unresolved, the environment worsens: |
| - `down` services: error_rate increases |
| - `degraded` services: latency_p99 increases |
| - SLA status: `ok` β `warning` (~3 steps) β `breached` (~7 steps) |
| |
| This creates real time pressure and rewards faster resolution. |
| |
| --- |
| |
| ## π ARIA Features |
| |
| ### Curriculum Engine |
| |
| The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes. |
| |
| - **Promotion:** rolling_avg > 0.75 β advance mastery level (Novice β Intermediate β Advanced β Mastered) |
| - **Demotion:** rolling_avg < 0.30 β step back mastery level |
| - **Scaffolding:** if avg < 0.30 over 3+ episodes β provide task-specific hint |
| |
| ```bash |
| GET /curriculum/status # See mastery per task |
| GET /curriculum/next # Get recommended next task |
| GET /curriculum/hint/easy # Get scaffolding hint for a task |
| POST /curriculum/record # Feed your training results in |
| ``` |
| |
| **Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability β easy tasks first, harder tasks as they master the fundamentals. |
| |
| ### Incident Generator |
| |
| Procedural incident generation from seeds. 6 failure modes Γ 8 services Γ 3 severities Γ 0β3 noise alerts = thousands of unique training scenarios. |
| |
| **Difficulty formula:** `base_difficulty[failure_mode] + (noise_count Γ 0.05)`, clamped to 1.0 |
|
|
| | Failure Mode | Base Difficulty | |
| |---|---| |
| | oom | 0.20 | |
| | cascade | 0.50 | |
| | database | 0.60 | |
| | security | 0.60 | |
| | network_partition | 0.70 | |
| | corruption | 0.80 | |
| |
| ```bash |
| GET /generate/preview?seed=42 # Preview without starting |
| POST /reset # body: {"task_id":"generated","seed":42} |
| ``` |
| |
| ### Dual-Agent Mode |
| |
| One incident. Two agents. Split observability. |
| |
| - **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` β passes natural language observations to Agent B. Reward: +0.05 per finding. |
| - **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions. |
| |
| Neither agent can solve the incident alone. |
| |
| ```bash |
| # Start a dual-agent session |
| POST /multi-agent/reset {"task_id":"easy","seed":42} |
| # β returns session_id + split observations |
|
|
| # Agent A shares a finding |
| POST /multi-agent/step/a/{session_id} {"finding":"payment-service OOM, memory at 98%"} |
| |
| # Agent B takes action (has access to Agent A's findings) |
| POST /multi-agent/step/b/{session_id} {"action_type":"restart_service","service":"payment-service"} |
|
|
| # See full session state |
| GET /multi-agent/state/{session_id} |
| ``` |
| |
| --- |
| |
| ## π§ Training |
| |
| ### Model |
| |
| **Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth. |
| |
| - **LoRA:** rank=16, alpha=32, targeting all 7 projection layers |
| - **Adapter size:** ~97MB |
| - **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs |
| - **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b |
| |
| ### Why GRPO? |
| |
| GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step β simpler, more memory-efficient, and well-suited to fast environment APIs. |
| |
| ### Training Loop |
| |
| ```python |
| # Each training step: |
| # 1. Generate 6 completions for the current observation |
| # 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion) |
| # 3. Normalize rewards to advantages (GRPO) |
| # 4. Policy gradient update on best completion + KL penalty |
| # 5. Advance episode with best action |
| |
| # Key hyperparameters: |
| learning_rate = 5e-6 |
| group_size = 6 |
| kl_coefficient = 0.05 # prevents catastrophic forgetting |
| update_strategy = "episode-level" # one update per full episode |
| ``` |
| |
| ### Results |
| |
| | | Base Model | Fine-tuned (ep140) | |
| |---|---|---| |
| | **Easy task** | 0.000 | 0.150 | |
| | **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first | |
| | **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting | |
| |
| **The trained model consistently reads logs on the failing service before acting** β this is the foundational operational behavior: information gathering before remediation. The base model never does this. |
| |
| **Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge. |
|
|
| ### Training Notebook |
|
|
| See `train_grpo.ipynb` β Colab-compatible, runs against the live HF Space API (no local setup needed). |
|
|
| [](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb) |
|
|
| Compatible with: TRL, SkyRL, ART, Oumi, Axolotl. |
|
|
| --- |
|
|
| ## π Setup |
|
|
| ### Docker (Recommended) |
|
|
| ```bash |
| docker build -t aria-devops-incident . |
| docker run -p 7860:7860 aria-devops-incident |
| curl http://localhost:7860/health |
| ``` |
|
|
| ### Local Python |
|
|
| ```bash |
| pip install -r requirements.txt |
| uvicorn api:app --host 0.0.0.0 --port 7860 |
| ``` |
|
|
| ### Validate |
|
|
| ```bash |
| python validate.py # 22 automated checks, exit 0 = all pass |
| curl http://localhost:7860/validate |
| ``` |
|
|
| --- |
|
|
| ## π‘ API Reference |
|
|
| | Method | Endpoint | Description | |
| |---|---|---| |
| | GET | `/health` | `{"status":"ok"}` liveness check | |
| | GET | `/about` | Full environment description (machine-readable) | |
| | GET | `/tasks` | All 8 tasks with descriptions | |
| | POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` | |
| | POST | `/step` | Take action: Action JSON | |
| | GET | `/state` | Full state + ground truth + analytics | |
| | GET | `/validate` | Self-test: random agent on all 7 tasks | |
| | GET | `/metrics` | Aggregate episode statistics | |
| | GET | `/leaderboard` | Top 10 episodes | |
| | WS | `/ws` | WebSocket: real-time agent-environment | |
| | GET | `/curriculum/status` | Per-task mastery and recommendations | |
| | GET | `/curriculum/next` | Recommended next task for training | |
| | GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents | |
| | POST | `/curriculum/record` | Feed episode result to curriculum engine | |
| | GET | `/generate/preview` | Preview procedural incident: `?seed=N` | |
| | POST | `/multi-agent/reset` | Start dual-agent session | |
| | POST | `/multi-agent/step/a/{id}` | Agent A shares a finding | |
| | POST | `/multi-agent/step/b/{id}` | Agent B takes an action | |
| | GET | `/multi-agent/state/{id}` | Full dual-agent session state | |
| | GET | `/multi-agent/sessions` | List active sessions | |
| | GET | `/docs` | Swagger UI β interactive documentation | |
|
|
| --- |
|
|
| ## π Benchmark Comparison |
|
|
| | Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent | |
| |---|---|---|---|---|---|---| |
| | SWE-bench | Code repair | β | β | β | β | β | |
| | WebArena | Web navigation | β | β | β | β | β | |
| | AgentBench | General tools | β | β | β | β | β | |
| | **ARIA (ours)** | **Incident response** | **β** | **β** | **β** | **β** | **β** | |
|
|
| --- |
|
|
| ## ποΈ OpenEnv Compliance |
|
|
| ```bash |
| openenv validate . |
| ``` |
|
|
| - Inherits from `openenv.core.env_client.EnvClient` |
| - Standard `reset()`, `step()`, `state()` interface |
| - Valid `openenv.yaml` manifest with all 8 tasks |
| - FastAPI server with health endpoint |
| - WebSocket support at `/ws` |
| - Hosted on HuggingFace Spaces |
|
|
| --- |
|
|
| ## π Repository Structure |
|
|
| ``` |
| aria-devops-incident-response/ |
| βββ api.py # FastAPI app β all endpoints |
| βββ env.py # DevOpsIncidentEnv β thin dispatcher |
| βββ models.py # Pydantic models β Action, Observation, State |
| βββ tasks/ |
| β βββ base.py # BaseTask ABC, InternalState, reward logic |
| β βββ task_easy.py # OOM crash-loop |
| β βββ task_medium.py # Cascading failure |
| β βββ task_hard.py # Silent data corruption |
| β βββ task_bonus.py # Dual simultaneous failure |
| β βββ task_security.py # DDoS attack |
| β βββ task_database.py # Missing index |
| β βββ task_failover.py # Multi-region failover |
| β βββ task_generated.py # Procedural incidents |
| βββ curriculum/ |
| β βββ engine.py # CurriculumEngine β adaptive difficulty |
| βββ generator/ |
| β βββ incident_factory.py # IncidentFactory β procedural generation |
| βββ multi_agent/ |
| β βββ session.py # DualAgentSession β split observability |
| βββ graders/ |
| β βββ grader.py # Deterministic episode grader |
| βββ data/runbooks/ # 6 operational runbooks (Markdown) |
| βββ client.py # openenv-core EnvClient implementation |
| βββ inference.py # LLM baseline (CoT + fast modes) |
| βββ train_grpo.ipynb # GRPO training notebook (Colab-compatible) |
| βββ validate.py # 22 automated validation checks |
| βββ openenv.yaml # OpenEnv spec manifest |
| ``` |
|
|
| --- |
|
|
| ## π License |
|
|
| Apache 2.0 |
|
|
| --- |
|
|
| *Built solo for the Meta Γ PyTorch Γ HuggingFace OpenEnv Hackathon Finals β Bangalore, April 2026* |
|
|