Spaces:

Arijit-07
/

devops-incident-response

Sleeping

App Files Files Community

Arijit-07 commited on Apr 26

Commit

bdd0439

1 Parent(s): bc9f59c

Finalizing ARIA for production deployment: 8B model migration, documentation polish, cleanup

Browse files

Files changed (15) hide show

BLOG.md +185 -20
README.md +104 -437
README_github.md +101 -435
debug_out.txt +0 -0
debug_steps.txt +0 -0
demo_llm.py +0 -2
openenv.yaml +3 -3
server/app.py +0 -58
training_screenshots/post_training.png +0 -0
training_screenshots/training_bonus.png +0 -0
training_screenshots/training_easy.png +0 -0
training_screenshots/training_hard.png +0 -0
training_screenshots/training_medium.png +0 -0
ui_test.py +0 -945
update_dashboard.py +0 -658

BLOG.md CHANGED Viewed

@@ -1,30 +1,195 @@
-# ARIA: Teaching AI Agents to Think Like On-Call Engineers
-In the high-stakes world of DevOps, the difference between a minor blip and a multi-million dollar outage often comes down to the speed and precision of the on-call engineer. But as systems scale into thousands of microservices, the cognitive load on humans is becoming unsustainable. At the Meta x PyTorch x Hugging Face OpenEnv Hackathon, we tackled this challenge head-on by building **ARIA: Adaptive Reward & Incident Architecture**.
-## The Problem: The "Hallucination" Gap
-Traditional LLM agents are great at writing code, but they often struggle with the messy, non-deterministic nature of live production environments. When an agent sees a 500 error, it frequently jumps to conclusions—"restart the database"—without correlating logs, metrics, and dependencies. To solve this, we needed more than just a simulator; we needed a *gym* that could teach agents the scientific method of troubleshooting.
-## The Solution: ARIA
-ARIA is a specialized OpenEnv environment that simulates a complex microservice ecosystem (Payments, Inventory, ML Inference, etc.). It doesn't just present "tasks"; it orchestrates a dynamic learning journey through three core innovations:
-### 1. The Curriculum Engine
-Most agents fail because they are thrown into the deep end. ARIA’s **Curriculum Engine** tracks an agent's mastery across 7 distinct domains (OOM, Cascading Failures, Security DDoS, etc.). Using a rolling success metric, the engine automatically injects "scaffolding"—extra diagnostic hints or simplified observations—when an agent struggles, and pulls them back as the agent gains confidence.
-### 2. The Procedural Incident Generator
-Static benchmarks are easily overfitted. ARIA features a seed-based **Procedural Generator** capable of creating infinite unique incident scenarios. By varying the root cause, affected services, and "noise" alerts, we ensure that agents are learning generalizable troubleshooting logic rather than memorizing specific patterns.
-### 3. Dual-Agent Mode (Split Observability)
-In the real world, senior engineers often pair up. ARIA's **Dual-Agent Mode** enforces a "Split Observability" constraint:
-*   **Agent A (Observer)** sees raw logs and security alerts but cannot take remediation actions.
-*   **Agent B (Responder)** sees system metrics and holds the "keys" to the infrastructure but is blind to the logs.
-This forces the agents to communicate and synthesize findings, mirroring the collaborative nature of high-performing SRE teams.
-## Results & Impact
-During our training runs, we observed that agents trained within ARIA's curriculum achieved a **42% higher resolution rate** on "Hard" tier incidents compared to those trained on static tasks. More importantly, the agents started exhibiting "diagnostic patience"—checking indices before rolling back databases and validating IP ranges before blocking traffic.
-## The Future
-ARIA isn't just a hackathon project; it's a blueprint for the next generation of autonomous infrastructure. By bridging the gap between raw LLM reasoning and the gritty reality of DevOps, we are moving one step closer to a world where "on-call" is handled by silicon, while humans focus on innovation.
 ---
-*Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon · Bangalore 2026*

+# I Built an RL Environment for Production Incidents. Here's What Happened at 3am.
+*A story about building ARIA — the first OpenEnv RL environment for production incident response — solo, overnight, for the Meta × PyTorch × HuggingFace Hackathon Finals in Bangalore.*
+---
+It starts the same way every time.
+Your phone buzzes. PagerDuty. A red notification cuts through the dark. You open your laptop, half-asleep, and somewhere in a data center, a service is dying.
+Payment-service. OOMKilled. Third time in five minutes.
+You know what to do. Read the logs. Check memory. Diagnose the leak. Restart the pod. Done. Go back to sleep. But it cost you forty minutes, a spike of cortisol, and whatever dream you were having.
+Now imagine an AI agent doing that instead. Not a chatbot. Not a code generator. An agent that reads logs strategically, traces cascading failures through dependency graphs, correlates business metric anomalies with deployment events thirty seconds ago — and fixes it. Without waking you up.
+That's what I wanted to build. And when I saw the OpenEnv Hackathon theme — *World Modeling: Professional Tasks* — I knew exactly what the world I wanted to model looked like.
+---
+## The Idea That Kept Me Up
+I spent the first hour of the hackathon rejecting ideas.
+Trading environments — boring, done to death. Game wrappers — impressive but toy. Code generation — SWE-bench already exists and does it better. What was genuinely missing?
+I kept coming back to one observation: **every major RL benchmark tests a skill that has nothing to do with running production software.** SWE-bench fixes bugs in a repository. WebArena navigates websites. AgentBench uses general tools. None of them ask the question that keeps every on-call engineer awake: *can this agent diagnose a live production incident?*
+The skill is called **operational intelligence**. And it's different from anything benchmarks currently measure.
+A production incident requires you to:
+- Read partial, noisy logs from twelve services simultaneously
+- Identify which alerts are symptoms and which are root causes
+- Trace dependency chains to find where a cascade started
+- Make precise interventions where the wrong move causes collateral damage
+- Do all of this under time pressure while SLA timers are ticking
+No existing benchmark tests this. So I built one.
+---
+## Designing the Environment
+The first design decision was the most important one: **what does the agent see?**
+I could have given it full logs. Perfect observability. Complete metrics. That would be easy to train on and useless in practice — real systems are noisy, partial, and overwhelming by design.
+So I made a choice that shaped everything else: **the agent only sees two log lines per service upfront.**
+If it wants to know more, it has to call `read_logs`. If it wants to search for a specific pattern, it has to call `search_logs`. This models exactly how Datadog and Kibana work — you don't read everything, you query strategically. This single design choice forced agents to develop information-gathering behavior instead of just pattern-matching on whatever's in front of them.
+The environment simulates a production e-commerce microservices platform — twelve services, from api-gateway and payment-service down to log-aggregator and ml-inference-service, with real dependency relationships. When inventory-service has a bad deployment, you see order-service timing out, which shows up as api-gateway errors. Three services failing, one root cause. The agent has to trace backwards.
+---
+## Seven Flavors of Pain
+I built seven tasks, each designed to test a different type of operational reasoning.
+**The easy task** is a single service OOMKilled. Straightforward — read logs, diagnose the memory leak, restart the correct service. A random agent scores 0.05. The optimal agent scores 0.99 in four steps. This is the baseline: if your agent can't solve this, it can't handle anything.
+**The medium task** introduces a cascade. A bad deployment of inventory-service exhausts its connection pool, timing out order-service, which floods api-gateway with errors. Three services visibly failing. One root cause. And a red herring: notification-service is also showing HIGH CPU — a completely unrelated scheduled batch job. Touch the wrong service and lose -0.15. The agent has to follow the dependency graph backwards, not just attack the loudest alarm.
+**The hard task** is my favorite. It's the one that separates genuine reasoning from pattern matching. All twelve services show green. Zero error rates. Normal latency. No standard alerts. The signal is buried: price-validation-service is logging WARN messages about a 15% price mismatch rate (baseline: 0.2%), and analytics-service shows average order value of $847 against an $89 historical baseline. A data pipeline deployment happened two minutes ago. Three unrelated noise alerts fire to distract you. The agent must ignore the healthy dashboards, correlate subtle anomalies, and understand that a deployment event is the causal link.
+This task requires qualitatively different thinking. And our trained 8B model scored **0.869** on it.
+**The bonus task** gives the agent two completely independent failures simultaneously — log-aggregator disk at 100% capacity, ml-inference-service stuck in a model checksum reload loop consuming 99% CPU. Neither is related to the other. Neither fix helps the other. The agent must decompose the problem, maintain two separate hypotheses, and resolve both independently. This tests something fundamental: can you hold multiple things in your head at once?
+**The security task** introduces a DDoS attack. A botnet is credential-stuffing the login endpoint from the 185.220.x.x IP range at 12,000 requests per second. Restarting the service won't help. Scaling up won't help. The agent needs to read the access logs, identify the attacking CIDR, block it at network level, and escalate to the security team. `block_ip_range` is a new action type that doesn't exist in any other RL benchmark.
+**The database task** is a missing index. A schema migration added a user_segment column to the orders table without an index. Every query is now doing a full sequential scan — postgres CPU is spiking, orders are slow. The signal is in the slow query logs. The fix is structural: `create_index('orders', 'user_segment')`. Not a restart. Not a rollback. Understanding the underlying cause.
+**The failover task** is the most constrained. A network partition hits us-east-1. Four services should be failed over to us-west-2. Two absolutely cannot be: payment-service requires human approval for PCI-DSS compliance, and postgres-primary failover risks data loss from replication lag. The runbook lists which services are safe. The agent that reads it first scores far better than the one that doesn't. Wrong failovers cost -0.25 each — modeling a real compliance violation.
+---
+## The Reward Function Is a Statement About Values
+Every number in the reward function is a design decision. Let me explain the ones that matter.
+**-0.15 for collateral damage.** If you restart a healthy service, you lose 0.15. This models the real cost — an unnecessary restart causes downtime, depletes goodwill, and occasionally triggers cascades of its own. The number is large enough to discourage random restarts, small enough that one mistake doesn't doom the episode.
+**-0.10 for blind remediation.** If you fix an incident without diagnosing it first, you lose 0.10. This is the most important penalty. In real incident response, acting without understanding is how you make things worse. Engineers who restart services hoping for the best are engineers who become a problem. The environment enforces the discipline: understand first, act second.
+**-0.25 for wrong failover.** This is catastrophic by design. Failing over payment-service without human approval is a PCI-DSS violation. Failing over postgres-primary without checking replication lag is how you lose data. These penalties model real consequences, not just suboptimal choices.
+**Semantic diagnosis matching.** The diagnose action uses keyword overlap, not exact string comparison. An agent that says "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service." This matters enormously for LLMs — they paraphrase, and penalizing correct reasoning for imperfect phrasing is wrong.
+**Clamped to (0.001, 0.999).** Never exactly 0 or 1. GRPO advantage normalization requires non-constant rewards within a group. Hard zeros create zero-variance groups where the model doesn't learn. The tiny clamp ensures a gradient signal always exists.
+---
+## Three Features That Make ARIA Adaptive
+Once the core environment worked, I built three systems that transform it from a static benchmark into something that grows with the agent.
+**The Curriculum Engine** tracks rolling average performance per task over the last five episodes. When an agent masters a task — rolling average above 0.75 — it promotes to harder tasks. When it struggles — below 0.30 for three or more episodes — it gets scaffolding hints. "Focus on the service with highest memory_percent." "Follow the dependency map backwards from the erroring service." The agent always trains at the edge of its capability.
+**The Incident Generator** creates procedural incidents from seeds. Any integer from 0 to 99,999 produces a unique combination of failure mode, affected service, severity, and noise alerts. Same seed always produces the same incident — reproducible for evaluation. Different seeds produce genuinely different incidents — impossible to memorize. Six failure modes times eight services times three severities times variable noise gives thousands of unique training scenarios beyond the seven fixed tasks.
+**Dual-Agent Mode** is the most conceptually interesting feature. One incident, two agents, split observability. Agent A (the Observer) sees only logs and alerts. Agent B (the Responder) sees only metrics and service dependencies. Agent A can only call `share_finding` — passing natural language observations to Agent B. Neither can solve the incident alone. This models how real incident response works: one engineer reads logs on Slack, another watches dashboards, they coordinate.
+---
+## Training a Real Model
+I trained Llama-3.1-8B-Instruct using GRPO — Group Relative Policy Optimization — with Unsloth for 4-bit quantization and HuggingFace TRL. 160 episodes across four task types. NVIDIA L4. 162 minutes.
+The training loop calls the live HF Space API for every episode. No local environment. No simulation. Real rewards from a real server.
+And here's the bug that cost me hours.
+My original training loop called `env_step` during group generation. I was generating six completions per step, scoring each by calling the environment, then using the rewards for GRPO advantage estimation. The problem: calling `env_step` six times per step consumed six reward gates from the same episode state. By the time the actual training step advanced the episode, all the interesting reward gates had been burned. The model had nothing to learn from because every action it took in the main episode was rewarded with zero — the good rewards had already been consumed by the scoring phase.
+The fix was conceptually simple and took me an embarrassingly long time to see: score all group completions on **fresh environment snapshots** — reset the environment fresh for each scoring call — then advance the main episode with only the best action. The main episode stays intact. The scoring sees independent reward signals. The gradient is real.
+After the fix, training looked like this:
+| Task | Baseline | Fine-tuned | Improvement |
+|---|---|---|---|
+| easy | 0.320 | 0.685 | **+0.365** |
+| medium | 0.050 | 0.378 | **+0.328** |
+| hard | 0.190 | 0.869 | **+0.679** |
+| bonus | 0.152 | 0.682 | **+0.530** |
+The hard task improvement of +0.679 is the number I'm most proud of. The hardest scenario — the one where all services show green and the signal is buried in business metric anomalies — went from barely-better-than-random to scoring 0.869. The model learned to look past the healthy dashboards.
 ---
+## The Thing That Almost Killed the Training Run
+At 11pm, training was going beautifully. Episode 25 on the easy task: rolling average 0.900. The model was clearly learning.
+Then the HuggingFace Space crashed.
+Not the training Space — the environment Space. The keep-alive server I'd built using Gradio had a bug in Jinja2's template cache that caused a `TypeError: unhashable type: 'dict'` on every request. Gradio was dying silently, port 7860 was returning 500 errors, and HuggingFace's health checker was about to kill the entire training container.
+I had about ten minutes before the Space went down and took the training run with it.
+I replaced Gradio with a twelve-line Python `HTTPServer`. No dependencies. No templates. No Jinja2. Just raw HTTP responses with `do_GET` and `do_HEAD` methods that read the training state file and return an HTML page. It can't crash because there's nothing to crash.
+The Space stayed alive. The training ran to completion.
+Sometimes the boring solution is the right one.
+---
+## What I Learned
+**Reward function design is philosophy, not engineering.** Every number encodes a judgment about what matters. -0.25 for failing over the payment service isn't arbitrary — it's a statement that compliance violations are catastrophic, not just suboptimal. The reward function is the most important document in an RL environment, and it should be written like one.
+**Partial observability forces genuine reasoning.** The decision to show only two log lines was uncomfortable — it made training harder, it made evaluation harder, it made everything slower. But it produced agents that actually learned to query. The easy path is full observability. The interesting path is making agents work for information.
+**RL bugs are invisible until they aren't.** The reward gate exhaustion bug was invisible for weeks. The model seemed to train — loss went down, some rewards appeared. Only when I looked closely at the reward distribution per step did I see that the main episode was consistently getting zero after the first action. Debugging RL requires different instincts than debugging regular software. The symptom is always "the model isn't learning." The cause could be anywhere.
+**Solo hackathons are clarifying.** No coordination overhead. Every decision is made in seconds. The tradeoff is that there's no one to catch your mistakes. I would have found the training bug faster with two people. But the environment design benefited from having one coherent vision all the way through, without the friction of consensus.
+---
+## What's Next
+Three directions I'd pursue with more time:
+**A human baseline.** Time actual on-call engineers on the same tasks and compare their scores to LLM agents. This positions ARIA as a real benchmark with human reference points, not just an RL playground. The hard task — silent data corruption — would be genuinely interesting to watch experienced engineers solve.
+**Adversarial task generation.** An LLM generates new incident scenarios from operational runbooks. Infinite task variety without manual authoring. The environment would grow with the agent's capability.
+**Multi-agent cooperation at scale.** Two agents with split observability is a start. The more interesting version models the cost of communication — Slack messages take time, paging someone wakes them up. An agent that must decide *when* to share a finding, not just *what* to share, is a harder and more realistic problem.
+---
+## Try It
+The environment is live. The model is on HuggingFace. The API is open.
+```bash
+curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "hard", "seed": 42}'
+```
+All services will show green. Good luck finding the signal.
+Can your agent handle a SEV-1 at 3am?
+---
+**Links:**
+- 🚨 Live Environment: https://huggingface.co/spaces/Arijit-07/devops-incident-response
+- 🧠 Trained Model (8B): https://huggingface.co/Arijit-07/aria-devops-llama8b
+- 💻 GitHub: https://github.com/Twilight-13/devops-incident-response
+- 📖 API Docs: https://arijit-07-devops-incident-response.hf.space/docs
+*Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*

README.md CHANGED Viewed

@@ -19,19 +19,20 @@ tags:
   - huggingface
   - pytorch
   - meta
-short_description: RL environment for DevOps incident response agents
 ---
 # ARIA — DevOps Incident Response
 ### *The first OpenEnv RL environment for production incident response*
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
-[![Trained Model](https://img.shields.io/badge/🤗-Trained%20Model-blue)](https://huggingface.co/Arijit-07/aria-devops-llama3b)
 [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
-> **ARIA** — Adaptive Reward & Incident Architecture
 > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
 ---
@@ -41,12 +42,13 @@ short_description: RL environment for DevOps incident response agents
 | Resource | Link |
 |---|---|
 | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
-| **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs |
-| **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b |
-| **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png |
-| **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
 | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
-| **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate |
 ---
@@ -63,7 +65,7 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "read_logs", "service": "payment-service"}'
-# 3. Diagnose the root cause (reward: +0.30)
 curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
@@ -73,510 +75,175 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "restart_service", "service": "payment-service"}'
-# 5. See the final score
-curl https://arijit-07-devops-incident-response.hf.space/state
-# 6. Validate all 7 tasks pass
 curl https://arijit-07-devops-incident-response.hf.space/validate
 ```
-```python
-# Or install and use the Python client
-pip install git+https://github.com/Twilight-13/devops-incident-response.git
-from devops_incident_response import DevOpsIncidentEnv, Action, ActionType
-env = DevOpsIncidentEnv(task_id="easy", seed=42)
-obs = env.reset()
-result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
-print(f"Reward: {result.reward}")  # 0.15
-```
 ---
-## 🎯 The Problem This Solves
-Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.**
-A single SEV-1 incident — a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint — can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent.
-**Yet no RL benchmark exists for this domain.**
-SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** — the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage.
-ARIA fills that gap.
----
-## 🏗️ Environment Architecture
-ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`.
-### What the Agent Observes
-Each step returns a structured `Observation` object:
-```
-Observation
-├── step, max_steps, task_id, task_description
-├── services: List[ServiceStatus]
-│   ├── name, status (healthy/degraded/down/unknown)
-│   ├── cpu_percent, memory_percent
-│   ├── error_rate, latency_p99_ms
-│   ├── replicas_running, replicas_desired
-│   ├── current_version, last_deployed
-│   └── sla_breach, minutes_degraded        ← SLA tracking per step
-├── active_alerts: List[Alert]              ← may include red herrings
-├── recent_logs: Dict[str, List[str]]       ← PARTIAL: only 2 lines shown
-├── service_dependencies: List[ServiceDependency]  ← call topology
-├── evidence_log: List[EvidenceEntry]       ← accumulates across steps
-├── sla_status: Dict[str, str]              ← ok/warning/breached
-└── available_runbooks: List[str]
-```
-**Key design: Partial Log Observability**
-The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries — agents must develop a search strategy, not just read everything.
-### The Services
-| Service | Stack | Role |
-|---|---|---|
-| `api-gateway` | Go | Routes external requests |
-| `payment-service` | Java (Spring) | Processes payments |
-| `order-service` | Python | Creates and tracks orders |
-| `inventory-service` | Java | Manages product stock |
-| `user-service` | Node.js | Auth and profiles |
-| `notification-service` | Python | Email and push alerts |
-| `data-pipeline-service` | Python | Writes catalog data |
-| `product-catalog-service` | Go | Stores and serves product data |
-| `price-validation-service` | Python | Validates prices |
-| `analytics-service` | Python | Aggregates business metrics |
-| `ml-inference-service` | Python | Serves recommendation models |
-| `log-aggregator` | Go | Collects and stores logs |
-### Service Dependency Map
-Every observation includes the call topology — agents can trace cascades:
-```
-api-gateway → order-service → inventory-service
-api-gateway → payment-service
-order-service → notification-service
-data-pipeline-service → product-catalog-service → price-validation-service
-```
 ---
 ## 🎬 The 7 Tasks
-### Task 1 — Single Service OOM (`easy`)
-**Max steps: 15 | Expected strong LLM: 0.85–1.00 | Random agent: 0.05**
-One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service — with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom.
-**What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down).
-**Optimal sequence:** `read_logs` → `read_metrics` → `diagnose` → `restart_service`
-**Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus**
----
-### Task 2 — Cascading Failure (`medium`)
-**Max steps: 20 | Expected strong LLM: 0.55–0.75 | Random agent: 0.03**
-A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job — completely unrelated).
-**What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty.
-**Optimal sequence:** Investigate `api-gateway` → trace to `order-service` → trace to `inventory-service` → `rollback`
-**Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92**
----
-### Task 3 — Silent Data Corruption (`hard`)
-**Max steps: 25 | Expected strong LLM: 0.30–0.50 | Random agent: 0.01**
-**All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in:
-- `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%)
-- `analytics-service` anomaly: avg order value $847 vs $89 historical baseline
-Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
-**What makes it interesting:** This requires qualitatively different reasoning — ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation.
-**Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed)
-**Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87**
----
-### Task 4 — Dual Simultaneous Failure (`bonus`)
-**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
-Two completely independent failures at once:
-1. `log-aggregator` disk 100% full — dropping 48k log messages/min
-2. `ml-inference-service` stuck in model checksum reload loop — CPU 99%+
-**What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously.
-**Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)`
-**Optimal score: ~0.77**
----
-### Task 5 — Security Incident: DDoS (`security`)
-**Max steps: 20 | Expected strong LLM: 0.40–0.60 | Random agent: 0.01**
-A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range.
-**New action: `block_ip_range`** — models real network-level DDoS mitigation.
-**Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team.
-**Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall`
-**Optimal score: ~0.80**
 ---
-### Task 6 — Database Degradation (`database`)
-**Max steps: 20 | Expected strong LLM: 0.45–0.65 | Random agent: 0.01**
-A schema migration added a `user_segment` column to the `orders` table 15 minutes ago — without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`.
-**New action: `create_index`** — models real DBA response to missing indexes.
-**Alternative fix:** Rolling back the migration is also accepted for full credit.
-**Optimal score: ~0.80**
----
-### Task 7 — Multi-Region Failover (`failover`)
-**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
-A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over:
-- `payment-service` — PCI-DSS compliance requires human approval
-- `postgres-primary` — replication lag risk causes data loss
-**New action: `failover`** — with `target_region` parameter.
-**Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic.
-**The runbook explicitly lists which services are safe** — reading it first is rewarded.
-**Optimal score: ~0.70**
----
-### Task 8 — Generated Incident (`generated`)
-**Max steps: 20 | Variable difficulty | Seed-deterministic**
-The Incident Generator creates procedural incidents from any integer seed (0–99,999). Same seed always produces the same incident. Different seeds produce unique combinations of:
-- 6 failure modes × 8 services × 3 severity levels × 0–3 noise alerts
-```bash
-# Preview any incident before running it
-curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345"
-# Run it as a full episode
-curl -X POST .../reset -d '{"task_id":"generated","seed":12345}'
-```
----
-## 🏆 Reward Function Design
-### The Formula
 ```
 Final Score = Σ(step_rewards)
-            + efficiency_bonus          # (1 - steps/max_steps) × 0.05 if resolved
-            + diagnosis_precision_bonus # +0.03 if ≥50% keyword overlap, +0.01 if ≥30%
-            - noop_penalty             # (noop_count - 3) × 0.02
-            - repeat_restart_penalty   # (restarts - 1) × 0.05 per service
 ```
-All scores clamped to **(0.001, 0.999)** — never exactly 0 or 1.
-> **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists.
-### Step-Level Rewards
-| Action | Reward | Condition |
-|---|---|---|
-| `read_logs` (failing service) | +0.10–0.15 | First time only |
-| `read_metrics` (failing service) | +0.10 | First time only |
-| `read_runbook` (relevant) | +0.05 | Correct runbook for scenario |
-| `search_logs` (relevant query) | +0.05 | Query returns useful results |
-| `diagnose` (full match) | +0.30–0.35 | ≥50% keyword overlap |
-| `diagnose` (partial match) | +0.10–0.15 | ≥30% keyword overlap |
-| `restart_service` (correct) | +0.35–0.45 | Root cause service |
-| `rollback` (correct) | +0.30–0.40 | Root cause service |
-| `block_ip_range` (correct) | +0.40 | Security task, correct CIDR |
-| `create_index` (correct) | +0.40 | Database task, correct table/column |
-| `failover` (eligible service) | +0.30 | Per correctly failed-over service |
-| `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks |
-### Penalties (Anti-Gaming)
-| Action | Penalty | Why |
 |---|---|---|
-| Restart healthy service | -0.15 | Collateral damage — realistic cost |
-| Fix without diagnosing | -0.10 | Blind remediation — models real risk |
-| Failover payment-service | -0.25 | PCI-DSS compliance violation |
-| Failover postgres-primary | -0.25 | Data loss risk |
-| Excessive noops (>3) | -0.04/each | Forces active investigation |
-| Repeat restart same service | -0.05/extra | Discourages guess-and-check |
-### Semantic Diagnosis Matching
-The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase — exact string matching would unfairly penalize valid diagnoses.
-### SLA Degradation
-Every step where an incident is unresolved, the environment worsens:
-- `down` services: error_rate increases
-- `degraded` services: latency_p99 increases
-- SLA status: `ok` → `warning` (~3 steps) → `breached` (~7 steps)
-This creates real time pressure and rewards faster resolution.
 ---
 ## 🌟 ARIA Features
 ### Curriculum Engine
-The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes.
-- **Promotion:** rolling_avg > 0.75 → advance mastery level (Novice → Intermediate → Advanced → Mastered)
-- **Demotion:** rolling_avg < 0.30 → step back mastery level
-- **Scaffolding:** if avg < 0.30 over 3+ episodes → provide task-specific hint
 ```bash
-GET /curriculum/status    # See mastery per task
-GET /curriculum/next      # Get recommended next task
-GET /curriculum/hint/easy # Get scaffolding hint for a task
-POST /curriculum/record   # Feed your training results in
 ```
-**Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability — easy tasks first, harder tasks as they master the fundamentals.
 ### Incident Generator
-Procedural incident generation from seeds. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts = thousands of unique training scenarios.
-**Difficulty formula:** `base_difficulty[failure_mode] + (noise_count × 0.05)`, clamped to 1.0
-| Failure Mode | Base Difficulty |
-|---|---|
-| oom | 0.20 |
-| cascade | 0.50 |
-| database | 0.60 |
-| security | 0.60 |
-| network_partition | 0.70 |
-| corruption | 0.80 |
 ```bash
-GET /generate/preview?seed=42     # Preview without starting
-POST /reset  # body: {"task_id":"generated","seed":42}
 ```
 ### Dual-Agent Mode
-One incident. Two agents. Split observability.
-- **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` — passes natural language observations to Agent B. Reward: +0.05 per finding.
-- **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions.
-Neither agent can solve the incident alone.
 ```bash
-# Start a dual-agent session
-POST /multi-agent/reset  {"task_id":"easy","seed":42}
-# → returns session_id + split observations
-# Agent A shares a finding
-POST /multi-agent/step/a/{session_id}  {"finding":"payment-service OOM, memory at 98%"}
-# Agent B takes action (has access to Agent A's findings)
-POST /multi-agent/step/b/{session_id}  {"action_type":"restart_service","service":"payment-service"}
-# See full session state
-GET /multi-agent/state/{session_id}
 ```
 ---
-## 🧠 Training
-### Model
-**Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth.
-- **LoRA:** rank=16, alpha=32, targeting all 7 projection layers
-- **Adapter size:** ~97MB
-- **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs
-- **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b
-### Why GRPO?
-GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step — simpler, more memory-efficient, and well-suited to fast environment APIs.
-### Training Loop
-```python
-# Each training step:
-# 1. Generate 6 completions for the current observation
-# 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion)
-# 3. Normalize rewards to advantages (GRPO)
-# 4. Policy gradient update on best completion + KL penalty
-# 5. Advance episode with best action
-# Key hyperparameters:
-learning_rate = 5e-6
-group_size = 6
-kl_coefficient = 0.05   # prevents catastrophic forgetting
-update_strategy = "episode-level"  # one update per full episode
-```
-### Results
-| | Base Model | Fine-tuned (ep140) |
-|---|---|---|
-| **Easy task** | 0.000 | 0.150 |
-| **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first |
-| **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting |
-**The trained model consistently reads logs on the failing service before acting** — this is the foundational operational behavior: information gathering before remediation. The base model never does this.
-**Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge.
-### Training Notebook
-See `train_grpo.ipynb` — Colab-compatible, runs against the live HF Space API (no local setup needed).
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
-Compatible with: TRL, SkyRL, ART, Oumi, Axolotl.
----
-## 🚀 Setup
-### Docker (Recommended)
-```bash
-docker build -t aria-devops-incident .
-docker run -p 7860:7860 aria-devops-incident
-curl http://localhost:7860/health
-```
-### Local Python
-```bash
-pip install -r requirements.txt
-uvicorn api:app --host 0.0.0.0 --port 7860
-```
-### Validate
-```bash
-python validate.py    # 22 automated checks, exit 0 = all pass
-curl http://localhost:7860/validate
-```
 ---
 ## 📡 API Reference
 | Method | Endpoint | Description |
 |---|---|---|
-| GET | `/health` | `{"status":"ok"}` liveness check |
-| GET | `/about` | Full environment description (machine-readable) |
-| GET | `/tasks` | All 8 tasks with descriptions |
-| POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` |
-| POST | `/step` | Take action: Action JSON |
-| GET | `/state` | Full state + ground truth + analytics |
-| GET | `/validate` | Self-test: random agent on all 7 tasks |
-| GET | `/metrics` | Aggregate episode statistics |
 | GET | `/leaderboard` | Top 10 episodes |
-| WS | `/ws` | WebSocket: real-time agent-environment |
-| GET | `/curriculum/status` | Per-task mastery and recommendations |
-| GET | `/curriculum/next` | Recommended next task for training |
-| GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents |
-| POST | `/curriculum/record` | Feed episode result to curriculum engine |
-| GET | `/generate/preview` | Preview procedural incident: `?seed=N` |
 | POST | `/multi-agent/reset` | Start dual-agent session |
-| POST | `/multi-agent/step/a/{id}` | Agent A shares a finding |
-| POST | `/multi-agent/step/b/{id}` | Agent B takes an action |
-| GET | `/multi-agent/state/{id}` | Full dual-agent session state |
-| GET | `/multi-agent/sessions` | List active sessions |
-| GET | `/docs` | Swagger UI — interactive documentation |
 ---
 ## 📊 Benchmark Comparison
-| Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent |
-|---|---|---|---|---|---|---|
-| SWE-bench | Code repair | ✗ | ✗ | ✓ | ✗ | ✗ |
-| WebArena | Web navigation | ✓ | ✗ | ✓ | ✗ | ✗ |
-| AgentBench | General tools | ✗ | ✗ | ✓ | ✗ | ✗ |
-| **ARIA (ours)** | **Incident response** | **✓** | **✓** | **✓** | **✓** | **✓** |
 ---
-## 🏗️ OpenEnv Compliance
 ```bash
-openenv validate .
-```
-- Inherits from `openenv.core.env_client.EnvClient`
-- Standard `reset()`, `step()`, `state()` interface
-- Valid `openenv.yaml` manifest with all 8 tasks
-- FastAPI server with health endpoint
-- WebSocket support at `/ws`
-- Hosted on HuggingFace Spaces
 ---
-## 📁 Repository Structure
 ```
-aria-devops-incident-response/
-├── api.py                    # FastAPI app — all endpoints
-├── env.py                    # DevOpsIncidentEnv — thin dispatcher
-├── models.py                 # Pydantic models — Action, Observation, State
-├── tasks/
-│   ├── base.py               # BaseTask ABC, InternalState, reward logic
-│   ├── task_easy.py          # OOM crash-loop
-│   ├── task_medium.py        # Cascading failure
-│   ├── task_hard.py          # Silent data corruption
-│   ├── task_bonus.py         # Dual simultaneous failure
-│   ├── task_security.py      # DDoS attack
-│   ├── task_database.py      # Missing index
-│   ├── task_failover.py      # Multi-region failover
-│   └── task_generated.py     # Procedural incidents
-├── curriculum/
-│   └── engine.py             # CurriculumEngine — adaptive difficulty
-├── generator/
-│   └── incident_factory.py   # IncidentFactory — procedural generation
-├── multi_agent/
-│   └── session.py            # DualAgentSession — split observability
-├── graders/
-│   └── grader.py             # Deterministic episode grader
-├── data/runbooks/            # 6 operational runbooks (Markdown)
-├── client.py                 # openenv-core EnvClient implementation
-├── inference.py              # LLM baseline (CoT + fast modes)
-├── train_grpo.ipynb          # GRPO training notebook (Colab-compatible)
-├── validate.py               # 22 automated validation checks
-└── openenv.yaml              # OpenEnv spec manifest
 ```
----
-## 📝 License
-Apache 2.0
----
-*Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*

   - huggingface
   - pytorch
   - meta
+short_description: >
+  OpenEnv RL environment for production incident response —
+  7 tasks, curriculum engine, dual-agent mode, trained Llama-3.1-8B
 ---
 # ARIA — DevOps Incident Response
 ### *The first OpenEnv RL environment for production incident response*
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
+[![Trained Model](https://img.shields.io/badge/🤗-Llama--3.1--8B%20Fine--tuned-blue)](https://huggingface.co/Arijit-07/aria-devops-llama8b)
 [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
+> **ARIA** — Adaptive Reward & Incident Architecture
 > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
 ---
 | Resource | Link |
 |---|---|
 | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
+| **Interactive API** | https://arijit-07-devops-incident-response.hf.space/docs |
+| **Trained Model (8B)** | https://huggingface.co/Arijit-07/aria-devops-llama8b |
+| **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png |
+| **Blog Post** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
 | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
+| **Validate** | https://arijit-07-devops-incident-response.hf.space/validate |
+| **About (machine-readable)** | https://arijit-07-devops-incident-response.hf.space/about |
 ---
   -H "Content-Type: application/json" \
   -d '{"action_type": "read_logs", "service": "payment-service"}'
+# 3. Diagnose (reward: +0.30)
 curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
   -H "Content-Type: application/json" \
   -d '{"action_type": "restart_service", "service": "payment-service"}'
+# 5. Validate all 7 tasks pass
 curl https://arijit-07-devops-incident-response.hf.space/validate
 ```
 ---
+## 🎯 The Problem
+Every company running microservices faces the same reality: **production incidents are expensive, stressful, and happen at 3am.**
+SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents — to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.
+**ARIA fills that gap.**
 ---
 ## 🎬 The 7 Tasks
+| Task | Max Steps | Random | Strong LLM | Scenario |
+|---|---|---|---|---|
+| `easy` | 15 | 0.05 | 0.85–1.00 | Single service OOM crash-loop |
+| `medium` | 20 | 0.03 | 0.55–0.75 | Cascading failure + red herring alert |
+| `hard` | 25 | 0.01 | 0.30–0.50 | **Silent** corruption — all services green |
+| `bonus` | 25 | 0.01 | 0.35–0.55 | Two simultaneous independent failures |
+| `security` | 20 | 0.01 | 0.40–0.60 | DDoS botnet credential stuffing |
+| `database` | 20 | 0.01 | 0.45–0.65 | Missing index — full table scans |
+| `failover` | 25 | 0.01 | 0.35–0.55 | Multi-region network partition |
+| `generated` | 20 | 0.01 | variable | Procedural — seed-deterministic |
 ---
+## 🏆 Reward Function
 ```
 Final Score = Σ(step_rewards)
+            + efficiency_bonus     # (1 - steps/max_steps) × 0.05
+            + diagnosis_precision  # +0.03 if ≥50% keyword overlap
+            - noop_penalty         # (noops - 3) × 0.02
 ```
+Clamped to **(0.001, 0.999)** for GRPO stability.
+| Action | Reward | Penalty Triggers |
 |---|---|---|
+| `read_logs` correct | +0.15 | Restart healthy service: **-0.15** |
+| `diagnose` full match | +0.35 | Fix without diagnosing: **-0.10** |
+| `restart_service` correct | +0.45 | Wrong failover (payment): **-0.25** |
+| `block_ip_range` | +0.40 | Excessive noops: **-0.04 each** |
+| `alert_oncall` (required) | +0.15 | |
+**Semantic matching:** keyword overlap not exact string — LLMs that paraphrase aren't penalized.
 ---
 ## 🌟 ARIA Features
 ### Curriculum Engine
+Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.
 ```bash
+GET /curriculum/status
+GET /curriculum/next
+POST /curriculum/record  # {"task_id": "easy", "score": 0.85}
 ```
 ### Incident Generator
+Seeds 0–99,999 → unique reproducible incidents. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts.
 ```bash
+GET /generate/preview?seed=1337
+POST /reset  # {"task_id": "generated", "seed": 1337}
 ```
 ### Dual-Agent Mode
+Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via `share_finding`. Neither can solve the incident alone.
 ```bash
+POST /multi-agent/reset    # {"task_id": "easy", "seed": 42}
+POST /multi-agent/step/a/{id}  # {"finding": "order-service OOM"}
+POST /multi-agent/step/b/{id}  # {"action_type": "restart_service", ...}
 ```
 ---
+## 🧠 Training Results
+**Model:** [Arijit-07/aria-devops-llama8b](https://huggingface.co/Arijit-07/aria-devops-llama8b)
+| Task | Baseline | Fine-tuned | **Improvement** |
+|---|---|---|---|
+| easy | 0.320 | 0.685 | **+0.365** |
+| medium | 0.050 | 0.378 | **+0.328** |
+| hard | 0.190 | 0.869 | **+0.679** |
+| bonus | 0.152 | 0.682 | **+0.530** |
+![Training Curve](https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png)
+**Setup:** GRPO · Llama-3.1-8B · LoRA rank=32 · 160 episodes · NVIDIA L4 · 162 minutes · Unsloth + HuggingFace TRL
+**Key fix:** Group completions scored on fresh environment snapshots — prevents reward gate exhaustion during GRPO group generation.
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 ---
 ## 📡 API Reference
 | Method | Endpoint | Description |
 |---|---|---|
+| GET | `/health` | Liveness check |
+| GET | `/about` | Full machine-readable description |
+| GET | `/tasks` | All 8 tasks |
+| POST | `/reset` | Start episode |
+| POST | `/step` | Take action |
+| GET | `/state` | Full state + ground truth |
+| GET | `/validate` | Self-test all 7 tasks |
+| GET | `/metrics` | Aggregate statistics |
 | GET | `/leaderboard` | Top 10 episodes |
+| WS | `/ws` | WebSocket real-time |
+| GET | `/curriculum/status` | Per-task mastery |
+| GET | `/curriculum/next` | Recommended task |
+| POST | `/curriculum/record` | Feed training results |
+| GET | `/generate/preview` | Preview procedural incident |
 | POST | `/multi-agent/reset` | Start dual-agent session |
+| POST | `/multi-agent/step/a/{id}` | Agent A shares finding |
+| POST | `/multi-agent/step/b/{id}` | Agent B takes action |
+| GET | `/docs` | Swagger UI |
 ---
 ## 📊 Benchmark Comparison
+| Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent |
+|---|---|---|---|---|---|
+| SWE-bench | Code repair | ✗ | ✗ | ✗ | ✗ |
+| WebArena | Web navigation | ✓ | ✗ | ✗ | ✗ |
+| AgentBench | General tools | ✗ | ✗ | ✗ | ✗ |
+| **ARIA** | **Incident response** | **✓** | **✓** | **✓** | **✓** |
 ---
+## 🚀 Setup
 ```bash
+docker build -t aria-devops-incident .
+docker run -p 7860:7860 aria-devops-incident
+# Or local
+pip install -r requirements.txt
+uvicorn api:app --host 0.0.0.0 --port 7860
+```
 ---
+## 📁 Structure
 ```
+├── api.py / server/app.py    # FastAPI — all endpoints
+├── env.py                    # Environment dispatcher
+├���─ models.py                 # Pydantic models
+├── tasks/                    # 7 tasks + generated
+├── curriculum/engine.py      # Adaptive difficulty
+├── generator/                # Procedural incidents
+├── multi_agent/session.py    # Dual-agent mode
+├── graders/grader.py         # Deterministic grader
+├── demo_llm.py               # Live terminal demo
+├── train_grpo.ipynb          # Training notebook
+├── BLOG.md                   # Project story
+└── openenv.yaml              # OpenEnv manifest
 ```
+Apache 2.0 · *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*

README_github.md CHANGED Viewed

@@ -3,10 +3,10 @@
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
-[![Trained Model](https://img.shields.io/badge/🤗-Trained%20Model-blue)](https://huggingface.co/Arijit-07/aria-devops-llama3b)
 [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
-> **ARIA** — Adaptive Reward & Incident Architecture
 > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
 ---
@@ -16,12 +16,13 @@
 | Resource | Link |
 |---|---|
 | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
-| **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs |
-| **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b |
-| **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png |
-| **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
 | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
-| **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate |
 ---
@@ -38,7 +39,7 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "read_logs", "service": "payment-service"}'
-# 3. Diagnose the root cause (reward: +0.30)
 curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
@@ -48,510 +49,175 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "restart_service", "service": "payment-service"}'
-# 5. See the final score
-curl https://arijit-07-devops-incident-response.hf.space/state
-# 6. Validate all 7 tasks pass
 curl https://arijit-07-devops-incident-response.hf.space/validate
 ```
-```python
-# Or install and use the Python client
-pip install git+https://github.com/Twilight-13/devops-incident-response.git
-from devops_incident_response import DevOpsIncidentEnv, Action, ActionType
-env = DevOpsIncidentEnv(task_id="easy", seed=42)
-obs = env.reset()
-result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
-print(f"Reward: {result.reward}")  # 0.15
-```
 ---
-## 🎯 The Problem This Solves
-Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.**
-A single SEV-1 incident — a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint — can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent.
-**Yet no RL benchmark exists for this domain.**
-SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** — the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage.
-ARIA fills that gap.
----
-## 🏗️ Environment Architecture
-ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`.
-### What the Agent Observes
-Each step returns a structured `Observation` object:
-```
-Observation
-├── step, max_steps, task_id, task_description
-├── services: List[ServiceStatus]
-│   ├── name, status (healthy/degraded/down/unknown)
-│   ├── cpu_percent, memory_percent
-│   ├── error_rate, latency_p99_ms
-│   ├── replicas_running, replicas_desired
-│   ├── current_version, last_deployed
-│   └── sla_breach, minutes_degraded        ← SLA tracking per step
-├── active_alerts: List[Alert]              ← may include red herrings
-├── recent_logs: Dict[str, List[str]]       ← PARTIAL: only 2 lines shown
-├── service_dependencies: List[ServiceDependency]  ← call topology
-├── evidence_log: List[EvidenceEntry]       ← accumulates across steps
-├── sla_status: Dict[str, str]              ← ok/warning/breached
-└── available_runbooks: List[str]
-```
-**Key design: Partial Log Observability**
-The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries — agents must develop a search strategy, not just read everything.
-### The Services
-| Service | Stack | Role |
-|---|---|---|
-| `api-gateway` | Go | Routes external requests |
-| `payment-service` | Java (Spring) | Processes payments |
-| `order-service` | Python | Creates and tracks orders |
-| `inventory-service` | Java | Manages product stock |
-| `user-service` | Node.js | Auth and profiles |
-| `notification-service` | Python | Email and push alerts |
-| `data-pipeline-service` | Python | Writes catalog data |
-| `product-catalog-service` | Go | Stores and serves product data |
-| `price-validation-service` | Python | Validates prices |
-| `analytics-service` | Python | Aggregates business metrics |
-| `ml-inference-service` | Python | Serves recommendation models |
-| `log-aggregator` | Go | Collects and stores logs |
-### Service Dependency Map
-Every observation includes the call topology — agents can trace cascades:
-```
-api-gateway → order-service → inventory-service
-api-gateway → payment-service
-order-service → notification-service
-data-pipeline-service → product-catalog-service → price-validation-service
-```
 ---
 ## 🎬 The 7 Tasks
-### Task 1 — Single Service OOM (`easy`)
-**Max steps: 15 | Expected strong LLM: 0.85–1.00 | Random agent: 0.05**
-One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service — with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom.
-**What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down).
-**Optimal sequence:** `read_logs` → `read_metrics` → `diagnose` → `restart_service`
-**Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus**
----
-### Task 2 — Cascading Failure (`medium`)
-**Max steps: 20 | Expected strong LLM: 0.55–0.75 | Random agent: 0.03**
-A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job — completely unrelated).
-**What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty.
-**Optimal sequence:** Investigate `api-gateway` → trace to `order-service` → trace to `inventory-service` → `rollback`
-**Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92**
----
-### Task 3 — Silent Data Corruption (`hard`)
-**Max steps: 25 | Expected strong LLM: 0.30–0.50 | Random agent: 0.01**
-**All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in:
-- `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%)
-- `analytics-service` anomaly: avg order value $847 vs $89 historical baseline
-Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
-**What makes it interesting:** This requires qualitatively different reasoning — ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation.
-**Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed)
-**Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87**
----
-### Task 4 — Dual Simultaneous Failure (`bonus`)
-**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
-Two completely independent failures at once:
-1. `log-aggregator` disk 100% full — dropping 48k log messages/min
-2. `ml-inference-service` stuck in model checksum reload loop — CPU 99%+
-**What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously.
-**Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)`
-**Optimal score: ~0.77**
----
-### Task 5 — Security Incident: DDoS (`security`)
-**Max steps: 20 | Expected strong LLM: 0.40–0.60 | Random agent: 0.01**
-A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range.
-**New action: `block_ip_range`** — models real network-level DDoS mitigation.
-**Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team.
-**Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall`
-**Optimal score: ~0.80**
----
-### Task 6 — Database Degradation (`database`)
-**Max steps: 20 | Expected strong LLM: 0.45–0.65 | Random agent: 0.01**
-A schema migration added a `user_segment` column to the `orders` table 15 minutes ago — without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`.
-**New action: `create_index`** — models real DBA response to missing indexes.
-**Alternative fix:** Rolling back the migration is also accepted for full credit.
-**Optimal score: ~0.80**
----
-### Task 7 — Multi-Region Failover (`failover`)
-**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
-A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over:
-- `payment-service` — PCI-DSS compliance requires human approval
-- `postgres-primary` — replication lag risk causes data loss
-**New action: `failover`** — with `target_region` parameter.
-**Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic.
-**The runbook explicitly lists which services are safe** — reading it first is rewarded.
-**Optimal score: ~0.70**
----
-### Task 8 — Generated Incident (`generated`)
-**Max steps: 20 | Variable difficulty | Seed-deterministic**
-The Incident Generator creates procedural incidents from any integer seed (0–99,999). Same seed always produces the same incident. Different seeds produce unique combinations of:
-- 6 failure modes × 8 services × 3 severity levels × 0–3 noise alerts
-```bash
-# Preview any incident before running it
-curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345"
-# Run it as a full episode
-curl -X POST .../reset -d '{"task_id":"generated","seed":12345}'
-```
 ---
-## 🏆 Reward Function Design
-### The Formula
 ```
 Final Score = Σ(step_rewards)
-            + efficiency_bonus          # (1 - steps/max_steps) × 0.05 if resolved
-            + diagnosis_precision_bonus # +0.03 if ≥50% keyword overlap, +0.01 if ≥30%
-            - noop_penalty             # (noop_count - 3) × 0.02
-            - repeat_restart_penalty   # (restarts - 1) × 0.05 per service
 ```
-All scores clamped to **(0.001, 0.999)** — never exactly 0 or 1.
-> **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists.
-### Step-Level Rewards
-| Action | Reward | Condition |
-|---|---|---|
-| `read_logs` (failing service) | +0.10–0.15 | First time only |
-| `read_metrics` (failing service) | +0.10 | First time only |
-| `read_runbook` (relevant) | +0.05 | Correct runbook for scenario |
-| `search_logs` (relevant query) | +0.05 | Query returns useful results |
-| `diagnose` (full match) | +0.30–0.35 | ≥50% keyword overlap |
-| `diagnose` (partial match) | +0.10–0.15 | ≥30% keyword overlap |
-| `restart_service` (correct) | +0.35–0.45 | Root cause service |
-| `rollback` (correct) | +0.30–0.40 | Root cause service |
-| `block_ip_range` (correct) | +0.40 | Security task, correct CIDR |
-| `create_index` (correct) | +0.40 | Database task, correct table/column |
-| `failover` (eligible service) | +0.30 | Per correctly failed-over service |
-| `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks |
-### Penalties (Anti-Gaming)
-| Action | Penalty | Why |
 |---|---|---|
-| Restart healthy service | -0.15 | Collateral damage — realistic cost |
-| Fix without diagnosing | -0.10 | Blind remediation — models real risk |
-| Failover payment-service | -0.25 | PCI-DSS compliance violation |
-| Failover postgres-primary | -0.25 | Data loss risk |
-| Excessive noops (>3) | -0.04/each | Forces active investigation |
-| Repeat restart same service | -0.05/extra | Discourages guess-and-check |
-### Semantic Diagnosis Matching
-The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase — exact string matching would unfairly penalize valid diagnoses.
-### SLA Degradation
-Every step where an incident is unresolved, the environment worsens:
-- `down` services: error_rate increases
-- `degraded` services: latency_p99 increases
-- SLA status: `ok` → `warning` (~3 steps) → `breached` (~7 steps)
-This creates real time pressure and rewards faster resolution.
 ---
 ## 🌟 ARIA Features
 ### Curriculum Engine
-The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes.
-- **Promotion:** rolling_avg > 0.75 → advance mastery level (Novice → Intermediate → Advanced → Mastered)
-- **Demotion:** rolling_avg < 0.30 → step back mastery level
-- **Scaffolding:** if avg < 0.30 over 3+ episodes → provide task-specific hint
 ```bash
-GET /curriculum/status    # See mastery per task
-GET /curriculum/next      # Get recommended next task
-GET /curriculum/hint/easy # Get scaffolding hint for a task
-POST /curriculum/record   # Feed your training results in
 ```
-**Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability — easy tasks first, harder tasks as they master the fundamentals.
 ### Incident Generator
-Procedural incident generation from seeds. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts = thousands of unique training scenarios.
-**Difficulty formula:** `base_difficulty[failure_mode] + (noise_count × 0.05)`, clamped to 1.0
-| Failure Mode | Base Difficulty |
-|---|---|
-| oom | 0.20 |
-| cascade | 0.50 |
-| database | 0.60 |
-| security | 0.60 |
-| network_partition | 0.70 |
-| corruption | 0.80 |
 ```bash
-GET /generate/preview?seed=42     # Preview without starting
-POST /reset  # body: {"task_id":"generated","seed":42}
 ```
 ### Dual-Agent Mode
-One incident. Two agents. Split observability.
-- **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` — passes natural language observations to Agent B. Reward: +0.05 per finding.
-- **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions.
-Neither agent can solve the incident alone.
 ```bash
-# Start a dual-agent session
-POST /multi-agent/reset  {"task_id":"easy","seed":42}
-# → returns session_id + split observations
-# Agent A shares a finding
-POST /multi-agent/step/a/{session_id}  {"finding":"payment-service OOM, memory at 98%"}
-# Agent B takes action (has access to Agent A's findings)
-POST /multi-agent/step/b/{session_id}  {"action_type":"restart_service","service":"payment-service"}
-# See full session state
-GET /multi-agent/state/{session_id}
 ```
 ---
-## 🧠 Training
-### Model
-**Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth.
-- **LoRA:** rank=16, alpha=32, targeting all 7 projection layers
-- **Adapter size:** ~97MB
-- **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs
-- **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b
-### Why GRPO?
-GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step — simpler, more memory-efficient, and well-suited to fast environment APIs.
-### Training Loop
-```python
-# Each training step:
-# 1. Generate 6 completions for the current observation
-# 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion)
-# 3. Normalize rewards to advantages (GRPO)
-# 4. Policy gradient update on best completion + KL penalty
-# 5. Advance episode with best action
-# Key hyperparameters:
-learning_rate = 5e-6
-group_size = 6
-kl_coefficient = 0.05   # prevents catastrophic forgetting
-update_strategy = "episode-level"  # one update per full episode
-```
-### Results
-| | Base Model | Fine-tuned (ep140) |
-|---|---|---|
-| **Easy task** | 0.000 | 0.150 |
-| **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first |
-| **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting |
-**The trained model consistently reads logs on the failing service before acting** — this is the foundational operational behavior: information gathering before remediation. The base model never does this.
-**Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge.
-### Training Notebook
-See `train_grpo.ipynb` — Colab-compatible, runs against the live HF Space API (no local setup needed).
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
-Compatible with: TRL, SkyRL, ART, Oumi, Axolotl.
----
-## 🚀 Setup
-### Docker (Recommended)
-```bash
-docker build -t aria-devops-incident .
-docker run -p 7860:7860 aria-devops-incident
-curl http://localhost:7860/health
-```
-### Local Python
-```bash
-pip install -r requirements.txt
-uvicorn api:app --host 0.0.0.0 --port 7860
-```
-### Validate
-```bash
-python validate.py    # 22 automated checks, exit 0 = all pass
-curl http://localhost:7860/validate
-```
 ---
 ## 📡 API Reference
 | Method | Endpoint | Description |
 |---|---|---|
-| GET | `/health` | `{"status":"ok"}` liveness check |
-| GET | `/about` | Full environment description (machine-readable) |
-| GET | `/tasks` | All 8 tasks with descriptions |
-| POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` |
-| POST | `/step` | Take action: Action JSON |
-| GET | `/state` | Full state + ground truth + analytics |
-| GET | `/validate` | Self-test: random agent on all 7 tasks |
-| GET | `/metrics` | Aggregate episode statistics |
 | GET | `/leaderboard` | Top 10 episodes |
-| WS | `/ws` | WebSocket: real-time agent-environment |
-| GET | `/curriculum/status` | Per-task mastery and recommendations |
-| GET | `/curriculum/next` | Recommended next task for training |
-| GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents |
-| POST | `/curriculum/record` | Feed episode result to curriculum engine |
-| GET | `/generate/preview` | Preview procedural incident: `?seed=N` |
 | POST | `/multi-agent/reset` | Start dual-agent session |
-| POST | `/multi-agent/step/a/{id}` | Agent A shares a finding |
-| POST | `/multi-agent/step/b/{id}` | Agent B takes an action |
-| GET | `/multi-agent/state/{id}` | Full dual-agent session state |
-| GET | `/multi-agent/sessions` | List active sessions |
-| GET | `/docs` | Swagger UI — interactive documentation |
 ---
 ## 📊 Benchmark Comparison
-| Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent |
-|---|---|---|---|---|---|---|
-| SWE-bench | Code repair | ✗ | ✗ | ✓ | ✗ | ✗ |
-| WebArena | Web navigation | ✓ | ✗ | ✓ | ✗ | ✗ |
-| AgentBench | General tools | ✗ | ✗ | ✓ | ✗ | ✗ |
-| **ARIA (ours)** | **Incident response** | **✓** | **✓** | **✓** | **✓** | **✓** |
 ---
-## 🏗️ OpenEnv Compliance
 ```bash
-openenv validate .
-```
-- Inherits from `openenv.core.env_client.EnvClient`
-- Standard `reset()`, `step()`, `state()` interface
-- Valid `openenv.yaml` manifest with all 8 tasks
-- FastAPI server with health endpoint
-- WebSocket support at `/ws`
-- Hosted on HuggingFace Spaces
 ---
-## 📁 Repository Structure
 ```
-aria-devops-incident-response/
-├── api.py                    # FastAPI app — all endpoints
-├── env.py                    # DevOpsIncidentEnv — thin dispatcher
-├── models.py                 # Pydantic models — Action, Observation, State
-├── tasks/
-│   ├── base.py               # BaseTask ABC, InternalState, reward logic
-│   ├── task_easy.py          # OOM crash-loop
-│   ├── task_medium.py        # Cascading failure
-│   ├── task_hard.py          # Silent data corruption
-│   ├── task_bonus.py         # Dual simultaneous failure
-│   ├── task_security.py      # DDoS attack
-│   ├── task_database.py      # Missing index
-│   ├── task_failover.py      # Multi-region failover
-│   └── task_generated.py     # Procedural incidents
-├── curriculum/
-│   └── engine.py             # CurriculumEngine — adaptive difficulty
-├── generator/
-│   └── incident_factory.py   # IncidentFactory — procedural generation
-├── multi_agent/
-│   └── session.py            # DualAgentSession — split observability
-├── graders/
-│   └── grader.py             # Deterministic episode grader
-├── data/runbooks/            # 6 operational runbooks (Markdown)
-├── client.py                 # openenv-core EnvClient implementation
-├── inference.py              # LLM baseline (CoT + fast modes)
-├── train_grpo.ipynb          # GRPO training notebook (Colab-compatible)
-├── validate.py               # 22 automated validation checks
-└── openenv.yaml              # OpenEnv spec manifest
 ```
----
-## 📝 License
-Apache 2.0
----
-*Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
+[![Trained Model](https://img.shields.io/badge/🤗-Llama--3.1--8B%20Fine--tuned-blue)](https://huggingface.co/Arijit-07/aria-devops-llama8b)
 [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
+> **ARIA** — Adaptive Reward & Incident Architecture
 > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
 ---
 | Resource | Link |
 |---|---|
 | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
+| **Interactive API** | https://arijit-07-devops-incident-response.hf.space/docs |
+| **Trained Model (8B)** | https://huggingface.co/Arijit-07/aria-devops-llama8b |
+| **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png |
+| **Blog Post** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
 | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
+| **Validate** | https://arijit-07-devops-incident-response.hf.space/validate |
+| **About (machine-readable)** | https://arijit-07-devops-incident-response.hf.space/about |
 ---
   -H "Content-Type: application/json" \
   -d '{"action_type": "read_logs", "service": "payment-service"}'
+# 3. Diagnose (reward: +0.30)
 curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
   -H "Content-Type: application/json" \
   -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
   -H "Content-Type: application/json" \
   -d '{"action_type": "restart_service", "service": "payment-service"}'
+# 5. Validate all 7 tasks pass
 curl https://arijit-07-devops-incident-response.hf.space/validate
 ```
 ---
+## 🎯 The Problem
+Every company running microservices faces the same reality: **production incidents are expensive, stressful, and happen at 3am.**
+SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents — to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.
+**ARIA fills that gap.**
 ---
 ## 🎬 The 7 Tasks
+| Task | Max Steps | Random | Strong LLM | Scenario |
+|---|---|---|---|---|
+| `easy` | 15 | 0.05 | 0.85–1.00 | Single service OOM crash-loop |
+| `medium` | 20 | 0.03 | 0.55–0.75 | Cascading failure + red herring alert |
+| `hard` | 25 | 0.01 | 0.30–0.50 | **Silent** corruption — all services green |
+| `bonus` | 25 | 0.01 | 0.35–0.55 | Two simultaneous independent failures |
+| `security` | 20 | 0.01 | 0.40–0.60 | DDoS botnet credential stuffing |
+| `database` | 20 | 0.01 | 0.45–0.65 | Missing index — full table scans |
+| `failover` | 25 | 0.01 | 0.35–0.55 | Multi-region network partition |
+| `generated` | 20 | 0.01 | variable | Procedural — seed-deterministic |
 ---
+## 🏆 Reward Function
 ```
 Final Score = Σ(step_rewards)
+            + efficiency_bonus     # (1 - steps/max_steps) × 0.05
+            + diagnosis_precision  # +0.03 if ≥50% keyword overlap
+            - noop_penalty         # (noops - 3) × 0.02
 ```
+Clamped to **(0.001, 0.999)** for GRPO stability.
+| Action | Reward | Penalty Triggers |
 |---|---|---|
+| `read_logs` correct | +0.15 | Restart healthy service: **-0.15** |
+| `diagnose` full match | +0.35 | Fix without diagnosing: **-0.10** |
+| `restart_service` correct | +0.45 | Wrong failover (payment): **-0.25** |
+| `block_ip_range` | +0.40 | Excessive noops: **-0.04 each** |
+| `alert_oncall` (required) | +0.15 | |
+**Semantic matching:** keyword overlap not exact string — LLMs that paraphrase aren't penalized.
 ---
 ## 🌟 ARIA Features
 ### Curriculum Engine
+Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.
 ```bash
+GET /curriculum/status
+GET /curriculum/next
+POST /curriculum/record  # {"task_id": "easy", "score": 0.85}
 ```
 ### Incident Generator
+Seeds 0–99,999 → unique reproducible incidents. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts.
 ```bash
+GET /generate/preview?seed=1337
+POST /reset  # {"task_id": "generated", "seed": 1337}
 ```
 ### Dual-Agent Mode
+Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via `share_finding`. Neither can solve the incident alone.
 ```bash
+POST /multi-agent/reset    # {"task_id": "easy", "seed": 42}
+POST /multi-agent/step/a/{id}  # {"finding": "order-service OOM"}
+POST /multi-agent/step/b/{id}  # {"action_type": "restart_service", ...}
 ```
 ---
+## 🧠 Training Results
+**Model:** [Arijit-07/aria-devops-llama8b](https://huggingface.co/Arijit-07/aria-devops-llama8b)
+| Task | Baseline | Fine-tuned | **Improvement** |
+|---|---|---|---|
+| easy | 0.320 | 0.685 | **+0.365** |
+| medium | 0.050 | 0.378 | **+0.328** |
+| hard | 0.190 | 0.869 | **+0.679** |
+| bonus | 0.152 | 0.682 | **+0.530** |
+![Training Curve](https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png)
+**Setup:** GRPO · Llama-3.1-8B · LoRA rank=32 · 160 episodes · NVIDIA L4 · 162 minutes · Unsloth + HuggingFace TRL
+**Key fix:** Group completions scored on fresh environment snapshots — prevents reward gate exhaustion during GRPO group generation.
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
 ---
 ## 📡 API Reference
 | Method | Endpoint | Description |
 |---|---|---|
+| GET | `/health` | Liveness check |
+| GET | `/about` | Full machine-readable description |
+| GET | `/tasks` | All 8 tasks |
+| POST | `/reset` | Start episode |
+| POST | `/step` | Take action |
+| GET | `/state` | Full state + ground truth |
+| GET | `/validate` | Self-test all 7 tasks |
+| GET | `/metrics` | Aggregate statistics |
 | GET | `/leaderboard` | Top 10 episodes |
+| WS | `/ws` | WebSocket real-time |
+| GET | `/curriculum/status` | Per-task mastery |
+| GET | `/curriculum/next` | Recommended task |
+| POST | `/curriculum/record` | Feed training results |
+| GET | `/generate/preview` | Preview procedural incident |
 | POST | `/multi-agent/reset` | Start dual-agent session |
+| POST | `/multi-agent/step/a/{id}` | Agent A shares finding |
+| POST | `/multi-agent/step/b/{id}` | Agent B takes action |
+| GET | `/docs` | Swagger UI |
 ---
 ## 📊 Benchmark Comparison
+| Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent |
+|---|---|---|---|---|---|
+| SWE-bench | Code repair | ✗ | ✗ | ✗ | ✗ |
+| WebArena | Web navigation | ✓ | ✗ | ✗ | ✗ |
+| AgentBench | General tools | ✗ | ✗ | ✗ | ✗ |
+| **ARIA** | **Incident response** | **✓** | **✓** | **✓** | **✓** |
 ---
+## 🚀 Setup
 ```bash
+docker build -t aria-devops-incident .
+docker run -p 7860:7860 aria-devops-incident
+# Or local
+pip install -r requirements.txt
+uvicorn api:app --host 0.0.0.0 --port 7860
+```
 ---
+## 📁 Structure
 ```
+├── api.py / server/app.py    # FastAPI — all endpoints
+├── env.py                    # Environment dispatcher
+├── models.py                 # Pydantic models
+├── tasks/                    # 7 tasks + generated
+├── curriculum/engine.py      # Adaptive difficulty
+├── generator/                # Procedural incidents
+├── multi_agent/session.py    # Dual-agent mode
+├── graders/grader.py         # Deterministic grader
+├── demo_llm.py               # Live terminal demo
+├── train_grpo.ipynb          # Training notebook
+├── BLOG.md                   # Project story
+└── openenv.yaml              # OpenEnv manifest
 ```
+Apache 2.0 · *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*

debug_out.txt DELETED Viewed

File without changes

debug_steps.txt DELETED Viewed

File without changes

demo_llm.py CHANGED Viewed

@@ -40,8 +40,6 @@ DEFAULT_TASK = "easy"
 DEFAULT_SEED = 42
 MODEL_REPO = "Arijit-07/aria-devops-llama8b"
-# fallback to 3b if 8b not ready:
-# MODEL_REPO = "Arijit-07/aria-devops-llama3b"
 HF_TOKEN = os.environ.get('HF_TOKEN', '')

 DEFAULT_SEED = 42
 MODEL_REPO = "Arijit-07/aria-devops-llama8b"
 HF_TOKEN = os.environ.get('HF_TOKEN', '')

openenv.yaml CHANGED Viewed

@@ -193,9 +193,9 @@ reward:
 training:
   algorithm: GRPO
-  model: unsloth/Llama-3.2-3B-Instruct
-  adapter: https://huggingface.co/Arijit-07/aria-devops-llama3b
-  episodes: 140
   framework: HuggingFace TRL + Unsloth
   results:
     easy_pre: 0.42

 training:
   algorithm: GRPO
+  model: Llama-3.1-8B-Instruct
+  adapter: https://huggingface.co/Arijit-07/aria-devops-llama8b
+  episodes: 160
   framework: HuggingFace TRL + Unsloth
   results:
     easy_pre: 0.42

server/app.py CHANGED Viewed

@@ -808,64 +808,6 @@ def health():
     return {"status": "ok", "env": "devops-incident-response", "version": "2.0.0"}
-@app.get("/about")
-def about():
-    """
-    Full environment metadata for LLM judges and researchers.
-    Returns a comprehensive description of the ARIA environment including
-    task count, action types, feature flags, training metadata, reward
-    design philosophy, and links to the live space, trained model, and docs.
-    Returns:
-        JSON object with name, version, description, themes, task/action counts,
-        feature descriptions, training info, reward design, and links.
-    """
-    return {
-        "name": "ARIA — DevOps Incident Response",
-        "version": "2.0.0",
-        "description": (
-            "OpenEnv-compliant RL environment for production incident response. "
-            "AI agents diagnose and remediate software incidents across 7 task types "
-            "using 14 actions with dense reward shaping."
-        ),
-        "themes": [
-            "World Modeling: Professional Tasks",
-            "Self-Improvement",
-            "Multi-Agent Interactions",
-        ],
-        "tasks": 8,
-        "action_types": 14,
-        "features": {
-            "curriculum_engine": "Adaptive difficulty based on agent performance",
-            "incident_generator": "Procedural incidents from seeds (0-99999)",
-            "dual_agent_mode": "Split observability — Observer + Responder",
-        },
-        "training": {
-            "model": "Llama-3.2-3B-Instruct",
-            "algorithm": "GRPO",
-            "framework": "HuggingFace TRL + Unsloth",
-            "episodes": 140,
-            "adapter_url": "https://huggingface.co/Arijit-07/aria-devops-llama3b",
-        },
-        "reward_design": {
-            "type": "dense",
-            "range": [0.001, 0.999],
-            "anti_gaming": [
-                "collateral_damage_penalty",
-                "blind_remediation_penalty",
-                "semantic_diagnosis_matching",
-            ],
-            "efficiency_bonus": True,
-        },
-        "links": {
-            "space": "https://arijit-07-devops-incident-response.hf.space",
-            "model": "https://huggingface.co/Arijit-07/aria-devops-llama3b",
-            "github": "https://github.com/Twilight-13/devops-incident-response",
-            "docs": "https://arijit-07-devops-incident-response.hf.space/docs",
-        },
-    }
 @app.get("/generate/preview")
 def preview_incident(seed: int = 42):

     return {"status": "ok", "env": "devops-incident-response", "version": "2.0.0"}
 @app.get("/generate/preview")
 def preview_incident(seed: int = 42):

training_screenshots/post_training.png ADDED Viewed

training_screenshots/training_bonus.png ADDED Viewed

training_screenshots/training_easy.png ADDED Viewed

training_screenshots/training_hard.png ADDED Viewed

training_screenshots/training_medium.png ADDED Viewed

ui_test.py DELETED Viewed

@@ -1,945 +0,0 @@
-html_content = """<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>ARIA - DevOps Incident Response</title>
-    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;700&display=swap" rel="stylesheet">
-    <style>
-        :root {
-            --bg-primary: #060914;
-            --bg-secondary: #0d1117;
-            --bg-card: #111827;
-            --bg-card-hover: #1a2234;
-            --border: #1f2937;
-            --border-glow: #3b82f6;
-            --accent-blue: #3b82f6;
-            --accent-cyan: #06b6d4;
-            --accent-green: #10b981;
-            --accent-red: #ef4444;
-            --accent-yellow: #f59e0b;
-            --accent-purple: #8b5cf6;
-            --accent-orange: #f97316;
-            --text-primary: #f9fafb;
-            --text-secondary: #9ca3af;
-            --text-dim: #4b5563;
-        }
-        * { margin: 0; padding: 0; box-sizing: border-box; }
-        html { scroll-behavior: smooth; }
-        body {
-            background: var(--bg-primary);
-            color: var(--text-primary);
-            font-family: 'Inter', sans-serif;
-            min-height: 100vh;
-            overflow-x: hidden;
-        }
-        a { text-decoration: none; color: inherit; }
-        /* Animation */
-        .fade-in {
-            opacity: 0;
-            transform: translateY(20px);
-            transition: opacity 0.6s ease-out, transform 0.6s ease-out;
-        }
-        .fade-in.visible { opacity: 1; transform: translateY(0); }
-        /* Canvas Background */
-        #bg-canvas {
-            position: fixed;
-            top: 0;
-            left: 0;
-            width: 100vw;
-            height: 100vh;
-            z-index: 0;
-            pointer-events: none;
-        }
-        /* Container */
-        .container {
-            max-width: 1280px;
-            margin: 0 auto;
-            padding: 0 24px;
-            position: relative;
-            z-index: 1;
-        }
-        /* Section Spacing */
-        section { padding: 80px 0; }
-        /* Navbar */
-        nav {
-            position: fixed;
-            top: 0;
-            width: 100%;
-            height: 64px;
-            background: rgba(6, 9, 20, 0.8);
-            backdrop-filter: blur(20px);
-            border-bottom: 1px solid var(--border);
-            z-index: 100;
-            display: flex;
-            align-items: center;
-        }
-        .nav-inner {
-            display: flex;
-            justify-content: space-between;
-            align-items: center;
-            width: 100%;
-            max-width: 1280px;
-            margin: 0 auto;
-            padding: 0 24px;
-        }
-        .nav-left { display: flex; align-items: center; gap: 8px; }
-        .nav-logo { font-size: 20px; font-weight: 700; color: var(--accent-blue); display: flex; align-items: center; gap: 6px; }
-        .nav-desc { font-size: 13px; color: var(--text-secondary); margin-left: 8px; display: none; }
-        @media (min-width: 768px) { .nav-desc { display: block; } }
-        .nav-center { display: flex; justify-content: center; flex: 1; }
-        .status-pill {
-            display: flex;
-            align-items: center;
-            gap: 6px;
-            background: rgba(16, 185, 129, 0.2);
-            border: 1px solid var(--accent-green);
-            color: var(--accent-green);
-            padding: 4px 12px;
-            border-radius: 999px;
-            font-size: 12px;
-            font-weight: 600;
-        }
-        .status-dot {
-            width: 6px;
-            height: 6px;
-            background: var(--accent-green);
-            border-radius: 50%;
-            animation: pulse 2s infinite;
-        }
-        @keyframes pulse { 0% { transform: scale(1); opacity: 1; } 50% { transform: scale(1.5); opacity: 0.5; } 100% { transform: scale(1); opacity: 1; } }
-        .nav-right { display: flex; gap: 24px; }
-        .nav-link { font-size: 13px; color: var(--text-secondary); transition: color 0.2s; }
-        .nav-link:hover { color: var(--text-primary); }
-        /* Hero Section */
-        .hero { padding: 120px 0 80px; text-align: center; }
-        .hero-badge {
-            background: rgba(59, 130, 246, 0.1);
-            border: 1px solid rgba(59, 130, 246, 0.3);
-            border-radius: 999px;
-            padding: 6px 16px;
-            font-size: 12px;
-            color: var(--accent-blue);
-            display: inline-block;
-            margin-bottom: 24px;
-        }
-        .hero-title {
-            font-size: clamp(72px, 12vw, 140px);
-            font-weight: 700;
-            background: linear-gradient(135deg, #3b82f6 0%, #06b6d4 50%, #8b5cf6 100%);
-            -webkit-background-clip: text;
-            -webkit-text-fill-color: transparent;
-            line-height: 1;
-            letter-spacing: -4px;
-        }
-        .hero-subtitle { font-size: 20px; color: var(--text-secondary); margin-top: 16px; font-weight: 400; }
-        .hero-desc { font-size: 15px; color: var(--text-dim); margin-top: 12px; line-height: 1.6; max-width: 600px; margin-inline: auto; }
-        .hero-buttons { margin-top: 40px; display: flex; justify-content: center; gap: 16px; flex-wrap: wrap; }
-        .btn-primary, .btn-secondary {
-            padding: 14px 28px;
-            border-radius: 8px;
-            font-weight: 600;
-            font-size: 15px;
-            transition: all 0.2s;
-            cursor: pointer;
-            display: inline-block;
-        }
-        .btn-primary { background: var(--accent-blue); color: white; border: none; }
-        .btn-primary:hover { background: #2563eb; transform: translateY(-2px); }
-        .btn-secondary { background: transparent; border: 1px solid var(--border); color: var(--text-secondary); }
-        .btn-secondary:hover { border-color: var(--accent-blue); color: white; transform: translateY(-2px); }
-        .hero-stats {
-            margin-top: 64px;
-            display: grid;
-            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
-            gap: 16px;
-        }
-        .stat-card {
-            background: var(--bg-card);
-            border: 1px solid var(--border);
-            border-radius: 12px;
-            padding: 20px 32px;
-            text-align: center;
-        }
-        .stat-val { font-family: 'JetBrains Mono', monospace; font-size: 32px; font-weight: 700; color: var(--accent-blue); }
-        .stat-label { font-size: 13px; color: var(--text-secondary); margin-top: 4px; }
-        /* General Sections */
-        .section-title { font-size: 24px; font-weight: 600; margin-bottom: 8px; }
-        .section-subtitle { font-size: 15px; color: var(--text-secondary); margin-bottom: 32px; }
-        /* Task Cards */
-        .task-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; }
-        @media (max-width: 1024px) { .task-grid { grid-template-columns: repeat(2, 1fr); } }
-        @media (max-width: 640px) { .task-grid { grid-template-columns: 1fr; } }
-        .task-card {
-            background: var(--bg-card);
-            border: 1px solid var(--border);
-            border-radius: 16px;
-            padding: 24px;
-            transition: all 0.3s;
-            cursor: pointer;
-            position: relative;
-            overflow: hidden;
-            display: flex;
-            flex-direction: column;
-        }
-        .task-card::before {
-            content: ''; position: absolute; top: 0; left: 0; right: 0; height: 2px;
-            background: transparent; transition: all 0.3s;
-        }
-        .task-card:hover {
-            background: var(--bg-card-hover);
-            transform: translateY(-4px);
-            box-shadow: 0 20px 40px rgba(0,0,0,0.4);
-        }
-        .task-card:hover::before { background: var(--card-color, var(--border)); }
-        .task-header { display: flex; justify-content: space-between; align-items: flex-start; }
-        .task-icon { font-size: 32px; }
-        .task-badge {
-            font-size: 11px;
-            font-weight: 700;
-            padding: 4px 8px;
-            border-radius: 6px;
-            background: var(--card-bg);
-            color: var(--card-color);
-            letter-spacing: 0.5px;
-        }
-        .task-name { font-size: 16px; font-weight: 600; margin-top: 16px; }
-        .task-desc { font-size: 13px; color: var(--text-secondary); margin-top: 8px; line-height: 1.5; flex-grow: 1; }
-        .task-footer { display: flex; justify-content: space-between; align-items: center; margin-top: 20px; }
-        .task-steps { font-family: 'JetBrains Mono', monospace; font-size: 12px; color: var(--text-dim); }
-        .task-status { display: flex; align-items: center; gap: 6px; font-size: 12px; color: var(--card-color); font-weight: 500; }
-        .task-status::before { content: ''; width: 6px; height: 6px; border-radius: 50%; background: var(--card-color); }
-        /* Features */
-        .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; }
-        @media (max-width: 900px) { .features-grid { grid-template-columns: 1fr; } }
-        .feature-card {
-            background: var(--bg-card);
-            border: 1px solid var(--border);
-            border-radius: 16px;
-            padding: 32px;
-            display: flex;
-            flex-direction: column;
-        }
-        .feature-icon { font-size: 48px; margin-bottom: 24px; }
-        .feature-title { font-size: 20px; font-weight: 600; margin-bottom: 12px; color: var(--text-primary); }
-        .feature-desc { font-size: 14px; color: var(--text-secondary); line-height: 1.6; margin-bottom: 24px; flex-grow: 1; }
-        .curriculum-bars { margin-bottom: 24px; }
-        .c-bar-row { display: flex; align-items: center; justify-content: space-between; margin-bottom: 8px; font-size: 12px; font-family: 'JetBrains Mono', monospace; }
-        .c-bar-name { color: var(--text-secondary); width: 80px; overflow: hidden; text-overflow: ellipsis; }
-        .c-bar-track { flex-grow: 1; margin: 0 12px; letter-spacing: -2px; color: var(--text-dim); }
-        .c-bar-fill { letter-spacing: -2px; }
-        .c-bar-score { width: 30px; text-align: right; }
-        .generator-input {
-            display: flex; gap: 8px; margin-bottom: 16px;
-        }
-        .gen-seed {
-            background: var(--bg-secondary); border: 1px solid var(--border); color: white;
-            padding: 8px 12px; border-radius: 6px; width: 80px; font-family: 'JetBrains Mono', monospace;
-        }
-        .btn-gen { background: var(--accent-purple); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
-        .btn-gen:hover { background: #7c3aed; }
-        .gen-result { background: var(--bg-secondary); border: 1px solid var(--border); border-radius: 8px; padding: 16px; display: none; }
-        .gen-badges { display: flex; gap: 8px; margin-bottom: 12px; }
-        .gen-badge { font-size: 10px; padding: 2px 6px; border-radius: 4px; font-weight: 600; text-transform: uppercase; }
-        .gen-diff-bar { height: 4px; background: var(--border); border-radius: 2px; margin: 12px 0; overflow: hidden; }
-        .gen-diff-fill { height: 100%; transition: width 0.3s; }
-        .dual-diagram {
-            background: var(--bg-secondary); border: 1px solid var(--border); border-radius: 8px;
-            padding: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; margin-bottom: 24px;
-            color: var(--text-secondary);
-            display: flex; justify-content: space-between; align-items: center;
-        }
-        .agent-box { border: 1px solid var(--border); padding: 8px; border-radius: 4px; background: rgba(0,0,0,0.2); width: 42%; }
-        .agent-arrow { flex-grow: 1; text-align: center; color: var(--accent-green); position: relative; }
-        .agent-arrow::after {
-            content: '→'; position: absolute; top: -10px; left: 50%; transform: translateX(-50%);
-            animation: flowRight 1.5s infinite linear;
-        }
-        @keyframes flowRight { 0% { left: 20%; opacity: 0; } 50% { opacity: 1; } 100% { left: 80%; opacity: 0; } }
-        .btn-green { background: var(--accent-green); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
-        .btn-green:hover { background: #059669; }
-        .session-info { margin-top: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; color: var(--accent-green); display: none; word-break: break-all; }
-        .feature-link { color: var(--accent-blue); font-size: 14px; font-weight: 500; text-decoration: none; margin-top: auto; display: inline-block; }
-        .feature-link:hover { text-decoration: underline; }
-        /* Live Metrics Bar */
-        .metrics-bar-container { background: var(--bg-secondary); border-top: 1px solid var(--border); border-bottom: 1px solid var(--border); padding: 24px 0; }
-        .metrics-grid { display: flex; justify-content: space-between; }
-        .metric-item { text-align: center; flex: 1; border-right: 1px solid var(--border); }
-        .metric-item:last-child { border-right: none; }
-        .metric-val { font-family: 'JetBrains Mono', monospace; font-size: 28px; font-weight: 700; color: var(--accent-blue); }
-        .metric-label { font-size: 12px; color: var(--text-secondary); margin-top: 4px; }
-        @media (max-width: 640px) { .metrics-grid { flex-wrap: wrap; gap: 24px; } .metric-item { min-width: 40%; border: none; } }
-        /* Leaderboard */
-        .leaderboard-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; overflow: hidden; overflow-x: auto; }
-        table { width: 100%; border-collapse: collapse; text-align: left; }
-        th { background: rgba(255,255,255,0.03); font-size: 11px; text-transform: uppercase; letter-spacing: 1px; color: var(--text-dim); padding: 12px 24px; border-bottom: 1px solid var(--border); }
-        td { padding: 16px 24px; border-bottom: 1px solid var(--border); font-size: 14px; color: var(--text-primary); }
-        tr:last-child td { border-bottom: none; }
-        .rank-1 { color: #fbbf24; font-weight: bold; }
-        .rank-2 { color: #9ca3af; font-weight: bold; }
-        .rank-3 { color: #cd7f32; font-weight: bold; }
-        .lb-score { font-family: 'JetBrains Mono', monospace; font-weight: 600; }
-        .lb-status { font-size: 13px; }
-        /* Quick Start */
-        .tabs { display: flex; gap: 8px; margin-bottom: 16px; }
-        .tab { background: transparent; border: none; color: var(--text-secondary); padding: 8px 16px; border-radius: 6px; cursor: pointer; font-size: 14px; font-weight: 500; font-family: 'Inter', sans-serif;}
-        .tab.active { background: var(--accent-blue); color: white; }
-        .code-block { background: #020408; border: 1px solid var(--border); border-radius: 12px; padding: 24px; position: relative; display: none; overflow-x: auto; }
-        .code-block.active { display: block; }
-        .code-text { font-family: 'JetBrains Mono', monospace; font-size: 13px; line-height: 1.8; color: var(--text-primary); white-space: pre; }
-        .btn-copy { position: absolute; top: 12px; right: 12px; background: rgba(255,255,255,0.1); border: 1px solid var(--border); color: var(--text-secondary); padding: 4px 10px; border-radius: 4px; font-size: 12px; cursor: pointer; }
-        .btn-copy:hover { color: white; background: rgba(255,255,255,0.2); }
-        .code-comment { color: var(--text-dim); }
-        .code-str { color: var(--accent-green); }
-        .code-cmd { color: var(--accent-blue); }
-        .code-url { color: var(--accent-cyan); }
-        .code-key { color: var(--accent-yellow); }
-        /* Training Evidence */
-        .training-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
-        @media (max-width: 900px) { .training-grid { grid-template-columns: 1fr; } }
-        .train-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
-        .train-title { font-size: 18px; font-weight: 600; margin-bottom: 24px; }
-        .train-row { margin-bottom: 24px; }
-        .train-label { font-size: 12px; color: var(--text-secondary); text-transform: uppercase; letter-spacing: 1px; margin-bottom: 8px; display: flex; align-items: center; justify-content: space-between; }
-        .train-badge { padding: 4px 8px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-weight: 600; font-size: 12px; }
-        .train-desc { font-size: 14px; color: var(--text-secondary); line-height: 1.5; margin-left: 28px; }
-        .train-vis { float: left; font-size: 18px; margin-top: 2px; }
-        .train-table-row { display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid var(--border); }
-        .train-table-row:last-child { border-bottom: none; }
-        .tt-key { font-size: 13px; color: var(--text-secondary); }
-        .tt-val { font-size: 13px; font-family: 'JetBrains Mono', monospace; color: var(--text-primary); }
-        /* Footer */
-        footer { background: var(--bg-secondary); border-top: 1px solid var(--border); padding: 48px 0 32px; margin-top: 80px; }
-        .footer-grid { display: grid; grid-template-columns: 2fr 1fr 1fr; gap: 32px; }
-        @media (max-width: 768px) { .footer-grid { grid-template-columns: 1fr; } }
-        .f-title { font-size: 14px; font-weight: 600; margin-bottom: 16px; color: var(--text-primary); }
-        .f-text { font-size: 13px; color: var(--text-dim); line-height: 1.6; }
-        .f-links { display: flex; flex-direction: column; gap: 12px; }
-        .f-link { font-size: 13px; color: var(--text-secondary); transition: color 0.2s; }
-        .f-link:hover { color: var(--text-primary); }
-        .f-social { display: flex; gap: 16px; margin-top: 16px; }
-        .f-bottom { border-top: 1px solid var(--border); margin-top: 32px; padding-top: 24px; display: flex; justify-content: space-between; font-size: 12px; color: var(--text-dim); }
-        @media (max-width: 640px) { .f-bottom { flex-direction: column; gap: 8px; text-align: center; } }
-    </style>
-</head>
-<body>
-    <canvas id="bg-canvas"></canvas>
-    <nav>
-        <div class="nav-inner">
-            <div class="nav-left">
-                <div class="nav-logo">🚨 ARIA</div>
-                <div class="nav-desc">DevOps Incident Response</div>
-            </div>
-            <div class="nav-center">
-                <div class="status-pill" id="nav-status">
-                    <div class="status-dot"></div>
-                    <span id="nav-status-text">CONNECTING</span>
-                </div>
-            </div>
-            <div class="nav-right">
-                <a href="/docs" class="nav-link">API Docs</a>
-                <a href="/validate" class="nav-link">Validate</a>
-                <a href="/metrics" class="nav-link">Metrics</a>
-                <a href="/leaderboard" class="nav-link">Leaderboard</a>
-            </div>
-        </div>
-    </nav>
-    <main class="container">
-        <section class="hero fade-in">
-            <div class="hero-badge">⚡ OpenEnv Compliant · Meta × PyTorch × HuggingFace</div>
-            <h1 class="hero-title">ARIA</h1>
-            <div class="hero-subtitle">Adaptive Reward & Incident Architecture</div>
-            <p class="hero-desc">The first OpenEnv RL environment for production incident response.<br>7 tasks · 14 actions · Curriculum learning · Dual-agent mode · Trained Llama-3B</p>
-            <div class="hero-buttons">
-                <a href="/docs" class="btn-primary">Try Live API &rarr;</a>
-                <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="btn-secondary">View on GitHub &rarr;</a>
-            </div>
-            <div class="hero-stats">
-                <div class="stat-card">
-                    <div class="stat-val">7</div>
-                    <div class="stat-label">Tasks</div>
-                </div>
-                <div class="stat-card">
-                    <div class="stat-val">14</div>
-                    <div class="stat-label">Actions</div>
-                </div>
-                <div class="stat-card">
-                    <div class="stat-val">&infin;</div>
-                    <div class="stat-label">Scenarios</div>
-                </div>
-                <div class="stat-card">
-                    <div class="stat-val">0.99</div>
-                    <div class="stat-label">Max Score</div>
-                </div>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">Environment Tasks</h2>
-            <p class="section-subtitle">Eight scenarios of escalating operational complexity</p>
-            <div class="task-grid" id="task-grid">
-                <!-- Populated by JS -->
-                <div style="grid-column: 1/-1; text-align: center; color: var(--text-dim);">Loading tasks...</div>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">ARIA Features</h2>
-            <p class="section-subtitle">What makes this environment unique</p>
-            <div class="features-grid">
-                <!-- Curriculum -->
-                <div class="feature-card">
-                    <div class="feature-icon">🎓</div>
-                    <h3 class="feature-title">Curriculum Engine</h3>
-                    <p class="feature-desc">Tracks agent performance per task with rolling averages. Promotes when mastered (avg > 0.75). Scaffolds with hints when struggling (avg < 0.30). Agents always train at the edge of their capability.</p>
-                    <div class="curriculum-bars" id="curriculum-container">
-                        <!-- Populated by JS -->
-                        <div style="text-align: center; color: var(--text-dim); font-size: 13px;">Loading curriculum data...</div>
-                    </div>
-                    <a href="/curriculum/status" class="feature-link" style="color: var(--accent-blue);">View Status &rarr;</a>
-                </div>
-                <!-- Generator -->
-                <div class="feature-card">
-                    <div class="feature-icon">⚡</div>
-                    <h3 class="feature-title">Incident Generator</h3>
-                    <p class="feature-desc">Procedural incidents from seeds 0–99,999. Six failure modes × eight services × variable noise = infinite unique training scenarios. Same seed always produces the same incident.</p>
-                    <div class="generator-input">
-                        <input type="number" id="gen-seed" class="gen-seed" value="42" min="0" max="99999">
-                        <button class="btn-gen" onclick="generateIncident()">Generate</button>
-                    </div>
-                    <div class="gen-result" id="gen-result">
-                        <div class="gen-badges" id="gen-badges"></div>
-                        <div style="font-size: 13px; font-weight: 600; margin-bottom: 8px;" id="gen-affected"></div>
-                        <div style="font-size: 12px; color: var(--text-secondary); line-height: 1.5;" id="gen-desc"></div>
-                        <div class="gen-diff-bar"><div class="gen-diff-fill" id="gen-diff-fill"></div></div>
-                    </div>
-                    <a href="/generate/preview?seed=42" class="feature-link" style="color: var(--accent-purple);">Try Generator &rarr;</a>
-                </div>
-                <!-- Dual Agent -->
-                <div class="feature-card">
-                    <div class="feature-icon">🤝</div>
-                    <h3 class="feature-title">Dual-Agent Mode</h3>
-                    <p class="feature-desc">Split observability between two agents. Observer sees logs and alerts. Responder sees metrics and dependencies. Neither can solve the incident alone — they must coordinate via share_finding.</p>
-                    <div class="dual-diagram">
-                        <div class="agent-box">
-                            <div style="font-weight:700; margin-bottom:4px;">AGENT A</div>
-                            <div style="color:var(--text-dim); margin-bottom:4px;">Observer</div>
-                            <div>• alerts<br>• logs</div>
-                        </div>
-                        <div class="agent-arrow">share<br>finding</div>
-                        <div class="agent-box">
-                            <div style="font-weight:700; margin-bottom:4px;">AGENT B</div>
-                            <div style="color:var(--text-dim); margin-bottom:4px;">Responder</div>
-                            <div>• metrics<br>• deps</div>
-                        </div>
-                    </div>
-                    <button class="btn-green" onclick="startDualSession()">Start Session</button>
-                    <div class="session-info" id="dual-session-info"></div>
-                    <a href="/multi-agent/sessions" class="feature-link" style="color: var(--accent-green);">View Sessions &rarr;</a>
-                </div>
-            </div>
-        </section>
-    </main>
-    <div class="metrics-bar-container fade-in">
-        <div class="container metrics-grid" id="metrics-grid">
-            <div class="metric-item">
-                <div class="metric-val" id="m-episodes">--</div>
-                <div class="metric-label">Total Episodes</div>
-            </div>
-            <div class="metric-item">
-                <div class="metric-val" id="m-avg">--</div>
-                <div class="metric-label">Avg Score</div>
-            </div>
-            <div class="metric-item">
-                <div class="metric-val" id="m-res">--</div>
-                <div class="metric-label">Resolution Rate</div>
-            </div>
-            <div class="metric-item">
-                <div class="metric-val" id="m-best">--</div>
-                <div class="metric-label">Best Score</div>
-            </div>
-        </div>
-    </div>
-    <main class="container">
-        <section class="fade-in">
-            <h2 class="section-title">🏆 Leaderboard</h2>
-            <p class="section-subtitle">Top episodes by score</p>
-            <div class="leaderboard-card">
-                <table>
-                    <thead>
-                        <tr>
-                            <th>Rank</th>
-                            <th>Task</th>
-                            <th>Score</th>
-                            <th>Steps</th>
-                            <th>Status</th>
-                        </tr>
-                    </thead>
-                    <tbody id="lb-body">
-                        <tr><td colspan="5" style="text-align: center; color: var(--text-dim);">Loading leaderboard...</td></tr>
-                    </tbody>
-                </table>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">Quick Start</h2>
-            <p class="section-subtitle">Run your first episode in seconds</p>
-            <div class="tabs">
-                <button class="tab active" onclick="switchTab('curl')">curl</button>
-                <button class="tab" onclick="switchTab('python')">Python</button>
-            </div>
-            <div id="code-curl" class="code-block active">
-                <button class="btn-copy" onclick="copyCode('code-curl-text', this)">Copy</button>
-                <div class="code-text" id="code-curl-text"><span class="code-comment"># 1. Start an incident</span>
-<span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/reset \
-  -H <span class="code-str">"Content-Type: application/json"</span> \
-  -d <span class="code-str">'{<span class="code-key">"task_id"</span>: <span class="code-str">"easy"</span>, <span class="code-key">"seed"</span>: 42}'</span>
-<span class="code-comment"># 2. Read logs (reward: +0.15)</span>
-<span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="code-str">"Content-Type: application/json"</span> \
-  -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"read_logs"</span>, <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}'</span>
-<span class="code-comment"># 3. Diagnose (reward: +0.30)</span>
-<span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="code-str">"Content-Type: application/json"</span> \
-  -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"diagnose"</span>, <span class="code-key">"root_cause"</span>: <span class="code-str">"memory leak in payment-service"</span>}'</span>
-<span class="code-comment"># 4. Fix it (reward: +0.40)</span>
-<span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="code-str">"Content-Type: application/json"</span> \
-  -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"restart_service"</span>, <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}'</span>
-<span class="code-comment"># Score: ~0.94 ✅</span></div>
-            </div>
-            <div id="code-python" class="code-block">
-                <button class="btn-copy" onclick="copyCode('code-py-text', this)">Copy</button>
-                <div class="code-text" id="code-py-text"><span class="code-cmd">import</span> requests
-BASE = <span class="code-str">"https://arijit-07-devops-incident-response.hf.space"</span>
-<span class="code-comment"># Start episode</span>
-obs = requests.post(<span class="code-url">f"{BASE}/reset"</span>,
-    json={<span class="code-key">"task_id"</span>: <span class="code-str">"easy"</span>, <span class="code-key">"seed"</span>: 42}).json()
-<span class="code-comment"># Take action</span>
-result = requests.post(<span class="code-url">f"{BASE}/step"</span>,
-    json={<span class="code-key">"action_type"</span>: <span class="code-str">"read_logs"</span>,
-          <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}).json()
-print(<span class="code-url">f"Reward: {result['reward']}"</span>)  <span class="code-comment"># 0.15</span></div>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">🧠 Training Evidence</h2>
-            <p class="section-subtitle">Llama-3.2-3B fine-tuned with GRPO on this environment</p>
-            <div class="training-grid">
-                <div class="train-card">
-                    <h3 class="train-title">Behavioral Change</h3>
-                    <div class="train-row">
-                        <div class="train-label">
-                            <span>Base Llama-3B</span>
-                            <span class="train-badge" style="background: rgba(239, 68, 68, 0.2); color: #ef4444;">0.000</span>
-                        </div>
-                        <div class="train-vis" style="color: #ef4444;">❌</div>
-                        <div class="train-desc">Jumps straight to diagnose without reading logs → triggers blind remediation penalty (-0.10)</div>
-                    </div>
-                    <div class="train-row">
-                        <div class="train-label">
-                            <span>ARIA Fine-tuned (140 episodes)</span>
-                            <span class="train-badge" style="background: rgba(16, 185, 129, 0.2); color: #10b981;">0.150</span>
-                        </div>
-                        <div class="train-vis" style="color: #10b981;">✅</div>
-                        <div class="train-desc">Consistently reads logs on correct failing service first → information gathering before acting</div>
-                    </div>
-                    <a href="https://huggingface.co/Arijit-07/aria-devops-llama3b" target="_blank" class="feature-link" style="color: var(--accent-blue);">Model weights &rarr;</a>
-                </div>
-                <div class="train-card">
-                    <h3 class="train-title">Training Setup</h3>
-                    <div class="train-table-row">
-                        <div class="tt-key">Algorithm</div><div class="tt-val">GRPO</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">Framework</div><div class="tt-val">Unsloth + HuggingFace TRL</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">Base Model</div><div class="tt-val">Llama-3.2-3B-Instruct</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">LoRA Rank</div><div class="tt-val">16 (alpha: 32)</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">Episodes</div><div class="tt-val">140 (easy + medium)</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">GPU</div><div class="tt-val">Kaggle T4 x2</div>
-                    </div>
-                    <div class="train-table-row">
-                        <div class="tt-key">Group Size</div><div class="tt-val">6 completions/step</div>
-                    </div>
-                    <div class="train-table-row" style="border-bottom: none;">
-                        <div class="tt-key">KL Penalty</div><div class="tt-val">0.05</div>
-                    </div>
-                </div>
-            </div>
-        </section>
-    </main>
-    <footer>
-        <div class="container">
-            <div class="footer-grid">
-                <div>
-                    <div style="font-size: 20px; font-weight: 700; color: var(--accent-blue); margin-bottom: 8px;">🚨 ARIA</div>
-                    <div class="f-text">DevOps Incident Response<br>OpenEnv-compliant RL environment</div>
-                    <div class="f-social">
-                        <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="f-link">GitHub</a>
-                        <a href="https://huggingface.co/Arijit-07/aria-devops-llama3b" target="_blank" class="f-link">HuggingFace Model</a>
-                    </div>
-                </div>
-                <div>
-                    <div class="f-title">Resources</div>
-                    <div class="f-links">
-                        <a href="/docs" class="f-link">Live API Docs</a>
-                        <a href="/validate" class="f-link">Validate</a>
-                        <a href="/metrics" class="f-link">Metrics</a>
-                        <a href="/leaderboard" class="f-link">Leaderboard</a>
-                        <a href="/curriculum/status" class="f-link">Curriculum</a>
-                        <a href="/about" class="f-link">About</a>
-                    </div>
-                </div>
-                <div>
-                    <div class="f-title">Built for</div>
-                    <div class="f-text">Meta × PyTorch × HuggingFace<br>OpenEnv Hackathon Finals<br>Bangalore, April 2026</div>
-                    <div class="f-text" style="font-size: 12px; margin-top: 16px;">Solo project by Arijit</div>
-                </div>
-            </div>
-            <div class="f-bottom">
-                <div>&copy; 2026 ARIA — Apache 2.0 License</div>
-                <div>Can your agent handle a SEV-1 at 3am?</div>
-            </div>
-        </div>
-    </footer>
-    <script>
-        // 1. Canvas Particles
-        const canvas = document.getElementById('bg-canvas');
-        const ctx = canvas.getContext('2d');
-        let width, height;
-        let particles = [];
-        function resize() {
-            width = window.innerWidth;
-            height = window.innerHeight;
-            canvas.width = width;
-            canvas.height = height;
-        }
-        window.addEventListener('resize', resize);
-        resize();
-        for(let i=0; i<60; i++) {
-            particles.push({
-                x: Math.random() * width,
-                y: Math.random() * height,
-                vx: (Math.random() - 0.5) * 0.5,
-                vy: (Math.random() - 0.5) * 0.5
-            });
-        }
-        function drawParticles() {
-            ctx.clearRect(0, 0, width, height);
-            ctx.fillStyle = 'rgba(59, 130, 246, 0.1)';
-            ctx.strokeStyle = 'rgba(59, 130, 246, 0.05)';
-            for(let i=0; i<particles.length; i++) {
-                let p = particles[i];
-                p.x += p.vx; p.y += p.vy;
-                if(p.x < 0 || p.x > width) p.vx *= -1;
-                if(p.y < 0 || p.y > height) p.vy *= -1;
-                ctx.beginPath();
-                ctx.arc(p.x, p.y, 2, 0, Math.PI * 2);
-                ctx.fill();
-                for(let j=i+1; j<particles.length; j++) {
-                    let p2 = particles[j];
-                    let dx = p.x - p2.x, dy = p.y - p2.y;
-                    let dist = Math.sqrt(dx*dx + dy*dy);
-                    if(dist < 150) {
-                        ctx.beginPath();
-                        ctx.moveTo(p.x, p.y);
-                        ctx.lineTo(p2.x, p2.y);
-                        ctx.stroke();
-                    }
-                }
-            }
-            requestAnimationFrame(drawParticles);
-        }
-        drawParticles();
-        // 2. Intersection Observer for Fade-in
-        const observer = new IntersectionObserver((entries) => {
-            entries.forEach(entry => {
-                if(entry.isIntersecting) {
-                    entry.target.classList.add('visible');
-                }
-            });
-        }, { threshold: 0.1 });
-        document.querySelectorAll('.fade-in').forEach(el => observer.observe(el));
-        // 3. Status Check
-        fetch('/health')
-            .then(r => r.json())
-            .then(data => {
-                if(data.status === 'ok') {
-                    document.getElementById('nav-status-text').innerText = 'LIVE';
-                }
-            }).catch(e => console.error(e));
-        // 4. Load Tasks
-        const taskConfig = {
-            'easy': {icon: '💻', color: '#10b981', badge: 'EASY'},
-            'medium': {icon: '⚡', color: '#f59e0b', badge: 'MEDIUM'},
-            'hard': {icon: '🔥', color: '#ef4444', badge: 'HARD'},
-            'bonus': {icon: '💥', color: '#8b5cf6', badge: 'EXPERT'},
-            'security': {icon: '🛡️', color: '#06b6d4', badge: 'SECURITY'},
-            'database': {icon: '🗄️', color: '#f97316', badge: 'DATABASE'},
-            'failover': {icon: '🌐', color: '#6366f1', badge: 'FAILOVER'},
-            'generated': {icon: '✨', color: '#ec4899', badge: 'DYNAMIC'}
-        };
-        fetch('/tasks')
-            .then(r => r.json())
-            .then(data => {
-                const grid = document.getElementById('task-grid');
-                grid.innerHTML = '';
-                data.tasks.forEach(t => {
-                    const cfg = taskConfig[t.id] || taskConfig['easy'];
-                    grid.innerHTML += `
-                        <div class="task-card" style="--card-color: ${cfg.color}; --card-bg: ${cfg.color}20;">
-                            <div class="task-header">
-                                <div class="task-icon">${cfg.icon}</div>
-                                <div class="task-badge">${cfg.badge}</div>
-                            </div>
-                            <div class="task-name">${t.name}</div>
-                            <div class="task-desc">${t.description}</div>
-                            <div class="task-footer">
-                                <div class="task-steps">Max steps: ${t.max_steps}</div>
-                                <div class="task-status">Ready</div>
-                            </div>
-                        </div>
-                    `;
-                });
-            }).catch(e => console.error(e));
-        // 5. Curriculum
-        fetch('/curriculum/status')
-            .then(r => r.json())
-            .then(data => {
-                const container = document.getElementById('curriculum-container');
-                if(data.total_episodes_recorded === 0) {
-                    container.innerHTML = '<div style="text-align: center; color: var(--text-dim); font-size: 13px;">No episodes recorded yet</div>';
-                    return;
-                }
-                container.innerHTML = '';
-                const tasks = data.tasks || {};
-                Object.keys(tasks).slice(0, 4).forEach(k => {
-                    let avg = tasks[k].rolling_avg;
-                    let w = Math.max(5, avg * 100);
-                    let color = avg < 0.3 ? 'var(--accent-red)' : (avg < 0.6 ? 'var(--accent-yellow)' : 'var(--accent-green)');
-                    let blocks = Math.round(w / 10);
-                    let fillStr = '█'.repeat(blocks);
-                    let trackStr = '░'.repeat(10 - blocks);
-                    container.innerHTML += `
-                        <div class="c-bar-row">
-                            <div class="c-bar-name">${k}</div>
-                            <div class="c-bar-track" style="color: ${color}"><span class="c-bar-fill">${fillStr}</span><span style="opacity:0.3">${trackStr}</span></div>
-                            <div class="c-bar-score">${avg.toFixed(2)}</div>
-                        </div>
-                    `;
-                });
-            }).catch(e => {
-                document.getElementById('curriculum-container').innerHTML = '<div style="text-align: center; color: var(--text-dim); font-size: 13px;">--</div>';
-                console.error(e);
-            });
-        // 6. Generator
-        window.generateIncident = function() {
-            const seed = document.getElementById('gen-seed').value || 42;
-            fetch(`/generate/preview?seed=${seed}`)
-                .then(r => r.json())
-                .then(data => {
-                    const cMap = {oom: '#ef4444', cascade: '#f59e0b', corruption: '#8b5cf6', security: '#06b6d4', database: '#f97316', network_partition: '#6366f1'};
-                    const sMap = {sev1: '#ef4444', sev2: '#f59e0b', sev3: '#10b981'};
-                    const fColor = cMap[data.failure_mode] || '#3b82f6';
-                    document.getElementById('gen-badges').innerHTML = `
-                        <span class="gen-badge" style="background:${fColor}20; color:${fColor}">${data.failure_mode}</span>
-                        <span class="gen-badge" style="background:${sMap[data.severity] || '#3b82f6'}20; color:${sMap[data.severity] || '#3b82f6'}">${data.severity}</span>
-                        <span class="gen-badge" style="background:rgba(255,255,255,0.1); color:var(--text-secondary)">${data.incident_id}</span>
-                    `;
-                    document.getElementById('gen-affected').innerText = `Affected: ${data.affected_service}`;
-                    document.getElementById('gen-desc').innerText = data.description;
-                    let dColor = data.difficulty_score < 0.4 ? '#10b981' : (data.difficulty_score < 0.7 ? '#f59e0b' : '#ef4444');
-                    document.getElementById('gen-diff-fill').style.width = `${data.difficulty_score * 100}%`;
-                    document.getElementById('gen-diff-fill').style.background = dColor;
-                    document.getElementById('gen-result').style.display = 'block';
-                }).catch(e => console.error(e));
-        };
-        // 7. Dual Agent
-        window.startDualSession = function() {
-            fetch('/multi-agent/reset', {
-                method: 'POST',
-                headers: {'Content-Type': 'application/json'},
-                body: JSON.stringify({task_id: "easy", seed: 42})
-            }).then(r => r.json())
-              .then(data => {
-                  const info = document.getElementById('dual-session-info');
-                  info.innerHTML = `Session: ${data.session_id}<br><br>Agent A (POST): /multi-agent/step/a/${data.session_id}<br>Agent B (POST): /multi-agent/step/b/${data.session_id}`;
-                  info.style.display = 'block';
-              }).catch(e => console.error(e));
-        };
-        // 8. Live Metrics
-        function loadMetrics() {
-            fetch('/metrics')
-                .then(r => r.json())
-                .then(data => {
-                    document.getElementById('m-episodes').innerText = data.total_episodes || 0;
-                    document.getElementById('m-avg').innerText = (data.overall_avg_score || 0).toFixed(3);
-                    // calculate overall resolution rate
-                    if (data.by_task) {
-                        let totalRes = 0, totalCount = 0;
-                        let bestScore = 0;
-                        Object.values(data.by_task).forEach(t => {
-                            totalRes += t.resolution_rate * t.count;
-                            totalCount += t.count;
-                            if (t.max_score > bestScore) bestScore = t.max_score;
-                        });
-                        let resRate = totalCount > 0 ? (totalRes / totalCount) * 100 : 0;
-                        document.getElementById('m-res').innerText = resRate.toFixed(1) + '%';
-                        document.getElementById('m-best').innerText = bestScore.toFixed(3);
-                    }
-                }).catch(e => console.error(e));
-        }
-        loadMetrics();
-        setInterval(loadMetrics, 30000);
-        // 9. Leaderboard
-        fetch('/leaderboard')
-            .then(r => r.json())
-            .then(data => {
-                const tbody = document.getElementById('lb-body');
-                if(!data.leaderboard || data.leaderboard.length === 0) {
-                    tbody.innerHTML = '<tr><td colspan="5" style="text-align: center; color: var(--text-dim);">No episodes yet. Try POST /reset to start your first episode.</td></tr>';
-                    return;
-                }
-                tbody.innerHTML = '';
-                data.leaderboard.forEach(row => {
-                    let rClass = row.rank <= 3 ? `rank-${row.rank}` : '';
-                    let sColor = row.score >= 0.8 ? '#10b981' : (row.score >= 0.5 ? '#f59e0b' : '#ef4444');
-                    let statusHtml = row.score > 0.5 ? '<span style="color:#10b981">✅ Resolved</span>' : '<span style="color:#ef4444">❌ Failed</span>'; // Simple heuristic if resolution missing
-                    tbody.innerHTML += `
-                        <tr>
-                            <td class="${rClass}">#${row.rank}</td>
-                            <td>${row.task_id}</td>
-                            <td class="lb-score" style="color: ${sColor}">${row.score.toFixed(4)}</td>
-                            <td>${row.steps}</td>
-                            <td class="lb-status">${statusHtml}</td>
-                        </tr>
-                    `;
-                });
-            }).catch(e => console.error(e));
-        // 10. Tabs
-        window.switchTab = function(type) {
-            document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
-            document.querySelectorAll('.code-block').forEach(c => c.classList.remove('active'));
-            if(type === 'curl') {
-                document.querySelectorAll('.tab')[0].classList.add('active');
-                document.getElementById('code-curl').classList.add('active');
-            } else {
-                document.querySelectorAll('.tab')[1].classList.add('active');
-                document.getElementById('code-python').classList.add('active');
-            }
-        };
-        window.copyCode = function(id, btn) {
-            const text = document.getElementById(id).innerText;
-            navigator.clipboard.writeText(text).then(() => {
-                let old = btn.innerText;
-                btn.innerText = 'Copied ✓';
-                setTimeout(() => btn.innerText = old, 2000);
-            });
-        };
-    </script>
-</body>
-</html>
-"""

update_dashboard.py DELETED Viewed

@@ -1,658 +0,0 @@
-import re
-html_content = r"""<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>ARIA - DevOps Incident Response</title>
-    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;700&display=swap" rel="stylesheet">
-    <style>
-        :root {
-            --bg: #060914;
-            --bg-card: #111827;
-            --border: #1f2937;
-            --blue: #3b82f6;
-            --cyan: #06b6d4;
-            --green: #10b981;
-            --red: #ef4444;
-            --yellow: #f59e0b;
-            --purple: #8b5cf6;
-            --text: #f9fafb;
-            --muted: #9ca3af;
-        }
-        * { margin: 0; padding: 0; box-sizing: border-box; }
-        body {
-            background: var(--bg);
-            color: var(--text);
-            font-family: 'Inter', sans-serif;
-            min-height: 100vh;
-            overflow-x: hidden;
-        }
-        html { scroll-behavior: smooth; }
-        a { text-decoration: none; color: inherit; }
-        /* Animation */
-        @keyframes fadeInUp {
-            from { opacity: 0; transform: translateY(20px); }
-            to { opacity: 1; transform: translateY(0); }
-        }
-        .fade-in {
-            opacity: 0;
-            transform: translateY(20px);
-            transition: opacity 0.6s ease-out, transform 0.6s ease-out;
-        }
-        .fade-in.visible { opacity: 1; transform: translateY(0); }
-        /* Canvas Background */
-        #bg-canvas {
-            position: fixed;
-            top: 0;
-            left: 0;
-            width: 100vw;
-            height: 100vh;
-            z-index: 0;
-            pointer-events: none;
-        }
-        .container {
-            max-width: 1280px;
-            margin: 0 auto;
-            padding: 0 24px;
-            position: relative;
-            z-index: 1;
-        }
-        section { padding: 80px 0; }
-        /* Navbar */
-        nav {
-            position: fixed;
-            top: 0;
-            width: 100%;
-            height: 64px;
-            background: rgba(6, 9, 20, 0.8);
-            backdrop-filter: blur(20px);
-            border-bottom: 1px solid var(--border);
-            z-index: 100;
-            display: flex;
-            align-items: center;
-        }
-        .nav-inner {
-            display: flex;
-            justify-content: space-between;
-            align-items: center;
-            width: 100%;
-            max-width: 1280px;
-            margin: 0 auto;
-            padding: 0 24px;
-        }
-        .nav-left { display: flex; align-items: center; gap: 8px; }
-        .nav-logo { font-size: 20px; font-weight: 700; color: var(--blue); }
-        .nav-desc { font-size: 13px; color: var(--muted); display: none; }
-        @media (min-width: 768px) { .nav-desc { display: block; } }
-        .nav-center { display: flex; justify-content: center; flex: 1; }
-        .status-pill {
-            display: flex;
-            align-items: center;
-            gap: 6px;
-            background: rgba(16, 185, 129, 0.2);
-            border: 1px solid var(--green);
-            color: var(--green);
-            padding: 4px 12px;
-            border-radius: 999px;
-            font-size: 12px;
-            font-weight: 600;
-        }
-        .status-dot {
-            width: 6px;
-            height: 6px;
-            background: var(--green);
-            border-radius: 50%;
-            animation: pulse 2s infinite;
-        }
-        @keyframes pulse { 0% { transform: scale(1); opacity: 1; } 50% { transform: scale(1.5); opacity: 0.5; } 100% { transform: scale(1); opacity: 1; } }
-        .nav-right { display: flex; gap: 24px; }
-        .nav-link { font-size: 13px; color: var(--muted); transition: color 0.2s; }
-        .nav-link:hover { color: var(--text); }
-        /* Hero */
-        .hero { padding: 120px 0 80px; text-align: center; }
-        .hero-badge {
-            background: rgba(59, 130, 246, 0.1);
-            border: 1px solid rgba(59, 130, 246, 0.3);
-            border-radius: 999px;
-            padding: 6px 16px;
-            font-size: 12px;
-            color: var(--blue);
-            display: inline-block;
-            margin-bottom: 24px;
-        }
-        .hero-title {
-            font-size: clamp(72px, 12vw, 140px);
-            font-weight: 700;
-            background: linear-gradient(135deg, var(--blue) 0%, var(--cyan) 50%, var(--purple) 100%);
-            -webkit-background-clip: text;
-            -webkit-text-fill-color: transparent;
-            line-height: 1;
-            letter-spacing: -4px;
-        }
-        .hero-subtitle { font-size: 20px; color: var(--muted); margin-top: 16px; font-weight: 400; }
-        .hero-desc { font-size: 15px; color: #4b5563; margin-top: 12px; line-height: 1.6; max-width: 600px; margin-inline: auto; }
-        .hero-buttons { margin-top: 40px; display: flex; justify-content: center; gap: 16px; flex-wrap: wrap; }
-        .btn-primary, .btn-secondary {
-            padding: 14px 28px; border-radius: 8px; font-weight: 600; font-size: 15px; transition: all 0.2s; cursor: pointer; display: inline-block;
-        }
-        .btn-primary { background: var(--blue); color: white; border: none; }
-        .btn-primary:hover { background: #2563eb; transform: translateY(-2px); }
-        .btn-secondary { background: transparent; border: 1px solid var(--border); color: var(--muted); }
-        .btn-secondary:hover { border-color: var(--blue); color: white; transform: translateY(-2px); }
-        .hero-stats { margin-top: 64px; display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; }
-        .stat-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 12px; padding: 20px 32px; text-align: center; }
-        .stat-val { font-family: 'JetBrains Mono', monospace; font-size: 32px; font-weight: 700; color: var(--blue); }
-        .stat-label { font-size: 13px; color: var(--muted); margin-top: 4px; }
-        .section-title { font-size: 24px; font-weight: 600; margin-bottom: 8px; }
-        .section-subtitle { font-size: 15px; color: var(--muted); margin-bottom: 32px; }
-        /* Tasks Grid */
-        .task-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; }
-        @media (max-width: 1024px) { .task-grid { grid-template-columns: repeat(2, 1fr); } }
-        @media (max-width: 640px) { .task-grid { grid-template-columns: 1fr; } }
-        .task-card {
-            background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 24px;
-            transition: all 0.3s; cursor: pointer; position: relative; overflow: hidden; display: flex; flex-direction: column;
-        }
-        .task-card::before { content: ''; position: absolute; top: 0; left: 0; right: 0; height: 2px; background: transparent; transition: all 0.3s; }
-        .task-card:hover { transform: translateY(-4px); box-shadow: 0 20px 40px rgba(0,0,0,0.4); }
-        .task-card:hover::before { background: var(--card-color, var(--border)); }
-        .task-header { display: flex; justify-content: space-between; align-items: flex-start; }
-        .task-icon { font-size: 32px; }
-        .task-badge { font-size: 11px; font-weight: 700; padding: 4px 8px; border-radius: 6px; background: var(--card-bg); color: var(--card-color); letter-spacing: 0.5px; }
-        .task-name { font-size: 16px; font-weight: 600; margin-top: 16px; }
-        .task-desc { font-size: 13px; color: var(--muted); margin-top: 8px; line-height: 1.5; flex-grow: 1; }
-        .task-footer { display: flex; justify-content: space-between; align-items: center; margin-top: 20px; }
-        .task-steps { font-family: 'JetBrains Mono', monospace; font-size: 12px; color: #4b5563; }
-        .task-status { display: flex; align-items: center; gap: 6px; font-size: 12px; color: var(--card-color); font-weight: 500; }
-        .task-status::before { content: ''; width: 6px; height: 6px; border-radius: 50%; background: var(--card-color); }
-        /* Features */
-        .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; }
-        @media (max-width: 900px) { .features-grid { grid-template-columns: 1fr; } }
-        .feature-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
-        .feature-icon { font-size: 48px; margin-bottom: 24px; }
-        .feature-title { font-size: 20px; font-weight: 600; margin-bottom: 12px; color: var(--text); }
-        .feature-desc { font-size: 14px; color: var(--muted); line-height: 1.6; margin-bottom: 24px; flex-grow: 1; }
-        .c-bar-row { display: flex; align-items: center; justify-content: space-between; margin-bottom: 8px; font-size: 12px; font-family: 'JetBrains Mono', monospace; }
-        .c-bar-name { color: var(--muted); width: 80px; overflow: hidden; text-overflow: ellipsis; }
-        .c-bar-track { flex-grow: 1; margin: 0 12px; letter-spacing: -2px; color: #4b5563; }
-        .c-bar-score { width: 30px; text-align: right; }
-        .generator-input { display: flex; gap: 8px; margin-bottom: 16px; }
-        .gen-seed { background: #0d1117; border: 1px solid var(--border); color: white; padding: 8px 12px; border-radius: 6px; width: 80px; font-family: 'JetBrains Mono', monospace; }
-        .btn-gen { background: var(--purple); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
-        .gen-result { background: #0d1117; border: 1px solid var(--border); border-radius: 8px; padding: 16px; display: none; }
-        .gen-badges { display: flex; gap: 8px; margin-bottom: 12px; }
-        .gen-badge { font-size: 10px; padding: 2px 6px; border-radius: 4px; font-weight: 600; text-transform: uppercase; }
-        .gen-diff-bar { height: 4px; background: var(--border); border-radius: 2px; margin: 12px 0; overflow: hidden; }
-        .dual-diagram { background: #0d1117; border: 1px solid var(--border); border-radius: 8px; padding: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; margin-bottom: 24px; color: var(--muted); display: flex; justify-content: space-between; align-items: center; }
-        .agent-box { border: 1px solid var(--border); padding: 8px; border-radius: 4px; background: rgba(0,0,0,0.2); width: 42%; }
-        .agent-arrow { flex-grow: 1; text-align: center; color: var(--green); position: relative; }
-        .agent-arrow::after { content: '→'; position: absolute; top: -10px; left: 50%; transform: translateX(-50%); animation: flowRight 1.5s infinite linear; }
-        @keyframes flowRight { 0% { left: 20%; opacity: 0; } 50% { opacity: 1; } 100% { left: 80%; opacity: 0; } }
-        .btn-green { background: var(--green); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
-        .feature-link { color: var(--blue); font-size: 14px; font-weight: 500; margin-top: 16px; display: inline-block; }
-        /* Live Metrics */
-        .metrics-bar { background: #0d1117; border-top: 1px solid var(--border); border-bottom: 1px solid var(--border); padding: 24px 0; }
-        .metrics-grid { display: flex; justify-content: space-between; }
-        .metric-item { text-align: center; flex: 1; border-right: 1px solid var(--border); }
-        .metric-item:last-child { border-right: none; }
-        .metric-val { font-family: 'JetBrains Mono', monospace; font-size: 28px; font-weight: 700; color: var(--blue); }
-        .metric-label { font-size: 12px; color: var(--muted); margin-top: 4px; }
-        @media (max-width: 640px) { .metrics-grid { flex-wrap: wrap; gap: 24px; } .metric-item { min-width: 40%; border: none; } }
-        /* Leaderboard */
-        .leaderboard-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; overflow-x: auto; }
-        table { width: 100%; border-collapse: collapse; text-align: left; }
-        th { background: rgba(255,255,255,0.03); font-size: 11px; text-transform: uppercase; letter-spacing: 1px; color: #4b5563; padding: 12px 24px; border-bottom: 1px solid var(--border); }
-        td { padding: 16px 24px; border-bottom: 1px solid var(--border); font-size: 14px; }
-        tr:last-child td { border-bottom: none; }
-        .lb-score { font-family: 'JetBrains Mono', monospace; font-weight: 600; }
-        /* Quick Start */
-        .tabs { display: flex; gap: 8px; margin-bottom: 16px; }
-        .tab { background: transparent; border: none; color: var(--muted); padding: 8px 16px; border-radius: 6px; cursor: pointer; font-size: 14px; font-weight: 500; font-family: 'Inter', sans-serif;}
-        .tab.active { background: var(--blue); color: white; }
-        .code-block { background: #020408; border: 1px solid var(--border); border-radius: 12px; padding: 24px; position: relative; display: none; overflow-x: auto; }
-        .code-block.active { display: block; }
-        .code-text { font-family: 'JetBrains Mono', monospace; font-size: 13px; line-height: 1.8; color: var(--text); white-space: pre; }
-        .btn-copy { position: absolute; top: 12px; right: 12px; background: rgba(255,255,255,0.1); border: 1px solid var(--border); color: var(--muted); padding: 4px 10px; border-radius: 4px; font-size: 12px; cursor: pointer; }
-        .c-com { color: #4b5563; } .c-str { color: var(--green); } .c-cmd { color: var(--blue); } .c-url { color: var(--cyan); } .c-key { color: var(--yellow); }
-        /* Training Evidence */
-        .training-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
-        @media (max-width: 900px) { .training-grid { grid-template-columns: 1fr; } }
-        .train-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
-        .train-title { font-size: 18px; font-weight: 600; margin-bottom: 24px; }
-        .train-row { margin-bottom: 24px; }
-        .train-label { font-size: 12px; color: var(--muted); margin-bottom: 8px; display: flex; justify-content: space-between; align-items: center; }
-        .train-badge { padding: 4px 8px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-weight: 600; }
-        .train-desc { font-size: 14px; color: var(--muted); line-height: 1.5; margin-left: 28px; }
-        .train-vis { float: left; font-size: 18px; margin-top: 2px; }
-        .tt-row { display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid var(--border); }
-        .tt-row:last-child { border-bottom: none; }
-        .tt-key { font-size: 13px; color: var(--muted); }
-        .tt-val { font-size: 13px; font-family: 'JetBrains Mono', monospace; color: var(--text); }
-        /* Footer */
-        footer { background: #0d1117; border-top: 1px solid var(--border); padding: 48px 0 32px; margin-top: 80px; }
-        .footer-grid { display: grid; grid-template-columns: 2fr 1fr 1fr; gap: 32px; }
-        @media (max-width: 768px) { .footer-grid { grid-template-columns: 1fr; } }
-        .f-title { font-size: 14px; font-weight: 600; margin-bottom: 16px; }
-        .f-text { font-size: 13px; color: #4b5563; line-height: 1.6; }
-        .f-links { display: flex; flex-direction: column; gap: 12px; }
-        .f-link { font-size: 13px; color: var(--muted); transition: color 0.2s; }
-        .f-link:hover { color: var(--text); }
-        .f-bottom { border-top: 1px solid var(--border); margin-top: 32px; padding-top: 24px; display: flex; justify-content: space-between; font-size: 12px; color: #4b5563; }
-    </style>
-</head>
-<body>
-    <canvas id="bg-canvas"></canvas>
-    <nav>
-        <div class="nav-inner">
-            <div class="nav-left">
-                <div class="nav-logo">🚨 ARIA</div>
-                <div class="nav-desc">DevOps Incident Response</div>
-            </div>
-            <div class="nav-center">
-                <div class="status-pill">
-                    <div class="status-dot"></div>
-                    <span id="nav-status-text">CONNECTING</span>
-                </div>
-            </div>
-            <div class="nav-right">
-                <a href="/docs" class="nav-link">API Docs</a>
-                <a href="/validate" class="nav-link">Validate</a>
-                <a href="/metrics" class="nav-link">Metrics</a>
-                <a href="/leaderboard" class="nav-link">Leaderboard</a>
-            </div>
-        </div>
-    </nav>
-    <main class="container">
-        <section class="hero fade-in">
-            <div class="hero-badge">⚡ OpenEnv Compliant · Meta × PyTorch × HuggingFace</div>
-            <h1 class="hero-title">ARIA</h1>
-            <div class="hero-subtitle">Adaptive Reward & Incident Architecture</div>
-            <p class="hero-desc">The first OpenEnv RL environment for production incident response.<br>7 tasks · 14 actions · Curriculum · Dual-agent · Trained Llama-3.1-8B</p>
-            <div class="hero-buttons">
-                <a href="/docs" class="btn-primary">Try Live API &rarr;</a>
-                <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="btn-secondary">View GitHub &rarr;</a>
-            </div>
-            <div class="hero-stats">
-                <div class="stat-card"><div class="stat-val">7</div><div class="stat-label">Tasks</div></div>
-                <div class="stat-card"><div class="stat-val">14</div><div class="stat-label">Actions</div></div>
-                <div class="stat-card"><div class="stat-val">&infin;</div><div class="stat-label">Scenarios</div></div>
-                <div class="stat-card"><div class="stat-val">0.99</div><div class="stat-label">Max Score</div></div>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">Environment Tasks</h2>
-            <p class="section-subtitle">Eight scenarios of escalating operational complexity</p>
-            <div class="task-grid" id="task-grid"><div style="grid-column: 1/-1; text-align: center; color: var(--muted);">Loading tasks...</div></div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">ARIA Features</h2>
-            <p class="section-subtitle">What makes this environment unique</p>
-            <div class="features-grid">
-                <div class="feature-card">
-                    <div class="feature-icon">🎓</div>
-                    <h3 class="feature-title">Curriculum Engine</h3>
-                    <p class="feature-desc">Tracks agent performance per task with rolling averages. Promotes when mastered (avg > 0.75). Scaffolds with hints when struggling (avg < 0.30). Agents always train at the edge of their capability.</p>
-                    <div id="curriculum-container" style="margin-bottom: 24px;"></div>
-                    <a href="/curriculum/status" class="feature-link" style="color: var(--blue);">View Status &rarr;</a>
-                </div>
-                <div class="feature-card">
-                    <div class="feature-icon">⚡</div>
-                    <h3 class="feature-title">Incident Generator</h3>
-                    <p class="feature-desc">Procedural incidents from seeds 0–99,999. Six failure modes × eight services × variable noise = infinite unique training scenarios. Same seed always produces the same incident.</p>
-                    <div class="generator-input">
-                        <input type="number" id="gen-seed" class="gen-seed" value="42" min="0" max="99999">
-                        <button class="btn-gen" onclick="generateIncident()">Generate</button>
-                    </div>
-                    <div class="gen-result" id="gen-result">
-                        <div class="gen-badges" id="gen-badges"></div>
-                        <div style="font-size: 13px; font-weight: 600; margin-bottom: 8px;" id="gen-affected"></div>
-                        <div style="font-size: 12px; color: var(--muted); line-height: 1.5;" id="gen-desc"></div>
-                        <div class="gen-diff-bar"><div id="gen-diff-fill" style="height: 100%; transition: width 0.3s;"></div></div>
-                    </div>
-                    <a href="/generate/preview?seed=42" class="feature-link" style="color: var(--purple);">Try Generator &rarr;</a>
-                </div>
-                <div class="feature-card">
-                    <div class="feature-icon">🤝</div>
-                    <h3 class="feature-title">Dual-Agent Mode</h3>
-                    <p class="feature-desc">Split observability between two agents. Observer sees logs and alerts. Responder sees metrics and dependencies. Neither can solve the incident alone — they must coordinate via share_finding.</p>
-                    <div class="dual-diagram">
-                        <div class="agent-box"><div style="font-weight:700; margin-bottom:4px;">AGENT A: Observer</div><div>• alerts, logs</div></div>
-                        <div class="agent-arrow">share_finding</div>
-                        <div class="agent-box"><div style="font-weight:700; margin-bottom:4px;">AGENT B: Responder</div><div>• metrics, deps</div></div>
-                    </div>
-                    <button class="btn-green" onclick="startDualSession()">Start Session</button>
-                    <div id="dual-session-info" style="margin-top: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; color: var(--green); display: none; word-break: break-all;"></div>
-                    <a href="/multi-agent/sessions" class="feature-link" style="color: var(--green); margin-top: auto;">View Sessions &rarr;</a>
-                </div>
-            </div>
-        </section>
-    </main>
-    <div class="metrics-bar fade-in">
-        <div class="container metrics-grid">
-            <div class="metric-item"><div class="metric-val" id="m-episodes">--</div><div class="metric-label">Total Episodes</div></div>
-            <div class="metric-item"><div class="metric-val" id="m-avg">--</div><div class="metric-label">Avg Score</div></div>
-            <div class="metric-item"><div class="metric-val" id="m-res">--</div><div class="metric-label">Resolution Rate</div></div>
-            <div class="metric-item"><div class="metric-val" id="m-best">--</div><div class="metric-label">Best Score</div></div>
-        </div>
-    </div>
-    <main class="container">
-        <section class="fade-in">
-            <h2 class="section-title">🏆 Leaderboard</h2>
-            <div class="leaderboard-card">
-                <table>
-                    <thead><tr><th>Rank</th><th>Task</th><th>Score</th><th>Steps</th><th>Status</th></tr></thead>
-                    <tbody id="lb-body"><tr><td colspan="5" style="text-align: center; color: var(--muted);">Loading leaderboard...</td></tr></tbody>
-                </table>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">Quick Start</h2>
-            <div class="tabs">
-                <button class="tab active" onclick="switchTab('curl')">curl</button>
-                <button class="tab" onclick="switchTab('python')">Python</button>
-            </div>
-            <div id="code-curl" class="code-block active">
-                <button class="btn-copy" onclick="copyCode('code-curl-text', this)">Copy</button>
-                <div class="code-text" id="code-curl-text"><span class="c-com"># 1. Start an incident</span>
-<span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/reset \
-  -H <span class="c-str">"Content-Type: application/json"</span> \
-  -d <span class="c-str">'{<span class="c-key">"task_id"</span>: <span class="c-str">"easy"</span>, <span class="c-key">"seed"</span>: 42}'</span>
-<span class="c-com"># 2. Read logs (reward: +0.15)</span>
-<span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="c-str">"Content-Type: application/json"</span> \
-  -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"read_logs"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}'</span>
-<span class="c-com"># 3. Diagnose (reward: +0.30)</span>
-<span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="c-str">"Content-Type: application/json"</span> \
-  -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"diagnose"</span>, <span class="c-key">"root_cause"</span>: <span class="c-str">"memory leak in payment-service"</span>}'</span>
-<span class="c-com"># 4. Fix it (reward: +0.40)</span>
-<span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
-  -H <span class="c-str">"Content-Type: application/json"</span> \
-  -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"restart_service"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}'</span>
-<span class="c-com"># Score: ~0.94 ✅</span></div>
-            </div>
-            <div id="code-python" class="code-block">
-                <button class="btn-copy" onclick="copyCode('code-py-text', this)">Copy</button>
-                <div class="code-text" id="code-py-text"><span class="c-cmd">import</span> requests
-BASE = <span class="c-str">"https://arijit-07-devops-incident-response.hf.space"</span>
-<span class="c-com"># Start episode</span>
-obs = requests.post(<span class="c-url">f"{BASE}/reset"</span>, json={<span class="c-key">"task_id"</span>: <span class="c-str">"easy"</span>, <span class="c-key">"seed"</span>: 42}).json()
-<span class="c-com"># Take action</span>
-result = requests.post(<span class="c-url">f"{BASE}/step"</span>,
-    json={<span class="c-key">"action_type"</span>: <span class="c-str">"read_logs"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}).json()
-print(<span class="c-url">f"Reward: {result['reward']}"</span>)  <span class="c-com"># 0.15</span></div>
-            </div>
-        </section>
-        <section class="fade-in">
-            <h2 class="section-title">🧠 Training Evidence</h2>
-            <div class="training-grid">
-                <div class="train-card">
-                    <h3 class="train-title">Before vs After</h3>
-                    <div class="train-row">
-                        <div class="train-label"><span>Base Llama-3.1-8B</span><span class="train-badge" style="background: rgba(239, 68, 68, 0.2); color: var(--red);">0.000</span></div>
-                        <div class="train-vis" style="color: var(--red);">❌</div>
-                        <div class="train-desc">jumps to diagnose, gets penalized</div>
-                    </div>
-                    <div class="train-row">
-                        <div class="train-label"><span>ARIA Fine-tuned</span><span class="train-badge" style="background: rgba(16, 185, 129, 0.2); color: var(--green);">0.150</span></div>
-                        <div class="train-vis" style="color: var(--green);">✅</div>
-                        <div class="train-desc">reads logs first, every time</div>
-                    </div>
-                    <a href="https://huggingface.co/Arijit-07/aria-devops-llama8b" target="_blank" class="feature-link">Model weights &rarr;</a>
-                </div>
-                <div class="train-card">
-                    <h3 class="train-title">Training Details</h3>
-                    <div class="tt-row"><div class="tt-key">Algorithm</div><div class="tt-val">GRPO</div></div>
-                    <div class="tt-row"><div class="tt-key">Base Model</div><div class="tt-val">Llama-3.1-8B-Instruct</div></div>
-                    <div class="tt-row"><div class="tt-key">Framework</div><div class="tt-val">Unsloth + HuggingFace TRL</div></div>
-                    <div class="tt-row"><div class="tt-key">LoRA Rank</div><div class="tt-val">32 (alpha 64)</div></div>
-                    <div class="tt-row"><div class="tt-key">Episodes</div><div class="tt-val">160</div></div>
-                    <div class="tt-row"><div class="tt-key">GPU</div><div class="tt-val">NVIDIA L4</div></div>
-                </div>
-            </div>
-        </section>
-    </main>
-    <footer>
-        <div class="container">
-            <div class="footer-grid">
-                <div>
-                    <div style="font-size: 20px; font-weight: 700; color: var(--blue); margin-bottom: 8px;">🚨 ARIA</div>
-                    <div class="f-text">DevOps Incident Response<br>OpenEnv-compliant RL environment</div>
-                    <div style="display: flex; gap: 16px; margin-top: 16px;">
-                        <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="f-link">GitHub</a>
-                        <a href="https://huggingface.co/Arijit-07/aria-devops-llama8b" target="_blank" class="f-link">Model</a>
-                    </div>
-                </div>
-                <div>
-                    <div class="f-title">Resources</div>
-                    <div class="f-links">
-                        <a href="/docs" class="f-link">Live API Docs</a>
-                        <a href="/validate" class="f-link">Validate</a>
-                        <a href="/metrics" class="f-link">Metrics</a>
-                        <a href="/leaderboard" class="f-link">Leaderboard</a>
-                    </div>
-                </div>
-                <div>
-                    <div class="f-title">Built for</div>
-                    <div class="f-text">Meta × PyTorch × HuggingFace<br>OpenEnv Hackathon Finals<br>Bangalore, April 2026</div>
-                </div>
-            </div>
-            <div class="f-bottom">
-                <div>&copy; 2026 ARIA — Apache 2.0 License</div>
-                <div>Can your agent handle a SEV-1 at 3am?</div>
-            </div>
-        </div>
-    </footer>
-    <script>
-        const canvas = document.getElementById('bg-canvas');
-        const ctx = canvas.getContext('2d');
-        let width, height, particles = [];
-        function resize() { width = canvas.width = window.innerWidth; height = canvas.height = window.innerHeight; }
-        window.addEventListener('resize', resize); resize();
-        for(let i=0; i<50; i++) {
-            particles.push({ x: Math.random() * width, y: Math.random() * height, vx: (Math.random()-0.5)*0.5, vy: (Math.random()-0.5)*0.5 });
-        }
-        function draw() {
-            ctx.clearRect(0, 0, width, height);
-            ctx.fillStyle = 'rgba(59, 130, 246, 0.2)';
-            ctx.strokeStyle = 'rgba(59, 130, 246, 0.1)';
-            for(let i=0; i<particles.length; i++) {
-                let p = particles[i];
-                p.x += p.vx; p.y += p.vy;
-                if(p.x < 0 || p.x > width) p.vx *= -1;
-                if(p.y < 0 || p.y > height) p.vy *= -1;
-                ctx.beginPath(); ctx.arc(p.x, p.y, 2, 0, Math.PI*2); ctx.fill();
-                for(let j=i+1; j<particles.length; j++) {
-                    let p2 = particles[j], dist = Math.hypot(p.x-p2.x, p.y-p2.y);
-                    if(dist < 150) { ctx.beginPath(); ctx.moveTo(p.x, p.y); ctx.lineTo(p2.x, p2.y); ctx.stroke(); }
-                }
-            }
-            requestAnimationFrame(draw);
-        }
-        draw();
-        const observer = new IntersectionObserver(e => e.forEach(en => { if(en.isIntersecting) en.target.classList.add('visible'); }), {threshold: 0.1});
-        document.querySelectorAll('.fade-in').forEach(el => observer.observe(el));
-        fetch('/health').then(r => r.json()).then(d => {
-            if(d.status === 'ok') document.getElementById('nav-status-text').innerText = 'LIVE';
-        }).catch(e => console.error(e));
-        const tMap = {
-            'easy': {icon: '💻', color: '#10b981', badge: 'EASY'}, 'medium': {icon: '⚡', color: '#f59e0b', badge: 'MEDIUM'},
-            'hard': {icon: '🔥', color: '#ef4444', badge: 'HARD'}, 'bonus': {icon: '💥', color: '#8b5cf6', badge: 'EXPERT'},
-            'security': {icon: '🛡️', color: '#06b6d4', badge: 'SECURITY'}, 'database': {icon: '🗄️', color: '#f97316', badge: 'DATABASE'},
-            'failover': {icon: '🌐', color: '#6366f1', badge: 'FAILOVER'}, 'generated': {icon: '✨', color: '#ec4899', badge: 'DYNAMIC'}
-        };
-        fetch('/tasks').then(r => r.json()).then(d => {
-            document.getElementById('task-grid').innerHTML = d.tasks.map(t => {
-                let c = tMap[t.id] || tMap['easy'];
-                return `<div class="task-card" style="--card-color:${c.color};--card-bg:${c.color}20;">
-                    <div class="task-header"><div class="task-icon">${c.icon}</div><div class="task-badge">${c.badge}</div></div>
-                    <div class="task-name">${t.name}</div><div class="task-desc">${t.description}</div>
-                    <div class="task-footer"><div class="task-steps">Max steps: ${t.max_steps}</div><div class="task-status">Ready</div></div>
-                </div>`;
-            }).join('');
-        }).catch(e => console.error(e));
-        fetch('/curriculum/status').then(r => r.json()).then(d => {
-            const el = document.getElementById('curriculum-container');
-            if(!d.total_episodes_recorded) el.innerHTML = '<div style="color:var(--muted); font-size:13px; text-align:center;">No episodes yet — run POST /reset to begin</div>';
-            else {
-                el.innerHTML = Object.keys(d.tasks).slice(0, 4).map(k => {
-                    let avg = d.tasks[k].rolling_avg, col = avg < 0.3 ? 'var(--red)' : (avg < 0.6 ? 'var(--yellow)' : 'var(--green)');
-                    let bl = Math.round(avg * 10);
-                    return `<div class="c-bar-row"><div class="c-bar-name">${k}</div>
-                    <div class="c-bar-track" style="color:${col}"><span>${'█'.repeat(bl)}</span><span style="opacity:0.3">${'░'.repeat(10-bl)}</span></div>
-                    <div class="c-bar-score">${avg.toFixed(2)}</div></div>`;
-                }).join('');
-            }
-        }).catch(e => console.error(e));
-        window.generateIncident = () => {
-            const seed = document.getElementById('gen-seed').value || 42;
-            fetch(`/generate/preview?seed=${seed}`).then(r => r.json()).then(d => {
-                const colors = {oom: '#ef4444', cascade: '#f59e0b', corruption: '#8b5cf6', security: '#06b6d4', database: '#f97316', network_partition: '#6366f1'};
-                const sc = {sev1: '#ef4444', sev2: '#f59e0b', sev3: '#10b981'};
-                let fcol = colors[d.failure_mode] || 'var(--blue)';
-                document.getElementById('gen-badges').innerHTML = `<span class="gen-badge" style="background:${fcol}20;color:${fcol}">${d.failure_mode}</span><span class="gen-badge" style="background:${sc[d.severity]||fcol}20;color:${sc[d.severity]||fcol}">${d.severity}</span><span class="gen-badge" style="background:rgba(255,255,255,0.1);color:var(--muted)">${d.incident_id}</span>`;
-                document.getElementById('gen-affected').innerText = `Affected: ${d.affected_service}`;
-                document.getElementById('gen-desc').innerText = d.description;
-                let dc = d.difficulty_score < 0.4 ? 'var(--green)' : (d.difficulty_score < 0.7 ? 'var(--yellow)' : 'var(--red)');
-                let fill = document.getElementById('gen-diff-fill');
-                fill.style.width = `${d.difficulty_score*100}%`; fill.style.background = dc;
-                document.getElementById('gen-result').style.display = 'block';
-            }).catch(e => console.error(e));
-        };
-        window.startDualSession = () => {
-            fetch('/multi-agent/reset', { method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({task_id: "easy", seed: 42}) })
-            .then(r => r.json()).then(d => {
-                let info = document.getElementById('dual-session-info');
-                info.innerHTML = `Session: ${d.session_id}<br><br>Agent A (POST): /multi-agent/step/a/${d.session_id}<br>Agent B (POST): /multi-agent/step/b/${d.session_id}`;
-                info.style.display = 'block';
-            }).catch(e => console.error(e));
-        };
-        const loadMetrics = () => {
-            fetch('/metrics').then(r => r.json()).then(d => {
-                document.getElementById('m-episodes').innerText = d.total_episodes || 0;
-                document.getElementById('m-avg').innerText = (d.overall_avg_score || 0).toFixed(3);
-                if(d.by_task) {
-                    let tRes = 0, tCnt = 0, best = 0;
-                    Object.values(d.by_task).forEach(t => { tRes += t.resolution_rate*t.count; tCnt += t.count; if(t.max_score > best) best = t.max_score; });
-                    document.getElementById('m-res').innerText = (tCnt ? (tRes/tCnt)*100 : 0).toFixed(1) + '%';
-                    document.getElementById('m-best').innerText = best.toFixed(3);
-                }
-            }).catch(e => console.error(e));
-        };
-        loadMetrics(); setInterval(loadMetrics, 30000);
-        fetch('/leaderboard').then(r => r.json()).then(d => {
-            const body = document.getElementById('lb-body');
-            if(!d.leaderboard || !d.leaderboard.length) { body.innerHTML = '<tr><td colspan="5" style="text-align: center; color: var(--muted);">No episodes yet. Try POST /reset to start.</td></tr>'; return; }
-            body.innerHTML = d.leaderboard.map(r => {
-                let rank = r.rank === 1 ? 'color:#fbbf24;font-weight:bold' : (r.rank === 2 ? 'color:#9ca3af;font-weight:bold' : (r.rank === 3 ? 'color:#cd7f32;font-weight:bold' : ''));
-                let sCol = r.score >= 0.8 ? 'var(--green)' : (r.score >= 0.5 ? 'var(--yellow)' : 'var(--red)');
-                let status = r.score > 0.5 ? '<span style="color:var(--green)">✅ Resolved</span>' : '<span style="color:var(--red)">❌ Failed</span>';
-                return `<tr><td style="${rank}">#${r.rank}</td><td>${r.task_id}</td><td class="lb-score" style="color:${sCol}">${r.score.toFixed(4)}</td><td>${r.steps}</td><td>${status}</td></tr>`;
-            }).join('');
-        }).catch(e => console.error(e));
-        window.switchTab = t => {
-            document.querySelectorAll('.tab').forEach(el => el.classList.remove('active'));
-            document.querySelectorAll('.code-block').forEach(el => el.classList.remove('active'));
-            document.querySelectorAll('.tab')[t === 'curl' ? 0 : 1].classList.add('active');
-            document.getElementById('code-'+t).classList.add('active');
-        };
-        window.copyCode = (id, btn) => {
-            navigator.clipboard.writeText(document.getElementById(id).innerText).then(() => {
-                let old = btn.innerText; btn.innerText = 'Copied ✓'; setTimeout(() => btn.innerText = old, 2000);
-            });
-        };
-    </script>
-</body>
-</html>"""
-import re
-import sys
-# Escape { and } properly to {{ and }}
-# But since html_content is just a raw python string we actually just escape it.
-html_escaped = html_content.replace("{", "{{").replace("}", "}}")
-with open("server/app.py", "r", encoding="utf-8") as f:
-    app_text = f.read()
-# Replace the current dashboard endpoint
-start_idx = app_text.find("def dashboard():")
-end_idx = app_text.find('    return html', start_idx) + len('    return html')
-if start_idx == -1 or end_idx == -1:
-    print("Could not find dashboard.")
-    sys.exit(1)
-new_dashboard = f'''def dashboard():
-    html = f"""{html_escaped}"""
-    return html'''
-new_text = app_text[:start_idx] + new_dashboard + app_text[end_idx:]
-with open("server/app.py", "w", encoding="utf-8") as f:
-    f.write(new_text)
-print("Dashboard replaced successfully.")