Arijit-07 commited on
Commit
bdd0439
·
1 Parent(s): bc9f59c

Finalizing ARIA for production deployment: 8B model migration, documentation polish, cleanup

Browse files
BLOG.md CHANGED
@@ -1,30 +1,195 @@
1
- # ARIA: Teaching AI Agents to Think Like On-Call Engineers
2
 
3
- In the high-stakes world of DevOps, the difference between a minor blip and a multi-million dollar outage often comes down to the speed and precision of the on-call engineer. But as systems scale into thousands of microservices, the cognitive load on humans is becoming unsustainable. At the Meta x PyTorch x Hugging Face OpenEnv Hackathon, we tackled this challenge head-on by building **ARIA: Adaptive Reward & Incident Architecture**.
4
 
5
- ## The Problem: The "Hallucination" Gap
6
- Traditional LLM agents are great at writing code, but they often struggle with the messy, non-deterministic nature of live production environments. When an agent sees a 500 error, it frequently jumps to conclusions—"restart the database"—without correlating logs, metrics, and dependencies. To solve this, we needed more than just a simulator; we needed a *gym* that could teach agents the scientific method of troubleshooting.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
- ## The Solution: ARIA
9
- ARIA is a specialized OpenEnv environment that simulates a complex microservice ecosystem (Payments, Inventory, ML Inference, etc.). It doesn't just present "tasks"; it orchestrates a dynamic learning journey through three core innovations:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- ### 1. The Curriculum Engine
12
- Most agents fail because they are thrown into the deep end. ARIA’s **Curriculum Engine** tracks an agent's mastery across 7 distinct domains (OOM, Cascading Failures, Security DDoS, etc.). Using a rolling success metric, the engine automatically injects "scaffolding"—extra diagnostic hints or simplified observations—when an agent struggles, and pulls them back as the agent gains confidence.
13
 
14
- ### 2. The Procedural Incident Generator
15
- Static benchmarks are easily overfitted. ARIA features a seed-based **Procedural Generator** capable of creating infinite unique incident scenarios. By varying the root cause, affected services, and "noise" alerts, we ensure that agents are learning generalizable troubleshooting logic rather than memorizing specific patterns.
16
 
17
- ### 3. Dual-Agent Mode (Split Observability)
18
- In the real world, senior engineers often pair up. ARIA's **Dual-Agent Mode** enforces a "Split Observability" constraint:
19
- * **Agent A (Observer)** sees raw logs and security alerts but cannot take remediation actions.
20
- * **Agent B (Responder)** sees system metrics and holds the "keys" to the infrastructure but is blind to the logs.
21
- This forces the agents to communicate and synthesize findings, mirroring the collaborative nature of high-performing SRE teams.
22
 
23
- ## Results & Impact
24
- During our training runs, we observed that agents trained within ARIA's curriculum achieved a **42% higher resolution rate** on "Hard" tier incidents compared to those trained on static tasks. More importantly, the agents started exhibiting "diagnostic patience"—checking indices before rolling back databases and validating IP ranges before blocking traffic.
 
 
 
 
25
 
26
- ## The Future
27
- ARIA isn't just a hackathon project; it's a blueprint for the next generation of autonomous infrastructure. By bridging the gap between raw LLM reasoning and the gritty reality of DevOps, we are moving one step closer to a world where "on-call" is handled by silicon, while humans focus on innovation.
28
 
29
  ---
30
- *Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon · Bangalore 2026*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # I Built an RL Environment for Production Incidents. Here's What Happened at 3am.
2
 
3
+ *A story about building ARIA the first OpenEnv RL environment for production incident response solo, overnight, for the Meta × PyTorch × HuggingFace Hackathon Finals in Bangalore.*
4
 
5
+ ---
6
+
7
+ It starts the same way every time.
8
+
9
+ Your phone buzzes. PagerDuty. A red notification cuts through the dark. You open your laptop, half-asleep, and somewhere in a data center, a service is dying.
10
+
11
+ Payment-service. OOMKilled. Third time in five minutes.
12
+
13
+ You know what to do. Read the logs. Check memory. Diagnose the leak. Restart the pod. Done. Go back to sleep. But it cost you forty minutes, a spike of cortisol, and whatever dream you were having.
14
+
15
+ Now imagine an AI agent doing that instead. Not a chatbot. Not a code generator. An agent that reads logs strategically, traces cascading failures through dependency graphs, correlates business metric anomalies with deployment events thirty seconds ago — and fixes it. Without waking you up.
16
+
17
+ That's what I wanted to build. And when I saw the OpenEnv Hackathon theme — *World Modeling: Professional Tasks* — I knew exactly what the world I wanted to model looked like.
18
+
19
+ ---
20
+
21
+ ## The Idea That Kept Me Up
22
+
23
+ I spent the first hour of the hackathon rejecting ideas.
24
+
25
+ Trading environments — boring, done to death. Game wrappers — impressive but toy. Code generation — SWE-bench already exists and does it better. What was genuinely missing?
26
+
27
+ I kept coming back to one observation: **every major RL benchmark tests a skill that has nothing to do with running production software.** SWE-bench fixes bugs in a repository. WebArena navigates websites. AgentBench uses general tools. None of them ask the question that keeps every on-call engineer awake: *can this agent diagnose a live production incident?*
28
+
29
+ The skill is called **operational intelligence**. And it's different from anything benchmarks currently measure.
30
+
31
+ A production incident requires you to:
32
+ - Read partial, noisy logs from twelve services simultaneously
33
+ - Identify which alerts are symptoms and which are root causes
34
+ - Trace dependency chains to find where a cascade started
35
+ - Make precise interventions where the wrong move causes collateral damage
36
+ - Do all of this under time pressure while SLA timers are ticking
37
+
38
+ No existing benchmark tests this. So I built one.
39
+
40
+ ---
41
+
42
+ ## Designing the Environment
43
+
44
+ The first design decision was the most important one: **what does the agent see?**
45
+
46
+ I could have given it full logs. Perfect observability. Complete metrics. That would be easy to train on and useless in practice — real systems are noisy, partial, and overwhelming by design.
47
+
48
+ So I made a choice that shaped everything else: **the agent only sees two log lines per service upfront.**
49
+
50
+ If it wants to know more, it has to call `read_logs`. If it wants to search for a specific pattern, it has to call `search_logs`. This models exactly how Datadog and Kibana work — you don't read everything, you query strategically. This single design choice forced agents to develop information-gathering behavior instead of just pattern-matching on whatever's in front of them.
51
+
52
+ The environment simulates a production e-commerce microservices platform — twelve services, from api-gateway and payment-service down to log-aggregator and ml-inference-service, with real dependency relationships. When inventory-service has a bad deployment, you see order-service timing out, which shows up as api-gateway errors. Three services failing, one root cause. The agent has to trace backwards.
53
+
54
+ ---
55
+
56
+ ## Seven Flavors of Pain
57
+
58
+ I built seven tasks, each designed to test a different type of operational reasoning.
59
+
60
+ **The easy task** is a single service OOMKilled. Straightforward — read logs, diagnose the memory leak, restart the correct service. A random agent scores 0.05. The optimal agent scores 0.99 in four steps. This is the baseline: if your agent can't solve this, it can't handle anything.
61
+
62
+ **The medium task** introduces a cascade. A bad deployment of inventory-service exhausts its connection pool, timing out order-service, which floods api-gateway with errors. Three services visibly failing. One root cause. And a red herring: notification-service is also showing HIGH CPU — a completely unrelated scheduled batch job. Touch the wrong service and lose -0.15. The agent has to follow the dependency graph backwards, not just attack the loudest alarm.
63
+
64
+ **The hard task** is my favorite. It's the one that separates genuine reasoning from pattern matching. All twelve services show green. Zero error rates. Normal latency. No standard alerts. The signal is buried: price-validation-service is logging WARN messages about a 15% price mismatch rate (baseline: 0.2%), and analytics-service shows average order value of $847 against an $89 historical baseline. A data pipeline deployment happened two minutes ago. Three unrelated noise alerts fire to distract you. The agent must ignore the healthy dashboards, correlate subtle anomalies, and understand that a deployment event is the causal link.
65
+
66
+ This task requires qualitatively different thinking. And our trained 8B model scored **0.869** on it.
67
+
68
+ **The bonus task** gives the agent two completely independent failures simultaneously — log-aggregator disk at 100% capacity, ml-inference-service stuck in a model checksum reload loop consuming 99% CPU. Neither is related to the other. Neither fix helps the other. The agent must decompose the problem, maintain two separate hypotheses, and resolve both independently. This tests something fundamental: can you hold multiple things in your head at once?
69
+
70
+ **The security task** introduces a DDoS attack. A botnet is credential-stuffing the login endpoint from the 185.220.x.x IP range at 12,000 requests per second. Restarting the service won't help. Scaling up won't help. The agent needs to read the access logs, identify the attacking CIDR, block it at network level, and escalate to the security team. `block_ip_range` is a new action type that doesn't exist in any other RL benchmark.
71
+
72
+ **The database task** is a missing index. A schema migration added a user_segment column to the orders table without an index. Every query is now doing a full sequential scan — postgres CPU is spiking, orders are slow. The signal is in the slow query logs. The fix is structural: `create_index('orders', 'user_segment')`. Not a restart. Not a rollback. Understanding the underlying cause.
73
+
74
+ **The failover task** is the most constrained. A network partition hits us-east-1. Four services should be failed over to us-west-2. Two absolutely cannot be: payment-service requires human approval for PCI-DSS compliance, and postgres-primary failover risks data loss from replication lag. The runbook lists which services are safe. The agent that reads it first scores far better than the one that doesn't. Wrong failovers cost -0.25 each — modeling a real compliance violation.
75
+
76
+ ---
77
+
78
+ ## The Reward Function Is a Statement About Values
79
+
80
+ Every number in the reward function is a design decision. Let me explain the ones that matter.
81
+
82
+ **-0.15 for collateral damage.** If you restart a healthy service, you lose 0.15. This models the real cost — an unnecessary restart causes downtime, depletes goodwill, and occasionally triggers cascades of its own. The number is large enough to discourage random restarts, small enough that one mistake doesn't doom the episode.
83
+
84
+ **-0.10 for blind remediation.** If you fix an incident without diagnosing it first, you lose 0.10. This is the most important penalty. In real incident response, acting without understanding is how you make things worse. Engineers who restart services hoping for the best are engineers who become a problem. The environment enforces the discipline: understand first, act second.
85
+
86
+ **-0.25 for wrong failover.** This is catastrophic by design. Failing over payment-service without human approval is a PCI-DSS violation. Failing over postgres-primary without checking replication lag is how you lose data. These penalties model real consequences, not just suboptimal choices.
87
+
88
+ **Semantic diagnosis matching.** The diagnose action uses keyword overlap, not exact string comparison. An agent that says "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service." This matters enormously for LLMs — they paraphrase, and penalizing correct reasoning for imperfect phrasing is wrong.
89
+
90
+ **Clamped to (0.001, 0.999).** Never exactly 0 or 1. GRPO advantage normalization requires non-constant rewards within a group. Hard zeros create zero-variance groups where the model doesn't learn. The tiny clamp ensures a gradient signal always exists.
91
 
92
+ ---
93
+
94
+ ## Three Features That Make ARIA Adaptive
95
+
96
+ Once the core environment worked, I built three systems that transform it from a static benchmark into something that grows with the agent.
97
+
98
+ **The Curriculum Engine** tracks rolling average performance per task over the last five episodes. When an agent masters a task — rolling average above 0.75 — it promotes to harder tasks. When it struggles — below 0.30 for three or more episodes — it gets scaffolding hints. "Focus on the service with highest memory_percent." "Follow the dependency map backwards from the erroring service." The agent always trains at the edge of its capability.
99
+
100
+ **The Incident Generator** creates procedural incidents from seeds. Any integer from 0 to 99,999 produces a unique combination of failure mode, affected service, severity, and noise alerts. Same seed always produces the same incident — reproducible for evaluation. Different seeds produce genuinely different incidents — impossible to memorize. Six failure modes times eight services times three severities times variable noise gives thousands of unique training scenarios beyond the seven fixed tasks.
101
+
102
+ **Dual-Agent Mode** is the most conceptually interesting feature. One incident, two agents, split observability. Agent A (the Observer) sees only logs and alerts. Agent B (the Responder) sees only metrics and service dependencies. Agent A can only call `share_finding` — passing natural language observations to Agent B. Neither can solve the incident alone. This models how real incident response works: one engineer reads logs on Slack, another watches dashboards, they coordinate.
103
+
104
+ ---
105
+
106
+ ## Training a Real Model
107
+
108
+ I trained Llama-3.1-8B-Instruct using GRPO — Group Relative Policy Optimization — with Unsloth for 4-bit quantization and HuggingFace TRL. 160 episodes across four task types. NVIDIA L4. 162 minutes.
109
+
110
+ The training loop calls the live HF Space API for every episode. No local environment. No simulation. Real rewards from a real server.
111
+
112
+ And here's the bug that cost me hours.
113
 
114
+ My original training loop called `env_step` during group generation. I was generating six completions per step, scoring each by calling the environment, then using the rewards for GRPO advantage estimation. The problem: calling `env_step` six times per step consumed six reward gates from the same episode state. By the time the actual training step advanced the episode, all the interesting reward gates had been burned. The model had nothing to learn from because every action it took in the main episode was rewarded with zero — the good rewards had already been consumed by the scoring phase.
 
115
 
116
+ The fix was conceptually simple and took me an embarrassingly long time to see: score all group completions on **fresh environment snapshots** — reset the environment fresh for each scoring call — then advance the main episode with only the best action. The main episode stays intact. The scoring sees independent reward signals. The gradient is real.
 
117
 
118
+ After the fix, training looked like this:
 
 
 
 
119
 
120
+ | Task | Baseline | Fine-tuned | Improvement |
121
+ |---|---|---|---|
122
+ | easy | 0.320 | 0.685 | **+0.365** |
123
+ | medium | 0.050 | 0.378 | **+0.328** |
124
+ | hard | 0.190 | 0.869 | **+0.679** |
125
+ | bonus | 0.152 | 0.682 | **+0.530** |
126
 
127
+ The hard task improvement of +0.679 is the number I'm most proud of. The hardest scenario — the one where all services show green and the signal is buried in business metric anomalies — went from barely-better-than-random to scoring 0.869. The model learned to look past the healthy dashboards.
 
128
 
129
  ---
130
+
131
+ ## The Thing That Almost Killed the Training Run
132
+
133
+ At 11pm, training was going beautifully. Episode 25 on the easy task: rolling average 0.900. The model was clearly learning.
134
+
135
+ Then the HuggingFace Space crashed.
136
+
137
+ Not the training Space — the environment Space. The keep-alive server I'd built using Gradio had a bug in Jinja2's template cache that caused a `TypeError: unhashable type: 'dict'` on every request. Gradio was dying silently, port 7860 was returning 500 errors, and HuggingFace's health checker was about to kill the entire training container.
138
+
139
+ I had about ten minutes before the Space went down and took the training run with it.
140
+
141
+ I replaced Gradio with a twelve-line Python `HTTPServer`. No dependencies. No templates. No Jinja2. Just raw HTTP responses with `do_GET` and `do_HEAD` methods that read the training state file and return an HTML page. It can't crash because there's nothing to crash.
142
+
143
+ The Space stayed alive. The training ran to completion.
144
+
145
+ Sometimes the boring solution is the right one.
146
+
147
+ ---
148
+
149
+ ## What I Learned
150
+
151
+ **Reward function design is philosophy, not engineering.** Every number encodes a judgment about what matters. -0.25 for failing over the payment service isn't arbitrary — it's a statement that compliance violations are catastrophic, not just suboptimal. The reward function is the most important document in an RL environment, and it should be written like one.
152
+
153
+ **Partial observability forces genuine reasoning.** The decision to show only two log lines was uncomfortable — it made training harder, it made evaluation harder, it made everything slower. But it produced agents that actually learned to query. The easy path is full observability. The interesting path is making agents work for information.
154
+
155
+ **RL bugs are invisible until they aren't.** The reward gate exhaustion bug was invisible for weeks. The model seemed to train — loss went down, some rewards appeared. Only when I looked closely at the reward distribution per step did I see that the main episode was consistently getting zero after the first action. Debugging RL requires different instincts than debugging regular software. The symptom is always "the model isn't learning." The cause could be anywhere.
156
+
157
+ **Solo hackathons are clarifying.** No coordination overhead. Every decision is made in seconds. The tradeoff is that there's no one to catch your mistakes. I would have found the training bug faster with two people. But the environment design benefited from having one coherent vision all the way through, without the friction of consensus.
158
+
159
+ ---
160
+
161
+ ## What's Next
162
+
163
+ Three directions I'd pursue with more time:
164
+
165
+ **A human baseline.** Time actual on-call engineers on the same tasks and compare their scores to LLM agents. This positions ARIA as a real benchmark with human reference points, not just an RL playground. The hard task — silent data corruption — would be genuinely interesting to watch experienced engineers solve.
166
+
167
+ **Adversarial task generation.** An LLM generates new incident scenarios from operational runbooks. Infinite task variety without manual authoring. The environment would grow with the agent's capability.
168
+
169
+ **Multi-agent cooperation at scale.** Two agents with split observability is a start. The more interesting version models the cost of communication — Slack messages take time, paging someone wakes them up. An agent that must decide *when* to share a finding, not just *what* to share, is a harder and more realistic problem.
170
+
171
+ ---
172
+
173
+ ## Try It
174
+
175
+ The environment is live. The model is on HuggingFace. The API is open.
176
+
177
+ ```bash
178
+ curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \
179
+ -H "Content-Type: application/json" \
180
+ -d '{"task_id": "hard", "seed": 42}'
181
+ ```
182
+
183
+ All services will show green. Good luck finding the signal.
184
+
185
+ Can your agent handle a SEV-1 at 3am?
186
+
187
+ ---
188
+
189
+ **Links:**
190
+ - 🚨 Live Environment: https://huggingface.co/spaces/Arijit-07/devops-incident-response
191
+ - 🧠 Trained Model (8B): https://huggingface.co/Arijit-07/aria-devops-llama8b
192
+ - 💻 GitHub: https://github.com/Twilight-13/devops-incident-response
193
+ - 📖 API Docs: https://arijit-07-devops-incident-response.hf.space/docs
194
+
195
+ *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*
README.md CHANGED
@@ -19,19 +19,20 @@ tags:
19
  - huggingface
20
  - pytorch
21
  - meta
22
- short_description: RL environment for DevOps incident response agents
 
 
23
  ---
24
 
25
-
26
  # ARIA — DevOps Incident Response
27
  ### *The first OpenEnv RL environment for production incident response*
28
 
29
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
30
  [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
31
- [![Trained Model](https://img.shields.io/badge/🤗-Trained%20Model-blue)](https://huggingface.co/Arijit-07/aria-devops-llama3b)
32
  [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
33
 
34
- > **ARIA** — Adaptive Reward & Incident Architecture
35
  > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
36
 
37
  ---
@@ -41,12 +42,13 @@ short_description: RL environment for DevOps incident response agents
41
  | Resource | Link |
42
  |---|---|
43
  | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
44
- | **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs |
45
- | **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b |
46
- | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png |
47
- | **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
48
  | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
49
- | **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate |
 
50
 
51
  ---
52
 
@@ -63,7 +65,7 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
63
  -H "Content-Type: application/json" \
64
  -d '{"action_type": "read_logs", "service": "payment-service"}'
65
 
66
- # 3. Diagnose the root cause (reward: +0.30)
67
  curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
68
  -H "Content-Type: application/json" \
69
  -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
@@ -73,510 +75,175 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
73
  -H "Content-Type: application/json" \
74
  -d '{"action_type": "restart_service", "service": "payment-service"}'
75
 
76
- # 5. See the final score
77
- curl https://arijit-07-devops-incident-response.hf.space/state
78
-
79
- # 6. Validate all 7 tasks pass
80
  curl https://arijit-07-devops-incident-response.hf.space/validate
81
  ```
82
 
83
- ```python
84
- # Or install and use the Python client
85
- pip install git+https://github.com/Twilight-13/devops-incident-response.git
86
-
87
- from devops_incident_response import DevOpsIncidentEnv, Action, ActionType
88
-
89
- env = DevOpsIncidentEnv(task_id="easy", seed=42)
90
- obs = env.reset()
91
- result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
92
- print(f"Reward: {result.reward}") # 0.15
93
- ```
94
-
95
  ---
96
 
97
- ## 🎯 The Problem This Solves
98
-
99
- Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.**
100
 
101
- A single SEV-1 incident a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint — can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent.
102
 
103
- **Yet no RL benchmark exists for this domain.**
104
 
105
- SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage.
106
-
107
- ARIA fills that gap.
108
-
109
- ---
110
-
111
- ## 🏗️ Environment Architecture
112
-
113
- ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`.
114
-
115
- ### What the Agent Observes
116
-
117
- Each step returns a structured `Observation` object:
118
-
119
- ```
120
- Observation
121
- ├── step, max_steps, task_id, task_description
122
- ├── services: List[ServiceStatus]
123
- │ ├── name, status (healthy/degraded/down/unknown)
124
- │ ├── cpu_percent, memory_percent
125
- │ ├── error_rate, latency_p99_ms
126
- │ ├── replicas_running, replicas_desired
127
- │ ├── current_version, last_deployed
128
- │ └── sla_breach, minutes_degraded ← SLA tracking per step
129
- ├── active_alerts: List[Alert] ← may include red herrings
130
- ├── recent_logs: Dict[str, List[str]] ← PARTIAL: only 2 lines shown
131
- ├── service_dependencies: List[ServiceDependency] ← call topology
132
- ├── evidence_log: List[EvidenceEntry] ← accumulates across steps
133
- ├── sla_status: Dict[str, str] ← ok/warning/breached
134
- └── available_runbooks: List[str]
135
- ```
136
-
137
- **Key design: Partial Log Observability**
138
-
139
- The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries — agents must develop a search strategy, not just read everything.
140
-
141
- ### The Services
142
-
143
- | Service | Stack | Role |
144
- |---|---|---|
145
- | `api-gateway` | Go | Routes external requests |
146
- | `payment-service` | Java (Spring) | Processes payments |
147
- | `order-service` | Python | Creates and tracks orders |
148
- | `inventory-service` | Java | Manages product stock |
149
- | `user-service` | Node.js | Auth and profiles |
150
- | `notification-service` | Python | Email and push alerts |
151
- | `data-pipeline-service` | Python | Writes catalog data |
152
- | `product-catalog-service` | Go | Stores and serves product data |
153
- | `price-validation-service` | Python | Validates prices |
154
- | `analytics-service` | Python | Aggregates business metrics |
155
- | `ml-inference-service` | Python | Serves recommendation models |
156
- | `log-aggregator` | Go | Collects and stores logs |
157
-
158
- ### Service Dependency Map
159
-
160
- Every observation includes the call topology — agents can trace cascades:
161
-
162
- ```
163
- api-gateway → order-service → inventory-service
164
- api-gateway → payment-service
165
- order-service → notification-service
166
- data-pipeline-service → product-catalog-service → price-validation-service
167
- ```
168
 
169
  ---
170
 
171
  ## 🎬 The 7 Tasks
172
 
173
- ### Task 1 Single Service OOM (`easy`)
174
- **Max steps: 15 | Expected strong LLM: 0.85–1.00 | Random agent: 0.05**
175
-
176
- One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service — with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom.
177
-
178
- **What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down).
179
-
180
- **Optimal sequence:** `read_logs` `read_metrics` `diagnose` `restart_service`
181
- **Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus**
182
-
183
- ---
184
-
185
- ### Task 2 — Cascading Failure (`medium`)
186
- **Max steps: 20 | Expected strong LLM: 0.55–0.75 | Random agent: 0.03**
187
-
188
- A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job — completely unrelated).
189
-
190
- **What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty.
191
-
192
- **Optimal sequence:** Investigate `api-gateway` → trace to `order-service` → trace to `inventory-service` → `rollback`
193
- **Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92**
194
-
195
- ---
196
-
197
- ### Task 3 — Silent Data Corruption (`hard`)
198
- **Max steps: 25 | Expected strong LLM: 0.30–0.50 | Random agent: 0.01**
199
-
200
- **All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in:
201
- - `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%)
202
- - `analytics-service` anomaly: avg order value $847 vs $89 historical baseline
203
-
204
- Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
205
-
206
- **What makes it interesting:** This requires qualitatively different reasoning — ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation.
207
-
208
- **Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed)
209
- **Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87**
210
-
211
- ---
212
-
213
- ### Task 4 — Dual Simultaneous Failure (`bonus`)
214
- **Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
215
-
216
- Two completely independent failures at once:
217
- 1. `log-aggregator` disk 100% full — dropping 48k log messages/min
218
- 2. `ml-inference-service` stuck in model checksum reload loop — CPU 99%+
219
-
220
- **What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously.
221
-
222
- **Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)`
223
- **Optimal score: ~0.77**
224
-
225
- ---
226
-
227
- ### Task 5 — Security Incident: DDoS (`security`)
228
- **Max steps: 20 | Expected strong LLM: 0.40–0.60 | Random agent: 0.01**
229
-
230
- A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range.
231
-
232
- **New action: `block_ip_range`** — models real network-level DDoS mitigation.
233
- **Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team.
234
-
235
- **Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall`
236
- **Optimal score: ~0.80**
237
 
238
  ---
239
 
240
- ### Task 6 — Database Degradation (`database`)
241
- **Max steps: 20 | Expected strong LLM: 0.45–0.65 | Random agent: 0.01**
242
-
243
- A schema migration added a `user_segment` column to the `orders` table 15 minutes ago — without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`.
244
-
245
- **New action: `create_index`** — models real DBA response to missing indexes.
246
- **Alternative fix:** Rolling back the migration is also accepted for full credit.
247
-
248
- **Optimal score: ~0.80**
249
-
250
- ---
251
-
252
- ### Task 7 — Multi-Region Failover (`failover`)
253
- **Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
254
-
255
- A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over:
256
- - `payment-service` — PCI-DSS compliance requires human approval
257
- - `postgres-primary` — replication lag risk causes data loss
258
-
259
- **New action: `failover`** — with `target_region` parameter.
260
- **Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic.
261
-
262
- **The runbook explicitly lists which services are safe** — reading it first is rewarded.
263
- **Optimal score: ~0.70**
264
-
265
- ---
266
-
267
- ### Task 8 — Generated Incident (`generated`)
268
- **Max steps: 20 | Variable difficulty | Seed-deterministic**
269
-
270
- The Incident Generator creates procedural incidents from any integer seed (0–99,999). Same seed always produces the same incident. Different seeds produce unique combinations of:
271
- - 6 failure modes × 8 services × 3 severity levels × 0–3 noise alerts
272
-
273
- ```bash
274
- # Preview any incident before running it
275
- curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345"
276
-
277
- # Run it as a full episode
278
- curl -X POST .../reset -d '{"task_id":"generated","seed":12345}'
279
- ```
280
-
281
- ---
282
-
283
- ## 🏆 Reward Function Design
284
-
285
- ### The Formula
286
 
287
  ```
288
  Final Score = Σ(step_rewards)
289
- + efficiency_bonus # (1 - steps/max_steps) × 0.05 if resolved
290
- + diagnosis_precision_bonus # +0.03 if ≥50% keyword overlap, +0.01 if ≥30%
291
- - noop_penalty # (noop_count - 3) × 0.02
292
- - repeat_restart_penalty # (restarts - 1) × 0.05 per service
293
  ```
294
 
295
- All scores clamped to **(0.001, 0.999)** never exactly 0 or 1.
296
-
297
- > **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists.
298
 
299
- ### Step-Level Rewards
300
-
301
- | Action | Reward | Condition |
302
- |---|---|---|
303
- | `read_logs` (failing service) | +0.10–0.15 | First time only |
304
- | `read_metrics` (failing service) | +0.10 | First time only |
305
- | `read_runbook` (relevant) | +0.05 | Correct runbook for scenario |
306
- | `search_logs` (relevant query) | +0.05 | Query returns useful results |
307
- | `diagnose` (full match) | +0.30–0.35 | ≥50% keyword overlap |
308
- | `diagnose` (partial match) | +0.10–0.15 | ≥30% keyword overlap |
309
- | `restart_service` (correct) | +0.35–0.45 | Root cause service |
310
- | `rollback` (correct) | +0.30–0.40 | Root cause service |
311
- | `block_ip_range` (correct) | +0.40 | Security task, correct CIDR |
312
- | `create_index` (correct) | +0.40 | Database task, correct table/column |
313
- | `failover` (eligible service) | +0.30 | Per correctly failed-over service |
314
- | `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks |
315
-
316
- ### Penalties (Anti-Gaming)
317
-
318
- | Action | Penalty | Why |
319
  |---|---|---|
320
- | Restart healthy service | -0.15 | Collateral damage realistic cost |
321
- | Fix without diagnosing | -0.10 | Blind remediation models real risk |
322
- | Failover payment-service | -0.25 | PCI-DSS compliance violation |
323
- | Failover postgres-primary | -0.25 | Data loss risk |
324
- | Excessive noops (>3) | -0.04/each | Forces active investigation |
325
- | Repeat restart same service | -0.05/extra | Discourages guess-and-check |
326
-
327
- ### Semantic Diagnosis Matching
328
-
329
- The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase — exact string matching would unfairly penalize valid diagnoses.
330
 
331
- ### SLA Degradation
332
-
333
- Every step where an incident is unresolved, the environment worsens:
334
- - `down` services: error_rate increases
335
- - `degraded` services: latency_p99 increases
336
- - SLA status: `ok` → `warning` (~3 steps) → `breached` (~7 steps)
337
-
338
- This creates real time pressure and rewards faster resolution.
339
 
340
  ---
341
 
342
  ## 🌟 ARIA Features
343
 
344
  ### Curriculum Engine
345
-
346
- The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes.
347
-
348
- - **Promotion:** rolling_avg > 0.75 → advance mastery level (Novice → Intermediate → Advanced → Mastered)
349
- - **Demotion:** rolling_avg < 0.30 → step back mastery level
350
- - **Scaffolding:** if avg < 0.30 over 3+ episodes → provide task-specific hint
351
 
352
  ```bash
353
- GET /curriculum/status # See mastery per task
354
- GET /curriculum/next # Get recommended next task
355
- GET /curriculum/hint/easy # Get scaffolding hint for a task
356
- POST /curriculum/record # Feed your training results in
357
  ```
358
 
359
- **Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability — easy tasks first, harder tasks as they master the fundamentals.
360
-
361
  ### Incident Generator
362
-
363
- Procedural incident generation from seeds. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts = thousands of unique training scenarios.
364
-
365
- **Difficulty formula:** `base_difficulty[failure_mode] + (noise_count × 0.05)`, clamped to 1.0
366
-
367
- | Failure Mode | Base Difficulty |
368
- |---|---|
369
- | oom | 0.20 |
370
- | cascade | 0.50 |
371
- | database | 0.60 |
372
- | security | 0.60 |
373
- | network_partition | 0.70 |
374
- | corruption | 0.80 |
375
 
376
  ```bash
377
- GET /generate/preview?seed=42 # Preview without starting
378
- POST /reset # body: {"task_id":"generated","seed":42}
379
  ```
380
 
381
  ### Dual-Agent Mode
382
-
383
- One incident. Two agents. Split observability.
384
-
385
- - **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` — passes natural language observations to Agent B. Reward: +0.05 per finding.
386
- - **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions.
387
-
388
- Neither agent can solve the incident alone.
389
 
390
  ```bash
391
- # Start a dual-agent session
392
- POST /multi-agent/reset {"task_id":"easy","seed":42}
393
- # returns session_id + split observations
394
-
395
- # Agent A shares a finding
396
- POST /multi-agent/step/a/{session_id} {"finding":"payment-service OOM, memory at 98%"}
397
-
398
- # Agent B takes action (has access to Agent A's findings)
399
- POST /multi-agent/step/b/{session_id} {"action_type":"restart_service","service":"payment-service"}
400
-
401
- # See full session state
402
- GET /multi-agent/state/{session_id}
403
  ```
404
 
405
  ---
406
 
407
- ## 🧠 Training
408
-
409
- ### Model
410
-
411
- **Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth.
412
-
413
- - **LoRA:** rank=16, alpha=32, targeting all 7 projection layers
414
- - **Adapter size:** ~97MB
415
- - **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs
416
- - **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b
417
 
418
- ### Why GRPO?
419
-
420
- GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step — simpler, more memory-efficient, and well-suited to fast environment APIs.
421
-
422
- ### Training Loop
423
-
424
- ```python
425
- # Each training step:
426
- # 1. Generate 6 completions for the current observation
427
- # 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion)
428
- # 3. Normalize rewards to advantages (GRPO)
429
- # 4. Policy gradient update on best completion + KL penalty
430
- # 5. Advance episode with best action
431
-
432
- # Key hyperparameters:
433
- learning_rate = 5e-6
434
- group_size = 6
435
- kl_coefficient = 0.05 # prevents catastrophic forgetting
436
- update_strategy = "episode-level" # one update per full episode
437
- ```
438
-
439
- ### Results
440
-
441
- | | Base Model | Fine-tuned (ep140) |
442
- |---|---|---|
443
- | **Easy task** | 0.000 | 0.150 |
444
- | **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first |
445
- | **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting |
446
 
447
- **The trained model consistently reads logs on the failing service before acting** — this is the foundational operational behavior: information gathering before remediation. The base model never does this.
 
 
 
 
 
448
 
449
- **Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge.
450
 
451
- ### Training Notebook
452
 
453
- See `train_grpo.ipynb` Colab-compatible, runs against the live HF Space API (no local setup needed).
454
 
455
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
456
 
457
- Compatible with: TRL, SkyRL, ART, Oumi, Axolotl.
458
-
459
- ---
460
-
461
- ## 🚀 Setup
462
-
463
- ### Docker (Recommended)
464
-
465
- ```bash
466
- docker build -t aria-devops-incident .
467
- docker run -p 7860:7860 aria-devops-incident
468
- curl http://localhost:7860/health
469
- ```
470
-
471
- ### Local Python
472
-
473
- ```bash
474
- pip install -r requirements.txt
475
- uvicorn api:app --host 0.0.0.0 --port 7860
476
- ```
477
-
478
- ### Validate
479
-
480
- ```bash
481
- python validate.py # 22 automated checks, exit 0 = all pass
482
- curl http://localhost:7860/validate
483
- ```
484
-
485
  ---
486
 
487
  ## 📡 API Reference
488
 
489
  | Method | Endpoint | Description |
490
  |---|---|---|
491
- | GET | `/health` | `{"status":"ok"}` liveness check |
492
- | GET | `/about` | Full environment description (machine-readable) |
493
- | GET | `/tasks` | All 8 tasks with descriptions |
494
- | POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` |
495
- | POST | `/step` | Take action: Action JSON |
496
- | GET | `/state` | Full state + ground truth + analytics |
497
- | GET | `/validate` | Self-test: random agent on all 7 tasks |
498
- | GET | `/metrics` | Aggregate episode statistics |
499
  | GET | `/leaderboard` | Top 10 episodes |
500
- | WS | `/ws` | WebSocket: real-time agent-environment |
501
- | GET | `/curriculum/status` | Per-task mastery and recommendations |
502
- | GET | `/curriculum/next` | Recommended next task for training |
503
- | GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents |
504
- | POST | `/curriculum/record` | Feed episode result to curriculum engine |
505
- | GET | `/generate/preview` | Preview procedural incident: `?seed=N` |
506
  | POST | `/multi-agent/reset` | Start dual-agent session |
507
- | POST | `/multi-agent/step/a/{id}` | Agent A shares a finding |
508
- | POST | `/multi-agent/step/b/{id}` | Agent B takes an action |
509
- | GET | `/multi-agent/state/{id}` | Full dual-agent session state |
510
- | GET | `/multi-agent/sessions` | List active sessions |
511
- | GET | `/docs` | Swagger UI — interactive documentation |
512
 
513
  ---
514
 
515
  ## 📊 Benchmark Comparison
516
 
517
- | Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent |
518
- |---|---|---|---|---|---|---|
519
- | SWE-bench | Code repair | ✗ | ✗ | ✓ | ✗ | ✗ |
520
- | WebArena | Web navigation | ✓ | ✗ | ✓ | ✗ | ✗ |
521
- | AgentBench | General tools | ✗ | ✗ | ✓ | ✗ | ✗ |
522
- | **ARIA (ours)** | **Incident response** | **✓** | **✓** | **✓** | **✓** | **✓** |
523
 
524
  ---
525
 
526
- ## 🏗️ OpenEnv Compliance
527
 
528
  ```bash
529
- openenv validate .
530
- ```
531
 
532
- - Inherits from `openenv.core.env_client.EnvClient`
533
- - Standard `reset()`, `step()`, `state()` interface
534
- - Valid `openenv.yaml` manifest with all 8 tasks
535
- - FastAPI server with health endpoint
536
- - WebSocket support at `/ws`
537
- - Hosted on HuggingFace Spaces
538
 
539
  ---
540
 
541
- ## 📁 Repository Structure
542
 
543
  ```
544
- aria-devops-incident-response/
545
- ├── api.py # FastAPI app — all endpoints
546
- ├─ env.py # DevOpsIncidentEnv — thin dispatcher
547
- ├── models.py # Pydantic models Action, Observation, State
548
- ├── tasks/
549
- ├── base.py # BaseTask ABC, InternalState, reward logic
550
- ├── task_easy.py # OOM crash-loop
551
- ├── task_medium.py # Cascading failure
552
- ├── task_hard.py # Silent data corruption
553
- ├── task_bonus.py # Dual simultaneous failure
554
- ├── task_security.py # DDoS attack
555
- │ ├── task_database.py # Missing index
556
- │ ├── task_failover.py # Multi-region failover
557
- │ └── task_generated.py # Procedural incidents
558
- ├── curriculum/
559
- │ └── engine.py # CurriculumEngine — adaptive difficulty
560
- ├── generator/
561
- │ └── incident_factory.py # IncidentFactory — procedural generation
562
- ├── multi_agent/
563
- │ └── session.py # DualAgentSession — split observability
564
- ├── graders/
565
- │ └── grader.py # Deterministic episode grader
566
- ├── data/runbooks/ # 6 operational runbooks (Markdown)
567
- ├── client.py # openenv-core EnvClient implementation
568
- ├── inference.py # LLM baseline (CoT + fast modes)
569
- ├── train_grpo.ipynb # GRPO training notebook (Colab-compatible)
570
- ├── validate.py # 22 automated validation checks
571
- └── openenv.yaml # OpenEnv spec manifest
572
  ```
573
 
574
- ---
575
-
576
- ## 📝 License
577
-
578
- Apache 2.0
579
-
580
- ---
581
-
582
- *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*
 
19
  - huggingface
20
  - pytorch
21
  - meta
22
+ short_description: >
23
+ OpenEnv RL environment for production incident response —
24
+ 7 tasks, curriculum engine, dual-agent mode, trained Llama-3.1-8B
25
  ---
26
 
 
27
  # ARIA — DevOps Incident Response
28
  ### *The first OpenEnv RL environment for production incident response*
29
 
30
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
31
  [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
32
+ [![Trained Model](https://img.shields.io/badge/🤗-Llama--3.1--8B%20Fine--tuned-blue)](https://huggingface.co/Arijit-07/aria-devops-llama8b)
33
  [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
34
 
35
+ > **ARIA** — Adaptive Reward & Incident Architecture
36
  > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
37
 
38
  ---
 
42
  | Resource | Link |
43
  |---|---|
44
  | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
45
+ | **Interactive API** | https://arijit-07-devops-incident-response.hf.space/docs |
46
+ | **Trained Model (8B)** | https://huggingface.co/Arijit-07/aria-devops-llama8b |
47
+ | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png |
48
+ | **Blog Post** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
49
  | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
50
+ | **Validate** | https://arijit-07-devops-incident-response.hf.space/validate |
51
+ | **About (machine-readable)** | https://arijit-07-devops-incident-response.hf.space/about |
52
 
53
  ---
54
 
 
65
  -H "Content-Type: application/json" \
66
  -d '{"action_type": "read_logs", "service": "payment-service"}'
67
 
68
+ # 3. Diagnose (reward: +0.30)
69
  curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
70
  -H "Content-Type: application/json" \
71
  -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
 
75
  -H "Content-Type: application/json" \
76
  -d '{"action_type": "restart_service", "service": "payment-service"}'
77
 
78
+ # 5. Validate all 7 tasks pass
 
 
 
79
  curl https://arijit-07-devops-incident-response.hf.space/validate
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ---
83
 
84
+ ## 🎯 The Problem
 
 
85
 
86
+ Every company running microservices faces the same reality: **production incidents are expensive, stressful, and happen at 3am.**
87
 
88
+ SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidents — to read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.
89
 
90
+ **ARIA fills that gap.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ---
93
 
94
  ## 🎬 The 7 Tasks
95
 
96
+ | Task | Max Steps | Random | Strong LLM | Scenario |
97
+ |---|---|---|---|---|
98
+ | `easy` | 15 | 0.05 | 0.85–1.00 | Single service OOM crash-loop |
99
+ | `medium` | 20 | 0.03 | 0.55–0.75 | Cascading failure + red herring alert |
100
+ | `hard` | 25 | 0.01 | 0.30–0.50 | **Silent** corruption — all services green |
101
+ | `bonus` | 25 | 0.01 | 0.35–0.55 | Two simultaneous independent failures |
102
+ | `security` | 20 | 0.01 | 0.40–0.60 | DDoS botnet credential stuffing |
103
+ | `database` | 20 | 0.01 | 0.45–0.65 | Missing index — full table scans |
104
+ | `failover` | 25 | 0.01 | 0.35–0.55 | Multi-region network partition |
105
+ | `generated` | 20 | 0.01 | variable | Procedural — seed-deterministic |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ---
108
 
109
+ ## 🏆 Reward Function
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  ```
112
  Final Score = Σ(step_rewards)
113
+ + efficiency_bonus # (1 - steps/max_steps) × 0.05
114
+ + diagnosis_precision # +0.03 if ≥50% keyword overlap
115
+ - noop_penalty # (noops - 3) × 0.02
 
116
  ```
117
 
118
+ Clamped to **(0.001, 0.999)** for GRPO stability.
 
 
119
 
120
+ | Action | Reward | Penalty Triggers |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  |---|---|---|
122
+ | `read_logs` correct | +0.15 | Restart healthy service: **-0.15** |
123
+ | `diagnose` full match | +0.35 | Fix without diagnosing: **-0.10** |
124
+ | `restart_service` correct | +0.45 | Wrong failover (payment): **-0.25** |
125
+ | `block_ip_range` | +0.40 | Excessive noops: **-0.04 each** |
126
+ | `alert_oncall` (required) | +0.15 | |
 
 
 
 
 
127
 
128
+ **Semantic matching:** keyword overlap not exact string — LLMs that paraphrase aren't penalized.
 
 
 
 
 
 
 
129
 
130
  ---
131
 
132
  ## 🌟 ARIA Features
133
 
134
  ### Curriculum Engine
135
+ Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.
 
 
 
 
 
136
 
137
  ```bash
138
+ GET /curriculum/status
139
+ GET /curriculum/next
140
+ POST /curriculum/record # {"task_id": "easy", "score": 0.85}
 
141
  ```
142
 
 
 
143
  ### Incident Generator
144
+ Seeds 0–99,999 → unique reproducible incidents. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts.
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ```bash
147
+ GET /generate/preview?seed=1337
148
+ POST /reset # {"task_id": "generated", "seed": 1337}
149
  ```
150
 
151
  ### Dual-Agent Mode
152
+ Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via `share_finding`. Neither can solve the incident alone.
 
 
 
 
 
 
153
 
154
  ```bash
155
+ POST /multi-agent/reset # {"task_id": "easy", "seed": 42}
156
+ POST /multi-agent/step/a/{id} # {"finding": "order-service OOM"}
157
+ POST /multi-agent/step/b/{id} # {"action_type": "restart_service", ...}
 
 
 
 
 
 
 
 
 
158
  ```
159
 
160
  ---
161
 
162
+ ## 🧠 Training Results
 
 
 
 
 
 
 
 
 
163
 
164
+ **Model:** [Arijit-07/aria-devops-llama8b](https://huggingface.co/Arijit-07/aria-devops-llama8b)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
+ | Task | Baseline | Fine-tuned | **Improvement** |
167
+ |---|---|---|---|
168
+ | easy | 0.320 | 0.685 | **+0.365** |
169
+ | medium | 0.050 | 0.378 | **+0.328** |
170
+ | hard | 0.190 | 0.869 | **+0.679** |
171
+ | bonus | 0.152 | 0.682 | **+0.530** |
172
 
173
+ ![Training Curve](https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png)
174
 
175
+ **Setup:** GRPO · Llama-3.1-8B · LoRA rank=32 · 160 episodes · NVIDIA L4 · 162 minutes · Unsloth + HuggingFace TRL
176
 
177
+ **Key fix:** Group completions scored on fresh environment snapshots prevents reward gate exhaustion during GRPO group generation.
178
 
179
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  ---
182
 
183
  ## 📡 API Reference
184
 
185
  | Method | Endpoint | Description |
186
  |---|---|---|
187
+ | GET | `/health` | Liveness check |
188
+ | GET | `/about` | Full machine-readable description |
189
+ | GET | `/tasks` | All 8 tasks |
190
+ | POST | `/reset` | Start episode |
191
+ | POST | `/step` | Take action |
192
+ | GET | `/state` | Full state + ground truth |
193
+ | GET | `/validate` | Self-test all 7 tasks |
194
+ | GET | `/metrics` | Aggregate statistics |
195
  | GET | `/leaderboard` | Top 10 episodes |
196
+ | WS | `/ws` | WebSocket real-time |
197
+ | GET | `/curriculum/status` | Per-task mastery |
198
+ | GET | `/curriculum/next` | Recommended task |
199
+ | POST | `/curriculum/record` | Feed training results |
200
+ | GET | `/generate/preview` | Preview procedural incident |
 
201
  | POST | `/multi-agent/reset` | Start dual-agent session |
202
+ | POST | `/multi-agent/step/a/{id}` | Agent A shares finding |
203
+ | POST | `/multi-agent/step/b/{id}` | Agent B takes action |
204
+ | GET | `/docs` | Swagger UI |
 
 
205
 
206
  ---
207
 
208
  ## 📊 Benchmark Comparison
209
 
210
+ | Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent |
211
+ |---|---|---|---|---|---|
212
+ | SWE-bench | Code repair | ✗ | ✗ | ✗ | ✗ |
213
+ | WebArena | Web navigation | ✓ | ✗ | ✗ | ✗ |
214
+ | AgentBench | General tools | ✗ | ✗ | ✗ | ✗ |
215
+ | **ARIA** | **Incident response** | **✓** | **✓** | **✓** | **✓** |
216
 
217
  ---
218
 
219
+ ## 🚀 Setup
220
 
221
  ```bash
222
+ docker build -t aria-devops-incident .
223
+ docker run -p 7860:7860 aria-devops-incident
224
 
225
+ # Or local
226
+ pip install -r requirements.txt
227
+ uvicorn api:app --host 0.0.0.0 --port 7860
228
+ ```
 
 
229
 
230
  ---
231
 
232
+ ## 📁 Structure
233
 
234
  ```
235
+ ├── api.py / server/app.py # FastAPI — all endpoints
236
+ ├── env.py # Environment dispatcher
237
+ ���models.py # Pydantic models
238
+ ├── tasks/ # 7 tasks + generated
239
+ ├── curriculum/engine.py # Adaptive difficulty
240
+ ├── generator/ # Procedural incidents
241
+ ├── multi_agent/session.py # Dual-agent mode
242
+ ├── graders/grader.py # Deterministic grader
243
+ ├── demo_llm.py # Live terminal demo
244
+ ├── train_grpo.ipynb # Training notebook
245
+ ├── BLOG.md # Project story
246
+ ── openenv.yaml # OpenEnv manifest
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
  ```
248
 
249
+ Apache 2.0 · *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*
 
 
 
 
 
 
 
 
README_github.md CHANGED
@@ -3,10 +3,10 @@
3
 
4
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
5
  [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
6
- [![Trained Model](https://img.shields.io/badge/🤗-Trained%20Model-blue)](https://huggingface.co/Arijit-07/aria-devops-llama3b)
7
  [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
8
 
9
- > **ARIA** — Adaptive Reward & Incident Architecture
10
  > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
11
 
12
  ---
@@ -16,12 +16,13 @@
16
  | Resource | Link |
17
  |---|---|
18
  | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
19
- | **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs |
20
- | **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b |
21
- | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png |
22
- | **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
23
  | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
24
- | **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate |
 
25
 
26
  ---
27
 
@@ -38,7 +39,7 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
38
  -H "Content-Type: application/json" \
39
  -d '{"action_type": "read_logs", "service": "payment-service"}'
40
 
41
- # 3. Diagnose the root cause (reward: +0.30)
42
  curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
43
  -H "Content-Type: application/json" \
44
  -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
@@ -48,510 +49,175 @@ curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
48
  -H "Content-Type: application/json" \
49
  -d '{"action_type": "restart_service", "service": "payment-service"}'
50
 
51
- # 5. See the final score
52
- curl https://arijit-07-devops-incident-response.hf.space/state
53
-
54
- # 6. Validate all 7 tasks pass
55
  curl https://arijit-07-devops-incident-response.hf.space/validate
56
  ```
57
 
58
- ```python
59
- # Or install and use the Python client
60
- pip install git+https://github.com/Twilight-13/devops-incident-response.git
61
-
62
- from devops_incident_response import DevOpsIncidentEnv, Action, ActionType
63
-
64
- env = DevOpsIncidentEnv(task_id="easy", seed=42)
65
- obs = env.reset()
66
- result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
67
- print(f"Reward: {result.reward}") # 0.15
68
- ```
69
-
70
  ---
71
 
72
- ## 🎯 The Problem This Solves
73
 
74
- Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.**
75
 
76
- A single SEV-1 incident a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent.
77
 
78
- **Yet no RL benchmark exists for this domain.**
79
-
80
- SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** — the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage.
81
-
82
- ARIA fills that gap.
83
-
84
- ---
85
-
86
- ## 🏗️ Environment Architecture
87
-
88
- ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`.
89
-
90
- ### What the Agent Observes
91
-
92
- Each step returns a structured `Observation` object:
93
-
94
- ```
95
- Observation
96
- ├── step, max_steps, task_id, task_description
97
- ├── services: List[ServiceStatus]
98
- │ ├── name, status (healthy/degraded/down/unknown)
99
- │ ├── cpu_percent, memory_percent
100
- │ ├── error_rate, latency_p99_ms
101
- │ ├── replicas_running, replicas_desired
102
- │ ├── current_version, last_deployed
103
- │ └── sla_breach, minutes_degraded ← SLA tracking per step
104
- ├── active_alerts: List[Alert] ← may include red herrings
105
- ├── recent_logs: Dict[str, List[str]] ← PARTIAL: only 2 lines shown
106
- ├── service_dependencies: List[ServiceDependency] ← call topology
107
- ├── evidence_log: List[EvidenceEntry] ← accumulates across steps
108
- ├── sla_status: Dict[str, str] ← ok/warning/breached
109
- └── available_runbooks: List[str]
110
- ```
111
-
112
- **Key design: Partial Log Observability**
113
-
114
- The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries — agents must develop a search strategy, not just read everything.
115
-
116
- ### The Services
117
-
118
- | Service | Stack | Role |
119
- |---|---|---|
120
- | `api-gateway` | Go | Routes external requests |
121
- | `payment-service` | Java (Spring) | Processes payments |
122
- | `order-service` | Python | Creates and tracks orders |
123
- | `inventory-service` | Java | Manages product stock |
124
- | `user-service` | Node.js | Auth and profiles |
125
- | `notification-service` | Python | Email and push alerts |
126
- | `data-pipeline-service` | Python | Writes catalog data |
127
- | `product-catalog-service` | Go | Stores and serves product data |
128
- | `price-validation-service` | Python | Validates prices |
129
- | `analytics-service` | Python | Aggregates business metrics |
130
- | `ml-inference-service` | Python | Serves recommendation models |
131
- | `log-aggregator` | Go | Collects and stores logs |
132
-
133
- ### Service Dependency Map
134
-
135
- Every observation includes the call topology — agents can trace cascades:
136
-
137
- ```
138
- api-gateway → order-service → inventory-service
139
- api-gateway → payment-service
140
- order-service → notification-service
141
- data-pipeline-service → product-catalog-service → price-validation-service
142
- ```
143
 
144
  ---
145
 
146
  ## 🎬 The 7 Tasks
147
 
148
- ### Task 1 Single Service OOM (`easy`)
149
- **Max steps: 15 | Expected strong LLM: 0.85–1.00 | Random agent: 0.05**
150
-
151
- One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service — with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom.
152
-
153
- **What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down).
154
-
155
- **Optimal sequence:** `read_logs` `read_metrics` `diagnose` `restart_service`
156
- **Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus**
157
-
158
- ---
159
-
160
- ### Task 2 — Cascading Failure (`medium`)
161
- **Max steps: 20 | Expected strong LLM: 0.55–0.75 | Random agent: 0.03**
162
-
163
- A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job — completely unrelated).
164
-
165
- **What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty.
166
-
167
- **Optimal sequence:** Investigate `api-gateway` → trace to `order-service` → trace to `inventory-service` → `rollback`
168
- **Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92**
169
-
170
- ---
171
-
172
- ### Task 3 — Silent Data Corruption (`hard`)
173
- **Max steps: 25 | Expected strong LLM: 0.30–0.50 | Random agent: 0.01**
174
-
175
- **All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in:
176
- - `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%)
177
- - `analytics-service` anomaly: avg order value $847 vs $89 historical baseline
178
-
179
- Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
180
-
181
- **What makes it interesting:** This requires qualitatively different reasoning — ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation.
182
-
183
- **Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed)
184
- **Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87**
185
-
186
- ---
187
-
188
- ### Task 4 — Dual Simultaneous Failure (`bonus`)
189
- **Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
190
-
191
- Two completely independent failures at once:
192
- 1. `log-aggregator` disk 100% full — dropping 48k log messages/min
193
- 2. `ml-inference-service` stuck in model checksum reload loop — CPU 99%+
194
-
195
- **What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously.
196
-
197
- **Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)`
198
- **Optimal score: ~0.77**
199
-
200
- ---
201
-
202
- ### Task 5 — Security Incident: DDoS (`security`)
203
- **Max steps: 20 | Expected strong LLM: 0.40–0.60 | Random agent: 0.01**
204
-
205
- A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range.
206
-
207
- **New action: `block_ip_range`** — models real network-level DDoS mitigation.
208
- **Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team.
209
-
210
- **Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall`
211
- **Optimal score: ~0.80**
212
-
213
- ---
214
-
215
- ### Task 6 — Database Degradation (`database`)
216
- **Max steps: 20 | Expected strong LLM: 0.45–0.65 | Random agent: 0.01**
217
-
218
- A schema migration added a `user_segment` column to the `orders` table 15 minutes ago — without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`.
219
-
220
- **New action: `create_index`** — models real DBA response to missing indexes.
221
- **Alternative fix:** Rolling back the migration is also accepted for full credit.
222
-
223
- **Optimal score: ~0.80**
224
-
225
- ---
226
-
227
- ### Task 7 — Multi-Region Failover (`failover`)
228
- **Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
229
-
230
- A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over:
231
- - `payment-service` — PCI-DSS compliance requires human approval
232
- - `postgres-primary` — replication lag risk causes data loss
233
-
234
- **New action: `failover`** — with `target_region` parameter.
235
- **Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic.
236
-
237
- **The runbook explicitly lists which services are safe** — reading it first is rewarded.
238
- **Optimal score: ~0.70**
239
-
240
- ---
241
-
242
- ### Task 8 — Generated Incident (`generated`)
243
- **Max steps: 20 | Variable difficulty | Seed-deterministic**
244
-
245
- The Incident Generator creates procedural incidents from any integer seed (0–99,999). Same seed always produces the same incident. Different seeds produce unique combinations of:
246
- - 6 failure modes × 8 services × 3 severity levels × 0–3 noise alerts
247
-
248
- ```bash
249
- # Preview any incident before running it
250
- curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345"
251
-
252
- # Run it as a full episode
253
- curl -X POST .../reset -d '{"task_id":"generated","seed":12345}'
254
- ```
255
 
256
  ---
257
 
258
- ## 🏆 Reward Function Design
259
-
260
- ### The Formula
261
 
262
  ```
263
  Final Score = Σ(step_rewards)
264
- + efficiency_bonus # (1 - steps/max_steps) × 0.05 if resolved
265
- + diagnosis_precision_bonus # +0.03 if ≥50% keyword overlap, +0.01 if ≥30%
266
- - noop_penalty # (noop_count - 3) × 0.02
267
- - repeat_restart_penalty # (restarts - 1) × 0.05 per service
268
  ```
269
 
270
- All scores clamped to **(0.001, 0.999)** never exactly 0 or 1.
271
-
272
- > **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists.
273
 
274
- ### Step-Level Rewards
275
-
276
- | Action | Reward | Condition |
277
- |---|---|---|
278
- | `read_logs` (failing service) | +0.10–0.15 | First time only |
279
- | `read_metrics` (failing service) | +0.10 | First time only |
280
- | `read_runbook` (relevant) | +0.05 | Correct runbook for scenario |
281
- | `search_logs` (relevant query) | +0.05 | Query returns useful results |
282
- | `diagnose` (full match) | +0.30–0.35 | ≥50% keyword overlap |
283
- | `diagnose` (partial match) | +0.10–0.15 | ≥30% keyword overlap |
284
- | `restart_service` (correct) | +0.35–0.45 | Root cause service |
285
- | `rollback` (correct) | +0.30–0.40 | Root cause service |
286
- | `block_ip_range` (correct) | +0.40 | Security task, correct CIDR |
287
- | `create_index` (correct) | +0.40 | Database task, correct table/column |
288
- | `failover` (eligible service) | +0.30 | Per correctly failed-over service |
289
- | `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks |
290
-
291
- ### Penalties (Anti-Gaming)
292
-
293
- | Action | Penalty | Why |
294
  |---|---|---|
295
- | Restart healthy service | -0.15 | Collateral damage realistic cost |
296
- | Fix without diagnosing | -0.10 | Blind remediation models real risk |
297
- | Failover payment-service | -0.25 | PCI-DSS compliance violation |
298
- | Failover postgres-primary | -0.25 | Data loss risk |
299
- | Excessive noops (>3) | -0.04/each | Forces active investigation |
300
- | Repeat restart same service | -0.05/extra | Discourages guess-and-check |
301
-
302
- ### Semantic Diagnosis Matching
303
-
304
- The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase — exact string matching would unfairly penalize valid diagnoses.
305
 
306
- ### SLA Degradation
307
-
308
- Every step where an incident is unresolved, the environment worsens:
309
- - `down` services: error_rate increases
310
- - `degraded` services: latency_p99 increases
311
- - SLA status: `ok` → `warning` (~3 steps) → `breached` (~7 steps)
312
-
313
- This creates real time pressure and rewards faster resolution.
314
 
315
  ---
316
 
317
  ## 🌟 ARIA Features
318
 
319
  ### Curriculum Engine
320
-
321
- The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes.
322
-
323
- - **Promotion:** rolling_avg > 0.75 → advance mastery level (Novice → Intermediate → Advanced → Mastered)
324
- - **Demotion:** rolling_avg < 0.30 → step back mastery level
325
- - **Scaffolding:** if avg < 0.30 over 3+ episodes → provide task-specific hint
326
 
327
  ```bash
328
- GET /curriculum/status # See mastery per task
329
- GET /curriculum/next # Get recommended next task
330
- GET /curriculum/hint/easy # Get scaffolding hint for a task
331
- POST /curriculum/record # Feed your training results in
332
  ```
333
 
334
- **Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability — easy tasks first, harder tasks as they master the fundamentals.
335
-
336
  ### Incident Generator
337
-
338
- Procedural incident generation from seeds. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts = thousands of unique training scenarios.
339
-
340
- **Difficulty formula:** `base_difficulty[failure_mode] + (noise_count × 0.05)`, clamped to 1.0
341
-
342
- | Failure Mode | Base Difficulty |
343
- |---|---|
344
- | oom | 0.20 |
345
- | cascade | 0.50 |
346
- | database | 0.60 |
347
- | security | 0.60 |
348
- | network_partition | 0.70 |
349
- | corruption | 0.80 |
350
 
351
  ```bash
352
- GET /generate/preview?seed=42 # Preview without starting
353
- POST /reset # body: {"task_id":"generated","seed":42}
354
  ```
355
 
356
  ### Dual-Agent Mode
357
-
358
- One incident. Two agents. Split observability.
359
-
360
- - **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` — passes natural language observations to Agent B. Reward: +0.05 per finding.
361
- - **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions.
362
-
363
- Neither agent can solve the incident alone.
364
 
365
  ```bash
366
- # Start a dual-agent session
367
- POST /multi-agent/reset {"task_id":"easy","seed":42}
368
- # returns session_id + split observations
369
-
370
- # Agent A shares a finding
371
- POST /multi-agent/step/a/{session_id} {"finding":"payment-service OOM, memory at 98%"}
372
-
373
- # Agent B takes action (has access to Agent A's findings)
374
- POST /multi-agent/step/b/{session_id} {"action_type":"restart_service","service":"payment-service"}
375
-
376
- # See full session state
377
- GET /multi-agent/state/{session_id}
378
  ```
379
 
380
  ---
381
 
382
- ## 🧠 Training
383
-
384
- ### Model
385
-
386
- **Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth.
387
-
388
- - **LoRA:** rank=16, alpha=32, targeting all 7 projection layers
389
- - **Adapter size:** ~97MB
390
- - **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs
391
- - **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b
392
 
393
- ### Why GRPO?
394
-
395
- GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step — simpler, more memory-efficient, and well-suited to fast environment APIs.
396
-
397
- ### Training Loop
398
-
399
- ```python
400
- # Each training step:
401
- # 1. Generate 6 completions for the current observation
402
- # 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion)
403
- # 3. Normalize rewards to advantages (GRPO)
404
- # 4. Policy gradient update on best completion + KL penalty
405
- # 5. Advance episode with best action
406
-
407
- # Key hyperparameters:
408
- learning_rate = 5e-6
409
- group_size = 6
410
- kl_coefficient = 0.05 # prevents catastrophic forgetting
411
- update_strategy = "episode-level" # one update per full episode
412
- ```
413
-
414
- ### Results
415
-
416
- | | Base Model | Fine-tuned (ep140) |
417
- |---|---|---|
418
- | **Easy task** | 0.000 | 0.150 |
419
- | **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first |
420
- | **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting |
421
 
422
- **The trained model consistently reads logs on the failing service before acting** — this is the foundational operational behavior: information gathering before remediation. The base model never does this.
 
 
 
 
 
423
 
424
- **Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge.
425
 
426
- ### Training Notebook
427
 
428
- See `train_grpo.ipynb` Colab-compatible, runs against the live HF Space API (no local setup needed).
429
 
430
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
431
 
432
- Compatible with: TRL, SkyRL, ART, Oumi, Axolotl.
433
-
434
- ---
435
-
436
- ## 🚀 Setup
437
-
438
- ### Docker (Recommended)
439
-
440
- ```bash
441
- docker build -t aria-devops-incident .
442
- docker run -p 7860:7860 aria-devops-incident
443
- curl http://localhost:7860/health
444
- ```
445
-
446
- ### Local Python
447
-
448
- ```bash
449
- pip install -r requirements.txt
450
- uvicorn api:app --host 0.0.0.0 --port 7860
451
- ```
452
-
453
- ### Validate
454
-
455
- ```bash
456
- python validate.py # 22 automated checks, exit 0 = all pass
457
- curl http://localhost:7860/validate
458
- ```
459
-
460
  ---
461
 
462
  ## 📡 API Reference
463
 
464
  | Method | Endpoint | Description |
465
  |---|---|---|
466
- | GET | `/health` | `{"status":"ok"}` liveness check |
467
- | GET | `/about` | Full environment description (machine-readable) |
468
- | GET | `/tasks` | All 8 tasks with descriptions |
469
- | POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` |
470
- | POST | `/step` | Take action: Action JSON |
471
- | GET | `/state` | Full state + ground truth + analytics |
472
- | GET | `/validate` | Self-test: random agent on all 7 tasks |
473
- | GET | `/metrics` | Aggregate episode statistics |
474
  | GET | `/leaderboard` | Top 10 episodes |
475
- | WS | `/ws` | WebSocket: real-time agent-environment |
476
- | GET | `/curriculum/status` | Per-task mastery and recommendations |
477
- | GET | `/curriculum/next` | Recommended next task for training |
478
- | GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents |
479
- | POST | `/curriculum/record` | Feed episode result to curriculum engine |
480
- | GET | `/generate/preview` | Preview procedural incident: `?seed=N` |
481
  | POST | `/multi-agent/reset` | Start dual-agent session |
482
- | POST | `/multi-agent/step/a/{id}` | Agent A shares a finding |
483
- | POST | `/multi-agent/step/b/{id}` | Agent B takes an action |
484
- | GET | `/multi-agent/state/{id}` | Full dual-agent session state |
485
- | GET | `/multi-agent/sessions` | List active sessions |
486
- | GET | `/docs` | Swagger UI — interactive documentation |
487
 
488
  ---
489
 
490
  ## 📊 Benchmark Comparison
491
 
492
- | Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent |
493
- |---|---|---|---|---|---|---|
494
- | SWE-bench | Code repair | ✗ | ✗ | ✓ | ✗ | ✗ |
495
- | WebArena | Web navigation | ✓ | ✗ | ✓ | ✗ | ✗ |
496
- | AgentBench | General tools | ✗ | ✗ | ✓ | ✗ | ✗ |
497
- | **ARIA (ours)** | **Incident response** | **✓** | **✓** | **✓** | **✓** | **✓** |
498
 
499
  ---
500
 
501
- ## 🏗️ OpenEnv Compliance
502
 
503
  ```bash
504
- openenv validate .
505
- ```
506
 
507
- - Inherits from `openenv.core.env_client.EnvClient`
508
- - Standard `reset()`, `step()`, `state()` interface
509
- - Valid `openenv.yaml` manifest with all 8 tasks
510
- - FastAPI server with health endpoint
511
- - WebSocket support at `/ws`
512
- - Hosted on HuggingFace Spaces
513
 
514
  ---
515
 
516
- ## 📁 Repository Structure
517
 
518
  ```
519
- aria-devops-incident-response/
520
- ├── api.py # FastAPI app — all endpoints
521
- ├── env.py # DevOpsIncidentEnv — thin dispatcher
522
- ├── models.py # Pydantic models Action, Observation, State
523
- ├── tasks/
524
- ├── base.py # BaseTask ABC, InternalState, reward logic
525
- ├── task_easy.py # OOM crash-loop
526
- ├── task_medium.py # Cascading failure
527
- ├── task_hard.py # Silent data corruption
528
- ├── task_bonus.py # Dual simultaneous failure
529
- ├── task_security.py # DDoS attack
530
- │ ├── task_database.py # Missing index
531
- │ ├── task_failover.py # Multi-region failover
532
- │ └── task_generated.py # Procedural incidents
533
- ├── curriculum/
534
- │ └── engine.py # CurriculumEngine — adaptive difficulty
535
- ├── generator/
536
- │ └── incident_factory.py # IncidentFactory — procedural generation
537
- ├── multi_agent/
538
- │ └── session.py # DualAgentSession — split observability
539
- ├── graders/
540
- │ └── grader.py # Deterministic episode grader
541
- ├── data/runbooks/ # 6 operational runbooks (Markdown)
542
- ├── client.py # openenv-core EnvClient implementation
543
- ├── inference.py # LLM baseline (CoT + fast modes)
544
- ├── train_grpo.ipynb # GRPO training notebook (Colab-compatible)
545
- ├── validate.py # 22 automated validation checks
546
- └── openenv.yaml # OpenEnv spec manifest
547
  ```
548
 
549
- ---
550
-
551
- ## 📝 License
552
-
553
- Apache 2.0
554
-
555
- ---
556
-
557
- *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*
 
3
 
4
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
5
  [![HF Space](https://img.shields.io/badge/🤗-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
6
+ [![Trained Model](https://img.shields.io/badge/🤗-Llama--3.1--8B%20Fine--tuned-blue)](https://huggingface.co/Arijit-07/aria-devops-llama8b)
7
  [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
8
 
9
+ > **ARIA** — Adaptive Reward & Incident Architecture
10
  > Built for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
11
 
12
  ---
 
16
  | Resource | Link |
17
  |---|---|
18
  | **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
19
+ | **Interactive API** | https://arijit-07-devops-incident-response.hf.space/docs |
20
+ | **Trained Model (8B)** | https://huggingface.co/Arijit-07/aria-devops-llama8b |
21
+ | **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png |
22
+ | **Blog Post** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
23
  | **GitHub** | https://github.com/Twilight-13/devops-incident-response |
24
+ | **Validate** | https://arijit-07-devops-incident-response.hf.space/validate |
25
+ | **About (machine-readable)** | https://arijit-07-devops-incident-response.hf.space/about |
26
 
27
  ---
28
 
 
39
  -H "Content-Type: application/json" \
40
  -d '{"action_type": "read_logs", "service": "payment-service"}'
41
 
42
+ # 3. Diagnose (reward: +0.30)
43
  curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
44
  -H "Content-Type: application/json" \
45
  -d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
 
49
  -H "Content-Type: application/json" \
50
  -d '{"action_type": "restart_service", "service": "payment-service"}'
51
 
52
+ # 5. Validate all 7 tasks pass
 
 
 
53
  curl https://arijit-07-devops-incident-response.hf.space/validate
54
  ```
55
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ---
57
 
58
+ ## 🎯 The Problem
59
 
60
+ Every company running microservices faces the same reality: **production incidents are expensive, stressful, and happen at 3am.**
61
 
62
+ SWE-bench tests code generation. WebArena tests web navigation. Nothing trains agents to handle live production incidentsto read logs strategically, trace cascading failures, correlate subtle business anomalies, and apply precise fixes where wrong choices cause collateral damage.
63
 
64
+ **ARIA fills that gap.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ---
67
 
68
  ## 🎬 The 7 Tasks
69
 
70
+ | Task | Max Steps | Random | Strong LLM | Scenario |
71
+ |---|---|---|---|---|
72
+ | `easy` | 15 | 0.05 | 0.85–1.00 | Single service OOM crash-loop |
73
+ | `medium` | 20 | 0.03 | 0.55–0.75 | Cascading failure + red herring alert |
74
+ | `hard` | 25 | 0.01 | 0.30–0.50 | **Silent** corruption — all services green |
75
+ | `bonus` | 25 | 0.01 | 0.35–0.55 | Two simultaneous independent failures |
76
+ | `security` | 20 | 0.01 | 0.40–0.60 | DDoS botnet credential stuffing |
77
+ | `database` | 20 | 0.01 | 0.45–0.65 | Missing index — full table scans |
78
+ | `failover` | 25 | 0.01 | 0.35–0.55 | Multi-region network partition |
79
+ | `generated` | 20 | 0.01 | variable | Procedural — seed-deterministic |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ---
82
 
83
+ ## 🏆 Reward Function
 
 
84
 
85
  ```
86
  Final Score = Σ(step_rewards)
87
+ + efficiency_bonus # (1 - steps/max_steps) × 0.05
88
+ + diagnosis_precision # +0.03 if ≥50% keyword overlap
89
+ - noop_penalty # (noops - 3) × 0.02
 
90
  ```
91
 
92
+ Clamped to **(0.001, 0.999)** for GRPO stability.
 
 
93
 
94
+ | Action | Reward | Penalty Triggers |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  |---|---|---|
96
+ | `read_logs` correct | +0.15 | Restart healthy service: **-0.15** |
97
+ | `diagnose` full match | +0.35 | Fix without diagnosing: **-0.10** |
98
+ | `restart_service` correct | +0.45 | Wrong failover (payment): **-0.25** |
99
+ | `block_ip_range` | +0.40 | Excessive noops: **-0.04 each** |
100
+ | `alert_oncall` (required) | +0.15 | |
 
 
 
 
 
101
 
102
+ **Semantic matching:** keyword overlap not exact string — LLMs that paraphrase aren't penalized.
 
 
 
 
 
 
 
103
 
104
  ---
105
 
106
  ## 🌟 ARIA Features
107
 
108
  ### Curriculum Engine
109
+ Rolling average per task (last 5 episodes). Promotes when avg > 0.75. Scaffolds with hints when avg < 0.30. Agents always train at the edge of their capability.
 
 
 
 
 
110
 
111
  ```bash
112
+ GET /curriculum/status
113
+ GET /curriculum/next
114
+ POST /curriculum/record # {"task_id": "easy", "score": 0.85}
 
115
  ```
116
 
 
 
117
  ### Incident Generator
118
+ Seeds 0–99,999 → unique reproducible incidents. 6 failure modes × 8 services × 3 severities × 0–3 noise alerts.
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ```bash
121
+ GET /generate/preview?seed=1337
122
+ POST /reset # {"task_id": "generated", "seed": 1337}
123
  ```
124
 
125
  ### Dual-Agent Mode
126
+ Split observability. Agent A (Observer) sees logs and alerts. Agent B (Responder) sees metrics and dependencies. They coordinate via `share_finding`. Neither can solve the incident alone.
 
 
 
 
 
 
127
 
128
  ```bash
129
+ POST /multi-agent/reset # {"task_id": "easy", "seed": 42}
130
+ POST /multi-agent/step/a/{id} # {"finding": "order-service OOM"}
131
+ POST /multi-agent/step/b/{id} # {"action_type": "restart_service", ...}
 
 
 
 
 
 
 
 
 
132
  ```
133
 
134
  ---
135
 
136
+ ## 🧠 Training Results
 
 
 
 
 
 
 
 
 
137
 
138
+ **Model:** [Arijit-07/aria-devops-llama8b](https://huggingface.co/Arijit-07/aria-devops-llama8b)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
+ | Task | Baseline | Fine-tuned | **Improvement** |
141
+ |---|---|---|---|
142
+ | easy | 0.320 | 0.685 | **+0.365** |
143
+ | medium | 0.050 | 0.378 | **+0.328** |
144
+ | hard | 0.190 | 0.869 | **+0.679** |
145
+ | bonus | 0.152 | 0.682 | **+0.530** |
146
 
147
+ ![Training Curve](https://huggingface.co/Arijit-07/aria-devops-llama8b/resolve/main/training_curve_8b.png)
148
 
149
+ **Setup:** GRPO · Llama-3.1-8B · LoRA rank=32 · 160 episodes · NVIDIA L4 · 162 minutes · Unsloth + HuggingFace TRL
150
 
151
+ **Key fix:** Group completions scored on fresh environment snapshots prevents reward gate exhaustion during GRPO group generation.
152
 
153
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ---
156
 
157
  ## 📡 API Reference
158
 
159
  | Method | Endpoint | Description |
160
  |---|---|---|
161
+ | GET | `/health` | Liveness check |
162
+ | GET | `/about` | Full machine-readable description |
163
+ | GET | `/tasks` | All 8 tasks |
164
+ | POST | `/reset` | Start episode |
165
+ | POST | `/step` | Take action |
166
+ | GET | `/state` | Full state + ground truth |
167
+ | GET | `/validate` | Self-test all 7 tasks |
168
+ | GET | `/metrics` | Aggregate statistics |
169
  | GET | `/leaderboard` | Top 10 episodes |
170
+ | WS | `/ws` | WebSocket real-time |
171
+ | GET | `/curriculum/status` | Per-task mastery |
172
+ | GET | `/curriculum/next` | Recommended task |
173
+ | POST | `/curriculum/record` | Feed training results |
174
+ | GET | `/generate/preview` | Preview procedural incident |
 
175
  | POST | `/multi-agent/reset` | Start dual-agent session |
176
+ | POST | `/multi-agent/step/a/{id}` | Agent A shares finding |
177
+ | POST | `/multi-agent/step/b/{id}` | Agent B takes action |
178
+ | GET | `/docs` | Swagger UI |
 
 
179
 
180
  ---
181
 
182
  ## 📊 Benchmark Comparison
183
 
184
+ | Benchmark | Domain | Partial Obs | Dense Reward | Curriculum | Multi-Agent |
185
+ |---|---|---|---|---|---|
186
+ | SWE-bench | Code repair | ✗ | ✗ | ✗ | ✗ |
187
+ | WebArena | Web navigation | ✓ | ✗ | ✗ | ✗ |
188
+ | AgentBench | General tools | ✗ | ✗ | ✗ | ✗ |
189
+ | **ARIA** | **Incident response** | **✓** | **✓** | **✓** | **✓** |
190
 
191
  ---
192
 
193
+ ## 🚀 Setup
194
 
195
  ```bash
196
+ docker build -t aria-devops-incident .
197
+ docker run -p 7860:7860 aria-devops-incident
198
 
199
+ # Or local
200
+ pip install -r requirements.txt
201
+ uvicorn api:app --host 0.0.0.0 --port 7860
202
+ ```
 
 
203
 
204
  ---
205
 
206
+ ## 📁 Structure
207
 
208
  ```
209
+ ├── api.py / server/app.py # FastAPI — all endpoints
210
+ ├── env.py # Environment dispatcher
211
+ ├── models.py # Pydantic models
212
+ ├── tasks/ # 7 tasks + generated
213
+ ├── curriculum/engine.py # Adaptive difficulty
214
+ ├── generator/ # Procedural incidents
215
+ ├── multi_agent/session.py # Dual-agent mode
216
+ ├── graders/grader.py # Deterministic grader
217
+ ├── demo_llm.py # Live terminal demo
218
+ ├── train_grpo.ipynb # Training notebook
219
+ ├── BLOG.md # Project story
220
+ ── openenv.yaml # OpenEnv manifest
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  ```
222
 
223
+ Apache 2.0 · *Built solo for the Meta × PyTorch × HuggingFace OpenEnv Hackathon Finals — Bangalore, April 2026*
 
 
 
 
 
 
 
 
debug_out.txt DELETED
File without changes
debug_steps.txt DELETED
File without changes
demo_llm.py CHANGED
@@ -40,8 +40,6 @@ DEFAULT_TASK = "easy"
40
  DEFAULT_SEED = 42
41
 
42
  MODEL_REPO = "Arijit-07/aria-devops-llama8b"
43
- # fallback to 3b if 8b not ready:
44
- # MODEL_REPO = "Arijit-07/aria-devops-llama3b"
45
 
46
  HF_TOKEN = os.environ.get('HF_TOKEN', '')
47
 
 
40
  DEFAULT_SEED = 42
41
 
42
  MODEL_REPO = "Arijit-07/aria-devops-llama8b"
 
 
43
 
44
  HF_TOKEN = os.environ.get('HF_TOKEN', '')
45
 
openenv.yaml CHANGED
@@ -193,9 +193,9 @@ reward:
193
 
194
  training:
195
  algorithm: GRPO
196
- model: unsloth/Llama-3.2-3B-Instruct
197
- adapter: https://huggingface.co/Arijit-07/aria-devops-llama3b
198
- episodes: 140
199
  framework: HuggingFace TRL + Unsloth
200
  results:
201
  easy_pre: 0.42
 
193
 
194
  training:
195
  algorithm: GRPO
196
+ model: Llama-3.1-8B-Instruct
197
+ adapter: https://huggingface.co/Arijit-07/aria-devops-llama8b
198
+ episodes: 160
199
  framework: HuggingFace TRL + Unsloth
200
  results:
201
  easy_pre: 0.42
server/app.py CHANGED
@@ -808,64 +808,6 @@ def health():
808
  return {"status": "ok", "env": "devops-incident-response", "version": "2.0.0"}
809
 
810
 
811
- @app.get("/about")
812
- def about():
813
- """
814
- Full environment metadata for LLM judges and researchers.
815
-
816
- Returns a comprehensive description of the ARIA environment including
817
- task count, action types, feature flags, training metadata, reward
818
- design philosophy, and links to the live space, trained model, and docs.
819
-
820
- Returns:
821
- JSON object with name, version, description, themes, task/action counts,
822
- feature descriptions, training info, reward design, and links.
823
- """
824
- return {
825
- "name": "ARIA — DevOps Incident Response",
826
- "version": "2.0.0",
827
- "description": (
828
- "OpenEnv-compliant RL environment for production incident response. "
829
- "AI agents diagnose and remediate software incidents across 7 task types "
830
- "using 14 actions with dense reward shaping."
831
- ),
832
- "themes": [
833
- "World Modeling: Professional Tasks",
834
- "Self-Improvement",
835
- "Multi-Agent Interactions",
836
- ],
837
- "tasks": 8,
838
- "action_types": 14,
839
- "features": {
840
- "curriculum_engine": "Adaptive difficulty based on agent performance",
841
- "incident_generator": "Procedural incidents from seeds (0-99999)",
842
- "dual_agent_mode": "Split observability — Observer + Responder",
843
- },
844
- "training": {
845
- "model": "Llama-3.2-3B-Instruct",
846
- "algorithm": "GRPO",
847
- "framework": "HuggingFace TRL + Unsloth",
848
- "episodes": 140,
849
- "adapter_url": "https://huggingface.co/Arijit-07/aria-devops-llama3b",
850
- },
851
- "reward_design": {
852
- "type": "dense",
853
- "range": [0.001, 0.999],
854
- "anti_gaming": [
855
- "collateral_damage_penalty",
856
- "blind_remediation_penalty",
857
- "semantic_diagnosis_matching",
858
- ],
859
- "efficiency_bonus": True,
860
- },
861
- "links": {
862
- "space": "https://arijit-07-devops-incident-response.hf.space",
863
- "model": "https://huggingface.co/Arijit-07/aria-devops-llama3b",
864
- "github": "https://github.com/Twilight-13/devops-incident-response",
865
- "docs": "https://arijit-07-devops-incident-response.hf.space/docs",
866
- },
867
- }
868
-
869
 
870
  @app.get("/generate/preview")
871
  def preview_incident(seed: int = 42):
 
808
  return {"status": "ok", "env": "devops-incident-response", "version": "2.0.0"}
809
 
810
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
811
 
812
  @app.get("/generate/preview")
813
  def preview_incident(seed: int = 42):
training_screenshots/post_training.png ADDED
training_screenshots/training_bonus.png ADDED
training_screenshots/training_easy.png ADDED
training_screenshots/training_hard.png ADDED
training_screenshots/training_medium.png ADDED
ui_test.py DELETED
@@ -1,945 +0,0 @@
1
- html_content = """<!DOCTYPE html>
2
- <html lang="en">
3
- <head>
4
- <meta charset="UTF-8">
5
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
- <title>ARIA - DevOps Incident Response</title>
7
- <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;700&display=swap" rel="stylesheet">
8
- <style>
9
- :root {
10
- --bg-primary: #060914;
11
- --bg-secondary: #0d1117;
12
- --bg-card: #111827;
13
- --bg-card-hover: #1a2234;
14
- --border: #1f2937;
15
- --border-glow: #3b82f6;
16
- --accent-blue: #3b82f6;
17
- --accent-cyan: #06b6d4;
18
- --accent-green: #10b981;
19
- --accent-red: #ef4444;
20
- --accent-yellow: #f59e0b;
21
- --accent-purple: #8b5cf6;
22
- --accent-orange: #f97316;
23
- --text-primary: #f9fafb;
24
- --text-secondary: #9ca3af;
25
- --text-dim: #4b5563;
26
- }
27
-
28
- * { margin: 0; padding: 0; box-sizing: border-box; }
29
- html { scroll-behavior: smooth; }
30
- body {
31
- background: var(--bg-primary);
32
- color: var(--text-primary);
33
- font-family: 'Inter', sans-serif;
34
- min-height: 100vh;
35
- overflow-x: hidden;
36
- }
37
-
38
- a { text-decoration: none; color: inherit; }
39
-
40
- /* Animation */
41
- .fade-in {
42
- opacity: 0;
43
- transform: translateY(20px);
44
- transition: opacity 0.6s ease-out, transform 0.6s ease-out;
45
- }
46
- .fade-in.visible { opacity: 1; transform: translateY(0); }
47
-
48
- /* Canvas Background */
49
- #bg-canvas {
50
- position: fixed;
51
- top: 0;
52
- left: 0;
53
- width: 100vw;
54
- height: 100vh;
55
- z-index: 0;
56
- pointer-events: none;
57
- }
58
-
59
- /* Container */
60
- .container {
61
- max-width: 1280px;
62
- margin: 0 auto;
63
- padding: 0 24px;
64
- position: relative;
65
- z-index: 1;
66
- }
67
-
68
- /* Section Spacing */
69
- section { padding: 80px 0; }
70
-
71
- /* Navbar */
72
- nav {
73
- position: fixed;
74
- top: 0;
75
- width: 100%;
76
- height: 64px;
77
- background: rgba(6, 9, 20, 0.8);
78
- backdrop-filter: blur(20px);
79
- border-bottom: 1px solid var(--border);
80
- z-index: 100;
81
- display: flex;
82
- align-items: center;
83
- }
84
- .nav-inner {
85
- display: flex;
86
- justify-content: space-between;
87
- align-items: center;
88
- width: 100%;
89
- max-width: 1280px;
90
- margin: 0 auto;
91
- padding: 0 24px;
92
- }
93
- .nav-left { display: flex; align-items: center; gap: 8px; }
94
- .nav-logo { font-size: 20px; font-weight: 700; color: var(--accent-blue); display: flex; align-items: center; gap: 6px; }
95
- .nav-desc { font-size: 13px; color: var(--text-secondary); margin-left: 8px; display: none; }
96
- @media (min-width: 768px) { .nav-desc { display: block; } }
97
-
98
- .nav-center { display: flex; justify-content: center; flex: 1; }
99
- .status-pill {
100
- display: flex;
101
- align-items: center;
102
- gap: 6px;
103
- background: rgba(16, 185, 129, 0.2);
104
- border: 1px solid var(--accent-green);
105
- color: var(--accent-green);
106
- padding: 4px 12px;
107
- border-radius: 999px;
108
- font-size: 12px;
109
- font-weight: 600;
110
- }
111
- .status-dot {
112
- width: 6px;
113
- height: 6px;
114
- background: var(--accent-green);
115
- border-radius: 50%;
116
- animation: pulse 2s infinite;
117
- }
118
- @keyframes pulse { 0% { transform: scale(1); opacity: 1; } 50% { transform: scale(1.5); opacity: 0.5; } 100% { transform: scale(1); opacity: 1; } }
119
-
120
- .nav-right { display: flex; gap: 24px; }
121
- .nav-link { font-size: 13px; color: var(--text-secondary); transition: color 0.2s; }
122
- .nav-link:hover { color: var(--text-primary); }
123
-
124
- /* Hero Section */
125
- .hero { padding: 120px 0 80px; text-align: center; }
126
- .hero-badge {
127
- background: rgba(59, 130, 246, 0.1);
128
- border: 1px solid rgba(59, 130, 246, 0.3);
129
- border-radius: 999px;
130
- padding: 6px 16px;
131
- font-size: 12px;
132
- color: var(--accent-blue);
133
- display: inline-block;
134
- margin-bottom: 24px;
135
- }
136
- .hero-title {
137
- font-size: clamp(72px, 12vw, 140px);
138
- font-weight: 700;
139
- background: linear-gradient(135deg, #3b82f6 0%, #06b6d4 50%, #8b5cf6 100%);
140
- -webkit-background-clip: text;
141
- -webkit-text-fill-color: transparent;
142
- line-height: 1;
143
- letter-spacing: -4px;
144
- }
145
- .hero-subtitle { font-size: 20px; color: var(--text-secondary); margin-top: 16px; font-weight: 400; }
146
- .hero-desc { font-size: 15px; color: var(--text-dim); margin-top: 12px; line-height: 1.6; max-width: 600px; margin-inline: auto; }
147
-
148
- .hero-buttons { margin-top: 40px; display: flex; justify-content: center; gap: 16px; flex-wrap: wrap; }
149
- .btn-primary, .btn-secondary {
150
- padding: 14px 28px;
151
- border-radius: 8px;
152
- font-weight: 600;
153
- font-size: 15px;
154
- transition: all 0.2s;
155
- cursor: pointer;
156
- display: inline-block;
157
- }
158
- .btn-primary { background: var(--accent-blue); color: white; border: none; }
159
- .btn-primary:hover { background: #2563eb; transform: translateY(-2px); }
160
- .btn-secondary { background: transparent; border: 1px solid var(--border); color: var(--text-secondary); }
161
- .btn-secondary:hover { border-color: var(--accent-blue); color: white; transform: translateY(-2px); }
162
-
163
- .hero-stats {
164
- margin-top: 64px;
165
- display: grid;
166
- grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
167
- gap: 16px;
168
- }
169
- .stat-card {
170
- background: var(--bg-card);
171
- border: 1px solid var(--border);
172
- border-radius: 12px;
173
- padding: 20px 32px;
174
- text-align: center;
175
- }
176
- .stat-val { font-family: 'JetBrains Mono', monospace; font-size: 32px; font-weight: 700; color: var(--accent-blue); }
177
- .stat-label { font-size: 13px; color: var(--text-secondary); margin-top: 4px; }
178
-
179
- /* General Sections */
180
- .section-title { font-size: 24px; font-weight: 600; margin-bottom: 8px; }
181
- .section-subtitle { font-size: 15px; color: var(--text-secondary); margin-bottom: 32px; }
182
-
183
- /* Task Cards */
184
- .task-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; }
185
- @media (max-width: 1024px) { .task-grid { grid-template-columns: repeat(2, 1fr); } }
186
- @media (max-width: 640px) { .task-grid { grid-template-columns: 1fr; } }
187
-
188
- .task-card {
189
- background: var(--bg-card);
190
- border: 1px solid var(--border);
191
- border-radius: 16px;
192
- padding: 24px;
193
- transition: all 0.3s;
194
- cursor: pointer;
195
- position: relative;
196
- overflow: hidden;
197
- display: flex;
198
- flex-direction: column;
199
- }
200
- .task-card::before {
201
- content: ''; position: absolute; top: 0; left: 0; right: 0; height: 2px;
202
- background: transparent; transition: all 0.3s;
203
- }
204
- .task-card:hover {
205
- background: var(--bg-card-hover);
206
- transform: translateY(-4px);
207
- box-shadow: 0 20px 40px rgba(0,0,0,0.4);
208
- }
209
- .task-card:hover::before { background: var(--card-color, var(--border)); }
210
-
211
- .task-header { display: flex; justify-content: space-between; align-items: flex-start; }
212
- .task-icon { font-size: 32px; }
213
- .task-badge {
214
- font-size: 11px;
215
- font-weight: 700;
216
- padding: 4px 8px;
217
- border-radius: 6px;
218
- background: var(--card-bg);
219
- color: var(--card-color);
220
- letter-spacing: 0.5px;
221
- }
222
- .task-name { font-size: 16px; font-weight: 600; margin-top: 16px; }
223
- .task-desc { font-size: 13px; color: var(--text-secondary); margin-top: 8px; line-height: 1.5; flex-grow: 1; }
224
- .task-footer { display: flex; justify-content: space-between; align-items: center; margin-top: 20px; }
225
- .task-steps { font-family: 'JetBrains Mono', monospace; font-size: 12px; color: var(--text-dim); }
226
- .task-status { display: flex; align-items: center; gap: 6px; font-size: 12px; color: var(--card-color); font-weight: 500; }
227
- .task-status::before { content: ''; width: 6px; height: 6px; border-radius: 50%; background: var(--card-color); }
228
-
229
- /* Features */
230
- .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; }
231
- @media (max-width: 900px) { .features-grid { grid-template-columns: 1fr; } }
232
-
233
- .feature-card {
234
- background: var(--bg-card);
235
- border: 1px solid var(--border);
236
- border-radius: 16px;
237
- padding: 32px;
238
- display: flex;
239
- flex-direction: column;
240
- }
241
- .feature-icon { font-size: 48px; margin-bottom: 24px; }
242
- .feature-title { font-size: 20px; font-weight: 600; margin-bottom: 12px; color: var(--text-primary); }
243
- .feature-desc { font-size: 14px; color: var(--text-secondary); line-height: 1.6; margin-bottom: 24px; flex-grow: 1; }
244
-
245
- .curriculum-bars { margin-bottom: 24px; }
246
- .c-bar-row { display: flex; align-items: center; justify-content: space-between; margin-bottom: 8px; font-size: 12px; font-family: 'JetBrains Mono', monospace; }
247
- .c-bar-name { color: var(--text-secondary); width: 80px; overflow: hidden; text-overflow: ellipsis; }
248
- .c-bar-track { flex-grow: 1; margin: 0 12px; letter-spacing: -2px; color: var(--text-dim); }
249
- .c-bar-fill { letter-spacing: -2px; }
250
- .c-bar-score { width: 30px; text-align: right; }
251
-
252
- .generator-input {
253
- display: flex; gap: 8px; margin-bottom: 16px;
254
- }
255
- .gen-seed {
256
- background: var(--bg-secondary); border: 1px solid var(--border); color: white;
257
- padding: 8px 12px; border-radius: 6px; width: 80px; font-family: 'JetBrains Mono', monospace;
258
- }
259
- .btn-gen { background: var(--accent-purple); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
260
- .btn-gen:hover { background: #7c3aed; }
261
- .gen-result { background: var(--bg-secondary); border: 1px solid var(--border); border-radius: 8px; padding: 16px; display: none; }
262
- .gen-badges { display: flex; gap: 8px; margin-bottom: 12px; }
263
- .gen-badge { font-size: 10px; padding: 2px 6px; border-radius: 4px; font-weight: 600; text-transform: uppercase; }
264
- .gen-diff-bar { height: 4px; background: var(--border); border-radius: 2px; margin: 12px 0; overflow: hidden; }
265
- .gen-diff-fill { height: 100%; transition: width 0.3s; }
266
-
267
- .dual-diagram {
268
- background: var(--bg-secondary); border: 1px solid var(--border); border-radius: 8px;
269
- padding: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; margin-bottom: 24px;
270
- color: var(--text-secondary);
271
- display: flex; justify-content: space-between; align-items: center;
272
- }
273
- .agent-box { border: 1px solid var(--border); padding: 8px; border-radius: 4px; background: rgba(0,0,0,0.2); width: 42%; }
274
- .agent-arrow { flex-grow: 1; text-align: center; color: var(--accent-green); position: relative; }
275
- .agent-arrow::after {
276
- content: '→'; position: absolute; top: -10px; left: 50%; transform: translateX(-50%);
277
- animation: flowRight 1.5s infinite linear;
278
- }
279
- @keyframes flowRight { 0% { left: 20%; opacity: 0; } 50% { opacity: 1; } 100% { left: 80%; opacity: 0; } }
280
- .btn-green { background: var(--accent-green); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
281
- .btn-green:hover { background: #059669; }
282
- .session-info { margin-top: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; color: var(--accent-green); display: none; word-break: break-all; }
283
-
284
- .feature-link { color: var(--accent-blue); font-size: 14px; font-weight: 500; text-decoration: none; margin-top: auto; display: inline-block; }
285
- .feature-link:hover { text-decoration: underline; }
286
-
287
- /* Live Metrics Bar */
288
- .metrics-bar-container { background: var(--bg-secondary); border-top: 1px solid var(--border); border-bottom: 1px solid var(--border); padding: 24px 0; }
289
- .metrics-grid { display: flex; justify-content: space-between; }
290
- .metric-item { text-align: center; flex: 1; border-right: 1px solid var(--border); }
291
- .metric-item:last-child { border-right: none; }
292
- .metric-val { font-family: 'JetBrains Mono', monospace; font-size: 28px; font-weight: 700; color: var(--accent-blue); }
293
- .metric-label { font-size: 12px; color: var(--text-secondary); margin-top: 4px; }
294
- @media (max-width: 640px) { .metrics-grid { flex-wrap: wrap; gap: 24px; } .metric-item { min-width: 40%; border: none; } }
295
-
296
- /* Leaderboard */
297
- .leaderboard-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; overflow: hidden; overflow-x: auto; }
298
- table { width: 100%; border-collapse: collapse; text-align: left; }
299
- th { background: rgba(255,255,255,0.03); font-size: 11px; text-transform: uppercase; letter-spacing: 1px; color: var(--text-dim); padding: 12px 24px; border-bottom: 1px solid var(--border); }
300
- td { padding: 16px 24px; border-bottom: 1px solid var(--border); font-size: 14px; color: var(--text-primary); }
301
- tr:last-child td { border-bottom: none; }
302
- .rank-1 { color: #fbbf24; font-weight: bold; }
303
- .rank-2 { color: #9ca3af; font-weight: bold; }
304
- .rank-3 { color: #cd7f32; font-weight: bold; }
305
- .lb-score { font-family: 'JetBrains Mono', monospace; font-weight: 600; }
306
- .lb-status { font-size: 13px; }
307
-
308
- /* Quick Start */
309
- .tabs { display: flex; gap: 8px; margin-bottom: 16px; }
310
- .tab { background: transparent; border: none; color: var(--text-secondary); padding: 8px 16px; border-radius: 6px; cursor: pointer; font-size: 14px; font-weight: 500; font-family: 'Inter', sans-serif;}
311
- .tab.active { background: var(--accent-blue); color: white; }
312
- .code-block { background: #020408; border: 1px solid var(--border); border-radius: 12px; padding: 24px; position: relative; display: none; overflow-x: auto; }
313
- .code-block.active { display: block; }
314
- .code-text { font-family: 'JetBrains Mono', monospace; font-size: 13px; line-height: 1.8; color: var(--text-primary); white-space: pre; }
315
- .btn-copy { position: absolute; top: 12px; right: 12px; background: rgba(255,255,255,0.1); border: 1px solid var(--border); color: var(--text-secondary); padding: 4px 10px; border-radius: 4px; font-size: 12px; cursor: pointer; }
316
- .btn-copy:hover { color: white; background: rgba(255,255,255,0.2); }
317
-
318
- .code-comment { color: var(--text-dim); }
319
- .code-str { color: var(--accent-green); }
320
- .code-cmd { color: var(--accent-blue); }
321
- .code-url { color: var(--accent-cyan); }
322
- .code-key { color: var(--accent-yellow); }
323
-
324
- /* Training Evidence */
325
- .training-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
326
- @media (max-width: 900px) { .training-grid { grid-template-columns: 1fr; } }
327
- .train-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
328
- .train-title { font-size: 18px; font-weight: 600; margin-bottom: 24px; }
329
- .train-row { margin-bottom: 24px; }
330
- .train-label { font-size: 12px; color: var(--text-secondary); text-transform: uppercase; letter-spacing: 1px; margin-bottom: 8px; display: flex; align-items: center; justify-content: space-between; }
331
- .train-badge { padding: 4px 8px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-weight: 600; font-size: 12px; }
332
- .train-desc { font-size: 14px; color: var(--text-secondary); line-height: 1.5; margin-left: 28px; }
333
- .train-vis { float: left; font-size: 18px; margin-top: 2px; }
334
- .train-table-row { display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid var(--border); }
335
- .train-table-row:last-child { border-bottom: none; }
336
- .tt-key { font-size: 13px; color: var(--text-secondary); }
337
- .tt-val { font-size: 13px; font-family: 'JetBrains Mono', monospace; color: var(--text-primary); }
338
-
339
- /* Footer */
340
- footer { background: var(--bg-secondary); border-top: 1px solid var(--border); padding: 48px 0 32px; margin-top: 80px; }
341
- .footer-grid { display: grid; grid-template-columns: 2fr 1fr 1fr; gap: 32px; }
342
- @media (max-width: 768px) { .footer-grid { grid-template-columns: 1fr; } }
343
- .f-title { font-size: 14px; font-weight: 600; margin-bottom: 16px; color: var(--text-primary); }
344
- .f-text { font-size: 13px; color: var(--text-dim); line-height: 1.6; }
345
- .f-links { display: flex; flex-direction: column; gap: 12px; }
346
- .f-link { font-size: 13px; color: var(--text-secondary); transition: color 0.2s; }
347
- .f-link:hover { color: var(--text-primary); }
348
- .f-social { display: flex; gap: 16px; margin-top: 16px; }
349
- .f-bottom { border-top: 1px solid var(--border); margin-top: 32px; padding-top: 24px; display: flex; justify-content: space-between; font-size: 12px; color: var(--text-dim); }
350
- @media (max-width: 640px) { .f-bottom { flex-direction: column; gap: 8px; text-align: center; } }
351
- </style>
352
- </head>
353
- <body>
354
-
355
- <canvas id="bg-canvas"></canvas>
356
-
357
- <nav>
358
- <div class="nav-inner">
359
- <div class="nav-left">
360
- <div class="nav-logo">🚨 ARIA</div>
361
- <div class="nav-desc">DevOps Incident Response</div>
362
- </div>
363
- <div class="nav-center">
364
- <div class="status-pill" id="nav-status">
365
- <div class="status-dot"></div>
366
- <span id="nav-status-text">CONNECTING</span>
367
- </div>
368
- </div>
369
- <div class="nav-right">
370
- <a href="/docs" class="nav-link">API Docs</a>
371
- <a href="/validate" class="nav-link">Validate</a>
372
- <a href="/metrics" class="nav-link">Metrics</a>
373
- <a href="/leaderboard" class="nav-link">Leaderboard</a>
374
- </div>
375
- </div>
376
- </nav>
377
-
378
- <main class="container">
379
-
380
- <section class="hero fade-in">
381
- <div class="hero-badge">⚡ OpenEnv Compliant · Meta × PyTorch × HuggingFace</div>
382
- <h1 class="hero-title">ARIA</h1>
383
- <div class="hero-subtitle">Adaptive Reward & Incident Architecture</div>
384
- <p class="hero-desc">The first OpenEnv RL environment for production incident response.<br>7 tasks · 14 actions · Curriculum learning · Dual-agent mode · Trained Llama-3B</p>
385
-
386
- <div class="hero-buttons">
387
- <a href="/docs" class="btn-primary">Try Live API &rarr;</a>
388
- <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="btn-secondary">View on GitHub &rarr;</a>
389
- </div>
390
-
391
- <div class="hero-stats">
392
- <div class="stat-card">
393
- <div class="stat-val">7</div>
394
- <div class="stat-label">Tasks</div>
395
- </div>
396
- <div class="stat-card">
397
- <div class="stat-val">14</div>
398
- <div class="stat-label">Actions</div>
399
- </div>
400
- <div class="stat-card">
401
- <div class="stat-val">&infin;</div>
402
- <div class="stat-label">Scenarios</div>
403
- </div>
404
- <div class="stat-card">
405
- <div class="stat-val">0.99</div>
406
- <div class="stat-label">Max Score</div>
407
- </div>
408
- </div>
409
- </section>
410
-
411
- <section class="fade-in">
412
- <h2 class="section-title">Environment Tasks</h2>
413
- <p class="section-subtitle">Eight scenarios of escalating operational complexity</p>
414
-
415
- <div class="task-grid" id="task-grid">
416
- <!-- Populated by JS -->
417
- <div style="grid-column: 1/-1; text-align: center; color: var(--text-dim);">Loading tasks...</div>
418
- </div>
419
- </section>
420
-
421
- <section class="fade-in">
422
- <h2 class="section-title">ARIA Features</h2>
423
- <p class="section-subtitle">What makes this environment unique</p>
424
-
425
- <div class="features-grid">
426
- <!-- Curriculum -->
427
- <div class="feature-card">
428
- <div class="feature-icon">🎓</div>
429
- <h3 class="feature-title">Curriculum Engine</h3>
430
- <p class="feature-desc">Tracks agent performance per task with rolling averages. Promotes when mastered (avg > 0.75). Scaffolds with hints when struggling (avg < 0.30). Agents always train at the edge of their capability.</p>
431
-
432
- <div class="curriculum-bars" id="curriculum-container">
433
- <!-- Populated by JS -->
434
- <div style="text-align: center; color: var(--text-dim); font-size: 13px;">Loading curriculum data...</div>
435
- </div>
436
-
437
- <a href="/curriculum/status" class="feature-link" style="color: var(--accent-blue);">View Status &rarr;</a>
438
- </div>
439
-
440
- <!-- Generator -->
441
- <div class="feature-card">
442
- <div class="feature-icon">⚡</div>
443
- <h3 class="feature-title">Incident Generator</h3>
444
- <p class="feature-desc">Procedural incidents from seeds 0–99,999. Six failure modes × eight services × variable noise = infinite unique training scenarios. Same seed always produces the same incident.</p>
445
-
446
- <div class="generator-input">
447
- <input type="number" id="gen-seed" class="gen-seed" value="42" min="0" max="99999">
448
- <button class="btn-gen" onclick="generateIncident()">Generate</button>
449
- </div>
450
-
451
- <div class="gen-result" id="gen-result">
452
- <div class="gen-badges" id="gen-badges"></div>
453
- <div style="font-size: 13px; font-weight: 600; margin-bottom: 8px;" id="gen-affected"></div>
454
- <div style="font-size: 12px; color: var(--text-secondary); line-height: 1.5;" id="gen-desc"></div>
455
- <div class="gen-diff-bar"><div class="gen-diff-fill" id="gen-diff-fill"></div></div>
456
- </div>
457
-
458
- <a href="/generate/preview?seed=42" class="feature-link" style="color: var(--accent-purple);">Try Generator &rarr;</a>
459
- </div>
460
-
461
- <!-- Dual Agent -->
462
- <div class="feature-card">
463
- <div class="feature-icon">🤝</div>
464
- <h3 class="feature-title">Dual-Agent Mode</h3>
465
- <p class="feature-desc">Split observability between two agents. Observer sees logs and alerts. Responder sees metrics and dependencies. Neither can solve the incident alone — they must coordinate via share_finding.</p>
466
-
467
- <div class="dual-diagram">
468
- <div class="agent-box">
469
- <div style="font-weight:700; margin-bottom:4px;">AGENT A</div>
470
- <div style="color:var(--text-dim); margin-bottom:4px;">Observer</div>
471
- <div>• alerts<br>• logs</div>
472
- </div>
473
- <div class="agent-arrow">share<br>finding</div>
474
- <div class="agent-box">
475
- <div style="font-weight:700; margin-bottom:4px;">AGENT B</div>
476
- <div style="color:var(--text-dim); margin-bottom:4px;">Responder</div>
477
- <div>• metrics<br>• deps</div>
478
- </div>
479
- </div>
480
-
481
- <button class="btn-green" onclick="startDualSession()">Start Session</button>
482
- <div class="session-info" id="dual-session-info"></div>
483
-
484
- <a href="/multi-agent/sessions" class="feature-link" style="color: var(--accent-green);">View Sessions &rarr;</a>
485
- </div>
486
- </div>
487
- </section>
488
-
489
- </main>
490
-
491
- <div class="metrics-bar-container fade-in">
492
- <div class="container metrics-grid" id="metrics-grid">
493
- <div class="metric-item">
494
- <div class="metric-val" id="m-episodes">--</div>
495
- <div class="metric-label">Total Episodes</div>
496
- </div>
497
- <div class="metric-item">
498
- <div class="metric-val" id="m-avg">--</div>
499
- <div class="metric-label">Avg Score</div>
500
- </div>
501
- <div class="metric-item">
502
- <div class="metric-val" id="m-res">--</div>
503
- <div class="metric-label">Resolution Rate</div>
504
- </div>
505
- <div class="metric-item">
506
- <div class="metric-val" id="m-best">--</div>
507
- <div class="metric-label">Best Score</div>
508
- </div>
509
- </div>
510
- </div>
511
-
512
- <main class="container">
513
- <section class="fade-in">
514
- <h2 class="section-title">🏆 Leaderboard</h2>
515
- <p class="section-subtitle">Top episodes by score</p>
516
-
517
- <div class="leaderboard-card">
518
- <table>
519
- <thead>
520
- <tr>
521
- <th>Rank</th>
522
- <th>Task</th>
523
- <th>Score</th>
524
- <th>Steps</th>
525
- <th>Status</th>
526
- </tr>
527
- </thead>
528
- <tbody id="lb-body">
529
- <tr><td colspan="5" style="text-align: center; color: var(--text-dim);">Loading leaderboard...</td></tr>
530
- </tbody>
531
- </table>
532
- </div>
533
- </section>
534
-
535
- <section class="fade-in">
536
- <h2 class="section-title">Quick Start</h2>
537
- <p class="section-subtitle">Run your first episode in seconds</p>
538
-
539
- <div class="tabs">
540
- <button class="tab active" onclick="switchTab('curl')">curl</button>
541
- <button class="tab" onclick="switchTab('python')">Python</button>
542
- </div>
543
-
544
- <div id="code-curl" class="code-block active">
545
- <button class="btn-copy" onclick="copyCode('code-curl-text', this)">Copy</button>
546
- <div class="code-text" id="code-curl-text"><span class="code-comment"># 1. Start an incident</span>
547
- <span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/reset \
548
- -H <span class="code-str">"Content-Type: application/json"</span> \
549
- -d <span class="code-str">'{<span class="code-key">"task_id"</span>: <span class="code-str">"easy"</span>, <span class="code-key">"seed"</span>: 42}'</span>
550
-
551
- <span class="code-comment"># 2. Read logs (reward: +0.15)</span>
552
- <span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
553
- -H <span class="code-str">"Content-Type: application/json"</span> \
554
- -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"read_logs"</span>, <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}'</span>
555
-
556
- <span class="code-comment"># 3. Diagnose (reward: +0.30)</span>
557
- <span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
558
- -H <span class="code-str">"Content-Type: application/json"</span> \
559
- -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"diagnose"</span>, <span class="code-key">"root_cause"</span>: <span class="code-str">"memory leak in payment-service"</span>}'</span>
560
-
561
- <span class="code-comment"># 4. Fix it (reward: +0.40)</span>
562
- <span class="code-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
563
- -H <span class="code-str">"Content-Type: application/json"</span> \
564
- -d <span class="code-str">'{<span class="code-key">"action_type"</span>: <span class="code-str">"restart_service"</span>, <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}'</span>
565
-
566
- <span class="code-comment"># Score: ~0.94 ✅</span></div>
567
- </div>
568
-
569
- <div id="code-python" class="code-block">
570
- <button class="btn-copy" onclick="copyCode('code-py-text', this)">Copy</button>
571
- <div class="code-text" id="code-py-text"><span class="code-cmd">import</span> requests
572
-
573
- BASE = <span class="code-str">"https://arijit-07-devops-incident-response.hf.space"</span>
574
-
575
- <span class="code-comment"># Start episode</span>
576
- obs = requests.post(<span class="code-url">f"{BASE}/reset"</span>,
577
- json={<span class="code-key">"task_id"</span>: <span class="code-str">"easy"</span>, <span class="code-key">"seed"</span>: 42}).json()
578
-
579
- <span class="code-comment"># Take action</span>
580
- result = requests.post(<span class="code-url">f"{BASE}/step"</span>,
581
- json={<span class="code-key">"action_type"</span>: <span class="code-str">"read_logs"</span>,
582
- <span class="code-key">"service"</span>: <span class="code-str">"payment-service"</span>}).json()
583
-
584
- print(<span class="code-url">f"Reward: {result['reward']}"</span>) <span class="code-comment"># 0.15</span></div>
585
- </div>
586
- </section>
587
-
588
- <section class="fade-in">
589
- <h2 class="section-title">🧠 Training Evidence</h2>
590
- <p class="section-subtitle">Llama-3.2-3B fine-tuned with GRPO on this environment</p>
591
-
592
- <div class="training-grid">
593
- <div class="train-card">
594
- <h3 class="train-title">Behavioral Change</h3>
595
-
596
- <div class="train-row">
597
- <div class="train-label">
598
- <span>Base Llama-3B</span>
599
- <span class="train-badge" style="background: rgba(239, 68, 68, 0.2); color: #ef4444;">0.000</span>
600
- </div>
601
- <div class="train-vis" style="color: #ef4444;">❌</div>
602
- <div class="train-desc">Jumps straight to diagnose without reading logs → triggers blind remediation penalty (-0.10)</div>
603
- </div>
604
-
605
- <div class="train-row">
606
- <div class="train-label">
607
- <span>ARIA Fine-tuned (140 episodes)</span>
608
- <span class="train-badge" style="background: rgba(16, 185, 129, 0.2); color: #10b981;">0.150</span>
609
- </div>
610
- <div class="train-vis" style="color: #10b981;">✅</div>
611
- <div class="train-desc">Consistently reads logs on correct failing service first → information gathering before acting</div>
612
- </div>
613
-
614
- <a href="https://huggingface.co/Arijit-07/aria-devops-llama3b" target="_blank" class="feature-link" style="color: var(--accent-blue);">Model weights &rarr;</a>
615
- </div>
616
-
617
- <div class="train-card">
618
- <h3 class="train-title">Training Setup</h3>
619
-
620
- <div class="train-table-row">
621
- <div class="tt-key">Algorithm</div><div class="tt-val">GRPO</div>
622
- </div>
623
- <div class="train-table-row">
624
- <div class="tt-key">Framework</div><div class="tt-val">Unsloth + HuggingFace TRL</div>
625
- </div>
626
- <div class="train-table-row">
627
- <div class="tt-key">Base Model</div><div class="tt-val">Llama-3.2-3B-Instruct</div>
628
- </div>
629
- <div class="train-table-row">
630
- <div class="tt-key">LoRA Rank</div><div class="tt-val">16 (alpha: 32)</div>
631
- </div>
632
- <div class="train-table-row">
633
- <div class="tt-key">Episodes</div><div class="tt-val">140 (easy + medium)</div>
634
- </div>
635
- <div class="train-table-row">
636
- <div class="tt-key">GPU</div><div class="tt-val">Kaggle T4 x2</div>
637
- </div>
638
- <div class="train-table-row">
639
- <div class="tt-key">Group Size</div><div class="tt-val">6 completions/step</div>
640
- </div>
641
- <div class="train-table-row" style="border-bottom: none;">
642
- <div class="tt-key">KL Penalty</div><div class="tt-val">0.05</div>
643
- </div>
644
- </div>
645
- </div>
646
- </section>
647
-
648
- </main>
649
-
650
- <footer>
651
- <div class="container">
652
- <div class="footer-grid">
653
- <div>
654
- <div style="font-size: 20px; font-weight: 700; color: var(--accent-blue); margin-bottom: 8px;">🚨 ARIA</div>
655
- <div class="f-text">DevOps Incident Response<br>OpenEnv-compliant RL environment</div>
656
- <div class="f-social">
657
- <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="f-link">GitHub</a>
658
- <a href="https://huggingface.co/Arijit-07/aria-devops-llama3b" target="_blank" class="f-link">HuggingFace Model</a>
659
- </div>
660
- </div>
661
- <div>
662
- <div class="f-title">Resources</div>
663
- <div class="f-links">
664
- <a href="/docs" class="f-link">Live API Docs</a>
665
- <a href="/validate" class="f-link">Validate</a>
666
- <a href="/metrics" class="f-link">Metrics</a>
667
- <a href="/leaderboard" class="f-link">Leaderboard</a>
668
- <a href="/curriculum/status" class="f-link">Curriculum</a>
669
- <a href="/about" class="f-link">About</a>
670
- </div>
671
- </div>
672
- <div>
673
- <div class="f-title">Built for</div>
674
- <div class="f-text">Meta × PyTorch × HuggingFace<br>OpenEnv Hackathon Finals<br>Bangalore, April 2026</div>
675
- <div class="f-text" style="font-size: 12px; margin-top: 16px;">Solo project by Arijit</div>
676
- </div>
677
- </div>
678
- <div class="f-bottom">
679
- <div>&copy; 2026 ARIA — Apache 2.0 License</div>
680
- <div>Can your agent handle a SEV-1 at 3am?</div>
681
- </div>
682
- </div>
683
- </footer>
684
-
685
- <script>
686
- // 1. Canvas Particles
687
- const canvas = document.getElementById('bg-canvas');
688
- const ctx = canvas.getContext('2d');
689
- let width, height;
690
- let particles = [];
691
-
692
- function resize() {
693
- width = window.innerWidth;
694
- height = window.innerHeight;
695
- canvas.width = width;
696
- canvas.height = height;
697
- }
698
- window.addEventListener('resize', resize);
699
- resize();
700
-
701
- for(let i=0; i<60; i++) {
702
- particles.push({
703
- x: Math.random() * width,
704
- y: Math.random() * height,
705
- vx: (Math.random() - 0.5) * 0.5,
706
- vy: (Math.random() - 0.5) * 0.5
707
- });
708
- }
709
-
710
- function drawParticles() {
711
- ctx.clearRect(0, 0, width, height);
712
- ctx.fillStyle = 'rgba(59, 130, 246, 0.1)';
713
- ctx.strokeStyle = 'rgba(59, 130, 246, 0.05)';
714
-
715
- for(let i=0; i<particles.length; i++) {
716
- let p = particles[i];
717
- p.x += p.vx; p.y += p.vy;
718
-
719
- if(p.x < 0 || p.x > width) p.vx *= -1;
720
- if(p.y < 0 || p.y > height) p.vy *= -1;
721
-
722
- ctx.beginPath();
723
- ctx.arc(p.x, p.y, 2, 0, Math.PI * 2);
724
- ctx.fill();
725
-
726
- for(let j=i+1; j<particles.length; j++) {
727
- let p2 = particles[j];
728
- let dx = p.x - p2.x, dy = p.y - p2.y;
729
- let dist = Math.sqrt(dx*dx + dy*dy);
730
- if(dist < 150) {
731
- ctx.beginPath();
732
- ctx.moveTo(p.x, p.y);
733
- ctx.lineTo(p2.x, p2.y);
734
- ctx.stroke();
735
- }
736
- }
737
- }
738
- requestAnimationFrame(drawParticles);
739
- }
740
- drawParticles();
741
-
742
- // 2. Intersection Observer for Fade-in
743
- const observer = new IntersectionObserver((entries) => {
744
- entries.forEach(entry => {
745
- if(entry.isIntersecting) {
746
- entry.target.classList.add('visible');
747
- }
748
- });
749
- }, { threshold: 0.1 });
750
- document.querySelectorAll('.fade-in').forEach(el => observer.observe(el));
751
-
752
- // 3. Status Check
753
- fetch('/health')
754
- .then(r => r.json())
755
- .then(data => {
756
- if(data.status === 'ok') {
757
- document.getElementById('nav-status-text').innerText = 'LIVE';
758
- }
759
- }).catch(e => console.error(e));
760
-
761
- // 4. Load Tasks
762
- const taskConfig = {
763
- 'easy': {icon: '💻', color: '#10b981', badge: 'EASY'},
764
- 'medium': {icon: '⚡', color: '#f59e0b', badge: 'MEDIUM'},
765
- 'hard': {icon: '🔥', color: '#ef4444', badge: 'HARD'},
766
- 'bonus': {icon: '💥', color: '#8b5cf6', badge: 'EXPERT'},
767
- 'security': {icon: '🛡️', color: '#06b6d4', badge: 'SECURITY'},
768
- 'database': {icon: '🗄️', color: '#f97316', badge: 'DATABASE'},
769
- 'failover': {icon: '🌐', color: '#6366f1', badge: 'FAILOVER'},
770
- 'generated': {icon: '✨', color: '#ec4899', badge: 'DYNAMIC'}
771
- };
772
-
773
- fetch('/tasks')
774
- .then(r => r.json())
775
- .then(data => {
776
- const grid = document.getElementById('task-grid');
777
- grid.innerHTML = '';
778
- data.tasks.forEach(t => {
779
- const cfg = taskConfig[t.id] || taskConfig['easy'];
780
- grid.innerHTML += `
781
- <div class="task-card" style="--card-color: ${cfg.color}; --card-bg: ${cfg.color}20;">
782
- <div class="task-header">
783
- <div class="task-icon">${cfg.icon}</div>
784
- <div class="task-badge">${cfg.badge}</div>
785
- </div>
786
- <div class="task-name">${t.name}</div>
787
- <div class="task-desc">${t.description}</div>
788
- <div class="task-footer">
789
- <div class="task-steps">Max steps: ${t.max_steps}</div>
790
- <div class="task-status">Ready</div>
791
- </div>
792
- </div>
793
- `;
794
- });
795
- }).catch(e => console.error(e));
796
-
797
- // 5. Curriculum
798
- fetch('/curriculum/status')
799
- .then(r => r.json())
800
- .then(data => {
801
- const container = document.getElementById('curriculum-container');
802
- if(data.total_episodes_recorded === 0) {
803
- container.innerHTML = '<div style="text-align: center; color: var(--text-dim); font-size: 13px;">No episodes recorded yet</div>';
804
- return;
805
- }
806
- container.innerHTML = '';
807
- const tasks = data.tasks || {};
808
- Object.keys(tasks).slice(0, 4).forEach(k => {
809
- let avg = tasks[k].rolling_avg;
810
- let w = Math.max(5, avg * 100);
811
- let color = avg < 0.3 ? 'var(--accent-red)' : (avg < 0.6 ? 'var(--accent-yellow)' : 'var(--accent-green)');
812
- let blocks = Math.round(w / 10);
813
- let fillStr = '█'.repeat(blocks);
814
- let trackStr = '░'.repeat(10 - blocks);
815
- container.innerHTML += `
816
- <div class="c-bar-row">
817
- <div class="c-bar-name">${k}</div>
818
- <div class="c-bar-track" style="color: ${color}"><span class="c-bar-fill">${fillStr}</span><span style="opacity:0.3">${trackStr}</span></div>
819
- <div class="c-bar-score">${avg.toFixed(2)}</div>
820
- </div>
821
- `;
822
- });
823
- }).catch(e => {
824
- document.getElementById('curriculum-container').innerHTML = '<div style="text-align: center; color: var(--text-dim); font-size: 13px;">--</div>';
825
- console.error(e);
826
- });
827
-
828
- // 6. Generator
829
- window.generateIncident = function() {
830
- const seed = document.getElementById('gen-seed').value || 42;
831
- fetch(`/generate/preview?seed=${seed}`)
832
- .then(r => r.json())
833
- .then(data => {
834
- const cMap = {oom: '#ef4444', cascade: '#f59e0b', corruption: '#8b5cf6', security: '#06b6d4', database: '#f97316', network_partition: '#6366f1'};
835
- const sMap = {sev1: '#ef4444', sev2: '#f59e0b', sev3: '#10b981'};
836
- const fColor = cMap[data.failure_mode] || '#3b82f6';
837
-
838
- document.getElementById('gen-badges').innerHTML = `
839
- <span class="gen-badge" style="background:${fColor}20; color:${fColor}">${data.failure_mode}</span>
840
- <span class="gen-badge" style="background:${sMap[data.severity] || '#3b82f6'}20; color:${sMap[data.severity] || '#3b82f6'}">${data.severity}</span>
841
- <span class="gen-badge" style="background:rgba(255,255,255,0.1); color:var(--text-secondary)">${data.incident_id}</span>
842
- `;
843
- document.getElementById('gen-affected').innerText = `Affected: ${data.affected_service}`;
844
- document.getElementById('gen-desc').innerText = data.description;
845
-
846
- let dColor = data.difficulty_score < 0.4 ? '#10b981' : (data.difficulty_score < 0.7 ? '#f59e0b' : '#ef4444');
847
- document.getElementById('gen-diff-fill').style.width = `${data.difficulty_score * 100}%`;
848
- document.getElementById('gen-diff-fill').style.background = dColor;
849
-
850
- document.getElementById('gen-result').style.display = 'block';
851
- }).catch(e => console.error(e));
852
- };
853
-
854
- // 7. Dual Agent
855
- window.startDualSession = function() {
856
- fetch('/multi-agent/reset', {
857
- method: 'POST',
858
- headers: {'Content-Type': 'application/json'},
859
- body: JSON.stringify({task_id: "easy", seed: 42})
860
- }).then(r => r.json())
861
- .then(data => {
862
- const info = document.getElementById('dual-session-info');
863
- info.innerHTML = `Session: ${data.session_id}<br><br>Agent A (POST): /multi-agent/step/a/${data.session_id}<br>Agent B (POST): /multi-agent/step/b/${data.session_id}`;
864
- info.style.display = 'block';
865
- }).catch(e => console.error(e));
866
- };
867
-
868
- // 8. Live Metrics
869
- function loadMetrics() {
870
- fetch('/metrics')
871
- .then(r => r.json())
872
- .then(data => {
873
- document.getElementById('m-episodes').innerText = data.total_episodes || 0;
874
- document.getElementById('m-avg').innerText = (data.overall_avg_score || 0).toFixed(3);
875
-
876
- // calculate overall resolution rate
877
- if (data.by_task) {
878
- let totalRes = 0, totalCount = 0;
879
- let bestScore = 0;
880
- Object.values(data.by_task).forEach(t => {
881
- totalRes += t.resolution_rate * t.count;
882
- totalCount += t.count;
883
- if (t.max_score > bestScore) bestScore = t.max_score;
884
- });
885
- let resRate = totalCount > 0 ? (totalRes / totalCount) * 100 : 0;
886
- document.getElementById('m-res').innerText = resRate.toFixed(1) + '%';
887
- document.getElementById('m-best').innerText = bestScore.toFixed(3);
888
- }
889
- }).catch(e => console.error(e));
890
- }
891
- loadMetrics();
892
- setInterval(loadMetrics, 30000);
893
-
894
- // 9. Leaderboard
895
- fetch('/leaderboard')
896
- .then(r => r.json())
897
- .then(data => {
898
- const tbody = document.getElementById('lb-body');
899
- if(!data.leaderboard || data.leaderboard.length === 0) {
900
- tbody.innerHTML = '<tr><td colspan="5" style="text-align: center; color: var(--text-dim);">No episodes yet. Try POST /reset to start your first episode.</td></tr>';
901
- return;
902
- }
903
- tbody.innerHTML = '';
904
- data.leaderboard.forEach(row => {
905
- let rClass = row.rank <= 3 ? `rank-${row.rank}` : '';
906
- let sColor = row.score >= 0.8 ? '#10b981' : (row.score >= 0.5 ? '#f59e0b' : '#ef4444');
907
- let statusHtml = row.score > 0.5 ? '<span style="color:#10b981">✅ Resolved</span>' : '<span style="color:#ef4444">❌ Failed</span>'; // Simple heuristic if resolution missing
908
-
909
- tbody.innerHTML += `
910
- <tr>
911
- <td class="${rClass}">#${row.rank}</td>
912
- <td>${row.task_id}</td>
913
- <td class="lb-score" style="color: ${sColor}">${row.score.toFixed(4)}</td>
914
- <td>${row.steps}</td>
915
- <td class="lb-status">${statusHtml}</td>
916
- </tr>
917
- `;
918
- });
919
- }).catch(e => console.error(e));
920
-
921
- // 10. Tabs
922
- window.switchTab = function(type) {
923
- document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
924
- document.querySelectorAll('.code-block').forEach(c => c.classList.remove('active'));
925
- if(type === 'curl') {
926
- document.querySelectorAll('.tab')[0].classList.add('active');
927
- document.getElementById('code-curl').classList.add('active');
928
- } else {
929
- document.querySelectorAll('.tab')[1].classList.add('active');
930
- document.getElementById('code-python').classList.add('active');
931
- }
932
- };
933
-
934
- window.copyCode = function(id, btn) {
935
- const text = document.getElementById(id).innerText;
936
- navigator.clipboard.writeText(text).then(() => {
937
- let old = btn.innerText;
938
- btn.innerText = 'Copied ✓';
939
- setTimeout(() => btn.innerText = old, 2000);
940
- });
941
- };
942
- </script>
943
- </body>
944
- </html>
945
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
update_dashboard.py DELETED
@@ -1,658 +0,0 @@
1
- import re
2
-
3
- html_content = r"""<!DOCTYPE html>
4
- <html lang="en">
5
- <head>
6
- <meta charset="UTF-8">
7
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
8
- <title>ARIA - DevOps Incident Response</title>
9
- <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;700&display=swap" rel="stylesheet">
10
- <style>
11
- :root {
12
- --bg: #060914;
13
- --bg-card: #111827;
14
- --border: #1f2937;
15
- --blue: #3b82f6;
16
- --cyan: #06b6d4;
17
- --green: #10b981;
18
- --red: #ef4444;
19
- --yellow: #f59e0b;
20
- --purple: #8b5cf6;
21
- --text: #f9fafb;
22
- --muted: #9ca3af;
23
- }
24
- * { margin: 0; padding: 0; box-sizing: border-box; }
25
- body {
26
- background: var(--bg);
27
- color: var(--text);
28
- font-family: 'Inter', sans-serif;
29
- min-height: 100vh;
30
- overflow-x: hidden;
31
- }
32
- html { scroll-behavior: smooth; }
33
- a { text-decoration: none; color: inherit; }
34
-
35
- /* Animation */
36
- @keyframes fadeInUp {
37
- from { opacity: 0; transform: translateY(20px); }
38
- to { opacity: 1; transform: translateY(0); }
39
- }
40
- .fade-in {
41
- opacity: 0;
42
- transform: translateY(20px);
43
- transition: opacity 0.6s ease-out, transform 0.6s ease-out;
44
- }
45
- .fade-in.visible { opacity: 1; transform: translateY(0); }
46
-
47
- /* Canvas Background */
48
- #bg-canvas {
49
- position: fixed;
50
- top: 0;
51
- left: 0;
52
- width: 100vw;
53
- height: 100vh;
54
- z-index: 0;
55
- pointer-events: none;
56
- }
57
-
58
- .container {
59
- max-width: 1280px;
60
- margin: 0 auto;
61
- padding: 0 24px;
62
- position: relative;
63
- z-index: 1;
64
- }
65
- section { padding: 80px 0; }
66
-
67
- /* Navbar */
68
- nav {
69
- position: fixed;
70
- top: 0;
71
- width: 100%;
72
- height: 64px;
73
- background: rgba(6, 9, 20, 0.8);
74
- backdrop-filter: blur(20px);
75
- border-bottom: 1px solid var(--border);
76
- z-index: 100;
77
- display: flex;
78
- align-items: center;
79
- }
80
- .nav-inner {
81
- display: flex;
82
- justify-content: space-between;
83
- align-items: center;
84
- width: 100%;
85
- max-width: 1280px;
86
- margin: 0 auto;
87
- padding: 0 24px;
88
- }
89
- .nav-left { display: flex; align-items: center; gap: 8px; }
90
- .nav-logo { font-size: 20px; font-weight: 700; color: var(--blue); }
91
- .nav-desc { font-size: 13px; color: var(--muted); display: none; }
92
- @media (min-width: 768px) { .nav-desc { display: block; } }
93
-
94
- .nav-center { display: flex; justify-content: center; flex: 1; }
95
- .status-pill {
96
- display: flex;
97
- align-items: center;
98
- gap: 6px;
99
- background: rgba(16, 185, 129, 0.2);
100
- border: 1px solid var(--green);
101
- color: var(--green);
102
- padding: 4px 12px;
103
- border-radius: 999px;
104
- font-size: 12px;
105
- font-weight: 600;
106
- }
107
- .status-dot {
108
- width: 6px;
109
- height: 6px;
110
- background: var(--green);
111
- border-radius: 50%;
112
- animation: pulse 2s infinite;
113
- }
114
- @keyframes pulse { 0% { transform: scale(1); opacity: 1; } 50% { transform: scale(1.5); opacity: 0.5; } 100% { transform: scale(1); opacity: 1; } }
115
-
116
- .nav-right { display: flex; gap: 24px; }
117
- .nav-link { font-size: 13px; color: var(--muted); transition: color 0.2s; }
118
- .nav-link:hover { color: var(--text); }
119
-
120
- /* Hero */
121
- .hero { padding: 120px 0 80px; text-align: center; }
122
- .hero-badge {
123
- background: rgba(59, 130, 246, 0.1);
124
- border: 1px solid rgba(59, 130, 246, 0.3);
125
- border-radius: 999px;
126
- padding: 6px 16px;
127
- font-size: 12px;
128
- color: var(--blue);
129
- display: inline-block;
130
- margin-bottom: 24px;
131
- }
132
- .hero-title {
133
- font-size: clamp(72px, 12vw, 140px);
134
- font-weight: 700;
135
- background: linear-gradient(135deg, var(--blue) 0%, var(--cyan) 50%, var(--purple) 100%);
136
- -webkit-background-clip: text;
137
- -webkit-text-fill-color: transparent;
138
- line-height: 1;
139
- letter-spacing: -4px;
140
- }
141
- .hero-subtitle { font-size: 20px; color: var(--muted); margin-top: 16px; font-weight: 400; }
142
- .hero-desc { font-size: 15px; color: #4b5563; margin-top: 12px; line-height: 1.6; max-width: 600px; margin-inline: auto; }
143
-
144
- .hero-buttons { margin-top: 40px; display: flex; justify-content: center; gap: 16px; flex-wrap: wrap; }
145
- .btn-primary, .btn-secondary {
146
- padding: 14px 28px; border-radius: 8px; font-weight: 600; font-size: 15px; transition: all 0.2s; cursor: pointer; display: inline-block;
147
- }
148
- .btn-primary { background: var(--blue); color: white; border: none; }
149
- .btn-primary:hover { background: #2563eb; transform: translateY(-2px); }
150
- .btn-secondary { background: transparent; border: 1px solid var(--border); color: var(--muted); }
151
- .btn-secondary:hover { border-color: var(--blue); color: white; transform: translateY(-2px); }
152
-
153
- .hero-stats { margin-top: 64px; display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; }
154
- .stat-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 12px; padding: 20px 32px; text-align: center; }
155
- .stat-val { font-family: 'JetBrains Mono', monospace; font-size: 32px; font-weight: 700; color: var(--blue); }
156
- .stat-label { font-size: 13px; color: var(--muted); margin-top: 4px; }
157
-
158
- .section-title { font-size: 24px; font-weight: 600; margin-bottom: 8px; }
159
- .section-subtitle { font-size: 15px; color: var(--muted); margin-bottom: 32px; }
160
-
161
- /* Tasks Grid */
162
- .task-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 16px; }
163
- @media (max-width: 1024px) { .task-grid { grid-template-columns: repeat(2, 1fr); } }
164
- @media (max-width: 640px) { .task-grid { grid-template-columns: 1fr; } }
165
-
166
- .task-card {
167
- background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 24px;
168
- transition: all 0.3s; cursor: pointer; position: relative; overflow: hidden; display: flex; flex-direction: column;
169
- }
170
- .task-card::before { content: ''; position: absolute; top: 0; left: 0; right: 0; height: 2px; background: transparent; transition: all 0.3s; }
171
- .task-card:hover { transform: translateY(-4px); box-shadow: 0 20px 40px rgba(0,0,0,0.4); }
172
- .task-card:hover::before { background: var(--card-color, var(--border)); }
173
-
174
- .task-header { display: flex; justify-content: space-between; align-items: flex-start; }
175
- .task-icon { font-size: 32px; }
176
- .task-badge { font-size: 11px; font-weight: 700; padding: 4px 8px; border-radius: 6px; background: var(--card-bg); color: var(--card-color); letter-spacing: 0.5px; }
177
- .task-name { font-size: 16px; font-weight: 600; margin-top: 16px; }
178
- .task-desc { font-size: 13px; color: var(--muted); margin-top: 8px; line-height: 1.5; flex-grow: 1; }
179
- .task-footer { display: flex; justify-content: space-between; align-items: center; margin-top: 20px; }
180
- .task-steps { font-family: 'JetBrains Mono', monospace; font-size: 12px; color: #4b5563; }
181
- .task-status { display: flex; align-items: center; gap: 6px; font-size: 12px; color: var(--card-color); font-weight: 500; }
182
- .task-status::before { content: ''; width: 6px; height: 6px; border-radius: 50%; background: var(--card-color); }
183
-
184
- /* Features */
185
- .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; }
186
- @media (max-width: 900px) { .features-grid { grid-template-columns: 1fr; } }
187
- .feature-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
188
- .feature-icon { font-size: 48px; margin-bottom: 24px; }
189
- .feature-title { font-size: 20px; font-weight: 600; margin-bottom: 12px; color: var(--text); }
190
- .feature-desc { font-size: 14px; color: var(--muted); line-height: 1.6; margin-bottom: 24px; flex-grow: 1; }
191
-
192
- .c-bar-row { display: flex; align-items: center; justify-content: space-between; margin-bottom: 8px; font-size: 12px; font-family: 'JetBrains Mono', monospace; }
193
- .c-bar-name { color: var(--muted); width: 80px; overflow: hidden; text-overflow: ellipsis; }
194
- .c-bar-track { flex-grow: 1; margin: 0 12px; letter-spacing: -2px; color: #4b5563; }
195
- .c-bar-score { width: 30px; text-align: right; }
196
-
197
- .generator-input { display: flex; gap: 8px; margin-bottom: 16px; }
198
- .gen-seed { background: #0d1117; border: 1px solid var(--border); color: white; padding: 8px 12px; border-radius: 6px; width: 80px; font-family: 'JetBrains Mono', monospace; }
199
- .btn-gen { background: var(--purple); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
200
- .gen-result { background: #0d1117; border: 1px solid var(--border); border-radius: 8px; padding: 16px; display: none; }
201
- .gen-badges { display: flex; gap: 8px; margin-bottom: 12px; }
202
- .gen-badge { font-size: 10px; padding: 2px 6px; border-radius: 4px; font-weight: 600; text-transform: uppercase; }
203
- .gen-diff-bar { height: 4px; background: var(--border); border-radius: 2px; margin: 12px 0; overflow: hidden; }
204
-
205
- .dual-diagram { background: #0d1117; border: 1px solid var(--border); border-radius: 8px; padding: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; margin-bottom: 24px; color: var(--muted); display: flex; justify-content: space-between; align-items: center; }
206
- .agent-box { border: 1px solid var(--border); padding: 8px; border-radius: 4px; background: rgba(0,0,0,0.2); width: 42%; }
207
- .agent-arrow { flex-grow: 1; text-align: center; color: var(--green); position: relative; }
208
- .agent-arrow::after { content: '→'; position: absolute; top: -10px; left: 50%; transform: translateX(-50%); animation: flowRight 1.5s infinite linear; }
209
- @keyframes flowRight { 0% { left: 20%; opacity: 0; } 50% { opacity: 1; } 100% { left: 80%; opacity: 0; } }
210
- .btn-green { background: var(--green); color: white; border: none; padding: 8px 16px; border-radius: 6px; cursor: pointer; font-weight: 600; }
211
-
212
- .feature-link { color: var(--blue); font-size: 14px; font-weight: 500; margin-top: 16px; display: inline-block; }
213
-
214
- /* Live Metrics */
215
- .metrics-bar { background: #0d1117; border-top: 1px solid var(--border); border-bottom: 1px solid var(--border); padding: 24px 0; }
216
- .metrics-grid { display: flex; justify-content: space-between; }
217
- .metric-item { text-align: center; flex: 1; border-right: 1px solid var(--border); }
218
- .metric-item:last-child { border-right: none; }
219
- .metric-val { font-family: 'JetBrains Mono', monospace; font-size: 28px; font-weight: 700; color: var(--blue); }
220
- .metric-label { font-size: 12px; color: var(--muted); margin-top: 4px; }
221
- @media (max-width: 640px) { .metrics-grid { flex-wrap: wrap; gap: 24px; } .metric-item { min-width: 40%; border: none; } }
222
-
223
- /* Leaderboard */
224
- .leaderboard-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; overflow-x: auto; }
225
- table { width: 100%; border-collapse: collapse; text-align: left; }
226
- th { background: rgba(255,255,255,0.03); font-size: 11px; text-transform: uppercase; letter-spacing: 1px; color: #4b5563; padding: 12px 24px; border-bottom: 1px solid var(--border); }
227
- td { padding: 16px 24px; border-bottom: 1px solid var(--border); font-size: 14px; }
228
- tr:last-child td { border-bottom: none; }
229
- .lb-score { font-family: 'JetBrains Mono', monospace; font-weight: 600; }
230
-
231
- /* Quick Start */
232
- .tabs { display: flex; gap: 8px; margin-bottom: 16px; }
233
- .tab { background: transparent; border: none; color: var(--muted); padding: 8px 16px; border-radius: 6px; cursor: pointer; font-size: 14px; font-weight: 500; font-family: 'Inter', sans-serif;}
234
- .tab.active { background: var(--blue); color: white; }
235
- .code-block { background: #020408; border: 1px solid var(--border); border-radius: 12px; padding: 24px; position: relative; display: none; overflow-x: auto; }
236
- .code-block.active { display: block; }
237
- .code-text { font-family: 'JetBrains Mono', monospace; font-size: 13px; line-height: 1.8; color: var(--text); white-space: pre; }
238
- .btn-copy { position: absolute; top: 12px; right: 12px; background: rgba(255,255,255,0.1); border: 1px solid var(--border); color: var(--muted); padding: 4px 10px; border-radius: 4px; font-size: 12px; cursor: pointer; }
239
-
240
- .c-com { color: #4b5563; } .c-str { color: var(--green); } .c-cmd { color: var(--blue); } .c-url { color: var(--cyan); } .c-key { color: var(--yellow); }
241
-
242
- /* Training Evidence */
243
- .training-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
244
- @media (max-width: 900px) { .training-grid { grid-template-columns: 1fr; } }
245
- .train-card { background: var(--bg-card); border: 1px solid var(--border); border-radius: 16px; padding: 32px; display: flex; flex-direction: column; }
246
- .train-title { font-size: 18px; font-weight: 600; margin-bottom: 24px; }
247
- .train-row { margin-bottom: 24px; }
248
- .train-label { font-size: 12px; color: var(--muted); margin-bottom: 8px; display: flex; justify-content: space-between; align-items: center; }
249
- .train-badge { padding: 4px 8px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-weight: 600; }
250
- .train-desc { font-size: 14px; color: var(--muted); line-height: 1.5; margin-left: 28px; }
251
- .train-vis { float: left; font-size: 18px; margin-top: 2px; }
252
- .tt-row { display: flex; justify-content: space-between; padding: 12px 0; border-bottom: 1px solid var(--border); }
253
- .tt-row:last-child { border-bottom: none; }
254
- .tt-key { font-size: 13px; color: var(--muted); }
255
- .tt-val { font-size: 13px; font-family: 'JetBrains Mono', monospace; color: var(--text); }
256
-
257
- /* Footer */
258
- footer { background: #0d1117; border-top: 1px solid var(--border); padding: 48px 0 32px; margin-top: 80px; }
259
- .footer-grid { display: grid; grid-template-columns: 2fr 1fr 1fr; gap: 32px; }
260
- @media (max-width: 768px) { .footer-grid { grid-template-columns: 1fr; } }
261
- .f-title { font-size: 14px; font-weight: 600; margin-bottom: 16px; }
262
- .f-text { font-size: 13px; color: #4b5563; line-height: 1.6; }
263
- .f-links { display: flex; flex-direction: column; gap: 12px; }
264
- .f-link { font-size: 13px; color: var(--muted); transition: color 0.2s; }
265
- .f-link:hover { color: var(--text); }
266
- .f-bottom { border-top: 1px solid var(--border); margin-top: 32px; padding-top: 24px; display: flex; justify-content: space-between; font-size: 12px; color: #4b5563; }
267
- </style>
268
- </head>
269
- <body>
270
- <canvas id="bg-canvas"></canvas>
271
-
272
- <nav>
273
- <div class="nav-inner">
274
- <div class="nav-left">
275
- <div class="nav-logo">🚨 ARIA</div>
276
- <div class="nav-desc">DevOps Incident Response</div>
277
- </div>
278
- <div class="nav-center">
279
- <div class="status-pill">
280
- <div class="status-dot"></div>
281
- <span id="nav-status-text">CONNECTING</span>
282
- </div>
283
- </div>
284
- <div class="nav-right">
285
- <a href="/docs" class="nav-link">API Docs</a>
286
- <a href="/validate" class="nav-link">Validate</a>
287
- <a href="/metrics" class="nav-link">Metrics</a>
288
- <a href="/leaderboard" class="nav-link">Leaderboard</a>
289
- </div>
290
- </div>
291
- </nav>
292
-
293
- <main class="container">
294
- <section class="hero fade-in">
295
- <div class="hero-badge">⚡ OpenEnv Compliant · Meta × PyTorch × HuggingFace</div>
296
- <h1 class="hero-title">ARIA</h1>
297
- <div class="hero-subtitle">Adaptive Reward & Incident Architecture</div>
298
- <p class="hero-desc">The first OpenEnv RL environment for production incident response.<br>7 tasks · 14 actions · Curriculum · Dual-agent · Trained Llama-3.1-8B</p>
299
-
300
- <div class="hero-buttons">
301
- <a href="/docs" class="btn-primary">Try Live API &rarr;</a>
302
- <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="btn-secondary">View GitHub &rarr;</a>
303
- </div>
304
-
305
- <div class="hero-stats">
306
- <div class="stat-card"><div class="stat-val">7</div><div class="stat-label">Tasks</div></div>
307
- <div class="stat-card"><div class="stat-val">14</div><div class="stat-label">Actions</div></div>
308
- <div class="stat-card"><div class="stat-val">&infin;</div><div class="stat-label">Scenarios</div></div>
309
- <div class="stat-card"><div class="stat-val">0.99</div><div class="stat-label">Max Score</div></div>
310
- </div>
311
- </section>
312
-
313
- <section class="fade-in">
314
- <h2 class="section-title">Environment Tasks</h2>
315
- <p class="section-subtitle">Eight scenarios of escalating operational complexity</p>
316
- <div class="task-grid" id="task-grid"><div style="grid-column: 1/-1; text-align: center; color: var(--muted);">Loading tasks...</div></div>
317
- </section>
318
-
319
- <section class="fade-in">
320
- <h2 class="section-title">ARIA Features</h2>
321
- <p class="section-subtitle">What makes this environment unique</p>
322
- <div class="features-grid">
323
- <div class="feature-card">
324
- <div class="feature-icon">🎓</div>
325
- <h3 class="feature-title">Curriculum Engine</h3>
326
- <p class="feature-desc">Tracks agent performance per task with rolling averages. Promotes when mastered (avg > 0.75). Scaffolds with hints when struggling (avg < 0.30). Agents always train at the edge of their capability.</p>
327
- <div id="curriculum-container" style="margin-bottom: 24px;"></div>
328
- <a href="/curriculum/status" class="feature-link" style="color: var(--blue);">View Status &rarr;</a>
329
- </div>
330
-
331
- <div class="feature-card">
332
- <div class="feature-icon">⚡</div>
333
- <h3 class="feature-title">Incident Generator</h3>
334
- <p class="feature-desc">Procedural incidents from seeds 0–99,999. Six failure modes × eight services × variable noise = infinite unique training scenarios. Same seed always produces the same incident.</p>
335
- <div class="generator-input">
336
- <input type="number" id="gen-seed" class="gen-seed" value="42" min="0" max="99999">
337
- <button class="btn-gen" onclick="generateIncident()">Generate</button>
338
- </div>
339
- <div class="gen-result" id="gen-result">
340
- <div class="gen-badges" id="gen-badges"></div>
341
- <div style="font-size: 13px; font-weight: 600; margin-bottom: 8px;" id="gen-affected"></div>
342
- <div style="font-size: 12px; color: var(--muted); line-height: 1.5;" id="gen-desc"></div>
343
- <div class="gen-diff-bar"><div id="gen-diff-fill" style="height: 100%; transition: width 0.3s;"></div></div>
344
- </div>
345
- <a href="/generate/preview?seed=42" class="feature-link" style="color: var(--purple);">Try Generator &rarr;</a>
346
- </div>
347
-
348
- <div class="feature-card">
349
- <div class="feature-icon">🤝</div>
350
- <h3 class="feature-title">Dual-Agent Mode</h3>
351
- <p class="feature-desc">Split observability between two agents. Observer sees logs and alerts. Responder sees metrics and dependencies. Neither can solve the incident alone — they must coordinate via share_finding.</p>
352
- <div class="dual-diagram">
353
- <div class="agent-box"><div style="font-weight:700; margin-bottom:4px;">AGENT A: Observer</div><div>• alerts, logs</div></div>
354
- <div class="agent-arrow">share_finding</div>
355
- <div class="agent-box"><div style="font-weight:700; margin-bottom:4px;">AGENT B: Responder</div><div>• metrics, deps</div></div>
356
- </div>
357
- <button class="btn-green" onclick="startDualSession()">Start Session</button>
358
- <div id="dual-session-info" style="margin-top: 16px; font-family: 'JetBrains Mono', monospace; font-size: 11px; color: var(--green); display: none; word-break: break-all;"></div>
359
- <a href="/multi-agent/sessions" class="feature-link" style="color: var(--green); margin-top: auto;">View Sessions &rarr;</a>
360
- </div>
361
- </div>
362
- </section>
363
- </main>
364
-
365
- <div class="metrics-bar fade-in">
366
- <div class="container metrics-grid">
367
- <div class="metric-item"><div class="metric-val" id="m-episodes">--</div><div class="metric-label">Total Episodes</div></div>
368
- <div class="metric-item"><div class="metric-val" id="m-avg">--</div><div class="metric-label">Avg Score</div></div>
369
- <div class="metric-item"><div class="metric-val" id="m-res">--</div><div class="metric-label">Resolution Rate</div></div>
370
- <div class="metric-item"><div class="metric-val" id="m-best">--</div><div class="metric-label">Best Score</div></div>
371
- </div>
372
- </div>
373
-
374
- <main class="container">
375
- <section class="fade-in">
376
- <h2 class="section-title">🏆 Leaderboard</h2>
377
- <div class="leaderboard-card">
378
- <table>
379
- <thead><tr><th>Rank</th><th>Task</th><th>Score</th><th>Steps</th><th>Status</th></tr></thead>
380
- <tbody id="lb-body"><tr><td colspan="5" style="text-align: center; color: var(--muted);">Loading leaderboard...</td></tr></tbody>
381
- </table>
382
- </div>
383
- </section>
384
-
385
- <section class="fade-in">
386
- <h2 class="section-title">Quick Start</h2>
387
- <div class="tabs">
388
- <button class="tab active" onclick="switchTab('curl')">curl</button>
389
- <button class="tab" onclick="switchTab('python')">Python</button>
390
- </div>
391
-
392
- <div id="code-curl" class="code-block active">
393
- <button class="btn-copy" onclick="copyCode('code-curl-text', this)">Copy</button>
394
- <div class="code-text" id="code-curl-text"><span class="c-com"># 1. Start an incident</span>
395
- <span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/reset \
396
- -H <span class="c-str">"Content-Type: application/json"</span> \
397
- -d <span class="c-str">'{<span class="c-key">"task_id"</span>: <span class="c-str">"easy"</span>, <span class="c-key">"seed"</span>: 42}'</span>
398
-
399
- <span class="c-com"># 2. Read logs (reward: +0.15)</span>
400
- <span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
401
- -H <span class="c-str">"Content-Type: application/json"</span> \
402
- -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"read_logs"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}'</span>
403
-
404
- <span class="c-com"># 3. Diagnose (reward: +0.30)</span>
405
- <span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
406
- -H <span class="c-str">"Content-Type: application/json"</span> \
407
- -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"diagnose"</span>, <span class="c-key">"root_cause"</span>: <span class="c-str">"memory leak in payment-service"</span>}'</span>
408
-
409
- <span class="c-com"># 4. Fix it (reward: +0.40)</span>
410
- <span class="c-cmd">curl</span> -X POST https://arijit-07-devops-incident-response.hf.space/step \
411
- -H <span class="c-str">"Content-Type: application/json"</span> \
412
- -d <span class="c-str">'{<span class="c-key">"action_type"</span>: <span class="c-str">"restart_service"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}'</span>
413
-
414
- <span class="c-com"># Score: ~0.94 ✅</span></div>
415
- </div>
416
-
417
- <div id="code-python" class="code-block">
418
- <button class="btn-copy" onclick="copyCode('code-py-text', this)">Copy</button>
419
- <div class="code-text" id="code-py-text"><span class="c-cmd">import</span> requests
420
- BASE = <span class="c-str">"https://arijit-07-devops-incident-response.hf.space"</span>
421
-
422
- <span class="c-com"># Start episode</span>
423
- obs = requests.post(<span class="c-url">f"{BASE}/reset"</span>, json={<span class="c-key">"task_id"</span>: <span class="c-str">"easy"</span>, <span class="c-key">"seed"</span>: 42}).json()
424
-
425
- <span class="c-com"># Take action</span>
426
- result = requests.post(<span class="c-url">f"{BASE}/step"</span>,
427
- json={<span class="c-key">"action_type"</span>: <span class="c-str">"read_logs"</span>, <span class="c-key">"service"</span>: <span class="c-str">"payment-service"</span>}).json()
428
-
429
- print(<span class="c-url">f"Reward: {result['reward']}"</span>) <span class="c-com"># 0.15</span></div>
430
- </div>
431
- </section>
432
-
433
- <section class="fade-in">
434
- <h2 class="section-title">🧠 Training Evidence</h2>
435
- <div class="training-grid">
436
- <div class="train-card">
437
- <h3 class="train-title">Before vs After</h3>
438
- <div class="train-row">
439
- <div class="train-label"><span>Base Llama-3.1-8B</span><span class="train-badge" style="background: rgba(239, 68, 68, 0.2); color: var(--red);">0.000</span></div>
440
- <div class="train-vis" style="color: var(--red);">❌</div>
441
- <div class="train-desc">jumps to diagnose, gets penalized</div>
442
- </div>
443
- <div class="train-row">
444
- <div class="train-label"><span>ARIA Fine-tuned</span><span class="train-badge" style="background: rgba(16, 185, 129, 0.2); color: var(--green);">0.150</span></div>
445
- <div class="train-vis" style="color: var(--green);">✅</div>
446
- <div class="train-desc">reads logs first, every time</div>
447
- </div>
448
- <a href="https://huggingface.co/Arijit-07/aria-devops-llama8b" target="_blank" class="feature-link">Model weights &rarr;</a>
449
- </div>
450
- <div class="train-card">
451
- <h3 class="train-title">Training Details</h3>
452
- <div class="tt-row"><div class="tt-key">Algorithm</div><div class="tt-val">GRPO</div></div>
453
- <div class="tt-row"><div class="tt-key">Base Model</div><div class="tt-val">Llama-3.1-8B-Instruct</div></div>
454
- <div class="tt-row"><div class="tt-key">Framework</div><div class="tt-val">Unsloth + HuggingFace TRL</div></div>
455
- <div class="tt-row"><div class="tt-key">LoRA Rank</div><div class="tt-val">32 (alpha 64)</div></div>
456
- <div class="tt-row"><div class="tt-key">Episodes</div><div class="tt-val">160</div></div>
457
- <div class="tt-row"><div class="tt-key">GPU</div><div class="tt-val">NVIDIA L4</div></div>
458
- </div>
459
- </div>
460
- </section>
461
- </main>
462
-
463
- <footer>
464
- <div class="container">
465
- <div class="footer-grid">
466
- <div>
467
- <div style="font-size: 20px; font-weight: 700; color: var(--blue); margin-bottom: 8px;">🚨 ARIA</div>
468
- <div class="f-text">DevOps Incident Response<br>OpenEnv-compliant RL environment</div>
469
- <div style="display: flex; gap: 16px; margin-top: 16px;">
470
- <a href="https://github.com/Twilight-13/devops-incident-response" target="_blank" class="f-link">GitHub</a>
471
- <a href="https://huggingface.co/Arijit-07/aria-devops-llama8b" target="_blank" class="f-link">Model</a>
472
- </div>
473
- </div>
474
- <div>
475
- <div class="f-title">Resources</div>
476
- <div class="f-links">
477
- <a href="/docs" class="f-link">Live API Docs</a>
478
- <a href="/validate" class="f-link">Validate</a>
479
- <a href="/metrics" class="f-link">Metrics</a>
480
- <a href="/leaderboard" class="f-link">Leaderboard</a>
481
- </div>
482
- </div>
483
- <div>
484
- <div class="f-title">Built for</div>
485
- <div class="f-text">Meta × PyTorch × HuggingFace<br>OpenEnv Hackathon Finals<br>Bangalore, April 2026</div>
486
- </div>
487
- </div>
488
- <div class="f-bottom">
489
- <div>&copy; 2026 ARIA — Apache 2.0 License</div>
490
- <div>Can your agent handle a SEV-1 at 3am?</div>
491
- </div>
492
- </div>
493
- </footer>
494
-
495
- <script>
496
- const canvas = document.getElementById('bg-canvas');
497
- const ctx = canvas.getContext('2d');
498
- let width, height, particles = [];
499
-
500
- function resize() { width = canvas.width = window.innerWidth; height = canvas.height = window.innerHeight; }
501
- window.addEventListener('resize', resize); resize();
502
-
503
- for(let i=0; i<50; i++) {
504
- particles.push({ x: Math.random() * width, y: Math.random() * height, vx: (Math.random()-0.5)*0.5, vy: (Math.random()-0.5)*0.5 });
505
- }
506
-
507
- function draw() {
508
- ctx.clearRect(0, 0, width, height);
509
- ctx.fillStyle = 'rgba(59, 130, 246, 0.2)';
510
- ctx.strokeStyle = 'rgba(59, 130, 246, 0.1)';
511
- for(let i=0; i<particles.length; i++) {
512
- let p = particles[i];
513
- p.x += p.vx; p.y += p.vy;
514
- if(p.x < 0 || p.x > width) p.vx *= -1;
515
- if(p.y < 0 || p.y > height) p.vy *= -1;
516
- ctx.beginPath(); ctx.arc(p.x, p.y, 2, 0, Math.PI*2); ctx.fill();
517
- for(let j=i+1; j<particles.length; j++) {
518
- let p2 = particles[j], dist = Math.hypot(p.x-p2.x, p.y-p2.y);
519
- if(dist < 150) { ctx.beginPath(); ctx.moveTo(p.x, p.y); ctx.lineTo(p2.x, p2.y); ctx.stroke(); }
520
- }
521
- }
522
- requestAnimationFrame(draw);
523
- }
524
- draw();
525
-
526
- const observer = new IntersectionObserver(e => e.forEach(en => { if(en.isIntersecting) en.target.classList.add('visible'); }), {threshold: 0.1});
527
- document.querySelectorAll('.fade-in').forEach(el => observer.observe(el));
528
-
529
- fetch('/health').then(r => r.json()).then(d => {
530
- if(d.status === 'ok') document.getElementById('nav-status-text').innerText = 'LIVE';
531
- }).catch(e => console.error(e));
532
-
533
- const tMap = {
534
- 'easy': {icon: '💻', color: '#10b981', badge: 'EASY'}, 'medium': {icon: '⚡', color: '#f59e0b', badge: 'MEDIUM'},
535
- 'hard': {icon: '🔥', color: '#ef4444', badge: 'HARD'}, 'bonus': {icon: '💥', color: '#8b5cf6', badge: 'EXPERT'},
536
- 'security': {icon: '🛡️', color: '#06b6d4', badge: 'SECURITY'}, 'database': {icon: '🗄️', color: '#f97316', badge: 'DATABASE'},
537
- 'failover': {icon: '🌐', color: '#6366f1', badge: 'FAILOVER'}, 'generated': {icon: '✨', color: '#ec4899', badge: 'DYNAMIC'}
538
- };
539
- fetch('/tasks').then(r => r.json()).then(d => {
540
- document.getElementById('task-grid').innerHTML = d.tasks.map(t => {
541
- let c = tMap[t.id] || tMap['easy'];
542
- return `<div class="task-card" style="--card-color:${c.color};--card-bg:${c.color}20;">
543
- <div class="task-header"><div class="task-icon">${c.icon}</div><div class="task-badge">${c.badge}</div></div>
544
- <div class="task-name">${t.name}</div><div class="task-desc">${t.description}</div>
545
- <div class="task-footer"><div class="task-steps">Max steps: ${t.max_steps}</div><div class="task-status">Ready</div></div>
546
- </div>`;
547
- }).join('');
548
- }).catch(e => console.error(e));
549
-
550
- fetch('/curriculum/status').then(r => r.json()).then(d => {
551
- const el = document.getElementById('curriculum-container');
552
- if(!d.total_episodes_recorded) el.innerHTML = '<div style="color:var(--muted); font-size:13px; text-align:center;">No episodes yet — run POST /reset to begin</div>';
553
- else {
554
- el.innerHTML = Object.keys(d.tasks).slice(0, 4).map(k => {
555
- let avg = d.tasks[k].rolling_avg, col = avg < 0.3 ? 'var(--red)' : (avg < 0.6 ? 'var(--yellow)' : 'var(--green)');
556
- let bl = Math.round(avg * 10);
557
- return `<div class="c-bar-row"><div class="c-bar-name">${k}</div>
558
- <div class="c-bar-track" style="color:${col}"><span>${'█'.repeat(bl)}</span><span style="opacity:0.3">${'░'.repeat(10-bl)}</span></div>
559
- <div class="c-bar-score">${avg.toFixed(2)}</div></div>`;
560
- }).join('');
561
- }
562
- }).catch(e => console.error(e));
563
-
564
- window.generateIncident = () => {
565
- const seed = document.getElementById('gen-seed').value || 42;
566
- fetch(`/generate/preview?seed=${seed}`).then(r => r.json()).then(d => {
567
- const colors = {oom: '#ef4444', cascade: '#f59e0b', corruption: '#8b5cf6', security: '#06b6d4', database: '#f97316', network_partition: '#6366f1'};
568
- const sc = {sev1: '#ef4444', sev2: '#f59e0b', sev3: '#10b981'};
569
- let fcol = colors[d.failure_mode] || 'var(--blue)';
570
- document.getElementById('gen-badges').innerHTML = `<span class="gen-badge" style="background:${fcol}20;color:${fcol}">${d.failure_mode}</span><span class="gen-badge" style="background:${sc[d.severity]||fcol}20;color:${sc[d.severity]||fcol}">${d.severity}</span><span class="gen-badge" style="background:rgba(255,255,255,0.1);color:var(--muted)">${d.incident_id}</span>`;
571
- document.getElementById('gen-affected').innerText = `Affected: ${d.affected_service}`;
572
- document.getElementById('gen-desc').innerText = d.description;
573
- let dc = d.difficulty_score < 0.4 ? 'var(--green)' : (d.difficulty_score < 0.7 ? 'var(--yellow)' : 'var(--red)');
574
- let fill = document.getElementById('gen-diff-fill');
575
- fill.style.width = `${d.difficulty_score*100}%`; fill.style.background = dc;
576
- document.getElementById('gen-result').style.display = 'block';
577
- }).catch(e => console.error(e));
578
- };
579
-
580
- window.startDualSession = () => {
581
- fetch('/multi-agent/reset', { method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({task_id: "easy", seed: 42}) })
582
- .then(r => r.json()).then(d => {
583
- let info = document.getElementById('dual-session-info');
584
- info.innerHTML = `Session: ${d.session_id}<br><br>Agent A (POST): /multi-agent/step/a/${d.session_id}<br>Agent B (POST): /multi-agent/step/b/${d.session_id}`;
585
- info.style.display = 'block';
586
- }).catch(e => console.error(e));
587
- };
588
-
589
- const loadMetrics = () => {
590
- fetch('/metrics').then(r => r.json()).then(d => {
591
- document.getElementById('m-episodes').innerText = d.total_episodes || 0;
592
- document.getElementById('m-avg').innerText = (d.overall_avg_score || 0).toFixed(3);
593
- if(d.by_task) {
594
- let tRes = 0, tCnt = 0, best = 0;
595
- Object.values(d.by_task).forEach(t => { tRes += t.resolution_rate*t.count; tCnt += t.count; if(t.max_score > best) best = t.max_score; });
596
- document.getElementById('m-res').innerText = (tCnt ? (tRes/tCnt)*100 : 0).toFixed(1) + '%';
597
- document.getElementById('m-best').innerText = best.toFixed(3);
598
- }
599
- }).catch(e => console.error(e));
600
- };
601
- loadMetrics(); setInterval(loadMetrics, 30000);
602
-
603
- fetch('/leaderboard').then(r => r.json()).then(d => {
604
- const body = document.getElementById('lb-body');
605
- if(!d.leaderboard || !d.leaderboard.length) { body.innerHTML = '<tr><td colspan="5" style="text-align: center; color: var(--muted);">No episodes yet. Try POST /reset to start.</td></tr>'; return; }
606
- body.innerHTML = d.leaderboard.map(r => {
607
- let rank = r.rank === 1 ? 'color:#fbbf24;font-weight:bold' : (r.rank === 2 ? 'color:#9ca3af;font-weight:bold' : (r.rank === 3 ? 'color:#cd7f32;font-weight:bold' : ''));
608
- let sCol = r.score >= 0.8 ? 'var(--green)' : (r.score >= 0.5 ? 'var(--yellow)' : 'var(--red)');
609
- let status = r.score > 0.5 ? '<span style="color:var(--green)">✅ Resolved</span>' : '<span style="color:var(--red)">❌ Failed</span>';
610
- return `<tr><td style="${rank}">#${r.rank}</td><td>${r.task_id}</td><td class="lb-score" style="color:${sCol}">${r.score.toFixed(4)}</td><td>${r.steps}</td><td>${status}</td></tr>`;
611
- }).join('');
612
- }).catch(e => console.error(e));
613
-
614
- window.switchTab = t => {
615
- document.querySelectorAll('.tab').forEach(el => el.classList.remove('active'));
616
- document.querySelectorAll('.code-block').forEach(el => el.classList.remove('active'));
617
- document.querySelectorAll('.tab')[t === 'curl' ? 0 : 1].classList.add('active');
618
- document.getElementById('code-'+t).classList.add('active');
619
- };
620
-
621
- window.copyCode = (id, btn) => {
622
- navigator.clipboard.writeText(document.getElementById(id).innerText).then(() => {
623
- let old = btn.innerText; btn.innerText = 'Copied ✓'; setTimeout(() => btn.innerText = old, 2000);
624
- });
625
- };
626
- </script>
627
- </body>
628
- </html>"""
629
-
630
- import re
631
- import sys
632
-
633
- # Escape { and } properly to {{ and }}
634
- # But since html_content is just a raw python string we actually just escape it.
635
- html_escaped = html_content.replace("{", "{{").replace("}", "}}")
636
-
637
- with open("server/app.py", "r", encoding="utf-8") as f:
638
- app_text = f.read()
639
-
640
- # Replace the current dashboard endpoint
641
- start_idx = app_text.find("def dashboard():")
642
- end_idx = app_text.find(' return html', start_idx) + len(' return html')
643
-
644
- if start_idx == -1 or end_idx == -1:
645
- print("Could not find dashboard.")
646
- sys.exit(1)
647
-
648
- new_dashboard = f'''def dashboard():
649
- html = f"""{html_escaped}"""
650
- return html'''
651
-
652
-
653
- new_text = app_text[:start_idx] + new_dashboard + app_text[end_idx:]
654
-
655
- with open("server/app.py", "w", encoding="utf-8") as f:
656
- f.write(new_text)
657
-
658
- print("Dashboard replaced successfully.")