aria-devops-llama3b / README.md
Arijit-07's picture
Update README.md
e1b5cb0 verified
---
title: ARIA DevOps Incident Response
emoji: 🚨
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
license: apache-2.0
tags:
- openenv
- reinforcement-learning
- devops
- incident-response
- rl-environment
- multi-agent
- llm-agent
- grpo
- curriculum-learning
- huggingface
- pytorch
- meta
short_description: RL environment for DevOps incident response agents
---
# ARIA β€” DevOps Incident Response
### *The first OpenEnv RL environment for production incident response*
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
[![HF Space](https://img.shields.io/badge/πŸ€—-Live%20Environment-orange)](https://huggingface.co/spaces/Arijit-07/devops-incident-response)
[![Trained Model](https://img.shields.io/badge/πŸ€—-Trained%20Model-blue)](https://huggingface.co/Arijit-07/aria-devops-llama3b)
[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
> **ARIA** β€” Adaptive Reward & Incident Architecture
> Built for the Meta Γ— PyTorch Γ— HuggingFace OpenEnv Hackathon Finals | Bangalore, April 2026
---
## πŸ”— Quick Links for Judges
| Resource | Link |
|---|---|
| **Live Environment** | https://arijit-07-devops-incident-response.hf.space |
| **Interactive API (Swagger)** | https://arijit-07-devops-incident-response.hf.space/docs |
| **Trained Model (Llama-3B LoRA)** | https://huggingface.co/Arijit-07/aria-devops-llama3b |
| **Training Curve** | https://huggingface.co/Arijit-07/aria-devops-llama3b/resolve/main/training_curve.png |
| **HuggingFace Blog** | https://huggingface.co/blog/Arijit-07/aria-devops-incident-response |
| **GitHub** | https://github.com/Twilight-13/devops-incident-response |
| **Validate (self-test)** | https://arijit-07-devops-incident-response.hf.space/validate |
---
## ⚑ Run a Complete Episode Right Now
```bash
# 1. Start an easy incident
curl -X POST https://arijit-07-devops-incident-response.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy", "seed": 42}'
# 2. Read logs on the failing service (reward: +0.15)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "read_logs", "service": "payment-service"}'
# 3. Diagnose the root cause (reward: +0.30)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "diagnose", "root_cause": "memory leak in payment-service"}'
# 4. Fix it (reward: +0.40)
curl -X POST https://arijit-07-devops-incident-response.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type": "restart_service", "service": "payment-service"}'
# 5. See the final score
curl https://arijit-07-devops-incident-response.hf.space/state
# 6. Validate all 7 tasks pass
curl https://arijit-07-devops-incident-response.hf.space/validate
```
```python
# Or install and use the Python client
pip install git+https://github.com/Twilight-13/devops-incident-response.git
from devops_incident_response import DevOpsIncidentEnv, Action, ActionType
env = DevOpsIncidentEnv(task_id="easy", seed=42)
obs = env.reset()
result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
print(f"Reward: {result.reward}") # 0.15
```
---
## 🎯 The Problem This Solves
Every software company running microservices faces the same brutal reality: **production incidents are expensive, unpredictable, and happen at 3am.**
A single SEV-1 incident β€” a payment service crashing, a data corruption silently corrupting prices, a DDoS botnet overwhelming your login endpoint β€” can cost millions and require hours of expert engineer time to diagnose and fix. On-call rotations are stressful. Tier-2 incidents that follow recognizable patterns are handled by engineers when they could, in principle, be handled by an AI agent.
**Yet no RL benchmark exists for this domain.**
SWE-bench tests code generation. WebArena tests web navigation. AgentBench tests general tool use. None of them model **operational intelligence** β€” the ability to reason under uncertainty about live production systems, gather information strategically, and take precise actions where wrong choices cause additional damage.
ARIA fills that gap.
---
## πŸ—οΈ Environment Architecture
ARIA simulates a production microservices e-commerce platform. Agents interact with the environment through a standard OpenEnv API: `reset()`, `step()`, `state()`.
### What the Agent Observes
Each step returns a structured `Observation` object:
```
Observation
β”œβ”€β”€ step, max_steps, task_id, task_description
β”œβ”€β”€ services: List[ServiceStatus]
β”‚ β”œβ”€β”€ name, status (healthy/degraded/down/unknown)
β”‚ β”œβ”€β”€ cpu_percent, memory_percent
β”‚ β”œβ”€β”€ error_rate, latency_p99_ms
β”‚ β”œβ”€β”€ replicas_running, replicas_desired
β”‚ β”œβ”€β”€ current_version, last_deployed
β”‚ └── sla_breach, minutes_degraded ← SLA tracking per step
β”œβ”€β”€ active_alerts: List[Alert] ← may include red herrings
β”œβ”€β”€ recent_logs: Dict[str, List[str]] ← PARTIAL: only 2 lines shown
β”œβ”€β”€ service_dependencies: List[ServiceDependency] ← call topology
β”œβ”€β”€ evidence_log: List[EvidenceEntry] ← accumulates across steps
β”œβ”€β”€ sla_status: Dict[str, str] ← ok/warning/breached
└── available_runbooks: List[str]
```
**Key design: Partial Log Observability**
The agent only sees 2 log lines per service upfront. Full history requires calling `read_logs` explicitly. This models real observability tools (Datadog, Kibana) where engineers run queries β€” agents must develop a search strategy, not just read everything.
### The Services
| Service | Stack | Role |
|---|---|---|
| `api-gateway` | Go | Routes external requests |
| `payment-service` | Java (Spring) | Processes payments |
| `order-service` | Python | Creates and tracks orders |
| `inventory-service` | Java | Manages product stock |
| `user-service` | Node.js | Auth and profiles |
| `notification-service` | Python | Email and push alerts |
| `data-pipeline-service` | Python | Writes catalog data |
| `product-catalog-service` | Go | Stores and serves product data |
| `price-validation-service` | Python | Validates prices |
| `analytics-service` | Python | Aggregates business metrics |
| `ml-inference-service` | Python | Serves recommendation models |
| `log-aggregator` | Go | Collects and stores logs |
### Service Dependency Map
Every observation includes the call topology β€” agents can trace cascades:
```
api-gateway β†’ order-service β†’ inventory-service
api-gateway β†’ payment-service
order-service β†’ notification-service
data-pipeline-service β†’ product-catalog-service β†’ price-validation-service
```
---
## 🎬 The 7 Tasks
### Task 1 β€” Single Service OOM (`easy`)
**Max steps: 15 | Expected strong LLM: 0.85–1.00 | Random agent: 0.05**
One service crash-loops with an OutOfMemoryError. The affected service rotates by seed across payment-service, order-service, and user-service β€” with different log formats (Java heap errors, Python memory errors, Node.js heap dumps). A secondary circuit-breaker alert fires on api-gateway as a visible symptom.
**What makes it interesting:** The agent must identify the ROOT cause service (the one running out of memory) not the SYMPTOM services (everything downstream that's erroring because the root is down).
**Optimal sequence:** `read_logs` β†’ `read_metrics` β†’ `diagnose` β†’ `restart_service`
**Reward breakdown:** +0.10 read_logs, +0.10 read_metrics, +0.30 diagnose, +0.40 restart = **0.99 with efficiency bonus**
---
### Task 2 β€” Cascading Failure (`medium`)
**Max steps: 20 | Expected strong LLM: 0.55–0.75 | Random agent: 0.03**
A bad deployment of `inventory-service` causes connection pool exhaustion, cascading timeouts to `order-service` and elevated error rates on `api-gateway`. **Red herring:** a `notification-service` HIGH CPU alert fires (scheduled batch job β€” completely unrelated).
**What makes it interesting:** The agent must follow the dependency chain backwards. Three services are visibly failing, but only one is the root cause. Touching the wrong service gives -0.15 collateral damage penalty.
**Optimal sequence:** Investigate `api-gateway` β†’ trace to `order-service` β†’ trace to `inventory-service` β†’ `rollback`
**Reward breakdown:** +0.20 trace cascade, +0.05 runbook, +0.25 diagnose, +0.35 rollback = **0.92**
---
### Task 3 β€” Silent Data Corruption (`hard`)
**Max steps: 25 | Expected strong LLM: 0.30–0.50 | Random agent: 0.01**
**All services show green.** Zero error rates. Normal latency. No standard alerts. The signal is buried in:
- `price-validation-service` WARN logs: 15% price mismatch rate (baseline: 0.2%)
- `analytics-service` anomaly: avg order value $847 vs $89 historical baseline
Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
**What makes it interesting:** This requires qualitatively different reasoning β€” ignoring green health checks, correlating subtle business metric anomalies, and understanding that a data pipeline deployment 2 minutes ago is the causal explanation.
**Full credit requires BOTH:** `rollback(data-pipeline-service)` AND `alert_oncall` (data audit needed)
**Reward breakdown:** +0.15 subtle signals, +0.10 pipeline metrics, +0.05 runbook, +0.20 diagnose, +0.25 rollback, +0.15 alert_oncall = **0.87**
---
### Task 4 β€” Dual Simultaneous Failure (`bonus`)
**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
Two completely independent failures at once:
1. `log-aggregator` disk 100% full β€” dropping 48k log messages/min
2. `ml-inference-service` stuck in model checksum reload loop β€” CPU 99%+
**What makes it interesting:** Neither failure is related to the other. Solving one doesn't help the other. The agent must decompose and fix independently. This tests whether agents can maintain multiple hypotheses simultaneously.
**Full credit requires BOTH:** `alert_oncall` (disk cleanup) AND `rollback/restart(ml-inference-service)`
**Optimal score: ~0.77**
---
### Task 5 β€” Security Incident: DDoS (`security`)
**Max steps: 20 | Expected strong LLM: 0.40–0.60 | Random agent: 0.01**
A botnet is targeting the login endpoint with 12,000 req/s from the `185.220.x.x` IP range. Standard rate limiting is ineffective (distributed attack). The access logs show 1,847+ failed login attempts per 60 seconds from that range.
**New action: `block_ip_range`** β€” models real network-level DDoS mitigation.
**Wrong actions:** Restarting api-gateway won't help. Scaling up won't help. Must block at network level + escalate to security team.
**Full credit:** `block_ip_range("185.220.0.0/16")` AND `alert_oncall`
**Optimal score: ~0.80**
---
### Task 6 β€” Database Degradation (`database`)
**Max steps: 20 | Expected strong LLM: 0.45–0.65 | Random agent: 0.01**
A schema migration added a `user_segment` column to the `orders` table 15 minutes ago β€” without an index. Every query is now doing a full sequential table scan. DB CPU is spiking. The slow query log shows `seq_scan on orders (847ms)`.
**New action: `create_index`** β€” models real DBA response to missing indexes.
**Alternative fix:** Rolling back the migration is also accepted for full credit.
**Optimal score: ~0.80**
---
### Task 7 β€” Multi-Region Failover (`failover`)
**Max steps: 25 | Expected strong LLM: 0.35–0.55 | Random agent: 0.01**
A network partition affects `us-east-1`. Four services support automatic failover to `us-west-2` and should be switched. Two services MUST NOT be failed over:
- `payment-service` β€” PCI-DSS compliance requires human approval
- `postgres-primary` β€” replication lag risk causes data loss
**New action: `failover`** β€” with `target_region` parameter.
**Heavy penalty: -0.25 per wrong service.** Failing over payment or postgres is catastrophic.
**The runbook explicitly lists which services are safe** β€” reading it first is rewarded.
**Optimal score: ~0.70**
---
### Task 8 β€” Generated Incident (`generated`)
**Max steps: 20 | Variable difficulty | Seed-deterministic**
The Incident Generator creates procedural incidents from any integer seed (0–99,999). Same seed always produces the same incident. Different seeds produce unique combinations of:
- 6 failure modes Γ— 8 services Γ— 3 severity levels Γ— 0–3 noise alerts
```bash
# Preview any incident before running it
curl "https://arijit-07-devops-incident-response.hf.space/generate/preview?seed=12345"
# Run it as a full episode
curl -X POST .../reset -d '{"task_id":"generated","seed":12345}'
```
---
## πŸ† Reward Function Design
### The Formula
```
Final Score = Ξ£(step_rewards)
+ efficiency_bonus # (1 - steps/max_steps) Γ— 0.05 if resolved
+ diagnosis_precision_bonus # +0.03 if β‰₯50% keyword overlap, +0.01 if β‰₯30%
- noop_penalty # (noop_count - 3) Γ— 0.02
- repeat_restart_penalty # (restarts - 1) Γ— 0.05 per service
```
All scores clamped to **(0.001, 0.999)** β€” never exactly 0 or 1.
> **Why (0.001, 0.999) not (0, 1)?** GRPO advantage normalization requires non-constant rewards within a group. Hard 0 or 1 creates zero-variance groups where the model doesn't update. The tiny clamp ensures a gradient signal always exists.
### Step-Level Rewards
| Action | Reward | Condition |
|---|---|---|
| `read_logs` (failing service) | +0.10–0.15 | First time only |
| `read_metrics` (failing service) | +0.10 | First time only |
| `read_runbook` (relevant) | +0.05 | Correct runbook for scenario |
| `search_logs` (relevant query) | +0.05 | Query returns useful results |
| `diagnose` (full match) | +0.30–0.35 | β‰₯50% keyword overlap |
| `diagnose` (partial match) | +0.10–0.15 | β‰₯30% keyword overlap |
| `restart_service` (correct) | +0.35–0.45 | Root cause service |
| `rollback` (correct) | +0.30–0.40 | Root cause service |
| `block_ip_range` (correct) | +0.40 | Security task, correct CIDR |
| `create_index` (correct) | +0.40 | Database task, correct table/column |
| `failover` (eligible service) | +0.30 | Per correctly failed-over service |
| `alert_oncall` (required) | +0.15 | Hard/security/database/failover tasks |
### Penalties (Anti-Gaming)
| Action | Penalty | Why |
|---|---|---|
| Restart healthy service | -0.15 | Collateral damage β€” realistic cost |
| Fix without diagnosing | -0.10 | Blind remediation β€” models real risk |
| Failover payment-service | -0.25 | PCI-DSS compliance violation |
| Failover postgres-primary | -0.25 | Data loss risk |
| Excessive noops (>3) | -0.04/each | Forces active investigation |
| Repeat restart same service | -0.05/extra | Discourages guess-and-check |
### Semantic Diagnosis Matching
The `diagnose` action uses **keyword overlap** not exact string matching. An agent saying "memory exhaustion in payment-service" correctly matches the ground truth "memory_leak_payment_service". This is critical for LLM agents that paraphrase β€” exact string matching would unfairly penalize valid diagnoses.
### SLA Degradation
Every step where an incident is unresolved, the environment worsens:
- `down` services: error_rate increases
- `degraded` services: latency_p99 increases
- SLA status: `ok` β†’ `warning` (~3 steps) β†’ `breached` (~7 steps)
This creates real time pressure and rewards faster resolution.
---
## 🌟 ARIA Features
### Curriculum Engine
The Curriculum Engine tracks agent performance per task using a rolling average of the last 5 episodes.
- **Promotion:** rolling_avg > 0.75 β†’ advance mastery level (Novice β†’ Intermediate β†’ Advanced β†’ Mastered)
- **Demotion:** rolling_avg < 0.30 β†’ step back mastery level
- **Scaffolding:** if avg < 0.30 over 3+ episodes β†’ provide task-specific hint
```bash
GET /curriculum/status # See mastery per task
GET /curriculum/next # Get recommended next task
GET /curriculum/hint/easy # Get scaffolding hint for a task
POST /curriculum/record # Feed your training results in
```
**Why this matters for training:** RL fails when agents never see successful trajectories. The curriculum ensures agents always train at the edge of their capability β€” easy tasks first, harder tasks as they master the fundamentals.
### Incident Generator
Procedural incident generation from seeds. 6 failure modes Γ— 8 services Γ— 3 severities Γ— 0–3 noise alerts = thousands of unique training scenarios.
**Difficulty formula:** `base_difficulty[failure_mode] + (noise_count Γ— 0.05)`, clamped to 1.0
| Failure Mode | Base Difficulty |
|---|---|
| oom | 0.20 |
| cascade | 0.50 |
| database | 0.60 |
| security | 0.60 |
| network_partition | 0.70 |
| corruption | 0.80 |
```bash
GET /generate/preview?seed=42 # Preview without starting
POST /reset # body: {"task_id":"generated","seed":42}
```
### Dual-Agent Mode
One incident. Two agents. Split observability.
- **Agent A (Observer):** Sees logs, alerts, evidence. Can ONLY call `share_finding` β€” passes natural language observations to Agent B. Reward: +0.05 per finding.
- **Agent B (Responder):** Sees metrics, service dependencies, SLA status. Cannot see logs directly. Must rely on Agent A's findings. Executes all real actions.
Neither agent can solve the incident alone.
```bash
# Start a dual-agent session
POST /multi-agent/reset {"task_id":"easy","seed":42}
# β†’ returns session_id + split observations
# Agent A shares a finding
POST /multi-agent/step/a/{session_id} {"finding":"payment-service OOM, memory at 98%"}
# Agent B takes action (has access to Agent A's findings)
POST /multi-agent/step/b/{session_id} {"action_type":"restart_service","service":"payment-service"}
# See full session state
GET /multi-agent/state/{session_id}
```
---
## 🧠 Training
### Model
**Llama-3.2-3B-Instruct** fine-tuned with **GRPO** (Group Relative Policy Optimization) using HuggingFace TRL and Unsloth.
- **LoRA:** rank=16, alpha=32, targeting all 7 projection layers
- **Adapter size:** ~97MB
- **Training:** 140 episodes (easy + medium tasks) on Kaggle T4 x2 GPUs
- **Model repo:** https://huggingface.co/Arijit-07/aria-devops-llama3b
### Why GRPO?
GRPO eliminates the value network that PPO requires. For environment-based RL where rewards come from an external API, a value model adds complexity without benefit. GRPO estimates the baseline from a group of 6 completions per step β€” simpler, more memory-efficient, and well-suited to fast environment APIs.
### Training Loop
```python
# Each training step:
# 1. Generate 6 completions for the current observation
# 2. Score each on a FRESH env snapshot (prevents reward gate exhaustion)
# 3. Normalize rewards to advantages (GRPO)
# 4. Policy gradient update on best completion + KL penalty
# 5. Advance episode with best action
# Key hyperparameters:
learning_rate = 5e-6
group_size = 6
kl_coefficient = 0.05 # prevents catastrophic forgetting
update_strategy = "episode-level" # one update per full episode
```
### Results
| | Base Model | Fine-tuned (ep140) |
|---|---|---|
| **Easy task** | 0.000 | 0.150 |
| **Behavior** | Jumps to diagnose immediately | Reads logs on correct service first |
| **Why the difference** | Base model triggers blind remediation penalty | Fine-tuned model learned to gather information before acting |
**The trained model consistently reads logs on the failing service before acting** β€” this is the foundational operational behavior: information gathering before remediation. The base model never does this.
**Training challenge identified:** The original training loop called `env_step` during group generation, burning reward gates before the best action could advance the episode. After fixing to score completions on fresh environment snapshots, the model successfully learned step 1 of the optimal policy. With more episodes using the corrected loop, the full sequence would emerge.
### Training Notebook
See `train_grpo.ipynb` β€” Colab-compatible, runs against the live HF Space API (no local setup needed).
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Twilight-13/devops-incident-response/blob/main/train_grpo.ipynb)
Compatible with: TRL, SkyRL, ART, Oumi, Axolotl.
---
## πŸš€ Setup
### Docker (Recommended)
```bash
docker build -t aria-devops-incident .
docker run -p 7860:7860 aria-devops-incident
curl http://localhost:7860/health
```
### Local Python
```bash
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 7860
```
### Validate
```bash
python validate.py # 22 automated checks, exit 0 = all pass
curl http://localhost:7860/validate
```
---
## πŸ“‘ API Reference
| Method | Endpoint | Description |
|---|---|---|
| GET | `/health` | `{"status":"ok"}` liveness check |
| GET | `/about` | Full environment description (machine-readable) |
| GET | `/tasks` | All 8 tasks with descriptions |
| POST | `/reset` | Start episode: `{"task_id":"easy","seed":42}` |
| POST | `/step` | Take action: Action JSON |
| GET | `/state` | Full state + ground truth + analytics |
| GET | `/validate` | Self-test: random agent on all 7 tasks |
| GET | `/metrics` | Aggregate episode statistics |
| GET | `/leaderboard` | Top 10 episodes |
| WS | `/ws` | WebSocket: real-time agent-environment |
| GET | `/curriculum/status` | Per-task mastery and recommendations |
| GET | `/curriculum/next` | Recommended next task for training |
| GET | `/curriculum/hint/{task_id}` | Scaffolding hint for struggling agents |
| POST | `/curriculum/record` | Feed episode result to curriculum engine |
| GET | `/generate/preview` | Preview procedural incident: `?seed=N` |
| POST | `/multi-agent/reset` | Start dual-agent session |
| POST | `/multi-agent/step/a/{id}` | Agent A shares a finding |
| POST | `/multi-agent/step/b/{id}` | Agent B takes an action |
| GET | `/multi-agent/state/{id}` | Full dual-agent session state |
| GET | `/multi-agent/sessions` | List active sessions |
| GET | `/docs` | Swagger UI β€” interactive documentation |
---
## πŸ“Š Benchmark Comparison
| Benchmark | Domain | Partial Obs | Dense Reward | Multi-Step | Curriculum | Multi-Agent |
|---|---|---|---|---|---|---|
| SWE-bench | Code repair | βœ— | βœ— | βœ“ | βœ— | βœ— |
| WebArena | Web navigation | βœ“ | βœ— | βœ“ | βœ— | βœ— |
| AgentBench | General tools | βœ— | βœ— | βœ“ | βœ— | βœ— |
| **ARIA (ours)** | **Incident response** | **βœ“** | **βœ“** | **βœ“** | **βœ“** | **βœ“** |
---
## πŸ—οΈ OpenEnv Compliance
```bash
openenv validate .
```
- Inherits from `openenv.core.env_client.EnvClient`
- Standard `reset()`, `step()`, `state()` interface
- Valid `openenv.yaml` manifest with all 8 tasks
- FastAPI server with health endpoint
- WebSocket support at `/ws`
- Hosted on HuggingFace Spaces
---
## πŸ“ Repository Structure
```
aria-devops-incident-response/
β”œβ”€β”€ api.py # FastAPI app β€” all endpoints
β”œβ”€β”€ env.py # DevOpsIncidentEnv β€” thin dispatcher
β”œβ”€β”€ models.py # Pydantic models β€” Action, Observation, State
β”œβ”€β”€ tasks/
β”‚ β”œβ”€β”€ base.py # BaseTask ABC, InternalState, reward logic
β”‚ β”œβ”€β”€ task_easy.py # OOM crash-loop
β”‚ β”œβ”€β”€ task_medium.py # Cascading failure
β”‚ β”œβ”€β”€ task_hard.py # Silent data corruption
β”‚ β”œβ”€β”€ task_bonus.py # Dual simultaneous failure
β”‚ β”œβ”€β”€ task_security.py # DDoS attack
β”‚ β”œβ”€β”€ task_database.py # Missing index
β”‚ β”œβ”€β”€ task_failover.py # Multi-region failover
β”‚ └── task_generated.py # Procedural incidents
β”œβ”€β”€ curriculum/
β”‚ └── engine.py # CurriculumEngine β€” adaptive difficulty
β”œβ”€β”€ generator/
β”‚ └── incident_factory.py # IncidentFactory β€” procedural generation
β”œβ”€β”€ multi_agent/
β”‚ └── session.py # DualAgentSession β€” split observability
β”œβ”€β”€ graders/
β”‚ └── grader.py # Deterministic episode grader
β”œβ”€β”€ data/runbooks/ # 6 operational runbooks (Markdown)
β”œβ”€β”€ client.py # openenv-core EnvClient implementation
β”œβ”€β”€ inference.py # LLM baseline (CoT + fast modes)
β”œβ”€β”€ train_grpo.ipynb # GRPO training notebook (Colab-compatible)
β”œβ”€β”€ validate.py # 22 automated validation checks
└── openenv.yaml # OpenEnv spec manifest
```
---
## πŸ“ License
Apache 2.0
---
*Built solo for the Meta Γ— PyTorch Γ— HuggingFace OpenEnv Hackathon Finals β€” Bangalore, April 2026*