Spaces:

Arijit-07
/

devops-incident-response

Running

+FROM python:3.11-slim
+# Metadata
+LABEL maintainer="devops-incident-env"
+LABEL description="DevOps Incident Response — OpenEnv"
+LABEL version="1.0.0"
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gcc \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python deps first (layer cache)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source
+COPY . .
+# Non-root user for security
+RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
+USER appuser
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+EXPOSE 7860
+CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md ADDED Viewed

	@@ -0,0 +1,299 @@

+---
+tags:
+  - openenv
+  - devops
+  - incident-response
+  - real-world
+  - reinforcement-learning
+  - reward-shaping
+license: apache-2.0
+pipeline_tag: reinforcement-learning
+sdk: docker
+---
+# DevOps Incident Response — OpenEnv
+An OpenEnv-compliant reinforcement learning environment where AI agents learn
+to diagnose and remediate production software incidents across a simulated
+microservices architecture.
+Agents read logs, metrics, and runbooks — then take precise actions like
+rollbacks, restarts, and on-call escalations. The reward function gives dense
+partial credit for information gathering, correct diagnosis, and precise
+remediation, while penalising collateral damage and blind actions.
+**Four tasks of escalating difficulty:**
+- **Easy** — single service OOM crash-loop (which service varies by seed)
+- **Medium** — cascading failure from bad deployment with a red-herring alert
+- **Hard** — silent data corruption with no error-rate alerts, only business metric anomalies
+- **Bonus** — two simultaneous independent failures, both must be fixed
+---
+## Why This Environment?
+Every software company runs incident response. On-call engineers spend hours
+each week reading logs, correlating metrics, and executing precise remediations
+under time pressure. This is exactly the kind of multi-step, information-sparse,
+high-stakes reasoning task that separates strong AI agents from weak ones.
+**What makes it a rigorous benchmark:**
+- The hard task fires **no standard alerts** — the signal is buried in WARN-level
+  logs and business metric anomalies across 6 services
+- The reward function gives **dense partial credit** so training signal is never sparse
+- **SLA degradation** — services worsen each step if unresolved, creating real time pressure
+- **Service dependency map** — exposes call topology so agents can trace cascades
+- **Evidence log** — accumulated across steps so agents can reason over gathered data
+- **Collateral damage penalty** — restarting healthy services reduces the score
+- **Blind remediation penalty** — acting without diagnosing first is penalised
+---
+## Environment Description
+The environment simulates a microservices e-commerce cluster. Depending on the
+task, 3–6 services are active. Services that can appear:
+| Service | Stack | Role |
+|---|---|---|
+| `api-gateway` | Go | Routes external requests |
+| `payment-service` | Java (Spring) | Processes payments |
+| `order-service` | Python | Creates and tracks orders |
+| `inventory-service` | Java | Manages product stock |
+| `user-service` | Node.js | Auth and profiles |
+| `notification-service` | Python | Email and push alerts |
+| `data-pipeline-service` | Python | Writes catalog data from event stream |
+| `product-catalog-service` | Go | Stores and serves product data |
+| `price-validation-service` | Python | Validates prices for consistency |
+| `analytics-service` | Python | Aggregates business metrics |
+| `ml-inference-service` | Python | Serves recommendation models |
+| `log-aggregator` | Go | Collects and stores logs |
+Each episode seeds a random scenario. The same seed always produces the same
+episode. Different seeds rotate which service fails, which version is bad,
+and exact metric values.
+---
+## Action Space
+| Action | Parameters | Description |
+|---|---|---|
+| `diagnose` | `root_cause` (str) | Record your root cause hypothesis |
+| `read_logs` | `service` (str) | Fetch recent log lines for a service |
+| `read_metrics` | `service` (str) | Fetch CPU, memory, error rate, P99 latency |
+| `read_runbook` | `runbook` (str) | Read an operational runbook |
+| `restart_service` | `service` (str) | Restart a service (clears memory/connections) |
+| `rollback` | `service`, `version` | Roll back to a previous artifact version |
+| `scale_up` | `service` (str) | Increase replica count |
+| `alert_oncall` | `reason` (str) | Page the on-call engineering team |
+| `acknowledge` | `service` (alert id) | Acknowledge an active alert |
+| `noop` | — | Take no action |
+---
+## Observation Space
+Each step returns a Pydantic `Observation` with:
+```
+Observation
+├── step, max_steps, task_id, task_description
+├── services: List[ServiceStatus]
+│   ├── name, status, cpu_percent, memory_percent
+│   ├── error_rate, latency_p99_ms
+│   ├── replicas_running, replicas_desired
+│   ├── current_version, last_deployed
+│   ├── sla_breach, minutes_degraded        ← NEW: SLA tracking
+├── active_alerts: List[Alert]
+├── recent_logs: Dict[str, List[str]]
+├── service_dependencies: List[ServiceDependency]  ← NEW: call topology
+│   ├── service, calls, called_by
+├── evidence_log: List[EvidenceEntry]              ← NEW: accumulated reads
+│   ├── step, source, summary, raw
+├── sla_status: Dict[str, str]                     ← NEW: ok/warning/breached
+├── available_runbooks: List[str]
+├── last_action_result, last_action_error
+├─��� incident_start_time, elapsed_minutes
+```
+---
+## Tasks
+### Task 1 — Single Service OOM (Easy)
+**Max steps:** 15 | **Expected strong LLM score:** 0.85–1.00
+One service crash-loops with an out-of-memory error. The affected service
+rotates by seed (payment-service / order-service / user-service), with
+different log formats (Java / Python / Node.js). A secondary circuit-breaker
+alert fires on api-gateway.
+**Reward breakdown:** read_logs (+0.15), read_metrics (+0.10), runbook (+0.05),
+correct diagnosis (+0.30), restart correct service (+0.40).
+Penalties: healthy restart (−0.10), excessive noop (−0.04/step).
+---
+### Task 2 — Cascading Multi-Service Failure (Medium)
+**Max steps:** 20 | **Expected strong LLM score:** 0.55–0.75
+A bad deployment causes connection pool exhaustion or a NullPointerException
+in `inventory-service`, cascading timeouts to `order-service` and elevated
+error rates on `api-gateway`. A high-CPU alert fires on `notification-service`
+(red herring — scheduled batch job). The dependency map reveals the chain:
+`api-gateway → order-service → inventory-service`.
+**Reward breakdown:** investigate inventory (+0.20), trace cascade (+0.05),
+runbook (+0.05), correct diagnosis (+0.25), rollback root service (+0.30–0.40).
+Penalties: chasing red herring (−0.05), treating symptom before root (−0.10).
+---
+### Task 3 — Silent Data Corruption (Hard)
+**Max steps:** 25 | **Expected strong LLM score:** 0.30–0.50
+All services show green health — zero error rates, normal latency, no standard
+alerts. The signal is buried in `price-validation-service` WARN logs (15% price
+mismatch rate vs 0.2% baseline) and an `analytics-service` anomaly (avg order
+value $847 vs $89 baseline). Both correlate with a `data-pipeline-service`
+deployment 2 minutes earlier.
+Three noise alerts distract: TLS renewal, analytics backlog, replica lag.
+Full credit requires **both** rollback AND alert_oncall.
+**Reward breakdown:** read subtle signals (+0.15–0.20), check pipeline metrics
+(+0.10), runbook (+0.05), correct diagnosis (+0.20), rollback pipeline (+0.25),
+alert_oncall (+0.15).
+Penalties: any restart/scale (−0.15).
+---
+### Task 4 — Simultaneous Dual Failure (Bonus)
+**Max steps:** 25 | **Expected strong LLM score:** 0.35–0.55
+Two completely independent failures at once:
+1. `log-aggregator` disk 100% full (dropping 48k log messages/min)
+2. `ml-inference-service` stuck in a model checksum reload loop (CPU 99%+)
+Fixing one does not help the other. Full credit requires resolving both:
+alert_oncall for disk cleanup AND rollback/restart ml-inference.
+---
+## Reward Function Design
+```
+Score = Σ(step rewards) + efficiency_bonus + diagnosis_bonus
+      - collateral_damage_penalty - blind_action_penalty - noop_penalty
+```
+Key properties:
+- **Dense signal** — never zero for an entire episode unless truly random
+- **Information-first** — reading before acting is rewarded
+- **Precision required** — wrong service gives 0 or negative
+- **Time pressure** — SLA status worsens each step; efficiency bonus rewards speed
+- **Two-action requirement** — hard and bonus tasks require multiple correct actions
+All rewards clamped to **[0.0, 1.0]**.
+---
+## Setup Instructions
+### Docker (recommended for judging)
+```bash
+docker build -t devops-incident-env .
+docker run -p 7860:7860 devops-incident-env
+curl http://localhost:7860/health
+```
+### Local Python
+```bash
+pip install -r requirements.txt
+uvicorn api:app --host 0.0.0.0 --port 7860
+```
+### Direct import
+```python
+from env import DevOpsIncidentEnv
+from models import Action, ActionType
+env = DevOpsIncidentEnv(task_id="easy", seed=42)
+obs = env.reset()
+# Service dependency map is in obs.service_dependencies
+# Evidence log accumulates in obs.evidence_log as you read
+result = env.step(Action(action_type=ActionType.READ_LOGS, service="payment-service"))
+print(result.reward)          # 0.15
+print(result.observation.evidence_log[-1].summary)
+```
+### Validation
+```bash
+python validate.py    # 22 automated checks, exit 0 = all pass
+```
+---
+## Running the Inference Baseline
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
+export HF_TOKEN="hf_your_token_here"
+python inference.py
+```
+---
+## Baseline Scores
+Run with `meta-llama/Llama-3.3-70B-Instruct`, seed=42, temperature=0.1:
+| Task | Score | Resolved | Steps |
+|---|---|---|---|
+| easy | 1.0000 | ✓ | 5 |
+| medium | 0.6800 | ✓ | 9 |
+| hard | 0.3500 | ✗ | 25 |
+| bonus | 0.3800 | ✗ | 25 |
+| **average** | **0.6025** | — | — |
+*Scores vary with model and temperature. Run with seed=42 for reproducibility.*
+---
+## API Reference
+| Endpoint | Method | Body | Description |
+|---|---|---|---|
+| `/health` | GET | — | Returns `{"status": "ok"}` |
+| `/reset` | POST | `{"task_id": "easy", "seed": 42}` | Start new episode |
+| `/step` | POST | `Action` JSON | Take one action |
+| `/state` | GET | — | Full state + ground truth + analytics |
+| `/tasks` | GET | — | List all 4 tasks |
+| `/validate` | GET | — | Self-validation report for all tasks |
+---
+## OpenEnv Compliance
+```bash
+openenv validate .
+```
+All endpoints comply with the OpenEnv spec. `openenv.yaml` contains full
+metadata including 4 task definitions, action/observation space descriptions,
+expected score ranges, and Docker configuration.
+---
+## License
+Apache 2.0

api.py ADDED Viewed

	@@ -0,0 +1,151 @@

+from __future__ import annotations
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional
+from env import DevOpsIncidentEnv
+from models import Action, Observation, StepResult, State
+app = FastAPI(
+    title="DevOps Incident Response — OpenEnv",
+    description=(
+        "An OpenEnv-compliant RL environment where AI agents diagnose and remediate "
+        "production software incidents across a simulated microservices architecture. "
+        "Four tasks: easy (OOM), medium (cascade), hard (silent corruption), "
+        "bonus (dual simultaneous failure)."
+    ),
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+VALID_TASKS = ("easy", "medium", "hard", "bonus")
+_env: Optional[DevOpsIncidentEnv] = None
+class ResetRequest(BaseModel):
+    task_id: str = "easy"
+    seed: Optional[int] = None
+@app.get("/health")
+def health():
+    return {"status": "ok", "env": "devops-incident-response", "version": "1.0.0"}
+@app.post("/reset", response_model=Observation)
+def reset(req: ResetRequest):
+    global _env
+    if req.task_id not in VALID_TASKS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"task_id must be one of {VALID_TASKS}. Got: {req.task_id}",
+        )
+    _env = DevOpsIncidentEnv(task_id=req.task_id, seed=req.seed)
+    return _env.reset()
+@app.post("/step", response_model=StepResult)
+def step(action: Action):
+    if _env is None:
+        raise HTTPException(status_code=400, detail="Call /reset before /step")
+    return _env.step(action)
+@app.get("/state", response_model=State)
+def state():
+    if _env is None:
+        raise HTTPException(status_code=400, detail="Call /reset before /state")
+    return _env.state()
+@app.get("/tasks")
+def list_tasks():
+    return {
+        "tasks": [
+            {
+                "id": "easy",
+                "name": "Single Service OOM",
+                "difficulty": "easy",
+                "max_steps": 15,
+                "description": "One service crash-loops from a memory leak. Which service varies by seed.",
+            },
+            {
+                "id": "medium",
+                "name": "Cascading Multi-Service Failure",
+                "difficulty": "medium",
+                "max_steps": 20,
+                "description": (
+                    "Bad deployment causes connection pool exhaustion cascading through 3 services. "
+                    "One red-herring alert included."
+                ),
+            },
+            {
+                "id": "hard",
+                "name": "Silent Data Corruption",
+                "difficulty": "hard",
+                "max_steps": 25,
+                "description": (
+                    "No error-rate alerts fire. Signals are WARN-level logs and a business metric anomaly. "
+                    "Requires rollback + on-call alert for full credit."
+                ),
+            },
+            {
+                "id": "bonus",
+                "name": "Simultaneous Dual Failure",
+                "difficulty": "hard",
+                "max_steps": 25,
+                "description": (
+                    "Two independent failures at once: disk full on log aggregator + "
+                    "model reload CPU loop on ml-inference. Both must be fixed for full credit."
+                ),
+            },
+        ]
+    }
+@app.get("/validate")
+def validate():
+    """
+    Self-validation endpoint for judges.
+    Runs a quick episode on each task and confirms graders return [0.0, 1.0].
+    """
+    import random
+    from graders.grader import grade_episode
+    results = []
+    for task_id in VALID_TASKS:
+        try:
+            env = DevOpsIncidentEnv(task_id=task_id, seed=42)
+            env.reset()
+            done = False
+            rng = random.Random(7)
+            steps = 0
+            import random as _random
+            while not done and steps < 30:
+                action = Action(action_type=_random.choice(list(ActionType)))
+                result = env.step(action)
+                done = result.done
+                steps += 1
+            s = env.state()
+            score = grade_episode(
+                task_id, s.action_history, s.ground_truth_root_cause,
+                s.ground_truth_fix, s.incident_resolved, s.total_reward,
+            )
+            results.append({
+                "task_id": task_id,
+                "score": score,
+                "in_range": 0.0 <= score <= 1.0,
+                "resolved": s.incident_resolved,
+                "steps": steps,
+                "status": "ok",
+            })
+        except Exception as e:
+            results.append({"task_id": task_id, "status": "error", "error": str(e)})
+    all_ok = all(r.get("status") == "ok" and r.get("in_range") for r in results)
+    return {"validation": "passed" if all_ok else "failed", "tasks": results}

audit_failures.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ [["Restarting healthy service gives negative reward", "'dict' object has no attribute 'name'"], ["Failing services have ERROR/WARN log lines", "medium: failing service exhaustion has no anomalous logs"]]

audit_output.txt ADDED Viewed

Binary file (4.38 kB). View file

data/runbooks/cascade_failure.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# Runbook: Cascading Service Failure
+## Pattern
+Service A fails → Service B times out calling A → Service C sees errors from B.
+Alerts fire on B and C (downstream victims), NOT on A (the root cause).
+## How to Find the Root Cause
+1. Map the dependency chain: which service does the failing service call?
+2. The root cause is the DEEPEST failing service in the chain
+3. Look for the service with the most recent deployment OR the highest internal error rate
+## Signals
+- Circuit breakers opening in downstream services (log: "Circuit breaker OPEN for X")
+- Upstream timeout errors (log: "call to X timed out")
+- The root service will have high P99 latency or error rate itself
+## Remediation
+Fix the root cause service ONLY. Downstream services will recover automatically
+once the upstream is healthy. Do not restart downstream victims.
+## Anti-patterns to Avoid
+- Restarting B and C when A is broken — they will fail again immediately
+- Scaling up victims — more replicas of a broken caller doesn't help
+- Treating all alerts as equal — alerts on downstream services are symptoms

data/runbooks/data_corruption.md ADDED Viewed

	@@ -0,0 +1,45 @@

+# Runbook: Silent Data Corruption
+## What Makes This Hard
+Silent data corruption does NOT trigger standard error-rate or latency alerts.
+All services appear healthy. The signal is in business-logic metrics:
+- Price mismatches in validation logs (WARN level, not ERROR)
+- Anomalous average order values in analytics
+- Write operations succeeding (HTTP 200) but writing wrong values
+## How to Detect
+1. Read logs for price-validation-service — look for PRICE_MISMATCH warnings
+2. Read metrics for analytics-service — look for avg_order_value anomalies
+3. Read logs for data-pipeline-service — check for recent deployment
+4. Correlate: did the mismatch rate spike immediately after a pipeline deployment?
+## Root Cause Pattern
+A data pipeline deployment introduced a bug that writes incorrect values
+to the product catalog. Writes succeed at the DB level (no errors),
+but the values are wrong (e.g., decimal point off by 10x).
+## Remediation — Two Steps Required
+### Step 1: Stop the corruption
+Rollback the pipeline service to stop new corrupt writes.
+```
+action: rollback
+service: data-pipeline-service
+version: previous
+```
+### Step 2: Audit existing corrupt data
+Rollback stops NEW corruption but does NOT fix data already written.
+You MUST page the data engineering team to run a correction job.
+```
+action: alert_oncall
+reason: Data corruption detected — price-validation mismatch rate 15%.
+        Pipeline rolled back. Need audit and correction of product-catalog prices.
+```
+## Do NOT
+- Restart services (won't fix written data)
+- Scale up services (more replicas = more corrupt writes)
+- Close the incident after rollback only — corrupted data persists until corrected

data/runbooks/db_connection.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# Runbook: Database Connection Pool Exhaustion
+## Symptoms
+- `HikariPool - Connection is not available, request timed out` in logs
+- `Connection pool exhausted (max=N, active=N, waiting=M)` in logs
+- Very high P99 latency (10–60 seconds) on the affected service
+- High CPU from thread pool saturation
+- Downstream services timing out and opening circuit breakers
+## Diagnosis Steps
+1. Check logs of the slow service for HikariCP / connection pool errors
+2. Check metrics: P99 latency will be extremely high (>10s)
+3. Check if a recent deployment occurred (new version = likely cause)
+4. Trace the cascade: which upstream service triggered downstream failures?
+## Root Cause
+Connection pool exhaustion occurs when:
+- A new deployment introduced a connection leak (connections not returned to pool)
+- A slow query is holding connections open longer than expected
+- Pool size is misconfigured for current load
+## Remediation
+**If caused by a bad deployment (most common):**
+Rollback the service to the previous known-good version.
+```
+action: rollback
+service: <affected-service>
+version: <previous-version>
+```
+**If not deployment-related:**
+Restart the service to clear the pool, then investigate query performance.
+## Do NOT
+- Restart downstream services first (they are victims, not the cause)
+- Ignore the cascade — fix the root service, not the symptoms
+## Recovery
+After rollback, downstream circuit breakers will reset within 30–60 seconds.

data/runbooks/deployment_rollback.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# Runbook: Deployment Rollback
+## When to Rollback
+- Error rate spike immediately following a deployment
+- Latency increase correlated with a new version going live
+- A service was recently deployed (`last_deployed` within the last hour)
+- Logs show errors that did not exist before the deployment
+## How to Identify the Bad Deployment
+1. Check `current_version` and `last_deployed` in service metrics
+2. Correlate the deployment timestamp with the incident start time
+3. Read the service logs — new errors after deployment = likely cause
+## Remediation
+```
+action: rollback
+service: <service-that-was-deployed>
+version: <previous-stable-version>
+```
+If you don't know the exact previous version, use `previous` and the
+system will revert to the last known-good artifact.
+## Post-Rollback
+- Monitor error rate for 5 minutes to confirm recovery
+- Downstream services should recover automatically as upstream stabilises
+- Alert the owning team so they can investigate the bad release
+## Do NOT
+- Rollback services that were NOT recently deployed
+- Rollback before confirming the new deployment is actually the cause
+- Restart services instead of rolling back (restart keeps the bad version)

data/runbooks/high_cpu.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Runbook: High CPU
+## Symptoms
+- CPU > 80% sustained for more than 5 minutes
+- Increased latency as threads compete for CPU cycles
+- Possible OOM if CPU contention causes GC pressure
+## Common Causes
+1. **Batch job running** — check if CPU spike is scheduled (e.g., email sends, report generation)
+2. **Traffic spike** — check request rate metrics
+3. **Infinite loop / CPU leak** — check for runaway threads in logs
+4. **GC pressure** — look for GC log entries alongside high CPU
+## Remediation
+- If batch job: no action needed, wait for completion
+- If traffic spike: scale_up the service
+- If CPU leak / bad code: rollback to previous version
+## Important
+High CPU on a service that is otherwise healthy (error_rate=0, P99 normal)
+is almost always a scheduled batch job. Do NOT restart it unnecessarily.

data/runbooks/memory_leak.md ADDED Viewed

	@@ -0,0 +1,36 @@

+# Runbook: Memory Leak / OOMKilled
+## Symptoms
+- Pod restarting repeatedly with reason `OOMKilled`
+- Memory usage > 90% in metrics
+- `java.lang.OutOfMemoryError: Java heap space` in logs
+- GC overhead limit exceeded warnings before crash
+## Diagnosis Steps
+1. Check memory metrics: `read_metrics <service>`
+2. Check logs for OOM errors: `read_logs <service>`
+3. Confirm restart loop in alerts (OOMKilled N times in M minutes)
+## Root Cause
+The service has a memory leak — objects are allocated but not released,
+causing heap exhaustion and JVM crash. This can also occur if the pod's
+memory limit is set too low for the current load.
+## Remediation
+**Immediate fix:** Restart the affected service. This clears the heap
+and restores service. The pod will start fresh.
+```
+action: restart_service
+service: <affected-service>
+```
+**After restart:** Monitor memory over the next 30 minutes. If memory
+climbs again rapidly, escalate to the service team for a heap dump analysis.
+## Do NOT
+- Restart other healthy services (collateral damage)
+- Scale up replicas (all new pods will also OOM)
+## Expected Recovery Time
+2–5 minutes after restart.

env.py ADDED Viewed

	@@ -0,0 +1,76 @@

+from __future__ import annotations
+import random
+from typing import Optional
+from models import Action, Observation, StepResult, State
+from tasks import EasyTask, MediumTask, HardTask, BonusTask
+from tasks.base import InternalState
+TASK_MAP = {
+    "easy": EasyTask,
+    "medium": MediumTask,
+    "hard": HardTask,
+    "bonus": BonusTask,
+}
+class DevOpsIncidentEnv:
+    """
+    OpenEnv-compliant environment for DevOps incident response.
+    Four tasks of escalating difficulty:
+      easy   - Single service OOM (rotating service by seed)
+      medium - Cascading failure from bad deployment (red-herring alert)
+      hard   - Silent data corruption, no error-rate alerts
+      bonus  - Two simultaneous independent failures, both must be fixed
+    """
+    def __init__(self, task_id: str = "easy", seed: Optional[int] = None):
+        if task_id not in TASK_MAP:
+            raise ValueError(
+                f"task_id must be one of {list(TASK_MAP.keys())}, got '{task_id}'"
+            )
+        self.task_id = task_id
+        self.seed = seed
+        self._task = None
+        self._internal_state: Optional[InternalState] = None
+    def reset(self, seed: Optional[int] = None) -> Observation:
+        if seed is not None:
+            self.seed = seed
+        rng = random.Random(self.seed)
+        self._task = TASK_MAP[self.task_id](rng=rng)
+        self._internal_state = self._task.initialize()
+        return self._internal_state._build_observation()
+    def step(self, action: Action) -> StepResult:
+        if self._internal_state is None:
+            raise RuntimeError("Call reset() before step()")
+        output = self._task.step(self._internal_state, action)
+        self._internal_state = output.next_state
+        return StepResult(
+            observation=self._internal_state._build_observation(),
+            reward=output.reward,
+            done=output.done,
+            info=output.info,
+        )
+    def state(self) -> State:
+        if self._internal_state is None:
+            raise RuntimeError("Call reset() before state()")
+        s = self._internal_state
+        from graders.grader import grade_episode, get_episode_analytics
+        snap = s.to_state_snapshot()
+        analytics = get_episode_analytics(
+            s.task_id, s.action_history,
+            s.ground_truth_root_cause, s.incident_resolved,
+        )
+        current_score = grade_episode(
+            s.task_id, s.action_history, s.ground_truth_root_cause,
+            s.ground_truth_fix, s.incident_resolved, s.total_reward,
+        )
+        snap.info = {
+            "rewards_unlocked": sorted(s.rewards_given),
+            "current_score": current_score,
+            "analytics": analytics,
+        }
+        return snap

graders/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from graders.grader import grade_episode
2	+
3	+ __all__ = ["grade_episode"]

graders/grader.py ADDED Viewed

	@@ -0,0 +1,195 @@

+from __future__ import annotations
+from typing import List, Dict, Any, Optional
+def grade_episode(
+    task_id: str,
+    action_history: List[Dict[str, Any]],
+    ground_truth_root_cause: str,
+    ground_truth_fix: str,
+    incident_resolved: bool,
+    total_reward: float,
+) -> float:
+    """
+    Deterministic grader. Returns a float in [0.0, 1.0].
+    Scoring:
+      - Base: total_reward accumulated during episode (already [0,1])
+      - Efficiency bonus: up to +0.05 for fast resolution
+      - Diagnosis quality bonus: up to +0.03 for precise root cause
+      - Penalty: excess noops, repeated unnecessary restarts
+    Args:
+        task_id:                 "easy" | "medium" | "hard" | "bonus"
+        action_history:          List of {step, action, reward} dicts
+        ground_truth_root_cause: The actual root cause string
+        ground_truth_fix:        The correct remediation string
+        incident_resolved:       Whether the environment flagged resolution
+        total_reward:            Cumulative in-episode reward [0.0, 1.0]
+    Returns:
+        Final score in [0.0, 1.0]
+    """
+    score = float(total_reward)
+    actions = [entry["action"] for entry in action_history]
+    action_types = [a["action_type"] for a in actions]
+    n_steps = len(action_history)
+    # --- Efficiency bonus (faster = better) ---
+    if incident_resolved and n_steps > 0:
+        max_steps = {"easy": 15, "medium": 20, "hard": 25, "bonus": 25}.get(task_id, 20)
+        efficiency = max(0.0, 1.0 - (n_steps / max_steps))
+        score += efficiency * 0.05
+    # --- Diagnosis precision bonus ---
+    diagnoses = [
+        a.get("root_cause", "") or ""
+        for a in actions
+        if a["action_type"] == "diagnose"
+    ]
+    if diagnoses:
+        best_overlap = max(
+            _keyword_overlap(d, ground_truth_root_cause) for d in diagnoses
+        )
+        if best_overlap >= 0.5:
+            score += 0.03
+        elif best_overlap >= 0.3:
+            score += 0.01
+    # --- Penalty: excessive noops ---
+    noop_count = action_types.count("noop")
+    if noop_count > 3:
+        score -= (noop_count - 3) * 0.02
+    # --- Penalty: repeated restarts of same service ---
+    restart_counts: Dict[str, int] = {}
+    for a in actions:
+        if a["action_type"] == "restart_service":
+            svc = a.get("service") or ""
+            restart_counts[svc] = restart_counts.get(svc, 0) + 1
+    for svc, count in restart_counts.items():
+        if count > 1:
+            score -= (count - 1) * 0.05
+    return round(max(0.0, min(1.0, score)), 4)
+def get_episode_analytics(
+    task_id: str,
+    action_history: List[Dict[str, Any]],
+    ground_truth_root_cause: str,
+    incident_resolved: bool,
+) -> Dict[str, Any]:
+    """
+    Returns detailed analytics for a completed episode.
+    Used by /state endpoint and for debugging agent performance.
+    """
+    actions = [entry["action"] for entry in action_history]
+    action_types = [a["action_type"] for a in actions]
+    # Steps to first diagnosis
+    steps_to_diagnosis: Optional[int] = None
+    for i, a in enumerate(actions):
+        if a["action_type"] == "diagnose":
+            steps_to_diagnosis = i + 1
+            break
+    # Steps to resolution
+    steps_to_resolution: Optional[int] = len(action_history) if incident_resolved else None
+    # Best diagnosis overlap
+    diagnoses = [a.get("root_cause", "") or "" for a in actions if a["action_type"] == "diagnose"]
+    best_diagnosis_overlap = max(
+        (_keyword_overlap(d, ground_truth_root_cause) for d in diagnoses), default=0.0
+    )
+    # Information gathering ratio
+    read_actions = sum(1 for at in action_types if at in ("read_logs", "read_metrics", "read_runbook"))
+    info_ratio = read_actions / max(len(action_types), 1)
+    # Services investigated
+    services_read = list({
+        a.get("service") or ""
+        for a in actions
+        if a["action_type"] in ("read_logs", "read_metrics") and a.get("service")
+    })
+    # Collateral damage count
+    rewards = [entry["reward"] for entry in action_history]
+    negative_rewards = [r for r in rewards if r < -0.01]
+    return {
+        "task_id": task_id,
+        "total_steps": len(action_history),
+        "steps_to_first_diagnosis": steps_to_diagnosis,
+        "steps_to_resolution": steps_to_resolution,
+        "incident_resolved": incident_resolved,
+        "best_diagnosis_overlap": round(best_diagnosis_overlap, 3),
+        "information_gathering_ratio": round(info_ratio, 3),
+        "services_investigated": services_read,
+        "collateral_damage_events": len(negative_rewards),
+        "action_type_counts": {
+            at: action_types.count(at)
+            for at in set(action_types)
+        },
+    }
+def _keyword_overlap(candidate: str, ground_truth: str) -> float:
+    """
+    Returns fraction of ground-truth content words present in candidate.
+    Handles hyphens, underscores, case. Filters stop words.
+    """
+    if not candidate or not ground_truth:
+        return 0.0
+    stops = {"the", "a", "an", "of", "to", "in", "for", "and", "or",
+             "is", "was", "are", "v", "v2", "v3", "v4"}
+    def tokenize(s: str) -> set:
+        tokens = s.lower().replace("-", " ").replace("_", " ").replace(".", " ").split()
+        return {t for t in tokens if t not in stops and len(t) > 1}
+    gt_words = tokenize(ground_truth)
+    cand_words = tokenize(candidate)
+    if not gt_words:
+        return 0.0
+    return len(gt_words & cand_words) / len(gt_words)
+def run_smoke_test() -> None:
+    """Quick smoke test for CI/CD — verifies grader correctness."""
+    import sys
+    import os
+    import random
+    sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+    from env import DevOpsIncidentEnv
+    from models import Action, ActionType
+    print("Running grader smoke test...")
+    for task_id in ["easy", "medium", "hard", "bonus"]:
+        rng = random.Random(99)
+        env = DevOpsIncidentEnv(task_id=task_id, seed=42)
+        env.reset()
+        done = False
+        while not done:
+            action = Action(
+                action_type=rng.choice(list(ActionType)),
+                service=rng.choice(["api-gateway", "payment-service", None]),
+            )
+            result = env.step(action)
+            done = result.done
+        s = env.state()
+        score = grade_episode(
+            task_id, s.action_history, s.ground_truth_root_cause,
+            s.ground_truth_fix, s.incident_resolved, s.total_reward,
+        )
+        analytics = get_episode_analytics(
+            task_id, s.action_history, s.ground_truth_root_cause, s.incident_resolved
+        )
+        assert 0.0 <= score <= 1.0, f"Score {score} out of range"
+        print(f"  {task_id}: score={score:.4f}  analytics={analytics['action_type_counts']}")
+    print("Smoke test passed.")
+if __name__ == "__main__":
+    run_smoke_test()

inference.py ADDED Viewed

	@@ -0,0 +1,274 @@

+"""
+Inference Script — DevOps Incident Response OpenEnv
+=====================================================
+MANDATORY env vars:
+    API_BASE_URL   The API endpoint for the LLM
+    MODEL_NAME     The model identifier
+    HF_TOKEN       Your Hugging Face / API key
+Run:
+    API_BASE_URL=... MODEL_NAME=... HF_TOKEN=... python inference.py
+"""
+import os
+import json
+import re
+import textwrap
+from typing import Optional
+from openai import OpenAI
+from env import DevOpsIncidentEnv
+from models import Action, ActionType, Observation
+from graders.grader import grade_episode
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
+MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct")
+TEMPERATURE = 0.1
+MAX_TOKENS = 512
+FALLBACK_ACTION = Action(action_type=ActionType.NOOP, reason="parse_failure")
+SYSTEM_PROMPT = textwrap.dedent("""
+You are a senior on-call DevOps engineer responding to a production incident.
+You will receive: active alerts, service statuses, recent logs, a service
+dependency map, and a log of all evidence you have gathered so far.
+Your strategy:
+1. Read logs and metrics for the most suspicious services BEFORE acting
+2. Use the dependency map to trace cascades to their ROOT cause
+3. Issue a DIAGNOSE action once you have enough evidence
+4. Apply the precise fix — wrong service or wrong action loses points
+5. On hard incidents: both rollback AND alert_oncall may be required
+Respond with ONLY a valid JSON object — no markdown, no commentary:
+{
+  "action_type": "<diagnose|read_logs|read_metrics|read_runbook|restart_service|rollback|scale_up|alert_oncall|acknowledge|noop>",
+  "service": "<service name or null>",
+  "root_cause": "<diagnosis string if action_type is diagnose, else null>",
+  "runbook": "<runbook filename if action_type is read_runbook, else null>",
+  "version": "<version string if action_type is rollback, else null>",
+  "reason": "<one sentence: what you know and why you are taking this action>"
+}
+Available runbooks: high_cpu.md, memory_leak.md, db_connection.md,
+deployment_rollback.md, cascade_failure.md, data_corruption.md
+""").strip()
+def observation_to_text(obs: Observation) -> str:
+    lines = [
+        f"╔═ INCIDENT RESPONSE  Step {obs.step}/{obs.max_steps}  "
+        f"Elapsed: {obs.elapsed_minutes}min ═╗",
+        f"Task: {obs.task_description[:120]}",
+        "",
+    ]
+    # SLA status
+    breached = [s for s, v in obs.sla_status.items() if v == "breached"]
+    warning_sla = [s for s, v in obs.sla_status.items() if v == "warning"]
+    if breached:
+        lines.append(f"⚠ SLA BREACHED: {', '.join(breached)}")
+    if warning_sla:
+        lines.append(f"⚠ SLA WARNING:  {', '.join(warning_sla)}")
+    if breached or warning_sla:
+        lines.append("")
+    # Active alerts
+    lines.append("── ALERTS ──────────────────────────────────────────")
+    if obs.active_alerts:
+        for a in sorted(obs.active_alerts, key=lambda x: x.severity):
+            ack = " [ACK]" if a.acknowledged else ""
+            lines.append(f"  [{a.severity.upper():<8}]{ack} {a.service}: {a.message}")
+    else:
+        lines.append("  (no active alerts)")
+    # Service status table
+    lines.append("")
+    lines.append("── SERVICES ─────────────────────────────────────────")
+    lines.append(f"  {'SERVICE':<30} {'STATUS':<10} {'CPU':>5} {'MEM':>5} "
+                 f"{'ERR/s':>6} {'P99ms':>7} {'VERSION':<12} {'DEPLOYED'}")
+    for svc in sorted(obs.services, key=lambda s: s.error_rate, reverse=True):
+        sla = "🔴" if obs.sla_status.get(svc.name) == "breached" else (
+              "🟡" if obs.sla_status.get(svc.name) == "warning" else " ")
+        lines.append(
+            f"  {sla}{svc.name:<29} {svc.status.upper():<10} "
+            f"{svc.cpu_percent:>4.0f}% {svc.memory_percent:>4.0f}% "
+            f"{svc.error_rate:>6.2f} {svc.latency_p99_ms:>7.0f} "
+            f"{svc.current_version:<12} {svc.last_deployed[:10]}"
+        )
+    # Dependency topology
+    if obs.service_dependencies:
+        lines.append("")
+        lines.append("── SERVICE DEPENDENCY MAP ───────────────────────────")
+        for dep in obs.service_dependencies:
+            if dep.calls:
+                lines.append(f"  {dep.service}  →  {', '.join(dep.calls)}")
+    # Recent logs (only services with anomalies or not yet read)
+    already_read = {e.source.replace("logs:", "") for e in obs.evidence_log
+                    if e.source.startswith("logs:")}
+    lines.append("")
+    lines.append("── RECENT LOGS ──────────────────────────────────────")
+    for svc_name, log_lines in obs.recent_logs.items():
+        if not log_lines:
+            continue
+        # Show all logs on first 3 steps, then only unread + anomalies
+        has_anomaly = any(
+            kw in "\n".join(log_lines).upper()
+            for kw in ["ERROR", "FATAL", "CRIT", "WARN", "MISMATCH", "ENOSPC", "OOM"]
+        )
+        if obs.step <= 3 or svc_name not in already_read or has_anomaly:
+            lines.append(f"  [{svc_name}]")
+            for line in log_lines[-5:]:
+                lines.append(f"    {line}")
+    # Accumulated evidence
+    if obs.evidence_log:
+        lines.append("")
+        lines.append("── EVIDENCE GATHERED (all steps) ────────────────────")
+        for e in obs.evidence_log:
+            lines.append(f"  Step {e.step:02d} | {e.source}")
+            lines.append(f"         {e.summary}")
+    if obs.last_action_result:
+        lines.append("")
+        lines.append(f"Last action: {obs.last_action_result}")
+    if obs.last_action_error:
+        lines.append(f"ERROR: {obs.last_action_error}")
+    return "\n".join(lines)
+def parse_action(response_text: str) -> Action:
+    if not response_text:
+        return FALLBACK_ACTION
+    text = re.sub(r"```(?:json)?|```", "", response_text).strip()
+    match = re.search(r"\{.*\}", text, re.DOTALL)
+    if not match:
+        return FALLBACK_ACTION
+    try:
+        data = json.loads(match.group(0))
+        at_str = data.get("action_type", "noop")
+        valid = {e.value for e in ActionType}
+        if at_str not in valid:
+            at_str = "noop"
+        return Action(
+            action_type=ActionType(at_str),
+            service=data.get("service"),
+            root_cause=data.get("root_cause"),
+            runbook=data.get("runbook"),
+            version=data.get("version"),
+            reason=data.get("reason"),
+        )
+    except Exception:
+        return FALLBACK_ACTION
+def run_task(client: OpenAI, task_id: str, seed: int = 42) -> dict:
+    env = DevOpsIncidentEnv(task_id=task_id, seed=seed)
+    obs = env.reset()
+    print(f"\n{'━'*64}")
+    print(f"  Task: {task_id.upper()}  |  Seed: {seed}  |  Model: {MODEL_NAME}")
+    print(f"{'━'*64}")
+    done = False
+    step = 0
+    while not done and step < obs.max_steps:
+        step += 1
+        prompt = observation_to_text(obs)
+        try:
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": prompt},
+                ],
+                temperature=TEMPERATURE,
+                max_tokens=MAX_TOKENS,
+            )
+            response_text = completion.choices[0].message.content or ""
+        except Exception as exc:
+            print(f"  Step {step:02d}: API error — {exc}")
+            response_text = ""
+        action = parse_action(response_text)
+        action_label = action.action_type.value
+        if action.service:
+            action_label += f"({action.service})"
+        if action.root_cause:
+            action_label += f'  rc="{action.root_cause[:40]}"'
+        if action.version:
+            action_label += f"  ver={action.version}"
+        if action.runbook:
+            action_label += f"  rb={action.runbook}"
+        result = env.step(action)
+        obs = result.observation
+        reward_str = f"  reward={result.reward:+.3f}" if result.reward != 0 else ""
+        resolution_str = f"  *** {result.info.get('resolution', '')} ***" if result.done and result.info.get("resolution") else ""
+        print(f"  Step {step:02d}: {action_label}{reward_str}{resolution_str}")
+        if obs.last_action_error:
+            print(f"           ⚠ {obs.last_action_error[:80]}")
+        done = result.done
+    state = env.state()
+    final_score = grade_episode(
+        task_id=task_id,
+        action_history=state.action_history,
+        ground_truth_root_cause=state.ground_truth_root_cause,
+        ground_truth_fix=state.ground_truth_fix,
+        incident_resolved=state.incident_resolved,
+        total_reward=state.total_reward,
+    )
+    print(f"\n  Ground truth : {state.ground_truth_root_cause}")
+    print(f"  Resolved     : {state.incident_resolved}")
+    print(f"  Steps taken  : {step}")
+    print(f"  Rewards      : {[e['reward'] for e in state.action_history if e['reward'] != 0]}")
+    print(f"  Final score  : {final_score:.4f}")
+    return {
+        "task_id": task_id,
+        "score": final_score,
+        "resolved": state.incident_resolved,
+        "steps": step,
+        "rewards_unlocked": state.info.get("rewards_unlocked", []),
+    }
+def main():
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    results = []
+    for task_id in ["easy", "medium", "hard", "bonus"]:
+        r = run_task(client, task_id, seed=42)
+        results.append(r)
+    print(f"\n{'━'*64}")
+    print("  BASELINE SCORES")
+    print(f"{'━'*64}")
+    total = 0.0
+    for r in results:
+        resolved_mark = "✓" if r["resolved"] else "✗"
+        print(
+            f"  {r['task_id']:<8}  {r['score']:.4f}  "
+            f"{resolved_mark}  steps={r['steps']}  "
+            f"unlocked={len(r['rewards_unlocked'])}"
+        )
+        total += r["score"]
+    avg = total / len(results)
+    print(f"  {'average':<8}  {avg:.4f}")
+    print(f"{'━'*64}\n")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,107 @@

+from __future__ import annotations
+from pydantic import BaseModel, Field
+from typing import List, Optional, Dict, Any, Literal
+from enum import Enum
+class ActionType(str, Enum):
+    DIAGNOSE = "diagnose"
+    READ_LOGS = "read_logs"
+    READ_METRICS = "read_metrics"
+    READ_RUNBOOK = "read_runbook"
+    RESTART_SERVICE = "restart_service"
+    ROLLBACK = "rollback"
+    SCALE_UP = "scale_up"
+    ALERT_ONCALL = "alert_oncall"
+    ACKNOWLEDGE = "acknowledge"
+    NOOP = "noop"
+class Action(BaseModel):
+    action_type: ActionType
+    service: Optional[str] = None
+    root_cause: Optional[str] = None
+    runbook: Optional[str] = None
+    version: Optional[str] = None
+    reason: Optional[str] = None
+class Alert(BaseModel):
+    id: str
+    severity: Literal["critical", "warning", "info"]
+    service: str
+    message: str
+    timestamp: str
+    acknowledged: bool = False
+class ServiceStatus(BaseModel):
+    name: str
+    status: Literal["healthy", "degraded", "down", "unknown"]
+    cpu_percent: float
+    memory_percent: float
+    error_rate: float
+    latency_p99_ms: float
+    replicas_running: int
+    replicas_desired: int
+    current_version: str
+    last_deployed: str
+    # SLA tracking — updated each step if unresolved
+    sla_breach: bool = False
+    minutes_degraded: int = 0
+class ServiceDependency(BaseModel):
+    """Describes which services call which — critical for cascade diagnosis."""
+    service: str
+    calls: List[str]  # services this one depends on
+    called_by: List[str]  # services that depend on this one
+class EvidenceEntry(BaseModel):
+    """One piece of gathered evidence — accumulated across steps."""
+    step: int
+    source: str       # e.g. "logs:payment-service" or "metrics:inventory-service"
+    summary: str      # short digest of what was found
+    raw: str          # full content returned by read action
+class Observation(BaseModel):
+    step: int
+    max_steps: int
+    task_id: str
+    task_description: str
+    services: List[ServiceStatus]
+    active_alerts: List[Alert]
+    recent_logs: Dict[str, List[str]]
+    available_runbooks: List[str]
+    # NEW: dependency topology so agent can reason about cascades
+    service_dependencies: List[ServiceDependency] = []
+    # NEW: accumulated evidence from all previous read actions
+    evidence_log: List[EvidenceEntry] = []
+    # NEW: SLA status — shows urgency
+    sla_status: Dict[str, str] = {}   # service -> "ok" | "warning" | "breached"
+    last_action_result: Optional[str] = None
+    last_action_error: Optional[str] = None
+    incident_start_time: str
+    elapsed_minutes: int
+class StepResult(BaseModel):
+    observation: Observation
+    reward: float
+    done: bool
+    info: Dict[str, Any] = {}
+class State(BaseModel):
+    episode_id: str
+    task_id: str
+    step: int
+    current_observation: Observation
+    action_history: List[Dict[str, Any]]
+    total_reward: float
+    incident_resolved: bool
+    ground_truth_root_cause: str
+    ground_truth_fix: str
+    info: Dict[str, Any] = {}

openenv.yaml ADDED Viewed

	@@ -0,0 +1,131 @@

+name: devops-incident-response
+version: "1.0.0"
+description: >
+  A reinforcement learning environment where AI agents learn to diagnose and
+  remediate production software incidents. Agents read logs, metrics, and
+  alerts across a simulated microservices architecture, then take remediation
+  actions such as rollbacks, restarts, and on-call escalations. Three tasks
+  of escalating difficulty — from a clear memory leak to silent data
+  corruption with no error-rate alerts — provide a meaningful difficulty
+  progression for benchmarking agent reasoning quality.
+author: "devops-incident-env"
+tags:
+  - openenv
+  - devops
+  - incident-response
+  - real-world
+  - multi-step
+  - microservices
+  - reward-shaping
+tasks:
+  - id: easy
+    name: Single Service Anomaly
+    description: >
+      A payment service is crash-looping due to a JVM heap memory leak.
+      Logs clearly show OutOfMemoryError and OOMKilled pod restarts.
+      The agent must read logs/metrics, diagnose the memory leak, and
+      restart the affected service without touching healthy services.
+    difficulty: easy
+    max_steps: 15
+    reward_range: [0.0, 1.0]
+    expected_score_random_agent: 0.05
+    expected_score_strong_llm: 0.75
+  - id: medium
+    name: Cascading Multi-Service Failure
+    description: >
+      A bad deployment of inventory-service introduced connection pool
+      exhaustion, cascading to order-service timeouts and api-gateway
+      errors. A red-herring alert fires on notification-service (high CPU
+      from a scheduled batch job). The agent must trace the cascade to the
+      root service and rollback — not restart downstream victims.
+    difficulty: medium
+    max_steps: 20
+    reward_range: [0.0, 1.0]
+    expected_score_random_agent: 0.03
+    expected_score_strong_llm: 0.55
+  - id: hard
+    name: Silent Data Corruption
+    description: >
+      A data pipeline deployment silently writes incorrect price values to
+      the product catalog. No standard error-rate or latency alerts fire —
+      all services show green health. The signal is buried in
+      price-validation WARN logs (15% mismatch rate) and an analytics
+      anomaly (avg order value 9x baseline). Full credit requires both
+      rollback of the pipeline AND alerting on-call for a data audit.
+    difficulty: hard
+    max_steps: 25
+    reward_range: [0.0, 1.0]
+    expected_score_random_agent: 0.01
+    expected_score_strong_llm: 0.35
+  - id: bonus
+    name: Simultaneous Dual Failure
+    description: >
+      Two independent failures strike at once: log-aggregator disk is 100% full
+      (causing log loss across all services) and ml-inference-service is stuck
+      in a model reload CPU loop. Neither failure is related to the other.
+      Full credit requires fixing both root causes independently.
+    difficulty: hard
+    max_steps: 25
+    reward_range: [0.0, 1.0]
+    expected_score_random_agent: 0.01
+    expected_score_strong_llm: 0.40
+action_space:
+  type: structured
+  description: >
+    Discrete action types with optional service/parameter arguments.
+    Actions are expressed as Pydantic Action objects with fields:
+    action_type, service, root_cause, runbook, version, reason.
+  actions:
+    - name: diagnose
+      description: Record the agent's root cause hypothesis
+    - name: read_logs
+      description: Read recent log lines for a named service
+    - name: read_metrics
+      description: Read CPU, memory, error rate, latency for a named service
+    - name: read_runbook
+      description: Read an operational runbook by filename
+    - name: restart_service
+      description: Restart a named service (clears memory, resets connections)
+    - name: rollback
+      description: Roll back a service to a previous version
+    - name: scale_up
+      description: Increase replica count for a named service
+    - name: alert_oncall
+      description: Page the on-call engineering team
+    - name: acknowledge
+      description: Acknowledge an active alert by ID
+    - name: noop
+      description: Take no action this step
+observation_space:
+  type: structured
+  description: >
+    Pydantic Observation object containing: current step, task description,
+    list of ServiceStatus objects (name, status, cpu, memory, error_rate,
+    latency_p99, replicas, version, last_deployed), list of Alert objects
+    (severity, service, message, acknowledged), recent log lines per
+    service (dict of service_name -> last 10 lines), available runbook
+    names, last action result/error, and incident timing info.
+reward:
+  type: dense
+  range: [0.0, 1.0]
+  description: >
+    Partial credit for information gathering, correct diagnosis, and
+    precise remediation. Penalties for collateral damage (restarting
+    healthy services), excessive noops, and treating symptoms instead
+    of root causes. Efficiency bonus for fast resolution.
+docker:
+  base_image: python:3.11-slim
+  port: 7860
+  health_endpoint: /health
+  reset_endpoint: /reset
+  step_endpoint: /step
+  state_endpoint: /state

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+pydantic>=2.0,<3.0
+fastapi>=0.110.0
+uvicorn>=0.29.0
+openai>=1.0.0
+python-dotenv>=1.0.0
+pyyaml>=6.0

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from tasks.task_easy import EasyTask
+from tasks.task_medium import MediumTask
+from tasks.task_hard import HardTask
+from tasks.task_bonus import BonusTask
+__all__ = ["EasyTask", "MediumTask", "HardTask", "BonusTask"]

tasks/base.py ADDED Viewed

	@@ -0,0 +1,306 @@

+from __future__ import annotations
+import random
+import uuid
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from typing import List, Dict, Any, Optional, Set
+from models import (
+    Action, ActionType, Observation, State, StepResult,
+    ServiceStatus, Alert, ServiceDependency, EvidenceEntry,
+)
+AVAILABLE_RUNBOOKS = [
+    "high_cpu.md",
+    "memory_leak.md",
+    "db_connection.md",
+    "deployment_rollback.md",
+    "cascade_failure.md",
+    "data_corruption.md",
+]
+TASK_DESCRIPTIONS = {
+    "easy": (
+        "PRODUCTION INCIDENT — One service is crash-looping. "
+        "Read its logs and metrics to find the root cause, diagnose precisely, "
+        "then apply the correct single-service fix. "
+        "Avoid restarting healthy services — collateral damage is penalised."
+    ),
+    "medium": (
+        "PRODUCTION INCIDENT — Multiple services are degraded. "
+        "Use the service dependency map to trace the failure to its origin. "
+        "A recent deployment is likely involved. One alert is a red herring. "
+        "Fix the root service only — downstream victims will self-heal."
+    ),
+    "hard": (
+        "PRODUCTION INCIDENT — All services show green health. No error-rate alerts. "
+        "Look for anomalies in business-logic metrics and WARN-level logs. "
+        "Correlate signals across services to find silent data corruption. "
+        "Two actions are required for full credit: rollback AND alert_oncall."
+    ),
+    "bonus": (
+        "PRODUCTION INCIDENT — Two independent failures are active simultaneously. "
+        "They are unrelated — fixing one will NOT fix the other. "
+        "Identify both root causes and remediate each independently. "
+        "Full credit requires resolving both."
+    ),
+}
+@dataclass
+class InternalState:
+    episode_id: str
+    task_id: str
+    step: int
+    max_steps: int
+    services: Dict[str, dict]
+    alerts: list
+    logs: Dict[str, List[str]]
+    action_history: List[Dict[str, Any]]
+    total_reward: float
+    incident_resolved: bool
+    ground_truth_root_cause: str
+    ground_truth_fix: str
+    incident_start_time: str
+    rewards_given: Set[str] = field(default_factory=set)
+    healthy_services: List[str] = field(default_factory=list)
+    evidence_log: List[dict] = field(default_factory=list)
+    service_dependencies: List[dict] = field(default_factory=list)
+    _scenario: Any = field(default=None, repr=False)
+    _ml_version: Any = field(default=None, repr=False)
+    def to_state_snapshot(self) -> State:
+        obs = self._build_observation()
+        return State(
+            episode_id=self.episode_id,
+            task_id=self.task_id,
+            step=self.step,
+            current_observation=obs,
+            action_history=self.action_history,
+            total_reward=round(self.total_reward, 4),
+            incident_resolved=self.incident_resolved,
+            ground_truth_root_cause=self.ground_truth_root_cause,
+            ground_truth_fix=self.ground_truth_fix,
+            info={
+                "rewards_unlocked": sorted(self.rewards_given),
+                "evidence_gathered": len(self.evidence_log),
+            },
+        )
+    def _build_sla_status(self) -> Dict[str, str]:
+        status = {}
+        for name, svc in self.services.items():
+            if svc["status"] == "down":
+                mins = self.step * 2
+                if mins >= 10:
+                    status[name] = "breached"
+                elif mins >= 5:
+                    status[name] = "warning"
+                else:
+                    status[name] = "ok"
+            elif svc["status"] == "degraded":
+                mins = self.step * 2
+                if mins >= 20:
+                    status[name] = "breached"
+                elif mins >= 10:
+                    status[name] = "warning"
+                else:
+                    status[name] = "ok"
+            else:
+                status[name] = "ok"
+        return status
+    def _apply_sla_degradation(self) -> None:
+        """Services get progressively worse if not fixed — adds urgency."""
+        if self.incident_resolved:
+            return
+        for name, svc in self.services.items():
+            if svc["status"] == "down":
+                svc["minutes_degraded"] = svc.get("minutes_degraded", 0) + 2
+                # Error rate creeps up
+                svc["error_rate"] = min(svc["error_rate"] * 1.05, 50.0)
+            elif svc["status"] == "degraded":
+                svc["minutes_degraded"] = svc.get("minutes_degraded", 0) + 2
+                # Latency grows
+                svc["latency_p99_ms"] = min(svc["latency_p99_ms"] * 1.03, 60000.0)
+                if svc["latency_p99_ms"] > 30000 and svc["error_rate"] < 1.0:
+                    svc["error_rate"] = round(svc["error_rate"] + 0.5, 2)
+    def _build_observation(
+        self,
+        last_action_result: Optional[str] = None,
+        last_action_error: Optional[str] = None,
+    ) -> Observation:
+        services = []
+        for name, s in self.services.items():
+            services.append(ServiceStatus(
+                name=s["name"],
+                status=s["status"],
+                cpu_percent=s["cpu_percent"],
+                memory_percent=s["memory_percent"],
+                error_rate=round(s["error_rate"], 3),
+                latency_p99_ms=round(s["latency_p99_ms"], 0),
+                replicas_running=s["replicas_running"],
+                replicas_desired=s["replicas_desired"],
+                current_version=s["current_version"],
+                last_deployed=s["last_deployed"],
+                sla_breach=s.get("sla_breach", False),
+                minutes_degraded=s.get("minutes_degraded", 0),
+            ))
+        alerts = [Alert(**a) for a in self.alerts]
+        deps = [ServiceDependency(**d) for d in self.service_dependencies]
+        evidence = [EvidenceEntry(**e) for e in self.evidence_log]
+        sla = self._build_sla_status()
+        return Observation(
+            step=self.step,
+            max_steps=self.max_steps,
+            task_id=self.task_id,
+            task_description=TASK_DESCRIPTIONS.get(self.task_id, ""),
+            services=services,
+            active_alerts=alerts,
+            recent_logs=self.logs,
+            available_runbooks=AVAILABLE_RUNBOOKS,
+            service_dependencies=deps,
+            evidence_log=evidence,
+            sla_status=sla,
+            last_action_result=last_action_result,
+            last_action_error=last_action_error,
+            incident_start_time=self.incident_start_time,
+            elapsed_minutes=self.step * 2,
+        )
+@dataclass
+class StepOutput:
+    next_state: InternalState
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+def semantic_match(candidate: str, keywords: List[str], threshold: int = 1) -> bool:
+    """
+    Returns True if candidate contains at least `threshold` keywords.
+    Case-insensitive, handles hyphens/underscores.
+    """
+    if not candidate:
+        return False
+    c = candidate.lower().replace("-", " ").replace("_", " ")
+    hits = sum(1 for kw in keywords if kw.lower().replace("-", " ") in c)
+    return hits >= threshold
+class BaseTask(ABC):
+    def __init__(self, rng: random.Random):
+        self.rng = rng
+    @abstractmethod
+    def initialize(self) -> InternalState:
+        pass
+    @abstractmethod
+    def step(self, state: InternalState, action: Action) -> StepOutput:
+        pass
+    def _apply_action_to_logs(
+        self, state: InternalState, action: Action
+    ) -> tuple[Optional[str], Optional[str]]:
+        at = action.action_type.value
+        if at == "read_logs":
+            svc = action.service
+            if svc and svc in state.logs:
+                lines = state.logs[svc]
+                result = "\n".join(lines)
+                # Add to evidence log
+                state.evidence_log.append({
+                    "step": state.step,
+                    "source": f"logs:{svc}",
+                    "summary": f"Read {len(lines)} log lines from {svc}",
+                    "raw": result,
+                })
+                return result, None
+            return None, f"No logs found for service '{svc}'"
+        if at == "read_metrics":
+            svc = action.service
+            if svc and svc in state.services:
+                s = state.services[svc]
+                result = (
+                    f"=== Metrics: {svc} ===\n"
+                    f"Status:       {s['status'].upper()}\n"
+                    f"CPU:          {s['cpu_percent']:.1f}%\n"
+                    f"Memory:       {s['memory_percent']:.1f}%\n"
+                    f"Error rate:   {s['error_rate']:.3f}/s\n"
+                    f"P99 latency:  {s['latency_p99_ms']:.0f}ms\n"
+                    f"Replicas:     {s['replicas_running']}/{s['replicas_desired']}\n"
+                    f"Version:      {s['current_version']}\n"
+                    f"Last deploy:  {s['last_deployed']}\n"
+                    f"Degraded for: {s.get('minutes_degraded', 0)} minutes"
+                )
+                state.evidence_log.append({
+                    "step": state.step,
+                    "source": f"metrics:{svc}",
+                    "summary": (
+                        f"{svc}: {s['status']}, cpu={s['cpu_percent']:.0f}%, "
+                        f"mem={s['memory_percent']:.0f}%, err={s['error_rate']:.2f}/s, "
+                        f"ver={s['current_version']}"
+                    ),
+                    "raw": result,
+                })
+                return result, None
+            return None, f"Unknown service '{svc}'"
+        if at == "read_runbook":
+            rb = action.runbook
+            if rb in AVAILABLE_RUNBOOKS:
+                content = self._load_runbook(rb)
+                state.evidence_log.append({
+                    "step": state.step,
+                    "source": f"runbook:{rb}",
+                    "summary": f"Read runbook: {rb}",
+                    "raw": content[:200],
+                })
+                return content, None
+            return None, f"Runbook '{rb}' not found. Available: {AVAILABLE_RUNBOOKS}"
+        if at == "acknowledge":
+            alert_id = action.service
+            for a in state.alerts:
+                if a["id"] == alert_id:
+                    a["acknowledged"] = True
+                    return f"Alert {alert_id} acknowledged.", None
+            return None, f"Alert '{alert_id}' not found."
+        if at == "noop":
+            return "No action taken.", None
+        return None, None
+    def _load_runbook(self, name: str) -> str:
+        import os
+        path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data", "runbooks", name)
+        try:
+            with open(path) as f:
+                return f.read()
+        except FileNotFoundError:
+            return f"[Runbook '{name}' not found]"
+    def _clamp(self, value: float) -> float:
+        return max(0.0, min(1.0, value))
+    def _penalty_blind_remediation(
+        self, state: InternalState, action: Action, fix_key: str
+    ) -> float:
+        """
+        Small penalty if agent remediates without any prior diagnosis.
+        Encourages evidence-gathering before action.
+        """
+        if fix_key in state.rewards_given:
+            return 0.0
+        if "diagnose_correct" not in state.rewards_given and \
+           "diagnose_partial" not in state.rewards_given:
+            return -0.05
+        return 0.0

tasks/task_bonus.py ADDED Viewed

	@@ -0,0 +1,208 @@

+from __future__ import annotations
+import uuid
+from typing import Dict, Any, List
+from models import Action, ActionType
+from tasks.base import BaseTask, InternalState, StepOutput, semantic_match
+INCIDENT_TIME = "2026-03-30T14:22:00Z"
+DEPENDENCIES = [
+    {"service": "api-gateway",         "calls": ["ml-inference-service", "product-service"], "called_by": []},
+    {"service": "ml-inference-service","calls": [],                                           "called_by": ["api-gateway"]},
+    {"service": "log-aggregator",      "calls": [],                                           "called_by": []},
+    {"service": "product-service",     "calls": [],                                           "called_by": ["api-gateway"]},
+]
+AGGREGATOR_LOGS = [
+    "[14:20:01] INFO  Log ingestion running: 48MB/s",
+    "[14:21:05] WARN  Disk usage at 91% (/var/log/aggregated)",
+    "[14:21:45] WARN  Disk usage at 95% - log rotation overdue",
+    "[14:22:01] ERROR Disk usage at 99% - write failure imminent",
+    "[14:22:02] ERROR Failed to write log chunk: No space left on device (ENOSPC)",
+    "[14:22:04] WARN  Dropping incoming logs: buffer overflow (48000 messages dropped)",
+    "[14:22:05] ERROR Log rotation job FAILED: No space left on device",
+    "[14:22:10] CRIT  Disk 100% full - all log writes failing",
+]
+ML_LOGS = [
+    "[14:21:00] INFO  ml-inference-service starting",
+    "[14:21:01] INFO  Loading model: recommendation-v2.1 (2.3GB)",
+    "[14:21:12] INFO  Model loaded in 11.2s",
+    "[14:21:12] WARN  Model checksum mismatch - reloading",
+    "[14:21:23] INFO  Model loaded in 11.1s",
+    "[14:21:23] WARN  Model checksum mismatch - reloading",
+    "[14:21:34] WARN  Model reload loop detected: 6 reloads in 60s",
+    "[14:22:01] ERROR CPU throttled: 100% sustained for 120s",
+    "[14:22:02] WARN  Deployment {version} introduced new model checksum validation - may have bug",
+]
+API_LOGS = [
+    "[14:22:00] INFO  GET /api/v1/recommendations 200 145ms",
+    "[14:22:05] WARN  GET /api/v1/recommendations 200 4823ms (ml-inference slow)",
+    "[14:22:15] ERROR GET /api/v1/recommendations 504 Gateway Timeout",
+]
+class BonusTask(BaseTask):
+    def initialize(self) -> InternalState:
+        ml_ver = f"v2.{self.rng.randint(0, 3)}.{self.rng.randint(0, 5)}"
+        logs = {
+            "log-aggregator": AGGREGATOR_LOGS[:],
+            "ml-inference-service": [l.replace("{version}", ml_ver) for l in ML_LOGS],
+            "api-gateway": API_LOGS[:],
+            "product-service": ["[14:22:00] INFO  Service healthy - 0 errors"],
+        }
+        services = {
+            "api-gateway": {
+                "name": "api-gateway", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(40, 58), 1),
+                "memory_percent": round(self.rng.uniform(44, 56), 1),
+                "error_rate": round(self.rng.uniform(3.0, 6.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(8000, 12000), 0),
+                "replicas_running": 2, "replicas_desired": 2,
+                "current_version": "v3.1.0", "last_deployed": "2026-03-20T08:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "ml-inference-service": {
+                "name": "ml-inference-service", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(94, 100), 1),
+                "memory_percent": round(self.rng.uniform(55, 72), 1),
+                "error_rate": round(self.rng.uniform(1.5, 4.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(9000, 14000), 0),
+                "replicas_running": 2, "replicas_desired": 2,
+                "current_version": ml_ver, "last_deployed": "2026-03-30T14:20:55Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "log-aggregator": {
+                "name": "log-aggregator", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(18, 30), 1),
+                "memory_percent": round(self.rng.uniform(40, 52), 1),
+                "error_rate": round(self.rng.uniform(5.0, 9.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(200, 500), 0),
+                "replicas_running": 1, "replicas_desired": 1,
+                "current_version": "v1.3.0", "last_deployed": "2026-03-01T10:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "product-service": {
+                "name": "product-service", "status": "healthy",
+                "cpu_percent": round(self.rng.uniform(25, 38), 1),
+                "memory_percent": round(self.rng.uniform(35, 48), 1),
+                "error_rate": 0.0,
+                "latency_p99_ms": round(self.rng.uniform(15, 35), 0),
+                "replicas_running": 3, "replicas_desired": 3,
+                "current_version": "v2.0.1", "last_deployed": "2026-03-15T12:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+        }
+        alerts = [
+            {
+                "id": "B001", "severity": "critical", "service": "log-aggregator",
+                "message": "Disk 100% full on log-aggregator - dropping 48000 log messages/min",
+                "timestamp": "2026-03-30T14:22:10Z", "acknowledged": False,
+            },
+            {
+                "id": "B002", "severity": "critical", "service": "ml-inference-service",
+                "message": f"CPU sustained 99%+ for 120s - model reload loop detected ({ml_ver})",
+                "timestamp": "2026-03-30T14:22:01Z", "acknowledged": False,
+            },
+            {
+                "id": "B003", "severity": "warning", "service": "api-gateway",
+                "message": "P99 latency 10200ms on /recommendations - upstream ml-inference slow",
+                "timestamp": "2026-03-30T14:22:15Z", "acknowledged": False,
+            },
+        ]
+        state = InternalState(
+            episode_id=str(uuid.uuid4()), task_id="bonus", step=0, max_steps=25,
+            services=services, alerts=alerts, logs=logs,
+            action_history=[], total_reward=0.0, incident_resolved=False,
+            ground_truth_root_cause="disk_full_log_aggregator AND model_reload_loop_ml_inference",
+            ground_truth_fix="alert_oncall for disk cleanup AND rollback ml-inference-service",
+            incident_start_time=INCIDENT_TIME,
+            healthy_services=["product-service"],
+            service_dependencies=DEPENDENCIES,
+        )
+        state._ml_version = ml_ver
+        return state
+    def step(self, state: InternalState, action: Action) -> StepOutput:
+        state.step += 1
+        state._apply_sla_degradation()
+        at = action.action_type
+        svc = action.service or ""
+        reward = 0.0
+        done = False
+        info: Dict[str, Any] = {}
+        result_text, error_text = self._apply_action_to_logs(state, action)
+        gather_map = {
+            ("read_logs", "log-aggregator"):       ("rl_agg", 0.05),
+            ("read_logs", "ml-inference-service"): ("rl_ml", 0.05),
+            ("read_metrics", "log-aggregator"):    ("rm_agg", 0.05),
+            ("read_metrics", "ml-inference-service"): ("rm_ml", 0.05),
+        }
+        k = (at.value, svc)
+        if k in gather_map:
+            tag, r = gather_map[k]
+            if tag not in state.rewards_given:
+                reward += r; state.rewards_given.add(tag)
+        if at == ActionType.READ_RUNBOOK:
+            if "runbook" not in state.rewards_given:
+                reward += 0.04; state.rewards_given.add("runbook")
+        if at == ActionType.DIAGNOSE:
+            rc = action.root_cause or ""
+            has_disk = semantic_match(rc, ["disk", "storage", "full", "space", "log", "aggregat"])
+            has_ml = semantic_match(rc, ["ml", "inference", "model", "reload", "cpu", "loop"])
+            result_text = f"Diagnosis recorded: {rc}"
+            if has_disk and has_ml:
+                if "diagnose_both" not in state.rewards_given:
+                    reward += 0.20; state.rewards_given.add("diagnose_both")
+            elif has_disk or has_ml:
+                if "diagnose_one" not in state.rewards_given:
+                    reward += 0.08; state.rewards_given.add("diagnose_one")
+        # Fix 1: disk issue via oncall
+        if at == ActionType.ALERT_ONCALL:
+            reason = (action.reason or "").lower()
+            if semantic_match(reason, ["disk", "log", "storage", "space", "aggregat"]):
+                if "fix_disk" not in state.rewards_given:
+                    reward += 0.20; state.rewards_given.add("fix_disk")
+                    result_text = "SRE paged for disk cleanup. Volume extension underway (~5 min)."
+                    if "fix_ml" in state.rewards_given:
+                        state.incident_resolved = True; done = True; info["resolution"] = "incident_resolved"
+            else:
+                if "fix_disk" not in state.rewards_given:
+                    reward += 0.08
+                    result_text = "On-call paged. Clarify disk/log issue for faster resolution."
+        # Fix 2: ML reload loop via rollback or restart
+        if at in (ActionType.ROLLBACK, ActionType.RESTART_SERVICE) and svc == "ml-inference-service":
+            if "fix_ml" not in state.rewards_given:
+                r_base = 0.20 if at == ActionType.ROLLBACK else 0.12
+                reward += r_base; state.rewards_given.add("fix_ml")
+                state.services["ml-inference-service"]["cpu_percent"] = round(self.rng.uniform(22, 38), 1)
+                state.services["ml-inference-service"]["latency_p99_ms"] = round(self.rng.uniform(80, 140), 0)
+                state.services["ml-inference-service"]["error_rate"] = 0.0
+                action_word = "rolled back" if at == ActionType.ROLLBACK else "restarted"
+                result_text = f"ml-inference-service {action_word}. Reload loop stopped. CPU recovering."
+                if "fix_disk" in state.rewards_given:
+                    state.incident_resolved = True; done = True; info["resolution"] = "incident_resolved"
+        if at in (ActionType.RESTART_SERVICE, ActionType.ROLLBACK) and svc in state.healthy_services:
+            reward -= 0.08
+        if at == ActionType.NOOP and state.step > 5:
+            reward -= 0.03
+        state.total_reward = self._clamp(state.total_reward + reward)
+        if state.step >= state.max_steps and not done:
+            done = True; info["reason"] = "max_steps_reached"
+        obs = state._build_observation(last_action_result=result_text, last_action_error=error_text)
+        state.action_history.append({"step": state.step, "action": action.model_dump(), "reward": round(reward, 4)})
+        return StepOutput(next_state=state, reward=round(reward, 4), done=done, info=info)

tasks/task_easy.py ADDED Viewed

	@@ -0,0 +1,240 @@

+from __future__ import annotations
+import uuid
+from typing import Dict, Any, List
+from models import Action, ActionType
+from tasks.base import BaseTask, InternalState, StepOutput, semantic_match
+INCIDENT_TIME = "2026-03-30T10:14:47Z"
+SCENARIOS = [
+    {
+        "failing_service": "payment-service",
+        "root_cause": "memory_leak_payment_service",
+        "fix": "restart payment-service",
+        "alert_msg": "payment-service pod restarting (OOMKilled)",
+        "language": "java",
+        "diagnosis_keywords": ["memory", "oom", "heap", "leak", "outofmemory", "kill"],
+    },
+    {
+        "failing_service": "order-service",
+        "root_cause": "memory_leak_order_service",
+        "fix": "restart order-service",
+        "alert_msg": "order-service pod restarting (OOMKilled)",
+        "language": "python",
+        "diagnosis_keywords": ["memory", "oom", "heap", "leak", "segfault", "kill", "allocat"],
+    },
+    {
+        "failing_service": "user-service",
+        "root_cause": "memory_leak_user_service",
+        "fix": "restart user-service",
+        "alert_msg": "user-service pod restarting (OOMKilled)",
+        "language": "node",
+        "diagnosis_keywords": ["memory", "heap", "oom", "leak", "javascript", "kill"],
+    },
+]
+ALL_SERVICES = ["payment-service", "order-service", "user-service", "api-gateway"]
+VERSIONS = {
+    "payment-service": "v4.2.1", "order-service": "v1.8.2",
+    "user-service": "v3.0.5", "api-gateway": "v2.1.0",
+}
+DEPENDENCIES = [
+    {"service": "api-gateway", "calls": ["payment-service", "order-service", "user-service"], "called_by": []},
+    {"service": "payment-service", "calls": [], "called_by": ["api-gateway"]},
+    {"service": "order-service", "calls": [], "called_by": ["api-gateway"]},
+    {"service": "user-service", "calls": [], "called_by": ["api-gateway"]},
+]
+def _make_logs(scenario, heap1, heap2, restart_count):
+    svc = scenario["failing_service"]
+    lang = scenario["language"]
+    if lang == "java":
+        failing = [
+            "[10:13:55] INFO  Request processed 200 38ms",
+            f"[10:14:35] WARN  Heap usage at {heap1}% - approaching threshold",
+            f"[10:14:41] WARN  Heap usage at {heap2}%",
+            "[10:14:45] WARN  GC overhead limit exceeded - major GC running",
+            "[10:14:47] ERROR java.lang.OutOfMemoryError: Java heap space",
+            "[10:14:47] ERROR   at com.payments.ChargeProcessor.process(ChargeProcessor.java:142)",
+            f"[10:14:48] FATAL Service entering crash loop - pod restart #{restart_count}",
+        ]
+    elif lang == "python":
+        failing = [
+            "[10:13:55] INFO  POST /orders 200 55ms",
+            f"[10:14:35] WARN  RSS memory {heap1}% of pod limit",
+            f"[10:14:41] WARN  RSS memory {heap2}% of pod limit - approaching OOM",
+            "[10:14:46] ERROR Memory allocator: no more pages available",
+            "[10:14:47] ERROR Fatal Python error: Segmentation fault (memory allocator exhausted)",
+            f"[10:14:48] FATAL Pod killed by OOM killer - restart #{restart_count}",
+        ]
+    else:
+        failing = [
+            "[10:13:55] INFO  GET /users/profile 200 9ms",
+            f"[10:14:35] WARN  Heap used: {heap1}% ({heap1 * 2}MB / 200MB)",
+            f"[10:14:41] WARN  Heap used: {heap2}% - GC pressure increasing",
+            "[10:14:47] ERROR FATAL ERROR: Reached heap limit - JavaScript heap out of memory",
+            f"[10:14:48] FATAL Container OOMKilled - restart #{restart_count}",
+        ]
+    logs = {svc: failing}
+    for name in ALL_SERVICES:
+        if name == svc: continue
+        if name == "api-gateway":
+            logs[name] = [
+                "[10:14:30] INFO  GET /api/v1/health 200 3ms",
+                f"[10:14:48] WARN  Upstream {svc} returned 503",
+                f"[10:14:49] WARN  Circuit breaker OPEN for {svc}",
+            ]
+        else:
+            logs[name] = ["[10:14:30] INFO  Service healthy - 0 errors"]
+    return logs
+class EasyTask(BaseTask):
+    def initialize(self) -> InternalState:
+        scenario = SCENARIOS[self.rng.randint(0, len(SCENARIOS) - 1)]
+        failing = scenario["failing_service"]
+        heap1 = self.rng.randint(74, 83)
+        heap2 = heap1 + self.rng.randint(5, 10)
+        restart_count = self.rng.randint(2, 6)
+        services: Dict[str, dict] = {}
+        for name in ALL_SERVICES:
+            if name == failing:
+                services[name] = {
+                    "name": name, "status": "down",
+                    "cpu_percent": round(self.rng.uniform(5, 20), 1),
+                    "memory_percent": round(self.rng.uniform(93, 99), 1),
+                    "error_rate": round(self.rng.uniform(8.0, 15.0), 2),
+                    "latency_p99_ms": round(self.rng.uniform(5000, 9000), 0),
+                    "replicas_running": 0, "replicas_desired": 3,
+                    "current_version": VERSIONS[name],
+                    "last_deployed": "2026-03-28T14:00:00Z",
+                    "minutes_degraded": 0, "sla_breach": False,
+                }
+            elif name == "api-gateway":
+                services[name] = {
+                    "name": name, "status": "degraded",
+                    "cpu_percent": round(self.rng.uniform(35, 55), 1),
+                    "memory_percent": round(self.rng.uniform(40, 55), 1),
+                    "error_rate": round(self.rng.uniform(2.0, 5.0), 2),
+                    "latency_p99_ms": round(self.rng.uniform(800, 1500), 0),
+                    "replicas_running": 2, "replicas_desired": 2,
+                    "current_version": VERSIONS[name],
+                    "last_deployed": "2026-03-25T09:00:00Z",
+                    "minutes_degraded": 0, "sla_breach": False,
+                }
+            else:
+                services[name] = {
+                    "name": name, "status": "healthy",
+                    "cpu_percent": round(self.rng.uniform(20, 40), 1),
+                    "memory_percent": round(self.rng.uniform(30, 48), 1),
+                    "error_rate": 0.0,
+                    "latency_p99_ms": round(self.rng.uniform(8, 30), 0),
+                    "replicas_running": 2, "replicas_desired": 2,
+                    "current_version": VERSIONS[name],
+                    "last_deployed": "2026-03-20T11:00:00Z",
+                    "minutes_degraded": 0, "sla_breach": False,
+                }
+        alerts = [
+            {
+                "id": "A001", "severity": "critical", "service": failing,
+                "message": f"{scenario['alert_msg']} - {restart_count} times in 5 minutes",
+                "timestamp": "2026-03-30T10:14:48Z", "acknowledged": False,
+            },
+            {
+                "id": "A002", "severity": "warning", "service": "api-gateway",
+                "message": f"Upstream {failing} returning 503 - circuit breaker open",
+                "timestamp": "2026-03-30T10:14:52Z", "acknowledged": False,
+            },
+        ]
+        state = InternalState(
+            episode_id=str(uuid.uuid4()), task_id="easy", step=0, max_steps=15,
+            services=services, alerts=alerts,
+            logs=_make_logs(scenario, heap1, heap2, restart_count),
+            action_history=[], total_reward=0.0, incident_resolved=False,
+            ground_truth_root_cause=scenario["root_cause"],
+            ground_truth_fix=scenario["fix"],
+            incident_start_time=INCIDENT_TIME,
+            healthy_services=[s for s in ALL_SERVICES if s != failing],
+            service_dependencies=DEPENDENCIES,
+        )
+        state._scenario = scenario
+        return state
+    def step(self, state: InternalState, action: Action) -> StepOutput:
+        state.step += 1
+        state._apply_sla_degradation()
+        at = action.action_type
+        svc = action.service or ""
+        scenario = state._scenario
+        failing = scenario["failing_service"]
+        keywords = scenario["diagnosis_keywords"]
+        reward = 0.0
+        done = False
+        info: Dict[str, Any] = {}
+        result_text, error_text = self._apply_action_to_logs(state, action)
+        if at == ActionType.READ_LOGS and svc == failing:
+            if "read_logs" not in state.rewards_given:
+                reward += 0.15
+                state.rewards_given.add("read_logs")
+        if at == ActionType.READ_METRICS and svc == failing:
+            if "read_metrics" not in state.rewards_given:
+                reward += 0.10
+                state.rewards_given.add("read_metrics")
+        if at == ActionType.READ_RUNBOOK:
+            if "runbook" not in state.rewards_given:
+                reward += 0.05
+                state.rewards_given.add("runbook")
+        if at == ActionType.DIAGNOSE:
+            rc = action.root_cause or ""
+            correct_type = semantic_match(rc, keywords, threshold=1)
+            correct_svc = semantic_match(rc, [failing, failing.split("-")[0]])
+            result_text = f"Diagnosis recorded: {rc}"
+            if correct_type and correct_svc:
+                if "diagnose_correct" not in state.rewards_given:
+                    # Give full reward, remove partial if already given
+                    bonus = 0.30 if "diagnose_partial" not in state.rewards_given else 0.15
+                    reward += bonus
+                    state.rewards_given.add("diagnose_correct")
+            elif correct_type:
+                if "diagnose_partial" not in state.rewards_given and "diagnose_correct" not in state.rewards_given:
+                    reward += 0.15
+                    state.rewards_given.add("diagnose_partial")
+        if at == ActionType.RESTART_SERVICE:
+            blind_penalty = self._penalty_blind_remediation(state, action, "restarted")
+            reward += blind_penalty
+            if svc == failing:
+                reward += 0.40
+                state.services[svc]["status"] = "healthy"
+                state.services[svc]["memory_percent"] = round(self.rng.uniform(38, 48), 1)
+                state.services[svc]["error_rate"] = 0.0
+                state.services[svc]["latency_p99_ms"] = round(self.rng.uniform(20, 60), 0)
+                state.services[svc]["replicas_running"] = state.services[svc]["replicas_desired"]
+                state.alerts = [a for a in state.alerts if a["id"] != "A001"]
+                state.incident_resolved = True
+                result_text = f"{svc} restarted. Memory cleared. All pods healthy."
+                done = True
+                info["resolution"] = "incident_resolved"
+            elif svc in state.healthy_services:
+                reward -= 0.10
+                error_text = f"Collateral damage: {svc} was healthy. Unnecessary restart."
+        if at == ActionType.NOOP and state.step > 3:
+            reward -= 0.04
+        state.total_reward = self._clamp(state.total_reward + reward)
+        if state.step >= state.max_steps and not done:
+            done = True
+            info["reason"] = "max_steps_reached"
+        obs = state._build_observation(last_action_result=result_text, last_action_error=error_text)
+        state.action_history.append({"step": state.step, "action": action.model_dump(), "reward": round(reward, 4)})
+        return StepOutput(next_state=state, reward=round(reward, 4), done=done, info=info)

tasks/task_hard.py ADDED Viewed

	@@ -0,0 +1,224 @@

+from __future__ import annotations
+import uuid
+from typing import Dict, Any, List
+from models import Action, ActionType
+from tasks.base import BaseTask, InternalState, StepOutput, semantic_match
+INCIDENT_TIME = "2026-03-30T11:02:00Z"
+DEPENDENCIES = [
+    {"service": "api-gateway",             "calls": ["order-service", "product-catalog-service"], "called_by": []},
+    {"service": "order-service",           "calls": ["product-catalog-service"],                  "called_by": ["api-gateway"]},
+    {"service": "data-pipeline-service",   "calls": ["product-catalog-service"],                  "called_by": []},
+    {"service": "product-catalog-service", "calls": [],                                            "called_by": ["api-gateway", "order-service", "data-pipeline-service"]},
+    {"service": "price-validation-service","calls": ["product-catalog-service"],                   "called_by": []},
+    {"service": "analytics-service",       "calls": ["order-service"],                             "called_by": []},
+]
+PIPELINE_LOGS = [
+    "[11:01:55] INFO  Deployment data-pipeline-service:{version} complete",
+    "[11:01:58] INFO  Health check passed. Starting pipeline workers.",
+    "[11:02:00] INFO  Pipeline worker started. Consuming from topic: product-updates",
+    "[11:02:01] INFO  Processed batch: 142 records written to product-catalog",
+    "[11:02:03] INFO  Processed batch: 138 records written to product-catalog",
+    "[11:02:07] INFO  Processed batch: 147 records written to product-catalog",
+    "[11:02:09] INFO  All writes succeeded (HTTP 200) - no errors detected",
+]
+PRICE_VALIDATION_LOGS = [
+    "[11:02:08] INFO  Validation batch started: 312 products",
+    "[11:02:10] WARN  PRICE_MISMATCH: product_id=1042 catalog=149.99 expected=14.99 (10x multiplier?)",
+    "[11:02:11] WARN  PRICE_MISMATCH: product_id=2891 catalog=899.00 expected=89.00",
+    "[11:02:13] WARN  PRICE_MISMATCH: product_id=0391 catalog=24.90 expected=2.49",
+    "[11:02:14] WARN  PRICE_MISMATCH: product_id=5521 catalog=1299.90 expected=129.99",
+    "[11:02:17] WARN  PRICE_MISMATCH: product_id=7823 catalog=49.90 expected=4.99",
+    "[11:02:21] WARN  PRICE_MISMATCH: product_id=3314 catalog=799.00 expected=79.90",
+    "[11:02:24] INFO  Validation batch complete: 265 ok, 47 mismatches (15.1% rate, baseline: 0.2%)",
+    "[11:02:24] WARN  Mismatch rate 15.1% exceeds SLA threshold 1.0% - notifying data team",
+]
+ANALYTICS_LOGS = [
+    "[11:01:50] INFO  Hourly report: avg_order_value=$89.42 orders=138 (normal)",
+    "[11:02:00] INFO  Hourly report: avg_order_value=$91.18 orders=141",
+    "[11:02:10] INFO  ANOMALY: avg_order_value=$312.44 (3.5x baseline) in last 2 min",
+    "[11:02:20] WARN  avg_order_value=$847.23 - possible pricing issue",
+    "[11:02:21] INFO  orders_per_minute=142 (normal: 120-160) - volume is normal",
+    "[11:02:21] INFO  Spike NOT correlated with marketing campaign or known event",
+]
+CATALOG_LOGS = [
+    "[11:02:01] INFO  PUT /catalog/product/1042 200 8ms price=149.99",
+    "[11:02:02] INFO  PUT /catalog/product/2891 200 7ms price=899.00",
+    "[11:02:03] INFO  PUT /catalog/product/0391 200 6ms price=24.90",
+    "[11:02:04] INFO  PUT /catalog/product/5521 200 8ms price=1299.90",
+    "[11:02:05] INFO  All writes returning 200 OK - no DB errors",
+]
+GATEWAY_LOGS = [
+    "[11:02:00] INFO  GET /api/v1/products 200 12ms",
+    "[11:02:05] INFO  POST /api/v1/orders 200 88ms",
+    "[11:02:15] INFO  POST /api/v1/orders 200 91ms",
+    "[11:02:20] INFO  POST /api/v1/orders 200 87ms",
+]
+ORDER_LOGS = [
+    "[11:02:05] INFO  Order ORD-9901: total=$149.99 (product_id=1042)",
+    "[11:02:08] INFO  Order ORD-9902: total=$899.00 (product_id=2891)",
+    "[11:02:12] INFO  Order ORD-9903: total=$1299.90 (product_id=5521)",
+]
+# Extra noise alerts that don't point to the real issue
+NOISE_ALERTS = [
+    {
+        "id": "A030", "severity": "info", "service": "api-gateway",
+        "message": "TLS certificate renewing in 14 days - scheduled maintenance upcoming",
+        "timestamp": "2026-03-30T11:00:00Z", "acknowledged": False,
+    },
+    {
+        "id": "A031", "severity": "info", "service": "analytics-service",
+        "message": "Nightly aggregation job starting 5 minutes early due to backlog",
+        "timestamp": "2026-03-30T11:01:45Z", "acknowledged": False,
+    },
+    {
+        "id": "A032", "severity": "info", "service": "product-catalog-service",
+        "message": "Read replica lag 280ms (threshold: 500ms) - within normal range",
+        "timestamp": "2026-03-30T11:02:00Z", "acknowledged": False,
+    },
+]
+class HardTask(BaseTask):
+    def initialize(self) -> InternalState:
+        bad_ver = f"v3.1.{self.rng.randint(0, 4)}"
+        logs = {
+            "data-pipeline-service": [l.replace("{version}", bad_ver) for l in PIPELINE_LOGS],
+            "price-validation-service": PRICE_VALIDATION_LOGS[:],
+            "analytics-service": ANALYTICS_LOGS[:],
+            "product-catalog-service": CATALOG_LOGS[:],
+            "api-gateway": GATEWAY_LOGS[:],
+            "order-service": ORDER_LOGS[:],
+        }
+        def healthy_svc(name, ver, deployed):
+            return {
+                "name": name, "status": "healthy",
+                "cpu_percent": round(self.rng.uniform(22, 48), 1),
+                "memory_percent": round(self.rng.uniform(35, 55), 1),
+                "error_rate": 0.0,
+                "latency_p99_ms": round(self.rng.uniform(8, 130), 0),
+                "replicas_running": self.rng.choice([2, 3]),
+                "replicas_desired": self.rng.choice([2, 3]),
+                "current_version": ver, "last_deployed": deployed,
+                "minutes_degraded": 0, "sla_breach": False,
+            }
+        services = {
+            "api-gateway":             {**healthy_svc("api-gateway",             "v3.1.0", "2026-03-20T08:00:00Z"), "replicas_running": 2, "replicas_desired": 2},
+            "data-pipeline-service":   {**healthy_svc("data-pipeline-service",   bad_ver,  "2026-03-30T11:01:55Z"), "replicas_running": 3, "replicas_desired": 3},
+            "product-catalog-service": {**healthy_svc("product-catalog-service", "v2.0.1", "2026-03-10T12:00:00Z"), "replicas_running": 2, "replicas_desired": 2},
+            "price-validation-service":{**healthy_svc("price-validation-service","v1.4.0", "2026-03-12T14:00:00Z"), "replicas_running": 2, "replicas_desired": 2},
+            "analytics-service":       {**healthy_svc("analytics-service",       "v2.3.1", "2026-03-14T10:00:00Z"), "replicas_running": 2, "replicas_desired": 2},
+            "order-service":           {**healthy_svc("order-service",           "v1.8.2", "2026-03-22T10:00:00Z"), "replicas_running": 3, "replicas_desired": 3},
+        }
+        # Real signal alerts + noise
+        alerts = NOISE_ALERTS[:] + [
+            {
+                "id": "A020", "severity": "info", "service": "price-validation-service",
+                "message": "Price mismatch rate 15.1% — above SLA threshold of 1.0%. Data team notified.",
+                "timestamp": "2026-03-30T11:02:24Z", "acknowledged": False,
+            },
+            {
+                "id": "A021", "severity": "warning", "service": "analytics-service",
+                "message": "avg_order_value anomaly: $847.23 vs baseline $89.42 — not correlated with campaigns",
+                "timestamp": "2026-03-30T11:02:21Z", "acknowledged": False,
+            },
+        ]
+        state = InternalState(
+            episode_id=str(uuid.uuid4()), task_id="hard", step=0, max_steps=25,
+            services=services, alerts=alerts, logs=logs,
+            action_history=[], total_reward=0.0, incident_resolved=False,
+            ground_truth_root_cause=f"data_corruption_data_pipeline_{bad_ver}_incorrect_price_writes",
+            ground_truth_fix="rollback data-pipeline-service then alert_oncall for data audit",
+            incident_start_time=INCIDENT_TIME,
+            healthy_services=list(services.keys()),
+            service_dependencies=DEPENDENCIES,
+        )
+        state._bad_ver = bad_ver
+        return state
+    def step(self, state: InternalState, action: Action) -> StepOutput:
+        state.step += 1
+        # No SLA degradation on hard task — all services stay green
+        at = action.action_type
+        svc = action.service or ""
+        reward = 0.0
+        done = False
+        info: Dict[str, Any] = {}
+        result_text, error_text = self._apply_action_to_logs(state, action)
+        gather_map = {
+            ("read_logs", "price-validation-service"): ("rl_price", 0.05),
+            ("read_logs", "analytics-service"):         ("rl_analytics", 0.05),
+            ("read_logs", "data-pipeline-service"):     ("rl_pipeline", 0.05),
+            ("read_metrics", "analytics-service"):      ("rm_analytics", 0.10),
+            ("read_metrics", "data-pipeline-service"):  ("rm_pipeline", 0.10),
+        }
+        k = (at.value, svc)
+        if k in gather_map:
+            tag, r = gather_map[k]
+            if tag not in state.rewards_given:
+                reward += r; state.rewards_given.add(tag)
+        if at == ActionType.READ_RUNBOOK:
+            if "runbook" not in state.rewards_given:
+                reward += 0.05; state.rewards_given.add("runbook")
+        # Restarts/scale-ups are always wrong here
+        if at in (ActionType.RESTART_SERVICE, ActionType.SCALE_UP):
+            reward -= 0.15
+            error_text = (
+                f"Restarting/scaling {svc} will not fix corrupt data already written. "
+                "You need to rollback the pipeline and audit the data."
+            )
+        if at == ActionType.DIAGNOSE:
+            rc = action.root_cause or ""
+            has_pipeline = semantic_match(rc, ["pipeline", "data-pipeline"])
+            has_corruption = semantic_match(rc, ["corrupt", "data", "price", "wrong", "incorrect", "mismatch"])
+            result_text = f"Diagnosis recorded: {rc}"
+            if has_pipeline and has_corruption:
+                if "diagnose_correct" not in state.rewards_given:
+                    reward += 0.20; state.rewards_given.add("diagnose_correct")
+            elif has_pipeline or has_corruption:
+                if "diagnose_partial" not in state.rewards_given and "diagnose_correct" not in state.rewards_given:
+                    reward += 0.08; state.rewards_given.add("diagnose_partial")
+        if at == ActionType.ROLLBACK and svc == "data-pipeline-service":
+            reward += self._penalty_blind_remediation(state, action, "rollback_done")
+            if "rollback_done" not in state.rewards_given:
+                reward += 0.25; state.rewards_given.add("rollback_done")
+                state.services["data-pipeline-service"]["current_version"] = "v3.0.9"
+                result_text = (
+                    "data-pipeline-service rolled back to v3.0.9. Future writes corrected. "
+                    "WARNING: corrupted prices already written must be audited."
+                )
+                if "alert_oncall_done" in state.rewards_given:
+                    state.incident_resolved = True; done = True; info["resolution"] = "incident_resolved"
+        if at == ActionType.ALERT_ONCALL:
+            if "alert_oncall_done" not in state.rewards_given:
+                reward += 0.15; state.rewards_given.add("alert_oncall_done")
+                result_text = "On-call data team paged for price audit and correction job."
+                if "rollback_done" in state.rewards_given:
+                    state.incident_resolved = True; done = True; info["resolution"] = "incident_resolved"
+        state.total_reward = self._clamp(state.total_reward + reward)
+        if state.step >= state.max_steps and not done:
+            done = True; info["reason"] = "max_steps_reached"
+        obs = state._build_observation(last_action_result=result_text, last_action_error=error_text)
+        state.action_history.append({"step": state.step, "action": action.model_dump(), "reward": round(reward, 4)})
+        return StepOutput(next_state=state, reward=round(reward, 4), done=done, info=info)

tasks/task_medium.py ADDED Viewed

	@@ -0,0 +1,276 @@

+from __future__ import annotations
+import uuid
+from typing import Dict, Any, List
+from models import Action, ActionType
+from tasks.base import BaseTask, InternalState, StepOutput, semantic_match
+INCIDENT_TIME = "2026-03-30T10:32:01Z"
+DEPENDENCIES = [
+    {"service": "api-gateway",          "calls": ["order-service", "user-service"],          "called_by": []},
+    {"service": "order-service",        "calls": ["inventory-service"],                       "called_by": ["api-gateway"]},
+    {"service": "inventory-service",    "calls": ["db-primary"],                              "called_by": ["order-service"]},
+    {"service": "notification-service", "calls": [],                                          "called_by": []},
+    {"service": "user-service",         "calls": [],                                          "called_by": ["api-gateway"]},
+]
+# Cascading scenarios — 3 different root services that can fail
+SCENARIOS = [
+    {
+        "root_service": "inventory-service",
+        "root_cause_template": "connection_pool_exhaustion_{service}_{version}",
+        "fix_template": "rollback {service}",
+        "error_type": "connection_pool",
+        "diagnosis_keywords": ["connection", "pool", "hikari", "db", "database", "exhaustion", "inventory"],
+        "fix_action": ActionType.ROLLBACK,
+    },
+    {
+        "root_service": "inventory-service",
+        "root_cause_template": "null_pointer_exception_{service}_{version}",
+        "fix_template": "rollback {service}",
+        "error_type": "null_pointer",
+        "diagnosis_keywords": ["null", "nullpointer", "npe", "exception", "inventory", "bug", "crash"],
+        "fix_action": ActionType.ROLLBACK,
+    },
+]
+INV_LOGS_CONNECTION = [
+    "[10:31:58] INFO  Deployment inventory-service:{version} complete - 12 pods running",
+    "[10:32:01] INFO  Health check passed for inventory-service:{version}",
+    "[10:32:38] ERROR Failed to acquire connection from pool: timeout after 30000ms",
+    "[10:32:39] ERROR HikariPool-1 - Connection is not available, request timed out",
+    "[10:32:40] ERROR Connection pool exhausted (max=10, active=10, waiting=47)",
+    "[10:32:42] WARN  Retry attempt 1/3 failed for getInventory(productId=1982)",
+    "[10:32:46] WARN  Retry attempt 3/3 failed - returning error upstream",
+    "[10:32:48] ERROR Thread pool saturation: 98/100 threads active, queue depth 412",
+]
+INV_LOGS_NPE = [
+    "[10:31:58] INFO  Deployment inventory-service:{version} complete",
+    "[10:32:01] INFO  Health check passed for inventory-service:{version}",
+    "[10:32:35] ERROR NullPointerException: Cannot invoke method getStock() on null object",
+    "[10:32:35] ERROR   at InventoryService.checkAvailability(InventoryService.java:218)",
+    "[10:32:36] ERROR   at InventoryController.getInventory(InventoryController.java:87)",
+    "[10:32:37] WARN  Exception rate 38/min - circuit breaker threshold approaching",
+    "[10:32:42] ERROR Circuit breaker OPEN - too many NullPointerExceptions",
+    "[10:32:45] ERROR getInventory returning 500 for all requests",
+]
+ORDER_LOGS = [
+    "[10:32:30] INFO  Order created: order_id=ORD-8821 status=confirmed",
+    "[10:32:45] WARN  inventory-service call timed out after 5000ms",
+    "[10:32:49] ERROR Order creation failed: upstream dependency unavailable",
+    "[10:32:50] ERROR Circuit breaker OPEN for inventory-service endpoint",
+    "[10:32:51] WARN  Falling back to cached inventory data (may be stale)",
+]
+GATEWAY_LOGS = [
+    "[10:32:20] INFO  POST /api/v1/orders 200 142ms",
+    "[10:32:50] WARN  POST /api/v1/orders upstream latency 5800ms",
+    "[10:32:55] ERROR POST /api/v1/orders 503 Service Unavailable",
+    "[10:32:56] WARN  Error rate for /api/v1/orders: 18% (threshold: 5%)",
+]
+NOTIF_LOGS = [
+    "[10:30:00] INFO  Batch email job started: 48000 recipients",
+    "[10:31:30] INFO  Sent 24000/48000 emails",
+    "[10:33:00] INFO  Batch email job complete: 48000 sent, 0 failed",
+]
+USER_LOGS = ["[10:32:00] INFO  GET /users/profile 200 9ms",
+             "[10:33:00] INFO  GET /users/profile 200 10ms"]
+class MediumTask(BaseTask):
+    def initialize(self) -> InternalState:
+        scenario = SCENARIOS[self.rng.randint(0, len(SCENARIOS) - 1)]
+        bad_ver = f"v2.3.{self.rng.randint(1, 5)}"
+        root_svc = scenario["root_service"]
+        if scenario["error_type"] == "connection_pool":
+            inv_logs = [l.replace("{version}", bad_ver) for l in INV_LOGS_CONNECTION]
+        else:
+            inv_logs = [l.replace("{version}", bad_ver) for l in INV_LOGS_NPE]
+        logs = {
+            "inventory-service": inv_logs,
+            "order-service": ORDER_LOGS[:],
+            "api-gateway": GATEWAY_LOGS[:],
+            "notification-service": NOTIF_LOGS[:],
+            "user-service": USER_LOGS[:],
+        }
+        services = {
+            "api-gateway": {
+                "name": "api-gateway", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(55, 70), 1),
+                "memory_percent": round(self.rng.uniform(48, 60), 1),
+                "error_rate": round(self.rng.uniform(3.5, 6.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(4500, 6500), 0),
+                "replicas_running": 2, "replicas_desired": 2,
+                "current_version": "v3.1.0", "last_deployed": "2026-03-20T08:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "order-service": {
+                "name": "order-service", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(60, 75), 1),
+                "memory_percent": round(self.rng.uniform(55, 68), 1),
+                "error_rate": round(self.rng.uniform(4.0, 8.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(5000, 7000), 0),
+                "replicas_running": 3, "replicas_desired": 3,
+                "current_version": "v1.8.2", "last_deployed": "2026-03-22T10:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "inventory-service": {
+                "name": "inventory-service", "status": "degraded",
+                "cpu_percent": round(self.rng.uniform(80, 95), 1),
+                "memory_percent": round(self.rng.uniform(70, 85), 1),
+                "error_rate": round(self.rng.uniform(12.0, 20.0), 2),
+                "latency_p99_ms": round(self.rng.uniform(28000, 35000), 0),
+                "replicas_running": 3, "replicas_desired": 3,
+                "current_version": bad_ver, "last_deployed": "2026-03-30T10:31:58Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "notification-service": {
+                "name": "notification-service", "status": "healthy",
+                "cpu_percent": round(self.rng.uniform(82, 92), 1),
+                "memory_percent": round(self.rng.uniform(55, 65), 1),
+                "error_rate": 0.0,
+                "latency_p99_ms": round(self.rng.uniform(20, 45), 0),
+                "replicas_running": 2, "replicas_desired": 2,
+                "current_version": "v1.2.0", "last_deployed": "2026-03-15T16:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+            "user-service": {
+                "name": "user-service", "status": "healthy",
+                "cpu_percent": round(self.rng.uniform(20, 35), 1),
+                "memory_percent": round(self.rng.uniform(30, 42), 1),
+                "error_rate": 0.0,
+                "latency_p99_ms": round(self.rng.uniform(8, 20), 0),
+                "replicas_running": 2, "replicas_desired": 2,
+                "current_version": "v3.0.5", "last_deployed": "2026-03-18T09:00:00Z",
+                "minutes_degraded": 0, "sla_breach": False,
+            },
+        }
+        alerts = [
+            {
+                "id": "A010", "severity": "critical", "service": "api-gateway",
+                "message": "Error rate on /api/v1/orders exceeded 15% threshold",
+                "timestamp": "2026-03-30T10:32:56Z", "acknowledged": False,
+            },
+            {
+                "id": "A011", "severity": "critical", "service": "order-service",
+                "message": "Order creation failure rate 31% - circuit breaker triggered for inventory-service",
+                "timestamp": "2026-03-30T10:32:51Z", "acknowledged": False,
+            },
+            {
+                "id": "A012", "severity": "warning", "service": "inventory-service",
+                "message": f"P99 latency 32100ms (threshold: 5000ms) - deployed {bad_ver} at 10:31",
+                "timestamp": "2026-03-30T10:32:48Z", "acknowledged": False,
+            },
+            # Red herring
+            {
+                "id": "A013", "severity": "warning", "service": "notification-service",
+                "message": "CPU usage 88% - batch email job running (scheduled, not an incident)",
+                "timestamp": "2026-03-30T10:30:00Z", "acknowledged": False,
+            },
+        ]
+        rc = scenario["root_cause_template"].format(service=root_svc, version=bad_ver)
+        fix = scenario["fix_template"].format(service=root_svc)
+        state = InternalState(
+            episode_id=str(uuid.uuid4()), task_id="medium", step=0, max_steps=20,
+            services=services, alerts=alerts, logs=logs,
+            action_history=[], total_reward=0.0, incident_resolved=False,
+            ground_truth_root_cause=rc, ground_truth_fix=fix,
+            incident_start_time=INCIDENT_TIME,
+            healthy_services=["notification-service", "user-service"],
+            service_dependencies=DEPENDENCIES,
+        )
+        state._scenario = scenario
+        state._bad_ver = bad_ver
+        return state
+    def step(self, state: InternalState, action: Action) -> StepOutput:
+        state.step += 1
+        state._apply_sla_degradation()
+        at = action.action_type
+        svc = action.service or ""
+        scenario = state._scenario
+        keywords = scenario["diagnosis_keywords"]
+        bad_ver = state._bad_ver
+        reward = 0.0
+        done = False
+        info: Dict[str, Any] = {}
+        result_text, error_text = self._apply_action_to_logs(state, action)
+        if at == ActionType.READ_LOGS and svc == "inventory-service":
+            if "read_logs_inv" not in state.rewards_given:
+                reward += 0.10; state.rewards_given.add("read_logs_inv")
+        if at == ActionType.READ_METRICS and svc == "inventory-service":
+            if "read_metrics_inv" not in state.rewards_given:
+                reward += 0.10; state.rewards_given.add("read_metrics_inv")
+        if at == ActionType.READ_METRICS and svc == "order-service":
+            if "read_metrics_ord" not in state.rewards_given:
+                reward += 0.05; state.rewards_given.add("read_metrics_ord")
+        if at == ActionType.READ_RUNBOOK:
+            if "runbook" not in state.rewards_given:
+                reward += 0.05; state.rewards_given.add("runbook")
+        # Red herring penalty
+        if at == ActionType.RESTART_SERVICE and svc == "notification-service":
+            reward -= 0.05
+            error_text = "notification-service was healthy — high CPU is a scheduled batch job, not an incident."
+        # Treating symptom before root cause
+        if at == ActionType.RESTART_SERVICE and svc == "order-service":
+            if "diagnose_correct" not in state.rewards_given:
+                reward -= 0.10
+                error_text = "order-service is a downstream victim. Fix inventory-service first."
+        if at == ActionType.DIAGNOSE:
+            rc = action.root_cause or ""
+            has_service = semantic_match(rc, ["inventory"])
+            has_cause = semantic_match(rc, keywords, threshold=1)
+            result_text = f"Diagnosis recorded: {rc}"
+            if has_service and has_cause:
+                if "diagnose_correct" not in state.rewards_given:
+                    reward += 0.25; state.rewards_given.add("diagnose_correct")
+            elif has_service or has_cause:
+                if "diagnose_partial" not in state.rewards_given and "diagnose_correct" not in state.rewards_given:
+                    reward += 0.10; state.rewards_given.add("diagnose_partial")
+        if at == ActionType.ROLLBACK and svc == "inventory-service":
+            reward += self._penalty_blind_remediation(state, action, "rollback_done")
+            if "rollback_done" not in state.rewards_given:
+                reward += 0.30; state.rewards_given.add("rollback_done")
+                ver = action.version or ""
+                if "v2.3.0" in ver or ver in ("previous", "last"):
+                    reward += 0.10
+                state.services["inventory-service"]["status"] = "healthy"
+                state.services["inventory-service"]["error_rate"] = 0.0
+                state.services["inventory-service"]["latency_p99_ms"] = 85.0
+                state.services["inventory-service"]["current_version"] = "v2.3.0"
+                state.services["order-service"]["status"] = "healthy"
+                state.services["order-service"]["error_rate"] = 0.0
+                state.services["api-gateway"]["status"] = "healthy"
+                state.services["api-gateway"]["error_rate"] = 0.1
+                state.alerts = [a for a in state.alerts if a["id"] not in ("A010", "A011", "A012")]
+                state.incident_resolved = True
+                result_text = f"inventory-service rolled back. Downstream services recovering."
+                done = True; info["resolution"] = "incident_resolved"
+        if at in (ActionType.RESTART_SERVICE, ActionType.ROLLBACK) and svc in state.healthy_services:
+            reward -= 0.10
+        if at == ActionType.NOOP and state.step > 4:
+            reward -= 0.03
+        state.total_reward = self._clamp(state.total_reward + reward)
+        if state.step >= state.max_steps and not done:
+            done = True; info["reason"] = "max_steps_reached"
+        obs = state._build_observation(last_action_result=result_text, last_action_error=error_text)
+        state.action_history.append({"step": state.step, "action": action.model_dump(), "reward": round(reward, 4)})
+        return StepOutput(next_state=state, reward=round(reward, 4), done=done, info=info)

validate.py ADDED Viewed

	@@ -0,0 +1,303 @@

+#!/usr/bin/env python3
+"""
+validate.py — Pre-submission validation script.
+Run this before submitting to confirm all checklist items pass:
+    python validate.py
+Exit code 0 = all checks passed.
+Exit code 1 = one or more checks failed.
+"""
+import sys
+import os
+import random
+import traceback
+sys.path.insert(0, os.path.dirname(__file__))
+PASS = "\033[92m✓\033[0m"
+FAIL = "\033[91m✗\033[0m"
+WARN = "\033[93m!\033[0m"
+failures = []
+def check(name: str, fn):
+    try:
+        result = fn()
+        if result is True or result is None:
+            print(f"  {PASS}  {name}")
+            return True
+        else:
+            print(f"  {FAIL}  {name}: {result}")
+            failures.append(name)
+            return False
+    except Exception as e:
+        print(f"  {FAIL}  {name}: {e}")
+        traceback.print_exc()
+        failures.append(name)
+        return False
+def main():
+    print("\n=== DevOps Incident Response — OpenEnv Validation ===\n")
+    # --- Imports ---
+    print("[ Imports ]")
+    def check_imports():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType, Observation, StepResult, State
+        from graders.grader import grade_episode
+        return True
+    check("All modules import cleanly", check_imports)
+    # --- Reset returns valid Observation ---
+    print("\n[ reset() ]")
+    def check_reset_easy():
+        from env import DevOpsIncidentEnv
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        obs = env.reset()
+        assert obs.step == 0
+        assert len(obs.services) > 0
+        assert len(obs.active_alerts) > 0
+        assert obs.task_id == "easy"
+        return True
+    def check_reset_all_tasks():
+        from env import DevOpsIncidentEnv
+        for task_id in ["easy", "medium", "hard", "bonus"]:
+            env = DevOpsIncidentEnv(task_id=task_id, seed=42)
+            obs = env.reset()
+            assert obs.task_id == task_id, f"task_id mismatch for {task_id}"
+            assert obs.max_steps > 0
+        return True
+    def check_reset_reproducible():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        results = []
+        for _ in range(3):
+            env = DevOpsIncidentEnv(task_id="easy", seed=42)
+            obs = env.reset()
+            results.append(obs.services[0].memory_percent)
+        assert len(set(results)) == 1, f"Different results for same seed: {results}"
+        return True
+    def check_seed_variety():
+        from env import DevOpsIncidentEnv
+        roots = set()
+        for seed in range(10):
+            env = DevOpsIncidentEnv(task_id="easy", seed=seed)
+            env.reset()
+            s = env.state()
+            roots.add(s.ground_truth_root_cause)
+        assert len(roots) > 1, f"All seeds produce same scenario: {roots}"
+        return True
+    check("reset() returns valid Observation for easy task", check_reset_easy)
+    check("reset() works for all 4 tasks", check_reset_all_tasks)
+    check("Same seed always produces same episode", check_reset_reproducible)
+    check("Different seeds produce different scenarios", check_seed_variety)
+    # --- step() ---
+    print("\n[ step() ]")
+    def check_step_returns_result():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType, StepResult
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        env.reset()
+        result = env.step(Action(action_type=ActionType.NOOP))
+        assert isinstance(result, StepResult)
+        assert isinstance(result.reward, float)
+        assert isinstance(result.done, bool)
+        assert result.observation.step == 1
+        return True
+    def check_step_reward_in_range():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        rng = random.Random(0)
+        for task_id in ["easy", "medium", "hard", "bonus"]:
+            env = DevOpsIncidentEnv(task_id=task_id, seed=42)
+            env.reset()
+            done = False
+            steps = 0
+            while not done and steps < 30:
+                action = Action(action_type=rng.choice(list(ActionType)))
+                result = env.step(action)
+                assert -1.0 <= result.reward <= 1.0, f"reward={result.reward} out of range"
+                done = result.done
+                steps += 1
+        return True
+    def check_max_steps_terminates():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        env.reset()
+        done = False
+        steps = 0
+        while not done:
+            result = env.step(Action(action_type=ActionType.NOOP))
+            done = result.done
+            steps += 1
+            assert steps <= 20, "Episode never terminated"
+        return True
+    check("step() returns valid StepResult", check_step_returns_result)
+    check("step() rewards always in [-1.0, 1.0]", check_step_reward_in_range)
+    check("Episode terminates at max_steps", check_max_steps_terminates)
+    # --- state() ---
+    print("\n[ state() ]")
+    def check_state_has_ground_truth():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        env = DevOpsIncidentEnv(task_id="medium", seed=42)
+        env.reset()
+        env.step(Action(action_type=ActionType.NOOP))
+        s = env.state()
+        assert s.ground_truth_root_cause != ""
+        assert s.ground_truth_fix != ""
+        assert len(s.action_history) == 1
+        return True
+    check("state() returns ground truth and action history", check_state_has_ground_truth)
+    # --- Graders ---
+    print("\n[ Graders ]")
+    def check_graders_in_range():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        from graders.grader import grade_episode
+        rng = random.Random(99)
+        for task_id in ["easy", "medium", "hard", "bonus"]:
+            env = DevOpsIncidentEnv(task_id=task_id, seed=42)
+            env.reset()
+            done = False
+            steps = 0
+            while not done and steps < 30:
+                action = Action(action_type=rng.choice(list(ActionType)))
+                result = env.step(action)
+                done = result.done
+                steps += 1
+            s = env.state()
+            score = grade_episode(
+                task_id, s.action_history, s.ground_truth_root_cause,
+                s.ground_truth_fix, s.incident_resolved, s.total_reward,
+            )
+            assert 0.0 <= score <= 1.0, f"{task_id} score={score} out of [0,1]"
+        return True
+    def check_graders_not_constant():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        from graders.grader import grade_episode
+        scores = []
+        for seed in [1, 2, 3, 42, 99]:
+            rng = random.Random(seed * 7)
+            env = DevOpsIncidentEnv(task_id="easy", seed=seed)
+            env.reset()
+            done = False
+            steps = 0
+            while not done and steps < 15:
+                action = Action(action_type=rng.choice(list(ActionType)))
+                result = env.step(action)
+                done = result.done
+                steps += 1
+            s = env.state()
+            score = grade_episode(
+                "easy", s.action_history, s.ground_truth_root_cause,
+                s.ground_truth_fix, s.incident_resolved, s.total_reward,
+            )
+            scores.append(score)
+        assert len(set(scores)) > 1, f"Grader returns constant score: {scores}"
+        return True
+    def check_optimal_agent_scores_high():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        from graders.grader import grade_episode
+        # Easy task optimal sequence
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        env.reset()
+        s0 = env.state()
+        failing = s0.ground_truth_root_cause.replace("memory_leak_", "").replace("_", "-")
+        for act in [
+            Action(action_type=ActionType.READ_LOGS, service=failing),
+            Action(action_type=ActionType.READ_METRICS, service=failing),
+            Action(action_type=ActionType.DIAGNOSE, root_cause=f"memory leak {failing}"),
+            Action(action_type=ActionType.RESTART_SERVICE, service=failing),
+        ]:
+            result = env.step(act)
+            if result.done:
+                break
+        s = env.state()
+        score = grade_episode(
+            "easy", s.action_history, s.ground_truth_root_cause,
+            s.ground_truth_fix, s.incident_resolved, s.total_reward,
+        )
+        assert score >= 0.85, f"Optimal agent scored only {score:.3f} on easy"
+        return True
+    check("All graders return scores in [0.0, 1.0]", check_graders_in_range)
+    check("Grader does not return constant scores across episodes", check_graders_not_constant)
+    check("Optimal agent scores >= 0.85 on easy task", check_optimal_agent_scores_high)
+    # --- Collateral damage penalty ---
+    print("\n[ Reward shaping ]")
+    def check_collateral_damage_penalty():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        env.reset()
+        s0 = env.state()
+        healthy = [svc for svc in s0.current_observation.services
+                   if svc.status == "healthy"]
+        assert len(healthy) > 0, "No healthy services to test with"
+        result = env.step(Action(action_type=ActionType.RESTART_SERVICE,
+                                 service=healthy[0].name))
+        assert result.reward < 0, f"Expected negative reward for healthy restart, got {result.reward}"
+        return True
+    def check_info_gathering_rewarded():
+        from env import DevOpsIncidentEnv
+        from models import Action, ActionType
+        env = DevOpsIncidentEnv(task_id="easy", seed=42)
+        env.reset()
+        s0 = env.state()
+        failing = s0.ground_truth_root_cause.replace("memory_leak_", "").replace("_", "-")
+        result = env.step(Action(action_type=ActionType.READ_LOGS, service=failing))
+        assert result.reward > 0, f"Expected positive reward for reading failing service logs, got {result.reward}"
+        return True
+    check("Restarting healthy service gives negative reward", check_collateral_damage_penalty)
+    check("Reading failing service logs gives positive reward", check_info_gathering_rewarded)
+    # --- Files present ---
+    print("\n[ Required files ]")
+    for fname in ["openenv.yaml", "Dockerfile", "requirements.txt",
+                  "inference.py", "README.md", "env.py", "api.py"]:
+        path = os.path.join(os.path.dirname(__file__), fname)
+        check(f"{fname} exists", lambda p=path: os.path.exists(p) or f"Missing: {p}")
+    # --- Summary ---
+    print()
+    if not failures:
+        print(f"{PASS} All checks passed! Ready to submit.\n")
+        sys.exit(0)
+    else:
+        print(f"{FAIL} {len(failures)} check(s) failed: {failures}\n")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()