Spaces:

NDGCodes
/

Sre-Validation

Sleeping

App Files Files Community

abdur0001 commited on Apr 6

Commit

5fe9036

1 Parent(s): 8e5f213

feat: initial push with env and 3 tasks

Browse files

Files changed (20) hide show

.dockerignore +12 -0
.gitignore +22 -0
Dockerfile +16 -0
README.md +287 -1
env/__init__.py +0 -0
env/environment.py +363 -0
env/scenario.py +67 -0
env/services.py +155 -0
graders/__init__.py +0 -0
graders/grader.py +170 -0
inference.py +268 -0
models.py +134 -0
openenv.yaml +27 -0
requirements.txt +5 -0
server/__init__.py +0 -0
server/app.py +146 -0
tasks/__init__.py +40 -0
tasks/easy_oom.py +299 -0
tasks/hard_concurrent.py +353 -0
tasks/medium_deadlock.py +298 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,12 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+.git/
+.gitignore
+.env
+.env.*
+.venv/
+venv/
+*.md
+!README.md
+.dockerignore

.gitignore ADDED Viewed

	@@ -0,0 +1,22 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+.env
+.env.*
+!.env.example
+.venv/
+venv/
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+*.log

Dockerfile ADDED Viewed

	@@ -0,0 +1,16 @@

+FROM python:3.11-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PORT=8000
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+COPY . /app
+EXPOSE 8000
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

	@@ -1 +1,287 @@
1	- # ~~IncidentResponse_RL~~

+# SRE Incident Response Environment
+An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes — just like a real on-call SRE engineer.
+## Motivation
+Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with:
+- **Multi-service architectures** with dependency chains and cascading failures
+- **Progressive information revelation** — agents must actively investigate (read logs, check metrics, trace requests)
+- **Red herrings and misleading symptoms** — alerts point to symptoms, not root causes
+- **Concurrent faults** in the hardest tier — testing whether agents can find multiple independent root causes
+- **Realistic operational data** — 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs
+## Service Architecture
+All tasks share the same 7-service microservice architecture:
+```
+                    +--------------+
+          +-------->| auth-service |<------+
+          |         +------+-------+       |
+          |                | depends       | depends
++---------+------+  +------v------+  +-----+--------+
+|  api-gateway   |  | cache-redis |  | notification |
+|  (entry point) |  +-------------+  |   -service   |
++-+----------+---+                   +--------------+
+  |          |
+  | depends  | depends
+  v          v
++------------+  +-----------------+
+|user-service|  |payment-service  |
++-----+------+  +--------+--------+
+      | depends          | depends
+      v                  v
++----------------------------+
+|        db-postgres         |
++----------------------------+
+```
+Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data.
+## Tasks
+Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object.
+| Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required |
+|---------|------|-----------|-----------|------------|--------------|
+| `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` |
+| `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` |
+| `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` |
+### Task Details
+**Easy** — Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix.
+**Medium** — Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry).
+**Hard** — Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy.
+### Adding New Tasks
+To add a new task:
+1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`)
+2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` — see existing task files for the template
+3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object.
+No changes needed to the environment engine, grader, server, or inference script. The grader is generic — it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition.
+## Project Structure
+```
+IncidentResponse_RL/
+├── models.py                  # Pydantic models: Action, Observation, State, enums
+├── openenv.yaml               # OpenEnv manifest (tasks, models, runtime config)
+├── requirements.txt           # Python dependencies
+├── Dockerfile                 # Container for HF Spaces deployment
+├── inference.py               # Baseline agent using OpenAI client
+├── README.md
+│
+├── env/                       # Core environment engine
+│   ├── __init__.py
+│   ├── scenario.py            # IncidentScenario, ServiceConfig, RequiredFix dataclasses
+│   ├── environment.py         # step() / reset() / state() implementation
+│   └── services.py            # Alert generation, dependency cascade, data formatting
+│
+├── tasks/                     # Task definitions (auto-discovered)
+│   ├── __init__.py            # Auto-discovery loader → SCENARIOS dict
+│   ├── easy_oom.py            # Easy: Single Service OOM Crash
+│   ├── medium_deadlock.py     # Medium: Cascading Database Deadlock
+│   └── hard_concurrent.py     # Hard: Concurrent Faults + Misleading Evidence
+│
+├── graders/                   # Scoring engine
+│   ├── __init__.py
+│   └── grader.py              # Generic rubric-based grader (0.0-1.0)
+│
+└── server/                    # FastAPI web server
+    ├── __init__.py
+    └── app.py                 # /reset, /step, /state, /tasks endpoints
+```
+## Action Space
+All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type.
+### Investigation Actions (read-only, gather information)
+| Action | Required Fields | Returns |
+|--------|----------------|---------|
+| `read_logs` | `service` | 50+ timestamped log lines with noise and signal |
+| `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) |
+| `ping_service` | `service` | Reachability check with latency |
+| `check_dependencies` | `service` | Upstream dependency list with current health status |
+| `inspect_deploy` | `service` | Deploy history (version, timestamp, status) |
+| `query_traces` | `service` | Distributed trace spans showing latency breakdown |
+| `check_runbook` | `service` | Operational runbook with troubleshooting steps |
+| `diff_config` | `service` | Current vs previous config comparison |
+### Remediation Actions (modify environment state)
+| Action | Required Fields | Effect |
+|--------|----------------|--------|
+| `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. |
+| `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. |
+| `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. |
+| `drain_traffic` | `service` | Stops routing traffic to the service. |
+### Terminal Action
+| Action | Required Fields | Effect |
+|--------|----------------|--------|
+| `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. |
+### Root Cause Categories
+`oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit`
+### Example Actions
+```json
+{"action_type": "read_logs", "service": "auth-service"}
+{"action_type": "check_metrics", "service": "db-postgres"}
+{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
+{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"}
+```
+## Observation Space
+On `reset()`, the agent receives:
+- **Service health dashboard** — all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count
+- **Active alerts** — severity-tagged alerts (SEV-1/SEV-2/SEV-3)
+- **Incident summary** — text description of the situation
+On each `step()`, the agent receives:
+- **Updated service statuses** — health may change after remediation
+- **Updated alerts** — alerts clear when services recover
+- **Action result** — the data returned by the action (logs, metrics, traces, etc.)
+- **Reward** — per-step reward signal
+- **Done flag** — whether the episode has ended
+- **Score** — final score (only on terminal step)
+### Progressive Revelation
+The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget.
+### Post-Remediation Feedback
+After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task — after fixing only one of two faults, the check reveals remaining issues.
+## Reward Function
+### Per-Step Shaping
+| Action | Reward |
+|--------|--------|
+| Investigating a root-cause service | +0.01 |
+| Investigating a non-root-cause service | 0.00 |
+| Correct remediation (matches required fix) | +0.05 |
+| Wrong remediation (wrong service or wrong fix type) | -0.05 |
+### Terminal Grading (0.0 - 1.0)
+The grader is generic and rubric-based. Each task defines its own weights:
+| Component | Easy | Medium | Hard |
+|-----------|------|--------|------|
+| Correct root cause service identified | 0.30 | 0.25 | 0.15 |
+| Correct root cause category | 0.20 | 0.20 | 0.10 |
+| Primary fix applied | 0.30 | 0.25 | 0.15 |
+| Secondary fix(es) applied | -- | -- | 0.20 |
+| Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 |
+| Investigation thoroughness | 0.10 | 0.10 | 0.10 |
+| Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea |
+**Diagnosis text scoring** uses deterministic keyword matching — the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging.
+**Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting.
+## Setup
+### Local Development
+```bash
+pip install -r requirements.txt
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Docker
+```bash
+docker build -t sre-incident-response .
+docker run -p 8000:8000 sre-incident-response
+```
+### API Usage
+```bash
+# List available tasks
+curl http://localhost:8000/tasks
+# Reset (start a new episode)
+curl -X POST http://localhost:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "easy"}'
+# Step (take an action)
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}'
+# Get current episode state
+curl http://localhost:8000/state/<SESSION_ID>
+```
+OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`.
+### Running Inference
+```bash
+export HF_TOKEN=your_token
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+python inference.py
+```
+The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs:
+```
+[START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null
+[STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null
+[STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null
+[STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null
+[END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00
+```
+## Baseline Scores
+| Task | Expected Score Range | What a Perfect Agent Scores |
+|------|---------------------|---------------------------|
+| easy | 0.70 - 0.95 | 1.00 |
+| medium | 0.40 - 0.75 | 0.90 |
+| hard | 0.20 - 0.55 | 0.85 |
+## Environment Variables
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | HuggingFace API key | Required |
+| `PORT` | Server port | `8000` |
+| `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` |
+## OpenEnv Spec Compliance
+- `openenv.yaml` with metadata, task definitions, typed models, and runtime config
+- `step(action)` returns observation, reward, done, info
+- `reset()` returns initial observation
+- `state()` returns current episode metadata
+- Typed Pydantic models for Action, Observation, and State
+- 3 tasks with programmatic graders (easy, medium, hard)
+- Scores in 0.0-1.0 range with partial progress signals
+- Working Dockerfile for containerized execution
+- Baseline inference script (`inference.py`) with reproducible scores

env/__init__.py ADDED Viewed

File without changes

env/environment.py ADDED Viewed

	@@ -0,0 +1,363 @@

+"""
+Core environment engine — implements reset/step/state for the SRE Incident Response env.
+"""
+import uuid
+from typing import Any, Dict, Optional, Set, Tuple
+from models import (
+    Action,
+    ActionType,
+    GraderResult,
+    INVESTIGATION_ACTIONS,
+    Observation,
+    REMEDIATION_ACTIONS,
+    RootCauseCategory,
+    ServiceState,
+    ServiceStatus,
+    State,
+)
+from env.scenario import IncidentScenario, RequiredFix
+from tasks import SCENARIOS
+from env.services import (
+    format_config_diff,
+    format_deploy_history,
+    format_dependencies,
+    format_logs,
+    format_metrics,
+    format_runbook,
+    format_traces,
+    generate_alerts,
+    ping_service,
+    recompute_health,
+)
+class Session:
+    """Tracks the state of a single episode."""
+    def __init__(self, scenario: IncidentScenario, session_id: str):
+        self.session_id = session_id
+        self.scenario = scenario
+        self.step_count = 0
+        self.done = False
+        self.cumulative_reward = 0.0
+        # Mutable service state: {name: {status, version, replicas}}
+        self.services: Dict[str, Dict[str, Any]] = {}
+        for name, cfg in scenario.services.items():
+            self.services[name] = {
+                "status": cfg.status,
+                "version": cfg.version,
+                "replicas": cfg.replicas,
+            }
+        # Track which root-cause services have been fixed
+        self.fixed_services: Set[str] = set()
+        # Build root-cause map: service_name -> fault_type
+        self.root_cause_map: Dict[str, str] = {}
+        for name, cfg in scenario.services.items():
+            if cfg.is_root_cause and cfg.fault_type:
+                self.root_cause_map[name] = cfg.fault_type
+        # Action history for grading
+        self.actions: list[Action] = []
+        self.services_investigated: Set[str] = set()
+        self.remediations_applied: list[Dict[str, Any]] = []
+        self.diagnosis: Optional[Action] = None
+class IncidentResponseEnv:
+    """The SRE Incident Response OpenEnv environment."""
+    def __init__(self):
+        self.sessions: Dict[str, Session] = {}
+    def get_task_ids(self) -> list[str]:
+        return list(SCENARIOS.keys())
+    def reset(self, task_id: str, seed: int = 0) -> Tuple[Observation, str]:
+        """Start a new episode for the given task."""
+        if task_id not in SCENARIOS:
+            raise ValueError(f"Unknown task_id: {task_id}. Available: {list(SCENARIOS.keys())}")
+        scenario = SCENARIOS[task_id]
+        session_id = str(uuid.uuid4())[:8]
+        session = Session(scenario, session_id)
+        self.sessions[session_id] = session
+        # Build initial observation
+        obs = self._build_observation(session, action_result=None)
+        return obs, session_id
+    def step(self, session_id: str, action: Action) -> Tuple[Observation, float, bool, Dict]:
+        """Execute an action and return (observation, reward, done, info)."""
+        session = self._get_session(session_id)
+        if session.done:
+            obs = self._build_observation(session, action_result="Episode already finished.")
+            return obs, 0.0, True, {"error": "Episode already finished."}
+        session.step_count += 1
+        session.actions.append(action)
+        reward = 0.0
+        action_result = ""
+        info: Dict[str, Any] = {}
+        service_name = action.service
+        scenario = session.scenario
+        # Validate service name for actions that require it
+        if action.action_type != ActionType.SUBMIT_DIAGNOSIS:
+            if service_name and service_name not in scenario.services:
+                action_result = f"Unknown service: '{service_name}'. Available: {list(scenario.services.keys())}"
+                obs = self._build_observation(session, action_result=action_result)
+                return obs, 0.0, False, {"error": action_result}
+            if not service_name and action.action_type != ActionType.SUBMIT_DIAGNOSIS:
+                action_result = "Action requires a 'service' parameter."
+                obs = self._build_observation(session, action_result=action_result)
+                return obs, 0.0, False, {"error": action_result}
+        # ── Investigation actions ──
+        if action.action_type in INVESTIGATION_ACTIONS:
+            session.services_investigated.add(service_name)
+            action_result = self._handle_investigation(session, action)
+            # Small reward for investigating root cause services
+            if service_name in scenario.root_cause_services:
+                reward = 0.01
+            else:
+                reward = 0.0
+        # ── Remediation actions ──
+        elif action.action_type in REMEDIATION_ACTIONS:
+            action_result, reward = self._handle_remediation(session, action)
+            session.remediations_applied.append({
+                "action": action.action_type.value,
+                "service": service_name,
+                "target_version": action.target_version,
+                "replicas": action.replicas,
+            })
+        # ── Submit diagnosis ──
+        elif action.action_type == ActionType.SUBMIT_DIAGNOSIS:
+            session.diagnosis = action
+            session.done = True
+            grader_result = self._grade(session)
+            reward = grader_result.score
+            action_result = f"Diagnosis submitted. Score: {grader_result.score:.2f}"
+            info["grader_result"] = grader_result.model_dump()
+        session.cumulative_reward += reward
+        # Check max steps
+        if session.step_count >= scenario.max_steps and not session.done:
+            session.done = True
+            if session.diagnosis is None:
+                # Auto-grade with whatever we have
+                grader_result = self._grade(session)
+                reward = grader_result.score
+                info["grader_result"] = grader_result.model_dump()
+                action_result += f"\n[MAX STEPS REACHED] Episode ended. Score: {grader_result.score:.2f}"
+        obs = self._build_observation(session, action_result=action_result, reward=reward)
+        obs.done = session.done
+        if "grader_result" in info:
+            obs.score = info["grader_result"]["score"]
+        return obs, reward, session.done, info
+    def state(self, session_id: str) -> State:
+        """Return current episode state."""
+        session = self._get_session(session_id)
+        return State(
+            session_id=session.session_id,
+            task_id=session.scenario.task_id,
+            step_count=session.step_count,
+            max_steps=session.scenario.max_steps,
+            done=session.done,
+            actions_taken=[a.action_type.value for a in session.actions],
+            services_investigated=list(session.services_investigated),
+            remediations_applied=[f"{r['action']}({r['service']})" for r in session.remediations_applied],
+            cumulative_reward=round(session.cumulative_reward, 4),
+        )
+    # ── Internal helpers ───────────────────────────────────────────────
+    def _get_session(self, session_id: str) -> Session:
+        if session_id not in self.sessions:
+            raise ValueError(f"Unknown session: {session_id}")
+        return self.sessions[session_id]
+    def _build_observation(
+        self, session: Session, action_result: Optional[str], reward: float = 0.0,
+    ) -> Observation:
+        scenario = session.scenario
+        svc_states = {}
+        for name, data in session.services.items():
+            svc_states[name] = ServiceState(
+                status=data["status"],
+                version=data["version"],
+                replicas=data["replicas"],
+            )
+        alerts = generate_alerts(
+            session.services, scenario.initial_alerts, session.fixed_services,
+        )
+        return Observation(
+            step_number=session.step_count,
+            timestamp=f"2026-04-06T04:{session.step_count:02d}:00Z",
+            services=svc_states,
+            active_alerts=alerts,
+            incident_summary=scenario.incident_summary if session.step_count == 0 else "",
+            action_result=action_result,
+            reward=round(reward, 4),
+            done=session.done,
+        )
+    def _handle_investigation(self, session: Session, action: Action) -> str:
+        scenario = session.scenario
+        svc = action.service
+        if action.action_type == ActionType.READ_LOGS:
+            logs = scenario.logs.get(svc, [])
+            return format_logs(logs)
+        elif action.action_type == ActionType.CHECK_METRICS:
+            metrics = scenario.metrics.get(svc, [])
+            return format_metrics(metrics)
+        elif action.action_type == ActionType.PING_SERVICE:
+            status = session.services[svc]["status"]
+            return ping_service(status, svc)
+        elif action.action_type == ActionType.CHECK_DEPENDENCIES:
+            deps = scenario.dependencies.get(svc, [])
+            dep_info = format_dependencies(deps)
+            # Also show current health of dependencies
+            dep_health = []
+            for d in deps:
+                if d in session.services:
+                    dep_health.append(f"  {d}: {session.services[d]['status'].value}")
+            if dep_health:
+                dep_info += "\n\nDependency health:\n" + "\n".join(dep_health)
+            return dep_info
+        elif action.action_type == ActionType.INSPECT_DEPLOY:
+            deploys = scenario.deploy_history.get(svc, [])
+            return format_deploy_history(deploys)
+        elif action.action_type == ActionType.QUERY_TRACES:
+            traces = scenario.traces.get(svc, [])
+            return format_traces(traces)
+        elif action.action_type == ActionType.CHECK_RUNBOOK:
+            runbook = scenario.runbooks.get(svc, "")
+            return format_runbook(runbook)
+        elif action.action_type == ActionType.DIFF_CONFIG:
+            configs = scenario.configs.get(svc, {})
+            return format_config_diff(configs)
+        return f"No data available for {action.action_type.value} on {svc}."
+    def _handle_remediation(self, session: Session, action: Action) -> Tuple[str, float]:
+        scenario = session.scenario
+        svc = action.service
+        reward = 0.0
+        result = ""
+        # Check if this remediation matches any required fix
+        fix_matched = False
+        for req_fix in scenario.required_fixes:
+            if self._fix_matches(action, req_fix):
+                fix_matched = True
+                session.fixed_services.add(svc)
+                reward = 0.05
+                break
+        if action.action_type == ActionType.RESTART_SERVICE:
+            if fix_matched:
+                session.services[svc]["status"] = ServiceStatus.HEALTHY
+                result = f"Service '{svc}' restarted successfully. Status: HEALTHY"
+            else:
+                # Restarting a non-root-cause service: no effect on the underlying issue
+                current = session.services[svc]["status"]
+                if current == ServiceStatus.DOWN and svc in session.root_cause_map:
+                    result = f"Service '{svc}' restarted but crashed again — underlying issue persists."
+                elif current == ServiceStatus.HEALTHY:
+                    result = f"Service '{svc}' restarted. It was already healthy — no change."
+                else:
+                    result = f"Service '{svc}' restarted. Status unchanged — issue is caused by an upstream dependency."
+                reward = -0.05
+        elif action.action_type == ActionType.ROLLBACK_DEPLOY:
+            if fix_matched:
+                session.services[svc]["version"] = action.target_version or ""
+                session.services[svc]["status"] = ServiceStatus.HEALTHY
+                result = (
+                    f"Service '{svc}' rolled back to {action.target_version}. "
+                    f"Pods restarting with previous version... Status: HEALTHY"
+                )
+            else:
+                current_version = session.services[svc]["version"]
+                result = (
+                    f"Rolled back '{svc}' to {action.target_version}, but this didn't resolve the issue. "
+                    f"Previous version was {current_version}."
+                )
+                reward = -0.05
+        elif action.action_type == ActionType.SCALE_UP:
+            replicas = action.replicas or 3
+            if fix_matched or (svc in scenario.root_cause_services):
+                session.services[svc]["replicas"] = replicas
+                session.fixed_services.add(svc)
+                session.services[svc]["status"] = ServiceStatus.HEALTHY
+                result = f"Service '{svc}' scaled to {replicas} replicas. Memory pressure alleviated. Status: HEALTHY"
+                reward = 0.05
+            else:
+                session.services[svc]["replicas"] = replicas
+                result = f"Service '{svc}' scaled to {replicas} replicas. No effect on the underlying issue."
+                reward = -0.05
+        elif action.action_type == ActionType.DRAIN_TRAFFIC:
+            result = f"Traffic drained from '{svc}'. Service is no longer receiving requests."
+            if svc not in scenario.root_cause_services:
+                reward = -0.05
+        # Recompute health after remediation
+        session.services = recompute_health(
+            session.services,
+            scenario.dependencies,
+            session.fixed_services,
+            session.root_cause_map,
+        )
+        # Add post-remediation status summary
+        still_broken = [
+            name for name, data in session.services.items()
+            if data["status"] != ServiceStatus.HEALTHY
+        ]
+        if still_broken:
+            result += f"\n\n[POST-REMEDIATION CHECK] Services still unhealthy: {', '.join(still_broken)}"
+        else:
+            result += "\n\n[POST-REMEDIATION CHECK] All services are now HEALTHY."
+        return result, reward
+    def _fix_matches(self, action: Action, req_fix: RequiredFix) -> bool:
+        """Check if an action matches a required fix."""
+        if action.action_type.value != req_fix.action:
+            return False
+        if action.service != req_fix.service:
+            return False
+        if req_fix.target_version and action.target_version != req_fix.target_version:
+            return False
+        return True
+    def _grade(self, session: Session) -> GraderResult:
+        """Deterministic grading of the episode."""
+        from graders.grader import grade_episode
+        return grade_episode(session)

env/scenario.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""
+Scenario schema — shared dataclasses used by all task definitions.
+"""
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional
+from models import RootCauseCategory, ServiceStatus
+@dataclass
+class ServiceConfig:
+    """Configuration for a single service in the simulated architecture."""
+    status: ServiceStatus
+    deps: List[str] = field(default_factory=list)
+    version: str = ""
+    replicas: int = 1
+    is_root_cause: bool = False
+    fault_type: Optional[str] = None
+@dataclass
+class RequiredFix:
+    """A fix that the agent must apply to resolve the incident."""
+    action: str  # "restart_service", "rollback_deploy", "scale_up"
+    service: str
+    target_version: Optional[str] = None
+    replicas: Optional[int] = None
+@dataclass
+class IncidentScenario:
+    """
+    A self-contained incident scenario definition.
+    To create a new task, create a new Python file in tasks/ that instantiates
+    this dataclass and assigns it to a module-level variable named SCENARIO.
+    """
+    task_id: str
+    name: str
+    difficulty: str  # "easy", "medium", "hard"
+    max_steps: int
+    incident_summary: str
+    # Service architecture
+    services: Dict[str, ServiceConfig] = field(default_factory=dict)
+    # Pre-written data per service
+    logs: Dict[str, List[str]] = field(default_factory=dict)
+    metrics: Dict[str, List[Dict[str, Any]]] = field(default_factory=dict)
+    traces: Dict[str, List[str]] = field(default_factory=dict)
+    deploy_history: Dict[str, List[str]] = field(default_factory=dict)
+    runbooks: Dict[str, str] = field(default_factory=dict)
+    configs: Dict[str, Dict[str, str]] = field(default_factory=dict)
+    dependencies: Dict[str, List[str]] = field(default_factory=dict)
+    # Initial alerts
+    initial_alerts: List[str] = field(default_factory=list)
+    # Ground truth for grading
+    root_cause_services: List[str] = field(default_factory=list)
+    root_cause_categories: List[RootCauseCategory] = field(default_factory=list)
+    required_fixes: List[RequiredFix] = field(default_factory=list)
+    diagnosis_keywords: List[str] = field(default_factory=list)
+    # Grading weights
+    weights: Dict[str, float] = field(default_factory=dict)

env/services.py ADDED Viewed

	@@ -0,0 +1,155 @@

+"""
+Service simulation helpers — generates alerts, formats data, cascades dependency health.
+"""
+from typing import Any, Dict, List, Set, Tuple
+from models import ServiceStatus
+def generate_alerts(
+    services: Dict[str, Any],
+    scenario_alerts: List[str],
+    fixed_services: Set[str],
+) -> List[str]:
+    """Regenerate alerts based on current service state.
+    If all root-cause services are fixed, alerts clear."""
+    alerts: List[str] = []
+    for svc_name, svc in services.items():
+        status = svc["status"]
+        if status == ServiceStatus.DOWN and svc_name not in fixed_services:
+            alerts.append(f"[ALERT SEV-1] {svc_name}: service is DOWN, 0 healthy pods")
+        elif status == ServiceStatus.DEGRADED and svc_name not in fixed_services:
+            alerts.append(f"[ALERT SEV-2] {svc_name}: service is DEGRADED")
+    if not alerts:
+        return ["[INFO] All services HEALTHY — no active alerts."]
+    return alerts
+def recompute_health(
+    services: Dict[str, Any],
+    dependencies: Dict[str, List[str]],
+    fixed_services: Set[str],
+    root_cause_map: Dict[str, str],
+) -> Dict[str, Any]:
+    """Walk the dependency graph and update service health.
+    Rules:
+    - A root-cause service that has been fixed becomes HEALTHY.
+    - A non-root-cause service becomes HEALTHY if all its deps are HEALTHY.
+    - A non-root-cause service becomes DEGRADED if any dep is DEGRADED.
+    - A non-root-cause service becomes DOWN if any dep is DOWN.
+    """
+    updated = {k: dict(v) for k, v in services.items()}
+    # First, fix root-cause services that have been remediated
+    for svc_name in fixed_services:
+        if svc_name in updated:
+            updated[svc_name]["status"] = ServiceStatus.HEALTHY
+    # Iteratively propagate health (max 5 rounds to handle chains)
+    for _ in range(5):
+        changed = False
+        for svc_name, deps in dependencies.items():
+            if svc_name in fixed_services:
+                continue
+            if svc_name in root_cause_map and svc_name not in fixed_services:
+                continue  # still broken
+            if not deps:
+                continue
+            dep_statuses = [updated[d]["status"] for d in deps if d in updated]
+            if not dep_statuses:
+                continue
+            if any(s == ServiceStatus.DOWN for s in dep_statuses):
+                new_status = ServiceStatus.DEGRADED  # downstream of DOWN = DEGRADED
+            elif any(s == ServiceStatus.DEGRADED for s in dep_statuses):
+                new_status = ServiceStatus.DEGRADED
+            else:
+                new_status = ServiceStatus.HEALTHY
+            if updated[svc_name]["status"] != new_status:
+                updated[svc_name]["status"] = new_status
+                changed = True
+        if not changed:
+            break
+    return updated
+def format_metrics(metrics_list: List[Dict[str, Any]]) -> str:
+    """Format time-series metrics into a readable table."""
+    if not metrics_list:
+        return "No metrics available for this service."
+    # Get all keys from the first entry
+    keys = list(metrics_list[0].keys())
+    header = "  ".join(f"{k:<18}" for k in keys)
+    lines = [header, "-" * len(header)]
+    for row in metrics_list:
+        vals = []
+        for k in keys:
+            v = row.get(k, "")
+            vals.append(f"{str(v):<18}")
+        lines.append("  ".join(vals))
+    return "\n".join(lines)
+def format_logs(log_lines: List[str]) -> str:
+    """Join log lines with newlines."""
+    if not log_lines:
+        return "No logs available for this service."
+    return "\n".join(log_lines)
+def format_traces(trace_lines: List[str]) -> str:
+    """Format trace data."""
+    if not trace_lines:
+        return "No traces available for this service."
+    return "\n".join(trace_lines)
+def format_deploy_history(deploy_lines: List[str]) -> str:
+    """Format deploy history."""
+    if not deploy_lines:
+        return "No deploy history available for this service."
+    return "\n".join(deploy_lines)
+def format_dependencies(deps: List[str]) -> str:
+    """Format dependency list."""
+    if not deps:
+        return "This service has no upstream dependencies."
+    return "Dependencies: " + ", ".join(deps)
+def format_runbook(runbook: str) -> str:
+    """Return runbook text."""
+    if not runbook:
+        return "No runbook available for this service."
+    return runbook
+def format_config_diff(config_data: Dict[str, str]) -> str:
+    """Format config diff."""
+    if not config_data:
+        return "No config data available for this service."
+    result = []
+    if "diff" in config_data:
+        result.append(f"Config diff: {config_data['diff']}")
+    if "current" in config_data:
+        result.append(f"\nCurrent config:\n{config_data['current']}")
+    return "\n".join(result)
+def ping_service(status: ServiceStatus, service_name: str) -> str:
+    """Simulate a ping to a service."""
+    if status == ServiceStatus.HEALTHY:
+        return f"PING {service_name}: responding on :8080/healthz — 200 OK (latency: 5ms)"
+    elif status == ServiceStatus.DEGRADED:
+        return f"PING {service_name}: responding on :8080/healthz — 200 OK (latency: 1200ms, SLOW)"
+    else:
+        return f"PING {service_name}: connection refused on :8080/healthz — service unreachable"

graders/__init__.py ADDED Viewed

File without changes

graders/grader.py ADDED Viewed

	@@ -0,0 +1,170 @@

+"""
+Deterministic grading engine for the SRE Incident Response environment.
+Scores episodes on a 0.0-1.0 scale based on weighted rubric components.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING, Dict, List
+from models import (
+    ActionType,
+    GraderResult,
+    INVESTIGATION_ACTIONS,
+    RootCauseCategory,
+)
+if TYPE_CHECKING:
+    from env.environment import Session
+def grade_episode(session: Session) -> GraderResult:
+    """Grade a completed episode and return a GraderResult."""
+    scenario = session.scenario
+    weights = scenario.weights
+    diagnosis = session.diagnosis
+    notes: List[str] = []
+    breakdown: Dict[str, float] = {}
+    score = 0.0
+    # ── 1. Root cause service identification ──
+    service_score = 0.0
+    if diagnosis and diagnosis.root_cause_service:
+        if diagnosis.root_cause_service in scenario.root_cause_services:
+            service_score = weights.get("correct_service", 0)
+            notes.append(f"Correct root cause service: {diagnosis.root_cause_service}")
+        else:
+            notes.append(
+                f"Wrong root cause service: {diagnosis.root_cause_service} "
+                f"(expected one of: {scenario.root_cause_services})"
+            )
+    else:
+        notes.append("No root cause service submitted.")
+    breakdown["correct_service"] = service_score
+    score += service_score
+    # ── 2. Root cause category ──
+    category_score = 0.0
+    if diagnosis and diagnosis.root_cause_category:
+        if diagnosis.root_cause_category in scenario.root_cause_categories:
+            category_score = weights.get("correct_category", 0)
+            notes.append(f"Correct root cause category: {diagnosis.root_cause_category.value}")
+        else:
+            notes.append(
+                f"Wrong root cause category: {diagnosis.root_cause_category.value} "
+                f"(expected one of: {[c.value for c in scenario.root_cause_categories]})"
+            )
+    else:
+        notes.append("No root cause category submitted.")
+    breakdown["correct_category"] = category_score
+    score += category_score
+    # ── 3. Primary fix applied ──
+    fix_score = 0.0
+    primary_fix = scenario.required_fixes[0] if scenario.required_fixes else None
+    if primary_fix and primary_fix.service in session.fixed_services:
+        fix_score = weights.get("correct_fix", 0)
+        notes.append(f"Primary fix applied: {primary_fix.action}({primary_fix.service})")
+    elif primary_fix:
+        notes.append(
+            f"Primary fix NOT applied. Expected: {primary_fix.action}({primary_fix.service})"
+        )
+    breakdown["correct_fix"] = fix_score
+    score += fix_score
+    # ── 4. Secondary fixes (hard tier) ──
+    secondary_score = 0.0
+    secondary_weight = weights.get("secondary_fix", 0)
+    if secondary_weight > 0 and len(scenario.required_fixes) > 1:
+        secondary_fixes = scenario.required_fixes[1:]
+        fixed_count = sum(
+            1 for f in secondary_fixes if f.service in session.fixed_services
+        )
+        fraction = fixed_count / len(secondary_fixes)
+        secondary_score = secondary_weight * fraction
+        if fixed_count == len(secondary_fixes):
+            notes.append(f"All {len(secondary_fixes)} secondary fix(es) applied.")
+        elif fixed_count > 0:
+            notes.append(
+                f"Partial secondary fixes: {fixed_count}/{len(secondary_fixes)} applied."
+            )
+        else:
+            notes.append(f"No secondary fixes applied (needed {len(secondary_fixes)}).")
+    breakdown["secondary_fix"] = secondary_score
+    score += secondary_score
+    # ── 5. Diagnosis text quality (keyword matching) ──
+    text_score = 0.0
+    text_weight = weights.get("diagnosis_text", 0)
+    if diagnosis and diagnosis.fix_description:
+        desc_lower = diagnosis.fix_description.lower()
+        keywords = scenario.diagnosis_keywords
+        matched = sum(1 for kw in keywords if kw.lower() in desc_lower)
+        fraction = min(matched / max(len(keywords) // 2, 1), 1.0)  # need half the keywords for full marks
+        text_score = text_weight * fraction
+        notes.append(
+            f"Diagnosis text: {matched}/{len(keywords)} keywords matched "
+            f"({fraction:.0%} of required)"
+        )
+    else:
+        notes.append("No diagnosis description submitted.")
+    breakdown["diagnosis_text"] = round(text_score, 4)
+    score += text_score
+    # ── 6. Investigation thoroughness ──
+    invest_score = 0.0
+    invest_weight = weights.get("investigation", 0)
+    # Check if agent investigated at least one root cause service
+    investigated_root = any(
+        svc in session.services_investigated
+        for svc in scenario.root_cause_services
+    )
+    if investigated_root:
+        invest_score = invest_weight
+        notes.append(
+            f"Investigation: examined root cause service(s) "
+            f"({session.services_investigated & set(scenario.root_cause_services)})"
+        )
+    else:
+        notes.append(
+            f"Investigation: did NOT examine any root cause service. "
+            f"Investigated: {session.services_investigated or 'none'}"
+        )
+    breakdown["investigation"] = invest_score
+    score += invest_score
+    # ── 7. Wrong remediation penalties ──
+    penalty = 0.0
+    penalty_per = weights.get("wrong_penalty", 0.05)
+    wrong_count = 0
+    for rem in session.remediations_applied:
+        is_correct = False
+        for req_fix in scenario.required_fixes:
+            if rem["action"] == req_fix.action and rem["service"] == req_fix.service:
+                if req_fix.target_version and rem.get("target_version") != req_fix.target_version:
+                    continue
+                is_correct = True
+                break
+        # Also accept scale_up on root cause services as correct
+        if rem["service"] in scenario.root_cause_services and rem["action"] in ("restart_service", "scale_up", "rollback_deploy"):
+            is_correct = True
+        if not is_correct:
+            wrong_count += 1
+            penalty += penalty_per
+    if wrong_count > 0:
+        notes.append(f"Penalty: {wrong_count} wrong remediation(s) (-{penalty:.2f})")
+    breakdown["wrong_penalty"] = -round(penalty, 4)
+    score -= penalty
+    # ── Final clamp ──
+    score = round(max(0.0, min(1.0, score)), 4)
+    solved = score >= 0.7
+    return GraderResult(
+        score=score,
+        solved=solved,
+        breakdown=breakdown,
+        notes=notes,
+    )

inference.py ADDED Viewed

	@@ -0,0 +1,268 @@

+"""
+Inference Script — SRE Incident Response Environment
+=====================================================
+MANDATORY:
+- Before submitting, ensure the following variables are defined in your environment:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME  The name of the local Docker image (if using from_docker_image)
+- The inference script must be named `inference.py` and placed in the root directory
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+"""
+import asyncio
+import json
+import os
+import sys
+import textwrap
+from typing import List
+from openai import OpenAI
+# Add project root to path
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from models import Action, ActionType, RootCauseCategory
+from env.environment import IncidentResponseEnv
+IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = "sre_incident_response"
+MAX_STEPS = 20
+TEMPERATURE = 0.7
+SUCCESS_SCORE_THRESHOLD = 0.7
+# ── Logging helpers (strict format) ───────────────────────────────────
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error) -> None:
+    error_str = str(error) if error is not None else "null"
+    done_str = "true" if done else "false"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_str} error={error_str}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    success_str = "true" if success else "false"
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={success_str} steps={steps} score={score:.2f} rewards={rewards_str}",
+        flush=True,
+    )
+# ── System prompt ─────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""\
+You are an expert SRE (Site Reliability Engineer) responding to a production incident.
+You are given the current state of a microservice architecture and must:
+1. Investigate by reading logs, checking metrics, tracing requests, and examining dependencies
+2. Identify the root cause(s)
+3. Apply the correct fix(es)
+4. Submit a diagnosis
+Available actions (respond with a single JSON object):
+Investigation actions (require "service" field):
+- read_logs: Read recent logs from a service
+- check_metrics: Get time-series metrics (CPU, memory, latency, error rate)
+- ping_service: Check if service is reachable
+- check_dependencies: See upstream/downstream dependencies and their health
+- inspect_deploy: See deploy history (versions, timestamps)
+- query_traces: See distributed trace spans
+- check_runbook: Get operational runbook for the service
+- diff_config: Compare current vs previous config
+Remediation actions (require "service" field):
+- restart_service: Restart all pods for a service
+- rollback_deploy: Rollback to a specific version (requires "target_version")
+- scale_up: Increase replica count (requires "replicas")
+- drain_traffic: Stop routing traffic to a service
+Terminal action:
+- submit_diagnosis: Submit your diagnosis (requires "root_cause_service", "root_cause_category", "fix_description")
+Root cause categories: oom_crash, db_deadlock, bad_deploy, memory_leak, network_partition, disk_full, config_error, cert_expiry, dns_failure, rate_limit
+IMPORTANT: Respond with ONLY a JSON object like:
+{"action_type": "read_logs", "service": "auth-service"}
+{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
+{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock"}
+""")
+# ── Helpers ────────────────────────────────────────────────────────────
+def format_observation(obs_dict: dict) -> str:
+    """Format observation into a readable prompt for the LLM."""
+    parts = []
+    if obs_dict.get("incident_summary"):
+        parts.append(f"INCIDENT SUMMARY: {obs_dict['incident_summary']}")
+    parts.append(f"\nSTEP: {obs_dict.get('step_number', 0)}")
+    services = obs_dict.get("services", {})
+    if services:
+        parts.append("\nSERVICE STATUS DASHBOARD:")
+        for name, state in services.items():
+            status = state.get("status", "UNKNOWN")
+            version = state.get("version", "")
+            parts.append(f"  {name}: {status} (version: {version})")
+    alerts = obs_dict.get("active_alerts", [])
+    if alerts:
+        parts.append("\nACTIVE ALERTS:")
+        for alert in alerts:
+            parts.append(f"  {alert}")
+    action_result = obs_dict.get("action_result")
+    if action_result:
+        parts.append(f"\nRESULT OF LAST ACTION:\n{action_result}")
+    return "\n".join(parts)
+def get_model_message(client: OpenAI, obs_text: str, history: List[str]) -> str:
+    """Call the LLM and return the raw response text."""
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+    ]
+    # Include recent history for context
+    for h in history[-6:]:
+        messages.append({"role": "user", "content": h})
+    messages.append({"role": "user", "content": obs_text})
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=TEMPERATURE,
+            max_tokens=512,
+        )
+        return response.choices[0].message.content
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return '{"action_type": "read_logs", "service": "auth-service"}'
+def parse_action(response_text: str) -> Action:
+    """Parse LLM response into an Action object."""
+    text = response_text.strip()
+    if "```json" in text:
+        text = text.split("```json")[1].split("```")[0].strip()
+    elif "```" in text:
+        text = text.split("```")[1].split("```")[0].strip()
+    start = text.find("{")
+    end = text.rfind("}")
+    if start != -1 and end != -1:
+        text = text[start : end + 1]
+    data = json.loads(text)
+    return Action(**data)
+# ── Main ──────────────────────────────────────────────────────────────
+async def run_task(task_id: str) -> float:
+    """Run inference on a single task. Returns score in [0, 1]."""
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = IncidentResponseEnv()
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        obs, session_id = env.reset(task_id=task_id)
+        obs_dict = obs.model_dump()
+        for step in range(1, MAX_STEPS + 1):
+            if obs_dict.get("done", False):
+                break
+            obs_text = format_observation(obs_dict)
+            message = get_model_message(client, obs_text, history)
+            try:
+                action = parse_action(message)
+                error = None
+            except Exception as e:
+                error = str(e)
+                log_step(step=step, action="parse_error", reward=0.0, done=False, error=error)
+                rewards.append(0.0)
+                steps_taken = step
+                history.append(f"Step {step}: parse_error -> reward 0.00")
+                continue
+            obs, reward, done, info = env.step(session_id, action)
+            obs_dict = obs.model_dump()
+            reward = reward or 0.0
+            rewards.append(reward)
+            steps_taken = step
+            action_str = action.action_type.value
+            if action.service:
+                action_str += f"({action.service})"
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {action_str} -> reward {reward:+.2f}")
+            if done:
+                if "grader_result" in info:
+                    score = info["grader_result"]["score"]
+                break
+        # Clamp score to [0, 1]
+        score = min(max(score, 0.0), 1.0)
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+async def main() -> None:
+    task_ids = os.getenv("SRE_TASKS", "easy,medium,hard").split(",")
+    scores = {}
+    for task_id in task_ids:
+        task_id = task_id.strip()
+        score = await run_task(task_id)
+        scores[task_id] = score
+    print(f"\n{'='*60}", flush=True)
+    print("FINAL SCORES:", flush=True)
+    for task_id, score in scores.items():
+        print(f"  {task_id}: {score:.2f}", flush=True)
+    avg = sum(scores.values()) / len(scores) if scores else 0
+    print(f"  AVERAGE: {avg:.2f}", flush=True)
+    print(f"{'='*60}", flush=True)
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""
+Pydantic models for the SRE Incident Response OpenEnv environment.
+"""
+from enum import Enum
+from typing import Dict, List, Optional
+from pydantic import BaseModel, Field
+# ── Enums ──────────────────────────────────────────────────────────────
+class ServiceStatus(str, Enum):
+    HEALTHY = "HEALTHY"
+    DEGRADED = "DEGRADED"
+    DOWN = "DOWN"
+class ActionType(str, Enum):
+    # Investigation (read-only)
+    READ_LOGS = "read_logs"
+    CHECK_METRICS = "check_metrics"
+    PING_SERVICE = "ping_service"
+    CHECK_DEPENDENCIES = "check_dependencies"
+    INSPECT_DEPLOY = "inspect_deploy"
+    QUERY_TRACES = "query_traces"
+    CHECK_RUNBOOK = "check_runbook"
+    DIFF_CONFIG = "diff_config"
+    # Remediation (modifies state)
+    RESTART_SERVICE = "restart_service"
+    ROLLBACK_DEPLOY = "rollback_deploy"
+    SCALE_UP = "scale_up"
+    DRAIN_TRAFFIC = "drain_traffic"
+    # Terminal
+    SUBMIT_DIAGNOSIS = "submit_diagnosis"
+class RootCauseCategory(str, Enum):
+    OOM_CRASH = "oom_crash"
+    DB_DEADLOCK = "db_deadlock"
+    BAD_DEPLOY = "bad_deploy"
+    MEMORY_LEAK = "memory_leak"
+    NETWORK_PARTITION = "network_partition"
+    DISK_FULL = "disk_full"
+    CONFIG_ERROR = "config_error"
+    CERT_EXPIRY = "cert_expiry"
+    DNS_FAILURE = "dns_failure"
+    RATE_LIMIT = "rate_limit"
+INVESTIGATION_ACTIONS = {
+    ActionType.READ_LOGS,
+    ActionType.CHECK_METRICS,
+    ActionType.PING_SERVICE,
+    ActionType.CHECK_DEPENDENCIES,
+    ActionType.INSPECT_DEPLOY,
+    ActionType.QUERY_TRACES,
+    ActionType.CHECK_RUNBOOK,
+    ActionType.DIFF_CONFIG,
+}
+REMEDIATION_ACTIONS = {
+    ActionType.RESTART_SERVICE,
+    ActionType.ROLLBACK_DEPLOY,
+    ActionType.SCALE_UP,
+    ActionType.DRAIN_TRAFFIC,
+}
+# ── Action ─────────────────────────────────────────────────────────────
+class Action(BaseModel):
+    action_type: ActionType
+    service: Optional[str] = None
+    target_version: Optional[str] = None
+    replicas: Optional[int] = Field(None, ge=1, le=10)
+    root_cause_service: Optional[str] = None
+    root_cause_category: Optional[RootCauseCategory] = None
+    fix_description: Optional[str] = None
+# ── Observation ────────────────────────────────────────────────────────
+class ServiceState(BaseModel):
+    status: ServiceStatus
+    version: str = ""
+    replicas: int = 1
+class Observation(BaseModel):
+    step_number: int = 0
+    timestamp: str = ""
+    services: Dict[str, ServiceState] = Field(default_factory=dict)
+    active_alerts: List[str] = Field(default_factory=list)
+    incident_summary: str = ""
+    action_result: Optional[str] = None
+    reward: float = 0.0
+    done: bool = False
+    score: Optional[float] = None
+    info: Dict = Field(default_factory=dict)
+# ── State ──────────────────────────────────────────────────────────────
+class State(BaseModel):
+    session_id: str = ""
+    task_id: str = ""
+    step_count: int = 0
+    max_steps: int = 0
+    done: bool = False
+    actions_taken: List[str] = Field(default_factory=list)
+    services_investigated: List[str] = Field(default_factory=list)
+    remediations_applied: List[str] = Field(default_factory=list)
+    cumulative_reward: float = 0.0
+# ── Reward ─────────────────────────────────────────────────────────────
+class Reward(BaseModel):
+    value: float = Field(0.0, ge=0.0, le=1.0)
+    step_reward: float = 0.0
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+    is_terminal: bool = False
+# ── Grader Result ──────────────────────────────────────────────────────
+class GraderResult(BaseModel):
+    score: float = Field(..., ge=0.0, le=1.0)
+    solved: bool = False
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+    notes: List[str] = Field(default_factory=list)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+name: sre-incident-response
+version: "1.0.0"
+description: "SRE Incident Response environment — train AI agents to diagnose and fix production incidents"
+tasks:
+  - id: easy
+    name: Single Service OOM Crash
+    difficulty: easy
+    max_steps: 15
+  - id: medium
+    name: Cascading Database Deadlock
+    difficulty: medium
+    max_steps: 25
+  - id: hard
+    name: Concurrent Faults with Misleading Evidence
+    difficulty: hard
+    max_steps: 35
+models:
+  action: models.Action
+  observation: models.Observation
+  reward: models.Reward
+  state: models.State
+runtime:
+  port: 8000
+  entrypoint: server.app:app

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+fastapi>=0.104.0
+uvicorn>=0.24.0
+pydantic>=2.0.0
+openai>=1.0.0
+pyyaml>=6.0

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""
+FastAPI server for the SRE Incident Response OpenEnv environment.
+"""
+import sys
+import os
+# Add project root to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import Dict, List, Optional
+from models import Action, Observation, State
+from env.environment import IncidentResponseEnv
+from tasks import SCENARIOS
+app = FastAPI(
+    title="SRE Incident Response Environment",
+    description="An OpenEnv environment for training AI agents on production incident response.",
+    version="1.0.0",
+)
+env = IncidentResponseEnv()
+# ── Request/Response models ────────────────────────────────────────────
+class ResetRequest(BaseModel):
+    task_id: str = "easy"
+    seed: int = 0
+class ResetResponse(BaseModel):
+    observation: Observation
+    session_id: str
+class StepRequest(BaseModel):
+    session_id: str
+    action: Action
+class StepResponse(BaseModel):
+    observation: Observation
+    reward: float
+    done: bool
+    info: Dict
+class TaskInfo(BaseModel):
+    task_id: str
+    name: str
+    difficulty: str
+    max_steps: int
+    description: str
+# ── Endpoints ──────────────────────────────────────────────────────────
+@app.get("/")
+def root():
+    return {
+        "name": "SRE Incident Response Environment",
+        "version": "1.0.0",
+        "endpoints": ["/reset", "/step", "/state/{session_id}", "/tasks"],
+    }
+@app.post("/reset", response_model=ResetResponse)
+def reset(request: ResetRequest):
+    try:
+        obs, session_id = env.reset(task_id=request.task_id, seed=request.seed)
+        return ResetResponse(observation=obs, session_id=session_id)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.post("/step", response_model=StepResponse)
+def step(request: StepRequest):
+    try:
+        obs, reward, done, info = env.step(request.session_id, request.action)
+        # Ensure info is JSON-serializable
+        clean_info = {}
+        for k, v in info.items():
+            clean_info[k] = v
+        return StepResponse(observation=obs, reward=reward, done=done, info=clean_info)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.get("/state/{session_id}", response_model=State)
+def state(session_id: str):
+    try:
+        return env.state(session_id)
+    except ValueError as e:
+        raise HTTPException(status_code=404, detail=str(e))
+@app.get("/tasks", response_model=List[TaskInfo])
+def tasks():
+    result = []
+    for tid, scenario in SCENARIOS.items():
+        result.append(TaskInfo(
+            task_id=tid,
+            name=scenario.name,
+            difficulty=scenario.difficulty,
+            max_steps=scenario.max_steps,
+            description=scenario.incident_summary,
+        ))
+    return result
+# ── OpenEnv-prefixed aliases ───────────────────────────────────────────
+@app.post("/openenv/reset", response_model=ResetResponse)
+def openenv_reset(request: ResetRequest):
+    return reset(request)
+@app.post("/openenv/step", response_model=StepResponse)
+def openenv_step(request: StepRequest):
+    return step(request)
+@app.get("/openenv/state/{session_id}", response_model=State)
+def openenv_state(session_id: str):
+    return state(session_id)
+@app.get("/openenv/tasks", response_model=List[TaskInfo])
+def openenv_tasks():
+    return tasks()
+# ── Main ───────────────────────────────────────────────────────────────
+def main():
+    import uvicorn
+    port = int(os.environ.get("PORT", "8000"))
+    uvicorn.run(app, host="0.0.0.0", port=port)
+if __name__ == "__main__":
+    main()

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+Task auto-discovery and registry.
+Any .py file in this directory that defines a module-level SCENARIO variable
+(an IncidentScenario instance) will be automatically loaded and registered.
+To add a new task:
+  1. Create a new .py file in this directory (e.g., tasks/my_new_task.py)
+  2. Define SCENARIO = IncidentScenario(task_id="my_new_task", ...)
+  3. That's it — the task will be available via the API automatically.
+"""
+import importlib
+import pkgutil
+from pathlib import Path
+from typing import Dict
+from env.scenario import IncidentScenario
+def _discover_scenarios() -> Dict[str, IncidentScenario]:
+    """Scan all .py files in tasks/ and collect SCENARIO instances."""
+    scenarios: Dict[str, IncidentScenario] = {}
+    package_dir = Path(__file__).parent
+    for finder, module_name, is_pkg in pkgutil.iter_modules([str(package_dir)]):
+        if module_name.startswith("_"):
+            continue
+        try:
+            module = importlib.import_module(f"tasks.{module_name}")
+            scenario = getattr(module, "SCENARIO", None)
+            if isinstance(scenario, IncidentScenario):
+                scenarios[scenario.task_id] = scenario
+        except Exception as e:
+            print(f"Warning: failed to load task module 'tasks.{module_name}': {e}")
+    return scenarios
+SCENARIOS: Dict[str, IncidentScenario] = _discover_scenarios()

tasks/easy_oom.py ADDED Viewed

	@@ -0,0 +1,299 @@

+"""
+Task: Single Service OOM Crash
+To add a new task, copy this file, modify the SCENARIO definition, and place it in tasks/.
+The task loader will auto-discover it.
+"""
+from env.scenario import IncidentScenario, RequiredFix, ServiceConfig
+from models import RootCauseCategory, ServiceStatus
+SCENARIO = IncidentScenario(
+    task_id="easy",
+    name="Single Service OOM Crash",
+    difficulty="easy",
+    max_steps=15,
+    incident_summary=(
+        "PagerDuty alert fired at 02:15 UTC. auth-service is down with elevated error rates "
+        "and pod restarts. api-gateway reporting 503s on login endpoints. Other services appear "
+        "unaffected. On-call engineer needed to investigate and restore service."
+    ),
+    services={
+        "api-gateway": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["auth-service", "user-service", "payment-service"],
+            version="v1.12.0", replicas=3,
+        ),
+        "auth-service": ServiceConfig(
+            status=ServiceStatus.DOWN, deps=["cache-redis"],
+            version="v2.14.0", replicas=2, is_root_cause=True, fault_type="oom_crash",
+        ),
+        "user-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["db-postgres"],
+            version="v4.2.1", replicas=2,
+        ),
+        "payment-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["db-postgres"],
+            version="v3.8.1", replicas=2,
+        ),
+        "db-postgres": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=[],
+            version="v15.4", replicas=1,
+        ),
+        "cache-redis": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=[],
+            version="v7.2.4", replicas=1,
+        ),
+        "notification-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["auth-service"],
+            version="v1.5.0", replicas=1,
+        ),
+    },
+    initial_alerts=[
+        "[ALERT SEV-2] auth-service: error rate >50%, pod restarts detected (3 restarts in 5m)",
+        "[ALERT SEV-3] api-gateway: elevated 503 responses on /api/v2/login and /api/v2/verify",
+    ],
+    logs={
+        "auth-service": [
+            "2026-04-06T02:10:01Z INFO  [auth-service] Request processed: POST /auth/token uid=user_8832 latency=45ms",
+            "2026-04-06T02:10:02Z INFO  [auth-service] Request processed: POST /auth/token uid=user_1204 latency=52ms",
+            "2026-04-06T02:10:03Z INFO  [auth-service] Cache hit for session sid=a8f32c, returning cached token",
+            "2026-04-06T02:10:04Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_6650 latency=41ms",
+            "2026-04-06T02:10:05Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_3310 latency=38ms",
+            "2026-04-06T02:10:06Z INFO  [auth-service] Cache hit for session sid=b2e19f, returning cached token",
+            "2026-04-06T02:10:07Z INFO  [auth-service] Request processed: POST /auth/token uid=user_7712 latency=48ms",
+            "2026-04-06T02:10:08Z DEBUG [auth-service] GC pause: 120ms (heap=1.8GB/2.0GB)",
+            "2026-04-06T02:10:09Z INFO  [auth-service] Request processed: POST /auth/token uid=user_2290 latency=155ms",
+            "2026-04-06T02:10:10Z INFO  [auth-service] Request processed: POST /auth/token uid=user_5571 latency=310ms",
+            "2026-04-06T02:10:11Z WARN  [auth-service] Heap usage at 91% (1.82GB/2.0GB), approaching limit",
+            "2026-04-06T02:10:12Z INFO  [auth-service] Request processed: POST /auth/token uid=user_9912 latency=580ms",
+            "2026-04-06T02:10:13Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_4105 latency=620ms",
+            "2026-04-06T02:10:14Z WARN  [auth-service] GC overhead limit exceeded, full GC triggered (heap=1.95GB/2.0GB)",
+            "2026-04-06T02:10:15Z INFO  [auth-service] Full GC completed in 2100ms, freed 50MB",
+            "2026-04-06T02:10:16Z INFO  [auth-service] Request processed: POST /auth/token uid=user_4423 latency=2400ms",
+            "2026-04-06T02:10:17Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_8001 latency=1900ms",
+            "2026-04-06T02:10:18Z ERROR [auth-service] OutOfMemoryError: unable to allocate 64MB for token cache expansion",
+            "2026-04-06T02:10:18Z ERROR [auth-service] Worker pid=1842 killed by OOM killer (resident=2.01GB, limit=2.0GB)",
+            "2026-04-06T02:10:19Z WARN  [auth-service] Process supervisor restarting worker (attempt 1/3)",
+            "2026-04-06T02:10:22Z INFO  [auth-service] Worker pid=1901 started, initializing token cache...",
+            "2026-04-06T02:10:25Z INFO  [auth-service] Request processed: POST /auth/token uid=user_7781 latency=65ms",
+            "2026-04-06T02:10:28Z INFO  [auth-service] Request processed: POST /auth/token uid=user_2209 latency=72ms",
+            "2026-04-06T02:10:30Z INFO  [auth-service] Cache hit for session sid=f8a21c, returning cached token",
+            "2026-04-06T02:10:33Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_3390 latency=55ms",
+            "2026-04-06T02:10:35Z INFO  [auth-service] Request processed: POST /auth/token uid=user_1150 latency=68ms",
+            "2026-04-06T02:10:40Z INFO  [auth-service] Request processed: POST /auth/token uid=user_4482 latency=75ms",
+            "2026-04-06T02:10:45Z DEBUG [auth-service] GC pause: 85ms (heap=1.5GB/2.0GB)",
+            "2026-04-06T02:11:00Z INFO  [auth-service] Request processed: POST /auth/token uid=user_6633 latency=90ms",
+            "2026-04-06T02:11:30Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_9901 latency=110ms",
+            "2026-04-06T02:12:00Z WARN  [auth-service] Heap usage at 82% (1.64GB/2.0GB) — growing again after restart",
+            "2026-04-06T02:12:30Z INFO  [auth-service] Request processed: POST /auth/token uid=user_5510 latency=180ms",
+            "2026-04-06T02:12:45Z WARN  [auth-service] Heap usage at 88% (1.76GB/2.0GB) — growing linearly",
+            "2026-04-06T02:13:00Z DEBUG [auth-service] GC pause: 350ms (heap=1.85GB/2.0GB)",
+            "2026-04-06T02:13:05Z INFO  [auth-service] Request processed: POST /auth/token uid=user_8820 latency=890ms",
+            "2026-04-06T02:13:10Z ERROR [auth-service] OutOfMemoryError: unable to allocate 32MB for request buffer",
+            "2026-04-06T02:13:10Z ERROR [auth-service] Worker pid=1901 killed by OOM killer (resident=1.98GB, limit=2.0GB)",
+            "2026-04-06T02:13:11Z WARN  [auth-service] Process supervisor restarting worker (attempt 2/3)",
+            "2026-04-06T02:13:14Z INFO  [auth-service] Worker pid=1955 started, initializing token cache...",
+            "2026-04-06T02:13:20Z INFO  [auth-service] Request processed: POST /auth/token uid=user_1122 latency=58ms",
+            "2026-04-06T02:13:45Z WARN  [auth-service] Heap usage at 80% (1.60GB/2.0GB)",
+            "2026-04-06T02:14:15Z WARN  [auth-service] Heap usage at 87% (1.74GB/2.0GB)",
+            "2026-04-06T02:14:45Z WARN  [auth-service] GC overhead limit exceeded, full GC triggered",
+            "2026-04-06T02:15:00Z INFO  [auth-service] Full GC completed in 2800ms, freed 30MB — diminishing returns",
+            "2026-04-06T02:15:20Z ERROR [auth-service] OutOfMemoryError: Java heap space",
+            "2026-04-06T02:15:33Z ERROR [auth-service] Worker pid=1955 killed by OOM killer (resident=2.03GB, limit=2.0GB)",
+            "2026-04-06T02:15:34Z ERROR [auth-service] Process supervisor: all 3 restart attempts exhausted",
+            "2026-04-06T02:15:34Z FATAL [auth-service] Service entering crash loop backoff — no healthy workers remaining",
+            "2026-04-06T02:15:35Z ERROR [auth-service] Health check failed: connection refused on :8080/healthz",
+        ],
+        "api-gateway": [
+            "2026-04-06T02:10:01Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 48ms)",
+            "2026-04-06T02:10:02Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 32ms)",
+            "2026-04-06T02:10:03Z INFO  [api-gateway] Route: POST /api/v2/pay -> payment-service (200, 95ms)",
+            "2026-04-06T02:10:05Z INFO  [api-gateway] Route: GET /api/v2/user/settings -> user-service (200, 28ms)",
+            "2026-04-06T02:10:08Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 155ms)",
+            "2026-04-06T02:10:10Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 320ms)",
+            "2026-04-06T02:10:15Z WARN  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 2500ms) — slow",
+            "2026-04-06T02:10:18Z ERROR [api-gateway] Route: POST /api/v2/login -> auth-service (503, timeout after 5000ms)",
+            "2026-04-06T02:10:20Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 30ms)",
+            "2026-04-06T02:10:22Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 68ms)",
+            "2026-04-06T02:10:25Z INFO  [api-gateway] Route: POST /api/v2/pay -> payment-service (200, 88ms)",
+            "2026-04-06T02:13:10Z ERROR [api-gateway] Route: POST /api/v2/login -> auth-service (503, timeout after 5000ms)",
+            "2026-04-06T02:13:12Z WARN  [api-gateway] Retrying auth-service request (attempt 2/3)",
+            "2026-04-06T02:13:17Z ERROR [api-gateway] Route: POST /api/v2/login -> auth-service (503, timeout after 5000ms)",
+            "2026-04-06T02:15:35Z ERROR [api-gateway] Route: POST /api/v2/login -> auth-service (503, connection refused)",
+            "2026-04-06T02:15:36Z WARN  [api-gateway] Circuit breaker OPEN for auth-service (failures=10, threshold=5)",
+            "2026-04-06T02:15:37Z ERROR [api-gateway] Route: POST /api/v2/login -> auth-service (503, circuit breaker open)",
+            "2026-04-06T02:15:37Z ERROR [api-gateway] Route: POST /api/v2/verify -> auth-service (503, circuit breaker open)",
+            "2026-04-06T02:15:38Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 28ms)",
+            "2026-04-06T02:15:40Z INFO  [api-gateway] Route: POST /api/v2/pay -> payment-service (200, 95ms)",
+            "2026-04-06T02:15:42Z INFO  [api-gateway] Route: GET /api/v2/user/settings -> user-service (200, 25ms)",
+        ],
+        "user-service": [
+            "2026-04-06T02:10:01Z INFO  [user-service] GET /users/profile uid=user_4421 -> 200 (32ms)",
+            "2026-04-06T02:10:05Z INFO  [user-service] GET /users/settings uid=user_8832 -> 200 (28ms)",
+            "2026-04-06T02:10:10Z INFO  [user-service] PUT /users/profile uid=user_3310 -> 200 (85ms)",
+            "2026-04-06T02:10:15Z INFO  [user-service] GET /users/profile uid=user_1101 -> 200 (30ms)",
+            "2026-04-06T02:10:20Z INFO  [user-service] GET /users/profile uid=user_5571 -> 200 (27ms)",
+            "2026-04-06T02:15:00Z INFO  [user-service] GET /users/profile uid=user_7712 -> 200 (31ms)",
+            "2026-04-06T02:15:30Z INFO  [user-service] PUT /users/settings uid=user_2209 -> 200 (78ms)",
+            "2026-04-06T02:15:35Z INFO  [user-service] GET /users/profile uid=user_9901 -> 200 (26ms)",
+        ],
+        "payment-service": [
+            "2026-04-06T02:10:01Z INFO  [payment-service] Processing payment txn=pay_8832 amount=$45.00 -> db-postgres",
+            "2026-04-06T02:10:02Z INFO  [payment-service] Payment completed txn=pay_8832 latency=85ms",
+            "2026-04-06T02:10:10Z INFO  [payment-service] Processing payment txn=pay_1120 amount=$12.99 -> db-postgres",
+            "2026-04-06T02:10:10Z INFO  [payment-service] Payment completed txn=pay_1120 latency=92ms",
+            "2026-04-06T02:15:00Z INFO  [payment-service] Processing payment txn=pay_4455 amount=$78.50 -> db-postgres",
+            "2026-04-06T02:15:01Z INFO  [payment-service] Payment completed txn=pay_4455 latency=88ms",
+            "2026-04-06T02:15:30Z INFO  [payment-service] Health check /healthz -> 200 OK",
+        ],
+        "db-postgres": [
+            "2026-04-06T02:00:00Z INFO  [db-postgres] Checkpoint starting: time-based",
+            "2026-04-06T02:00:02Z INFO  [db-postgres] Checkpoint complete: wrote 842 buffers (5.7%)",
+            "2026-04-06T02:10:00Z INFO  [db-postgres] Active connections: 35/100",
+            "2026-04-06T02:10:01Z INFO  [db-postgres] Autovacuum: processing table users (dead tuples: 120)",
+            "2026-04-06T02:15:00Z INFO  [db-postgres] Active connections: 34/100",
+            "2026-04-06T02:15:01Z INFO  [db-postgres] Checkpoint starting: time-based",
+            "2026-04-06T02:15:03Z INFO  [db-postgres] Checkpoint complete: wrote 910 buffers (6.2%)",
+        ],
+        "cache-redis": [
+            "2026-04-06T02:10:00Z INFO  [cache-redis] Memory usage: 1.2GB/4.0GB (30%)",
+            "2026-04-06T02:10:01Z INFO  [cache-redis] Cache hit ratio: 92% (within normal range 85-95%)",
+            "2026-04-06T02:10:05Z INFO  [cache-redis] Connected clients: 45",
+            "2026-04-06T02:15:00Z INFO  [cache-redis] Memory usage: 1.2GB/4.0GB (30%)",
+            "2026-04-06T02:15:01Z INFO  [cache-redis] Cache hit ratio: 91%",
+            "2026-04-06T02:15:05Z INFO  [cache-redis] Key evictions: 0 in last 5m",
+        ],
+        "notification-service": [
+            "2026-04-06T02:10:00Z INFO  [notification-service] Email batch #4420 sent successfully (12 emails)",
+            "2026-04-06T02:10:05Z INFO  [notification-service] Auth token validated for batch #4421 (45ms)",
+            "2026-04-06T02:15:00Z INFO  [notification-service] Email batch #4425 sent successfully (8 emails)",
+            "2026-04-06T02:15:30Z INFO  [notification-service] Health check /healthz -> 200 OK",
+        ],
+    },
+    metrics={
+        "auth-service": [
+            {"timestamp": "2026-04-06T02:00:00Z", "cpu_pct": 25, "mem_pct": 60, "heap_gb": 1.2, "latency_p50": 45, "latency_p99": 120, "error_rate": 0.001, "restarts": 0, "connections": 150},
+            {"timestamp": "2026-04-06T02:05:00Z", "cpu_pct": 30, "mem_pct": 72, "heap_gb": 1.44, "latency_p50": 52, "latency_p99": 180, "error_rate": 0.002, "restarts": 0, "connections": 155},
+            {"timestamp": "2026-04-06T02:10:00Z", "cpu_pct": 45, "mem_pct": 91, "heap_gb": 1.82, "latency_p50": 310, "latency_p99": 2400, "error_rate": 0.15, "restarts": 1, "connections": 148},
+            {"timestamp": "2026-04-06T02:11:00Z", "cpu_pct": 35, "mem_pct": 65, "heap_gb": 1.30, "latency_p50": 65, "latency_p99": 200, "error_rate": 0.02, "restarts": 1, "connections": 140},
+            {"timestamp": "2026-04-06T02:13:00Z", "cpu_pct": 48, "mem_pct": 94, "heap_gb": 1.88, "latency_p50": 450, "latency_p99": 3100, "error_rate": 0.20, "restarts": 2, "connections": 130},
+            {"timestamp": "2026-04-06T02:15:00Z", "cpu_pct": 0, "mem_pct": 0, "heap_gb": 0, "latency_p50": 0, "latency_p99": 0, "error_rate": 1.0, "restarts": 3, "connections": 0},
+        ],
+        "api-gateway": [
+            {"timestamp": "2026-04-06T02:00:00Z", "cpu_pct": 20, "mem_pct": 45, "latency_p50": 32, "latency_p99": 85, "error_rate": 0.001, "5xx_rate": 0.001},
+            {"timestamp": "2026-04-06T02:10:00Z", "cpu_pct": 22, "mem_pct": 46, "latency_p50": 35, "latency_p99": 95, "error_rate": 0.005, "5xx_rate": 0.003},
+            {"timestamp": "2026-04-06T02:15:00Z", "cpu_pct": 25, "mem_pct": 48, "latency_p50": 40, "latency_p99": 5200, "error_rate": 0.42, "5xx_rate": 0.40},
+        ],
+        "user-service": [
+            {"timestamp": "2026-04-06T02:00:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 28, "latency_p99": 75, "error_rate": 0.001},
+            {"timestamp": "2026-04-06T02:15:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 29, "latency_p99": 78, "error_rate": 0.001},
+        ],
+        "payment-service": [
+            {"timestamp": "2026-04-06T02:00:00Z", "cpu_pct": 18, "mem_pct": 40, "latency_p50": 85, "latency_p99": 150, "error_rate": 0.001},
+            {"timestamp": "2026-04-06T02:15:00Z", "cpu_pct": 18, "mem_pct": 40, "latency_p50": 88, "latency_p99": 155, "error_rate": 0.001},
+        ],
+        "db-postgres": [
+            {"timestamp": "2026-04-06T02:00:00Z", "cpu_pct": 30, "mem_pct": 55, "connections": 35, "active_locks": 2, "deadlocks": 0, "write_iops": 1200, "read_iops": 3500},
+            {"timestamp": "2026-04-06T02:15:00Z", "cpu_pct": 32, "mem_pct": 55, "connections": 34, "active_locks": 2, "deadlocks": 0, "write_iops": 1150, "read_iops": 3400},
+        ],
+        "cache-redis": [
+            {"timestamp": "2026-04-06T02:00:00Z", "mem_gb": 1.2, "mem_pct": 30, "hit_ratio": 0.92, "evictions_per_s": 0, "connections": 45},
+            {"timestamp": "2026-04-06T02:15:00Z", "mem_gb": 1.2, "mem_pct": 30, "hit_ratio": 0.91, "evictions_per_s": 0, "connections": 45},
+        ],
+    },
+    traces={
+        "auth-service": [
+            "No recent traces available — service is down. Last successful trace:",
+            "Trace: POST /auth/token (uid=user_4423, total=2400ms) — BEFORE CRASH",
+            "  ├─ auth-service.checkSessionCache()   5ms   (cache-redis HIT)",
+            "  ├─ auth-service.generateToken()        45ms",
+            "  ├─ auth-service.GC_FULL_PAUSE          2100ms  ← GC dominated total time",
+            "  └─ auth-service.writeResponse()         250ms",
+        ],
+        "api-gateway": [
+            "Trace: POST /api/v2/login (total=5005ms) — TIMEOUT",
+            "  ├─ api-gateway.parseRequest()          2ms",
+            "  ├─ api-gateway.routeToAuthService()    5000ms (TIMEOUT — auth-service unreachable)",
+            "  └─ api-gateway.returnError()            3ms   (503 Service Unavailable)",
+        ],
+    },
+    deploy_history={
+        "auth-service": [
+            "v2.14.0  deployed 2026-04-01T10:00:00Z  status=stable  (running 5 days, no issues)",
+            "v2.13.2  deployed 2026-03-25T14:00:00Z  status=superseded",
+        ],
+        "api-gateway": [
+            "v1.12.0  deployed 2026-03-28T09:00:00Z  status=stable  (running 9 days)",
+        ],
+        "user-service": [
+            "v4.2.1  deployed 2026-04-05T16:00:00Z  status=stable  (running 10 hours)",
+            "v4.2.0  deployed 2026-04-01T11:00:00Z  status=superseded",
+        ],
+        "payment-service": [
+            "v3.8.1  deployed 2026-04-03T14:00:00Z  status=stable  (running 3 days)",
+        ],
+    },
+    runbooks={
+        "auth-service": (
+            "## auth-service Runbook\n"
+            "- OOM crashes: Check heap usage trends in metrics. If memory grows linearly after\n"
+            "  restart, likely a memory leak in the token cache. Short-term fix: restart to clear\n"
+            "  cached state. Long-term: file ticket for cache eviction policy fix.\n"
+            "- High latency: Check cache-redis connectivity. Auth-service falls back to DB lookups\n"
+            "  if cache is down, which increases latency 10x.\n"
+            "- Connection refused: Service may be in crash loop. Check restart count and supervisor logs.\n"
+            "- Token validation failures: Check if JWT signing key was recently rotated."
+        ),
+        "api-gateway": (
+            "## api-gateway Runbook\n"
+            "- 503 errors: Check downstream service health. Gateway proxies to auth-service,\n"
+            "  user-service, and payment-service. Identify which downstream is failing.\n"
+            "- Circuit breaker open: Downstream service has exceeded failure threshold.\n"
+            "  Fix the downstream service; circuit breaker will auto-close after 30s of healthy responses.\n"
+            "- High latency: Usually caused by slow downstream. Check traces to identify bottleneck."
+        ),
+    },
+    configs={
+        "auth-service": {
+            "current": "JVM_HEAP_MAX=2g\nTOKEN_CACHE_SIZE=500000\nSESSION_TTL=3600\nREDIS_POOL_SIZE=20",
+            "previous": "JVM_HEAP_MAX=2g\nTOKEN_CACHE_SIZE=500000\nSESSION_TTL=3600\nREDIS_POOL_SIZE=20",
+            "diff": "No changes — config has not been modified recently.",
+        },
+    },
+    dependencies={
+        "api-gateway": ["auth-service", "user-service", "payment-service"],
+        "auth-service": ["cache-redis"],
+        "user-service": ["db-postgres"],
+        "payment-service": ["db-postgres"],
+        "db-postgres": [],
+        "cache-redis": [],
+        "notification-service": ["auth-service"],
+    },
+    root_cause_services=["auth-service"],
+    root_cause_categories=[RootCauseCategory.OOM_CRASH],
+    required_fixes=[
+        RequiredFix(action="restart_service", service="auth-service"),
+    ],
+    diagnosis_keywords=["auth-service", "oom", "out of memory", "memory", "crash", "restart"],
+    weights={
+        "correct_service": 0.30,
+        "correct_category": 0.20,
+        "correct_fix": 0.30,
+        "secondary_fix": 0.00,
+        "diagnosis_text": 0.10,
+        "investigation": 0.10,
+        "wrong_penalty": 0.03,
+    },
+)

tasks/hard_concurrent.py ADDED Viewed

	@@ -0,0 +1,353 @@

+"""
+Task: Concurrent Faults with Misleading Evidence
+To add a new task, copy this file, modify the SCENARIO definition, and place it in tasks/.
+The task loader will auto-discover it.
+"""
+from env.scenario import IncidentScenario, RequiredFix, ServiceConfig
+from models import RootCauseCategory, ServiceStatus
+SCENARIO = IncidentScenario(
+    task_id="hard",
+    name="Concurrent Faults with Misleading Evidence",
+    difficulty="hard",
+    max_steps=35,
+    incident_summary=(
+        "SEV-1 incident declared at 04:00 UTC. Multiple services affected simultaneously. "
+        "payment-service is completely down after a recent deploy. auth-service showing intermittent "
+        "timeouts and session validation failures. notification-service queue backing up. "
+        "user-service has config warnings. api-gateway showing >30% error rate across multiple "
+        "endpoints. Need to identify ALL root causes and restore full system health."
+    ),
+    services={
+        "api-gateway": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["auth-service", "user-service", "payment-service"],
+            version="v1.12.0", replicas=3,
+        ),
+        "auth-service": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["cache-redis"],
+            version="v2.14.0", replicas=2,
+        ),
+        "user-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["db-postgres"],
+            version="v4.2.1", replicas=2,
+        ),
+        "payment-service": ServiceConfig(
+            status=ServiceStatus.DOWN, deps=["db-postgres"],
+            version="v3.8.2", replicas=2, is_root_cause=True, fault_type="bad_deploy",
+        ),
+        "db-postgres": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=[],
+            version="v15.4", replicas=1,
+        ),
+        "cache-redis": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=[],
+            version="v7.2.4", replicas=1, is_root_cause=True, fault_type="memory_leak",
+        ),
+        "notification-service": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["auth-service"],
+            version="v1.5.0", replicas=1,
+        ),
+    },
+    initial_alerts=[
+        "[ALERT SEV-1] api-gateway: error rate >30%, multiple downstream failures detected",
+        "[ALERT SEV-1] payment-service: health check failing, 0/2 pods ready, CrashLoopBackOff",
+        "[ALERT SEV-2] auth-service: intermittent 500 errors, session validation latency >3s",
+        "[ALERT SEV-2] notification-service: email delivery queue depth >2000, processing stalled",
+        "[ALERT SEV-3] user-service: config validation warning (non-critical)",
+    ],
+    logs={
+        "payment-service": [
+            "2026-04-06T04:00:00Z INFO  [payment-service] Deploying v3.8.2 (previous: v3.8.1)",
+            "2026-04-06T04:00:01Z INFO  [payment-service] Container image pulled: registry.internal/payment-service:v3.8.2",
+            "2026-04-06T04:00:02Z INFO  [payment-service] Pod payment-service-7d4f8b-xk9m2 starting...",
+            "2026-04-06T04:00:03Z INFO  [payment-service] Starting health check sequence...",
+            "2026-04-06T04:00:04Z INFO  [payment-service] Loading configuration from ConfigMap...",
+            "2026-04-06T04:00:05Z INFO  [payment-service] Initializing payment validation module v2 (new in v3.8.2)",
+            "2026-04-06T04:00:05Z ERROR [payment-service] NullPointerException in PaymentValidatorV2.initialize(): config.getValidationRules() returned null",
+            "2026-04-06T04:00:05Z ERROR [payment-service] Stack trace:",
+            "    at com.acme.payment.validator.PaymentValidatorV2.initialize(PaymentValidatorV2.java:42)",
+            "    at com.acme.payment.bootstrap.ServiceBootstrap.initModules(ServiceBootstrap.java:118)",
+            "    at com.acme.payment.bootstrap.ServiceBootstrap.start(ServiceBootstrap.java:55)",
+            "    at com.acme.payment.Main.main(Main.java:12)",
+            "2026-04-06T04:00:06Z FATAL [payment-service] Bootstrap failed: required module 'payment-validator-v2' could not initialize",
+            "2026-04-06T04:00:06Z INFO  [payment-service] Shutdown hook triggered, cleaning up...",
+            "2026-04-06T04:00:07Z INFO  [payment-service] Health check endpoint /healthz returning 503",
+            "2026-04-06T04:00:10Z WARN  [payment-service] Kubernetes: pod payment-service-7d4f8b-xk9m2 failed readiness probe (1/3)",
+            "2026-04-06T04:00:20Z WARN  [payment-service] Kubernetes: pod payment-service-7d4f8b-xk9m2 failed readiness probe (2/3)",
+            "2026-04-06T04:00:30Z ERROR [payment-service] Kubernetes: pod payment-service-7d4f8b-xk9m2 marked NotReady, removed from service",
+            "2026-04-06T04:00:31Z INFO  [payment-service] Kubernetes: restarting pod (CrashLoopBackOff)",
+            "2026-04-06T04:00:35Z INFO  [payment-service] Starting health check sequence...",
+            "2026-04-06T04:00:37Z ERROR [payment-service] NullPointerException in PaymentValidatorV2.initialize(): config.getValidationRules() returned null",
+            "2026-04-06T04:00:37Z FATAL [payment-service] Bootstrap failed: required module 'payment-validator-v2' could not initialize",
+            "2026-04-06T04:00:38Z INFO  [payment-service] Kubernetes: restarting pod (CrashLoopBackOff)",
+            "2026-04-06T04:00:45Z INFO  [payment-service] Starting health check sequence...",
+            "2026-04-06T04:00:47Z ERROR [payment-service] NullPointerException in PaymentValidatorV2.initialize(): config.getValidationRules() returned null",
+            "2026-04-06T04:00:47Z FATAL [payment-service] Bootstrap failed: required module 'payment-validator-v2' could not initialize",
+            "2026-04-06T04:01:00Z ERROR [payment-service] CrashLoopBackOff: backing off 60s before next restart",
+            "2026-04-06T04:02:05Z INFO  [payment-service] Starting health check sequence...",
+            "2026-04-06T04:02:07Z ERROR [payment-service] NullPointerException in PaymentValidatorV2.initialize(): config.getValidationRules() returned null",
+            "2026-04-06T04:02:07Z FATAL [payment-service] Bootstrap failed: required module 'payment-validator-v2' could not initialize",
+            "2026-04-06T04:02:10Z ERROR [payment-service] CrashLoopBackOff: backing off 120s before next restart",
+        ],
+        "cache-redis": [
+            "2026-04-06T03:00:00Z INFO  [cache-redis] Memory usage: 2.8GB/4.0GB (70%) — within operational range",
+            "2026-04-06T03:05:00Z INFO  [cache-redis] Memory usage: 2.9GB/4.0GB (72%)",
+            "2026-04-06T03:10:00Z INFO  [cache-redis] Memory usage: 3.0GB/4.0GB (75%)",
+            "2026-04-06T03:15:00Z INFO  [cache-redis] Memory usage: 3.1GB/4.0GB (77%)",
+            "2026-04-06T03:20:00Z INFO  [cache-redis] Memory usage: 3.2GB/4.0GB (80%)",
+            "2026-04-06T03:25:00Z INFO  [cache-redis] Memory usage: 3.3GB/4.0GB (82%)",
+            "2026-04-06T03:30:00Z WARN  [cache-redis] Memory usage: 3.4GB/4.0GB (85%) — approaching maxmemory threshold",
+            "2026-04-06T03:30:01Z INFO  [cache-redis] Eviction policy: allkeys-lru activated",
+            "2026-04-06T03:30:05Z WARN  [cache-redis] Evicting 1200 keys/sec to maintain memory budget",
+            "2026-04-06T03:35:00Z WARN  [cache-redis] Memory usage: 3.5GB/4.0GB (87%) despite active eviction",
+            "2026-04-06T03:40:00Z WARN  [cache-redis] Memory usage: 3.6GB/4.0GB (90%)",
+            "2026-04-06T03:45:00Z WARN  [cache-redis] Memory usage: 3.7GB/4.0GB (92%) despite active eviction",
+            "2026-04-06T03:45:01Z WARN  [cache-redis] Eviction rate insufficient: incoming writes (2.1GB/hr) exceed eviction rate (1.5GB/hr)",
+            "2026-04-06T03:45:02Z WARN  [cache-redis] Key namespace auth:session:* most affected — 60% of evictions from this prefix",
+            "2026-04-06T03:50:00Z WARN  [cache-redis] Memory usage: 3.8GB/4.0GB (95%)",
+            "2026-04-06T03:55:00Z ERROR [cache-redis] Memory usage: 3.82GB/4.0GB (95.5%)",
+            "2026-04-06T04:00:00Z ERROR [cache-redis] Memory usage: 3.85GB/4.0GB (96%) — critical threshold",
+            "2026-04-06T04:00:01Z ERROR [cache-redis] Rejecting 12% of SET commands due to memory pressure",
+            "2026-04-06T04:00:02Z WARN  [cache-redis] Client auth-service reporting increased cache misses (hit ratio: 35%, normal: 90%)",
+            "2026-04-06T04:00:05Z ERROR [cache-redis] Memory fragmentation ratio: 1.8 (healthy: <1.5) — possible memory leak in module",
+            "2026-04-06T04:00:10Z WARN  [cache-redis] Resident memory growing despite aggressive eviction — suspect leaked allocations in Lua script engine",
+            "2026-04-06T04:00:15Z ERROR [cache-redis] Rejecting 18% of SET commands due to memory pressure",
+        ],
+        "auth-service": [
+            "2026-04-06T03:00:00Z INFO  [auth-service] Request: POST /auth/token uid=user_4421 -> cache HIT (12ms)",
+            "2026-04-06T03:00:05Z INFO  [auth-service] Request: POST /auth/verify uid=user_8832 -> cache HIT (10ms)",
+            "2026-04-06T03:15:00Z INFO  [auth-service] Request: POST /auth/token uid=user_3310 -> cache HIT (11ms)",
+            "2026-04-06T03:30:00Z INFO  [auth-service] Request: POST /auth/token uid=user_5571 -> cache HIT (13ms)",
+            "2026-04-06T03:45:00Z WARN  [auth-service] Cache miss for session sid=c9f21a — falling back to db-postgres lookup (280ms)",
+            "2026-04-06T03:45:02Z INFO  [auth-service] Request: POST /auth/token uid=user_7712 -> cache HIT (14ms)",
+            "2026-04-06T03:45:05Z WARN  [auth-service] Cache miss rate elevated: 45% (normal: <10%)",
+            "2026-04-06T03:45:10Z WARN  [auth-service] Cache miss for session sid=d4e82b — falling back to db-postgres lookup (320ms)",
+            "2026-04-06T03:50:00Z WARN  [auth-service] DB connection pool: 28/30 active (falling back to DB for most session lookups)",
+            "2026-04-06T03:55:00Z WARN  [auth-service] Cache miss rate: 55% — DB fallback path overloaded",
+            "2026-04-06T04:00:00Z ERROR [auth-service] Cache write rejected by redis: OOM command not allowed when used memory > maxmemory",
+            "2026-04-06T04:00:01Z WARN  [auth-service] 65% of requests hitting DB fallback path — latency p99 = 3200ms",
+            "2026-04-06T04:00:03Z ERROR [auth-service] Request timeout: POST /auth/verify uid=user_8832 (DB fallback overloaded)",
+            "2026-04-06T04:00:05Z ERROR [auth-service] Request timeout: POST /auth/token uid=user_2209 (DB fallback overloaded)",
+            "2026-04-06T04:00:08Z WARN  [auth-service] DB connection pool: 30/30 active (SATURATED)",
+            "2026-04-06T04:00:10Z WARN  [auth-service] Degraded mode: session validation averaging 1800ms (SLA: 200ms)",
+            "2026-04-06T04:00:15Z ERROR [auth-service] 5 request timeouts in last 60 seconds",
+        ],
+        "user-service": [
+            "2026-04-06T03:30:00Z INFO  [user-service] Config reload triggered by configmap update",
+            "2026-04-06T03:30:01Z WARN  [user-service] Config validation: feature flag 'enable_profile_v2' references unknown experiment 'profile_redesign_q2'",
+            "2026-04-06T03:30:01Z WARN  [user-service] Config validation: deprecated field 'legacy_avatar_url' present — will be removed in v4.0",
+            "2026-04-06T03:30:02Z INFO  [user-service] Config applied successfully (2 warnings, 0 errors)",
+            "2026-04-06T03:30:03Z INFO  [user-service] All endpoints healthy, no service disruption during config reload",
+            "2026-04-06T03:30:10Z INFO  [user-service] GET /users/profile uid=user_4421 -> 200 (28ms)",
+            "2026-04-06T03:45:00Z INFO  [user-service] GET /users/profile uid=user_1101 -> 200 (30ms)",
+            "2026-04-06T03:45:05Z INFO  [user-service] PUT /users/profile uid=user_3310 -> 200 (82ms)",
+            "2026-04-06T04:00:00Z INFO  [user-service] GET /users/profile uid=user_1101 -> 200 (28ms)",
+            "2026-04-06T04:00:01Z INFO  [user-service] PUT /users/profile uid=user_3310 -> 200 (95ms)",
+            "2026-04-06T04:00:05Z INFO  [user-service] GET /users/settings uid=user_5571 -> 200 (26ms)",
+            "2026-04-06T04:00:10Z INFO  [user-service] Health check /healthz -> 200 OK",
+        ],
+        "notification-service": [
+            "2026-04-06T03:45:00Z INFO  [notification-service] Auth token validated for batch #4445 (48ms)",
+            "2026-04-06T03:45:01Z INFO  [notification-service] Email batch #4445 sent successfully (15 emails)",
+            "2026-04-06T04:00:00Z WARN  [notification-service] Auth token validation taking 2800ms (SLA: 500ms)",
+            "2026-04-06T04:00:02Z WARN  [notification-service] Email delivery queue depth: 2400 (normal: <100)",
+            "2026-04-06T04:00:05Z ERROR [notification-service] Failed to validate sender auth for notification batch #8832 — auth-service timeout",
+            "2026-04-06T04:00:06Z WARN  [notification-service] Pausing email delivery until auth validation recovers",
+            "2026-04-06T04:00:10Z WARN  [notification-service] Queue depth growing: 2800 pending emails",
+            "2026-04-06T04:00:15Z ERROR [notification-service] Auth validation timeout for batch #8833",
+            "2026-04-06T04:00:20Z WARN  [notification-service] Queue depth: 3200 — SLA breach imminent for time-sensitive notifications",
+        ],
+        "api-gateway": [
+            "2026-04-06T03:59:55Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 45ms)",
+            "2026-04-06T03:59:58Z INFO  [api-gateway] Route: POST /api/v2/pay -> payment-service (200, 92ms)",
+            "2026-04-06T04:00:01Z ERROR [api-gateway] Route: POST /api/v2/pay -> payment-service (503, connection refused)",
+            "2026-04-06T04:00:02Z WARN  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 1800ms) — slow",
+            "2026-04-06T04:00:03Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 28ms)",
+            "2026-04-06T04:00:05Z ERROR [api-gateway] Route: POST /api/v2/pay -> payment-service (503, connection refused)",
+            "2026-04-06T04:00:06Z WARN  [api-gateway] Circuit breaker OPEN for payment-service (failures=5, threshold=5)",
+            "2026-04-06T04:00:08Z ERROR [api-gateway] Route: POST /api/v2/verify -> auth-service (504, timeout after 5000ms)",
+            "2026-04-06T04:00:10Z INFO  [api-gateway] Route: GET /api/v2/user/settings -> user-service (200, 25ms)",
+            "2026-04-06T04:00:12Z ERROR [api-gateway] Route: POST /api/v2/pay -> payment-service (503, circuit breaker open)",
+            "2026-04-06T04:00:15Z WARN  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 3200ms) — very slow",
+            "2026-04-06T04:00:18Z ERROR [api-gateway] Route: POST /api/v2/verify -> auth-service (504, timeout after 5000ms)",
+            "2026-04-06T04:00:20Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 30ms)",
+        ],
+        "db-postgres": [
+            "2026-04-06T03:55:00Z INFO  [db-postgres] Active connections: 42/100",
+            "2026-04-06T04:00:00Z INFO  [db-postgres] Active connections: 58/100",
+            "2026-04-06T04:00:01Z INFO  [db-postgres] Checkpoint starting: time-based",
+            "2026-04-06T04:00:03Z INFO  [db-postgres] Checkpoint complete: wrote 1450 buffers (9.8%)",
+            "2026-04-06T04:00:05Z INFO  [db-postgres] Higher than normal read load — auth-service fallback queries detected",
+            "2026-04-06T04:00:10Z INFO  [db-postgres] Active connections: 62/100 — elevated but within limits",
+            "2026-04-06T04:00:15Z INFO  [db-postgres] No deadlocks detected. Lock wait queue empty.",
+            "2026-04-06T04:00:20Z INFO  [db-postgres] Autovacuum: processing table sessions (dead tuples: 850)",
+        ],
+    },
+    metrics={
+        "payment-service": [
+            {"timestamp": "2026-04-06T03:55:00Z", "cpu_pct": 18, "mem_pct": 40, "latency_p50": 88, "latency_p99": 155, "error_rate": 0.001, "pods_ready": 2, "pods_total": 2},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 0, "mem_pct": 0, "latency_p50": 0, "latency_p99": 0, "error_rate": 1.0, "pods_ready": 0, "pods_total": 2},
+        ],
+        "cache-redis": [
+            {"timestamp": "2026-04-06T02:00:00Z", "mem_gb": 2.4, "mem_pct": 60, "hit_ratio": 0.92, "evictions_per_s": 0, "connections": 45, "fragmentation_ratio": 1.1},
+            {"timestamp": "2026-04-06T02:30:00Z", "mem_gb": 2.6, "mem_pct": 65, "hit_ratio": 0.91, "evictions_per_s": 0, "connections": 46, "fragmentation_ratio": 1.2},
+            {"timestamp": "2026-04-06T03:00:00Z", "mem_gb": 2.8, "mem_pct": 70, "hit_ratio": 0.90, "evictions_per_s": 5, "connections": 47, "fragmentation_ratio": 1.3},
+            {"timestamp": "2026-04-06T03:30:00Z", "mem_gb": 3.4, "mem_pct": 85, "hit_ratio": 0.72, "evictions_per_s": 1200, "connections": 48, "fragmentation_ratio": 1.5},
+            {"timestamp": "2026-04-06T03:45:00Z", "mem_gb": 3.7, "mem_pct": 92, "hit_ratio": 0.55, "evictions_per_s": 1800, "connections": 48, "fragmentation_ratio": 1.7},
+            {"timestamp": "2026-04-06T04:00:00Z", "mem_gb": 3.85, "mem_pct": 96, "hit_ratio": 0.35, "evictions_per_s": 2200, "connections": 47, "fragmentation_ratio": 1.8},
+        ],
+        "auth-service": [
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 22, "mem_pct": 58, "latency_p50": 12, "latency_p99": 45, "error_rate": 0.001, "cache_hit_ratio": 0.90, "db_fallback_pct": 0.10},
+            {"timestamp": "2026-04-06T03:30:00Z", "cpu_pct": 28, "mem_pct": 60, "latency_p50": 25, "latency_p99": 180, "error_rate": 0.005, "cache_hit_ratio": 0.72, "db_fallback_pct": 0.28},
+            {"timestamp": "2026-04-06T03:45:00Z", "cpu_pct": 35, "mem_pct": 62, "latency_p50": 120, "latency_p99": 1200, "error_rate": 0.05, "cache_hit_ratio": 0.55, "db_fallback_pct": 0.45},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 42, "mem_pct": 65, "latency_p50": 800, "latency_p99": 3200, "error_rate": 0.15, "cache_hit_ratio": 0.35, "db_fallback_pct": 0.65},
+        ],
+        "user-service": [
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 28, "latency_p99": 75, "error_rate": 0.001},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 30, "latency_p99": 82, "error_rate": 0.001},
+        ],
+        "notification-service": [
+            {"timestamp": "2026-04-06T03:45:00Z", "cpu_pct": 12, "mem_pct": 30, "queue_depth": 15, "auth_validation_ms": 48, "emails_sent_per_min": 120},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 14, "mem_pct": 32, "queue_depth": 2400, "auth_validation_ms": 2800, "emails_sent_per_min": 5},
+        ],
+        "api-gateway": [
+            {"timestamp": "2026-04-06T03:55:00Z", "cpu_pct": 20, "mem_pct": 45, "latency_p50": 35, "latency_p99": 95, "error_rate": 0.002, "5xx_rate": 0.001},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 28, "mem_pct": 48, "latency_p50": 120, "latency_p99": 5200, "error_rate": 0.35, "5xx_rate": 0.32},
+        ],
+        "db-postgres": [
+            {"timestamp": "2026-04-06T03:55:00Z", "cpu_pct": 35, "mem_pct": 55, "connections": 42, "active_locks": 2, "deadlocks": 0, "write_iops": 1200, "read_iops": 3500},
+            {"timestamp": "2026-04-06T04:00:00Z", "cpu_pct": 45, "mem_pct": 58, "connections": 62, "active_locks": 3, "deadlocks": 0, "write_iops": 1100, "read_iops": 4800},
+        ],
+    },
+    traces={
+        "payment-service": [
+            "No recent traces — service is down (CrashLoopBackOff). Last successful trace (before deploy):",
+            "Trace: POST /api/v2/pay (txn=pay_9901, total=92ms) — v3.8.1",
+            "  ├─ payment-service.validateRequest()      8ms",
+            "  ├─ payment-service.checkBalance()         25ms  (SELECT -> db-postgres)",
+            "  ├─ payment-service.insertTransaction()    40ms  (INSERT -> db-postgres)",
+            "  └─ payment-service.sendConfirmation()     19ms",
+        ],
+        "auth-service": [
+            "Trace: POST /auth/verify (uid=user_8832, total=3200ms)",
+            "  ├─ auth-service.checkSessionCache()       8ms    (cache-redis MISS)",
+            "  ├─ auth-service.fallbackDBLookup()        2900ms (db-postgres — under load from fallback traffic)",
+            "  ├─ auth-service.validateToken()            45ms",
+            "  └─ auth-service.writeBackToCache()         FAILED (redis OOM rejected write)",
+        ],
+        "notification-service": [
+            "Trace: POST /notifications/send (batch=#8832, total=5200ms) — TIMEOUT",
+            "  ├─ notification-service.prepareBatch()     12ms",
+            "  ├─ notification-service.validateAuth()     5000ms (-> auth-service TIMEOUT)",
+            "  └─ notification-service.sendEmails()       never reached",
+        ],
+    },
+    deploy_history={
+        "payment-service": [
+            "v3.8.2  deployed 2026-04-06T04:00:00Z  status=CrashLoopBackOff  (deployed 15 min ago)",
+            "v3.8.1  deployed 2026-04-03T14:00:00Z  status=superseded  (was stable for 3 days)",
+            "v3.8.0  deployed 2026-03-28T10:00:00Z  status=superseded",
+        ],
+        "auth-service": [
+            "v2.14.0  deployed 2026-04-01T10:00:00Z  status=stable  (running 5 days, no issues)",
+        ],
+        "cache-redis": [
+            "v7.2.4  deployed 2026-03-20T09:00:00Z  status=stable  (running 17 days)",
+        ],
+        "user-service": [
+            "v4.2.1  deployed 2026-04-05T16:00:00Z  status=stable  (running 12 hours)",
+        ],
+    },
+    runbooks={
+        "payment-service": (
+            "## payment-service Runbook\n"
+            "- Crash on startup / CrashLoopBackOff: Check recent deploys. If the latest deploy\n"
+            "  introduced the crash, rollback to previous known-good version:\n"
+            "  rollback_deploy(service='payment-service', target_version='<previous_version>')\n"
+            "  Check deploy history for the last stable version.\n"
+            "- Transaction timeouts: Check db-postgres connection pool and lock status.\n"
+            "- High latency: Check downstream service health (db-postgres)."
+        ),
+        "cache-redis": (
+            "## cache-redis Runbook\n"
+            "- Memory pressure / approaching maxmemory: Check memory trend in metrics.\n"
+            "  If memory grows despite eviction, likely a memory leak.\n"
+            "  Short-term fix: restart_service to clear leaked memory.\n"
+            "  Alternative: scale_up to add more replicas and distribute load.\n"
+            "- Elevated miss ratio: If caused by memory pressure/eviction storm, fix memory issue first.\n"
+            "  If caused by TTL expiry batch, wait for cache to warm back up."
+        ),
+        "auth-service": (
+            "## auth-service Runbook\n"
+            "- High latency / DB fallback: Check cache-redis health. If redis is degraded,\n"
+            "  auth-service falls back to DB lookups which are 10-50x slower.\n"
+            "  Fix redis first — auth-service will recover automatically.\n"
+            "- Cache write failures: Redis may be rejecting writes due to OOM. Check redis memory."
+        ),
+        "notification-service": (
+            "## notification-service Runbook\n"
+            "- Queue backing up: Usually caused by auth-service degradation. Notification-service\n"
+            "  validates sender auth before sending. If auth is slow, queue grows.\n"
+            "  Fix auth-service first — queue will drain automatically."
+        ),
+    },
+    configs={
+        "payment-service": {
+            "current": "DB_POOL_SIZE=50\nDB_TIMEOUT=5000\nRETRY_COUNT=3\nVALIDATOR_VERSION=v2\nFEATURE_NEW_VALIDATION=true",
+            "previous": "DB_POOL_SIZE=50\nDB_TIMEOUT=5000\nRETRY_COUNT=3\nVALIDATOR_VERSION=v1\nFEATURE_NEW_VALIDATION=false",
+            "diff": "Changed VALIDATOR_VERSION from v1 to v2, enabled FEATURE_NEW_VALIDATION (part of v3.8.2 deploy)",
+        },
+        "user-service": {
+            "current": "FEATURE_PROFILE_V2=true\nLEGACY_AVATAR_URL=https://cdn.example.com/avatars\nDB_POOL_SIZE=30",
+            "previous": "FEATURE_PROFILE_V2=false\nDB_POOL_SIZE=30",
+            "diff": "Added FEATURE_PROFILE_V2=true and LEGACY_AVATAR_URL (config change 30 min ago). 2 validation warnings but applied successfully.",
+        },
+        "cache-redis": {
+            "current": "maxmemory=4gb\nmaxmemory-policy=allkeys-lru\ntimeout=300\ntcp-keepalive=60",
+            "previous": "maxmemory=4gb\nmaxmemory-policy=allkeys-lru\ntimeout=300\ntcp-keepalive=60",
+            "diff": "No changes — config has not been modified recently.",
+        },
+    },
+    dependencies={
+        "api-gateway": ["auth-service", "user-service", "payment-service"],
+        "auth-service": ["cache-redis"],
+        "user-service": ["db-postgres"],
+        "payment-service": ["db-postgres"],
+        "db-postgres": [],
+        "cache-redis": [],
+        "notification-service": ["auth-service"],
+    },
+    root_cause_services=["payment-service", "cache-redis"],
+    root_cause_categories=[RootCauseCategory.BAD_DEPLOY, RootCauseCategory.MEMORY_LEAK],
+    required_fixes=[
+        RequiredFix(action="rollback_deploy", service="payment-service", target_version="v3.8.1"),
+        RequiredFix(action="restart_service", service="cache-redis"),
+    ],
+    diagnosis_keywords=[
+        "payment-service", "deploy", "rollback", "v3.8.2", "v3.8.1", "NullPointerException", "crash",
+        "cache-redis", "memory", "leak", "eviction", "auth-service", "fallback",
+    ],
+    weights={
+        "correct_service": 0.15,
+        "correct_category": 0.10,
+        "correct_fix": 0.15,
+        "secondary_fix": 0.20,
+        "diagnosis_text": 0.15,
+        "investigation": 0.10,
+        "wrong_penalty": 0.05,
+    },
+)

tasks/medium_deadlock.py ADDED Viewed

	@@ -0,0 +1,298 @@

+"""
+Task: Cascading Database Deadlock
+To add a new task, copy this file, modify the SCENARIO definition, and place it in tasks/.
+The task loader will auto-discover it.
+"""
+from env.scenario import IncidentScenario, RequiredFix, ServiceConfig
+from models import RootCauseCategory, ServiceStatus
+SCENARIO = IncidentScenario(
+    task_id="medium",
+    name="Cascading Database Deadlock",
+    difficulty="medium",
+    max_steps=25,
+    incident_summary=(
+        "Multiple alerts fired at 03:05 UTC. payment-service and user-service both showing elevated "
+        "error rates and latency. Transaction timeouts increasing. cache-redis also flagged with "
+        "elevated miss ratio. Need to identify root cause and restore write path."
+    ),
+    services={
+        "api-gateway": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["auth-service", "user-service", "payment-service"],
+            version="v1.12.0", replicas=3,
+        ),
+        "auth-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["cache-redis"],
+            version="v2.14.0", replicas=2,
+        ),
+        "user-service": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["db-postgres"],
+            version="v4.2.1", replicas=2,
+        ),
+        "payment-service": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=["db-postgres"],
+            version="v3.8.1", replicas=2,
+        ),
+        "db-postgres": ServiceConfig(
+            status=ServiceStatus.DEGRADED, deps=[],
+            version="v15.4", replicas=1, is_root_cause=True, fault_type="db_deadlock",
+        ),
+        "cache-redis": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=[],
+            version="v7.2.4", replicas=1,
+        ),
+        "notification-service": ServiceConfig(
+            status=ServiceStatus.HEALTHY, deps=["auth-service"],
+            version="v1.5.0", replicas=1,
+        ),
+    },
+    initial_alerts=[
+        "[ALERT SEV-2] payment-service: transaction timeouts >15%, p99 latency >2s",
+        "[ALERT SEV-2] user-service: elevated error rate on profile updates",
+        "[ALERT SEV-3] cache-redis: cache miss ratio elevated (informational)",
+    ],
+    logs={
+        "payment-service": [
+            "2026-04-06T03:00:01Z INFO  [payment-service] Processing payment txn=pay_8832 amount=$45.00 -> db-postgres",
+            "2026-04-06T03:00:02Z INFO  [payment-service] Payment completed txn=pay_8832 latency=85ms",
+            "2026-04-06T03:00:10Z INFO  [payment-service] Processing payment txn=pay_1120 amount=$12.99 -> db-postgres",
+            "2026-04-06T03:00:11Z INFO  [payment-service] Payment completed txn=pay_1120 latency=92ms",
+            "2026-04-06T03:00:20Z INFO  [payment-service] Processing payment txn=pay_3341 amount=$199.00 -> db-postgres",
+            "2026-04-06T03:00:21Z INFO  [payment-service] Payment completed txn=pay_3341 latency=78ms",
+            "2026-04-06T03:01:00Z INFO  [payment-service] Health check /healthz -> 200 OK",
+            "2026-04-06T03:02:00Z INFO  [payment-service] Processing payment txn=pay_5590 amount=$25.00 -> db-postgres",
+            "2026-04-06T03:02:01Z INFO  [payment-service] Payment completed txn=pay_5590 latency=95ms",
+            "2026-04-06T03:03:00Z INFO  [payment-service] Processing payment txn=pay_6612 amount=$150.00 -> db-postgres",
+            "2026-04-06T03:03:01Z INFO  [payment-service] Payment completed txn=pay_6612 latency=88ms",
+            "2026-04-06T03:04:00Z INFO  [payment-service] Health check /healthz -> 200 OK",
+            "2026-04-06T03:05:00Z INFO  [payment-service] Processing payment txn=pay_7789 amount=$55.00 -> db-postgres",
+            "2026-04-06T03:05:12Z WARN  [payment-service] Slow query: INSERT INTO transactions (...) took 3200ms (threshold: 500ms)",
+            "2026-04-06T03:05:15Z INFO  [payment-service] Payment completed txn=pay_7789 latency=3250ms",
+            "2026-04-06T03:05:16Z INFO  [payment-service] Processing payment txn=pay_1120 amount=$67.00 -> db-postgres",
+            "2026-04-06T03:05:18Z WARN  [payment-service] DB connection pool: 48/50 active (96% utilized)",
+            "2026-04-06T03:05:20Z ERROR [payment-service] Transaction timeout: txn=pay_4455 exceeded 5000ms deadline",
+            "2026-04-06T03:05:20Z ERROR [payment-service] Retrying txn=pay_4455 (attempt 2/3)",
+            "2026-04-06T03:05:25Z ERROR [payment-service] Transaction timeout: txn=pay_4455 exceeded 5000ms deadline (retry 2)",
+            "2026-04-06T03:05:25Z ERROR [payment-service] Transaction failed permanently: txn=pay_4455 after 3 retries",
+            "2026-04-06T03:05:26Z WARN  [payment-service] DB connection pool: 50/50 active (SATURATED) — new requests queuing",
+            "2026-04-06T03:05:28Z ERROR [payment-service] Connection acquisition timeout: waited 10s for available connection",
+            "2026-04-06T03:05:30Z INFO  [payment-service] Read query SELECT balance WHERE user_id=... completed in 45ms",
+            "2026-04-06T03:05:32Z ERROR [payment-service] Transaction timeout: txn=pay_6691 exceeded 5000ms deadline",
+            "2026-04-06T03:05:33Z WARN  [payment-service] Circuit breaker WARNING for db-postgres writes (failures=8/10 threshold)",
+            "2026-04-06T03:05:35Z ERROR [payment-service] Transaction timeout: txn=pay_7801 exceeded 5000ms deadline",
+            "2026-04-06T03:05:40Z ERROR [payment-service] Transaction timeout: txn=pay_8912 exceeded 5000ms deadline",
+            "2026-04-06T03:06:00Z ERROR [payment-service] Connection acquisition timeout: waited 15s for available connection",
+            "2026-04-06T03:07:00Z ERROR [payment-service] 12 transactions failed in last 5 minutes. Write path severely degraded.",
+            "2026-04-06T03:08:00Z ERROR [payment-service] 15 transactions failed in last 5 minutes. Write path severely degraded.",
+        ],
+        "user-service": [
+            "2026-04-06T03:00:01Z INFO  [user-service] GET /users/profile uid=user_4421 -> 200 (32ms)",
+            "2026-04-06T03:00:05Z INFO  [user-service] GET /users/settings uid=user_8832 -> 200 (28ms)",
+            "2026-04-06T03:00:10Z INFO  [user-service] PUT /users/profile uid=user_3310 -> 200 (85ms)",
+            "2026-04-06T03:01:00Z INFO  [user-service] GET /users/profile uid=user_1101 -> 200 (30ms)",
+            "2026-04-06T03:02:00Z INFO  [user-service] PUT /users/settings uid=user_5571 -> 200 (78ms)",
+            "2026-04-06T03:03:00Z INFO  [user-service] GET /users/profile uid=user_7712 -> 200 (27ms)",
+            "2026-04-06T03:04:00Z INFO  [user-service] GET /users/profile uid=user_2209 -> 200 (31ms)",
+            "2026-04-06T03:05:10Z INFO  [user-service] GET /users/profile uid=user_9901 -> 200 (29ms)",
+            "2026-04-06T03:05:15Z INFO  [user-service] GET /users/profile uid=user_6633 -> 200 (26ms)",
+            "2026-04-06T03:05:18Z WARN  [user-service] Slow mutation: UPDATE users SET email=... took 4100ms",
+            "2026-04-06T03:05:20Z ERROR [user-service] Profile update failed: uid=user_8832 — database lock acquisition timeout",
+            "2026-04-06T03:05:22Z INFO  [user-service] GET /users/profile uid=user_1101 -> 200 (28ms)",
+            "2026-04-06T03:05:25Z ERROR [user-service] Profile update failed: uid=user_3310 — database lock acquisition timeout",
+            "2026-04-06T03:05:26Z WARN  [user-service] Write operations failing at 60% rate, reads unaffected",
+            "2026-04-06T03:05:30Z INFO  [user-service] GET /users/profile uid=user_4482 -> 200 (30ms)",
+            "2026-04-06T03:06:00Z ERROR [user-service] Profile update failed: uid=user_5510 — database lock acquisition timeout",
+            "2026-04-06T03:06:05Z INFO  [user-service] GET /users/settings uid=user_7781 -> 200 (25ms)",
+            "2026-04-06T03:07:00Z WARN  [user-service] Write operations failing at 75% rate, reads unaffected",
+        ],
+        "db-postgres": [
+            "2026-04-06T02:55:00Z INFO  [db-postgres] Connection from analytics-cron@10.0.3.42: BEGIN; SELECT ... FROM transactions JOIN users ... (full table scan)",
+            "2026-04-06T02:55:01Z INFO  [db-postgres] Query plan: Seq Scan on transactions (rows=2.4M, cost=45000..89000)",
+            "2026-04-06T02:55:01Z WARN  [db-postgres] Long-running transaction txid=8830012 holding RowExclusiveLock on transactions table",
+            "2026-04-06T02:56:00Z INFO  [db-postgres] Active connections: 55/100",
+            "2026-04-06T02:58:00Z INFO  [db-postgres] Active connections: 68/100",
+            "2026-04-06T03:00:00Z INFO  [db-postgres] Checkpoint starting: time-based",
+            "2026-04-06T03:00:02Z INFO  [db-postgres] Checkpoint complete: wrote 1204 buffers (8.2%)",
+            "2026-04-06T03:00:05Z INFO  [db-postgres] Active connections: 70/100",
+            "2026-04-06T03:02:00Z INFO  [db-postgres] Active connections: 78/100",
+            "2026-04-06T03:04:00Z INFO  [db-postgres] Active connections: 88/100",
+            "2026-04-06T03:05:10Z WARN  [db-postgres] Deadlock detected: process 4821 (payment-service) waiting for RowExclusiveLock on transactions, blocked by process 4455 (analytics-cron)",
+            "2026-04-06T03:05:10Z WARN  [db-postgres] Deadlock detected: process 4830 (user-service) waiting for RowExclusiveLock on users, blocked by process 4455 (analytics-cron)",
+            "2026-04-06T03:05:11Z INFO  [db-postgres] Active connections: 95/100 (analytics-cron holding 1, payment-service pool 50, user-service pool 30, other 14)",
+            "2026-04-06T03:05:15Z WARN  [db-postgres] Long-running transaction txid=8830012 has been active for 10m15s — consider terminating",
+            "2026-04-06T03:05:20Z WARN  [db-postgres] Lock wait queue depth: 12 processes waiting",
+            "2026-04-06T03:06:00Z INFO  [db-postgres] SELECT queries completing normally (read path unaffected)",
+            "2026-04-06T03:06:30Z WARN  [db-postgres] Connection pool nearing limit: 98/100 active",
+            "2026-04-06T03:07:00Z WARN  [db-postgres] Lock wait queue depth: 18 processes waiting — growing",
+            "2026-04-06T03:08:00Z ERROR [db-postgres] Connection limit reached: 100/100 — rejecting new connections",
+        ],
+        "auth-service": [
+            "2026-04-06T03:00:00Z INFO  [auth-service] Request processed: POST /auth/token uid=user_8832 latency=42ms",
+            "2026-04-06T03:00:05Z INFO  [auth-service] Cache hit for session sid=a8f32c, returning cached token",
+            "2026-04-06T03:05:00Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_3310 latency=38ms",
+            "2026-04-06T03:05:10Z INFO  [auth-service] Request processed: POST /auth/token uid=user_5571 latency=45ms",
+            "2026-04-06T03:05:30Z INFO  [auth-service] Health check /healthz -> 200 OK",
+            "2026-04-06T03:08:00Z INFO  [auth-service] Request processed: POST /auth/verify uid=user_1101 latency=40ms",
+        ],
+        "cache-redis": [
+            "2026-04-06T03:00:00Z INFO  [cache-redis] Memory usage: 1.2GB/4.0GB (30%)",
+            "2026-04-06T03:00:01Z INFO  [cache-redis] Cache hit ratio: 82% (normal: 85-95%)",
+            "2026-04-06T03:02:00Z INFO  [cache-redis] Cache hit ratio: 80%",
+            "2026-04-06T03:05:00Z INFO  [cache-redis] Cache hit ratio: 78% — slight decrease",
+            "2026-04-06T03:05:01Z INFO  [cache-redis] Key evictions: 45 in last 5m (within normal range)",
+            "2026-04-06T03:05:02Z WARN  [cache-redis] Cache miss ratio elevated for prefix auth:session:* — possible cache warming after TTL expiry batch",
+            "2026-04-06T03:05:10Z INFO  [cache-redis] Memory usage: 1.3GB/4.0GB (32%) — stable",
+            "2026-04-06T03:06:00Z INFO  [cache-redis] Cache hit ratio recovering: 84%",
+            "2026-04-06T03:08:00Z INFO  [cache-redis] Cache hit ratio: 88% — back to normal",
+        ],
+        "api-gateway": [
+            "2026-04-06T03:00:01Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 45ms)",
+            "2026-04-06T03:00:02Z INFO  [api-gateway] Route: POST /api/v2/pay -> payment-service (200, 88ms)",
+            "2026-04-06T03:05:20Z WARN  [api-gateway] Route: POST /api/v2/pay -> payment-service (504, 5200ms)",
+            "2026-04-06T03:05:22Z WARN  [api-gateway] Route: PUT /api/v2/user/profile -> user-service (504, 4800ms)",
+            "2026-04-06T03:05:25Z INFO  [api-gateway] Route: GET /api/v2/user/profile -> user-service (200, 30ms)",
+            "2026-04-06T03:05:30Z INFO  [api-gateway] Route: POST /api/v2/login -> auth-service (200, 42ms)",
+            "2026-04-06T03:06:00Z ERROR [api-gateway] Route: POST /api/v2/pay -> payment-service (504, timeout)",
+        ],
+        "notification-service": [
+            "2026-04-06T03:00:00Z INFO  [notification-service] Email batch #4430 sent successfully (10 emails)",
+            "2026-04-06T03:05:00Z INFO  [notification-service] Email batch #4435 sent successfully (7 emails)",
+            "2026-04-06T03:08:00Z INFO  [notification-service] Health check /healthz -> 200 OK",
+        ],
+    },
+    metrics={
+        "payment-service": [
+            {"timestamp": "2026-04-06T02:50:00Z", "cpu_pct": 20, "mem_pct": 40, "latency_p50": 80, "latency_p99": 150, "error_rate": 0.001, "db_pool_active": 15, "db_pool_max": 50},
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 20, "mem_pct": 40, "latency_p50": 85, "latency_p99": 160, "error_rate": 0.002, "db_pool_active": 18, "db_pool_max": 50},
+            {"timestamp": "2026-04-06T03:05:00Z", "cpu_pct": 22, "mem_pct": 41, "latency_p50": 3200, "latency_p99": 8500, "error_rate": 0.35, "db_pool_active": 50, "db_pool_max": 50},
+            {"timestamp": "2026-04-06T03:08:00Z", "cpu_pct": 18, "mem_pct": 40, "latency_p50": 4500, "latency_p99": "timeout", "error_rate": 0.52, "db_pool_active": 50, "db_pool_max": 50},
+        ],
+        "user-service": [
+            {"timestamp": "2026-04-06T02:50:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 28, "latency_p99": 75, "error_rate": 0.001, "write_error_rate": 0.001},
+            {"timestamp": "2026-04-06T03:05:00Z", "cpu_pct": 16, "mem_pct": 35, "latency_p50": 30, "latency_p99": 4100, "error_rate": 0.18, "write_error_rate": 0.60},
+            {"timestamp": "2026-04-06T03:08:00Z", "cpu_pct": 15, "mem_pct": 35, "latency_p50": 28, "latency_p99": "timeout", "error_rate": 0.25, "write_error_rate": 0.75},
+        ],
+        "db-postgres": [
+            {"timestamp": "2026-04-06T02:50:00Z", "cpu_pct": 35, "mem_pct": 60, "connections": 45, "active_locks": 3, "lock_wait_ms_p99": 5, "write_iops": 1200, "read_iops": 3500, "deadlocks": 0},
+            {"timestamp": "2026-04-06T02:55:00Z", "cpu_pct": 55, "mem_pct": 62, "connections": 55, "active_locks": 8, "lock_wait_ms_p99": 15, "write_iops": 1200, "read_iops": 4200, "deadlocks": 0},
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 65, "mem_pct": 64, "connections": 70, "active_locks": 15, "lock_wait_ms_p99": 250, "write_iops": 800, "read_iops": 4000, "deadlocks": 0},
+            {"timestamp": "2026-04-06T03:05:00Z", "cpu_pct": 78, "mem_pct": 65, "connections": 95, "active_locks": 28, "lock_wait_ms_p99": 8500, "write_iops": 200, "read_iops": 3800, "deadlocks": 4},
+            {"timestamp": "2026-04-06T03:08:00Z", "cpu_pct": 80, "mem_pct": 66, "connections": 100, "active_locks": 32, "lock_wait_ms_p99": 12000, "write_iops": 50, "read_iops": 3600, "deadlocks": 12},
+        ],
+        "auth-service": [
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 22, "mem_pct": 58, "latency_p50": 42, "latency_p99": 110, "error_rate": 0.001},
+            {"timestamp": "2026-04-06T03:08:00Z", "cpu_pct": 23, "mem_pct": 58, "latency_p50": 44, "latency_p99": 115, "error_rate": 0.001},
+        ],
+        "cache-redis": [
+            {"timestamp": "2026-04-06T03:00:00Z", "mem_gb": 1.2, "mem_pct": 30, "hit_ratio": 0.82, "evictions_per_s": 8, "connections": 46},
+            {"timestamp": "2026-04-06T03:05:00Z", "mem_gb": 1.3, "mem_pct": 32, "hit_ratio": 0.78, "evictions_per_s": 12, "connections": 46},
+            {"timestamp": "2026-04-06T03:08:00Z", "mem_gb": 1.2, "mem_pct": 30, "hit_ratio": 0.88, "evictions_per_s": 2, "connections": 45},
+        ],
+        "api-gateway": [
+            {"timestamp": "2026-04-06T03:00:00Z", "cpu_pct": 20, "mem_pct": 45, "latency_p50": 35, "latency_p99": 90, "error_rate": 0.002, "5xx_rate": 0.001},
+            {"timestamp": "2026-04-06T03:05:00Z", "cpu_pct": 22, "mem_pct": 46, "latency_p50": 45, "latency_p99": 5500, "error_rate": 0.18, "5xx_rate": 0.15},
+            {"timestamp": "2026-04-06T03:08:00Z", "cpu_pct": 23, "mem_pct": 46, "latency_p50": 50, "latency_p99": "timeout", "error_rate": 0.25, "5xx_rate": 0.22},
+        ],
+    },
+    traces={
+        "payment-service": [
+            "Trace: POST /api/v2/pay (txn=pay_6691, total=8500ms) — TIMEOUT",
+            "  ├─ payment-service.validateRequest()      12ms",
+            "  ├─ payment-service.checkBalance()         45ms   (SELECT -> db-postgres, fast)",
+            "  ├─ payment-service.insertTransaction()    8400ms (INSERT -> db-postgres, BLOCKED ON LOCK)",
+            "  └─ payment-service.sendConfirmation()     never reached (timeout)",
+        ],
+        "user-service": [
+            "Trace: PUT /api/v2/user/profile (uid=user_8832, total=4800ms) — TIMEOUT",
+            "  ├─ user-service.validateInput()           5ms",
+            "  ├─ user-service.updateProfile()           4780ms (UPDATE -> db-postgres, BLOCKED ON LOCK)",
+            "  └─ user-service.invalidateCache()         never reached (timeout)",
+        ],
+    },
+    deploy_history={
+        "payment-service": [
+            "v3.8.1  deployed 2026-04-03T14:00:00Z  status=stable  (running 3 days)",
+        ],
+        "user-service": [
+            "v4.2.1  deployed 2026-04-05T16:00:00Z  status=stable  (running 11 hours)",
+        ],
+        "db-postgres": [
+            "v15.4  deployed 2026-03-15T08:00:00Z  status=stable  (running 22 days)",
+        ],
+    },
+    runbooks={
+        "payment-service": (
+            "## payment-service Runbook\n"
+            "- Transaction timeouts: Check db-postgres connection pool and lock status.\n"
+            "  If db connection pool is saturated but CPU/memory are normal, likely a DB-side issue.\n"
+            "- High latency: Check downstream service health (db-postgres).\n"
+            "- Crash on startup: Check recent deploys and rollback if needed."
+        ),
+        "db-postgres": (
+            "## db-postgres Runbook\n"
+            "- Deadlocks: Identify the blocking transaction using pg_stat_activity.\n"
+            "  Kill long-running queries or restart postgres to clear all locks.\n"
+            "- Connection exhaustion: Check for connection leaks. Consider increasing max_connections\n"
+            "  or terminating idle connections.\n"
+            "- High CPU: Check for expensive queries in pg_stat_statements. Consider adding indexes.\n"
+            "- Replication lag: Check network connectivity to replicas and WAL sender status."
+        ),
+        "cache-redis": (
+            "## cache-redis Runbook\n"
+            "- Elevated miss ratio: Often caused by TTL expiry batches. Wait 5-10 minutes for cache\n"
+            "  to warm back up. If miss ratio doesn't recover, check maxmemory and eviction policy.\n"
+            "- Memory pressure: Check for memory leaks. Scale up replicas or increase maxmemory.\n"
+            "- Connection issues: Check network connectivity and client pool configuration."
+        ),
+    },
+    configs={
+        "db-postgres": {
+            "current": "max_connections=100\nshared_buffers=4GB\nwork_mem=256MB\nlock_timeout=30s\ndeadlock_timeout=1s",
+            "previous": "max_connections=100\nshared_buffers=4GB\nwork_mem=256MB\nlock_timeout=30s\ndeadlock_timeout=1s",
+            "diff": "No changes — config has not been modified recently.",
+        },
+        "payment-service": {
+            "current": "DB_POOL_SIZE=50\nDB_TIMEOUT=5000\nRETRY_COUNT=3\nCIRCUIT_BREAKER_THRESHOLD=10",
+            "previous": "DB_POOL_SIZE=50\nDB_TIMEOUT=5000\nRETRY_COUNT=3\nCIRCUIT_BREAKER_THRESHOLD=10",
+            "diff": "No changes — config has not been modified recently.",
+        },
+    },
+    dependencies={
+        "api-gateway": ["auth-service", "user-service", "payment-service"],
+        "auth-service": ["cache-redis"],
+        "user-service": ["db-postgres"],
+        "payment-service": ["db-postgres"],
+        "db-postgres": [],
+        "cache-redis": [],
+        "notification-service": ["auth-service"],
+    },
+    root_cause_services=["db-postgres"],
+    root_cause_categories=[RootCauseCategory.DB_DEADLOCK],
+    required_fixes=[
+        RequiredFix(action="restart_service", service="db-postgres"),
+    ],
+    diagnosis_keywords=["db-postgres", "deadlock", "lock", "analytics-cron", "long-running", "transaction", "blocking"],
+    weights={
+        "correct_service": 0.25,
+        "correct_category": 0.20,
+        "correct_fix": 0.25,
+        "secondary_fix": 0.00,
+        "diagnosis_text": 0.10,
+        "investigation": 0.10,
+        "wrong_penalty": 0.05,
+    },
+)