--- title: PostMortem Incident Triage OpenEnv emoji: 🚨 colorFrom: red colorTo: yellow sdk: docker app_port: 8000 pinned: false license: bsd-3-clause tags: - openenv - rl - sre - incident-response base_path: /web --- # PostMortem — Live Incident Triage Environment An OpenEnv environment where an LLM agent plays an on-call SRE responding to a live production incident. Real-world task, typed OpenEnv spec, deterministic grader, three difficulty tiers, dense process-reward signal. ## The task On each episode the agent receives an alert. It must: 1. **ack** the incident (accept ownership) 2. **query_logs / query_metrics / query_traces** on services to gather evidence 3. **scope** the blast radius 4. **hypothesize** the root cause 5. **mitigate** (propose a concrete remediation) 6. **write_status** (post a customer-facing update) All six verbs are exposed as a single typed action: ```python PostmortemAction(tool="query_logs", args={"service": "api"}) ``` ## Action space | tool | args | effect | |----------------|-----------------------------------|-------------------------------------| | `ack` | `{}` | accept the incident (sub-goal 1) | | `query_logs` | `{"service": str}` | return recent log lines | | `query_metrics`| `{"service": str}` | return latest metrics | | `query_traces` | `{"trace_id": str}` | return distributed trace spans | | `scope` | `{"services": list[str]}` | declare blast radius (sub-goal 2) | | `hypothesize` | `{"root_cause": str}` | declare root cause (sub-goal 3) | | `mitigate` | `{"action": str}` | apply mitigation (sub-goal 4) | | `write_status` | `{"text": str}` | publish update, ends ep (sub-goal 5)| ## Observation space Key fields of `PostmortemObservation`: - `task_id`, `task_description`, `available_services`, `available_trace_ids` - `tool_result` — free text result of the last tool call - `subgoals` — bool dict `{acked, scoped, hypothesized, mitigated, written}` - `reward_so_far` — cumulative reward in [0, 1] - `steps_remaining`, `last_error` - `done`, `reward` (current step) ## Tasks (3 difficulty tiers) On each `reset()` the env rotates to the next scenario. Running three resets in a row covers all three tiers in order. | task_id | difficulty | incident | |-------------------|------------|--------------------------------------------------------------| | `easy_oom` | easy | `api` OOM-killed; cause directly visible in logs | | `medium_cascade` | medium | checkout latency cascade; must correlate trace across 3 svcs | | `hard_dns` | hard | 503s blamed on fresh `api` deploy, real cause is upstream DNS| ## Reward design The reward is a **5-stage process-reward ladder** in `[0, 1]`: ``` ack +0.10 (granted on first successful ack) scope +0.20 × Jaccard(agent_services, gold_services) hypothesize +0.20 × keyword_fraction(agent_text, gold_hypothesis_keywords) mitigate +0.20 × keyword_fraction(agent_text, gold_mitigation_keywords) write_status +0.30 × keyword_fraction(agent_text, gold_writeup_keywords) ``` Each sub-goal is awarded once. The grader is fully **deterministic** — no LLM judge, no randomness. Partial credit gives a smooth gradient. The episode terminates when `write_status` fires or after `MAX_STEPS = 12`. ## Setup ```bash pip install openenv-core openenv build . # build Docker image python inference.py # run baseline (3 scenarios) ``` ### Required environment variables | var | default | notes | |----------------|------------------------------------------------|-------| | `HF_TOKEN` | (required) | HuggingFace token, also used as the OpenAI client API key | | `API_BASE_URL` | `https://router.huggingface.co/v1` | any OpenAI-compatible endpoint | | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | any chat model | | `IMAGE_NAME` | `postmortem_env-env:latest` | docker tag of the env image | ## Baseline reproduction ```bash export HF_TOKEN=hf_... export IMAGE_NAME=postmortem_env-env:latest python inference.py ``` Emits strict `[START] / [STEP] / [END]` lines, one `[END]` per task. ## Resource budget Well within the hackathon limits of **2 vCPU / 8 GB RAM**, and completes the 3-task sweep in **well under 20 minutes** (dominated by LLM latency, ≤ 36 LLM calls total). ## License BSD-3-Clause (matches OpenEnv core).