Spaces:
Sleeping
Sleeping
| title: PostMortem Incident Triage OpenEnv | |
| emoji: π¨ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| license: bsd-3-clause | |
| tags: | |
| - openenv | |
| - rl | |
| - sre | |
| - incident-response | |
| base_path: /web | |
| # PostMortem β Live Incident Triage Environment | |
| An OpenEnv environment where an LLM agent plays an on-call SRE responding to a | |
| live production incident. Real-world task, typed OpenEnv spec, deterministic | |
| grader, three difficulty tiers, dense process-reward signal. | |
| ## The task | |
| On each episode the agent receives an alert. It must: | |
| 1. **ack** the incident (accept ownership) | |
| 2. **query_logs / query_metrics / query_traces** on services to gather evidence | |
| 3. **scope** the blast radius | |
| 4. **hypothesize** the root cause | |
| 5. **mitigate** (propose a concrete remediation) | |
| 6. **write_status** (post a customer-facing update) | |
| All six verbs are exposed as a single typed action: | |
| ```python | |
| PostmortemAction(tool="query_logs", args={"service": "api"}) | |
| ``` | |
| ## Action space | |
| | tool | args | effect | | |
| |----------------|-----------------------------------|-------------------------------------| | |
| | `ack` | `{}` | accept the incident (sub-goal 1) | | |
| | `query_logs` | `{"service": str}` | return recent log lines | | |
| | `query_metrics`| `{"service": str}` | return latest metrics | | |
| | `query_traces` | `{"trace_id": str}` | return distributed trace spans | | |
| | `scope` | `{"services": list[str]}` | declare blast radius (sub-goal 2) | | |
| | `hypothesize` | `{"root_cause": str}` | declare root cause (sub-goal 3) | | |
| | `mitigate` | `{"action": str}` | apply mitigation (sub-goal 4) | | |
| | `write_status` | `{"text": str}` | publish update, ends ep (sub-goal 5)| | |
| ## Observation space | |
| Key fields of `PostmortemObservation`: | |
| - `task_id`, `task_description`, `available_services`, `available_trace_ids` | |
| - `tool_result` β free text result of the last tool call | |
| - `subgoals` β bool dict `{acked, scoped, hypothesized, mitigated, written}` | |
| - `reward_so_far` β cumulative reward in [0, 1] | |
| - `steps_remaining`, `last_error` | |
| - `done`, `reward` (current step) | |
| ## Tasks (3 difficulty tiers) | |
| On each `reset()` the env rotates to the next scenario. Running three resets in | |
| a row covers all three tiers in order. | |
| | task_id | difficulty | incident | | |
| |-------------------|------------|--------------------------------------------------------------| | |
| | `easy_oom` | easy | `api` OOM-killed; cause directly visible in logs | | |
| | `medium_cascade` | medium | checkout latency cascade; must correlate trace across 3 svcs | | |
| | `hard_dns` | hard | 503s blamed on fresh `api` deploy, real cause is upstream DNS| | |
| ## Reward design | |
| The reward is a **5-stage process-reward ladder** in `[0, 1]`: | |
| ``` | |
| ack +0.10 (granted on first successful ack) | |
| scope +0.20 Γ Jaccard(agent_services, gold_services) | |
| hypothesize +0.20 Γ keyword_fraction(agent_text, gold_hypothesis_keywords) | |
| mitigate +0.20 Γ keyword_fraction(agent_text, gold_mitigation_keywords) | |
| write_status +0.30 Γ keyword_fraction(agent_text, gold_writeup_keywords) | |
| ``` | |
| Each sub-goal is awarded once. The grader is fully **deterministic** β no LLM | |
| judge, no randomness. Partial credit gives a smooth gradient. The episode | |
| terminates when `write_status` fires or after `MAX_STEPS = 12`. | |
| ## Setup | |
| ```bash | |
| pip install openenv-core | |
| openenv build . # build Docker image | |
| python inference.py # run baseline (3 scenarios) | |
| ``` | |
| ### Required environment variables | |
| | var | default | notes | | |
| |----------------|------------------------------------------------|-------| | |
| | `HF_TOKEN` | (required) | HuggingFace token, also used as the OpenAI client API key | | |
| | `API_BASE_URL` | `https://router.huggingface.co/v1` | any OpenAI-compatible endpoint | | |
| | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | any chat model | | |
| | `IMAGE_NAME` | `postmortem_env-env:latest` | docker tag of the env image | | |
| ## Baseline reproduction | |
| ```bash | |
| export HF_TOKEN=hf_... | |
| export IMAGE_NAME=postmortem_env-env:latest | |
| python inference.py | |
| ``` | |
| Emits strict `[START] / [STEP] / [END]` lines, one `[END]` per task. | |
| ## Resource budget | |
| Well within the hackathon limits of **2 vCPU / 8 GB RAM**, and completes the | |
| 3-task sweep in **well under 20 minutes** (dominated by LLM latency, β€ 36 LLM | |
| calls total). | |
| ## License | |
| BSD-3-Clause (matches OpenEnv core). | |