File size: 4,834 Bytes
f13c6d3
b29893e
 
 
 
f13c6d3
b29893e
f13c6d3
b29893e
 
 
 
 
 
 
f13c6d3
 
b29893e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: PostMortem Incident Triage OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
tags:
  - openenv
  - rl
  - sre
  - incident-response
base_path: /web
---

# PostMortem — Live Incident Triage Environment

An OpenEnv environment where an LLM agent plays an on-call SRE responding to a
live production incident. Real-world task, typed OpenEnv spec, deterministic
grader, three difficulty tiers, dense process-reward signal.

## The task

On each episode the agent receives an alert. It must:

1. **ack** the incident (accept ownership)
2. **query_logs / query_metrics / query_traces** on services to gather evidence
3. **scope** the blast radius
4. **hypothesize** the root cause
5. **mitigate** (propose a concrete remediation)
6. **write_status** (post a customer-facing update)

All six verbs are exposed as a single typed action:

```python
PostmortemAction(tool="query_logs", args={"service": "api"})
```

## Action space

| tool           | args                              | effect                              |
|----------------|-----------------------------------|-------------------------------------|
| `ack`          | `{}`                              | accept the incident (sub-goal 1)    |
| `query_logs`   | `{"service": str}`                | return recent log lines             |
| `query_metrics`| `{"service": str}`                | return latest metrics               |
| `query_traces` | `{"trace_id": str}`               | return distributed trace spans      |
| `scope`        | `{"services": list[str]}`         | declare blast radius (sub-goal 2)   |
| `hypothesize`  | `{"root_cause": str}`             | declare root cause (sub-goal 3)     |
| `mitigate`     | `{"action": str}`                 | apply mitigation (sub-goal 4)       |
| `write_status` | `{"text": str}`                   | publish update, ends ep (sub-goal 5)|

## Observation space

Key fields of `PostmortemObservation`:

- `task_id`, `task_description`, `available_services`, `available_trace_ids`
- `tool_result` — free text result of the last tool call
- `subgoals` — bool dict `{acked, scoped, hypothesized, mitigated, written}`
- `reward_so_far` — cumulative reward in [0, 1]
- `steps_remaining`, `last_error`
- `done`, `reward` (current step)

## Tasks (3 difficulty tiers)

On each `reset()` the env rotates to the next scenario. Running three resets in
a row covers all three tiers in order.

| task_id           | difficulty | incident                                                     |
|-------------------|------------|--------------------------------------------------------------|
| `easy_oom`        | easy       | `api` OOM-killed; cause directly visible in logs             |
| `medium_cascade`  | medium     | checkout latency cascade; must correlate trace across 3 svcs |
| `hard_dns`        | hard       | 503s blamed on fresh `api` deploy, real cause is upstream DNS|

## Reward design

The reward is a **5-stage process-reward ladder** in `[0, 1]`:

```
ack           +0.10   (granted on first successful ack)
scope         +0.20 × Jaccard(agent_services, gold_services)
hypothesize   +0.20 × keyword_fraction(agent_text, gold_hypothesis_keywords)
mitigate      +0.20 × keyword_fraction(agent_text, gold_mitigation_keywords)
write_status  +0.30 × keyword_fraction(agent_text, gold_writeup_keywords)
```

Each sub-goal is awarded once. The grader is fully **deterministic** — no LLM
judge, no randomness. Partial credit gives a smooth gradient. The episode
terminates when `write_status` fires or after `MAX_STEPS = 12`.

## Setup

```bash
pip install openenv-core
openenv build .                    # build Docker image
python inference.py                # run baseline (3 scenarios)
```

### Required environment variables

| var            | default                                        | notes |
|----------------|------------------------------------------------|-------|
| `HF_TOKEN`     | (required)                                     | HuggingFace token, also used as the OpenAI client API key |
| `API_BASE_URL` | `https://router.huggingface.co/v1`             | any OpenAI-compatible endpoint |
| `MODEL_NAME`   | `Qwen/Qwen2.5-72B-Instruct`                    | any chat model |
| `IMAGE_NAME`   | `postmortem_env-env:latest`                    | docker tag of the env image |

## Baseline reproduction

```bash
export HF_TOKEN=hf_...
export IMAGE_NAME=postmortem_env-env:latest
python inference.py
```

Emits strict `[START] / [STEP] / [END]` lines, one `[END]` per task.

## Resource budget

Well within the hackathon limits of **2 vCPU / 8 GB RAM**, and completes the
3-task sweep in **well under 20 minutes** (dominated by LLM latency, ≤ 36 LLM
calls total).

## License

BSD-3-Clause (matches OpenEnv core).