Spaces:
Sleeping
title: PostMortem Incident Triage OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
tags:
- openenv
- rl
- sre
- incident-response
base_path: /web
PostMortem — Live Incident Triage Environment
An OpenEnv environment where an LLM agent plays an on-call SRE responding to a live production incident. Real-world task, typed OpenEnv spec, deterministic grader, three difficulty tiers, dense process-reward signal.
The task
On each episode the agent receives an alert. It must:
- ack the incident (accept ownership)
- query_logs / query_metrics / query_traces on services to gather evidence
- scope the blast radius
- hypothesize the root cause
- mitigate (propose a concrete remediation)
- write_status (post a customer-facing update)
All six verbs are exposed as a single typed action:
PostmortemAction(tool="query_logs", args={"service": "api"})
Action space
| tool | args | effect |
|---|---|---|
ack |
{} |
accept the incident (sub-goal 1) |
query_logs |
{"service": str} |
return recent log lines |
query_metrics |
{"service": str} |
return latest metrics |
query_traces |
{"trace_id": str} |
return distributed trace spans |
scope |
{"services": list[str]} |
declare blast radius (sub-goal 2) |
hypothesize |
{"root_cause": str} |
declare root cause (sub-goal 3) |
mitigate |
{"action": str} |
apply mitigation (sub-goal 4) |
write_status |
{"text": str} |
publish update, ends ep (sub-goal 5) |
Observation space
Key fields of PostmortemObservation:
task_id,task_description,available_services,available_trace_idstool_result— free text result of the last tool callsubgoals— bool dict{acked, scoped, hypothesized, mitigated, written}reward_so_far— cumulative reward in [0, 1]steps_remaining,last_errordone,reward(current step)
Tasks (3 difficulty tiers)
On each reset() the env rotates to the next scenario. Running three resets in
a row covers all three tiers in order.
| task_id | difficulty | incident |
|---|---|---|
easy_oom |
easy | api OOM-killed; cause directly visible in logs |
medium_cascade |
medium | checkout latency cascade; must correlate trace across 3 svcs |
hard_dns |
hard | 503s blamed on fresh api deploy, real cause is upstream DNS |
Reward design
The reward is a 5-stage process-reward ladder in [0, 1]:
ack +0.10 (granted on first successful ack)
scope +0.20 × Jaccard(agent_services, gold_services)
hypothesize +0.20 × keyword_fraction(agent_text, gold_hypothesis_keywords)
mitigate +0.20 × keyword_fraction(agent_text, gold_mitigation_keywords)
write_status +0.30 × keyword_fraction(agent_text, gold_writeup_keywords)
Each sub-goal is awarded once. The grader is fully deterministic — no LLM
judge, no randomness. Partial credit gives a smooth gradient. The episode
terminates when write_status fires or after MAX_STEPS = 12.
Setup
pip install openenv-core
openenv build . # build Docker image
python inference.py # run baseline (3 scenarios)
Required environment variables
| var | default | notes |
|---|---|---|
HF_TOKEN |
(required) | HuggingFace token, also used as the OpenAI client API key |
API_BASE_URL |
https://router.huggingface.co/v1 |
any OpenAI-compatible endpoint |
MODEL_NAME |
Qwen/Qwen2.5-72B-Instruct |
any chat model |
IMAGE_NAME |
postmortem_env-env:latest |
docker tag of the env image |
Baseline reproduction
export HF_TOKEN=hf_...
export IMAGE_NAME=postmortem_env-env:latest
python inference.py
Emits strict [START] / [STEP] / [END] lines, one [END] per task.
Resource budget
Well within the hackathon limits of 2 vCPU / 8 GB RAM, and completes the 3-task sweep in well under 20 minutes (dominated by LLM latency, ≤ 36 LLM calls total).
License
BSD-3-Clause (matches OpenEnv core).