Spaces:

yashppawar
/

postmortem_env

Sleeping

App Files Files Community

postmortem_env / README.md

yashppawar

Upload folder using huggingface_hub

b29893e verified 7 days ago

preview code

raw

history blame contribute delete

4.83 kB

	---
	title: PostMortem Incident Triage OpenEnv
	emoji: 🚨
	colorFrom: red
	colorTo: yellow
	sdk: docker
	app_port: 8000
	pinned: false
	license: bsd-3-clause
	tags:
	- openenv
	- rl
	- sre
	- incident-response
	base_path: /web
	---

	# PostMortem — Live Incident Triage Environment

	An OpenEnv environment where an LLM agent plays an on-call SRE responding to a
	live production incident. Real-world task, typed OpenEnv spec, deterministic
	grader, three difficulty tiers, dense process-reward signal.

	## The task

	On each episode the agent receives an alert. It must:

	1. ack the incident (accept ownership)
	2. query_logs / query_metrics / query_traces on services to gather evidence
	3. scope the blast radius
	4. hypothesize the root cause
	5. mitigate (propose a concrete remediation)
	6. write_status (post a customer-facing update)

	All six verbs are exposed as a single typed action:

	```python
	PostmortemAction(tool="query_logs", args={"service": "api"})
	```

	## Action space

	\| tool \| args \| effect \|
	\|----------------\|-----------------------------------\|-------------------------------------\|
	\| `ack` \| `{}` \| accept the incident (sub-goal 1) \|
	\| `query_logs` \| `{"service": str}` \| return recent log lines \|
	\| `query_metrics`\| `{"service": str}` \| return latest metrics \|
	\| `query_traces` \| `{"trace_id": str}` \| return distributed trace spans \|
	\| `scope` \| `{"services": list[str]}` \| declare blast radius (sub-goal 2) \|
	\| `hypothesize` \| `{"root_cause": str}` \| declare root cause (sub-goal 3) \|
	\| `mitigate` \| `{"action": str}` \| apply mitigation (sub-goal 4) \|
	\| `write_status` \| `{"text": str}` \| publish update, ends ep (sub-goal 5)\|

	## Observation space

	Key fields of `PostmortemObservation`:

	- `task_id`, `task_description`, `available_services`, `available_trace_ids`
	- `tool_result` — free text result of the last tool call
	- `subgoals` — bool dict `{acked, scoped, hypothesized, mitigated, written}`
	- `reward_so_far` — cumulative reward in [0, 1]
	- `steps_remaining`, `last_error`
	- `done`, `reward` (current step)

	## Tasks (3 difficulty tiers)

	On each `reset()` the env rotates to the next scenario. Running three resets in
	a row covers all three tiers in order.

	\| task_id \| difficulty \| incident \|
	\|-------------------\|------------\|--------------------------------------------------------------\|
	\| `easy_oom` \| easy \| `api` OOM-killed; cause directly visible in logs \|
	\| `medium_cascade` \| medium \| checkout latency cascade; must correlate trace across 3 svcs \|
	\| `hard_dns` \| hard \| 503s blamed on fresh `api` deploy, real cause is upstream DNS\|

	## Reward design

	The reward is a 5-stage process-reward ladder in `[0, 1]`:

	```
	ack +0.10 (granted on first successful ack)
	scope +0.20 × Jaccard(agent_services, gold_services)
	hypothesize +0.20 × keyword_fraction(agent_text, gold_hypothesis_keywords)
	mitigate +0.20 × keyword_fraction(agent_text, gold_mitigation_keywords)
	write_status +0.30 × keyword_fraction(agent_text, gold_writeup_keywords)
	```

	Each sub-goal is awarded once. The grader is fully deterministic — no LLM
	judge, no randomness. Partial credit gives a smooth gradient. The episode
	terminates when `write_status` fires or after `MAX_STEPS = 12`.

	## Setup

	```bash
	pip install openenv-core
	openenv build . # build Docker image
	python inference.py # run baseline (3 scenarios)
	```

	### Required environment variables

	\| var \| default \| notes \|
	\|----------------\|------------------------------------------------\|-------\|
	\| `HF_TOKEN` \| (required) \| HuggingFace token, also used as the OpenAI client API key \|
	\| `API_BASE_URL` \| `https://router.huggingface.co/v1` \| any OpenAI-compatible endpoint \|
	\| `MODEL_NAME` \| `Qwen/Qwen2.5-72B-Instruct` \| any chat model \|
	\| `IMAGE_NAME` \| `postmortem_env-env:latest` \| docker tag of the env image \|

	## Baseline reproduction

	```bash
	export HF_TOKEN=hf_...
	export IMAGE_NAME=postmortem_env-env:latest
	python inference.py
	```

	Emits strict `[START] / [STEP] / [END]` lines, one `[END]` per task.

	## Resource budget

	Well within the hackathon limits of 2 vCPU / 8 GB RAM, and completes the
	3-task sweep in well under 20 minutes (dominated by LLM latency, ≤ 36 LLM
	calls total).

	## License

	BSD-3-Clause (matches OpenEnv core).