Spaces:

yashppawar
/

postmortem_env

Sleeping

App Files Files Community

postmortem_env / README.md

yashppawar

Upload folder using huggingface_hub

b29893e verified 7 days ago

preview code

raw

history blame contribute delete

4.83 kB

metadata

title: PostMortem Incident Triage OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
tags:
  - openenv
  - rl
  - sre
  - incident-response
base_path: /web

PostMortem — Live Incident Triage Environment

An OpenEnv environment where an LLM agent plays an on-call SRE responding to a live production incident. Real-world task, typed OpenEnv spec, deterministic grader, three difficulty tiers, dense process-reward signal.

The task

On each episode the agent receives an alert. It must:

ack the incident (accept ownership)
query_logs / query_metrics / query_traces on services to gather evidence
scope the blast radius
hypothesize the root cause
mitigate (propose a concrete remediation)
write_status (post a customer-facing update)

All six verbs are exposed as a single typed action:

PostmortemAction(tool="query_logs", args={"service": "api"})

Action space

tool	args	effect
`ack`	`{}`	accept the incident (sub-goal 1)
`query_logs`	`{"service": str}`	return recent log lines
`query_metrics`	`{"service": str}`	return latest metrics
`query_traces`	`{"trace_id": str}`	return distributed trace spans
`scope`	`{"services": list[str]}`	declare blast radius (sub-goal 2)
`hypothesize`	`{"root_cause": str}`	declare root cause (sub-goal 3)
`mitigate`	`{"action": str}`	apply mitigation (sub-goal 4)
`write_status`	`{"text": str}`	publish update, ends ep (sub-goal 5)

Observation space

Key fields of PostmortemObservation:

task_id, task_description, available_services, available_trace_ids
tool_result — free text result of the last tool call
subgoals — bool dict {acked, scoped, hypothesized, mitigated, written}
reward_so_far — cumulative reward in [0, 1]
steps_remaining, last_error
done, reward (current step)

Tasks (3 difficulty tiers)

On each reset() the env rotates to the next scenario. Running three resets in a row covers all three tiers in order.

task_id	difficulty	incident
`easy_oom`	easy	`api` OOM-killed; cause directly visible in logs
`medium_cascade`	medium	checkout latency cascade; must correlate trace across 3 svcs
`hard_dns`	hard	503s blamed on fresh `api` deploy, real cause is upstream DNS

Reward design

The reward is a 5-stage process-reward ladder in [0, 1]:

ack           +0.10   (granted on first successful ack)
scope         +0.20 × Jaccard(agent_services, gold_services)
hypothesize   +0.20 × keyword_fraction(agent_text, gold_hypothesis_keywords)
mitigate      +0.20 × keyword_fraction(agent_text, gold_mitigation_keywords)
write_status  +0.30 × keyword_fraction(agent_text, gold_writeup_keywords)

Each sub-goal is awarded once. The grader is fully deterministic — no LLM judge, no randomness. Partial credit gives a smooth gradient. The episode terminates when write_status fires or after MAX_STEPS = 12.

Setup

pip install openenv-core
openenv build .                    # build Docker image
python inference.py                # run baseline (3 scenarios)

Required environment variables

var	default	notes
`HF_TOKEN`	(required)	HuggingFace token, also used as the OpenAI client API key
`API_BASE_URL`	`https://router.huggingface.co/v1`	any OpenAI-compatible endpoint
`MODEL_NAME`	`Qwen/Qwen2.5-72B-Instruct`	any chat model
`IMAGE_NAME`	`postmortem_env-env:latest`	docker tag of the env image

Baseline reproduction

export HF_TOKEN=hf_...
export IMAGE_NAME=postmortem_env-env:latest
python inference.py

Emits strict [START] / [STEP] / [END] lines, one [END] per task.

Resource budget

Well within the hackathon limits of 2 vCPU / 8 GB RAM, and completes the 3-task sweep in well under 20 minutes (dominated by LLM latency, ≤ 36 LLM calls total).

License

BSD-3-Clause (matches OpenEnv core).