Spaces:

dakshdoesdev
/

unified-incident-env

Sleeping

App Files Files Community

unified-incident-env / README.md

Daksh Verma

Add Spaces metadata to README

404cd62 verified about 2 months ago

preview code

raw

history blame contribute delete

10 kB

metadata

title: Unified Incident Env
emoji: 🚨
colorFrom: blue
colorTo: red
sdk: docker
app_port: 8000
pinned: false

Unified Incident Env

A deterministic OpenEnv benchmark where agents resolve production incidents whose true root cause includes a security vulnerability.

unified_incident_env is one judge-facing environment. It is not a collection of mini-projects and it is not a toy task. Each episode starts as an operational outage, but the correct solution requires the agent to bridge SRE investigation with security remediation, then recover the system in the correct order and submit a postmortem.

Why This Benchmark Matters

Most agent benchmarks test operations or security in isolation. This benchmark forces both:

operational symptoms appear first
the real root cause can be security-rooted
patching alone is not enough
recovery alone is not enough
the final postmortem must reflect the real causal chain

This is what makes it useful for evaluating incident-response agents rather than generic tool-using assistants.

Why It Is Non-Trivial

The benchmark is intentionally built around causal traps:

restarting the wrong service treats a symptom but does not fix the incident
patching the wrong vulnerability or wrong patch family wastes steps and score
recovering infrastructure before closing the exploit path can fail or regress
weak agents often loop in investigation, security verification, or post-security recovery

Bad agent pattern:

database is down
-> restart database
-> database crashes again because exploit path is still open

Good agent pattern:

find root cause
-> unlock security subquest
-> patch exploit path
-> verify fix
-> recover infrastructure
-> submit postmortem

Evaluation Gap

Property	Simple ops benchmark	Unified Incident Env
Failure model	broken service only	broken service plus security-rooted cause
Agent role	troubleshooter	incident responder plus security repair assistant
Action pattern	pure recovery	investigate -> unlock security -> patch -> recover
Failure traps	wrong restart	wrong restart plus wrong patch plus wrong order
Success condition	service healthy	service healthy plus exploit path closed plus postmortem

Benchmark Mechanics

Named mechanics that shape behavior:

Causal traps
Stage transitions
Security unlock
Recovery ordering
Negative-reward correction pressure
Deterministic postmortem scoring

These mechanics are explicit in the environment state and reward function. They are not hidden in a black-box grader.

At A Glance

Item	Value
Environment name	`unified_incident_env`
Environment count	1
Scenario count	3
Difficulty levels	Easy, Medium, Hard
Public actions	11
Score range	`0.0` to `1.0`
Score type	deterministic, dense, bounded
Root runner	`inference.py`
OpenEnv validation	passes
Test suite	`51 passed`
Docker build	passes
LLM judge	none

Scenario Pack

Scenario	Difficulty	Operational failure	Security root cause	Lesson
`database_sqli_outage`	Easy	database crash causes gateway `502`s	SQL injection in login path	close exploit before restarting database
`cache_abuse_broken_access_control`	Medium	cache crash and database degradation cascade	broken access control on internal admin endpoint	follow dependency chain and authorization evidence
`worker_bad_deploy_command_injection`	Hard	worker poisons downstream database and gateway	command injection plus bad deploy	stop investigating once enough evidence exists, then patch and rollback the worker path

Difficulty progression:

Easy   -> direct evidence, short recovery chain
Medium -> dependency reasoning, authorization bug
Hard   -> cross-service causality, exploit plus deploy rollback

flowchart LR
    E["Easy\nDirect evidence"] --> M["Medium\nDependency reasoning"]
    M --> H["Hard\nCross-service causality"]

Public Action Schema

Only these action_type values are valid:

[
  "query_logs",
  "query_metrics",
  "query_dependencies",
  "restart_service",
  "rollback_deploy",
  "inspect_code",
  "classify_vulnerability",
  "apply_patch",
  "verify_security_fix",
  "submit_security_fix",
  "submit_postmortem"
]

Required fields:

Action	Required fields
`query_logs`	`service`
`query_metrics`	`service`, `metric`
`query_dependencies`	`service`
`restart_service`	`service`
`rollback_deploy`	`service`
`inspect_code`	none
`classify_vulnerability`	`vulnerability_type`
`apply_patch`	`patch_id`
`verify_security_fix`	none
`submit_security_fix`	none
`submit_postmortem`	`postmortem`

Observation Design

Each step returns a typed observation with:

tick_count
workflow_stage
active_alerts
service_health
last_action_result
tool_output
failure_type
why_failed
allowed_actions
required_fields_by_action
valid_action_example
common_trap
loop_warning
blocked_until_security_complete
security_unlock_reason
best_recovery_action_family
progress_flags
security_subquest_status
security_context
final_score
score_breakdown
incident_resolved
reward
done

This keeps the benchmark deterministic while still making failure states explicit and machine-usable.

Scoring

The score is deterministic and bounded between 0.0 and 1.0.

final_score =
  infrastructure_score (0.00 to 0.45) +
  security_score       (0.00 to 0.35) +
  efficiency_score     (0.00 to 0.10) +
  postmortem_score     (0.00 to 0.10)

Score weight view:

Component	Weight
Infrastructure	0.45
Security	0.35
Efficiency	0.10
Postmortem	0.10

Deterministic guarantees:

preset authored scenarios
deterministic patch outcomes
deterministic postmortem scoring
no hidden fallback in strict benchmark behavior
no LLM grader
incomplete security subquest caps the final score at 0.5

Runtime Flow

model
  -> inference.py
  -> env.step(action)
  -> observation + reward + score
  -> next model decision

flowchart LR
    A["Model"] --> B["inference.py"]
    B --> C["env.step(action)"]
    C --> D["observation + reward + score"]
    D --> A

Successful episode flow:

diagnosis
-> root cause analysis
-> security subquest
-> remediation
-> verification
-> postmortem
-> done

Inference Path

The root inference.py is the submission runner.

It:

uses the OpenAI client
reads API_BASE_URL, MODEL_NAME, and HF_TOKEN
emits validator-compatible [START], [STEP], and [END] logs
runs all 3 scenarios through the real environment API

Inference modes:

INFERENCE_MODE=judge
- default
- compact, strong-model-friendly prompt
- structured outputs first
- no transcript stuffing
INFERENCE_MODE=small
- optional local rescue mode for weaker models
- compact corrective prompt behavior

Model Notes

Models used during development and validation:

qwen2.5:1.5b
qwen2.5:3b
qwen2.5:7b-instruct-q4_K_M
gemma2:2b
llama-3.3-70b-versatile

The default path is optimized for strong external judge models. The optional small mode exists only to support weaker local models without changing the benchmark contract.

Repository Layout

.
├── README.md
├── inference.py
├── openenv.yaml
├── Dockerfile
├── pyproject.toml
├── uv.lock
├── Makefile
├── server/
└── unified_incident_env/

Important internals:

Path	Purpose
`server/app.py`	top-level app entrypoint
`unified_incident_env/models.py`	typed action, observation, and state models
`unified_incident_env/server/challenge.py`	scenario catalog
`unified_incident_env/server/environment.py`	transition logic
`unified_incident_env/server/grader.py`	deterministic scoring
`unified_incident_env/scripts/baseline_agent.py`	deterministic internal reference baseline
`unified_incident_env/tests/`	regression tests
`unified_incident_env/trainer/`	optional secondary tooling

Running The Repo

Install:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"

Run tests:

pytest unified_incident_env/tests -q

Validate OpenEnv compliance:

openenv validate .

Run the environment locally:

uvicorn server.app:app --host 0.0.0.0 --port 8000

Run the root inference script:

python inference.py

Build and run with Docker:

docker build -t unified-incident-env .
docker run --rm -p 8000:8000 unified-incident-env

Environment Variables

inference.py supports:

API_BASE_URL
MODEL_NAME
HF_TOKEN
ENV_BASE_URL
INFERENCE_MODE

Validation Status

Current repo-level checks:

pytest unified_incident_env/tests -q -> 51 passed
openenv validate . -> passes
docker build -t unified-incident-env . -> passes

Hugging Face Space

Configured Space URL:

https://huggingface.co/spaces/dakshdoesdev/unified-incident-env

The repo is structured for a Docker-based Hugging Face Space via openenv.yaml.

Optional Trainer Scaffold

unified_incident_env/trainer/ is secondary tooling for:

trajectory collection
failure analysis
correction dataset generation
updater hooks

It is not a second environment. The judge-facing benchmark remains unified_incident_env.

Reading Order

For a new engineer or agent:

README.md
inference.py
openenv.yaml
unified_incident_env/models.py
unified_incident_env/server/challenge.py
unified_incident_env/server/environment.py
unified_incident_env/server/grader.py
unified_incident_env/tests/