Spaces:
Sleeping
title: Unified Incident Env
emoji: π¨
colorFrom: blue
colorTo: red
sdk: docker
app_port: 8000
pinned: false
Unified Incident Env
A deterministic OpenEnv benchmark where agents resolve production incidents whose true root cause includes a security vulnerability.
unified_incident_env is one judge-facing environment. It is not a collection of mini-projects and it is not a toy task. Each episode starts as an operational outage, but the correct solution requires the agent to bridge SRE investigation with security remediation, then recover the system in the correct order and submit a postmortem.
Why This Benchmark Matters
Most agent benchmarks test operations or security in isolation. This benchmark forces both:
- operational symptoms appear first
- the real root cause can be security-rooted
- patching alone is not enough
- recovery alone is not enough
- the final postmortem must reflect the real causal chain
This is what makes it useful for evaluating incident-response agents rather than generic tool-using assistants.
Why It Is Non-Trivial
The benchmark is intentionally built around causal traps:
- restarting the wrong service treats a symptom but does not fix the incident
- patching the wrong vulnerability or wrong patch family wastes steps and score
- recovering infrastructure before closing the exploit path can fail or regress
- weak agents often loop in investigation, security verification, or post-security recovery
Bad agent pattern:
database is down
-> restart database
-> database crashes again because exploit path is still open
Good agent pattern:
find root cause
-> unlock security subquest
-> patch exploit path
-> verify fix
-> recover infrastructure
-> submit postmortem
Evaluation Gap
| Property | Simple ops benchmark | Unified Incident Env |
|---|---|---|
| Failure model | broken service only | broken service plus security-rooted cause |
| Agent role | troubleshooter | incident responder plus security repair assistant |
| Action pattern | pure recovery | investigate -> unlock security -> patch -> recover |
| Failure traps | wrong restart | wrong restart plus wrong patch plus wrong order |
| Success condition | service healthy | service healthy plus exploit path closed plus postmortem |
Benchmark Mechanics
Named mechanics that shape behavior:
- Causal traps
- Stage transitions
- Security unlock
- Recovery ordering
- Negative-reward correction pressure
- Deterministic postmortem scoring
These mechanics are explicit in the environment state and reward function. They are not hidden in a black-box grader.
At A Glance
| Item | Value |
|---|---|
| Environment name | unified_incident_env |
| Environment count | 1 |
| Scenario count | 3 |
| Difficulty levels | Easy, Medium, Hard |
| Public actions | 11 |
| Score range | 0.0 to 1.0 |
| Score type | deterministic, dense, bounded |
| Root runner | inference.py |
| OpenEnv validation | passes |
| Test suite | 51 passed |
| Docker build | passes |
| LLM judge | none |
Scenario Pack
| Scenario | Difficulty | Operational failure | Security root cause | Lesson |
|---|---|---|---|---|
database_sqli_outage |
Easy | database crash causes gateway 502s |
SQL injection in login path | close exploit before restarting database |
cache_abuse_broken_access_control |
Medium | cache crash and database degradation cascade | broken access control on internal admin endpoint | follow dependency chain and authorization evidence |
worker_bad_deploy_command_injection |
Hard | worker poisons downstream database and gateway | command injection plus bad deploy | stop investigating once enough evidence exists, then patch and rollback the worker path |
Difficulty progression:
Easy -> direct evidence, short recovery chain
Medium -> dependency reasoning, authorization bug
Hard -> cross-service causality, exploit plus deploy rollback
flowchart LR
E["Easy\nDirect evidence"] --> M["Medium\nDependency reasoning"]
M --> H["Hard\nCross-service causality"]
Public Action Schema
Only these action_type values are valid:
[
"query_logs",
"query_metrics",
"query_dependencies",
"restart_service",
"rollback_deploy",
"inspect_code",
"classify_vulnerability",
"apply_patch",
"verify_security_fix",
"submit_security_fix",
"submit_postmortem"
]
Required fields:
| Action | Required fields |
|---|---|
query_logs |
service |
query_metrics |
service, metric |
query_dependencies |
service |
restart_service |
service |
rollback_deploy |
service |
inspect_code |
none |
classify_vulnerability |
vulnerability_type |
apply_patch |
patch_id |
verify_security_fix |
none |
submit_security_fix |
none |
submit_postmortem |
postmortem |
Observation Design
Each step returns a typed observation with:
tick_countworkflow_stageactive_alertsservice_healthlast_action_resulttool_outputfailure_typewhy_failedallowed_actionsrequired_fields_by_actionvalid_action_examplecommon_traploop_warningblocked_until_security_completesecurity_unlock_reasonbest_recovery_action_familyprogress_flagssecurity_subquest_statussecurity_contextfinal_scorescore_breakdownincident_resolvedrewarddone
This keeps the benchmark deterministic while still making failure states explicit and machine-usable.
Scoring
The score is deterministic and bounded between 0.0 and 1.0.
final_score =
infrastructure_score (0.00 to 0.45) +
security_score (0.00 to 0.35) +
efficiency_score (0.00 to 0.10) +
postmortem_score (0.00 to 0.10)
Score weight view:
| Component | Weight |
|---|---|
| Infrastructure | 0.45 |
| Security | 0.35 |
| Efficiency | 0.10 |
| Postmortem | 0.10 |
Deterministic guarantees:
- preset authored scenarios
- deterministic patch outcomes
- deterministic postmortem scoring
- no hidden fallback in strict benchmark behavior
- no LLM grader
- incomplete security subquest caps the final score at
0.5
Runtime Flow
model
-> inference.py
-> env.step(action)
-> observation + reward + score
-> next model decision
flowchart LR
A["Model"] --> B["inference.py"]
B --> C["env.step(action)"]
C --> D["observation + reward + score"]
D --> A
Successful episode flow:
diagnosis
-> root cause analysis
-> security subquest
-> remediation
-> verification
-> postmortem
-> done
Inference Path
The root inference.py is the submission runner.
It:
- uses the OpenAI client
- reads
API_BASE_URL,MODEL_NAME, andHF_TOKEN - emits validator-compatible
[START],[STEP], and[END]logs - runs all 3 scenarios through the real environment API
Inference modes:
INFERENCE_MODE=judge- default
- compact, strong-model-friendly prompt
- structured outputs first
- no transcript stuffing
INFERENCE_MODE=small- optional local rescue mode for weaker models
- compact corrective prompt behavior
Model Notes
Models used during development and validation:
qwen2.5:1.5bqwen2.5:3bqwen2.5:7b-instruct-q4_K_Mgemma2:2bllama-3.3-70b-versatile
The default path is optimized for strong external judge models. The optional small mode exists only to support weaker local models without changing the benchmark contract.
Repository Layout
.
βββ README.md
βββ inference.py
βββ openenv.yaml
βββ Dockerfile
βββ pyproject.toml
βββ uv.lock
βββ Makefile
βββ server/
βββ unified_incident_env/
Important internals:
| Path | Purpose |
|---|---|
server/app.py |
top-level app entrypoint |
unified_incident_env/models.py |
typed action, observation, and state models |
unified_incident_env/server/challenge.py |
scenario catalog |
unified_incident_env/server/environment.py |
transition logic |
unified_incident_env/server/grader.py |
deterministic scoring |
unified_incident_env/scripts/baseline_agent.py |
deterministic internal reference baseline |
unified_incident_env/tests/ |
regression tests |
unified_incident_env/trainer/ |
optional secondary tooling |
Running The Repo
Install:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
Run tests:
pytest unified_incident_env/tests -q
Validate OpenEnv compliance:
openenv validate .
Run the environment locally:
uvicorn server.app:app --host 0.0.0.0 --port 8000
Run the root inference script:
python inference.py
Build and run with Docker:
docker build -t unified-incident-env .
docker run --rm -p 8000:8000 unified-incident-env
Environment Variables
inference.py supports:
API_BASE_URLMODEL_NAMEHF_TOKENENV_BASE_URLINFERENCE_MODE
Validation Status
Current repo-level checks:
pytest unified_incident_env/tests -q->51 passedopenenv validate .-> passesdocker build -t unified-incident-env .-> passes
Hugging Face Space
Configured Space URL:
https://huggingface.co/spaces/dakshdoesdev/unified-incident-env
The repo is structured for a Docker-based Hugging Face Space via openenv.yaml.
Optional Trainer Scaffold
unified_incident_env/trainer/ is secondary tooling for:
- trajectory collection
- failure analysis
- correction dataset generation
- updater hooks
It is not a second environment. The judge-facing benchmark remains unified_incident_env.
Reading Order
For a new engineer or agent:
README.mdinference.pyopenenv.yamlunified_incident_env/models.pyunified_incident_env/server/challenge.pyunified_incident_env/server/environment.pyunified_incident_env/server/grader.pyunified_incident_env/tests/