Daksh Verma
Add Spaces metadata to README
404cd62 verified
---
title: Unified Incident Env
emoji: 🚨
colorFrom: blue
colorTo: red
sdk: docker
app_port: 8000
pinned: false
---
# Unified Incident Env
A deterministic OpenEnv benchmark where agents resolve production incidents whose true root cause includes a security vulnerability.
`unified_incident_env` is one judge-facing environment. It is not a collection of mini-projects and it is not a toy task. Each episode starts as an operational outage, but the correct solution requires the agent to bridge SRE investigation with security remediation, then recover the system in the correct order and submit a postmortem.
## Why This Benchmark Matters
Most agent benchmarks test operations or security in isolation. This benchmark forces both:
- operational symptoms appear first
- the real root cause can be security-rooted
- patching alone is not enough
- recovery alone is not enough
- the final postmortem must reflect the real causal chain
This is what makes it useful for evaluating incident-response agents rather than generic tool-using assistants.
## Why It Is Non-Trivial
The benchmark is intentionally built around causal traps:
- restarting the wrong service treats a symptom but does not fix the incident
- patching the wrong vulnerability or wrong patch family wastes steps and score
- recovering infrastructure before closing the exploit path can fail or regress
- weak agents often loop in investigation, security verification, or post-security recovery
Bad agent pattern:
```text
database is down
-> restart database
-> database crashes again because exploit path is still open
```
Good agent pattern:
```text
find root cause
-> unlock security subquest
-> patch exploit path
-> verify fix
-> recover infrastructure
-> submit postmortem
```
## Evaluation Gap
| Property | Simple ops benchmark | Unified Incident Env |
| --- | --- | --- |
| Failure model | broken service only | broken service plus security-rooted cause |
| Agent role | troubleshooter | incident responder plus security repair assistant |
| Action pattern | pure recovery | investigate -> unlock security -> patch -> recover |
| Failure traps | wrong restart | wrong restart plus wrong patch plus wrong order |
| Success condition | service healthy | service healthy plus exploit path closed plus postmortem |
## Benchmark Mechanics
Named mechanics that shape behavior:
- Causal traps
- Stage transitions
- Security unlock
- Recovery ordering
- Negative-reward correction pressure
- Deterministic postmortem scoring
These mechanics are explicit in the environment state and reward function. They are not hidden in a black-box grader.
## At A Glance
| Item | Value |
| --- | --- |
| Environment name | `unified_incident_env` |
| Environment count | 1 |
| Scenario count | 3 |
| Difficulty levels | Easy, Medium, Hard |
| Public actions | 11 |
| Score range | `0.0` to `1.0` |
| Score type | deterministic, dense, bounded |
| Root runner | `inference.py` |
| OpenEnv validation | passes |
| Test suite | `51 passed` |
| Docker build | passes |
| LLM judge | none |
## Scenario Pack
| Scenario | Difficulty | Operational failure | Security root cause | Lesson |
| --- | --- | --- | --- | --- |
| `database_sqli_outage` | Easy | database crash causes gateway `502`s | SQL injection in login path | close exploit before restarting database |
| `cache_abuse_broken_access_control` | Medium | cache crash and database degradation cascade | broken access control on internal admin endpoint | follow dependency chain and authorization evidence |
| `worker_bad_deploy_command_injection` | Hard | worker poisons downstream database and gateway | command injection plus bad deploy | stop investigating once enough evidence exists, then patch and rollback the worker path |
Difficulty progression:
```text
Easy -> direct evidence, short recovery chain
Medium -> dependency reasoning, authorization bug
Hard -> cross-service causality, exploit plus deploy rollback
```
```mermaid
flowchart LR
E["Easy\nDirect evidence"] --> M["Medium\nDependency reasoning"]
M --> H["Hard\nCross-service causality"]
```
## Public Action Schema
Only these `action_type` values are valid:
```json
[
"query_logs",
"query_metrics",
"query_dependencies",
"restart_service",
"rollback_deploy",
"inspect_code",
"classify_vulnerability",
"apply_patch",
"verify_security_fix",
"submit_security_fix",
"submit_postmortem"
]
```
Required fields:
| Action | Required fields |
| --- | --- |
| `query_logs` | `service` |
| `query_metrics` | `service`, `metric` |
| `query_dependencies` | `service` |
| `restart_service` | `service` |
| `rollback_deploy` | `service` |
| `inspect_code` | none |
| `classify_vulnerability` | `vulnerability_type` |
| `apply_patch` | `patch_id` |
| `verify_security_fix` | none |
| `submit_security_fix` | none |
| `submit_postmortem` | `postmortem` |
## Observation Design
Each step returns a typed observation with:
- `tick_count`
- `workflow_stage`
- `active_alerts`
- `service_health`
- `last_action_result`
- `tool_output`
- `failure_type`
- `why_failed`
- `allowed_actions`
- `required_fields_by_action`
- `valid_action_example`
- `common_trap`
- `loop_warning`
- `blocked_until_security_complete`
- `security_unlock_reason`
- `best_recovery_action_family`
- `progress_flags`
- `security_subquest_status`
- `security_context`
- `final_score`
- `score_breakdown`
- `incident_resolved`
- `reward`
- `done`
This keeps the benchmark deterministic while still making failure states explicit and machine-usable.
## Scoring
The score is deterministic and bounded between `0.0` and `1.0`.
```text
final_score =
infrastructure_score (0.00 to 0.45) +
security_score (0.00 to 0.35) +
efficiency_score (0.00 to 0.10) +
postmortem_score (0.00 to 0.10)
```
Score weight view:
| Component | Weight |
| --- | ---: |
| Infrastructure | 0.45 |
| Security | 0.35 |
| Efficiency | 0.10 |
| Postmortem | 0.10 |
Deterministic guarantees:
- preset authored scenarios
- deterministic patch outcomes
- deterministic postmortem scoring
- no hidden fallback in strict benchmark behavior
- no LLM grader
- incomplete security subquest caps the final score at `0.5`
## Runtime Flow
```text
model
-> inference.py
-> env.step(action)
-> observation + reward + score
-> next model decision
```
```mermaid
flowchart LR
A["Model"] --> B["inference.py"]
B --> C["env.step(action)"]
C --> D["observation + reward + score"]
D --> A
```
Successful episode flow:
```text
diagnosis
-> root cause analysis
-> security subquest
-> remediation
-> verification
-> postmortem
-> done
```
## Inference Path
The root `inference.py` is the submission runner.
It:
- uses the OpenAI client
- reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
- emits validator-compatible `[START]`, `[STEP]`, and `[END]` logs
- runs all 3 scenarios through the real environment API
Inference modes:
- `INFERENCE_MODE=judge`
- default
- compact, strong-model-friendly prompt
- structured outputs first
- no transcript stuffing
- `INFERENCE_MODE=small`
- optional local rescue mode for weaker models
- compact corrective prompt behavior
## Model Notes
Models used during development and validation:
- `qwen2.5:1.5b`
- `qwen2.5:3b`
- `qwen2.5:7b-instruct-q4_K_M`
- `gemma2:2b`
- `llama-3.3-70b-versatile`
The default path is optimized for strong external judge models. The optional `small` mode exists only to support weaker local models without changing the benchmark contract.
## Repository Layout
```text
.
β”œβ”€β”€ README.md
β”œβ”€β”€ inference.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock
β”œβ”€β”€ Makefile
β”œβ”€β”€ server/
└── unified_incident_env/
```
Important internals:
| Path | Purpose |
| --- | --- |
| `server/app.py` | top-level app entrypoint |
| `unified_incident_env/models.py` | typed action, observation, and state models |
| `unified_incident_env/server/challenge.py` | scenario catalog |
| `unified_incident_env/server/environment.py` | transition logic |
| `unified_incident_env/server/grader.py` | deterministic scoring |
| `unified_incident_env/scripts/baseline_agent.py` | deterministic internal reference baseline |
| `unified_incident_env/tests/` | regression tests |
| `unified_incident_env/trainer/` | optional secondary tooling |
## Running The Repo
Install:
```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
```
Run tests:
```bash
pytest unified_incident_env/tests -q
```
Validate OpenEnv compliance:
```bash
openenv validate .
```
Run the environment locally:
```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
Run the root inference script:
```bash
python inference.py
```
Build and run with Docker:
```bash
docker build -t unified-incident-env .
docker run --rm -p 8000:8000 unified-incident-env
```
## Environment Variables
`inference.py` supports:
- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`
- `ENV_BASE_URL`
- `INFERENCE_MODE`
## Validation Status
Current repo-level checks:
- `pytest unified_incident_env/tests -q` -> `51 passed`
- `openenv validate .` -> passes
- `docker build -t unified-incident-env .` -> passes
## Hugging Face Space
Configured Space URL:
- `https://huggingface.co/spaces/dakshdoesdev/unified-incident-env`
The repo is structured for a Docker-based Hugging Face Space via `openenv.yaml`.
## Optional Trainer Scaffold
`unified_incident_env/trainer/` is secondary tooling for:
- trajectory collection
- failure analysis
- correction dataset generation
- updater hooks
It is not a second environment. The judge-facing benchmark remains `unified_incident_env`.
## Reading Order
For a new engineer or agent:
1. `README.md`
2. `inference.py`
3. `openenv.yaml`
4. `unified_incident_env/models.py`
5. `unified_incident_env/server/challenge.py`
6. `unified_incident_env/server/environment.py`
7. `unified_incident_env/server/grader.py`
8. `unified_incident_env/tests/`