Spaces:

dakshdoesdev
/

unified-incident-env

Sleeping

App Files Files Community

unified-incident-env / README.md

Daksh Verma

Add Spaces metadata to README

404cd62 verified about 2 months ago

preview code

raw

history blame contribute delete

10 kB

	---
	title: Unified Incident Env
	emoji: 🚨
	colorFrom: blue
	colorTo: red
	sdk: docker
	app_port: 8000
	pinned: false
	---

	# Unified Incident Env

	A deterministic OpenEnv benchmark where agents resolve production incidents whose true root cause includes a security vulnerability.

	`unified_incident_env` is one judge-facing environment. It is not a collection of mini-projects and it is not a toy task. Each episode starts as an operational outage, but the correct solution requires the agent to bridge SRE investigation with security remediation, then recover the system in the correct order and submit a postmortem.

	## Why This Benchmark Matters

	Most agent benchmarks test operations or security in isolation. This benchmark forces both:

	- operational symptoms appear first
	- the real root cause can be security-rooted
	- patching alone is not enough
	- recovery alone is not enough
	- the final postmortem must reflect the real causal chain

	This is what makes it useful for evaluating incident-response agents rather than generic tool-using assistants.

	## Why It Is Non-Trivial

	The benchmark is intentionally built around causal traps:

	- restarting the wrong service treats a symptom but does not fix the incident
	- patching the wrong vulnerability or wrong patch family wastes steps and score
	- recovering infrastructure before closing the exploit path can fail or regress
	- weak agents often loop in investigation, security verification, or post-security recovery

	Bad agent pattern:

	```text
	database is down
	-> restart database
	-> database crashes again because exploit path is still open
	```

	Good agent pattern:

	```text
	find root cause
	-> unlock security subquest
	-> patch exploit path
	-> verify fix
	-> recover infrastructure
	-> submit postmortem
	```

	## Evaluation Gap

	\| Property \| Simple ops benchmark \| Unified Incident Env \|
	\| --- \| --- \| --- \|
	\| Failure model \| broken service only \| broken service plus security-rooted cause \|
	\| Agent role \| troubleshooter \| incident responder plus security repair assistant \|
	\| Action pattern \| pure recovery \| investigate -> unlock security -> patch -> recover \|
	\| Failure traps \| wrong restart \| wrong restart plus wrong patch plus wrong order \|
	\| Success condition \| service healthy \| service healthy plus exploit path closed plus postmortem \|

	## Benchmark Mechanics

	Named mechanics that shape behavior:

	- Causal traps
	- Stage transitions
	- Security unlock
	- Recovery ordering
	- Negative-reward correction pressure
	- Deterministic postmortem scoring

	These mechanics are explicit in the environment state and reward function. They are not hidden in a black-box grader.

	## At A Glance

	\| Item \| Value \|
	\| --- \| --- \|
	\| Environment name \| `unified_incident_env` \|
	\| Environment count \| 1 \|
	\| Scenario count \| 3 \|
	\| Difficulty levels \| Easy, Medium, Hard \|
	\| Public actions \| 11 \|
	\| Score range \| `0.0` to `1.0` \|
	\| Score type \| deterministic, dense, bounded \|
	\| Root runner \| `inference.py` \|
	\| OpenEnv validation \| passes \|
	\| Test suite \| `51 passed` \|
	\| Docker build \| passes \|
	\| LLM judge \| none \|

	## Scenario Pack

	\| Scenario \| Difficulty \| Operational failure \| Security root cause \| Lesson \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| `database_sqli_outage` \| Easy \| database crash causes gateway `502`s \| SQL injection in login path \| close exploit before restarting database \|
	\| `cache_abuse_broken_access_control` \| Medium \| cache crash and database degradation cascade \| broken access control on internal admin endpoint \| follow dependency chain and authorization evidence \|
	\| `worker_bad_deploy_command_injection` \| Hard \| worker poisons downstream database and gateway \| command injection plus bad deploy \| stop investigating once enough evidence exists, then patch and rollback the worker path \|

	Difficulty progression:

	```text
	Easy -> direct evidence, short recovery chain
	Medium -> dependency reasoning, authorization bug
	Hard -> cross-service causality, exploit plus deploy rollback
	```

	```mermaid
	flowchart LR
	E["Easy\nDirect evidence"] --> M["Medium\nDependency reasoning"]
	M --> H["Hard\nCross-service causality"]
	```

	## Public Action Schema

	Only these `action_type` values are valid:

	```json
	[
	"query_logs",
	"query_metrics",
	"query_dependencies",
	"restart_service",
	"rollback_deploy",
	"inspect_code",
	"classify_vulnerability",
	"apply_patch",
	"verify_security_fix",
	"submit_security_fix",
	"submit_postmortem"
	]
	```

	Required fields:

	\| Action \| Required fields \|
	\| --- \| --- \|
	\| `query_logs` \| `service` \|
	\| `query_metrics` \| `service`, `metric` \|
	\| `query_dependencies` \| `service` \|
	\| `restart_service` \| `service` \|
	\| `rollback_deploy` \| `service` \|
	\| `inspect_code` \| none \|
	\| `classify_vulnerability` \| `vulnerability_type` \|
	\| `apply_patch` \| `patch_id` \|
	\| `verify_security_fix` \| none \|
	\| `submit_security_fix` \| none \|
	\| `submit_postmortem` \| `postmortem` \|

	## Observation Design

	Each step returns a typed observation with:

	- `tick_count`
	- `workflow_stage`
	- `active_alerts`
	- `service_health`
	- `last_action_result`
	- `tool_output`
	- `failure_type`
	- `why_failed`
	- `allowed_actions`
	- `required_fields_by_action`
	- `valid_action_example`
	- `common_trap`
	- `loop_warning`
	- `blocked_until_security_complete`
	- `security_unlock_reason`
	- `best_recovery_action_family`
	- `progress_flags`
	- `security_subquest_status`
	- `security_context`
	- `final_score`
	- `score_breakdown`
	- `incident_resolved`
	- `reward`
	- `done`

	This keeps the benchmark deterministic while still making failure states explicit and machine-usable.

	## Scoring

	The score is deterministic and bounded between `0.0` and `1.0`.

	```text
	final_score =
	infrastructure_score (0.00 to 0.45) +
	security_score (0.00 to 0.35) +
	efficiency_score (0.00 to 0.10) +
	postmortem_score (0.00 to 0.10)
	```

	Score weight view:

	\| Component \| Weight \|
	\| --- \| ---: \|
	\| Infrastructure \| 0.45 \|
	\| Security \| 0.35 \|
	\| Efficiency \| 0.10 \|
	\| Postmortem \| 0.10 \|

	Deterministic guarantees:

	- preset authored scenarios
	- deterministic patch outcomes
	- deterministic postmortem scoring
	- no hidden fallback in strict benchmark behavior
	- no LLM grader
	- incomplete security subquest caps the final score at `0.5`

	## Runtime Flow

	```text
	model
	-> inference.py
	-> env.step(action)
	-> observation + reward + score
	-> next model decision
	```

	```mermaid
	flowchart LR
	A["Model"] --> B["inference.py"]
	B --> C["env.step(action)"]
	C --> D["observation + reward + score"]
	D --> A
	```

	Successful episode flow:

	```text
	diagnosis
	-> root cause analysis
	-> security subquest
	-> remediation
	-> verification
	-> postmortem
	-> done
	```

	## Inference Path

	The root `inference.py` is the submission runner.

	It:

	- uses the OpenAI client
	- reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
	- emits validator-compatible `[START]`, `[STEP]`, and `[END]` logs
	- runs all 3 scenarios through the real environment API

	Inference modes:

	- `INFERENCE_MODE=judge`
	- default
	- compact, strong-model-friendly prompt
	- structured outputs first
	- no transcript stuffing
	- `INFERENCE_MODE=small`
	- optional local rescue mode for weaker models
	- compact corrective prompt behavior

	## Model Notes

	Models used during development and validation:

	- `qwen2.5:1.5b`
	- `qwen2.5:3b`
	- `qwen2.5:7b-instruct-q4_K_M`
	- `gemma2:2b`
	- `llama-3.3-70b-versatile`

	The default path is optimized for strong external judge models. The optional `small` mode exists only to support weaker local models without changing the benchmark contract.

	## Repository Layout

	```text
	.
	├── README.md
	├── inference.py
	├── openenv.yaml
	├── Dockerfile
	├── pyproject.toml
	├── uv.lock
	├── Makefile
	├── server/
	└── unified_incident_env/
	```

	Important internals:

	\| Path \| Purpose \|
	\| --- \| --- \|
	\| `server/app.py` \| top-level app entrypoint \|
	\| `unified_incident_env/models.py` \| typed action, observation, and state models \|
	\| `unified_incident_env/server/challenge.py` \| scenario catalog \|
	\| `unified_incident_env/server/environment.py` \| transition logic \|
	\| `unified_incident_env/server/grader.py` \| deterministic scoring \|
	\| `unified_incident_env/scripts/baseline_agent.py` \| deterministic internal reference baseline \|
	\| `unified_incident_env/tests/` \| regression tests \|
	\| `unified_incident_env/trainer/` \| optional secondary tooling \|

	## Running The Repo

	Install:

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	python -m pip install -e ".[dev]"
	```

	Run tests:

	```bash
	pytest unified_incident_env/tests -q
	```

	Validate OpenEnv compliance:

	```bash
	openenv validate .
	```

	Run the environment locally:

	```bash
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	Run the root inference script:

	```bash
	python inference.py
	```

	Build and run with Docker:

	```bash
	docker build -t unified-incident-env .
	docker run --rm -p 8000:8000 unified-incident-env
	```

	## Environment Variables

	`inference.py` supports:

	- `API_BASE_URL`
	- `MODEL_NAME`
	- `HF_TOKEN`
	- `ENV_BASE_URL`
	- `INFERENCE_MODE`

	## Validation Status

	Current repo-level checks:

	- `pytest unified_incident_env/tests -q` -> `51 passed`
	- `openenv validate .` -> passes
	- `docker build -t unified-incident-env .` -> passes

	## Hugging Face Space

	Configured Space URL:

	- `https://huggingface.co/spaces/dakshdoesdev/unified-incident-env`

	The repo is structured for a Docker-based Hugging Face Space via `openenv.yaml`.

	## Optional Trainer Scaffold

	`unified_incident_env/trainer/` is secondary tooling for:

	- trajectory collection
	- failure analysis
	- correction dataset generation
	- updater hooks

	It is not a second environment. The judge-facing benchmark remains `unified_incident_env`.

	## Reading Order

	For a new engineer or agent:

	1. `README.md`
	2. `inference.py`
	3. `openenv.yaml`
	4. `unified_incident_env/models.py`
	5. `unified_incident_env/server/challenge.py`
	6. `unified_incident_env/server/environment.py`
	7. `unified_incident_env/server/grader.py`
	8. `unified_incident_env/tests/`