Spaces:
Sleeping
Sleeping
| title: Unified Incident Env | |
| emoji: π¨ | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| # Unified Incident Env | |
| A deterministic OpenEnv benchmark where agents resolve production incidents whose true root cause includes a security vulnerability. | |
| `unified_incident_env` is one judge-facing environment. It is not a collection of mini-projects and it is not a toy task. Each episode starts as an operational outage, but the correct solution requires the agent to bridge SRE investigation with security remediation, then recover the system in the correct order and submit a postmortem. | |
| ## Why This Benchmark Matters | |
| Most agent benchmarks test operations or security in isolation. This benchmark forces both: | |
| - operational symptoms appear first | |
| - the real root cause can be security-rooted | |
| - patching alone is not enough | |
| - recovery alone is not enough | |
| - the final postmortem must reflect the real causal chain | |
| This is what makes it useful for evaluating incident-response agents rather than generic tool-using assistants. | |
| ## Why It Is Non-Trivial | |
| The benchmark is intentionally built around causal traps: | |
| - restarting the wrong service treats a symptom but does not fix the incident | |
| - patching the wrong vulnerability or wrong patch family wastes steps and score | |
| - recovering infrastructure before closing the exploit path can fail or regress | |
| - weak agents often loop in investigation, security verification, or post-security recovery | |
| Bad agent pattern: | |
| ```text | |
| database is down | |
| -> restart database | |
| -> database crashes again because exploit path is still open | |
| ``` | |
| Good agent pattern: | |
| ```text | |
| find root cause | |
| -> unlock security subquest | |
| -> patch exploit path | |
| -> verify fix | |
| -> recover infrastructure | |
| -> submit postmortem | |
| ``` | |
| ## Evaluation Gap | |
| | Property | Simple ops benchmark | Unified Incident Env | | |
| | --- | --- | --- | | |
| | Failure model | broken service only | broken service plus security-rooted cause | | |
| | Agent role | troubleshooter | incident responder plus security repair assistant | | |
| | Action pattern | pure recovery | investigate -> unlock security -> patch -> recover | | |
| | Failure traps | wrong restart | wrong restart plus wrong patch plus wrong order | | |
| | Success condition | service healthy | service healthy plus exploit path closed plus postmortem | | |
| ## Benchmark Mechanics | |
| Named mechanics that shape behavior: | |
| - Causal traps | |
| - Stage transitions | |
| - Security unlock | |
| - Recovery ordering | |
| - Negative-reward correction pressure | |
| - Deterministic postmortem scoring | |
| These mechanics are explicit in the environment state and reward function. They are not hidden in a black-box grader. | |
| ## At A Glance | |
| | Item | Value | | |
| | --- | --- | | |
| | Environment name | `unified_incident_env` | | |
| | Environment count | 1 | | |
| | Scenario count | 3 | | |
| | Difficulty levels | Easy, Medium, Hard | | |
| | Public actions | 11 | | |
| | Score range | `0.0` to `1.0` | | |
| | Score type | deterministic, dense, bounded | | |
| | Root runner | `inference.py` | | |
| | OpenEnv validation | passes | | |
| | Test suite | `51 passed` | | |
| | Docker build | passes | | |
| | LLM judge | none | | |
| ## Scenario Pack | |
| | Scenario | Difficulty | Operational failure | Security root cause | Lesson | | |
| | --- | --- | --- | --- | --- | | |
| | `database_sqli_outage` | Easy | database crash causes gateway `502`s | SQL injection in login path | close exploit before restarting database | | |
| | `cache_abuse_broken_access_control` | Medium | cache crash and database degradation cascade | broken access control on internal admin endpoint | follow dependency chain and authorization evidence | | |
| | `worker_bad_deploy_command_injection` | Hard | worker poisons downstream database and gateway | command injection plus bad deploy | stop investigating once enough evidence exists, then patch and rollback the worker path | | |
| Difficulty progression: | |
| ```text | |
| Easy -> direct evidence, short recovery chain | |
| Medium -> dependency reasoning, authorization bug | |
| Hard -> cross-service causality, exploit plus deploy rollback | |
| ``` | |
| ```mermaid | |
| flowchart LR | |
| E["Easy\nDirect evidence"] --> M["Medium\nDependency reasoning"] | |
| M --> H["Hard\nCross-service causality"] | |
| ``` | |
| ## Public Action Schema | |
| Only these `action_type` values are valid: | |
| ```json | |
| [ | |
| "query_logs", | |
| "query_metrics", | |
| "query_dependencies", | |
| "restart_service", | |
| "rollback_deploy", | |
| "inspect_code", | |
| "classify_vulnerability", | |
| "apply_patch", | |
| "verify_security_fix", | |
| "submit_security_fix", | |
| "submit_postmortem" | |
| ] | |
| ``` | |
| Required fields: | |
| | Action | Required fields | | |
| | --- | --- | | |
| | `query_logs` | `service` | | |
| | `query_metrics` | `service`, `metric` | | |
| | `query_dependencies` | `service` | | |
| | `restart_service` | `service` | | |
| | `rollback_deploy` | `service` | | |
| | `inspect_code` | none | | |
| | `classify_vulnerability` | `vulnerability_type` | | |
| | `apply_patch` | `patch_id` | | |
| | `verify_security_fix` | none | | |
| | `submit_security_fix` | none | | |
| | `submit_postmortem` | `postmortem` | | |
| ## Observation Design | |
| Each step returns a typed observation with: | |
| - `tick_count` | |
| - `workflow_stage` | |
| - `active_alerts` | |
| - `service_health` | |
| - `last_action_result` | |
| - `tool_output` | |
| - `failure_type` | |
| - `why_failed` | |
| - `allowed_actions` | |
| - `required_fields_by_action` | |
| - `valid_action_example` | |
| - `common_trap` | |
| - `loop_warning` | |
| - `blocked_until_security_complete` | |
| - `security_unlock_reason` | |
| - `best_recovery_action_family` | |
| - `progress_flags` | |
| - `security_subquest_status` | |
| - `security_context` | |
| - `final_score` | |
| - `score_breakdown` | |
| - `incident_resolved` | |
| - `reward` | |
| - `done` | |
| This keeps the benchmark deterministic while still making failure states explicit and machine-usable. | |
| ## Scoring | |
| The score is deterministic and bounded between `0.0` and `1.0`. | |
| ```text | |
| final_score = | |
| infrastructure_score (0.00 to 0.45) + | |
| security_score (0.00 to 0.35) + | |
| efficiency_score (0.00 to 0.10) + | |
| postmortem_score (0.00 to 0.10) | |
| ``` | |
| Score weight view: | |
| | Component | Weight | | |
| | --- | ---: | | |
| | Infrastructure | 0.45 | | |
| | Security | 0.35 | | |
| | Efficiency | 0.10 | | |
| | Postmortem | 0.10 | | |
| Deterministic guarantees: | |
| - preset authored scenarios | |
| - deterministic patch outcomes | |
| - deterministic postmortem scoring | |
| - no hidden fallback in strict benchmark behavior | |
| - no LLM grader | |
| - incomplete security subquest caps the final score at `0.5` | |
| ## Runtime Flow | |
| ```text | |
| model | |
| -> inference.py | |
| -> env.step(action) | |
| -> observation + reward + score | |
| -> next model decision | |
| ``` | |
| ```mermaid | |
| flowchart LR | |
| A["Model"] --> B["inference.py"] | |
| B --> C["env.step(action)"] | |
| C --> D["observation + reward + score"] | |
| D --> A | |
| ``` | |
| Successful episode flow: | |
| ```text | |
| diagnosis | |
| -> root cause analysis | |
| -> security subquest | |
| -> remediation | |
| -> verification | |
| -> postmortem | |
| -> done | |
| ``` | |
| ## Inference Path | |
| The root `inference.py` is the submission runner. | |
| It: | |
| - uses the OpenAI client | |
| - reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` | |
| - emits validator-compatible `[START]`, `[STEP]`, and `[END]` logs | |
| - runs all 3 scenarios through the real environment API | |
| Inference modes: | |
| - `INFERENCE_MODE=judge` | |
| - default | |
| - compact, strong-model-friendly prompt | |
| - structured outputs first | |
| - no transcript stuffing | |
| - `INFERENCE_MODE=small` | |
| - optional local rescue mode for weaker models | |
| - compact corrective prompt behavior | |
| ## Model Notes | |
| Models used during development and validation: | |
| - `qwen2.5:1.5b` | |
| - `qwen2.5:3b` | |
| - `qwen2.5:7b-instruct-q4_K_M` | |
| - `gemma2:2b` | |
| - `llama-3.3-70b-versatile` | |
| The default path is optimized for strong external judge models. The optional `small` mode exists only to support weaker local models without changing the benchmark contract. | |
| ## Repository Layout | |
| ```text | |
| . | |
| βββ README.md | |
| βββ inference.py | |
| βββ openenv.yaml | |
| βββ Dockerfile | |
| βββ pyproject.toml | |
| βββ uv.lock | |
| βββ Makefile | |
| βββ server/ | |
| βββ unified_incident_env/ | |
| ``` | |
| Important internals: | |
| | Path | Purpose | | |
| | --- | --- | | |
| | `server/app.py` | top-level app entrypoint | | |
| | `unified_incident_env/models.py` | typed action, observation, and state models | | |
| | `unified_incident_env/server/challenge.py` | scenario catalog | | |
| | `unified_incident_env/server/environment.py` | transition logic | | |
| | `unified_incident_env/server/grader.py` | deterministic scoring | | |
| | `unified_incident_env/scripts/baseline_agent.py` | deterministic internal reference baseline | | |
| | `unified_incident_env/tests/` | regression tests | | |
| | `unified_incident_env/trainer/` | optional secondary tooling | | |
| ## Running The Repo | |
| Install: | |
| ```bash | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| python -m pip install -e ".[dev]" | |
| ``` | |
| Run tests: | |
| ```bash | |
| pytest unified_incident_env/tests -q | |
| ``` | |
| Validate OpenEnv compliance: | |
| ```bash | |
| openenv validate . | |
| ``` | |
| Run the environment locally: | |
| ```bash | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Run the root inference script: | |
| ```bash | |
| python inference.py | |
| ``` | |
| Build and run with Docker: | |
| ```bash | |
| docker build -t unified-incident-env . | |
| docker run --rm -p 8000:8000 unified-incident-env | |
| ``` | |
| ## Environment Variables | |
| `inference.py` supports: | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| - `HF_TOKEN` | |
| - `ENV_BASE_URL` | |
| - `INFERENCE_MODE` | |
| ## Validation Status | |
| Current repo-level checks: | |
| - `pytest unified_incident_env/tests -q` -> `51 passed` | |
| - `openenv validate .` -> passes | |
| - `docker build -t unified-incident-env .` -> passes | |
| ## Hugging Face Space | |
| Configured Space URL: | |
| - `https://huggingface.co/spaces/dakshdoesdev/unified-incident-env` | |
| The repo is structured for a Docker-based Hugging Face Space via `openenv.yaml`. | |
| ## Optional Trainer Scaffold | |
| `unified_incident_env/trainer/` is secondary tooling for: | |
| - trajectory collection | |
| - failure analysis | |
| - correction dataset generation | |
| - updater hooks | |
| It is not a second environment. The judge-facing benchmark remains `unified_incident_env`. | |
| ## Reading Order | |
| For a new engineer or agent: | |
| 1. `README.md` | |
| 2. `inference.py` | |
| 3. `openenv.yaml` | |
| 4. `unified_incident_env/models.py` | |
| 5. `unified_incident_env/server/challenge.py` | |
| 6. `unified_incident_env/server/environment.py` | |
| 7. `unified_incident_env/server/grader.py` | |
| 8. `unified_incident_env/tests/` | |