CausalOps-Env / README.md
omm7's picture
Upload folder using huggingface_hub
f84289a verified
---
title: NovaTech Incident Command
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
---
# NovaTech Incident Command
NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.
This version is explicitly designed to avoid common benchmark failures:
- no hidden answer leakage in public state
- no scripted reveal queue
- no keyword-based grader
- no hardcoded baseline answers
- session-safe API with per-episode isolation
## What The Agent Must Do
Each episode simulates a production incident with a fixed action budget.
The agent must:
- retrieve relevant logs using structured filters
- follow dependencies rather than brute-force the whole system
- narrow toward a causal tuple
- avoid destructive containment
- submit a causally consistent final report
## Core Mechanics
### Partial observability
The agent only sees:
- the incident briefing
- the dependency graph
- the logs it has explicitly revealed
It never sees:
- hidden logs
- gold evidence IDs
- grader internals
### Session-safe design
`POST /reset` returns a `session_id`.
All actions in `POST /step` should include that `session_id`, which isolates concurrent episodes and avoids the old shared-global-state exploit.
### Seeded stochasticity
Every reset can accept a seed:
```json
{
"task_id": "medium",
"seed": 42
}
```
Given the same seed:
- the task-specific log pool is reproducible
- distractor/noise sampling is reproducible
- retrieval order is reproducible
Different seeds slightly vary the non-essential observable context while preserving deterministic grading.
## Observations
Each `reset()` and `step()` returns a structured observation, not a loose blob.
Observation fields:
- `session_id`: the active episode identifier
- `task_id`: task difficulty key
- `task_title`: human-readable incident label
- `briefing`: incident objective, incident window, suspected services, customer statement, and constraints
- `dependency_graph`: the service graph the agent can reason over
- `visible_logs`: only the logs the agent has explicitly revealed
- `revealed_log_count`: number of currently visible logs
- `visited_services`: services already explored through dependency inspection or queries
- `submitted_containment`: containment actions already chosen
- `last_hypothesis`: latest structured causal hypothesis
- `step_number`: current step
- `max_steps`: step budget
- `feedback`: environment guidance after the last action
- `done`: terminal flag
Why this observation design matters:
- it gives enough structure for deliberate planning
- it preserves partial observability
- it prevents answer leakage
- it supports both frontier agents and smaller baselines
Example observation shape:
```json
{
"session_id": "8e7f...",
"task_id": "medium",
"task_title": "Checkout Competing Hypotheses",
"briefing": {
"incident_id": "INC-2144",
"title": "Checkout Competing Hypotheses",
"objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
"incident_window_start": "2025-06-15 06:20:00",
"incident_window_end": "2025-06-15 06:45:59",
"suspected_services": ["payment-api", "auth-service", "user-service"],
"customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
"operational_constraints": [
"Keep checkout partially available if possible.",
"Avoid blind restarts."
]
},
"dependency_graph": {
"payment-api": ["auth-service", "payment-gateway", "mysql"]
},
"visible_logs": [],
"revealed_log_count": 0,
"visited_services": [],
"submitted_containment": [],
"last_hypothesis": null,
"step_number": 0,
"max_steps": 7,
"feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
"done": false
}
```
## Tasks
### Easy: Auth Heap Exhaustion
Reasoning pattern:
- anomaly detection with clear signal
Goal:
- identify auth-service heap exhaustion as the true cause of a login incident
- avoid destructive overreaction
### Medium: Checkout Competing Hypotheses
Reasoning pattern:
- disambiguate competing explanations
Goal:
- determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise
### Hard: Cascading Multi-Service Incident
Reasoning pattern:
- partial observability
- timeline reconstruction
- tradeoff-aware containment
Goal:
- identify the initiating service in a multi-service cascade and propose layered containment
## Structured Actions
### Query logs
```json
{
"session_id": "<session_id>",
"action_type": "query_logs",
"query": {
"service_name": "payment-api",
"levels": ["CRITICAL", "ERROR"],
"start_time": "2025-06-15 06:20:00",
"end_time": "2025-06-15 06:45:59",
"limit": 6
}
}
```
### Inspect dependencies
```json
{
"session_id": "<session_id>",
"action_type": "inspect_dependencies",
"target_service": "payment-api"
}
```
### Update hypothesis
```json
{
"session_id": "<session_id>",
"action_type": "update_hypothesis",
"hypothesis": {
"primary_service": "payment-api",
"failure_mode": "dependency_outage",
"dependency": "payment-gateway",
"customer_impact": "checkout_delays",
"confidence": 0.87
}
}
```
### Submit report
```json
{
"session_id": "<session_id>",
"action_type": "submit_report",
"report": {
"evidence_log_ids": [193, 194, 195],
"impacted_services": ["payment-api"],
"root_cause": {
"primary_service": "payment-api",
"failure_mode": "dependency_outage",
"dependency": "payment-gateway",
"customer_impact": "checkout_delays",
"confidence": 0.87
},
"containment_plan": [
"restore_payment_gateway_connectivity",
"reduce_checkout_retry_pressure"
],
"summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
}
}
```
## Grading
The grader is fully deterministic and structured.
It scores:
- evidence quality via revealed-evidence F1
- root-cause tuple correctness
- impacted-service correctness
- containment alignment
- causal consistency across evidence, service, impact, and timeline
It penalizes:
- unseen evidence references
- contradictions
- forbidden containment
- repeated actions
There is no keyword-bag grader in this version.
## Reward Function
Intermediate rewards are dense and shaped:
- `signal_reward`: new relevant evidence
- `hypothesis_reward`: improvement toward the gold causal tuple
- `efficiency_reward`: solving earlier is better
- `penalty`: invalid queries, loops, contradictions, forbidden actions
This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.
## Clever Reward Techniques
This environment uses several reward-shaping ideas that are stronger than a typical binary grader.
### 1. Progress reward based on information gain
The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.
### 2. Hypothesis-improvement shaping
The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.
### 3. Observation-consistent terminal scoring
The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.
### 4. Contradiction penalties
The grader penalizes internal inconsistency across:
- selected evidence
- claimed root-cause service
- claimed customer impact
- timeline in the hard task
- containment choice
This means an agent cannot simply match one part of the answer key and ignore the rest.
### 5. Safe-containment bias
The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that β€œsolve” incidents by wiping logs or restarting everything are penalized.
### 6. Loop-aware shaping
Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.
### 7. Seeded stochastic distractors with deterministic grading
The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.
In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.
## API
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /health`
- `GET /debug_state`
`/debug_state` is disabled by default and only works when `OPENENV_DEBUG_STATE=true`.
## Baseline
`inference.py` is deterministic and observation-driven.
It:
- queries the incident window
- inspects the most suspicious service
- builds a structured hypothesis from revealed logs
- chooses containment from the inferred cause
- submits a final report
It does not use hardcoded gold `log_id` answers.
Required environment variables:
- `HF_TOKEN`
- `API_BASE_URL`
- `MODEL_NAME`
Optional:
- `LOGENV_URL`
- `DB_PATH`
Logging format is strict:
- `[START]`
- `[STEP]`
- `[END]`
Observed baseline scores from the deployed Hugging Face Space:
- `easy`: `0.721`
- `medium`: `0.504`
- `hard`: `0.859`
## Validation Status
Final pre-submission status:
- Hugging Face Space responds to `POST /reset`
- local Docker image builds successfully
- local Docker container responds to `GET /health` and `POST /reset`
- `openenv validate` passes
- `inference.py` runs end to end against the deployed Space
- 3 tasks are present: `easy`, `medium`, `hard`
- rewards and terminal scores are bounded to `[0.0, 1.0]`
- observations are fully documented and exposed as typed structured objects
Requirement coverage summary:
- OpenEnv interfaces: `reset()`, `step()`, `state()`
- typed Pydantic models: `Observation`, `Action`, `Reward`
- real-world domain: incident response / log debugging
- seeded partial observability
- deterministic structured grader
- dense reward shaping with safety and loop penalties
- OpenAI-client baseline using `HF_TOKEN`, `API_BASE_URL`, and `MODEL_NAME`
- Docker and Hugging Face deployment support
- strict `[START]`, `[STEP]`, `[END]` logging
## Round 1 Compliance Checklist
This section maps the submission directly to the Round 1 problem statement and validator expectations.
### Core task requirements
- Real-world environment: yes
- domain: production incident response / log debugging
- modeled workflow: log retrieval, dependency inspection, diagnosis, containment, final reporting
- OpenEnv interface: yes
- `reset() -> initial observation`
- `step(action) -> observation, reward, done, info`
- `state() -> current public state`
- Typed models: yes
- Pydantic `Observation`, `Action`, and `Reward`
- `openenv.yaml`: yes
- `openenv validate`: pass
### Task and grader requirements
- Exactly 3 tasks: yes
- `easy`
- `medium`
- `hard`
- Difficulty progression: yes
- easy = clear-signal anomaly detection
- medium = competing hypotheses
- hard = partial observability plus cascading-failure reasoning
- Deterministic graders: yes
- Scores bounded to `[0.0, 1.0]`: yes
- Penalizes invalid actions, loops, and inefficiency: yes
### Reward requirements
- Dense reward: yes
- Partial-progress shaping: yes
- Loop / contradiction / invalid-action penalties: yes
- Final normalized score: yes
### Deployment requirements
- Hugging Face Space deployment: yes
- `POST /reset` responds `200 OK`: yes
- Dockerfile included: yes
- `docker build` succeeds: yes
- `docker run` succeeds: yes
### Baseline requirements
- root-level `inference.py`: yes
- uses OpenAI client: yes
- supports `API_BASE_URL`: yes
- supports `MODEL_NAME`: yes
- supports `HF_TOKEN`: yes
- also accepts `OPENAI_API_KEY` as compatibility fallback: yes
- optional `LOCAL_IMAGE_NAME`: yes
- strict `[START]`, `[STEP]`, `[END]` logs: yes
- reproducible seeded benchmark flow: yes
- suitable for CPU-only execution under hackathon limits: yes
### Documentation requirements
- environment motivation: yes
- action space definition: yes
- observation space definition: yes
- task descriptions: yes
- reward design explanation: yes
- setup instructions: yes
- Docker instructions: yes
- Hugging Face deployment notes: yes
- baseline scores: yes
### Human-review strengths
- real-world utility: incident-response training and evaluation under partial observability
- environment design: non-leaking public state, session isolation, seeded reproducibility
- grader quality: structured evidence and causal-consistency scoring, not keyword matching
- creativity: multi-stage incident forensics with safety-aware containment and information-gain shaping
## Local Run
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
Reset:
```bash
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id":"easy","seed":42}'
```
## Docker
```bash
docker build -t novatech-incident-command .
docker run --rm -p 7860:7860 novatech-incident-command
```
## Hugging Face Spaces
This repository is intended for Docker Spaces.
Expected validator path:
- `POST /reset` returns `200 OK`
- `POST /step` accepts typed actions
- `GET /health` returns liveness
## Repo Layout
```text
logenv2/
β”œβ”€β”€ app.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ inference.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ preflight.sh
β”œβ”€β”€ novatech_logs.db
β”œβ”€β”€ env/
β”‚ β”œβ”€β”€ environment.py
β”‚ └── models.py
β”œβ”€β”€ data/
β”‚ └── db_loader.py
└── tasks/
β”œβ”€β”€ catalog.py
└── graders.py
```