Spaces:
Sleeping
title: NovaTech Incident Command
emoji: π¨
colorFrom: red
colorTo: blue
sdk: docker
app_file: app.py
pinned: false
NovaTech Incident Command
NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.
This version is explicitly designed to avoid common benchmark failures:
- no hidden answer leakage in public state
- no scripted reveal queue
- no keyword-based grader
- no hardcoded baseline answers
- session-safe API with per-episode isolation
What The Agent Must Do
Each episode simulates a production incident with a fixed action budget.
The agent must:
- retrieve relevant logs using structured filters
- follow dependencies rather than brute-force the whole system
- narrow toward a causal tuple
- avoid destructive containment
- submit a causally consistent final report
Core Mechanics
Partial observability
The agent only sees:
- the incident briefing
- the dependency graph
- the logs it has explicitly revealed
It never sees:
- hidden logs
- gold evidence IDs
- grader internals
Session-safe design
POST /reset returns a session_id.
All actions in POST /step should include that session_id, which isolates concurrent episodes and avoids the old shared-global-state exploit.
Seeded stochasticity
Every reset can accept a seed:
{
"task_id": "medium",
"seed": 42
}
Given the same seed:
- the task-specific log pool is reproducible
- distractor/noise sampling is reproducible
- retrieval order is reproducible
Different seeds slightly vary the non-essential observable context while preserving deterministic grading.
Observations
Each reset() and step() returns a structured observation, not a loose blob.
Observation fields:
session_id: the active episode identifiertask_id: task difficulty keytask_title: human-readable incident labelbriefing: incident objective, incident window, suspected services, customer statement, and constraintsdependency_graph: the service graph the agent can reason overvisible_logs: only the logs the agent has explicitly revealedrevealed_log_count: number of currently visible logsvisited_services: services already explored through dependency inspection or queriessubmitted_containment: containment actions already chosenlast_hypothesis: latest structured causal hypothesisstep_number: current stepmax_steps: step budgetfeedback: environment guidance after the last actiondone: terminal flag
Why this observation design matters:
- it gives enough structure for deliberate planning
- it preserves partial observability
- it prevents answer leakage
- it supports both frontier agents and smaller baselines
Example observation shape:
{
"session_id": "8e7f...",
"task_id": "medium",
"task_title": "Checkout Competing Hypotheses",
"briefing": {
"incident_id": "INC-2144",
"title": "Checkout Competing Hypotheses",
"objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
"incident_window_start": "2025-06-15 06:20:00",
"incident_window_end": "2025-06-15 06:45:59",
"suspected_services": ["payment-api", "auth-service", "user-service"],
"customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
"operational_constraints": [
"Keep checkout partially available if possible.",
"Avoid blind restarts."
]
},
"dependency_graph": {
"payment-api": ["auth-service", "payment-gateway", "mysql"]
},
"visible_logs": [],
"revealed_log_count": 0,
"visited_services": [],
"submitted_containment": [],
"last_hypothesis": null,
"step_number": 0,
"max_steps": 7,
"feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
"done": false
}
Tasks
Easy: Auth Heap Exhaustion
Reasoning pattern:
- anomaly detection with clear signal
Goal:
- identify auth-service heap exhaustion as the true cause of a login incident
- avoid destructive overreaction
Medium: Checkout Competing Hypotheses
Reasoning pattern:
- disambiguate competing explanations
Goal:
- determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise
Hard: Cascading Multi-Service Incident
Reasoning pattern:
- partial observability
- timeline reconstruction
- tradeoff-aware containment
Goal:
- identify the initiating service in a multi-service cascade and propose layered containment
Structured Actions
Query logs
{
"session_id": "<session_id>",
"action_type": "query_logs",
"query": {
"service_name": "payment-api",
"levels": ["CRITICAL", "ERROR"],
"start_time": "2025-06-15 06:20:00",
"end_time": "2025-06-15 06:45:59",
"limit": 6
}
}
Inspect dependencies
{
"session_id": "<session_id>",
"action_type": "inspect_dependencies",
"target_service": "payment-api"
}
Update hypothesis
{
"session_id": "<session_id>",
"action_type": "update_hypothesis",
"hypothesis": {
"primary_service": "payment-api",
"failure_mode": "dependency_outage",
"dependency": "payment-gateway",
"customer_impact": "checkout_delays",
"confidence": 0.87
}
}
Submit report
{
"session_id": "<session_id>",
"action_type": "submit_report",
"report": {
"evidence_log_ids": [193, 194, 195],
"impacted_services": ["payment-api"],
"root_cause": {
"primary_service": "payment-api",
"failure_mode": "dependency_outage",
"dependency": "payment-gateway",
"customer_impact": "checkout_delays",
"confidence": 0.87
},
"containment_plan": [
"restore_payment_gateway_connectivity",
"reduce_checkout_retry_pressure"
],
"summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
}
}
Grading
The grader is fully deterministic and structured.
It scores:
- evidence quality via revealed-evidence F1
- root-cause tuple correctness
- impacted-service correctness
- containment alignment
- causal consistency across evidence, service, impact, and timeline
It penalizes:
- unseen evidence references
- contradictions
- forbidden containment
- repeated actions
There is no keyword-bag grader in this version.
Reward Function
Intermediate rewards are dense and shaped:
signal_reward: new relevant evidencehypothesis_reward: improvement toward the gold causal tupleefficiency_reward: solving earlier is betterpenalty: invalid queries, loops, contradictions, forbidden actions
This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.
Clever Reward Techniques
This environment uses several reward-shaping ideas that are stronger than a typical binary grader.
1. Progress reward based on information gain
The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.
2. Hypothesis-improvement shaping
The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.
3. Observation-consistent terminal scoring
The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.
4. Contradiction penalties
The grader penalizes internal inconsistency across:
- selected evidence
- claimed root-cause service
- claimed customer impact
- timeline in the hard task
- containment choice
This means an agent cannot simply match one part of the answer key and ignore the rest.
5. Safe-containment bias
The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that βsolveβ incidents by wiping logs or restarting everything are penalized.
6. Loop-aware shaping
Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.
7. Seeded stochastic distractors with deterministic grading
The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.
In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.
API
POST /resetPOST /stepGET /stateGET /healthGET /debug_state
/debug_state is disabled by default and only works when OPENENV_DEBUG_STATE=true.
Baseline
inference.py is deterministic and observation-driven.
It:
- queries the incident window
- inspects the most suspicious service
- builds a structured hypothesis from revealed logs
- chooses containment from the inferred cause
- submits a final report
It does not use hardcoded gold log_id answers.
Required environment variables:
HF_TOKENAPI_BASE_URLMODEL_NAME
Optional:
LOGENV_URLDB_PATH
Logging format is strict:
[START][STEP][END]
Observed baseline scores from the deployed Hugging Face Space:
easy:0.721medium:0.504hard:0.859
Validation Status
Final pre-submission status:
- Hugging Face Space responds to
POST /reset - local Docker image builds successfully
- local Docker container responds to
GET /healthandPOST /reset openenv validatepassesinference.pyruns end to end against the deployed Space- 3 tasks are present:
easy,medium,hard - rewards and terminal scores are bounded to
[0.0, 1.0] - observations are fully documented and exposed as typed structured objects
Requirement coverage summary:
- OpenEnv interfaces:
reset(),step(),state() - typed Pydantic models:
Observation,Action,Reward - real-world domain: incident response / log debugging
- seeded partial observability
- deterministic structured grader
- dense reward shaping with safety and loop penalties
- OpenAI-client baseline using
HF_TOKEN,API_BASE_URL, andMODEL_NAME - Docker and Hugging Face deployment support
- strict
[START],[STEP],[END]logging
Round 1 Compliance Checklist
This section maps the submission directly to the Round 1 problem statement and validator expectations.
Core task requirements
- Real-world environment: yes
- domain: production incident response / log debugging
- modeled workflow: log retrieval, dependency inspection, diagnosis, containment, final reporting
- OpenEnv interface: yes
reset() -> initial observationstep(action) -> observation, reward, done, infostate() -> current public state
- Typed models: yes
- Pydantic
Observation,Action, andReward
- Pydantic
openenv.yaml: yesopenenv validate: pass
Task and grader requirements
- Exactly 3 tasks: yes
easymediumhard
- Difficulty progression: yes
- easy = clear-signal anomaly detection
- medium = competing hypotheses
- hard = partial observability plus cascading-failure reasoning
- Deterministic graders: yes
- Scores bounded to
[0.0, 1.0]: yes - Penalizes invalid actions, loops, and inefficiency: yes
Reward requirements
- Dense reward: yes
- Partial-progress shaping: yes
- Loop / contradiction / invalid-action penalties: yes
- Final normalized score: yes
Deployment requirements
- Hugging Face Space deployment: yes
POST /resetresponds200 OK: yes- Dockerfile included: yes
docker buildsucceeds: yesdocker runsucceeds: yes
Baseline requirements
- root-level
inference.py: yes - uses OpenAI client: yes
- supports
API_BASE_URL: yes - supports
MODEL_NAME: yes - supports
HF_TOKEN: yes - also accepts
OPENAI_API_KEYas compatibility fallback: yes - optional
LOCAL_IMAGE_NAME: yes - strict
[START],[STEP],[END]logs: yes - reproducible seeded benchmark flow: yes
- suitable for CPU-only execution under hackathon limits: yes
Documentation requirements
- environment motivation: yes
- action space definition: yes
- observation space definition: yes
- task descriptions: yes
- reward design explanation: yes
- setup instructions: yes
- Docker instructions: yes
- Hugging Face deployment notes: yes
- baseline scores: yes
Human-review strengths
- real-world utility: incident-response training and evaluation under partial observability
- environment design: non-leaking public state, session isolation, seeded reproducibility
- grader quality: structured evidence and causal-consistency scoring, not keyword matching
- creativity: multi-stage incident forensics with safety-aware containment and information-gain shaping
Local Run
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
Reset:
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id":"easy","seed":42}'
Docker
docker build -t novatech-incident-command .
docker run --rm -p 7860:7860 novatech-incident-command
Hugging Face Spaces
This repository is intended for Docker Spaces.
Expected validator path:
POST /resetreturns200 OKPOST /stepaccepts typed actionsGET /healthreturns liveness
Repo Layout
logenv2/
βββ app.py
βββ openenv.yaml
βββ inference.py
βββ Dockerfile
βββ requirements.txt
βββ preflight.sh
βββ novatech_logs.db
βββ env/
β βββ environment.py
β βββ models.py
βββ data/
β βββ db_loader.py
βββ tasks/
βββ catalog.py
βββ graders.py