--- title: NovaTech Incident Command emoji: 🚨 colorFrom: red colorTo: blue sdk: docker app_file: app.py pinned: false --- # NovaTech Incident Command NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report. This version is explicitly designed to avoid common benchmark failures: - no hidden answer leakage in public state - no scripted reveal queue - no keyword-based grader - no hardcoded baseline answers - session-safe API with per-episode isolation ## What The Agent Must Do Each episode simulates a production incident with a fixed action budget. The agent must: - retrieve relevant logs using structured filters - follow dependencies rather than brute-force the whole system - narrow toward a causal tuple - avoid destructive containment - submit a causally consistent final report ## Core Mechanics ### Partial observability The agent only sees: - the incident briefing - the dependency graph - the logs it has explicitly revealed It never sees: - hidden logs - gold evidence IDs - grader internals ### Session-safe design `POST /reset` returns a `session_id`. All actions in `POST /step` should include that `session_id`, which isolates concurrent episodes and avoids the old shared-global-state exploit. ### Seeded stochasticity Every reset can accept a seed: ```json { "task_id": "medium", "seed": 42 } ``` Given the same seed: - the task-specific log pool is reproducible - distractor/noise sampling is reproducible - retrieval order is reproducible Different seeds slightly vary the non-essential observable context while preserving deterministic grading. ## Observations Each `reset()` and `step()` returns a structured observation, not a loose blob. Observation fields: - `session_id`: the active episode identifier - `task_id`: task difficulty key - `task_title`: human-readable incident label - `briefing`: incident objective, incident window, suspected services, customer statement, and constraints - `dependency_graph`: the service graph the agent can reason over - `visible_logs`: only the logs the agent has explicitly revealed - `revealed_log_count`: number of currently visible logs - `visited_services`: services already explored through dependency inspection or queries - `submitted_containment`: containment actions already chosen - `last_hypothesis`: latest structured causal hypothesis - `step_number`: current step - `max_steps`: step budget - `feedback`: environment guidance after the last action - `done`: terminal flag Why this observation design matters: - it gives enough structure for deliberate planning - it preserves partial observability - it prevents answer leakage - it supports both frontier agents and smaller baselines Example observation shape: ```json { "session_id": "8e7f...", "task_id": "medium", "task_title": "Checkout Competing Hypotheses", "briefing": { "incident_id": "INC-2144", "title": "Checkout Competing Hypotheses", "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.", "incident_window_start": "2025-06-15 06:20:00", "incident_window_end": "2025-06-15 06:45:59", "suspected_services": ["payment-api", "auth-service", "user-service"], "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.", "operational_constraints": [ "Keep checkout partially available if possible.", "Avoid blind restarts." ] }, "dependency_graph": { "payment-api": ["auth-service", "payment-gateway", "mysql"] }, "visible_logs": [], "revealed_log_count": 0, "visited_services": [], "submitted_containment": [], "last_hypothesis": null, "step_number": 0, "max_steps": 7, "feedback": "Episode created. Query the incident window and inspect dependencies to build your case.", "done": false } ``` ## Tasks ### Easy: Auth Heap Exhaustion Reasoning pattern: - anomaly detection with clear signal Goal: - identify auth-service heap exhaustion as the true cause of a login incident - avoid destructive overreaction ### Medium: Checkout Competing Hypotheses Reasoning pattern: - disambiguate competing explanations Goal: - determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise ### Hard: Cascading Multi-Service Incident Reasoning pattern: - partial observability - timeline reconstruction - tradeoff-aware containment Goal: - identify the initiating service in a multi-service cascade and propose layered containment ## Structured Actions ### Query logs ```json { "session_id": "", "action_type": "query_logs", "query": { "service_name": "payment-api", "levels": ["CRITICAL", "ERROR"], "start_time": "2025-06-15 06:20:00", "end_time": "2025-06-15 06:45:59", "limit": 6 } } ``` ### Inspect dependencies ```json { "session_id": "", "action_type": "inspect_dependencies", "target_service": "payment-api" } ``` ### Update hypothesis ```json { "session_id": "", "action_type": "update_hypothesis", "hypothesis": { "primary_service": "payment-api", "failure_mode": "dependency_outage", "dependency": "payment-gateway", "customer_impact": "checkout_delays", "confidence": 0.87 } } ``` ### Submit report ```json { "session_id": "", "action_type": "submit_report", "report": { "evidence_log_ids": [193, 194, 195], "impacted_services": ["payment-api"], "root_cause": { "primary_service": "payment-api", "failure_mode": "dependency_outage", "dependency": "payment-gateway", "customer_impact": "checkout_delays", "confidence": 0.87 }, "containment_plan": [ "restore_payment_gateway_connectivity", "reduce_checkout_retry_pressure" ], "summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway." } } ``` ## Grading The grader is fully deterministic and structured. It scores: - evidence quality via revealed-evidence F1 - root-cause tuple correctness - impacted-service correctness - containment alignment - causal consistency across evidence, service, impact, and timeline It penalizes: - unseen evidence references - contradictions - forbidden containment - repeated actions There is no keyword-bag grader in this version. ## Reward Function Intermediate rewards are dense and shaped: - `signal_reward`: new relevant evidence - `hypothesis_reward`: improvement toward the gold causal tuple - `efficiency_reward`: solving earlier is better - `penalty`: invalid queries, loops, contradictions, forbidden actions This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring. ## Clever Reward Techniques This environment uses several reward-shaping ideas that are stronger than a typical binary grader. ### 1. Progress reward based on information gain The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence. ### 2. Hypothesis-improvement shaping The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission. ### 3. Observation-consistent terminal scoring The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence. ### 4. Contradiction penalties The grader penalizes internal inconsistency across: - selected evidence - claimed root-cause service - claimed customer impact - timeline in the hard task - containment choice This means an agent cannot simply match one part of the answer key and ignore the rest. ### 5. Safe-containment bias The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that “solve” incidents by wiping logs or restarting everything are penalized. ### 6. Loop-aware shaping Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops. ### 7. Seeded stochastic distractors with deterministic grading The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task. In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions. ## API - `POST /reset` - `POST /step` - `GET /state` - `GET /health` - `GET /debug_state` `/debug_state` is disabled by default and only works when `OPENENV_DEBUG_STATE=true`. ## Baseline `inference.py` is deterministic and observation-driven. It: - queries the incident window - inspects the most suspicious service - builds a structured hypothesis from revealed logs - chooses containment from the inferred cause - submits a final report It does not use hardcoded gold `log_id` answers. Required environment variables: - `HF_TOKEN` - `API_BASE_URL` - `MODEL_NAME` Optional: - `LOGENV_URL` - `DB_PATH` Logging format is strict: - `[START]` - `[STEP]` - `[END]` Observed baseline scores from the deployed Hugging Face Space: - `easy`: `0.721` - `medium`: `0.504` - `hard`: `0.859` ## Validation Status Final pre-submission status: - Hugging Face Space responds to `POST /reset` - local Docker image builds successfully - local Docker container responds to `GET /health` and `POST /reset` - `openenv validate` passes - `inference.py` runs end to end against the deployed Space - 3 tasks are present: `easy`, `medium`, `hard` - rewards and terminal scores are bounded to `[0.0, 1.0]` - observations are fully documented and exposed as typed structured objects Requirement coverage summary: - OpenEnv interfaces: `reset()`, `step()`, `state()` - typed Pydantic models: `Observation`, `Action`, `Reward` - real-world domain: incident response / log debugging - seeded partial observability - deterministic structured grader - dense reward shaping with safety and loop penalties - OpenAI-client baseline using `HF_TOKEN`, `API_BASE_URL`, and `MODEL_NAME` - Docker and Hugging Face deployment support - strict `[START]`, `[STEP]`, `[END]` logging ## Round 1 Compliance Checklist This section maps the submission directly to the Round 1 problem statement and validator expectations. ### Core task requirements - Real-world environment: yes - domain: production incident response / log debugging - modeled workflow: log retrieval, dependency inspection, diagnosis, containment, final reporting - OpenEnv interface: yes - `reset() -> initial observation` - `step(action) -> observation, reward, done, info` - `state() -> current public state` - Typed models: yes - Pydantic `Observation`, `Action`, and `Reward` - `openenv.yaml`: yes - `openenv validate`: pass ### Task and grader requirements - Exactly 3 tasks: yes - `easy` - `medium` - `hard` - Difficulty progression: yes - easy = clear-signal anomaly detection - medium = competing hypotheses - hard = partial observability plus cascading-failure reasoning - Deterministic graders: yes - Scores bounded to `[0.0, 1.0]`: yes - Penalizes invalid actions, loops, and inefficiency: yes ### Reward requirements - Dense reward: yes - Partial-progress shaping: yes - Loop / contradiction / invalid-action penalties: yes - Final normalized score: yes ### Deployment requirements - Hugging Face Space deployment: yes - `POST /reset` responds `200 OK`: yes - Dockerfile included: yes - `docker build` succeeds: yes - `docker run` succeeds: yes ### Baseline requirements - root-level `inference.py`: yes - uses OpenAI client: yes - supports `API_BASE_URL`: yes - supports `MODEL_NAME`: yes - supports `HF_TOKEN`: yes - also accepts `OPENAI_API_KEY` as compatibility fallback: yes - optional `LOCAL_IMAGE_NAME`: yes - strict `[START]`, `[STEP]`, `[END]` logs: yes - reproducible seeded benchmark flow: yes - suitable for CPU-only execution under hackathon limits: yes ### Documentation requirements - environment motivation: yes - action space definition: yes - observation space definition: yes - task descriptions: yes - reward design explanation: yes - setup instructions: yes - Docker instructions: yes - Hugging Face deployment notes: yes - baseline scores: yes ### Human-review strengths - real-world utility: incident-response training and evaluation under partial observability - environment design: non-leaking public state, session isolation, seeded reproducibility - grader quality: structured evidence and causal-consistency scoring, not keyword matching - creativity: multi-stage incident forensics with safety-aware containment and information-gain shaping ## Local Run ```bash pip install -r requirements.txt uvicorn app:app --host 0.0.0.0 --port 7860 ``` Reset: ```bash curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id":"easy","seed":42}' ``` ## Docker ```bash docker build -t novatech-incident-command . docker run --rm -p 7860:7860 novatech-incident-command ``` ## Hugging Face Spaces This repository is intended for Docker Spaces. Expected validator path: - `POST /reset` returns `200 OK` - `POST /step` accepts typed actions - `GET /health` returns liveness ## Repo Layout ```text logenv2/ ├── app.py ├── openenv.yaml ├── inference.py ├── Dockerfile ├── requirements.txt ├── preflight.sh ├── novatech_logs.db ├── env/ │ ├── environment.py │ └── models.py ├── data/ │ └── db_loader.py └── tasks/ ├── catalog.py └── graders.py ```