name: NovaTechIncidentCommand description: > Seeded OpenEnv incident-response benchmark built from a realistic NovaTech log corpus. Agents operate under partial observability: they must query logs, inspect dependencies, update a structured causal hypothesis, choose safe containment, and submit a final report. tasks: - id: easy description: Detect a clear login outage caused by auth-service heap exhaustion. - id: medium description: Resolve competing hypotheses during a payment confirmation outage. - id: hard description: Reconstruct a cascading multi-service incident under partial observability. action_space: type: structured fields: session_id: string action_type: "query_logs | inspect_dependencies | update_hypothesis | execute_containment | submit_report | request_more | no_anomalies" query: "optional structured filter with service_name, server_id, levels, start_time, end_time, text_contains, limit" target_service: "optional service name" hypothesis: "optional structured tuple: primary_service, failure_mode, dependency, customer_impact, confidence" containment_plan: "optional list of containment action names" report: "optional structured report with evidence_log_ids, impacted_services, root_cause, containment_plan, summary" observation_space: type: structured fields: session_id: string task_id: string task_title: string briefing: "structured incident briefing with incident window, objective, suspected_services, customer_statement, operational_constraints" dependency_graph: "service dependency map" visible_logs: "list of currently revealed log entries only" revealed_log_count: integer visited_services: "list of services explored so far" submitted_containment: "list of chosen containment actions" last_hypothesis: "optional structured root-cause hypothesis" step_number: integer max_steps: integer feedback: string done: boolean notes: - "Observations expose only agent-revealed logs." - "The dependency graph is visible, but hidden logs and gold evidence remain private." - "The latest structured hypothesis is included so agents can reason iteratively." reward_definition: type: scalar range: [0.0, 1.0] components: signal_reward: "Rewards newly discovered relevant signals and evidence quality." hypothesis_reward: "Rewards improvement toward the gold causal tuple and safe containment alignment." efficiency_reward: "Rewards solving within the action budget." penalty: "Penalizes unseen evidence, contradictions, forbidden containment, loops, and empty queries." techniques: - "Information-gain shaping: focused discovery beats broad noisy retrieval." - "Best-hypothesis tracking: reward is tied to causal improvement across the episode." - "Observation-consistent grading: unseen evidence references are rejected." - "Contradiction penalties: evidence, cause, impact, and timeline must agree." - "Safety shaping: destructive containment is penalized even if diagnosis is partially correct." interfaces: reset: "reset() -> initial observation" step: "step(action) -> observation, reward, done, info" state: "state() -> non-leaking public session state"