CausalOps-Env / README.md
omm7's picture
Upload folder using huggingface_hub
f84289a verified
metadata
title: NovaTech Incident Command
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: docker
app_file: app.py
pinned: false

NovaTech Incident Command

NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.

This version is explicitly designed to avoid common benchmark failures:

  • no hidden answer leakage in public state
  • no scripted reveal queue
  • no keyword-based grader
  • no hardcoded baseline answers
  • session-safe API with per-episode isolation

What The Agent Must Do

Each episode simulates a production incident with a fixed action budget.

The agent must:

  • retrieve relevant logs using structured filters
  • follow dependencies rather than brute-force the whole system
  • narrow toward a causal tuple
  • avoid destructive containment
  • submit a causally consistent final report

Core Mechanics

Partial observability

The agent only sees:

  • the incident briefing
  • the dependency graph
  • the logs it has explicitly revealed

It never sees:

  • hidden logs
  • gold evidence IDs
  • grader internals

Session-safe design

POST /reset returns a session_id.

All actions in POST /step should include that session_id, which isolates concurrent episodes and avoids the old shared-global-state exploit.

Seeded stochasticity

Every reset can accept a seed:

{
  "task_id": "medium",
  "seed": 42
}

Given the same seed:

  • the task-specific log pool is reproducible
  • distractor/noise sampling is reproducible
  • retrieval order is reproducible

Different seeds slightly vary the non-essential observable context while preserving deterministic grading.

Observations

Each reset() and step() returns a structured observation, not a loose blob.

Observation fields:

  • session_id: the active episode identifier
  • task_id: task difficulty key
  • task_title: human-readable incident label
  • briefing: incident objective, incident window, suspected services, customer statement, and constraints
  • dependency_graph: the service graph the agent can reason over
  • visible_logs: only the logs the agent has explicitly revealed
  • revealed_log_count: number of currently visible logs
  • visited_services: services already explored through dependency inspection or queries
  • submitted_containment: containment actions already chosen
  • last_hypothesis: latest structured causal hypothesis
  • step_number: current step
  • max_steps: step budget
  • feedback: environment guidance after the last action
  • done: terminal flag

Why this observation design matters:

  • it gives enough structure for deliberate planning
  • it preserves partial observability
  • it prevents answer leakage
  • it supports both frontier agents and smaller baselines

Example observation shape:

{
  "session_id": "8e7f...",
  "task_id": "medium",
  "task_title": "Checkout Competing Hypotheses",
  "briefing": {
    "incident_id": "INC-2144",
    "title": "Checkout Competing Hypotheses",
    "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
    "incident_window_start": "2025-06-15 06:20:00",
    "incident_window_end": "2025-06-15 06:45:59",
    "suspected_services": ["payment-api", "auth-service", "user-service"],
    "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
    "operational_constraints": [
      "Keep checkout partially available if possible.",
      "Avoid blind restarts."
    ]
  },
  "dependency_graph": {
    "payment-api": ["auth-service", "payment-gateway", "mysql"]
  },
  "visible_logs": [],
  "revealed_log_count": 0,
  "visited_services": [],
  "submitted_containment": [],
  "last_hypothesis": null,
  "step_number": 0,
  "max_steps": 7,
  "feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
  "done": false
}

Tasks

Easy: Auth Heap Exhaustion

Reasoning pattern:

  • anomaly detection with clear signal

Goal:

  • identify auth-service heap exhaustion as the true cause of a login incident
  • avoid destructive overreaction

Medium: Checkout Competing Hypotheses

Reasoning pattern:

  • disambiguate competing explanations

Goal:

  • determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise

Hard: Cascading Multi-Service Incident

Reasoning pattern:

  • partial observability
  • timeline reconstruction
  • tradeoff-aware containment

Goal:

  • identify the initiating service in a multi-service cascade and propose layered containment

Structured Actions

Query logs

{
  "session_id": "<session_id>",
  "action_type": "query_logs",
  "query": {
    "service_name": "payment-api",
    "levels": ["CRITICAL", "ERROR"],
    "start_time": "2025-06-15 06:20:00",
    "end_time": "2025-06-15 06:45:59",
    "limit": 6
  }
}

Inspect dependencies

{
  "session_id": "<session_id>",
  "action_type": "inspect_dependencies",
  "target_service": "payment-api"
}

Update hypothesis

{
  "session_id": "<session_id>",
  "action_type": "update_hypothesis",
  "hypothesis": {
    "primary_service": "payment-api",
    "failure_mode": "dependency_outage",
    "dependency": "payment-gateway",
    "customer_impact": "checkout_delays",
    "confidence": 0.87
  }
}

Submit report

{
  "session_id": "<session_id>",
  "action_type": "submit_report",
  "report": {
    "evidence_log_ids": [193, 194, 195],
    "impacted_services": ["payment-api"],
    "root_cause": {
      "primary_service": "payment-api",
      "failure_mode": "dependency_outage",
      "dependency": "payment-gateway",
      "customer_impact": "checkout_delays",
      "confidence": 0.87
    },
    "containment_plan": [
      "restore_payment_gateway_connectivity",
      "reduce_checkout_retry_pressure"
    ],
    "summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
  }
}

Grading

The grader is fully deterministic and structured.

It scores:

  • evidence quality via revealed-evidence F1
  • root-cause tuple correctness
  • impacted-service correctness
  • containment alignment
  • causal consistency across evidence, service, impact, and timeline

It penalizes:

  • unseen evidence references
  • contradictions
  • forbidden containment
  • repeated actions

There is no keyword-bag grader in this version.

Reward Function

Intermediate rewards are dense and shaped:

  • signal_reward: new relevant evidence
  • hypothesis_reward: improvement toward the gold causal tuple
  • efficiency_reward: solving earlier is better
  • penalty: invalid queries, loops, contradictions, forbidden actions

This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.

Clever Reward Techniques

This environment uses several reward-shaping ideas that are stronger than a typical binary grader.

1. Progress reward based on information gain

The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.

2. Hypothesis-improvement shaping

The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.

3. Observation-consistent terminal scoring

The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.

4. Contradiction penalties

The grader penalizes internal inconsistency across:

  • selected evidence
  • claimed root-cause service
  • claimed customer impact
  • timeline in the hard task
  • containment choice

This means an agent cannot simply match one part of the answer key and ignore the rest.

5. Safe-containment bias

The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that β€œsolve” incidents by wiping logs or restarting everything are penalized.

6. Loop-aware shaping

Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.

7. Seeded stochastic distractors with deterministic grading

The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.

In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.

API

  • POST /reset
  • POST /step
  • GET /state
  • GET /health
  • GET /debug_state

/debug_state is disabled by default and only works when OPENENV_DEBUG_STATE=true.

Baseline

inference.py is deterministic and observation-driven.

It:

  • queries the incident window
  • inspects the most suspicious service
  • builds a structured hypothesis from revealed logs
  • chooses containment from the inferred cause
  • submits a final report

It does not use hardcoded gold log_id answers.

Required environment variables:

  • HF_TOKEN
  • API_BASE_URL
  • MODEL_NAME

Optional:

  • LOGENV_URL
  • DB_PATH

Logging format is strict:

  • [START]
  • [STEP]
  • [END]

Observed baseline scores from the deployed Hugging Face Space:

  • easy: 0.721
  • medium: 0.504
  • hard: 0.859

Validation Status

Final pre-submission status:

  • Hugging Face Space responds to POST /reset
  • local Docker image builds successfully
  • local Docker container responds to GET /health and POST /reset
  • openenv validate passes
  • inference.py runs end to end against the deployed Space
  • 3 tasks are present: easy, medium, hard
  • rewards and terminal scores are bounded to [0.0, 1.0]
  • observations are fully documented and exposed as typed structured objects

Requirement coverage summary:

  • OpenEnv interfaces: reset(), step(), state()
  • typed Pydantic models: Observation, Action, Reward
  • real-world domain: incident response / log debugging
  • seeded partial observability
  • deterministic structured grader
  • dense reward shaping with safety and loop penalties
  • OpenAI-client baseline using HF_TOKEN, API_BASE_URL, and MODEL_NAME
  • Docker and Hugging Face deployment support
  • strict [START], [STEP], [END] logging

Round 1 Compliance Checklist

This section maps the submission directly to the Round 1 problem statement and validator expectations.

Core task requirements

  • Real-world environment: yes
    • domain: production incident response / log debugging
    • modeled workflow: log retrieval, dependency inspection, diagnosis, containment, final reporting
  • OpenEnv interface: yes
    • reset() -> initial observation
    • step(action) -> observation, reward, done, info
    • state() -> current public state
  • Typed models: yes
    • Pydantic Observation, Action, and Reward
  • openenv.yaml: yes
  • openenv validate: pass

Task and grader requirements

  • Exactly 3 tasks: yes
    • easy
    • medium
    • hard
  • Difficulty progression: yes
    • easy = clear-signal anomaly detection
    • medium = competing hypotheses
    • hard = partial observability plus cascading-failure reasoning
  • Deterministic graders: yes
  • Scores bounded to [0.0, 1.0]: yes
  • Penalizes invalid actions, loops, and inefficiency: yes

Reward requirements

  • Dense reward: yes
  • Partial-progress shaping: yes
  • Loop / contradiction / invalid-action penalties: yes
  • Final normalized score: yes

Deployment requirements

  • Hugging Face Space deployment: yes
  • POST /reset responds 200 OK: yes
  • Dockerfile included: yes
  • docker build succeeds: yes
  • docker run succeeds: yes

Baseline requirements

  • root-level inference.py: yes
  • uses OpenAI client: yes
  • supports API_BASE_URL: yes
  • supports MODEL_NAME: yes
  • supports HF_TOKEN: yes
  • also accepts OPENAI_API_KEY as compatibility fallback: yes
  • optional LOCAL_IMAGE_NAME: yes
  • strict [START], [STEP], [END] logs: yes
  • reproducible seeded benchmark flow: yes
  • suitable for CPU-only execution under hackathon limits: yes

Documentation requirements

  • environment motivation: yes
  • action space definition: yes
  • observation space definition: yes
  • task descriptions: yes
  • reward design explanation: yes
  • setup instructions: yes
  • Docker instructions: yes
  • Hugging Face deployment notes: yes
  • baseline scores: yes

Human-review strengths

  • real-world utility: incident-response training and evaluation under partial observability
  • environment design: non-leaking public state, session isolation, seeded reproducibility
  • grader quality: structured evidence and causal-consistency scoring, not keyword matching
  • creativity: multi-stage incident forensics with safety-aware containment and information-gain shaping

Local Run

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

Reset:

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy","seed":42}'

Docker

docker build -t novatech-incident-command .
docker run --rm -p 7860:7860 novatech-incident-command

Hugging Face Spaces

This repository is intended for Docker Spaces.

Expected validator path:

  • POST /reset returns 200 OK
  • POST /step accepts typed actions
  • GET /health returns liveness

Repo Layout

logenv2/
β”œβ”€β”€ app.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ inference.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ preflight.sh
β”œβ”€β”€ novatech_logs.db
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ environment.py
β”‚   └── models.py
β”œβ”€β”€ data/
β”‚   └── db_loader.py
└── tasks/
    β”œβ”€β”€ catalog.py
    └── graders.py