Spaces:

omm7
/

CausalOps-Env

Sleeping

App Files Files Community

CausalOps-Env / README.md

omm7

Upload folder using huggingface_hub

f84289a verified about 2 months ago

preview code

raw

history blame contribute delete

14.4 kB

metadata

title: NovaTech Incident Command
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: docker
app_file: app.py
pinned: false

NovaTech Incident Command

NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.

This version is explicitly designed to avoid common benchmark failures:

no hidden answer leakage in public state
no scripted reveal queue
no keyword-based grader
no hardcoded baseline answers
session-safe API with per-episode isolation

What The Agent Must Do

Each episode simulates a production incident with a fixed action budget.

The agent must:

retrieve relevant logs using structured filters
follow dependencies rather than brute-force the whole system
narrow toward a causal tuple
avoid destructive containment
submit a causally consistent final report

Core Mechanics

Partial observability

The agent only sees:

the incident briefing
the dependency graph
the logs it has explicitly revealed

It never sees:

hidden logs
gold evidence IDs
grader internals

Session-safe design

POST /reset returns a session_id.

All actions in POST /step should include that session_id, which isolates concurrent episodes and avoids the old shared-global-state exploit.

Seeded stochasticity

Every reset can accept a seed:

{
  "task_id": "medium",
  "seed": 42
}

Given the same seed:

the task-specific log pool is reproducible
distractor/noise sampling is reproducible
retrieval order is reproducible

Different seeds slightly vary the non-essential observable context while preserving deterministic grading.

Observations

Each reset() and step() returns a structured observation, not a loose blob.

Observation fields:

session_id: the active episode identifier
task_id: task difficulty key
task_title: human-readable incident label
briefing: incident objective, incident window, suspected services, customer statement, and constraints
dependency_graph: the service graph the agent can reason over
visible_logs: only the logs the agent has explicitly revealed
revealed_log_count: number of currently visible logs
visited_services: services already explored through dependency inspection or queries
submitted_containment: containment actions already chosen
last_hypothesis: latest structured causal hypothesis
step_number: current step
max_steps: step budget
feedback: environment guidance after the last action
done: terminal flag

Why this observation design matters:

it gives enough structure for deliberate planning
it preserves partial observability
it prevents answer leakage
it supports both frontier agents and smaller baselines

Example observation shape:

{
  "session_id": "8e7f...",
  "task_id": "medium",
  "task_title": "Checkout Competing Hypotheses",
  "briefing": {
    "incident_id": "INC-2144",
    "title": "Checkout Competing Hypotheses",
    "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
    "incident_window_start": "2025-06-15 06:20:00",
    "incident_window_end": "2025-06-15 06:45:59",
    "suspected_services": ["payment-api", "auth-service", "user-service"],
    "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
    "operational_constraints": [
      "Keep checkout partially available if possible.",
      "Avoid blind restarts."
    ]
  },
  "dependency_graph": {
    "payment-api": ["auth-service", "payment-gateway", "mysql"]
  },
  "visible_logs": [],
  "revealed_log_count": 0,
  "visited_services": [],
  "submitted_containment": [],
  "last_hypothesis": null,
  "step_number": 0,
  "max_steps": 7,
  "feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
  "done": false
}

Tasks

Easy: Auth Heap Exhaustion

Reasoning pattern:

anomaly detection with clear signal

Goal:

identify auth-service heap exhaustion as the true cause of a login incident
avoid destructive overreaction

Medium: Checkout Competing Hypotheses

Reasoning pattern:

disambiguate competing explanations

Goal:

determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise

Hard: Cascading Multi-Service Incident

Reasoning pattern:

partial observability
timeline reconstruction
tradeoff-aware containment

Goal:

identify the initiating service in a multi-service cascade and propose layered containment

Structured Actions

Query logs

{
  "session_id": "<session_id>",
  "action_type": "query_logs",
  "query": {
    "service_name": "payment-api",
    "levels": ["CRITICAL", "ERROR"],
    "start_time": "2025-06-15 06:20:00",
    "end_time": "2025-06-15 06:45:59",
    "limit": 6
  }
}

Inspect dependencies

{
  "session_id": "<session_id>",
  "action_type": "inspect_dependencies",
  "target_service": "payment-api"
}

Update hypothesis

{
  "session_id": "<session_id>",
  "action_type": "update_hypothesis",
  "hypothesis": {
    "primary_service": "payment-api",
    "failure_mode": "dependency_outage",
    "dependency": "payment-gateway",
    "customer_impact": "checkout_delays",
    "confidence": 0.87
  }
}

Submit report

{
  "session_id": "<session_id>",
  "action_type": "submit_report",
  "report": {
    "evidence_log_ids": [193, 194, 195],
    "impacted_services": ["payment-api"],
    "root_cause": {
      "primary_service": "payment-api",
      "failure_mode": "dependency_outage",
      "dependency": "payment-gateway",
      "customer_impact": "checkout_delays",
      "confidence": 0.87
    },
    "containment_plan": [
      "restore_payment_gateway_connectivity",
      "reduce_checkout_retry_pressure"
    ],
    "summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
  }
}

Grading

The grader is fully deterministic and structured.

It scores:

evidence quality via revealed-evidence F1
root-cause tuple correctness
impacted-service correctness
containment alignment
causal consistency across evidence, service, impact, and timeline

It penalizes:

unseen evidence references
contradictions
forbidden containment
repeated actions

There is no keyword-bag grader in this version.

Reward Function

Intermediate rewards are dense and shaped:

signal_reward: new relevant evidence
hypothesis_reward: improvement toward the gold causal tuple
efficiency_reward: solving earlier is better
penalty: invalid queries, loops, contradictions, forbidden actions

This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.

Clever Reward Techniques

This environment uses several reward-shaping ideas that are stronger than a typical binary grader.

1. Progress reward based on information gain

The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.

2. Hypothesis-improvement shaping

The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.

3. Observation-consistent terminal scoring

The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.

4. Contradiction penalties

The grader penalizes internal inconsistency across:

selected evidence
claimed root-cause service
claimed customer impact
timeline in the hard task
containment choice

This means an agent cannot simply match one part of the answer key and ignore the rest.

5. Safe-containment bias

The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that “solve” incidents by wiping logs or restarting everything are penalized.

6. Loop-aware shaping

Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.

7. Seeded stochastic distractors with deterministic grading

The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.

In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.

API

POST /reset
POST /step
GET /state
GET /health
GET /debug_state

/debug_state is disabled by default and only works when OPENENV_DEBUG_STATE=true.

Baseline

inference.py is deterministic and observation-driven.

It:

queries the incident window
inspects the most suspicious service
builds a structured hypothesis from revealed logs
chooses containment from the inferred cause
submits a final report

It does not use hardcoded gold log_id answers.

Required environment variables:

HF_TOKEN
API_BASE_URL
MODEL_NAME

Optional:

LOGENV_URL
DB_PATH

Logging format is strict:

[START]
[STEP]
[END]

Observed baseline scores from the deployed Hugging Face Space:

easy: 0.721
medium: 0.504
hard: 0.859

Validation Status

Final pre-submission status:

Hugging Face Space responds to POST /reset
local Docker image builds successfully
local Docker container responds to GET /health and POST /reset
openenv validate passes
inference.py runs end to end against the deployed Space
3 tasks are present: easy, medium, hard
rewards and terminal scores are bounded to [0.0, 1.0]
observations are fully documented and exposed as typed structured objects

Requirement coverage summary:

OpenEnv interfaces: reset(), step(), state()
typed Pydantic models: Observation, Action, Reward
real-world domain: incident response / log debugging
seeded partial observability
deterministic structured grader
dense reward shaping with safety and loop penalties
OpenAI-client baseline using HF_TOKEN, API_BASE_URL, and MODEL_NAME
Docker and Hugging Face deployment support
strict [START], [STEP], [END] logging

Round 1 Compliance Checklist

This section maps the submission directly to the Round 1 problem statement and validator expectations.

Core task requirements

Real-world environment: yes
- domain: production incident response / log debugging
- modeled workflow: log retrieval, dependency inspection, diagnosis, containment, final reporting
OpenEnv interface: yes
- reset() -> initial observation
- step(action) -> observation, reward, done, info
- state() -> current public state
Typed models: yes
- Pydantic Observation, Action, and Reward
openenv.yaml: yes
openenv validate: pass

Task and grader requirements

Exactly 3 tasks: yes
- easy
- medium
- hard
Difficulty progression: yes
- easy = clear-signal anomaly detection
- medium = competing hypotheses
- hard = partial observability plus cascading-failure reasoning
Deterministic graders: yes
Scores bounded to [0.0, 1.0]: yes
Penalizes invalid actions, loops, and inefficiency: yes

Reward requirements

Dense reward: yes
Partial-progress shaping: yes
Loop / contradiction / invalid-action penalties: yes
Final normalized score: yes

Deployment requirements

Hugging Face Space deployment: yes
POST /reset responds 200 OK: yes
Dockerfile included: yes
docker build succeeds: yes
docker run succeeds: yes

Baseline requirements

root-level inference.py: yes
uses OpenAI client: yes
supports API_BASE_URL: yes
supports MODEL_NAME: yes
supports HF_TOKEN: yes
also accepts OPENAI_API_KEY as compatibility fallback: yes
optional LOCAL_IMAGE_NAME: yes
strict [START], [STEP], [END] logs: yes
reproducible seeded benchmark flow: yes
suitable for CPU-only execution under hackathon limits: yes

Documentation requirements

environment motivation: yes
action space definition: yes
observation space definition: yes
task descriptions: yes
reward design explanation: yes
setup instructions: yes
Docker instructions: yes
Hugging Face deployment notes: yes
baseline scores: yes

Human-review strengths

real-world utility: incident-response training and evaluation under partial observability
environment design: non-leaking public state, session isolation, seeded reproducibility
grader quality: structured evidence and causal-consistency scoring, not keyword matching
creativity: multi-stage incident forensics with safety-aware containment and information-gain shaping

Local Run

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

Reset:

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy","seed":42}'

Docker

docker build -t novatech-incident-command .
docker run --rm -p 7860:7860 novatech-incident-command

Hugging Face Spaces

This repository is intended for Docker Spaces.

Expected validator path:

POST /reset returns 200 OK
POST /step accepts typed actions
GET /health returns liveness

Repo Layout

logenv2/
├── app.py
├── openenv.yaml
├── inference.py
├── Dockerfile
├── requirements.txt
├── preflight.sh
├── novatech_logs.db
├── env/
│   ├── environment.py
│   └── models.py
├── data/
│   └── db_loader.py
└── tasks/
    ├── catalog.py
    └── graders.py