Spaces:

omm7
/

CausalOps-Env

Sleeping

App Files Files Community

omm7 commited on Apr 8

Commit

bc2ead7

verified ·

1 Parent(s): 4cb7e01

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

.dockerignore +4 -0
Dockerfile +10 -14
README.md +401 -12
__pycache__/app.cpython-314.pyc +0 -0
__pycache__/inference.cpython-314.pyc +0 -0
agent/__pycache__/langgraph_agent.cpython-314.pyc +0 -0
app.py +853 -0
data/__init__.py +0 -0
data/__pycache__/db_loader.cpython-314.pyc +0 -0
data/db_loader.py +127 -0
env/__init__.py +0 -0
env/__pycache__/environment.cpython-314.pyc +0 -0
env/__pycache__/models.cpython-314.pyc +0 -0
env/environment.py +344 -0
env/models.py +142 -0
inference.py +257 -0
novatech_logs.db +0 -0
openenv.yaml +66 -0
preflight.sh +49 -0
requirements.txt +6 -3
tasks/__init__.py +0 -0
tasks/__pycache__/catalog.cpython-314.pyc +0 -0
tasks/__pycache__/graders.cpython-314.pyc +0 -0
tasks/catalog.py +133 -0
tasks/graders.py +177 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,4 @@

+__pycache__/
+*.pyc
+.DS_Store
+tmp/

Dockerfile CHANGED Viewed

@@ -1,20 +1,16 @@
-FROM python:3.13.5-slim
-WORKDIR /app
-RUN apt-get update && apt-get install -y \
-    build-essential \
-    curl \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt ./
-COPY src/ ./src/
-RUN pip3 install -r requirements.txt
-EXPOSE 8501
-HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
-ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --upgrade pip && pip install -r /app/requirements.txt
+COPY . /app
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,20 +1,409 @@
 ---
-title: CausalOps Env
-emoji: 🚀
 colorFrom: red
-colorTo: red
 sdk: docker
-app_port: 8501
-tags:
-- streamlit
 pinned: false
-short_description: A real-world OpenEnv benchmark for causal reasoning in distr
-license: mit
 ---
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

 ---
+title: NovaTech Incident Command
+emoji: 🚨
 colorFrom: red
+colorTo: blue
 sdk: docker
+app_file: app.py
 pinned: false
 ---
+# NovaTech Incident Command
+NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.
+This version is explicitly designed to avoid common benchmark failures:
+- no hidden answer leakage in public state
+- no scripted reveal queue
+- no keyword-based grader
+- no hardcoded baseline answers
+- session-safe API with per-episode isolation
+## What The Agent Must Do
+Each episode simulates a production incident with a fixed action budget.
+The agent must:
+- retrieve relevant logs using structured filters
+- follow dependencies rather than brute-force the whole system
+- narrow toward a causal tuple
+- avoid destructive containment
+- submit a causally consistent final report
+## Core Mechanics
+### Partial observability
+The agent only sees:
+- the incident briefing
+- the dependency graph
+- the logs it has explicitly revealed
+It never sees:
+- hidden logs
+- gold evidence IDs
+- grader internals
+### Session-safe design
+`POST /reset` returns a `session_id`.
+All actions in `POST /step` should include that `session_id`, which isolates concurrent episodes and avoids the old shared-global-state exploit.
+### Seeded stochasticity
+Every reset can accept a seed:
+```json
+{
+  "task_id": "medium",
+  "seed": 42
+}
+```
+Given the same seed:
+- the task-specific log pool is reproducible
+- distractor/noise sampling is reproducible
+- retrieval order is reproducible
+Different seeds slightly vary the non-essential observable context while preserving deterministic grading.
+## Observations
+Each `reset()` and `step()` returns a structured observation, not a loose blob.
+Observation fields:
+- `session_id`: the active episode identifier
+- `task_id`: task difficulty key
+- `task_title`: human-readable incident label
+- `briefing`: incident objective, incident window, suspected services, customer statement, and constraints
+- `dependency_graph`: the service graph the agent can reason over
+- `visible_logs`: only the logs the agent has explicitly revealed
+- `revealed_log_count`: number of currently visible logs
+- `visited_services`: services already explored through dependency inspection or queries
+- `submitted_containment`: containment actions already chosen
+- `last_hypothesis`: latest structured causal hypothesis
+- `step_number`: current step
+- `max_steps`: step budget
+- `feedback`: environment guidance after the last action
+- `done`: terminal flag
+Why this observation design matters:
+- it gives enough structure for deliberate planning
+- it preserves partial observability
+- it prevents answer leakage
+- it supports both frontier agents and smaller baselines
+Example observation shape:
+```json
+{
+  "session_id": "8e7f...",
+  "task_id": "medium",
+  "task_title": "Checkout Competing Hypotheses",
+  "briefing": {
+    "incident_id": "INC-2144",
+    "title": "Checkout Competing Hypotheses",
+    "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
+    "incident_window_start": "2025-06-15 06:20:00",
+    "incident_window_end": "2025-06-15 06:45:59",
+    "suspected_services": ["payment-api", "auth-service", "user-service"],
+    "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
+    "operational_constraints": [
+      "Keep checkout partially available if possible.",
+      "Avoid blind restarts."
+    ]
+  },
+  "dependency_graph": {
+    "payment-api": ["auth-service", "payment-gateway", "mysql"]
+  },
+  "visible_logs": [],
+  "revealed_log_count": 0,
+  "visited_services": [],
+  "submitted_containment": [],
+  "last_hypothesis": null,
+  "step_number": 0,
+  "max_steps": 7,
+  "feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
+  "done": false
+}
+```
+## Tasks
+### Easy: Auth Heap Exhaustion
+Reasoning pattern:
+- anomaly detection with clear signal
+Goal:
+- identify auth-service heap exhaustion as the true cause of a login incident
+- avoid destructive overreaction
+### Medium: Checkout Competing Hypotheses
+Reasoning pattern:
+- disambiguate competing explanations
+Goal:
+- determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise
+### Hard: Cascading Multi-Service Incident
+Reasoning pattern:
+- partial observability
+- timeline reconstruction
+- tradeoff-aware containment
+Goal:
+- identify the initiating service in a multi-service cascade and propose layered containment
+## Structured Actions
+### Query logs
+```json
+{
+  "session_id": "<session_id>",
+  "action_type": "query_logs",
+  "query": {
+    "service_name": "payment-api",
+    "levels": ["CRITICAL", "ERROR"],
+    "start_time": "2025-06-15 06:20:00",
+    "end_time": "2025-06-15 06:45:59",
+    "limit": 6
+  }
+}
+```
+### Inspect dependencies
+```json
+{
+  "session_id": "<session_id>",
+  "action_type": "inspect_dependencies",
+  "target_service": "payment-api"
+}
+```
+### Update hypothesis
+```json
+{
+  "session_id": "<session_id>",
+  "action_type": "update_hypothesis",
+  "hypothesis": {
+    "primary_service": "payment-api",
+    "failure_mode": "dependency_outage",
+    "dependency": "payment-gateway",
+    "customer_impact": "checkout_delays",
+    "confidence": 0.87
+  }
+}
+```
+### Submit report
+```json
+{
+  "session_id": "<session_id>",
+  "action_type": "submit_report",
+  "report": {
+    "evidence_log_ids": [193, 194, 195],
+    "impacted_services": ["payment-api"],
+    "root_cause": {
+      "primary_service": "payment-api",
+      "failure_mode": "dependency_outage",
+      "dependency": "payment-gateway",
+      "customer_impact": "checkout_delays",
+      "confidence": 0.87
+    },
+    "containment_plan": [
+      "restore_payment_gateway_connectivity",
+      "reduce_checkout_retry_pressure"
+    ],
+    "summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
+  }
+}
+```
+## Grading
+The grader is fully deterministic and structured.
+It scores:
+- evidence quality via revealed-evidence F1
+- root-cause tuple correctness
+- impacted-service correctness
+- containment alignment
+- causal consistency across evidence, service, impact, and timeline
+It penalizes:
+- unseen evidence references
+- contradictions
+- forbidden containment
+- repeated actions
+There is no keyword-bag grader in this version.
+## Reward Function
+Intermediate rewards are dense and shaped:
+- `signal_reward`: new relevant evidence
+- `hypothesis_reward`: improvement toward the gold causal tuple
+- `efficiency_reward`: solving earlier is better
+- `penalty`: invalid queries, loops, contradictions, forbidden actions
+This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.
+## Clever Reward Techniques
+This environment uses several reward-shaping ideas that are stronger than a typical binary grader.
+### 1. Progress reward based on information gain
+The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.
+### 2. Hypothesis-improvement shaping
+The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.
+### 3. Observation-consistent terminal scoring
+The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.
+### 4. Contradiction penalties
+The grader penalizes internal inconsistency across:
+- selected evidence
+- claimed root-cause service
+- claimed customer impact
+- timeline in the hard task
+- containment choice
+This means an agent cannot simply match one part of the answer key and ignore the rest.
+### 5. Safe-containment bias
+The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that “solve” incidents by wiping logs or restarting everything are penalized.
+### 6. Loop-aware shaping
+Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.
+### 7. Seeded stochastic distractors with deterministic grading
+The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.
+In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.
+## API
+- `POST /reset`
+- `POST /step`
+- `GET /state`
+- `GET /health`
+- `GET /debug_state`
+`/debug_state` is disabled by default and only works when `OPENENV_DEBUG_STATE=true`.
+## Baseline
+`inference.py` is deterministic and observation-driven.
+It:
+- queries the incident window
+- inspects the most suspicious service
+- builds a structured hypothesis from revealed logs
+- chooses containment from the inferred cause
+- submits a final report
+It does not use hardcoded gold `log_id` answers.
+Required environment variables:
+- `HF_TOKEN`
+- `API_BASE_URL`
+- `MODEL_NAME`
+Optional:
+- `LOGENV_URL`
+- `DB_PATH`
+Logging format is strict:
+- `[START]`
+- `[STEP]`
+- `[END]`
+## Local Run
+```bash
+pip install -r requirements.txt
+uvicorn app:app --host 0.0.0.0 --port 7860
+```
+Reset:
+```bash
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id":"easy","seed":42}'
+```
+## Docker
+```bash
+docker build -t novatech-incident-command .
+docker run --rm -p 7860:7860 novatech-incident-command
+```
+## Hugging Face Spaces
+This repository is intended for Docker Spaces.
+Expected validator path:
+- `POST /reset` returns `200 OK`
+- `POST /step` accepts typed actions
+- `GET /health` returns liveness
+## Repo Layout
+```text
+logenv2/
+├── app.py
+├── openenv.yaml
+├── inference.py
+├── Dockerfile
+├── requirements.txt
+├── preflight.sh
+├── novatech_logs.db
+├── env/
+│   ├── environment.py
+│   └── models.py
+├── data/
+│   └── db_loader.py
+└── tasks/
+    ├── catalog.py
+    └── graders.py
+```

__pycache__/app.cpython-314.pyc ADDED Viewed

Binary file (33.6 kB). View file

__pycache__/inference.cpython-314.pyc ADDED Viewed

Binary file (14.5 kB). View file

agent/__pycache__/langgraph_agent.cpython-314.pyc ADDED Viewed

Binary file (20.3 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,853 @@

+"""
+FastAPI application for the hardened NovaTech OpenEnv environment.
+"""
+from __future__ import annotations
+import os
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException, Query, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse, RedirectResponse
+from pydantic import BaseModel
+from env.environment import DEBUG_STATE_ENABLED, store
+from env.models import Action, Observation, Reward
+app = FastAPI(
+    title="NovaTech Incident Command",
+    description="Seeded, session-safe OpenEnv environment for incident response under partial observability.",
+    version="3.0.0",
+)
+app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
+class ResetRequest(BaseModel):
+    task_id: str = "easy"
+    seed: Optional[int] = None
+class StepResponse(BaseModel):
+    observation: Dict[str, Any]
+    reward: Dict[str, Any]
+    done: bool
+    info: Dict[str, Any]
+def _root_payload() -> Dict[str, Any]:
+    return {
+        "name": "NovaTech Incident Command",
+        "version": "3.0.0",
+        "debug_state_enabled": DEBUG_STATE_ENABLED,
+        "endpoints": {
+            "POST /reset": "Create an episode and return the initial observation.",
+            "POST /step": "Apply an action using a session_id.",
+            "GET /state": "Return public, non-leaking session state.",
+            "GET /health": "Liveness probe.",
+        },
+        "action_schema": Action.model_json_schema(),
+        "observation_schema": Observation.model_json_schema(),
+        "reward_schema": Reward.model_json_schema(),
+    }
+@app.get("/")
+def root(request: Request):
+    if "text/html" in (request.headers.get("accept") or "").lower():
+        return RedirectResponse(url="/playground", status_code=307)
+    return _root_payload()
+@app.get("/playground", response_class=HTMLResponse)
+def playground() -> str:
+    return """
+<!doctype html>
+<html>
+<head>
+  <meta charset="utf-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1" />
+  <title>NovaTech Incident Command</title>
+  <style>
+    :root {
+      --ink: #f4f1e8;
+      --muted: #b5c1d1;
+      --line: rgba(198, 218, 245, 0.14);
+      --panel: rgba(10, 18, 30, 0.78);
+      --panel-strong: rgba(8, 14, 25, 0.9);
+      --card-glow: 0 24px 80px rgba(5, 10, 18, 0.38);
+      --accent: #d54f36;
+      --accent-soft: #ff9e7b;
+      --teal: #3ca7a1;
+      --gold: #f1c56e;
+      --ok: #65d197;
+      --bad: #ff7d7d;
+      --mono: "IBM Plex Mono", "SFMono-Regular", monospace;
+      --sans: "Space Grotesk", "Avenir Next", sans-serif;
+    }
+    * { box-sizing: border-box; }
+    html { scroll-behavior: smooth; }
+    body {
+      margin: 0;
+      color: var(--ink);
+      font-family: var(--sans);
+      background:
+        radial-gradient(circle at 15% 20%, rgba(213, 79, 54, 0.24), transparent 24%),
+        radial-gradient(circle at 82% 10%, rgba(60, 167, 161, 0.24), transparent 22%),
+        radial-gradient(circle at 50% 100%, rgba(241, 197, 110, 0.12), transparent 34%),
+        linear-gradient(145deg, #09111d 0%, #0d1626 42%, #101a2d 100%);
+      min-height: 100vh;
+    }
+    .chrome {
+      position: fixed;
+      inset: 0;
+      pointer-events: none;
+      background-image:
+        linear-gradient(rgba(255,255,255,0.04) 1px, transparent 1px),
+        linear-gradient(90deg, rgba(255,255,255,0.04) 1px, transparent 1px);
+      background-size: 44px 44px;
+      mask-image: linear-gradient(to bottom, rgba(0,0,0,0.42), rgba(0,0,0,0.1));
+      opacity: 0.3;
+    }
+    .wrap { max-width: 1400px; margin: 0 auto; padding: 24px 18px 40px; position: relative; z-index: 1; }
+    .hero {
+      position: relative;
+      overflow: hidden;
+      background:
+        linear-gradient(135deg, rgba(18, 31, 51, 0.92), rgba(10, 18, 30, 0.84)),
+        linear-gradient(90deg, rgba(213, 79, 54, 0.16), rgba(60, 167, 161, 0.16));
+      border: 1px solid var(--line);
+      border-radius: 28px;
+      padding: 28px;
+      box-shadow: var(--card-glow);
+      margin-bottom: 18px;
+    }
+    .hero::after {
+      content: "";
+      position: absolute;
+      right: -80px;
+      top: -80px;
+      width: 260px;
+      height: 260px;
+      border-radius: 50%;
+      background: radial-gradient(circle, rgba(241, 197, 110, 0.28), transparent 70%);
+      filter: blur(8px);
+    }
+    .hero-top {
+      display: flex;
+      align-items: flex-start;
+      justify-content: space-between;
+      gap: 18px;
+      flex-wrap: wrap;
+    }
+    .eyebrow {
+      display: inline-flex;
+      align-items: center;
+      gap: 10px;
+      border: 1px solid rgba(241, 197, 110, 0.25);
+      border-radius: 999px;
+      padding: 7px 12px;
+      color: var(--gold);
+      font-size: 0.77rem;
+      letter-spacing: 0.12em;
+      text-transform: uppercase;
+      background: rgba(241, 197, 110, 0.08);
+      margin-bottom: 14px;
+    }
+    .hero h1 {
+      margin: 0;
+      font-size: clamp(2rem, 4vw, 3.35rem);
+      line-height: 0.96;
+      letter-spacing: -0.04em;
+      max-width: 720px;
+    }
+    .hero p {
+      margin: 16px 0 0;
+      max-width: 760px;
+      color: var(--muted);
+      font-size: 1.04rem;
+      line-height: 1.55;
+    }
+    .hero-statbar {
+      display: grid;
+      grid-template-columns: repeat(3, minmax(110px, 1fr));
+      gap: 10px;
+      min-width: 320px;
+    }
+    .hero-stat {
+      border: 1px solid var(--line);
+      border-radius: 18px;
+      padding: 14px;
+      background: rgba(255,255,255,0.04);
+      backdrop-filter: blur(12px);
+    }
+    .hero-stat .label {
+      color: var(--muted);
+      text-transform: uppercase;
+      font-size: 0.72rem;
+      letter-spacing: 0.09em;
+    }
+    .hero-stat .value {
+      margin-top: 8px;
+      font-size: 1.18rem;
+      font-weight: 700;
+    }
+    .dashboard {
+      display: grid;
+      grid-template-columns: 380px 1.05fr 0.8fr;
+      gap: 16px;
+      align-items: start;
+    }
+    .panel {
+      background: linear-gradient(180deg, rgba(14, 23, 37, 0.92), rgba(8, 14, 25, 0.92));
+      border: 1px solid var(--line);
+      border-radius: 24px;
+      box-shadow: var(--card-glow);
+      overflow: hidden;
+    }
+    .panel-head {
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      gap: 12px;
+      padding: 18px 18px 0;
+    }
+    .panel-title {
+      margin: 0;
+      font-size: 1rem;
+      letter-spacing: 0.02em;
+    }
+    .panel-subtitle {
+      margin: 6px 18px 0;
+      color: var(--muted);
+      font-size: 0.9rem;
+      line-height: 1.45;
+    }
+    .panel-body { padding: 18px; }
+    .stack { display: grid; gap: 14px; }
+    .field label, .group-label {
+      display: block;
+      margin-bottom: 8px;
+      color: var(--muted);
+      text-transform: uppercase;
+      font-size: 0.72rem;
+      letter-spacing: 0.08em;
+    }
+    input, select, textarea, button {
+      width: 100%;
+      border-radius: 16px;
+      border: 1px solid rgba(196, 217, 245, 0.12);
+      background: rgba(255,255,255,0.05);
+      color: var(--ink);
+      padding: 12px 14px;
+      font: inherit;
+      transition: border-color 0.18s ease, background 0.18s ease, transform 0.18s ease;
+    }
+    input::placeholder, textarea::placeholder { color: rgba(181, 193, 209, 0.65); }
+    input:focus, select:focus, textarea:focus {
+      outline: none;
+      border-color: rgba(241, 197, 110, 0.55);
+      background: rgba(255,255,255,0.08);
+    }
+    textarea {
+      min-height: 250px;
+      resize: vertical;
+      font-family: var(--mono);
+      font-size: 0.92rem;
+      line-height: 1.5;
+    }
+    button {
+      cursor: pointer;
+      border: 0;
+      font-weight: 700;
+      letter-spacing: 0.01em;
+      background: linear-gradient(135deg, var(--accent), #ef7a59);
+      box-shadow: 0 12px 24px rgba(213, 79, 54, 0.24);
+    }
+    button:hover {
+      transform: translateY(-1px);
+      filter: brightness(1.03);
+    }
+    .button-secondary {
+      background: linear-gradient(135deg, #1a6374, #2d8896);
+      box-shadow: 0 12px 24px rgba(45, 136, 150, 0.2);
+    }
+    .button-ghost {
+      background: rgba(255,255,255,0.06);
+      border: 1px solid rgba(196, 217, 245, 0.12);
+      box-shadow: none;
+    }
+    .button-grid {
+      display: grid;
+      grid-template-columns: 1fr 1fr;
+      gap: 10px;
+    }
+    .status {
+      min-height: 48px;
+      border-radius: 18px;
+      border: 1px solid rgba(196, 217, 245, 0.12);
+      background: rgba(255,255,255,0.04);
+      padding: 12px 14px;
+      color: var(--muted);
+      line-height: 1.5;
+    }
+    .status.ok { color: var(--ok); border-color: rgba(101, 209, 151, 0.2); }
+    .status.bad { color: var(--bad); border-color: rgba(255, 125, 125, 0.22); }
+    .chips {
+      display: flex;
+      gap: 8px;
+      flex-wrap: wrap;
+    }
+    .chip {
+      display: inline-flex;
+      align-items: center;
+      gap: 8px;
+      padding: 8px 11px;
+      border-radius: 999px;
+      border: 1px solid rgba(196, 217, 245, 0.12);
+      background: rgba(255,255,255,0.04);
+      color: var(--muted);
+      font-size: 0.82rem;
+    }
+    .kpis {
+      display: grid;
+      grid-template-columns: repeat(2, minmax(0, 1fr));
+      gap: 10px;
+    }
+    .kpi {
+      border-radius: 18px;
+      border: 1px solid rgba(196, 217, 245, 0.12);
+      background: rgba(255,255,255,0.035);
+      padding: 14px;
+    }
+    .kpi .label {
+      color: var(--muted);
+      font-size: 0.72rem;
+      text-transform: uppercase;
+      letter-spacing: 0.08em;
+    }
+    .kpi .value {
+      margin-top: 8px;
+      font-size: 1.05rem;
+      font-weight: 700;
+      word-break: break-word;
+    }
+    .template-grid {
+      display: grid;
+      gap: 8px;
+    }
+    .template {
+      text-align: left;
+      padding: 12px 13px;
+      border-radius: 16px;
+      background: rgba(255,255,255,0.045);
+      border: 1px solid rgba(196, 217, 245, 0.1);
+      color: var(--ink);
+      font-size: 0.92rem;
+      box-shadow: none;
+    }
+    .template strong {
+      display: block;
+      margin-bottom: 4px;
+      font-size: 0.9rem;
+    }
+    .template span {
+      color: var(--muted);
+      font-size: 0.82rem;
+      line-height: 1.45;
+    }
+    .viewer-tabs {
+      display: flex;
+      gap: 8px;
+      margin-bottom: 12px;
+    }
+    .tab {
+      width: auto;
+      padding: 10px 14px;
+      border-radius: 999px;
+      background: rgba(255,255,255,0.05);
+      box-shadow: none;
+      font-size: 0.86rem;
+    }
+    .tab.active {
+      background: linear-gradient(135deg, rgba(241, 197, 110, 0.18), rgba(213, 79, 54, 0.2));
+      border: 1px solid rgba(241, 197, 110, 0.28);
+    }
+    .viewer {
+      min-height: 620px;
+      border-radius: 20px;
+      background: linear-gradient(180deg, #0d1626, #0b1220);
+      border: 1px solid rgba(196, 217, 245, 0.1);
+      overflow: hidden;
+    }
+    pre {
+      margin: 0;
+      min-height: 620px;
+      padding: 18px;
+      overflow: auto;
+      white-space: pre-wrap;
+      word-break: break-word;
+      color: #e7efff;
+      font-family: var(--mono);
+      font-size: 0.9rem;
+      line-height: 1.58;
+    }
+    .hidden { display: none; }
+    .brief {
+      display: grid;
+      gap: 12px;
+    }
+    .brief-card {
+      border-radius: 18px;
+      border: 1px solid rgba(196, 217, 245, 0.1);
+      background: rgba(255,255,255,0.04);
+      padding: 14px;
+    }
+    .brief-card h3 {
+      margin: 0 0 8px;
+      font-size: 0.86rem;
+      text-transform: uppercase;
+      letter-spacing: 0.08em;
+      color: var(--gold);
+    }
+    .brief-card p, .brief-card ul {
+      margin: 0;
+      color: var(--muted);
+      line-height: 1.55;
+      font-size: 0.92rem;
+    }
+    .brief-card ul {
+      padding-left: 18px;
+    }
+    .brief-card li + li { margin-top: 6px; }
+    .footer-note {
+      margin-top: 10px;
+      color: rgba(181, 193, 209, 0.66);
+      font-size: 0.78rem;
+      line-height: 1.5;
+    }
+    @media (max-width: 1240px) {
+      .dashboard { grid-template-columns: 360px 1fr; }
+      .sidebar-right { grid-column: 1 / -1; }
+    }
+    @media (max-width: 900px) {
+      .wrap { padding: 18px 14px 28px; }
+      .dashboard { grid-template-columns: 1fr; }
+      .hero-top { flex-direction: column; }
+      .hero-statbar { width: 100%; min-width: 0; }
+      .button-grid { grid-template-columns: 1fr; }
+      .viewer, pre { min-height: 420px; }
+    }
+  </style>
+</head>
+<body>
+  <div class="chrome"></div>
+  <div class="wrap">
+    <section class="hero">
+      <div class="hero-top">
+        <div>
+          <div class="eyebrow">Live OpenEnv Ops Console</div>
+          <h1>NovaTech Incident Command</h1>
+          <p>Run a full incident workflow from one place: shape your search space, surface the most credible evidence, lock in a structured causal hypothesis, and pressure-test the final report before submission.</p>
+        </div>
+        <div class="hero-statbar">
+          <div class="hero-stat">
+            <div class="label">Mode</div>
+            <div class="value">Seeded, Partial</div>
+          </div>
+          <div class="hero-stat">
+            <div class="label">Sessions</div>
+            <div class="value" id="hero-session">None</div>
+          </div>
+          <div class="hero-stat">
+            <div class="label">Last Reward</div>
+            <div class="value" id="hero-reward">-</div>
+          </div>
+        </div>
+      </div>
+    </section>
+    <section class="dashboard">
+      <div class="panel">
+        <div class="panel-head">
+          <h2 class="panel-title">Mission Control</h2>
+        </div>
+        <p class="panel-subtitle">Start a seeded episode, track session health, and jump into common action patterns without writing boilerplate from scratch.</p>
+        <div class="panel-body stack">
+          <div class="field">
+            <label>Task</label>
+            <select id="task">
+              <option value="easy">easy · auth heap exhaustion</option>
+              <option value="medium">medium · competing checkout hypotheses</option>
+              <option value="hard">hard · cascading multi-service incident</option>
+            </select>
+          </div>
+          <div class="field">
+            <label>Seed</label>
+            <input id="seed" placeholder="Optional integer seed for reproducibility" />
+          </div>
+          <div class="button-grid">
+            <button onclick="resetEpisode()">Reset Episode</button>
+            <button class="button-secondary" onclick="loadState()">Load Public State</button>
+          </div>
+          <div id="status" class="status">No active session yet. Reset an episode to begin.</div>
+          <div class="kpis">
+            <div class="kpi">
+              <div class="label">Session ID</div>
+              <div class="value" id="session-pill">-</div>
+            </div>
+            <div class="kpi">
+              <div class="label">Task</div>
+              <div class="value" id="task-pill">-</div>
+            </div>
+            <div class="kpi">
+              <div class="label">Step</div>
+              <div class="value" id="step-pill">-</div>
+            </div>
+            <div class="kpi">
+              <div class="label">Done</div>
+              <div class="value" id="done-pill">-</div>
+            </div>
+          </div>
+          <div>
+            <div class="group-label">Quick Templates</div>
+            <div class="template-grid">
+              <button class="template" onclick="useTemplate('critical_window')">
+                <strong>Critical Window Query</strong>
+                <span>Pull the highest-risk logs in the incident window first.</span>
+              </button>
+              <button class="template" onclick="useTemplate('dependency_sweep')">
+                <strong>Dependency Sweep</strong>
+                <span>Inspect the most suspicious service and its adjacent services.</span>
+              </button>
+              <button class="template" onclick="useTemplate('hypothesis_auth')">
+                <strong>Auth Hypothesis</strong>
+                <span>Start from resource exhaustion in auth-service.</span>
+              </button>
+              <button class="template" onclick="useTemplate('submit_shell')">
+                <strong>Final Report Shell</strong>
+                <span>Fill a structured report with observed evidence only.</span>
+              </button>
+            </div>
+          </div>
+        </div>
+      </div>
+      <div class="panel">
+        <div class="panel-head">
+          <h2 class="panel-title">Action Composer</h2>
+          <button class="button-ghost" style="width:auto;" onclick="formatAction()">Format JSON</button>
+        </div>
+        <p class="panel-subtitle">Work directly against the typed API. The current session id is auto-injected when missing, so you can focus on the action payload itself.</p>
+        <div class="panel-body">
+          <div class="field">
+            <label>Action JSON</label>
+            <textarea id="action">{
+  "action_type": "query_logs",
+  "query": {
+    "levels": ["CRITICAL", "ERROR"],
+    "limit": 5
+  }
+}</textarea>
+          </div>
+          <div class="button-grid" style="margin-top: 12px;">
+            <button onclick="submitStep()">Submit Step</button>
+            <button class="button-secondary" onclick="copySessionAction()">Inject Session + Copy</button>
+          </div>
+          <div class="footer-note">Tip: keep evidence grounded. The grader now rejects unseen log ids and penalizes contradictions across service, impact, and containment.</div>
+        </div>
+      </div>
+      <div class="panel sidebar-right">
+        <div class="panel-head">
+          <h2 class="panel-title">Situation Room</h2>
+        </div>
+        <p class="panel-subtitle">Read the live incident summary, then switch between raw JSON and a cleaner operator view to understand what changed after each step.</p>
+        <div class="panel-body">
+          <div class="brief">
+            <div class="brief-card">
+              <h3>Incident Snapshot</h3>
+              <p id="brief-title">No active incident briefing yet.</p>
+            </div>
+            <div class="brief-card">
+              <h3>Operational Constraints</h3>
+              <ul id="constraints-list">
+                <li>Reset an episode to load task-specific constraints.</li>
+              </ul>
+            </div>
+            <div class="brief-card">
+              <h3>Suspected Services</h3>
+              <div class="chips" id="suspected-services">
+                <span class="chip">None</span>
+              </div>
+            </div>
+          </div>
+          <div class="viewer-tabs" style="margin-top: 16px;">
+            <button class="tab active" id="tab-raw" onclick="switchTab('raw')">Raw JSON</button>
+            <button class="tab" id="tab-ops" onclick="switchTab('ops')">Ops Summary</button>
+          </div>
+          <div class="viewer">
+            <pre id="output-raw">No data yet.</pre>
+            <pre id="output-ops" class="hidden">No data yet.</pre>
+          </div>
+        </div>
+      </div>
+    </section>
+  </div>
+  <script>
+    let currentSessionId = null;
+    const templates = {
+      critical_window: {
+        action_type: "query_logs",
+        query: { levels: ["CRITICAL", "ERROR"], limit: 6 }
+      },
+      dependency_sweep: {
+        action_type: "inspect_dependencies",
+        target_service: "payment-api"
+      },
+      hypothesis_auth: {
+        action_type: "update_hypothesis",
+        hypothesis: {
+          primary_service: "auth-service",
+          failure_mode: "resource_exhaustion",
+          dependency: "none",
+          customer_impact: "login_failures",
+          confidence: 0.82
+        }
+      },
+      submit_shell: {
+        action_type: "submit_report",
+        report: {
+          evidence_log_ids: [],
+          impacted_services: ["auth-service"],
+          root_cause: {
+            primary_service: "auth-service",
+            failure_mode: "resource_exhaustion",
+            dependency: "none",
+            customer_impact: "login_failures",
+            confidence: 0.82
+          },
+          containment_plan: ["increase_auth_heap", "enable_login_rate_limiting"],
+          summary: "Replace this with a concise, evidence-backed incident summary."
+        }
+      }
+    };
+    function buildOpsView(data) {
+      const source = data.observation || data;
+      const reward = data.reward || data.last_reward || {};
+      const logs = source.visible_logs || [];
+      const lines = [];
+      lines.push("Session Overview");
+      lines.push(`- Session: ${source.session_id || currentSessionId || "-"}`);
+      lines.push(`- Task: ${source.task_id || "-"}`);
+      lines.push(`- Step: ${source.step_number ?? data.step_number ?? "-"} / ${source.max_steps ?? data.max_steps ?? "-"}`);
+      lines.push(`- Revealed logs: ${source.revealed_log_count ?? data.revealed_log_count ?? logs.length ?? 0}`);
+      lines.push(`- Done: ${String(source.done ?? data.done ?? "-")}`);
+      if (source.feedback) {
+        lines.push("");
+        lines.push("Feedback");
+        lines.push(source.feedback);
+      }
+      if (source.briefing) {
+        lines.push("");
+        lines.push("Briefing");
+        lines.push(`- Title: ${source.briefing.title}`);
+        lines.push(`- Objective: ${source.briefing.objective}`);
+        lines.push(`- Customer: ${source.briefing.customer_statement}`);
+      }
+      if (source.last_hypothesis) {
+        lines.push("");
+        lines.push("Latest Hypothesis");
+        lines.push(`- Service: ${source.last_hypothesis.primary_service}`);
+        lines.push(`- Failure mode: ${source.last_hypothesis.failure_mode}`);
+        lines.push(`- Dependency: ${source.last_hypothesis.dependency}`);
+        lines.push(`- Impact: ${source.last_hypothesis.customer_impact}`);
+        lines.push(`- Confidence: ${source.last_hypothesis.confidence}`);
+      }
+      if (source.submitted_containment && source.submitted_containment.length) {
+        lines.push("");
+        lines.push("Containment");
+        source.submitted_containment.forEach((item) => lines.push(`- ${item}`));
+      }
+      if (reward.value !== undefined) {
+        lines.push("");
+        lines.push("Reward");
+        lines.push(`- Total: ${Number(reward.value).toFixed(4)}`);
+        if (reward.signal_reward !== undefined) lines.push(`- Signal: ${Number(reward.signal_reward).toFixed(4)}`);
+        if (reward.hypothesis_reward !== undefined) lines.push(`- Hypothesis: ${Number(reward.hypothesis_reward).toFixed(4)}`);
+        if (reward.efficiency_reward !== undefined) lines.push(`- Efficiency: ${Number(reward.efficiency_reward).toFixed(4)}`);
+        if (reward.penalty !== undefined) lines.push(`- Penalty: ${Number(reward.penalty).toFixed(4)}`);
+      }
+      if (logs.length) {
+        lines.push("");
+        lines.push(`Visible Logs (${logs.length})`);
+        logs.slice(0, 10).forEach((log) => {
+          lines.push(`- [${log.log_level}] ${log.log_id} · ${log.service_name} · ${log.server_id}`);
+          lines.push(`  ${log.message}`);
+        });
+      }
+      return lines.join("\\n");
+    }
+    function refreshBriefing(observation) {
+      document.getElementById("session-pill").textContent = observation.session_id || currentSessionId || "-";
+      document.getElementById("task-pill").textContent = observation.task_id || "-";
+      document.getElementById("step-pill").textContent = `${observation.step_number ?? "-"} / ${observation.max_steps ?? "-"}`;
+      document.getElementById("done-pill").textContent = String(observation.done ?? "-");
+      document.getElementById("hero-session").textContent = observation.session_id ? observation.session_id.slice(0, 8) : "None";
+      if (observation.briefing) {
+        document.getElementById("brief-title").textContent = `${observation.briefing.title}: ${observation.briefing.customer_statement}`;
+        const list = document.getElementById("constraints-list");
+        list.innerHTML = "";
+        observation.briefing.operational_constraints.forEach((item) => {
+          const li = document.createElement("li");
+          li.textContent = item;
+          list.appendChild(li);
+        });
+        const chips = document.getElementById("suspected-services");
+        chips.innerHTML = "";
+        observation.briefing.suspected_services.forEach((service) => {
+          const chip = document.createElement("span");
+          chip.className = "chip";
+          chip.textContent = service;
+          chips.appendChild(chip);
+        });
+      }
+    }
+    function show(data) {
+      document.getElementById("output-raw").textContent = JSON.stringify(data, null, 2);
+      document.getElementById("output-ops").textContent = buildOpsView(data);
+      const observation = data.observation || data;
+      if (observation.session_id) currentSessionId = observation.session_id;
+      refreshBriefing(observation);
+      const reward = data.reward || data.last_reward;
+      document.getElementById("hero-reward").textContent = reward && reward.value !== undefined ? Number(reward.value).toFixed(3) : "-";
+    }
+    function status(text, ok=true) {
+      const node = document.getElementById("status");
+      node.textContent = text;
+      node.className = ok ? "status ok" : "status bad";
+    }
+    function switchTab(which) {
+      const raw = document.getElementById("output-raw");
+      const ops = document.getElementById("output-ops");
+      const rawTab = document.getElementById("tab-raw");
+      const opsTab = document.getElementById("tab-ops");
+      if (which === "ops") {
+        raw.classList.add("hidden");
+        ops.classList.remove("hidden");
+        rawTab.classList.remove("active");
+        opsTab.classList.add("active");
+      } else {
+        ops.classList.add("hidden");
+        raw.classList.remove("hidden");
+        opsTab.classList.remove("active");
+        rawTab.classList.add("active");
+      }
+    }
+    function useTemplate(name) {
+      const template = JSON.parse(JSON.stringify(templates[name]));
+      if (currentSessionId) template.session_id = currentSessionId;
+      document.getElementById("action").value = JSON.stringify(template, null, 2);
+    }
+    function formatAction() {
+      try {
+        const payload = JSON.parse(document.getElementById("action").value);
+        document.getElementById("action").value = JSON.stringify(payload, null, 2);
+        status("Action JSON formatted.");
+      } catch (error) {
+        status(error.message, false);
+      }
+    }
+    async function copySessionAction() {
+      try {
+        const payload = JSON.parse(document.getElementById("action").value);
+        if (currentSessionId) payload.session_id = currentSessionId;
+        const text = JSON.stringify(payload, null, 2);
+        document.getElementById("action").value = text;
+        if (navigator.clipboard) {
+          await navigator.clipboard.writeText(text);
+        }
+        status("Session id injected and action copied.");
+      } catch (error) {
+        status(error.message, false);
+      }
+    }
+    async function resetEpisode() {
+      const task_id = document.getElementById("task").value;
+      const rawSeed = document.getElementById("seed").value.trim();
+      const payload = { task_id };
+      if (rawSeed) payload.seed = Number(rawSeed);
+      const res = await fetch("/reset", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(payload) });
+      const data = await res.json();
+      if (!res.ok) return status(JSON.stringify(data), false);
+      show(data);
+      status("Episode reset.");
+    }
+    async function submitStep() {
+      try {
+        const payload = JSON.parse(document.getElementById("action").value);
+        if (currentSessionId && !payload.session_id) payload.session_id = currentSessionId;
+        const res = await fetch("/step", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(payload) });
+        const data = await res.json();
+        if (!res.ok) return status(JSON.stringify(data), false);
+        show(data);
+        status("Step completed.");
+      } catch (error) {
+        status(error.message, false);
+      }
+    }
+    async function loadState() {
+      const url = currentSessionId ? `/state?session_id=${encodeURIComponent(currentSessionId)}` : "/state";
+      const res = await fetch(url);
+      const data = await res.json();
+      if (!res.ok) return status(JSON.stringify(data), false);
+      show(data);
+      status("Public state loaded.");
+    }
+    switchTab('raw');
+  </script>
+</body>
+</html>
+"""
+@app.get("/health")
+def health() -> Dict[str, str]:
+    return {"status": "ok"}
+@app.post("/reset", response_model=Dict[str, Any])
+def reset(request: ResetRequest) -> Dict[str, Any]:
+    try:
+        observation = store.reset(task_id=request.task_id, seed=request.seed)
+    except ValueError as exc:
+        raise HTTPException(status_code=422, detail=str(exc)) from exc
+    return observation.model_dump()
+@app.post("/step", response_model=StepResponse)
+def step(action: Action) -> StepResponse:
+    try:
+        observation, reward, done, info = store.step(action)
+    except (RuntimeError, ValueError) as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc
+    return StepResponse(
+        observation=observation.model_dump(),
+        reward=reward.model_dump(),
+        done=done,
+        info=info,
+    )
+@app.get("/state", response_model=Dict[str, Any])
+def state(session_id: Optional[str] = Query(default=None)) -> Dict[str, Any]:
+    try:
+        return store.public_state(session_id=session_id)
+    except RuntimeError as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc
+@app.get("/debug_state", response_model=Dict[str, Any])
+def debug_state(session_id: Optional[str] = Query(default=None)) -> Dict[str, Any]:
+    try:
+        return store.debug_state(session_id=session_id)
+    except PermissionError as exc:
+        raise HTTPException(status_code=403, detail=str(exc)) from exc
+    except RuntimeError as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc

data/__init__.py ADDED Viewed

File without changes

data/__pycache__/db_loader.cpython-314.pyc ADDED Viewed

Binary file (8.01 kB). View file

data/db_loader.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""
+Database and scenario loading helpers.
+"""
+from __future__ import annotations
+import os
+import random
+import sqlite3
+from pathlib import Path
+from typing import Any, Dict, Iterable, List
+from tasks.catalog import TASK_SPECS
+DEFAULT_DB_PATH = Path(__file__).resolve().parents[1] / "novatech_logs.db"
+DB_PATH = Path(os.getenv("DB_PATH", str(DEFAULT_DB_PATH))).expanduser().resolve()
+def _connect() -> sqlite3.Connection:
+    if not DB_PATH.exists():
+        raise FileNotFoundError(f"Database not found at '{DB_PATH}'")
+    return sqlite3.connect(str(DB_PATH))
+def load_thresholds() -> Dict[str, Dict[str, float]]:
+    conn = _connect()
+    rows = conn.execute(
+        "SELECT metric_name, warning_threshold, critical_threshold, consecutive_count FROM anomaly_thresholds"
+    ).fetchall()
+    conn.close()
+    return {
+        row[0]: {
+            "warning": float(row[1]),
+            "critical": float(row[2]),
+            "consecutive": float(row[3]),
+        }
+        for row in rows
+    }
+def load_patterns() -> Dict[str, Dict[str, str]]:
+    conn = _connect()
+    rows = conn.execute(
+        "SELECT pattern_keyword, severity, description FROM known_error_patterns ORDER BY pattern_id"
+    ).fetchall()
+    conn.close()
+    return {row[0]: {"severity": row[1], "description": row[2]} for row in rows}
+def load_all_logs() -> List[Dict[str, Any]]:
+    conn = _connect()
+    rows = conn.execute(
+        """
+        SELECT log_id, timestamp, server_id, log_level, service_name,
+               message, response_time_ms, cpu_usage_percent, memory_usage_percent
+        FROM server_logs
+        ORDER BY timestamp ASC, log_id ASC
+        """
+    ).fetchall()
+    conn.close()
+    return [
+        {
+            "log_id": int(row[0]),
+            "timestamp": str(row[1]),
+            "server_id": str(row[2]),
+            "log_level": str(row[3]),
+            "service_name": str(row[4]),
+            "message": str(row[5]),
+            "response_time_ms": int(row[6] or 0),
+            "cpu_usage_percent": float(row[7] or 0.0),
+            "memory_usage_percent": float(row[8] or 0.0),
+        }
+        for row in rows
+    ]
+def _within_window(log: Dict[str, Any], start: str, end: str) -> bool:
+    return start <= str(log["timestamp"]) <= end
+def _base_scope(task_id: str, logs: Iterable[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    spec = TASK_SPECS[task_id]
+    scope_servers = set(spec["scope_servers"])
+    scope_services = set(spec["scope_services"])
+    start = str(spec["incident_window_start"])
+    end = str(spec["incident_window_end"])
+    return [
+        log
+        for log in logs
+        if log["server_id"] in scope_servers
+        and log["service_name"] in scope_services
+        and (
+            _within_window(log, start, end)
+            or log["log_id"] in set(spec["must_include_ids"])
+        )
+    ]
+def build_task_log_pool(task_id: str, seed: int) -> List[Dict[str, Any]]:
+    spec = TASK_SPECS[task_id]
+    rng = random.Random(seed)
+    all_logs = load_all_logs()
+    must_include_ids = set(spec["must_include_ids"])
+    base_scope = _base_scope(task_id, all_logs)
+    scope_ids = {log["log_id"] for log in base_scope}
+    for log in all_logs:
+        if log["log_id"] in must_include_ids:
+            scope_ids.add(log["log_id"])
+    scope_logs = [log for log in all_logs if log["log_id"] in scope_ids]
+    noise_candidates = [
+        log
+        for log in all_logs
+        if log["log_id"] not in scope_ids
+        and log["server_id"] in set(spec["scope_servers"])
+        and log["service_name"] in set(spec["scope_services"])
+    ]
+    sample_size = min(int(spec["noise_sample_size"]), len(noise_candidates))
+    if sample_size:
+        for log in rng.sample(noise_candidates, sample_size):
+            scope_logs.append(log)
+    enriched = []
+    for index, log in enumerate(scope_logs):
+        log_copy = dict(log)
+        log_copy["_seed_rank"] = rng.random() + (index * 0.00001)
+        enriched.append(log_copy)
+    return enriched

env/__init__.py ADDED Viewed

File without changes

env/__pycache__/environment.cpython-314.pyc ADDED Viewed

Binary file (23.6 kB). View file

env/__pycache__/models.cpython-314.pyc ADDED Viewed

Binary file (6.41 kB). View file

env/environment.py ADDED Viewed

	@@ -0,0 +1,344 @@

+"""
+Session-safe OpenEnv environment with seeded partial observability.
+"""
+from __future__ import annotations
+import os
+import random
+import threading
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Set, Tuple
+from data.db_loader import build_task_log_pool, load_patterns, load_thresholds
+from env.models import Action, IncidentBriefing, Observation, Reward, RootCauseHypothesis
+from tasks.catalog import CONTAINMENT_DESCRIPTIONS, DEPENDENCY_GRAPH, TASK_SPECS
+from tasks.graders import build_dense_reward, containment_alignment, grade_report, hypothesis_match_score
+DEBUG_STATE_ENABLED = os.getenv("OPENENV_DEBUG_STATE", "false").lower() == "true"
+@dataclass
+class IncidentSession:
+    session_id: str
+    task_id: str
+    seed: int
+    max_steps: int
+    logs: List[Dict[str, Any]]
+    thresholds: Dict[str, Dict[str, float]]
+    patterns: Dict[str, Dict[str, str]]
+    step_number: int = 0
+    done: bool = False
+    visible_log_ids: Set[int] = field(default_factory=set)
+    visited_services: Set[str] = field(default_factory=set)
+    containment_plan: List[str] = field(default_factory=list)
+    last_hypothesis: Optional[RootCauseHypothesis] = None
+    best_hypothesis_score: float = 0.0
+    query_fingerprints: Dict[str, int] = field(default_factory=dict)
+    last_reward: Optional[Reward] = None
+    episode_history: List[Dict[str, Any]] = field(default_factory=list)
+    def visible_logs(self) -> List[Dict[str, Any]]:
+        visible = [log for log in self.logs if log["log_id"] in self.visible_log_ids]
+        return sorted(visible, key=lambda log: (log["timestamp"], log["log_id"]))
+    def log_map(self) -> Dict[int, Dict[str, Any]]:
+        return {log["log_id"]: log for log in self.logs}
+class SessionStore:
+    def __init__(self) -> None:
+        self._lock = threading.Lock()
+        self._sessions: Dict[str, IncidentSession] = {}
+    def reset(self, task_id: str = "easy", seed: Optional[int] = None) -> Observation:
+        if task_id not in TASK_SPECS:
+            raise ValueError(f"Unknown task_id '{task_id}'.")
+        actual_seed = int(seed if seed is not None else 2025 + (list(TASK_SPECS).index(task_id) * 17))
+        session_id = uuid.uuid4().hex
+        spec = TASK_SPECS[task_id]
+        session = IncidentSession(
+            session_id=session_id,
+            task_id=task_id,
+            seed=actual_seed,
+            max_steps=int(spec["max_steps"]),
+            logs=build_task_log_pool(task_id, actual_seed),
+            thresholds=load_thresholds(),
+            patterns=load_patterns(),
+        )
+        with self._lock:
+            self._sessions[session_id] = session
+        return self._build_observation(
+            session,
+            feedback="Episode created. Query the incident window and inspect dependencies to build your case.",
+        )
+    def step(self, action: Action) -> Tuple[Observation, Reward, bool, Dict[str, Any]]:
+        session = self._resolve_session(action.session_id)
+        if session.done:
+            raise RuntimeError("Episode already finished. Call /reset to start a new session.")
+        session.step_number += 1
+        repeated_action_count = self._register_action(session, action)
+        if action.action_type == "submit_report":
+            if action.report is None:
+                raise ValueError("submit_report requires report")
+            reward = grade_report(
+                task_id=session.task_id,
+                report=action.report,
+                revealed_log_ids=set(session.visible_log_ids),
+                revealed_log_map=session.log_map(),
+                step_number=session.step_number,
+                max_steps=session.max_steps,
+                repeated_action_count=repeated_action_count,
+            )
+            session.done = True
+            feedback = "Final report graded."
+        elif action.action_type == "no_anomalies":
+            reward = build_dense_reward(
+                signal_reward=0.0,
+                hypothesis_reward=0.0,
+                efficiency_reward=0.0,
+                penalty=1.0,
+                info={"message": "No-incident declaration is invalid for this benchmark."},
+            )
+            session.done = True
+            feedback = "No-incident declaration rejected."
+        else:
+            reward, feedback = self._handle_non_terminal(session, action, repeated_action_count)
+            if session.step_number >= session.max_steps:
+                session.done = True
+                feedback = f"{feedback} Step budget exhausted."
+        session.last_reward = reward
+        session.episode_history.append(
+            {
+                "step": session.step_number,
+                "action_type": action.action_type,
+                "reward": reward.value,
+                "done": session.done,
+            }
+        )
+        observation = self._build_observation(session, feedback=feedback)
+        return observation, reward, session.done, dict(reward.info)
+    def public_state(self, session_id: Optional[str] = None) -> Dict[str, Any]:
+        session = self._resolve_session(session_id)
+        return {
+            "session_id": session.session_id,
+            "task_id": session.task_id,
+            "step_number": session.step_number,
+            "max_steps": session.max_steps,
+            "done": session.done,
+            "revealed_log_count": len(session.visible_log_ids),
+            "visited_services": sorted(session.visited_services),
+            "submitted_containment": list(session.containment_plan),
+            "last_reward": session.last_reward.model_dump() if session.last_reward else None,
+        }
+    def debug_state(self, session_id: Optional[str] = None) -> Dict[str, Any]:
+        if not DEBUG_STATE_ENABLED:
+            raise PermissionError("Debug state is disabled.")
+        session = self._resolve_session(session_id)
+        return {
+            "session_id": session.session_id,
+            "task_id": session.task_id,
+            "seed": session.seed,
+            "visible_log_ids": sorted(session.visible_log_ids),
+            "all_logs": session.logs,
+            "history": session.episode_history,
+            "best_hypothesis_score": session.best_hypothesis_score,
+        }
+    def _resolve_session(self, session_id: Optional[str]) -> IncidentSession:
+        with self._lock:
+            if session_id:
+                session = self._sessions.get(session_id)
+                if session is None:
+                    raise RuntimeError(f"Unknown session_id '{session_id}'.")
+                return session
+            if len(self._sessions) == 1:
+                return next(iter(self._sessions.values()))
+        raise RuntimeError("A valid session_id is required.")
+    def _handle_non_terminal(
+        self,
+        session: IncidentSession,
+        action: Action,
+        repeated_action_count: int,
+    ) -> Tuple[Reward, str]:
+        signal_reward = 0.0
+        hypothesis_reward = 0.0
+        penalty = 0.0
+        info: Dict[str, Any] = {}
+        if action.action_type == "query_logs":
+            if action.query is None:
+                raise ValueError("query_logs requires query")
+            newly_revealed = self._query_logs(session, action.query.model_dump(exclude_none=True))
+            relevant = set(TASK_SPECS[session.task_id]["gold_evidence_ids"])
+            relevant_new = len(relevant & set(newly_revealed))
+            signal_reward = min(1.0, round((0.22 * len(newly_revealed)) + (0.28 * relevant_new), 4))
+            penalty = 0.15 if not newly_revealed else 0.0
+            feedback = f"Query revealed {len(newly_revealed)} new log(s)."
+            info["revealed_log_ids"] = newly_revealed
+        elif action.action_type == "inspect_dependencies":
+            if action.target_service is None:
+                raise ValueError("inspect_dependencies requires target_service")
+            session.visited_services.add(action.target_service)
+            neighbors = DEPENDENCY_GRAPH.get(action.target_service, [])
+            revealed = self._inspect_dependencies(session, action.target_service, neighbors)
+            relevant = set(TASK_SPECS[session.task_id]["gold_evidence_ids"])
+            signal_reward = min(1.0, round((0.15 * len(revealed)) + (0.35 * len(relevant & set(revealed))), 4))
+            penalty = 0.1 if not revealed else 0.0
+            feedback = f"Dependency inspection around {action.target_service} revealed {len(revealed)} new log(s)."
+            info["neighbors"] = neighbors
+            info["revealed_log_ids"] = revealed
+        elif action.action_type == "update_hypothesis":
+            if action.hypothesis is None:
+                raise ValueError("update_hypothesis requires hypothesis")
+            current_score = hypothesis_match_score(action.hypothesis, session.task_id)
+            improvement = max(0.0, current_score - session.best_hypothesis_score)
+            session.best_hypothesis_score = max(session.best_hypothesis_score, current_score)
+            session.last_hypothesis = action.hypothesis
+            hypothesis_reward = improvement
+            penalty = 0.15 if improvement == 0.0 and current_score < session.best_hypothesis_score else 0.0
+            feedback = "Hypothesis recorded."
+            info["hypothesis_score"] = current_score
+        elif action.action_type == "execute_containment":
+            plan = list(action.containment_plan or [])
+            positive, negative = containment_alignment(plan, session.task_id)
+            for item in plan:
+                if item not in session.containment_plan:
+                    session.containment_plan.append(item)
+            hypothesis_reward = positive
+            penalty = min(1.0, negative + (0.05 if not plan else 0.0))
+            feedback = "Containment actions recorded."
+            info["containment_positive"] = positive
+            info["containment_negative"] = negative
+            info["containment_descriptions"] = [CONTAINMENT_DESCRIPTIONS[item] for item in plan]
+        elif action.action_type == "request_more":
+            penalty = 0.1
+            feedback = "No additional passive data is provided. Use a concrete query."
+        else:
+            penalty = 0.2
+            feedback = "Unsupported action."
+        if repeated_action_count > 0:
+            penalty = min(1.0, penalty + min(0.2, repeated_action_count * 0.1))
+        efficiency_reward = max(
+            0.0,
+            round(1.0 - ((session.step_number - 1) / max(1, session.max_steps - 1)), 4),
+        )
+        reward = build_dense_reward(
+            signal_reward=signal_reward,
+            hypothesis_reward=hypothesis_reward,
+            efficiency_reward=efficiency_reward,
+            penalty=penalty,
+            info=info,
+        )
+        return reward, feedback
+    def _query_logs(self, session: IncidentSession, query: Dict[str, Any]) -> List[int]:
+        matched = [log for log in session.logs if self._match_query(log, query)]
+        ranked = sorted(matched, key=lambda log: (self._severity_rank(log["log_level"]), -float(log["_seed_rank"])))
+        revealed: List[int] = []
+        for log in ranked:
+            if log["log_id"] in session.visible_log_ids:
+                continue
+            session.visible_log_ids.add(log["log_id"])
+            revealed.append(log["log_id"])
+            session.visited_services.add(log["service_name"])
+            if len(revealed) >= int(query.get("limit", 6)):
+                break
+        return revealed
+    def _inspect_dependencies(self, session: IncidentSession, target_service: str, neighbors: List[str]) -> List[int]:
+        candidate_services = {target_service, *[neighbor for neighbor in neighbors if neighbor.endswith("-service")]}
+        matched = [
+            log
+            for log in session.logs
+            if log["service_name"] in candidate_services and log["log_level"] in {"CRITICAL", "ERROR", "WARN"}
+        ]
+        ranked = sorted(matched, key=lambda log: (self._severity_rank(log["log_level"]), log["timestamp"], -float(log["_seed_rank"])))
+        revealed: List[int] = []
+        for log in ranked:
+            if log["log_id"] in session.visible_log_ids:
+                continue
+            session.visible_log_ids.add(log["log_id"])
+            revealed.append(log["log_id"])
+            if len(revealed) >= 4:
+                break
+        return revealed
+    @staticmethod
+    def _match_query(log: Dict[str, Any], query: Dict[str, Any]) -> bool:
+        if query.get("service_name") and log["service_name"] != query["service_name"]:
+            return False
+        if query.get("server_id") and log["server_id"] != query["server_id"]:
+            return False
+        if query.get("levels") and log["log_level"] not in set(query["levels"]):
+            return False
+        if query.get("start_time") and str(log["timestamp"]) < str(query["start_time"]):
+            return False
+        if query.get("end_time") and str(log["timestamp"]) > str(query["end_time"]):
+            return False
+        if query.get("text_contains") and query["text_contains"].lower() not in str(log["message"]).lower():
+            return False
+        return True
+    @staticmethod
+    def _severity_rank(level: str) -> int:
+        order = {"CRITICAL": 0, "ERROR": 1, "WARN": 2, "INFO": 3}
+        return order.get(level, 4)
+    @staticmethod
+    def _register_action(session: IncidentSession, action: Action) -> int:
+        fingerprint_source = [action.action_type]
+        if action.query:
+            fingerprint_source.append(str(action.query.model_dump(exclude_none=True)))
+        if action.target_service:
+            fingerprint_source.append(action.target_service)
+        if action.hypothesis:
+            fingerprint_source.append(str(action.hypothesis.model_dump()))
+        if action.containment_plan:
+            fingerprint_source.append(",".join(action.containment_plan))
+        if action.report:
+            fingerprint_source.append(str(action.report.root_cause.model_dump()))
+        fingerprint = "::".join(fingerprint_source)
+        count = session.query_fingerprints.get(fingerprint, 0)
+        session.query_fingerprints[fingerprint] = count + 1
+        return count
+    def _build_observation(self, session: IncidentSession, feedback: Optional[str]) -> Observation:
+        spec = TASK_SPECS[session.task_id]
+        return Observation(
+            session_id=session.session_id,
+            task_id=session.task_id,
+            task_title=str(spec["title"]),
+            briefing=IncidentBriefing(
+                incident_id=str(spec["incident_id"]),
+                title=str(spec["title"]),
+                objective=str(spec["objective"]),
+                incident_window_start=str(spec["incident_window_start"]),
+                incident_window_end=str(spec["incident_window_end"]),
+                suspected_services=list(spec["suspected_services"]),
+                customer_statement=str(spec["customer_statement"]),
+                operational_constraints=list(spec["operational_constraints"]),
+            ),
+            dependency_graph=DEPENDENCY_GRAPH,
+            visible_logs=session.visible_logs(),
+            revealed_log_count=len(session.visible_log_ids),
+            visited_services=sorted(session.visited_services),
+            submitted_containment=list(session.containment_plan),
+            last_hypothesis=session.last_hypothesis,
+            step_number=session.step_number,
+            max_steps=session.max_steps,
+            feedback=feedback,
+            done=session.done,
+        )
+store = SessionStore()

env/models.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+Typed models for the hardened NovaTech OpenEnv environment.
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import BaseModel, Field
+ServiceName = Literal[
+    "auth-service",
+    "payment-api",
+    "order-service",
+    "notification-service",
+    "reporting-service",
+    "user-service",
+]
+ServerName = Literal["server_01", "server_02", "server_03", "server_04"]
+LogLevel = Literal["INFO", "WARN", "ERROR", "CRITICAL"]
+FailureMode = Literal[
+    "resource_exhaustion",
+    "dependency_outage",
+    "storage_saturation",
+    "certificate_expiry",
+    "application_bug",
+    "traffic_abuse",
+]
+DependencyName = Literal["none", "payment-gateway", "mysql", "email-relay", "ldap-directory"]
+CustomerImpact = Literal[
+    "login_failures",
+    "checkout_delays",
+    "order_write_failures",
+    "notification_delivery_failure",
+    "cross_service_major_incident",
+]
+ContainmentActionName = Literal[
+    "increase_auth_heap",
+    "enable_login_rate_limiting",
+    "restore_payment_gateway_connectivity",
+    "reduce_checkout_retry_pressure",
+    "free_order_log_disk",
+    "reset_mysql_connection_pool",
+    "renew_smtp_certificate",
+    "reroute_notification_traffic",
+    "page_major_incident_team",
+    "block_all_login_traffic",
+    "wipe_application_logs",
+    "restart_everything",
+]
+class LogEntry(BaseModel):
+    log_id: int
+    timestamp: str
+    server_id: ServerName
+    log_level: LogLevel
+    service_name: ServiceName
+    message: str
+    response_time_ms: int
+    cpu_usage_percent: float
+    memory_usage_percent: float
+class IncidentBriefing(BaseModel):
+    incident_id: str
+    title: str
+    objective: str
+    incident_window_start: str
+    incident_window_end: str
+    suspected_services: List[ServiceName]
+    customer_statement: str
+    operational_constraints: List[str]
+class RootCauseHypothesis(BaseModel):
+    primary_service: ServiceName
+    failure_mode: FailureMode
+    dependency: DependencyName = "none"
+    customer_impact: CustomerImpact
+    confidence: float = Field(..., ge=0.0, le=1.0)
+class LogQuery(BaseModel):
+    service_name: Optional[ServiceName] = None
+    server_id: Optional[ServerName] = None
+    levels: Optional[List[LogLevel]] = None
+    start_time: Optional[str] = None
+    end_time: Optional[str] = None
+    text_contains: Optional[str] = Field(default=None, max_length=80)
+    limit: int = Field(default=6, ge=1, le=6)
+class IncidentReport(BaseModel):
+    evidence_log_ids: List[int] = Field(default_factory=list, min_length=1)
+    impacted_services: List[ServiceName] = Field(default_factory=list, min_length=1)
+    root_cause: RootCauseHypothesis
+    containment_plan: List[ContainmentActionName] = Field(default_factory=list)
+    summary: str = Field(..., min_length=20, max_length=600)
+class Action(BaseModel):
+    session_id: Optional[str] = None
+    action_type: Literal[
+        "query_logs",
+        "inspect_dependencies",
+        "update_hypothesis",
+        "execute_containment",
+        "submit_report",
+        "request_more",
+        "no_anomalies",
+    ]
+    query: Optional[LogQuery] = None
+    target_service: Optional[ServiceName] = None
+    hypothesis: Optional[RootCauseHypothesis] = None
+    containment_plan: Optional[List[ContainmentActionName]] = None
+    report: Optional[IncidentReport] = None
+class Observation(BaseModel):
+    session_id: str
+    task_id: str
+    task_title: str
+    briefing: IncidentBriefing
+    dependency_graph: Dict[ServiceName, List[str]]
+    visible_logs: List[LogEntry]
+    revealed_log_count: int
+    visited_services: List[ServiceName]
+    submitted_containment: List[ContainmentActionName]
+    last_hypothesis: Optional[RootCauseHypothesis] = None
+    step_number: int = 0
+    max_steps: int = 8
+    feedback: Optional[str] = None
+    done: bool = False
+class Reward(BaseModel):
+    value: float = Field(..., ge=0.0, le=1.0)
+    signal_reward: float = Field(default=0.0, ge=0.0, le=1.0)
+    hypothesis_reward: float = Field(default=0.0, ge=0.0, le=1.0)
+    efficiency_reward: float = Field(default=0.0, ge=0.0, le=1.0)
+    penalty: float = Field(default=0.0, ge=0.0, le=1.0)
+    info: Dict[str, Any] = Field(default_factory=dict)

inference.py ADDED Viewed

	@@ -0,0 +1,257 @@

+from __future__ import annotations
+import os
+from typing import Any, Dict, List, Optional
+import requests
+from openai import OpenAI
+API_KEY = os.getenv("HF_TOKEN", "")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
+LOGENV_URL = os.getenv("LOGENV_URL", "http://localhost:7860")
+BENCHMARK = "NovaTechIncidentCommand"
+SUCCESS_THRESHOLD = 0.70
+client = OpenAI(api_key=API_KEY or "placeholder", base_url=API_BASE_URL)
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error if error else 'null'}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={max(0.0, min(1.0, score)):.3f} rewards={','.join(f'{r:.2f}' for r in rewards)}",
+        flush=True,
+    )
+def api_reset(task_id: str) -> Dict[str, Any]:
+    response = requests.post(f"{LOGENV_URL}/reset", json={"task_id": task_id}, timeout=30)
+    response.raise_for_status()
+    return response.json()
+def api_step(payload: Dict[str, Any]) -> Dict[str, Any]:
+    response = requests.post(f"{LOGENV_URL}/step", json=payload, timeout=60)
+    response.raise_for_status()
+    return response.json()
+def maybe_ping_model(task_id: str) -> None:
+    if not API_KEY:
+        return
+    try:
+        client.responses.create(
+            model=MODEL_NAME,
+            input=f"Reply with ACK for {task_id}.",
+            temperature=0,
+            max_output_tokens=4,
+        )
+    except Exception:
+        pass
+def _severity_score(log: Dict[str, Any]) -> float:
+    level_weight = {"CRITICAL": 4.0, "ERROR": 3.0, "WARN": 1.0, "INFO": 0.2}
+    score = level_weight.get(str(log["log_level"]).upper(), 0.0)
+    if float(log.get("cpu_usage_percent", 0.0)) >= 90.0:
+        score += 1.0
+    if float(log.get("memory_usage_percent", 0.0)) >= 95.0:
+        score += 1.0
+    if int(log.get("response_time_ms", 0)) >= 3000:
+        score += 1.0
+    message = str(log["message"]).lower()
+    for needle, bonus in {
+        "outofmemoryerror": 2.0,
+        "connection refused": 2.0,
+        "disk full": 2.0,
+        "ssl certificate expired": 1.8,
+        "segmentation fault": 1.8,
+        "timeout exceeded": 1.0,
+    }.items():
+        if needle in message:
+            score += bonus
+    return score
+def _infer_hypothesis(observation: Dict[str, Any]) -> Dict[str, Any]:
+    logs = sorted(observation.get("visible_logs", []), key=_severity_score, reverse=True)
+    services = {log["service_name"] for log in logs}
+    messages = " ".join(str(log["message"]).lower() for log in logs)
+    if "outofmemoryerror" in messages and {"payment-api", "order-service", "notification-service"} & services:
+        return {
+            "primary_service": "auth-service",
+            "failure_mode": "resource_exhaustion",
+            "dependency": "payment-api",
+            "customer_impact": "cross_service_major_incident",
+            "confidence": 0.92,
+        }
+    if "connection refused" in messages or "payment confirmation" in messages:
+        return {
+            "primary_service": "payment-api",
+            "failure_mode": "dependency_outage",
+            "dependency": "payment-gateway",
+            "customer_impact": "checkout_delays",
+            "confidence": 0.87,
+        }
+    if "disk full" in messages:
+        return {
+            "primary_service": "order-service",
+            "failure_mode": "storage_saturation",
+            "dependency": "mysql",
+            "customer_impact": "order_write_failures",
+            "confidence": 0.82,
+        }
+    if "ssl certificate expired" in messages or "email-relay" in messages:
+        return {
+            "primary_service": "notification-service",
+            "failure_mode": "certificate_expiry",
+            "dependency": "email-relay",
+            "customer_impact": "notification_delivery_failure",
+            "confidence": 0.81,
+        }
+    return {
+        "primary_service": observation["briefing"]["suspected_services"][0],
+        "failure_mode": "traffic_abuse",
+        "dependency": "none",
+        "customer_impact": "login_failures",
+        "confidence": 0.55,
+    }
+def _containment_for_hypothesis(hypothesis: Dict[str, Any]) -> List[str]:
+    if hypothesis["primary_service"] == "auth-service" and hypothesis["customer_impact"] == "cross_service_major_incident":
+        return [
+            "increase_auth_heap",
+            "enable_login_rate_limiting",
+            "restore_payment_gateway_connectivity",
+            "free_order_log_disk",
+            "renew_smtp_certificate",
+            "page_major_incident_team",
+        ]
+    if hypothesis["primary_service"] == "payment-api":
+        return ["restore_payment_gateway_connectivity", "reduce_checkout_retry_pressure"]
+    if hypothesis["primary_service"] == "order-service":
+        return ["free_order_log_disk", "reset_mysql_connection_pool"]
+    if hypothesis["primary_service"] == "notification-service":
+        return ["renew_smtp_certificate", "reroute_notification_traffic"]
+    return ["increase_auth_heap", "enable_login_rate_limiting"]
+def _build_report(observation: Dict[str, Any], hypothesis: Dict[str, Any]) -> Dict[str, Any]:
+    logs = sorted(observation.get("visible_logs", []), key=lambda log: _severity_score(log), reverse=True)
+    evidence_ids = [int(log["log_id"]) for log in logs[: min(10, len(logs))]]
+    impacted_services = sorted({log["service_name"] for log in logs if _severity_score(log) >= 3.0})
+    if not impacted_services:
+        impacted_services = [hypothesis["primary_service"]]
+    return {
+        "evidence_log_ids": evidence_ids,
+        "impacted_services": impacted_services,
+        "root_cause": hypothesis,
+        "containment_plan": _containment_for_hypothesis(hypothesis),
+        "summary": (
+            f"The most likely incident source is {hypothesis['primary_service']} with failure mode "
+            f"{hypothesis['failure_mode']}, creating customer impact {hypothesis['customer_impact']}."
+        ),
+    }
+def run_task(task_id: str) -> float:
+    rewards: List[float] = []
+    steps_taken = 0
+    final_score = 0.0
+    success = False
+    observation: Dict[str, Any] | None = None
+    log_start(task_id, BENCHMARK, MODEL_NAME)
+    try:
+        observation = api_reset(task_id)
+        session_id = observation["session_id"]
+        maybe_ping_model(task_id)
+        query_payload = {
+            "session_id": session_id,
+            "action_type": "query_logs",
+            "query": {
+                "levels": ["CRITICAL", "ERROR"],
+                "start_time": observation["briefing"]["incident_window_start"],
+                "end_time": observation["briefing"]["incident_window_end"],
+                "limit": 6,
+            },
+        }
+        result = api_step(query_payload)
+        observation = result["observation"]
+        rewards.append(float(result["reward"]["value"]))
+        steps_taken = 1
+        log_step(1, "query_logs", rewards[-1], bool(result["done"]), None)
+        target_service = max(
+            observation["briefing"]["suspected_services"],
+            key=lambda service: sum(1 for log in observation["visible_logs"] if log["service_name"] == service),
+        )
+        dep_payload = {
+            "session_id": session_id,
+            "action_type": "inspect_dependencies",
+            "target_service": target_service,
+        }
+        result = api_step(dep_payload)
+        observation = result["observation"]
+        rewards.append(float(result["reward"]["value"]))
+        steps_taken = 2
+        log_step(2, f"inspect_dependencies({target_service})", rewards[-1], bool(result["done"]), None)
+        hypothesis = _infer_hypothesis(observation)
+        hyp_payload = {
+            "session_id": session_id,
+            "action_type": "update_hypothesis",
+            "hypothesis": hypothesis,
+        }
+        result = api_step(hyp_payload)
+        observation = result["observation"]
+        rewards.append(float(result["reward"]["value"]))
+        steps_taken = 3
+        log_step(3, "update_hypothesis", rewards[-1], bool(result["done"]), None)
+        containment_payload = {
+            "session_id": session_id,
+            "action_type": "execute_containment",
+            "containment_plan": _containment_for_hypothesis(hypothesis),
+        }
+        result = api_step(containment_payload)
+        observation = result["observation"]
+        rewards.append(float(result["reward"]["value"]))
+        steps_taken = 4
+        log_step(4, "execute_containment", rewards[-1], bool(result["done"]), None)
+        report_payload = {
+            "session_id": session_id,
+            "action_type": "submit_report",
+            "report": _build_report(observation, hypothesis),
+        }
+        result = api_step(report_payload)
+        final_score = float(result["reward"]["value"])
+        rewards.append(final_score)
+        steps_taken = 5
+        log_step(5, "submit_report", final_score, bool(result["done"]), None)
+        success = final_score >= SUCCESS_THRESHOLD
+    except Exception as exc:
+        log_step(steps_taken + 1 if steps_taken else 1, "error", 0.0, True, str(exc).replace("\n", " "))
+        final_score = 0.0
+        success = False
+    finally:
+        log_end(success, steps_taken if steps_taken else 1, final_score, rewards or [0.0])
+    return final_score
+if __name__ == "__main__":
+    for task_name in ("easy", "medium", "hard"):
+        run_task(task_name)

novatech_logs.db ADDED Viewed

Binary file (94.2 kB). View file

openenv.yaml ADDED Viewed

	@@ -0,0 +1,66 @@

+name: NovaTechIncidentCommand
+description: >
+  Seeded OpenEnv incident-response benchmark built from a realistic NovaTech log corpus.
+  Agents operate under partial observability: they must query logs, inspect dependencies,
+  update a structured causal hypothesis, choose safe containment, and submit a final report.
+tasks:
+  - id: easy
+    description: Detect a clear login outage caused by auth-service heap exhaustion.
+  - id: medium
+    description: Resolve competing hypotheses during a payment confirmation outage.
+  - id: hard
+    description: Reconstruct a cascading multi-service incident under partial observability.
+action_space:
+  type: structured
+  fields:
+    session_id: string
+    action_type: "query_logs | inspect_dependencies | update_hypothesis | execute_containment | submit_report | request_more | no_anomalies"
+    query: "optional structured filter with service_name, server_id, levels, start_time, end_time, text_contains, limit"
+    target_service: "optional service name"
+    hypothesis: "optional structured tuple: primary_service, failure_mode, dependency, customer_impact, confidence"
+    containment_plan: "optional list of containment action names"
+    report: "optional structured report with evidence_log_ids, impacted_services, root_cause, containment_plan, summary"
+observation_space:
+  type: structured
+  fields:
+    session_id: string
+    task_id: string
+    task_title: string
+    briefing: "structured incident briefing with incident window, objective, suspected_services, customer_statement, operational_constraints"
+    dependency_graph: "service dependency map"
+    visible_logs: "list of currently revealed log entries only"
+    revealed_log_count: integer
+    visited_services: "list of services explored so far"
+    submitted_containment: "list of chosen containment actions"
+    last_hypothesis: "optional structured root-cause hypothesis"
+    step_number: integer
+    max_steps: integer
+    feedback: string
+    done: boolean
+  notes:
+    - "Observations expose only agent-revealed logs."
+    - "The dependency graph is visible, but hidden logs and gold evidence remain private."
+    - "The latest structured hypothesis is included so agents can reason iteratively."
+reward_definition:
+  type: scalar
+  range: [0.0, 1.0]
+  components:
+    signal_reward: "Rewards newly discovered relevant signals and evidence quality."
+    hypothesis_reward: "Rewards improvement toward the gold causal tuple and safe containment alignment."
+    efficiency_reward: "Rewards solving within the action budget."
+    penalty: "Penalizes unseen evidence, contradictions, forbidden containment, loops, and empty queries."
+  techniques:
+    - "Information-gain shaping: focused discovery beats broad noisy retrieval."
+    - "Best-hypothesis tracking: reward is tied to causal improvement across the episode."
+    - "Observation-consistent grading: unseen evidence references are rejected."
+    - "Contradiction penalties: evidence, cause, impact, and timeline must agree."
+    - "Safety shaping: destructive containment is penalized even if diagnosis is partially correct."
+interfaces:
+  reset: "reset() -> initial observation"
+  step: "step(action) -> observation, reward, done, info"
+  state: "state() -> non-leaking public session state"

preflight.sh ADDED Viewed

	@@ -0,0 +1,49 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+API_URL="${API_URL:-http://127.0.0.1:7860}"
+cleanup() {
+  if [[ -n "${UVICORN_PID:-}" ]]; then
+    kill "${UVICORN_PID}" >/dev/null 2>&1 || true
+  fi
+}
+trap cleanup EXIT
+cd "${ROOT_DIR}"
+python3 -m py_compile app.py inference.py env/models.py env/environment.py tasks/catalog.py tasks/graders.py data/db_loader.py
+if command -v openenv >/dev/null 2>&1; then
+  openenv validate
+fi
+python3 -m uvicorn app:app --host 127.0.0.1 --port 7860 >/tmp/logenv2_uvicorn.log 2>&1 &
+UVICORN_PID=$!
+sleep 2
+curl -sf "${API_URL}/health" >/tmp/logenv2_health.json
+curl -sf -X POST -H "Content-Type: application/json" -d '{"task_id":"easy","seed":42}' "${API_URL}/reset" >/tmp/logenv2_reset.json
+SESSION_ID="$(python3 - <<'PY'
+import json
+from pathlib import Path
+print(json.loads(Path('/tmp/logenv2_reset.json').read_text())['session_id'])
+PY
+)"
+curl -sf -X POST -H "Content-Type: application/json" \
+  -d "{\"session_id\":\"${SESSION_ID}\",\"action_type\":\"query_logs\",\"query\":{\"levels\":[\"CRITICAL\",\"ERROR\"],\"limit\":4}}" \
+  "${API_URL}/step" >/tmp/logenv2_step.json
+LOGENV_URL="${API_URL}" python3 inference.py >/tmp/logenv2_inference.log
+python3 - <<'PY'
+from pathlib import Path
+lines = [line.strip() for line in Path("/tmp/logenv2_inference.log").read_text().splitlines() if line.strip()]
+assert any(line.startswith("[START] ") for line in lines)
+assert any(line.startswith("[STEP] ") for line in lines)
+assert any(line.startswith("[END] ") for line in lines)
+print("preflight ok")
+PY

requirements.txt CHANGED Viewed

@@ -1,3 +1,6 @@
-altair
-pandas
-streamlit

+fastapi==0.115.0
+uvicorn[standard]==0.30.6
+pydantic==2.7.4
+requests==2.32.3
+pyyaml==6.0.2
+openai==2.9.0

tasks/__init__.py ADDED Viewed

File without changes

tasks/__pycache__/catalog.cpython-314.pyc ADDED Viewed

Binary file (5.58 kB). View file

tasks/__pycache__/graders.cpython-314.pyc ADDED Viewed

Binary file (12.6 kB). View file

tasks/catalog.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""
+Task catalog for the hardened NovaTech incident environment.
+"""
+from __future__ import annotations
+from typing import Dict, List
+DEPENDENCY_GRAPH: Dict[str, List[str]] = {
+    "auth-service": ["user-service", "payment-api", "ldap-directory"],
+    "payment-api": ["auth-service", "payment-gateway", "mysql"],
+    "order-service": ["payment-api", "mysql", "notification-service"],
+    "notification-service": ["order-service", "email-relay"],
+    "reporting-service": ["mysql"],
+    "user-service": ["auth-service", "ldap-directory"],
+}
+CONTAINMENT_DESCRIPTIONS: Dict[str, str] = {
+    "increase_auth_heap": "Increase heap headroom for auth-service.",
+    "enable_login_rate_limiting": "Slow abusive login traffic without fully blocking healthy users.",
+    "restore_payment_gateway_connectivity": "Repair egress, routes, or credentials to the payment gateway.",
+    "reduce_checkout_retry_pressure": "Reduce retry storms and non-critical checkout retries.",
+    "free_order_log_disk": "Recover /var/log capacity on order-service hosts.",
+    "reset_mysql_connection_pool": "Safely recycle exhausted database connection pools.",
+    "renew_smtp_certificate": "Renew SMTP or relay TLS certificates before reconnecting.",
+    "reroute_notification_traffic": "Use a safe backup path for notifications.",
+    "page_major_incident_team": "Escalate to major-incident coordination.",
+    "block_all_login_traffic": "Broadly block all login traffic.",
+    "wipe_application_logs": "Delete logs to free resources.",
+    "restart_everything": "Restart all major services immediately.",
+}
+TASK_SPECS: Dict[str, Dict[str, object]] = {
+    "easy": {
+        "incident_id": "INC-2101",
+        "title": "Auth Heap Exhaustion",
+        "max_steps": 6,
+        "objective": "Detect the direct cause of a login outage and choose a safe first containment move.",
+        "incident_window_start": "2025-06-15 02:00:00",
+        "incident_window_end": "2025-06-15 02:25:59",
+        "suspected_services": ["auth-service", "user-service", "payment-api"],
+        "customer_statement": "Support agents report that enterprise admins cannot log in reliably.",
+        "operational_constraints": [
+            "Do not erase evidence.",
+            "Do not fully block all login traffic.",
+            "You have six actions before leadership expects a recommendation.",
+        ],
+        "scope_servers": ["server_01"],
+        "scope_services": ["auth-service", "user-service", "payment-api", "notification-service"],
+        "must_include_ids": [72, 74, 75, 76, 77],
+        "noise_sample_size": 8,
+        "gold_evidence_ids": [74, 76, 77],
+        "plausible_but_wrong_ids": [72, 75],
+        "root_cause": {
+            "primary_service": "auth-service",
+            "failure_mode": "resource_exhaustion",
+            "dependency": "none",
+            "customer_impact": "login_failures",
+        },
+        "impacted_services": ["auth-service"],
+        "recommended_containment": ["increase_auth_heap", "enable_login_rate_limiting"],
+        "forbidden_containment": ["block_all_login_traffic", "wipe_application_logs", "restart_everything"],
+    },
+    "medium": {
+        "incident_id": "INC-2144",
+        "title": "Checkout Competing Hypotheses",
+        "max_steps": 7,
+        "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
+        "incident_window_start": "2025-06-15 06:20:00",
+        "incident_window_end": "2025-06-15 06:45:59",
+        "suspected_services": ["payment-api", "auth-service", "user-service"],
+        "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
+        "operational_constraints": [
+            "Keep checkout partially available if possible.",
+            "Avoid blind restarts.",
+            "You must justify why the leading hypothesis beats the alternative.",
+        ],
+        "scope_servers": ["server_01", "server_02"],
+        "scope_services": ["payment-api", "auth-service", "user-service", "order-service"],
+        "must_include_ids": [74, 76, 77, 193, 194, 195, 607],
+        "noise_sample_size": 12,
+        "gold_evidence_ids": [193, 194, 195],
+        "plausible_but_wrong_ids": [74, 76, 77, 607],
+        "root_cause": {
+            "primary_service": "payment-api",
+            "failure_mode": "dependency_outage",
+            "dependency": "payment-gateway",
+            "customer_impact": "checkout_delays",
+        },
+        "impacted_services": ["payment-api"],
+        "recommended_containment": ["restore_payment_gateway_connectivity", "reduce_checkout_retry_pressure"],
+        "forbidden_containment": ["restart_everything", "wipe_application_logs"],
+    },
+    "hard": {
+        "incident_id": "INC-2199",
+        "title": "Cascading Multi-Service Incident",
+        "max_steps": 9,
+        "objective": "Reconstruct a cascading outage under partial observability, identify the initiating service, and choose layered containment.",
+        "incident_window_start": "2025-06-15 02:00:00",
+        "incident_window_end": "2025-06-15 18:45:00",
+        "suspected_services": ["auth-service", "payment-api", "order-service", "notification-service", "reporting-service"],
+        "customer_statement": "Users report failed logins, stuck payments, delayed orders, and missing outbound notifications.",
+        "operational_constraints": [
+            "Preserve evidence across services.",
+            "Prioritize customer safety and data integrity over broad resets.",
+            "A major-incident bridge is already open.",
+        ],
+        "scope_servers": ["server_01", "server_02", "server_03", "server_04"],
+        "scope_services": ["auth-service", "payment-api", "order-service", "notification-service", "reporting-service", "user-service"],
+        "must_include_ids": [72, 74, 76, 77, 193, 194, 195, 266, 267, 334, 426, 429, 481, 564, 607],
+        "noise_sample_size": 24,
+        "gold_evidence_ids": [74, 76, 77, 193, 194, 266, 267, 426, 429, 564],
+        "plausible_but_wrong_ids": [195, 334, 481, 607],
+        "root_cause": {
+            "primary_service": "auth-service",
+            "failure_mode": "resource_exhaustion",
+            "dependency": "payment-api",
+            "customer_impact": "cross_service_major_incident",
+        },
+        "impacted_services": ["auth-service", "payment-api", "order-service", "notification-service"],
+        "recommended_containment": [
+            "increase_auth_heap",
+            "enable_login_rate_limiting",
+            "restore_payment_gateway_connectivity",
+            "free_order_log_disk",
+            "renew_smtp_certificate",
+            "page_major_incident_team",
+        ],
+        "forbidden_containment": ["wipe_application_logs", "block_all_login_traffic", "restart_everything"],
+    },
+}

tasks/graders.py ADDED Viewed

	@@ -0,0 +1,177 @@

+"""
+Structured deterministic graders for NovaTech incidents.
+"""
+from __future__ import annotations
+from typing import Dict, Iterable, List, Sequence, Set, Tuple
+from env.models import IncidentReport, Reward, RootCauseHypothesis
+from tasks.catalog import TASK_SPECS
+def _set_f1(predicted: Iterable[int], gold: Iterable[int]) -> Tuple[float, float, float]:
+    pred = set(int(x) for x in predicted)
+    truth = set(int(x) for x in gold)
+    tp = len(pred & truth)
+    fp = len(pred - truth)
+    fn = len(truth - pred)
+    precision = tp / (tp + fp) if (tp + fp) else 0.0
+    recall = tp / (tp + fn) if (tp + fn) else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
+    return round(f1, 4), round(precision, 4), round(recall, 4)
+def hypothesis_match_score(hypothesis: RootCauseHypothesis | None, task_id: str) -> float:
+    if hypothesis is None:
+        return 0.0
+    gold = TASK_SPECS[task_id]["root_cause"]
+    return round(
+        0.40 * float(hypothesis.primary_service == gold["primary_service"])
+        + 0.30 * float(hypothesis.failure_mode == gold["failure_mode"])
+        + 0.15 * float(hypothesis.dependency == gold["dependency"])
+        + 0.15 * float(hypothesis.customer_impact == gold["customer_impact"]),
+        4,
+    )
+def containment_alignment(actions: Sequence[str], task_id: str) -> Tuple[float, float]:
+    spec = TASK_SPECS[task_id]
+    recommended = set(spec["recommended_containment"])
+    forbidden = set(spec["forbidden_containment"])
+    chosen = set(actions)
+    positive = len(chosen & recommended) / len(recommended) if recommended else 0.0
+    negative = len(chosen & forbidden) / len(forbidden) if forbidden else 0.0
+    return round(positive, 4), round(negative, 4)
+def impacted_service_score(predicted: Sequence[str], task_id: str) -> float:
+    gold = set(TASK_SPECS[task_id]["impacted_services"])
+    pred = set(predicted)
+    if not gold:
+        return 0.0
+    return round(len(pred & gold) / len(gold), 4)
+def _evidence_consistency(report: IncidentReport, revealed_log_ids: Set[int], task_id: str) -> Tuple[float, float, float, List[str]]:
+    issues: List[str] = []
+    evidence = list(report.evidence_log_ids)
+    if any(log_id not in revealed_log_ids for log_id in evidence):
+        unseen = sorted(log_id for log_id in evidence if log_id not in revealed_log_ids)
+        issues.append(f"Unseen evidence referenced: {unseen}")
+        return 0.0, 0.0, 0.0, issues
+    spec = TASK_SPECS[task_id]
+    gold_f1, precision, recall = _set_f1(evidence, spec["gold_evidence_ids"])
+    if recall < 0.5:
+        issues.append("Evidence misses too many key signals.")
+    if precision < 0.5:
+        issues.append("Evidence includes too many irrelevant signals.")
+    return gold_f1, precision, recall, issues
+def _causal_consistency(report: IncidentReport, task_id: str, revealed_log_map: Dict[int, Dict[str, object]]) -> Tuple[float, List[str]]:
+    issues: List[str] = []
+    cause_score = hypothesis_match_score(report.root_cause, task_id)
+    evidence_logs = [revealed_log_map[log_id] for log_id in report.evidence_log_ids if log_id in revealed_log_map]
+    if not evidence_logs:
+        return 0.0, ["No visible evidence supplied."]
+    service_present = any(log["service_name"] == report.root_cause.primary_service for log in evidence_logs)
+    if not service_present:
+        issues.append("Root cause service is not supported by selected evidence.")
+        cause_score *= 0.4
+    earliest = min(evidence_logs, key=lambda item: item["timestamp"])
+    if task_id == "hard" and earliest["service_name"] != report.root_cause.primary_service:
+        issues.append("Selected timeline does not start with the claimed initiating service.")
+        cause_score *= 0.7
+    if report.root_cause.customer_impact == "checkout_delays":
+        payment_evidence = any(log["service_name"] == "payment-api" for log in evidence_logs)
+        if not payment_evidence:
+            issues.append("Checkout impact claimed without payment-api evidence.")
+            cause_score *= 0.5
+    if report.root_cause.customer_impact == "cross_service_major_incident":
+        covered = {log["service_name"] for log in evidence_logs}
+        expected = {"auth-service", "payment-api", "order-service", "notification-service"}
+        if len(covered & expected) < 3:
+            issues.append("Cross-service incident claimed without cross-service evidence.")
+            cause_score *= 0.5
+    return round(max(0.0, cause_score), 4), issues
+def build_dense_reward(
+    *,
+    signal_reward: float,
+    hypothesis_reward: float,
+    efficiency_reward: float,
+    penalty: float,
+    info: Dict[str, object],
+) -> Reward:
+    value = max(0.0, min(1.0, round((0.55 * signal_reward) + (0.25 * hypothesis_reward) + (0.20 * efficiency_reward) - (0.30 * penalty), 4)))
+    return Reward(
+        value=value,
+        signal_reward=round(signal_reward, 4),
+        hypothesis_reward=round(hypothesis_reward, 4),
+        efficiency_reward=round(efficiency_reward, 4),
+        penalty=round(penalty, 4),
+        info=info,
+    )
+def grade_report(
+    *,
+    task_id: str,
+    report: IncidentReport,
+    revealed_log_ids: Set[int],
+    revealed_log_map: Dict[int, Dict[str, object]],
+    step_number: int,
+    max_steps: int,
+    repeated_action_count: int,
+) -> Reward:
+    evidence_score, precision, recall, evidence_issues = _evidence_consistency(report, revealed_log_ids, task_id)
+    if evidence_score == 0.0 and evidence_issues and "Unseen evidence" in evidence_issues[0]:
+        return build_dense_reward(
+            signal_reward=0.0,
+            hypothesis_reward=0.0,
+            efficiency_reward=0.0,
+            penalty=1.0,
+            info={"issues": evidence_issues, "message": "Report rejected due to unseen evidence."},
+        )
+    cause_score, cause_issues = _causal_consistency(report, task_id, revealed_log_map)
+    impact_score = impacted_service_score(report.impacted_services, task_id)
+    positive_containment, forbidden_containment = containment_alignment(report.containment_plan, task_id)
+    contradiction_penalty = 0.0
+    if not set(report.impacted_services) & set(TASK_SPECS[task_id]["impacted_services"]):
+        contradiction_penalty += 0.4
+    if recall < 0.5:
+        contradiction_penalty += 0.45
+    if precision < 0.5:
+        contradiction_penalty += 0.30
+    if forbidden_containment > 0:
+        contradiction_penalty += min(0.7, forbidden_containment)
+    if repeated_action_count > 0:
+        contradiction_penalty += min(0.2, repeated_action_count / max(1.0, float(step_number)))
+    signal_reward = round((0.75 * evidence_score) + (0.25 * impact_score), 4)
+    hypothesis_reward = round((0.80 * cause_score) + (0.20 * positive_containment), 4)
+    efficiency_reward = max(0.0, round(1.0 - ((step_number - 1) / max(1, max_steps - 1)), 4))
+    penalty = round(min(1.0, contradiction_penalty), 4)
+    return build_dense_reward(
+        signal_reward=signal_reward,
+        hypothesis_reward=hypothesis_reward,
+        efficiency_reward=efficiency_reward,
+        penalty=penalty,
+        info={
+            "evidence_score": evidence_score,
+            "precision": precision,
+            "recall": recall,
+            "cause_score": cause_score,
+            "impact_score": impact_score,
+            "positive_containment": positive_containment,
+            "forbidden_containment": forbidden_containment,
+            "issues": evidence_issues + cause_issues,
+        },
+    )