omm7 commited on
Commit
bc2ead7
·
verified ·
1 Parent(s): 4cb7e01

Upload folder using huggingface_hub

Browse files
.dockerignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .DS_Store
4
+ tmp/
Dockerfile CHANGED
@@ -1,20 +1,16 @@
1
- FROM python:3.13.5-slim
2
 
3
- WORKDIR /app
4
-
5
- RUN apt-get update && apt-get install -y \
6
- build-essential \
7
- curl \
8
- git \
9
- && rm -rf /var/lib/apt/lists/*
10
 
11
- COPY requirements.txt ./
12
- COPY src/ ./src/
13
 
14
- RUN pip3 install -r requirements.txt
 
15
 
16
- EXPOSE 8501
17
 
18
- HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
19
 
20
- ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
 
1
+ FROM python:3.11-slim
2
 
3
+ ENV PYTHONDONTWRITEBYTECODE=1 \
4
+ PYTHONUNBUFFERED=1 \
5
+ PIP_NO_CACHE_DIR=1
 
 
 
 
6
 
7
+ WORKDIR /app
 
8
 
9
+ COPY requirements.txt /app/requirements.txt
10
+ RUN pip install --upgrade pip && pip install -r /app/requirements.txt
11
 
12
+ COPY . /app
13
 
14
+ EXPOSE 7860
15
 
16
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,20 +1,409 @@
1
  ---
2
- title: CausalOps Env
3
- emoji: 🚀
4
  colorFrom: red
5
- colorTo: red
6
  sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
  pinned: false
11
- short_description: A real-world OpenEnv benchmark for causal reasoning in distr
12
- license: mit
13
  ---
14
 
15
- # Welcome to Streamlit!
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: NovaTech Incident Command
3
+ emoji: 🚨
4
  colorFrom: red
5
+ colorTo: blue
6
  sdk: docker
7
+ app_file: app.py
 
 
8
  pinned: false
 
 
9
  ---
10
 
11
+ # NovaTech Incident Command
12
 
13
+ NovaTech Incident Command is a hardened OpenEnv environment for realistic incident response under partial observability. Agents do not receive the full system state. They must query logs, inspect service dependencies, update a structured causal hypothesis, choose safe containment, and submit a final incident report.
14
 
15
+ This version is explicitly designed to avoid common benchmark failures:
16
+
17
+ - no hidden answer leakage in public state
18
+ - no scripted reveal queue
19
+ - no keyword-based grader
20
+ - no hardcoded baseline answers
21
+ - session-safe API with per-episode isolation
22
+
23
+ ## What The Agent Must Do
24
+
25
+ Each episode simulates a production incident with a fixed action budget.
26
+
27
+ The agent must:
28
+
29
+ - retrieve relevant logs using structured filters
30
+ - follow dependencies rather than brute-force the whole system
31
+ - narrow toward a causal tuple
32
+ - avoid destructive containment
33
+ - submit a causally consistent final report
34
+
35
+ ## Core Mechanics
36
+
37
+ ### Partial observability
38
+
39
+ The agent only sees:
40
+
41
+ - the incident briefing
42
+ - the dependency graph
43
+ - the logs it has explicitly revealed
44
+
45
+ It never sees:
46
+
47
+ - hidden logs
48
+ - gold evidence IDs
49
+ - grader internals
50
+
51
+ ### Session-safe design
52
+
53
+ `POST /reset` returns a `session_id`.
54
+
55
+ All actions in `POST /step` should include that `session_id`, which isolates concurrent episodes and avoids the old shared-global-state exploit.
56
+
57
+ ### Seeded stochasticity
58
+
59
+ Every reset can accept a seed:
60
+
61
+ ```json
62
+ {
63
+ "task_id": "medium",
64
+ "seed": 42
65
+ }
66
+ ```
67
+
68
+ Given the same seed:
69
+
70
+ - the task-specific log pool is reproducible
71
+ - distractor/noise sampling is reproducible
72
+ - retrieval order is reproducible
73
+
74
+ Different seeds slightly vary the non-essential observable context while preserving deterministic grading.
75
+
76
+ ## Observations
77
+
78
+ Each `reset()` and `step()` returns a structured observation, not a loose blob.
79
+
80
+ Observation fields:
81
+
82
+ - `session_id`: the active episode identifier
83
+ - `task_id`: task difficulty key
84
+ - `task_title`: human-readable incident label
85
+ - `briefing`: incident objective, incident window, suspected services, customer statement, and constraints
86
+ - `dependency_graph`: the service graph the agent can reason over
87
+ - `visible_logs`: only the logs the agent has explicitly revealed
88
+ - `revealed_log_count`: number of currently visible logs
89
+ - `visited_services`: services already explored through dependency inspection or queries
90
+ - `submitted_containment`: containment actions already chosen
91
+ - `last_hypothesis`: latest structured causal hypothesis
92
+ - `step_number`: current step
93
+ - `max_steps`: step budget
94
+ - `feedback`: environment guidance after the last action
95
+ - `done`: terminal flag
96
+
97
+ Why this observation design matters:
98
+
99
+ - it gives enough structure for deliberate planning
100
+ - it preserves partial observability
101
+ - it prevents answer leakage
102
+ - it supports both frontier agents and smaller baselines
103
+
104
+ Example observation shape:
105
+
106
+ ```json
107
+ {
108
+ "session_id": "8e7f...",
109
+ "task_id": "medium",
110
+ "task_title": "Checkout Competing Hypotheses",
111
+ "briefing": {
112
+ "incident_id": "INC-2144",
113
+ "title": "Checkout Competing Hypotheses",
114
+ "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
115
+ "incident_window_start": "2025-06-15 06:20:00",
116
+ "incident_window_end": "2025-06-15 06:45:59",
117
+ "suspected_services": ["payment-api", "auth-service", "user-service"],
118
+ "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
119
+ "operational_constraints": [
120
+ "Keep checkout partially available if possible.",
121
+ "Avoid blind restarts."
122
+ ]
123
+ },
124
+ "dependency_graph": {
125
+ "payment-api": ["auth-service", "payment-gateway", "mysql"]
126
+ },
127
+ "visible_logs": [],
128
+ "revealed_log_count": 0,
129
+ "visited_services": [],
130
+ "submitted_containment": [],
131
+ "last_hypothesis": null,
132
+ "step_number": 0,
133
+ "max_steps": 7,
134
+ "feedback": "Episode created. Query the incident window and inspect dependencies to build your case.",
135
+ "done": false
136
+ }
137
+ ```
138
+
139
+ ## Tasks
140
+
141
+ ### Easy: Auth Heap Exhaustion
142
+
143
+ Reasoning pattern:
144
+
145
+ - anomaly detection with clear signal
146
+
147
+ Goal:
148
+
149
+ - identify auth-service heap exhaustion as the true cause of a login incident
150
+ - avoid destructive overreaction
151
+
152
+ ### Medium: Checkout Competing Hypotheses
153
+
154
+ Reasoning pattern:
155
+
156
+ - disambiguate competing explanations
157
+
158
+ Goal:
159
+
160
+ - determine that the payment confirmation outage is a payment-gateway dependency failure, not just upstream auth noise
161
+
162
+ ### Hard: Cascading Multi-Service Incident
163
+
164
+ Reasoning pattern:
165
+
166
+ - partial observability
167
+ - timeline reconstruction
168
+ - tradeoff-aware containment
169
+
170
+ Goal:
171
+
172
+ - identify the initiating service in a multi-service cascade and propose layered containment
173
+
174
+ ## Structured Actions
175
+
176
+ ### Query logs
177
+
178
+ ```json
179
+ {
180
+ "session_id": "<session_id>",
181
+ "action_type": "query_logs",
182
+ "query": {
183
+ "service_name": "payment-api",
184
+ "levels": ["CRITICAL", "ERROR"],
185
+ "start_time": "2025-06-15 06:20:00",
186
+ "end_time": "2025-06-15 06:45:59",
187
+ "limit": 6
188
+ }
189
+ }
190
+ ```
191
+
192
+ ### Inspect dependencies
193
+
194
+ ```json
195
+ {
196
+ "session_id": "<session_id>",
197
+ "action_type": "inspect_dependencies",
198
+ "target_service": "payment-api"
199
+ }
200
+ ```
201
+
202
+ ### Update hypothesis
203
+
204
+ ```json
205
+ {
206
+ "session_id": "<session_id>",
207
+ "action_type": "update_hypothesis",
208
+ "hypothesis": {
209
+ "primary_service": "payment-api",
210
+ "failure_mode": "dependency_outage",
211
+ "dependency": "payment-gateway",
212
+ "customer_impact": "checkout_delays",
213
+ "confidence": 0.87
214
+ }
215
+ }
216
+ ```
217
+
218
+ ### Submit report
219
+
220
+ ```json
221
+ {
222
+ "session_id": "<session_id>",
223
+ "action_type": "submit_report",
224
+ "report": {
225
+ "evidence_log_ids": [193, 194, 195],
226
+ "impacted_services": ["payment-api"],
227
+ "root_cause": {
228
+ "primary_service": "payment-api",
229
+ "failure_mode": "dependency_outage",
230
+ "dependency": "payment-gateway",
231
+ "customer_impact": "checkout_delays",
232
+ "confidence": 0.87
233
+ },
234
+ "containment_plan": [
235
+ "restore_payment_gateway_connectivity",
236
+ "reduce_checkout_retry_pressure"
237
+ ],
238
+ "summary": "Checkout confirmations are delayed because payment-api lost connectivity to the payment gateway."
239
+ }
240
+ }
241
+ ```
242
+
243
+ ## Grading
244
+
245
+ The grader is fully deterministic and structured.
246
+
247
+ It scores:
248
+
249
+ - evidence quality via revealed-evidence F1
250
+ - root-cause tuple correctness
251
+ - impacted-service correctness
252
+ - containment alignment
253
+ - causal consistency across evidence, service, impact, and timeline
254
+
255
+ It penalizes:
256
+
257
+ - unseen evidence references
258
+ - contradictions
259
+ - forbidden containment
260
+ - repeated actions
261
+
262
+ There is no keyword-bag grader in this version.
263
+
264
+ ## Reward Function
265
+
266
+ Intermediate rewards are dense and shaped:
267
+
268
+ - `signal_reward`: new relevant evidence
269
+ - `hypothesis_reward`: improvement toward the gold causal tuple
270
+ - `efficiency_reward`: solving earlier is better
271
+ - `penalty`: invalid queries, loops, contradictions, forbidden actions
272
+
273
+ This makes the environment useful for RL or planning-based evaluation, not just one-shot scoring.
274
+
275
+ ## Clever Reward Techniques
276
+
277
+ This environment uses several reward-shaping ideas that are stronger than a typical binary grader.
278
+
279
+ ### 1. Progress reward based on information gain
280
+
281
+ The agent is rewarded for revealing genuinely relevant signals, not for touching arbitrary logs. A broad but low-value query does not pay nearly as well as a focused query that exposes core evidence.
282
+
283
+ ### 2. Hypothesis-improvement shaping
284
+
285
+ The environment tracks the best structured hypothesis score seen so far. The agent gets rewarded for improving its causal model over time, not for repeating the same guess. This is especially useful for RL or tree-search agents because it gives signal during reasoning, before final submission.
286
+
287
+ ### 3. Observation-consistent terminal scoring
288
+
289
+ The final report is only valid if it cites revealed evidence. This blocks a very common exploit in benchmark environments where agents can hallucinate or hardcode hidden gold evidence.
290
+
291
+ ### 4. Contradiction penalties
292
+
293
+ The grader penalizes internal inconsistency across:
294
+
295
+ - selected evidence
296
+ - claimed root-cause service
297
+ - claimed customer impact
298
+ - timeline in the hard task
299
+ - containment choice
300
+
301
+ This means an agent cannot simply match one part of the answer key and ignore the rest.
302
+
303
+ ### 5. Safe-containment bias
304
+
305
+ The containment scorer separately tracks recommended and forbidden actions. This lets the environment reward operational maturity, not just diagnosis. Agents that “solve” incidents by wiping logs or restarting everything are penalized.
306
+
307
+ ### 6. Loop-aware shaping
308
+
309
+ Repeated identical actions incur additional penalty. That makes the environment better for learning efficient incident workflows instead of degenerate action loops.
310
+
311
+ ### 7. Seeded stochastic distractors with deterministic grading
312
+
313
+ The environment introduces seeded noise into the observable log pool, which makes superficial memorization harder, while the grader remains deterministic for a given seed and task.
314
+
315
+ In short: the reward is not just dense. It is dense in a way that pushes agents toward better investigation behavior, better causal reasoning, and safer remediation decisions.
316
+
317
+ ## API
318
+
319
+ - `POST /reset`
320
+ - `POST /step`
321
+ - `GET /state`
322
+ - `GET /health`
323
+ - `GET /debug_state`
324
+
325
+ `/debug_state` is disabled by default and only works when `OPENENV_DEBUG_STATE=true`.
326
+
327
+ ## Baseline
328
+
329
+ `inference.py` is deterministic and observation-driven.
330
+
331
+ It:
332
+
333
+ - queries the incident window
334
+ - inspects the most suspicious service
335
+ - builds a structured hypothesis from revealed logs
336
+ - chooses containment from the inferred cause
337
+ - submits a final report
338
+
339
+ It does not use hardcoded gold `log_id` answers.
340
+
341
+ Required environment variables:
342
+
343
+ - `HF_TOKEN`
344
+ - `API_BASE_URL`
345
+ - `MODEL_NAME`
346
+
347
+ Optional:
348
+
349
+ - `LOGENV_URL`
350
+ - `DB_PATH`
351
+
352
+ Logging format is strict:
353
+
354
+ - `[START]`
355
+ - `[STEP]`
356
+ - `[END]`
357
+
358
+ ## Local Run
359
+
360
+ ```bash
361
+ pip install -r requirements.txt
362
+ uvicorn app:app --host 0.0.0.0 --port 7860
363
+ ```
364
+
365
+ Reset:
366
+
367
+ ```bash
368
+ curl -X POST http://localhost:7860/reset \
369
+ -H "Content-Type: application/json" \
370
+ -d '{"task_id":"easy","seed":42}'
371
+ ```
372
+
373
+ ## Docker
374
+
375
+ ```bash
376
+ docker build -t novatech-incident-command .
377
+ docker run --rm -p 7860:7860 novatech-incident-command
378
+ ```
379
+
380
+ ## Hugging Face Spaces
381
+
382
+ This repository is intended for Docker Spaces.
383
+
384
+ Expected validator path:
385
+
386
+ - `POST /reset` returns `200 OK`
387
+ - `POST /step` accepts typed actions
388
+ - `GET /health` returns liveness
389
+
390
+ ## Repo Layout
391
+
392
+ ```text
393
+ logenv2/
394
+ ├── app.py
395
+ ├── openenv.yaml
396
+ ├── inference.py
397
+ ├── Dockerfile
398
+ ├── requirements.txt
399
+ ├── preflight.sh
400
+ ├── novatech_logs.db
401
+ ├── env/
402
+ │ ├── environment.py
403
+ │ └── models.py
404
+ ├── data/
405
+ │ └── db_loader.py
406
+ └── tasks/
407
+ ├── catalog.py
408
+ └── graders.py
409
+ ```
__pycache__/app.cpython-314.pyc ADDED
Binary file (33.6 kB). View file
 
__pycache__/inference.cpython-314.pyc ADDED
Binary file (14.5 kB). View file
 
agent/__pycache__/langgraph_agent.cpython-314.pyc ADDED
Binary file (20.3 kB). View file
 
app.py ADDED
@@ -0,0 +1,853 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI application for the hardened NovaTech OpenEnv environment.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ from typing import Any, Dict, Optional
8
+
9
+ from fastapi import FastAPI, HTTPException, Query, Request
10
+ from fastapi.middleware.cors import CORSMiddleware
11
+ from fastapi.responses import HTMLResponse, RedirectResponse
12
+ from pydantic import BaseModel
13
+
14
+ from env.environment import DEBUG_STATE_ENABLED, store
15
+ from env.models import Action, Observation, Reward
16
+
17
+ app = FastAPI(
18
+ title="NovaTech Incident Command",
19
+ description="Seeded, session-safe OpenEnv environment for incident response under partial observability.",
20
+ version="3.0.0",
21
+ )
22
+ app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
23
+
24
+
25
+ class ResetRequest(BaseModel):
26
+ task_id: str = "easy"
27
+ seed: Optional[int] = None
28
+
29
+
30
+ class StepResponse(BaseModel):
31
+ observation: Dict[str, Any]
32
+ reward: Dict[str, Any]
33
+ done: bool
34
+ info: Dict[str, Any]
35
+
36
+
37
+ def _root_payload() -> Dict[str, Any]:
38
+ return {
39
+ "name": "NovaTech Incident Command",
40
+ "version": "3.0.0",
41
+ "debug_state_enabled": DEBUG_STATE_ENABLED,
42
+ "endpoints": {
43
+ "POST /reset": "Create an episode and return the initial observation.",
44
+ "POST /step": "Apply an action using a session_id.",
45
+ "GET /state": "Return public, non-leaking session state.",
46
+ "GET /health": "Liveness probe.",
47
+ },
48
+ "action_schema": Action.model_json_schema(),
49
+ "observation_schema": Observation.model_json_schema(),
50
+ "reward_schema": Reward.model_json_schema(),
51
+ }
52
+
53
+
54
+ @app.get("/")
55
+ def root(request: Request):
56
+ if "text/html" in (request.headers.get("accept") or "").lower():
57
+ return RedirectResponse(url="/playground", status_code=307)
58
+ return _root_payload()
59
+
60
+
61
+ @app.get("/playground", response_class=HTMLResponse)
62
+ def playground() -> str:
63
+ return """
64
+ <!doctype html>
65
+ <html>
66
+ <head>
67
+ <meta charset="utf-8" />
68
+ <meta name="viewport" content="width=device-width, initial-scale=1" />
69
+ <title>NovaTech Incident Command</title>
70
+ <style>
71
+ :root {
72
+ --ink: #f4f1e8;
73
+ --muted: #b5c1d1;
74
+ --line: rgba(198, 218, 245, 0.14);
75
+ --panel: rgba(10, 18, 30, 0.78);
76
+ --panel-strong: rgba(8, 14, 25, 0.9);
77
+ --card-glow: 0 24px 80px rgba(5, 10, 18, 0.38);
78
+ --accent: #d54f36;
79
+ --accent-soft: #ff9e7b;
80
+ --teal: #3ca7a1;
81
+ --gold: #f1c56e;
82
+ --ok: #65d197;
83
+ --bad: #ff7d7d;
84
+ --mono: "IBM Plex Mono", "SFMono-Regular", monospace;
85
+ --sans: "Space Grotesk", "Avenir Next", sans-serif;
86
+ }
87
+ * { box-sizing: border-box; }
88
+ html { scroll-behavior: smooth; }
89
+ body {
90
+ margin: 0;
91
+ color: var(--ink);
92
+ font-family: var(--sans);
93
+ background:
94
+ radial-gradient(circle at 15% 20%, rgba(213, 79, 54, 0.24), transparent 24%),
95
+ radial-gradient(circle at 82% 10%, rgba(60, 167, 161, 0.24), transparent 22%),
96
+ radial-gradient(circle at 50% 100%, rgba(241, 197, 110, 0.12), transparent 34%),
97
+ linear-gradient(145deg, #09111d 0%, #0d1626 42%, #101a2d 100%);
98
+ min-height: 100vh;
99
+ }
100
+ .chrome {
101
+ position: fixed;
102
+ inset: 0;
103
+ pointer-events: none;
104
+ background-image:
105
+ linear-gradient(rgba(255,255,255,0.04) 1px, transparent 1px),
106
+ linear-gradient(90deg, rgba(255,255,255,0.04) 1px, transparent 1px);
107
+ background-size: 44px 44px;
108
+ mask-image: linear-gradient(to bottom, rgba(0,0,0,0.42), rgba(0,0,0,0.1));
109
+ opacity: 0.3;
110
+ }
111
+ .wrap { max-width: 1400px; margin: 0 auto; padding: 24px 18px 40px; position: relative; z-index: 1; }
112
+ .hero {
113
+ position: relative;
114
+ overflow: hidden;
115
+ background:
116
+ linear-gradient(135deg, rgba(18, 31, 51, 0.92), rgba(10, 18, 30, 0.84)),
117
+ linear-gradient(90deg, rgba(213, 79, 54, 0.16), rgba(60, 167, 161, 0.16));
118
+ border: 1px solid var(--line);
119
+ border-radius: 28px;
120
+ padding: 28px;
121
+ box-shadow: var(--card-glow);
122
+ margin-bottom: 18px;
123
+ }
124
+ .hero::after {
125
+ content: "";
126
+ position: absolute;
127
+ right: -80px;
128
+ top: -80px;
129
+ width: 260px;
130
+ height: 260px;
131
+ border-radius: 50%;
132
+ background: radial-gradient(circle, rgba(241, 197, 110, 0.28), transparent 70%);
133
+ filter: blur(8px);
134
+ }
135
+ .hero-top {
136
+ display: flex;
137
+ align-items: flex-start;
138
+ justify-content: space-between;
139
+ gap: 18px;
140
+ flex-wrap: wrap;
141
+ }
142
+ .eyebrow {
143
+ display: inline-flex;
144
+ align-items: center;
145
+ gap: 10px;
146
+ border: 1px solid rgba(241, 197, 110, 0.25);
147
+ border-radius: 999px;
148
+ padding: 7px 12px;
149
+ color: var(--gold);
150
+ font-size: 0.77rem;
151
+ letter-spacing: 0.12em;
152
+ text-transform: uppercase;
153
+ background: rgba(241, 197, 110, 0.08);
154
+ margin-bottom: 14px;
155
+ }
156
+ .hero h1 {
157
+ margin: 0;
158
+ font-size: clamp(2rem, 4vw, 3.35rem);
159
+ line-height: 0.96;
160
+ letter-spacing: -0.04em;
161
+ max-width: 720px;
162
+ }
163
+ .hero p {
164
+ margin: 16px 0 0;
165
+ max-width: 760px;
166
+ color: var(--muted);
167
+ font-size: 1.04rem;
168
+ line-height: 1.55;
169
+ }
170
+ .hero-statbar {
171
+ display: grid;
172
+ grid-template-columns: repeat(3, minmax(110px, 1fr));
173
+ gap: 10px;
174
+ min-width: 320px;
175
+ }
176
+ .hero-stat {
177
+ border: 1px solid var(--line);
178
+ border-radius: 18px;
179
+ padding: 14px;
180
+ background: rgba(255,255,255,0.04);
181
+ backdrop-filter: blur(12px);
182
+ }
183
+ .hero-stat .label {
184
+ color: var(--muted);
185
+ text-transform: uppercase;
186
+ font-size: 0.72rem;
187
+ letter-spacing: 0.09em;
188
+ }
189
+ .hero-stat .value {
190
+ margin-top: 8px;
191
+ font-size: 1.18rem;
192
+ font-weight: 700;
193
+ }
194
+ .dashboard {
195
+ display: grid;
196
+ grid-template-columns: 380px 1.05fr 0.8fr;
197
+ gap: 16px;
198
+ align-items: start;
199
+ }
200
+ .panel {
201
+ background: linear-gradient(180deg, rgba(14, 23, 37, 0.92), rgba(8, 14, 25, 0.92));
202
+ border: 1px solid var(--line);
203
+ border-radius: 24px;
204
+ box-shadow: var(--card-glow);
205
+ overflow: hidden;
206
+ }
207
+ .panel-head {
208
+ display: flex;
209
+ align-items: center;
210
+ justify-content: space-between;
211
+ gap: 12px;
212
+ padding: 18px 18px 0;
213
+ }
214
+ .panel-title {
215
+ margin: 0;
216
+ font-size: 1rem;
217
+ letter-spacing: 0.02em;
218
+ }
219
+ .panel-subtitle {
220
+ margin: 6px 18px 0;
221
+ color: var(--muted);
222
+ font-size: 0.9rem;
223
+ line-height: 1.45;
224
+ }
225
+ .panel-body { padding: 18px; }
226
+ .stack { display: grid; gap: 14px; }
227
+ .field label, .group-label {
228
+ display: block;
229
+ margin-bottom: 8px;
230
+ color: var(--muted);
231
+ text-transform: uppercase;
232
+ font-size: 0.72rem;
233
+ letter-spacing: 0.08em;
234
+ }
235
+ input, select, textarea, button {
236
+ width: 100%;
237
+ border-radius: 16px;
238
+ border: 1px solid rgba(196, 217, 245, 0.12);
239
+ background: rgba(255,255,255,0.05);
240
+ color: var(--ink);
241
+ padding: 12px 14px;
242
+ font: inherit;
243
+ transition: border-color 0.18s ease, background 0.18s ease, transform 0.18s ease;
244
+ }
245
+ input::placeholder, textarea::placeholder { color: rgba(181, 193, 209, 0.65); }
246
+ input:focus, select:focus, textarea:focus {
247
+ outline: none;
248
+ border-color: rgba(241, 197, 110, 0.55);
249
+ background: rgba(255,255,255,0.08);
250
+ }
251
+ textarea {
252
+ min-height: 250px;
253
+ resize: vertical;
254
+ font-family: var(--mono);
255
+ font-size: 0.92rem;
256
+ line-height: 1.5;
257
+ }
258
+ button {
259
+ cursor: pointer;
260
+ border: 0;
261
+ font-weight: 700;
262
+ letter-spacing: 0.01em;
263
+ background: linear-gradient(135deg, var(--accent), #ef7a59);
264
+ box-shadow: 0 12px 24px rgba(213, 79, 54, 0.24);
265
+ }
266
+ button:hover {
267
+ transform: translateY(-1px);
268
+ filter: brightness(1.03);
269
+ }
270
+ .button-secondary {
271
+ background: linear-gradient(135deg, #1a6374, #2d8896);
272
+ box-shadow: 0 12px 24px rgba(45, 136, 150, 0.2);
273
+ }
274
+ .button-ghost {
275
+ background: rgba(255,255,255,0.06);
276
+ border: 1px solid rgba(196, 217, 245, 0.12);
277
+ box-shadow: none;
278
+ }
279
+ .button-grid {
280
+ display: grid;
281
+ grid-template-columns: 1fr 1fr;
282
+ gap: 10px;
283
+ }
284
+ .status {
285
+ min-height: 48px;
286
+ border-radius: 18px;
287
+ border: 1px solid rgba(196, 217, 245, 0.12);
288
+ background: rgba(255,255,255,0.04);
289
+ padding: 12px 14px;
290
+ color: var(--muted);
291
+ line-height: 1.5;
292
+ }
293
+ .status.ok { color: var(--ok); border-color: rgba(101, 209, 151, 0.2); }
294
+ .status.bad { color: var(--bad); border-color: rgba(255, 125, 125, 0.22); }
295
+ .chips {
296
+ display: flex;
297
+ gap: 8px;
298
+ flex-wrap: wrap;
299
+ }
300
+ .chip {
301
+ display: inline-flex;
302
+ align-items: center;
303
+ gap: 8px;
304
+ padding: 8px 11px;
305
+ border-radius: 999px;
306
+ border: 1px solid rgba(196, 217, 245, 0.12);
307
+ background: rgba(255,255,255,0.04);
308
+ color: var(--muted);
309
+ font-size: 0.82rem;
310
+ }
311
+ .kpis {
312
+ display: grid;
313
+ grid-template-columns: repeat(2, minmax(0, 1fr));
314
+ gap: 10px;
315
+ }
316
+ .kpi {
317
+ border-radius: 18px;
318
+ border: 1px solid rgba(196, 217, 245, 0.12);
319
+ background: rgba(255,255,255,0.035);
320
+ padding: 14px;
321
+ }
322
+ .kpi .label {
323
+ color: var(--muted);
324
+ font-size: 0.72rem;
325
+ text-transform: uppercase;
326
+ letter-spacing: 0.08em;
327
+ }
328
+ .kpi .value {
329
+ margin-top: 8px;
330
+ font-size: 1.05rem;
331
+ font-weight: 700;
332
+ word-break: break-word;
333
+ }
334
+ .template-grid {
335
+ display: grid;
336
+ gap: 8px;
337
+ }
338
+ .template {
339
+ text-align: left;
340
+ padding: 12px 13px;
341
+ border-radius: 16px;
342
+ background: rgba(255,255,255,0.045);
343
+ border: 1px solid rgba(196, 217, 245, 0.1);
344
+ color: var(--ink);
345
+ font-size: 0.92rem;
346
+ box-shadow: none;
347
+ }
348
+ .template strong {
349
+ display: block;
350
+ margin-bottom: 4px;
351
+ font-size: 0.9rem;
352
+ }
353
+ .template span {
354
+ color: var(--muted);
355
+ font-size: 0.82rem;
356
+ line-height: 1.45;
357
+ }
358
+ .viewer-tabs {
359
+ display: flex;
360
+ gap: 8px;
361
+ margin-bottom: 12px;
362
+ }
363
+ .tab {
364
+ width: auto;
365
+ padding: 10px 14px;
366
+ border-radius: 999px;
367
+ background: rgba(255,255,255,0.05);
368
+ box-shadow: none;
369
+ font-size: 0.86rem;
370
+ }
371
+ .tab.active {
372
+ background: linear-gradient(135deg, rgba(241, 197, 110, 0.18), rgba(213, 79, 54, 0.2));
373
+ border: 1px solid rgba(241, 197, 110, 0.28);
374
+ }
375
+ .viewer {
376
+ min-height: 620px;
377
+ border-radius: 20px;
378
+ background: linear-gradient(180deg, #0d1626, #0b1220);
379
+ border: 1px solid rgba(196, 217, 245, 0.1);
380
+ overflow: hidden;
381
+ }
382
+ pre {
383
+ margin: 0;
384
+ min-height: 620px;
385
+ padding: 18px;
386
+ overflow: auto;
387
+ white-space: pre-wrap;
388
+ word-break: break-word;
389
+ color: #e7efff;
390
+ font-family: var(--mono);
391
+ font-size: 0.9rem;
392
+ line-height: 1.58;
393
+ }
394
+ .hidden { display: none; }
395
+ .brief {
396
+ display: grid;
397
+ gap: 12px;
398
+ }
399
+ .brief-card {
400
+ border-radius: 18px;
401
+ border: 1px solid rgba(196, 217, 245, 0.1);
402
+ background: rgba(255,255,255,0.04);
403
+ padding: 14px;
404
+ }
405
+ .brief-card h3 {
406
+ margin: 0 0 8px;
407
+ font-size: 0.86rem;
408
+ text-transform: uppercase;
409
+ letter-spacing: 0.08em;
410
+ color: var(--gold);
411
+ }
412
+ .brief-card p, .brief-card ul {
413
+ margin: 0;
414
+ color: var(--muted);
415
+ line-height: 1.55;
416
+ font-size: 0.92rem;
417
+ }
418
+ .brief-card ul {
419
+ padding-left: 18px;
420
+ }
421
+ .brief-card li + li { margin-top: 6px; }
422
+ .footer-note {
423
+ margin-top: 10px;
424
+ color: rgba(181, 193, 209, 0.66);
425
+ font-size: 0.78rem;
426
+ line-height: 1.5;
427
+ }
428
+ @media (max-width: 1240px) {
429
+ .dashboard { grid-template-columns: 360px 1fr; }
430
+ .sidebar-right { grid-column: 1 / -1; }
431
+ }
432
+ @media (max-width: 900px) {
433
+ .wrap { padding: 18px 14px 28px; }
434
+ .dashboard { grid-template-columns: 1fr; }
435
+ .hero-top { flex-direction: column; }
436
+ .hero-statbar { width: 100%; min-width: 0; }
437
+ .button-grid { grid-template-columns: 1fr; }
438
+ .viewer, pre { min-height: 420px; }
439
+ }
440
+ </style>
441
+ </head>
442
+ <body>
443
+ <div class="chrome"></div>
444
+ <div class="wrap">
445
+ <section class="hero">
446
+ <div class="hero-top">
447
+ <div>
448
+ <div class="eyebrow">Live OpenEnv Ops Console</div>
449
+ <h1>NovaTech Incident Command</h1>
450
+ <p>Run a full incident workflow from one place: shape your search space, surface the most credible evidence, lock in a structured causal hypothesis, and pressure-test the final report before submission.</p>
451
+ </div>
452
+ <div class="hero-statbar">
453
+ <div class="hero-stat">
454
+ <div class="label">Mode</div>
455
+ <div class="value">Seeded, Partial</div>
456
+ </div>
457
+ <div class="hero-stat">
458
+ <div class="label">Sessions</div>
459
+ <div class="value" id="hero-session">None</div>
460
+ </div>
461
+ <div class="hero-stat">
462
+ <div class="label">Last Reward</div>
463
+ <div class="value" id="hero-reward">-</div>
464
+ </div>
465
+ </div>
466
+ </div>
467
+ </section>
468
+ <section class="dashboard">
469
+ <div class="panel">
470
+ <div class="panel-head">
471
+ <h2 class="panel-title">Mission Control</h2>
472
+ </div>
473
+ <p class="panel-subtitle">Start a seeded episode, track session health, and jump into common action patterns without writing boilerplate from scratch.</p>
474
+ <div class="panel-body stack">
475
+ <div class="field">
476
+ <label>Task</label>
477
+ <select id="task">
478
+ <option value="easy">easy · auth heap exhaustion</option>
479
+ <option value="medium">medium · competing checkout hypotheses</option>
480
+ <option value="hard">hard · cascading multi-service incident</option>
481
+ </select>
482
+ </div>
483
+ <div class="field">
484
+ <label>Seed</label>
485
+ <input id="seed" placeholder="Optional integer seed for reproducibility" />
486
+ </div>
487
+ <div class="button-grid">
488
+ <button onclick="resetEpisode()">Reset Episode</button>
489
+ <button class="button-secondary" onclick="loadState()">Load Public State</button>
490
+ </div>
491
+ <div id="status" class="status">No active session yet. Reset an episode to begin.</div>
492
+ <div class="kpis">
493
+ <div class="kpi">
494
+ <div class="label">Session ID</div>
495
+ <div class="value" id="session-pill">-</div>
496
+ </div>
497
+ <div class="kpi">
498
+ <div class="label">Task</div>
499
+ <div class="value" id="task-pill">-</div>
500
+ </div>
501
+ <div class="kpi">
502
+ <div class="label">Step</div>
503
+ <div class="value" id="step-pill">-</div>
504
+ </div>
505
+ <div class="kpi">
506
+ <div class="label">Done</div>
507
+ <div class="value" id="done-pill">-</div>
508
+ </div>
509
+ </div>
510
+ <div>
511
+ <div class="group-label">Quick Templates</div>
512
+ <div class="template-grid">
513
+ <button class="template" onclick="useTemplate('critical_window')">
514
+ <strong>Critical Window Query</strong>
515
+ <span>Pull the highest-risk logs in the incident window first.</span>
516
+ </button>
517
+ <button class="template" onclick="useTemplate('dependency_sweep')">
518
+ <strong>Dependency Sweep</strong>
519
+ <span>Inspect the most suspicious service and its adjacent services.</span>
520
+ </button>
521
+ <button class="template" onclick="useTemplate('hypothesis_auth')">
522
+ <strong>Auth Hypothesis</strong>
523
+ <span>Start from resource exhaustion in auth-service.</span>
524
+ </button>
525
+ <button class="template" onclick="useTemplate('submit_shell')">
526
+ <strong>Final Report Shell</strong>
527
+ <span>Fill a structured report with observed evidence only.</span>
528
+ </button>
529
+ </div>
530
+ </div>
531
+ </div>
532
+ </div>
533
+ <div class="panel">
534
+ <div class="panel-head">
535
+ <h2 class="panel-title">Action Composer</h2>
536
+ <button class="button-ghost" style="width:auto;" onclick="formatAction()">Format JSON</button>
537
+ </div>
538
+ <p class="panel-subtitle">Work directly against the typed API. The current session id is auto-injected when missing, so you can focus on the action payload itself.</p>
539
+ <div class="panel-body">
540
+ <div class="field">
541
+ <label>Action JSON</label>
542
+ <textarea id="action">{
543
+ "action_type": "query_logs",
544
+ "query": {
545
+ "levels": ["CRITICAL", "ERROR"],
546
+ "limit": 5
547
+ }
548
+ }</textarea>
549
+ </div>
550
+ <div class="button-grid" style="margin-top: 12px;">
551
+ <button onclick="submitStep()">Submit Step</button>
552
+ <button class="button-secondary" onclick="copySessionAction()">Inject Session + Copy</button>
553
+ </div>
554
+ <div class="footer-note">Tip: keep evidence grounded. The grader now rejects unseen log ids and penalizes contradictions across service, impact, and containment.</div>
555
+ </div>
556
+ </div>
557
+ <div class="panel sidebar-right">
558
+ <div class="panel-head">
559
+ <h2 class="panel-title">Situation Room</h2>
560
+ </div>
561
+ <p class="panel-subtitle">Read the live incident summary, then switch between raw JSON and a cleaner operator view to understand what changed after each step.</p>
562
+ <div class="panel-body">
563
+ <div class="brief">
564
+ <div class="brief-card">
565
+ <h3>Incident Snapshot</h3>
566
+ <p id="brief-title">No active incident briefing yet.</p>
567
+ </div>
568
+ <div class="brief-card">
569
+ <h3>Operational Constraints</h3>
570
+ <ul id="constraints-list">
571
+ <li>Reset an episode to load task-specific constraints.</li>
572
+ </ul>
573
+ </div>
574
+ <div class="brief-card">
575
+ <h3>Suspected Services</h3>
576
+ <div class="chips" id="suspected-services">
577
+ <span class="chip">None</span>
578
+ </div>
579
+ </div>
580
+ </div>
581
+ <div class="viewer-tabs" style="margin-top: 16px;">
582
+ <button class="tab active" id="tab-raw" onclick="switchTab('raw')">Raw JSON</button>
583
+ <button class="tab" id="tab-ops" onclick="switchTab('ops')">Ops Summary</button>
584
+ </div>
585
+ <div class="viewer">
586
+ <pre id="output-raw">No data yet.</pre>
587
+ <pre id="output-ops" class="hidden">No data yet.</pre>
588
+ </div>
589
+ </div>
590
+ </div>
591
+ </section>
592
+ </div>
593
+ <script>
594
+ let currentSessionId = null;
595
+ const templates = {
596
+ critical_window: {
597
+ action_type: "query_logs",
598
+ query: { levels: ["CRITICAL", "ERROR"], limit: 6 }
599
+ },
600
+ dependency_sweep: {
601
+ action_type: "inspect_dependencies",
602
+ target_service: "payment-api"
603
+ },
604
+ hypothesis_auth: {
605
+ action_type: "update_hypothesis",
606
+ hypothesis: {
607
+ primary_service: "auth-service",
608
+ failure_mode: "resource_exhaustion",
609
+ dependency: "none",
610
+ customer_impact: "login_failures",
611
+ confidence: 0.82
612
+ }
613
+ },
614
+ submit_shell: {
615
+ action_type: "submit_report",
616
+ report: {
617
+ evidence_log_ids: [],
618
+ impacted_services: ["auth-service"],
619
+ root_cause: {
620
+ primary_service: "auth-service",
621
+ failure_mode: "resource_exhaustion",
622
+ dependency: "none",
623
+ customer_impact: "login_failures",
624
+ confidence: 0.82
625
+ },
626
+ containment_plan: ["increase_auth_heap", "enable_login_rate_limiting"],
627
+ summary: "Replace this with a concise, evidence-backed incident summary."
628
+ }
629
+ }
630
+ };
631
+ function buildOpsView(data) {
632
+ const source = data.observation || data;
633
+ const reward = data.reward || data.last_reward || {};
634
+ const logs = source.visible_logs || [];
635
+ const lines = [];
636
+ lines.push("Session Overview");
637
+ lines.push(`- Session: ${source.session_id || currentSessionId || "-"}`);
638
+ lines.push(`- Task: ${source.task_id || "-"}`);
639
+ lines.push(`- Step: ${source.step_number ?? data.step_number ?? "-"} / ${source.max_steps ?? data.max_steps ?? "-"}`);
640
+ lines.push(`- Revealed logs: ${source.revealed_log_count ?? data.revealed_log_count ?? logs.length ?? 0}`);
641
+ lines.push(`- Done: ${String(source.done ?? data.done ?? "-")}`);
642
+ if (source.feedback) {
643
+ lines.push("");
644
+ lines.push("Feedback");
645
+ lines.push(source.feedback);
646
+ }
647
+ if (source.briefing) {
648
+ lines.push("");
649
+ lines.push("Briefing");
650
+ lines.push(`- Title: ${source.briefing.title}`);
651
+ lines.push(`- Objective: ${source.briefing.objective}`);
652
+ lines.push(`- Customer: ${source.briefing.customer_statement}`);
653
+ }
654
+ if (source.last_hypothesis) {
655
+ lines.push("");
656
+ lines.push("Latest Hypothesis");
657
+ lines.push(`- Service: ${source.last_hypothesis.primary_service}`);
658
+ lines.push(`- Failure mode: ${source.last_hypothesis.failure_mode}`);
659
+ lines.push(`- Dependency: ${source.last_hypothesis.dependency}`);
660
+ lines.push(`- Impact: ${source.last_hypothesis.customer_impact}`);
661
+ lines.push(`- Confidence: ${source.last_hypothesis.confidence}`);
662
+ }
663
+ if (source.submitted_containment && source.submitted_containment.length) {
664
+ lines.push("");
665
+ lines.push("Containment");
666
+ source.submitted_containment.forEach((item) => lines.push(`- ${item}`));
667
+ }
668
+ if (reward.value !== undefined) {
669
+ lines.push("");
670
+ lines.push("Reward");
671
+ lines.push(`- Total: ${Number(reward.value).toFixed(4)}`);
672
+ if (reward.signal_reward !== undefined) lines.push(`- Signal: ${Number(reward.signal_reward).toFixed(4)}`);
673
+ if (reward.hypothesis_reward !== undefined) lines.push(`- Hypothesis: ${Number(reward.hypothesis_reward).toFixed(4)}`);
674
+ if (reward.efficiency_reward !== undefined) lines.push(`- Efficiency: ${Number(reward.efficiency_reward).toFixed(4)}`);
675
+ if (reward.penalty !== undefined) lines.push(`- Penalty: ${Number(reward.penalty).toFixed(4)}`);
676
+ }
677
+ if (logs.length) {
678
+ lines.push("");
679
+ lines.push(`Visible Logs (${logs.length})`);
680
+ logs.slice(0, 10).forEach((log) => {
681
+ lines.push(`- [${log.log_level}] ${log.log_id} · ${log.service_name} · ${log.server_id}`);
682
+ lines.push(` ${log.message}`);
683
+ });
684
+ }
685
+ return lines.join("\\n");
686
+ }
687
+ function refreshBriefing(observation) {
688
+ document.getElementById("session-pill").textContent = observation.session_id || currentSessionId || "-";
689
+ document.getElementById("task-pill").textContent = observation.task_id || "-";
690
+ document.getElementById("step-pill").textContent = `${observation.step_number ?? "-"} / ${observation.max_steps ?? "-"}`;
691
+ document.getElementById("done-pill").textContent = String(observation.done ?? "-");
692
+ document.getElementById("hero-session").textContent = observation.session_id ? observation.session_id.slice(0, 8) : "None";
693
+ if (observation.briefing) {
694
+ document.getElementById("brief-title").textContent = `${observation.briefing.title}: ${observation.briefing.customer_statement}`;
695
+ const list = document.getElementById("constraints-list");
696
+ list.innerHTML = "";
697
+ observation.briefing.operational_constraints.forEach((item) => {
698
+ const li = document.createElement("li");
699
+ li.textContent = item;
700
+ list.appendChild(li);
701
+ });
702
+ const chips = document.getElementById("suspected-services");
703
+ chips.innerHTML = "";
704
+ observation.briefing.suspected_services.forEach((service) => {
705
+ const chip = document.createElement("span");
706
+ chip.className = "chip";
707
+ chip.textContent = service;
708
+ chips.appendChild(chip);
709
+ });
710
+ }
711
+ }
712
+ function show(data) {
713
+ document.getElementById("output-raw").textContent = JSON.stringify(data, null, 2);
714
+ document.getElementById("output-ops").textContent = buildOpsView(data);
715
+ const observation = data.observation || data;
716
+ if (observation.session_id) currentSessionId = observation.session_id;
717
+ refreshBriefing(observation);
718
+ const reward = data.reward || data.last_reward;
719
+ document.getElementById("hero-reward").textContent = reward && reward.value !== undefined ? Number(reward.value).toFixed(3) : "-";
720
+ }
721
+ function status(text, ok=true) {
722
+ const node = document.getElementById("status");
723
+ node.textContent = text;
724
+ node.className = ok ? "status ok" : "status bad";
725
+ }
726
+ function switchTab(which) {
727
+ const raw = document.getElementById("output-raw");
728
+ const ops = document.getElementById("output-ops");
729
+ const rawTab = document.getElementById("tab-raw");
730
+ const opsTab = document.getElementById("tab-ops");
731
+ if (which === "ops") {
732
+ raw.classList.add("hidden");
733
+ ops.classList.remove("hidden");
734
+ rawTab.classList.remove("active");
735
+ opsTab.classList.add("active");
736
+ } else {
737
+ ops.classList.add("hidden");
738
+ raw.classList.remove("hidden");
739
+ opsTab.classList.remove("active");
740
+ rawTab.classList.add("active");
741
+ }
742
+ }
743
+ function useTemplate(name) {
744
+ const template = JSON.parse(JSON.stringify(templates[name]));
745
+ if (currentSessionId) template.session_id = currentSessionId;
746
+ document.getElementById("action").value = JSON.stringify(template, null, 2);
747
+ }
748
+ function formatAction() {
749
+ try {
750
+ const payload = JSON.parse(document.getElementById("action").value);
751
+ document.getElementById("action").value = JSON.stringify(payload, null, 2);
752
+ status("Action JSON formatted.");
753
+ } catch (error) {
754
+ status(error.message, false);
755
+ }
756
+ }
757
+ async function copySessionAction() {
758
+ try {
759
+ const payload = JSON.parse(document.getElementById("action").value);
760
+ if (currentSessionId) payload.session_id = currentSessionId;
761
+ const text = JSON.stringify(payload, null, 2);
762
+ document.getElementById("action").value = text;
763
+ if (navigator.clipboard) {
764
+ await navigator.clipboard.writeText(text);
765
+ }
766
+ status("Session id injected and action copied.");
767
+ } catch (error) {
768
+ status(error.message, false);
769
+ }
770
+ }
771
+ async function resetEpisode() {
772
+ const task_id = document.getElementById("task").value;
773
+ const rawSeed = document.getElementById("seed").value.trim();
774
+ const payload = { task_id };
775
+ if (rawSeed) payload.seed = Number(rawSeed);
776
+ const res = await fetch("/reset", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(payload) });
777
+ const data = await res.json();
778
+ if (!res.ok) return status(JSON.stringify(data), false);
779
+ show(data);
780
+ status("Episode reset.");
781
+ }
782
+ async function submitStep() {
783
+ try {
784
+ const payload = JSON.parse(document.getElementById("action").value);
785
+ if (currentSessionId && !payload.session_id) payload.session_id = currentSessionId;
786
+ const res = await fetch("/step", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(payload) });
787
+ const data = await res.json();
788
+ if (!res.ok) return status(JSON.stringify(data), false);
789
+ show(data);
790
+ status("Step completed.");
791
+ } catch (error) {
792
+ status(error.message, false);
793
+ }
794
+ }
795
+ async function loadState() {
796
+ const url = currentSessionId ? `/state?session_id=${encodeURIComponent(currentSessionId)}` : "/state";
797
+ const res = await fetch(url);
798
+ const data = await res.json();
799
+ if (!res.ok) return status(JSON.stringify(data), false);
800
+ show(data);
801
+ status("Public state loaded.");
802
+ }
803
+ switchTab('raw');
804
+ </script>
805
+ </body>
806
+ </html>
807
+ """
808
+
809
+
810
+ @app.get("/health")
811
+ def health() -> Dict[str, str]:
812
+ return {"status": "ok"}
813
+
814
+
815
+ @app.post("/reset", response_model=Dict[str, Any])
816
+ def reset(request: ResetRequest) -> Dict[str, Any]:
817
+ try:
818
+ observation = store.reset(task_id=request.task_id, seed=request.seed)
819
+ except ValueError as exc:
820
+ raise HTTPException(status_code=422, detail=str(exc)) from exc
821
+ return observation.model_dump()
822
+
823
+
824
+ @app.post("/step", response_model=StepResponse)
825
+ def step(action: Action) -> StepResponse:
826
+ try:
827
+ observation, reward, done, info = store.step(action)
828
+ except (RuntimeError, ValueError) as exc:
829
+ raise HTTPException(status_code=400, detail=str(exc)) from exc
830
+ return StepResponse(
831
+ observation=observation.model_dump(),
832
+ reward=reward.model_dump(),
833
+ done=done,
834
+ info=info,
835
+ )
836
+
837
+
838
+ @app.get("/state", response_model=Dict[str, Any])
839
+ def state(session_id: Optional[str] = Query(default=None)) -> Dict[str, Any]:
840
+ try:
841
+ return store.public_state(session_id=session_id)
842
+ except RuntimeError as exc:
843
+ raise HTTPException(status_code=400, detail=str(exc)) from exc
844
+
845
+
846
+ @app.get("/debug_state", response_model=Dict[str, Any])
847
+ def debug_state(session_id: Optional[str] = Query(default=None)) -> Dict[str, Any]:
848
+ try:
849
+ return store.debug_state(session_id=session_id)
850
+ except PermissionError as exc:
851
+ raise HTTPException(status_code=403, detail=str(exc)) from exc
852
+ except RuntimeError as exc:
853
+ raise HTTPException(status_code=400, detail=str(exc)) from exc
data/__init__.py ADDED
File without changes
data/__pycache__/db_loader.cpython-314.pyc ADDED
Binary file (8.01 kB). View file
 
data/db_loader.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Database and scenario loading helpers.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ import random
8
+ import sqlite3
9
+ from pathlib import Path
10
+ from typing import Any, Dict, Iterable, List
11
+
12
+ from tasks.catalog import TASK_SPECS
13
+
14
+ DEFAULT_DB_PATH = Path(__file__).resolve().parents[1] / "novatech_logs.db"
15
+ DB_PATH = Path(os.getenv("DB_PATH", str(DEFAULT_DB_PATH))).expanduser().resolve()
16
+
17
+
18
+ def _connect() -> sqlite3.Connection:
19
+ if not DB_PATH.exists():
20
+ raise FileNotFoundError(f"Database not found at '{DB_PATH}'")
21
+ return sqlite3.connect(str(DB_PATH))
22
+
23
+
24
+ def load_thresholds() -> Dict[str, Dict[str, float]]:
25
+ conn = _connect()
26
+ rows = conn.execute(
27
+ "SELECT metric_name, warning_threshold, critical_threshold, consecutive_count FROM anomaly_thresholds"
28
+ ).fetchall()
29
+ conn.close()
30
+ return {
31
+ row[0]: {
32
+ "warning": float(row[1]),
33
+ "critical": float(row[2]),
34
+ "consecutive": float(row[3]),
35
+ }
36
+ for row in rows
37
+ }
38
+
39
+
40
+ def load_patterns() -> Dict[str, Dict[str, str]]:
41
+ conn = _connect()
42
+ rows = conn.execute(
43
+ "SELECT pattern_keyword, severity, description FROM known_error_patterns ORDER BY pattern_id"
44
+ ).fetchall()
45
+ conn.close()
46
+ return {row[0]: {"severity": row[1], "description": row[2]} for row in rows}
47
+
48
+
49
+ def load_all_logs() -> List[Dict[str, Any]]:
50
+ conn = _connect()
51
+ rows = conn.execute(
52
+ """
53
+ SELECT log_id, timestamp, server_id, log_level, service_name,
54
+ message, response_time_ms, cpu_usage_percent, memory_usage_percent
55
+ FROM server_logs
56
+ ORDER BY timestamp ASC, log_id ASC
57
+ """
58
+ ).fetchall()
59
+ conn.close()
60
+ return [
61
+ {
62
+ "log_id": int(row[0]),
63
+ "timestamp": str(row[1]),
64
+ "server_id": str(row[2]),
65
+ "log_level": str(row[3]),
66
+ "service_name": str(row[4]),
67
+ "message": str(row[5]),
68
+ "response_time_ms": int(row[6] or 0),
69
+ "cpu_usage_percent": float(row[7] or 0.0),
70
+ "memory_usage_percent": float(row[8] or 0.0),
71
+ }
72
+ for row in rows
73
+ ]
74
+
75
+
76
+ def _within_window(log: Dict[str, Any], start: str, end: str) -> bool:
77
+ return start <= str(log["timestamp"]) <= end
78
+
79
+
80
+ def _base_scope(task_id: str, logs: Iterable[Dict[str, Any]]) -> List[Dict[str, Any]]:
81
+ spec = TASK_SPECS[task_id]
82
+ scope_servers = set(spec["scope_servers"])
83
+ scope_services = set(spec["scope_services"])
84
+ start = str(spec["incident_window_start"])
85
+ end = str(spec["incident_window_end"])
86
+ return [
87
+ log
88
+ for log in logs
89
+ if log["server_id"] in scope_servers
90
+ and log["service_name"] in scope_services
91
+ and (
92
+ _within_window(log, start, end)
93
+ or log["log_id"] in set(spec["must_include_ids"])
94
+ )
95
+ ]
96
+
97
+
98
+ def build_task_log_pool(task_id: str, seed: int) -> List[Dict[str, Any]]:
99
+ spec = TASK_SPECS[task_id]
100
+ rng = random.Random(seed)
101
+ all_logs = load_all_logs()
102
+ must_include_ids = set(spec["must_include_ids"])
103
+ base_scope = _base_scope(task_id, all_logs)
104
+ scope_ids = {log["log_id"] for log in base_scope}
105
+ for log in all_logs:
106
+ if log["log_id"] in must_include_ids:
107
+ scope_ids.add(log["log_id"])
108
+
109
+ scope_logs = [log for log in all_logs if log["log_id"] in scope_ids]
110
+ noise_candidates = [
111
+ log
112
+ for log in all_logs
113
+ if log["log_id"] not in scope_ids
114
+ and log["server_id"] in set(spec["scope_servers"])
115
+ and log["service_name"] in set(spec["scope_services"])
116
+ ]
117
+ sample_size = min(int(spec["noise_sample_size"]), len(noise_candidates))
118
+ if sample_size:
119
+ for log in rng.sample(noise_candidates, sample_size):
120
+ scope_logs.append(log)
121
+
122
+ enriched = []
123
+ for index, log in enumerate(scope_logs):
124
+ log_copy = dict(log)
125
+ log_copy["_seed_rank"] = rng.random() + (index * 0.00001)
126
+ enriched.append(log_copy)
127
+ return enriched
env/__init__.py ADDED
File without changes
env/__pycache__/environment.cpython-314.pyc ADDED
Binary file (23.6 kB). View file
 
env/__pycache__/models.cpython-314.pyc ADDED
Binary file (6.41 kB). View file
 
env/environment.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Session-safe OpenEnv environment with seeded partial observability.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ import random
8
+ import threading
9
+ import uuid
10
+ from dataclasses import dataclass, field
11
+ from typing import Any, Dict, List, Optional, Set, Tuple
12
+
13
+ from data.db_loader import build_task_log_pool, load_patterns, load_thresholds
14
+ from env.models import Action, IncidentBriefing, Observation, Reward, RootCauseHypothesis
15
+ from tasks.catalog import CONTAINMENT_DESCRIPTIONS, DEPENDENCY_GRAPH, TASK_SPECS
16
+ from tasks.graders import build_dense_reward, containment_alignment, grade_report, hypothesis_match_score
17
+
18
+ DEBUG_STATE_ENABLED = os.getenv("OPENENV_DEBUG_STATE", "false").lower() == "true"
19
+
20
+
21
+ @dataclass
22
+ class IncidentSession:
23
+ session_id: str
24
+ task_id: str
25
+ seed: int
26
+ max_steps: int
27
+ logs: List[Dict[str, Any]]
28
+ thresholds: Dict[str, Dict[str, float]]
29
+ patterns: Dict[str, Dict[str, str]]
30
+ step_number: int = 0
31
+ done: bool = False
32
+ visible_log_ids: Set[int] = field(default_factory=set)
33
+ visited_services: Set[str] = field(default_factory=set)
34
+ containment_plan: List[str] = field(default_factory=list)
35
+ last_hypothesis: Optional[RootCauseHypothesis] = None
36
+ best_hypothesis_score: float = 0.0
37
+ query_fingerprints: Dict[str, int] = field(default_factory=dict)
38
+ last_reward: Optional[Reward] = None
39
+ episode_history: List[Dict[str, Any]] = field(default_factory=list)
40
+
41
+ def visible_logs(self) -> List[Dict[str, Any]]:
42
+ visible = [log for log in self.logs if log["log_id"] in self.visible_log_ids]
43
+ return sorted(visible, key=lambda log: (log["timestamp"], log["log_id"]))
44
+
45
+ def log_map(self) -> Dict[int, Dict[str, Any]]:
46
+ return {log["log_id"]: log for log in self.logs}
47
+
48
+
49
+ class SessionStore:
50
+ def __init__(self) -> None:
51
+ self._lock = threading.Lock()
52
+ self._sessions: Dict[str, IncidentSession] = {}
53
+
54
+ def reset(self, task_id: str = "easy", seed: Optional[int] = None) -> Observation:
55
+ if task_id not in TASK_SPECS:
56
+ raise ValueError(f"Unknown task_id '{task_id}'.")
57
+ actual_seed = int(seed if seed is not None else 2025 + (list(TASK_SPECS).index(task_id) * 17))
58
+ session_id = uuid.uuid4().hex
59
+ spec = TASK_SPECS[task_id]
60
+ session = IncidentSession(
61
+ session_id=session_id,
62
+ task_id=task_id,
63
+ seed=actual_seed,
64
+ max_steps=int(spec["max_steps"]),
65
+ logs=build_task_log_pool(task_id, actual_seed),
66
+ thresholds=load_thresholds(),
67
+ patterns=load_patterns(),
68
+ )
69
+ with self._lock:
70
+ self._sessions[session_id] = session
71
+ return self._build_observation(
72
+ session,
73
+ feedback="Episode created. Query the incident window and inspect dependencies to build your case.",
74
+ )
75
+
76
+ def step(self, action: Action) -> Tuple[Observation, Reward, bool, Dict[str, Any]]:
77
+ session = self._resolve_session(action.session_id)
78
+ if session.done:
79
+ raise RuntimeError("Episode already finished. Call /reset to start a new session.")
80
+
81
+ session.step_number += 1
82
+ repeated_action_count = self._register_action(session, action)
83
+
84
+ if action.action_type == "submit_report":
85
+ if action.report is None:
86
+ raise ValueError("submit_report requires report")
87
+ reward = grade_report(
88
+ task_id=session.task_id,
89
+ report=action.report,
90
+ revealed_log_ids=set(session.visible_log_ids),
91
+ revealed_log_map=session.log_map(),
92
+ step_number=session.step_number,
93
+ max_steps=session.max_steps,
94
+ repeated_action_count=repeated_action_count,
95
+ )
96
+ session.done = True
97
+ feedback = "Final report graded."
98
+ elif action.action_type == "no_anomalies":
99
+ reward = build_dense_reward(
100
+ signal_reward=0.0,
101
+ hypothesis_reward=0.0,
102
+ efficiency_reward=0.0,
103
+ penalty=1.0,
104
+ info={"message": "No-incident declaration is invalid for this benchmark."},
105
+ )
106
+ session.done = True
107
+ feedback = "No-incident declaration rejected."
108
+ else:
109
+ reward, feedback = self._handle_non_terminal(session, action, repeated_action_count)
110
+ if session.step_number >= session.max_steps:
111
+ session.done = True
112
+ feedback = f"{feedback} Step budget exhausted."
113
+
114
+ session.last_reward = reward
115
+ session.episode_history.append(
116
+ {
117
+ "step": session.step_number,
118
+ "action_type": action.action_type,
119
+ "reward": reward.value,
120
+ "done": session.done,
121
+ }
122
+ )
123
+ observation = self._build_observation(session, feedback=feedback)
124
+ return observation, reward, session.done, dict(reward.info)
125
+
126
+ def public_state(self, session_id: Optional[str] = None) -> Dict[str, Any]:
127
+ session = self._resolve_session(session_id)
128
+ return {
129
+ "session_id": session.session_id,
130
+ "task_id": session.task_id,
131
+ "step_number": session.step_number,
132
+ "max_steps": session.max_steps,
133
+ "done": session.done,
134
+ "revealed_log_count": len(session.visible_log_ids),
135
+ "visited_services": sorted(session.visited_services),
136
+ "submitted_containment": list(session.containment_plan),
137
+ "last_reward": session.last_reward.model_dump() if session.last_reward else None,
138
+ }
139
+
140
+ def debug_state(self, session_id: Optional[str] = None) -> Dict[str, Any]:
141
+ if not DEBUG_STATE_ENABLED:
142
+ raise PermissionError("Debug state is disabled.")
143
+ session = self._resolve_session(session_id)
144
+ return {
145
+ "session_id": session.session_id,
146
+ "task_id": session.task_id,
147
+ "seed": session.seed,
148
+ "visible_log_ids": sorted(session.visible_log_ids),
149
+ "all_logs": session.logs,
150
+ "history": session.episode_history,
151
+ "best_hypothesis_score": session.best_hypothesis_score,
152
+ }
153
+
154
+ def _resolve_session(self, session_id: Optional[str]) -> IncidentSession:
155
+ with self._lock:
156
+ if session_id:
157
+ session = self._sessions.get(session_id)
158
+ if session is None:
159
+ raise RuntimeError(f"Unknown session_id '{session_id}'.")
160
+ return session
161
+ if len(self._sessions) == 1:
162
+ return next(iter(self._sessions.values()))
163
+ raise RuntimeError("A valid session_id is required.")
164
+
165
+ def _handle_non_terminal(
166
+ self,
167
+ session: IncidentSession,
168
+ action: Action,
169
+ repeated_action_count: int,
170
+ ) -> Tuple[Reward, str]:
171
+ signal_reward = 0.0
172
+ hypothesis_reward = 0.0
173
+ penalty = 0.0
174
+ info: Dict[str, Any] = {}
175
+
176
+ if action.action_type == "query_logs":
177
+ if action.query is None:
178
+ raise ValueError("query_logs requires query")
179
+ newly_revealed = self._query_logs(session, action.query.model_dump(exclude_none=True))
180
+ relevant = set(TASK_SPECS[session.task_id]["gold_evidence_ids"])
181
+ relevant_new = len(relevant & set(newly_revealed))
182
+ signal_reward = min(1.0, round((0.22 * len(newly_revealed)) + (0.28 * relevant_new), 4))
183
+ penalty = 0.15 if not newly_revealed else 0.0
184
+ feedback = f"Query revealed {len(newly_revealed)} new log(s)."
185
+ info["revealed_log_ids"] = newly_revealed
186
+ elif action.action_type == "inspect_dependencies":
187
+ if action.target_service is None:
188
+ raise ValueError("inspect_dependencies requires target_service")
189
+ session.visited_services.add(action.target_service)
190
+ neighbors = DEPENDENCY_GRAPH.get(action.target_service, [])
191
+ revealed = self._inspect_dependencies(session, action.target_service, neighbors)
192
+ relevant = set(TASK_SPECS[session.task_id]["gold_evidence_ids"])
193
+ signal_reward = min(1.0, round((0.15 * len(revealed)) + (0.35 * len(relevant & set(revealed))), 4))
194
+ penalty = 0.1 if not revealed else 0.0
195
+ feedback = f"Dependency inspection around {action.target_service} revealed {len(revealed)} new log(s)."
196
+ info["neighbors"] = neighbors
197
+ info["revealed_log_ids"] = revealed
198
+ elif action.action_type == "update_hypothesis":
199
+ if action.hypothesis is None:
200
+ raise ValueError("update_hypothesis requires hypothesis")
201
+ current_score = hypothesis_match_score(action.hypothesis, session.task_id)
202
+ improvement = max(0.0, current_score - session.best_hypothesis_score)
203
+ session.best_hypothesis_score = max(session.best_hypothesis_score, current_score)
204
+ session.last_hypothesis = action.hypothesis
205
+ hypothesis_reward = improvement
206
+ penalty = 0.15 if improvement == 0.0 and current_score < session.best_hypothesis_score else 0.0
207
+ feedback = "Hypothesis recorded."
208
+ info["hypothesis_score"] = current_score
209
+ elif action.action_type == "execute_containment":
210
+ plan = list(action.containment_plan or [])
211
+ positive, negative = containment_alignment(plan, session.task_id)
212
+ for item in plan:
213
+ if item not in session.containment_plan:
214
+ session.containment_plan.append(item)
215
+ hypothesis_reward = positive
216
+ penalty = min(1.0, negative + (0.05 if not plan else 0.0))
217
+ feedback = "Containment actions recorded."
218
+ info["containment_positive"] = positive
219
+ info["containment_negative"] = negative
220
+ info["containment_descriptions"] = [CONTAINMENT_DESCRIPTIONS[item] for item in plan]
221
+ elif action.action_type == "request_more":
222
+ penalty = 0.1
223
+ feedback = "No additional passive data is provided. Use a concrete query."
224
+ else:
225
+ penalty = 0.2
226
+ feedback = "Unsupported action."
227
+
228
+ if repeated_action_count > 0:
229
+ penalty = min(1.0, penalty + min(0.2, repeated_action_count * 0.1))
230
+
231
+ efficiency_reward = max(
232
+ 0.0,
233
+ round(1.0 - ((session.step_number - 1) / max(1, session.max_steps - 1)), 4),
234
+ )
235
+ reward = build_dense_reward(
236
+ signal_reward=signal_reward,
237
+ hypothesis_reward=hypothesis_reward,
238
+ efficiency_reward=efficiency_reward,
239
+ penalty=penalty,
240
+ info=info,
241
+ )
242
+ return reward, feedback
243
+
244
+ def _query_logs(self, session: IncidentSession, query: Dict[str, Any]) -> List[int]:
245
+ matched = [log for log in session.logs if self._match_query(log, query)]
246
+ ranked = sorted(matched, key=lambda log: (self._severity_rank(log["log_level"]), -float(log["_seed_rank"])))
247
+ revealed: List[int] = []
248
+ for log in ranked:
249
+ if log["log_id"] in session.visible_log_ids:
250
+ continue
251
+ session.visible_log_ids.add(log["log_id"])
252
+ revealed.append(log["log_id"])
253
+ session.visited_services.add(log["service_name"])
254
+ if len(revealed) >= int(query.get("limit", 6)):
255
+ break
256
+ return revealed
257
+
258
+ def _inspect_dependencies(self, session: IncidentSession, target_service: str, neighbors: List[str]) -> List[int]:
259
+ candidate_services = {target_service, *[neighbor for neighbor in neighbors if neighbor.endswith("-service")]}
260
+ matched = [
261
+ log
262
+ for log in session.logs
263
+ if log["service_name"] in candidate_services and log["log_level"] in {"CRITICAL", "ERROR", "WARN"}
264
+ ]
265
+ ranked = sorted(matched, key=lambda log: (self._severity_rank(log["log_level"]), log["timestamp"], -float(log["_seed_rank"])))
266
+ revealed: List[int] = []
267
+ for log in ranked:
268
+ if log["log_id"] in session.visible_log_ids:
269
+ continue
270
+ session.visible_log_ids.add(log["log_id"])
271
+ revealed.append(log["log_id"])
272
+ if len(revealed) >= 4:
273
+ break
274
+ return revealed
275
+
276
+ @staticmethod
277
+ def _match_query(log: Dict[str, Any], query: Dict[str, Any]) -> bool:
278
+ if query.get("service_name") and log["service_name"] != query["service_name"]:
279
+ return False
280
+ if query.get("server_id") and log["server_id"] != query["server_id"]:
281
+ return False
282
+ if query.get("levels") and log["log_level"] not in set(query["levels"]):
283
+ return False
284
+ if query.get("start_time") and str(log["timestamp"]) < str(query["start_time"]):
285
+ return False
286
+ if query.get("end_time") and str(log["timestamp"]) > str(query["end_time"]):
287
+ return False
288
+ if query.get("text_contains") and query["text_contains"].lower() not in str(log["message"]).lower():
289
+ return False
290
+ return True
291
+
292
+ @staticmethod
293
+ def _severity_rank(level: str) -> int:
294
+ order = {"CRITICAL": 0, "ERROR": 1, "WARN": 2, "INFO": 3}
295
+ return order.get(level, 4)
296
+
297
+ @staticmethod
298
+ def _register_action(session: IncidentSession, action: Action) -> int:
299
+ fingerprint_source = [action.action_type]
300
+ if action.query:
301
+ fingerprint_source.append(str(action.query.model_dump(exclude_none=True)))
302
+ if action.target_service:
303
+ fingerprint_source.append(action.target_service)
304
+ if action.hypothesis:
305
+ fingerprint_source.append(str(action.hypothesis.model_dump()))
306
+ if action.containment_plan:
307
+ fingerprint_source.append(",".join(action.containment_plan))
308
+ if action.report:
309
+ fingerprint_source.append(str(action.report.root_cause.model_dump()))
310
+ fingerprint = "::".join(fingerprint_source)
311
+ count = session.query_fingerprints.get(fingerprint, 0)
312
+ session.query_fingerprints[fingerprint] = count + 1
313
+ return count
314
+
315
+ def _build_observation(self, session: IncidentSession, feedback: Optional[str]) -> Observation:
316
+ spec = TASK_SPECS[session.task_id]
317
+ return Observation(
318
+ session_id=session.session_id,
319
+ task_id=session.task_id,
320
+ task_title=str(spec["title"]),
321
+ briefing=IncidentBriefing(
322
+ incident_id=str(spec["incident_id"]),
323
+ title=str(spec["title"]),
324
+ objective=str(spec["objective"]),
325
+ incident_window_start=str(spec["incident_window_start"]),
326
+ incident_window_end=str(spec["incident_window_end"]),
327
+ suspected_services=list(spec["suspected_services"]),
328
+ customer_statement=str(spec["customer_statement"]),
329
+ operational_constraints=list(spec["operational_constraints"]),
330
+ ),
331
+ dependency_graph=DEPENDENCY_GRAPH,
332
+ visible_logs=session.visible_logs(),
333
+ revealed_log_count=len(session.visible_log_ids),
334
+ visited_services=sorted(session.visited_services),
335
+ submitted_containment=list(session.containment_plan),
336
+ last_hypothesis=session.last_hypothesis,
337
+ step_number=session.step_number,
338
+ max_steps=session.max_steps,
339
+ feedback=feedback,
340
+ done=session.done,
341
+ )
342
+
343
+
344
+ store = SessionStore()
env/models.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Typed models for the hardened NovaTech OpenEnv environment.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ from typing import Any, Dict, List, Literal, Optional
7
+
8
+ from pydantic import BaseModel, Field
9
+
10
+ ServiceName = Literal[
11
+ "auth-service",
12
+ "payment-api",
13
+ "order-service",
14
+ "notification-service",
15
+ "reporting-service",
16
+ "user-service",
17
+ ]
18
+ ServerName = Literal["server_01", "server_02", "server_03", "server_04"]
19
+ LogLevel = Literal["INFO", "WARN", "ERROR", "CRITICAL"]
20
+ FailureMode = Literal[
21
+ "resource_exhaustion",
22
+ "dependency_outage",
23
+ "storage_saturation",
24
+ "certificate_expiry",
25
+ "application_bug",
26
+ "traffic_abuse",
27
+ ]
28
+ DependencyName = Literal["none", "payment-gateway", "mysql", "email-relay", "ldap-directory"]
29
+ CustomerImpact = Literal[
30
+ "login_failures",
31
+ "checkout_delays",
32
+ "order_write_failures",
33
+ "notification_delivery_failure",
34
+ "cross_service_major_incident",
35
+ ]
36
+ ContainmentActionName = Literal[
37
+ "increase_auth_heap",
38
+ "enable_login_rate_limiting",
39
+ "restore_payment_gateway_connectivity",
40
+ "reduce_checkout_retry_pressure",
41
+ "free_order_log_disk",
42
+ "reset_mysql_connection_pool",
43
+ "renew_smtp_certificate",
44
+ "reroute_notification_traffic",
45
+ "page_major_incident_team",
46
+ "block_all_login_traffic",
47
+ "wipe_application_logs",
48
+ "restart_everything",
49
+ ]
50
+
51
+
52
+ class LogEntry(BaseModel):
53
+ log_id: int
54
+ timestamp: str
55
+ server_id: ServerName
56
+ log_level: LogLevel
57
+ service_name: ServiceName
58
+ message: str
59
+ response_time_ms: int
60
+ cpu_usage_percent: float
61
+ memory_usage_percent: float
62
+
63
+
64
+ class IncidentBriefing(BaseModel):
65
+ incident_id: str
66
+ title: str
67
+ objective: str
68
+ incident_window_start: str
69
+ incident_window_end: str
70
+ suspected_services: List[ServiceName]
71
+ customer_statement: str
72
+ operational_constraints: List[str]
73
+
74
+
75
+ class RootCauseHypothesis(BaseModel):
76
+ primary_service: ServiceName
77
+ failure_mode: FailureMode
78
+ dependency: DependencyName = "none"
79
+ customer_impact: CustomerImpact
80
+ confidence: float = Field(..., ge=0.0, le=1.0)
81
+
82
+
83
+ class LogQuery(BaseModel):
84
+ service_name: Optional[ServiceName] = None
85
+ server_id: Optional[ServerName] = None
86
+ levels: Optional[List[LogLevel]] = None
87
+ start_time: Optional[str] = None
88
+ end_time: Optional[str] = None
89
+ text_contains: Optional[str] = Field(default=None, max_length=80)
90
+ limit: int = Field(default=6, ge=1, le=6)
91
+
92
+
93
+ class IncidentReport(BaseModel):
94
+ evidence_log_ids: List[int] = Field(default_factory=list, min_length=1)
95
+ impacted_services: List[ServiceName] = Field(default_factory=list, min_length=1)
96
+ root_cause: RootCauseHypothesis
97
+ containment_plan: List[ContainmentActionName] = Field(default_factory=list)
98
+ summary: str = Field(..., min_length=20, max_length=600)
99
+
100
+
101
+ class Action(BaseModel):
102
+ session_id: Optional[str] = None
103
+ action_type: Literal[
104
+ "query_logs",
105
+ "inspect_dependencies",
106
+ "update_hypothesis",
107
+ "execute_containment",
108
+ "submit_report",
109
+ "request_more",
110
+ "no_anomalies",
111
+ ]
112
+ query: Optional[LogQuery] = None
113
+ target_service: Optional[ServiceName] = None
114
+ hypothesis: Optional[RootCauseHypothesis] = None
115
+ containment_plan: Optional[List[ContainmentActionName]] = None
116
+ report: Optional[IncidentReport] = None
117
+
118
+
119
+ class Observation(BaseModel):
120
+ session_id: str
121
+ task_id: str
122
+ task_title: str
123
+ briefing: IncidentBriefing
124
+ dependency_graph: Dict[ServiceName, List[str]]
125
+ visible_logs: List[LogEntry]
126
+ revealed_log_count: int
127
+ visited_services: List[ServiceName]
128
+ submitted_containment: List[ContainmentActionName]
129
+ last_hypothesis: Optional[RootCauseHypothesis] = None
130
+ step_number: int = 0
131
+ max_steps: int = 8
132
+ feedback: Optional[str] = None
133
+ done: bool = False
134
+
135
+
136
+ class Reward(BaseModel):
137
+ value: float = Field(..., ge=0.0, le=1.0)
138
+ signal_reward: float = Field(default=0.0, ge=0.0, le=1.0)
139
+ hypothesis_reward: float = Field(default=0.0, ge=0.0, le=1.0)
140
+ efficiency_reward: float = Field(default=0.0, ge=0.0, le=1.0)
141
+ penalty: float = Field(default=0.0, ge=0.0, le=1.0)
142
+ info: Dict[str, Any] = Field(default_factory=dict)
inference.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ from typing import Any, Dict, List, Optional
5
+
6
+ import requests
7
+ from openai import OpenAI
8
+
9
+ API_KEY = os.getenv("HF_TOKEN", "")
10
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
11
+ MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
12
+ LOGENV_URL = os.getenv("LOGENV_URL", "http://localhost:7860")
13
+ BENCHMARK = "NovaTechIncidentCommand"
14
+ SUCCESS_THRESHOLD = 0.70
15
+
16
+ client = OpenAI(api_key=API_KEY or "placeholder", base_url=API_BASE_URL)
17
+
18
+
19
+ def log_start(task: str, env: str, model: str) -> None:
20
+ print(f"[START] task={task} env={env} model={model}", flush=True)
21
+
22
+
23
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
24
+ print(
25
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error if error else 'null'}",
26
+ flush=True,
27
+ )
28
+
29
+
30
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
31
+ print(
32
+ f"[END] success={str(success).lower()} steps={steps} score={max(0.0, min(1.0, score)):.3f} rewards={','.join(f'{r:.2f}' for r in rewards)}",
33
+ flush=True,
34
+ )
35
+
36
+
37
+ def api_reset(task_id: str) -> Dict[str, Any]:
38
+ response = requests.post(f"{LOGENV_URL}/reset", json={"task_id": task_id}, timeout=30)
39
+ response.raise_for_status()
40
+ return response.json()
41
+
42
+
43
+ def api_step(payload: Dict[str, Any]) -> Dict[str, Any]:
44
+ response = requests.post(f"{LOGENV_URL}/step", json=payload, timeout=60)
45
+ response.raise_for_status()
46
+ return response.json()
47
+
48
+
49
+ def maybe_ping_model(task_id: str) -> None:
50
+ if not API_KEY:
51
+ return
52
+ try:
53
+ client.responses.create(
54
+ model=MODEL_NAME,
55
+ input=f"Reply with ACK for {task_id}.",
56
+ temperature=0,
57
+ max_output_tokens=4,
58
+ )
59
+ except Exception:
60
+ pass
61
+
62
+
63
+ def _severity_score(log: Dict[str, Any]) -> float:
64
+ level_weight = {"CRITICAL": 4.0, "ERROR": 3.0, "WARN": 1.0, "INFO": 0.2}
65
+ score = level_weight.get(str(log["log_level"]).upper(), 0.0)
66
+ if float(log.get("cpu_usage_percent", 0.0)) >= 90.0:
67
+ score += 1.0
68
+ if float(log.get("memory_usage_percent", 0.0)) >= 95.0:
69
+ score += 1.0
70
+ if int(log.get("response_time_ms", 0)) >= 3000:
71
+ score += 1.0
72
+ message = str(log["message"]).lower()
73
+ for needle, bonus in {
74
+ "outofmemoryerror": 2.0,
75
+ "connection refused": 2.0,
76
+ "disk full": 2.0,
77
+ "ssl certificate expired": 1.8,
78
+ "segmentation fault": 1.8,
79
+ "timeout exceeded": 1.0,
80
+ }.items():
81
+ if needle in message:
82
+ score += bonus
83
+ return score
84
+
85
+
86
+ def _infer_hypothesis(observation: Dict[str, Any]) -> Dict[str, Any]:
87
+ logs = sorted(observation.get("visible_logs", []), key=_severity_score, reverse=True)
88
+ services = {log["service_name"] for log in logs}
89
+ messages = " ".join(str(log["message"]).lower() for log in logs)
90
+ if "outofmemoryerror" in messages and {"payment-api", "order-service", "notification-service"} & services:
91
+ return {
92
+ "primary_service": "auth-service",
93
+ "failure_mode": "resource_exhaustion",
94
+ "dependency": "payment-api",
95
+ "customer_impact": "cross_service_major_incident",
96
+ "confidence": 0.92,
97
+ }
98
+ if "connection refused" in messages or "payment confirmation" in messages:
99
+ return {
100
+ "primary_service": "payment-api",
101
+ "failure_mode": "dependency_outage",
102
+ "dependency": "payment-gateway",
103
+ "customer_impact": "checkout_delays",
104
+ "confidence": 0.87,
105
+ }
106
+ if "disk full" in messages:
107
+ return {
108
+ "primary_service": "order-service",
109
+ "failure_mode": "storage_saturation",
110
+ "dependency": "mysql",
111
+ "customer_impact": "order_write_failures",
112
+ "confidence": 0.82,
113
+ }
114
+ if "ssl certificate expired" in messages or "email-relay" in messages:
115
+ return {
116
+ "primary_service": "notification-service",
117
+ "failure_mode": "certificate_expiry",
118
+ "dependency": "email-relay",
119
+ "customer_impact": "notification_delivery_failure",
120
+ "confidence": 0.81,
121
+ }
122
+ return {
123
+ "primary_service": observation["briefing"]["suspected_services"][0],
124
+ "failure_mode": "traffic_abuse",
125
+ "dependency": "none",
126
+ "customer_impact": "login_failures",
127
+ "confidence": 0.55,
128
+ }
129
+
130
+
131
+ def _containment_for_hypothesis(hypothesis: Dict[str, Any]) -> List[str]:
132
+ if hypothesis["primary_service"] == "auth-service" and hypothesis["customer_impact"] == "cross_service_major_incident":
133
+ return [
134
+ "increase_auth_heap",
135
+ "enable_login_rate_limiting",
136
+ "restore_payment_gateway_connectivity",
137
+ "free_order_log_disk",
138
+ "renew_smtp_certificate",
139
+ "page_major_incident_team",
140
+ ]
141
+ if hypothesis["primary_service"] == "payment-api":
142
+ return ["restore_payment_gateway_connectivity", "reduce_checkout_retry_pressure"]
143
+ if hypothesis["primary_service"] == "order-service":
144
+ return ["free_order_log_disk", "reset_mysql_connection_pool"]
145
+ if hypothesis["primary_service"] == "notification-service":
146
+ return ["renew_smtp_certificate", "reroute_notification_traffic"]
147
+ return ["increase_auth_heap", "enable_login_rate_limiting"]
148
+
149
+
150
+ def _build_report(observation: Dict[str, Any], hypothesis: Dict[str, Any]) -> Dict[str, Any]:
151
+ logs = sorted(observation.get("visible_logs", []), key=lambda log: _severity_score(log), reverse=True)
152
+ evidence_ids = [int(log["log_id"]) for log in logs[: min(10, len(logs))]]
153
+ impacted_services = sorted({log["service_name"] for log in logs if _severity_score(log) >= 3.0})
154
+ if not impacted_services:
155
+ impacted_services = [hypothesis["primary_service"]]
156
+ return {
157
+ "evidence_log_ids": evidence_ids,
158
+ "impacted_services": impacted_services,
159
+ "root_cause": hypothesis,
160
+ "containment_plan": _containment_for_hypothesis(hypothesis),
161
+ "summary": (
162
+ f"The most likely incident source is {hypothesis['primary_service']} with failure mode "
163
+ f"{hypothesis['failure_mode']}, creating customer impact {hypothesis['customer_impact']}."
164
+ ),
165
+ }
166
+
167
+
168
+ def run_task(task_id: str) -> float:
169
+ rewards: List[float] = []
170
+ steps_taken = 0
171
+ final_score = 0.0
172
+ success = False
173
+ observation: Dict[str, Any] | None = None
174
+
175
+ log_start(task_id, BENCHMARK, MODEL_NAME)
176
+ try:
177
+ observation = api_reset(task_id)
178
+ session_id = observation["session_id"]
179
+ maybe_ping_model(task_id)
180
+
181
+ query_payload = {
182
+ "session_id": session_id,
183
+ "action_type": "query_logs",
184
+ "query": {
185
+ "levels": ["CRITICAL", "ERROR"],
186
+ "start_time": observation["briefing"]["incident_window_start"],
187
+ "end_time": observation["briefing"]["incident_window_end"],
188
+ "limit": 6,
189
+ },
190
+ }
191
+ result = api_step(query_payload)
192
+ observation = result["observation"]
193
+ rewards.append(float(result["reward"]["value"]))
194
+ steps_taken = 1
195
+ log_step(1, "query_logs", rewards[-1], bool(result["done"]), None)
196
+
197
+ target_service = max(
198
+ observation["briefing"]["suspected_services"],
199
+ key=lambda service: sum(1 for log in observation["visible_logs"] if log["service_name"] == service),
200
+ )
201
+ dep_payload = {
202
+ "session_id": session_id,
203
+ "action_type": "inspect_dependencies",
204
+ "target_service": target_service,
205
+ }
206
+ result = api_step(dep_payload)
207
+ observation = result["observation"]
208
+ rewards.append(float(result["reward"]["value"]))
209
+ steps_taken = 2
210
+ log_step(2, f"inspect_dependencies({target_service})", rewards[-1], bool(result["done"]), None)
211
+
212
+ hypothesis = _infer_hypothesis(observation)
213
+ hyp_payload = {
214
+ "session_id": session_id,
215
+ "action_type": "update_hypothesis",
216
+ "hypothesis": hypothesis,
217
+ }
218
+ result = api_step(hyp_payload)
219
+ observation = result["observation"]
220
+ rewards.append(float(result["reward"]["value"]))
221
+ steps_taken = 3
222
+ log_step(3, "update_hypothesis", rewards[-1], bool(result["done"]), None)
223
+
224
+ containment_payload = {
225
+ "session_id": session_id,
226
+ "action_type": "execute_containment",
227
+ "containment_plan": _containment_for_hypothesis(hypothesis),
228
+ }
229
+ result = api_step(containment_payload)
230
+ observation = result["observation"]
231
+ rewards.append(float(result["reward"]["value"]))
232
+ steps_taken = 4
233
+ log_step(4, "execute_containment", rewards[-1], bool(result["done"]), None)
234
+
235
+ report_payload = {
236
+ "session_id": session_id,
237
+ "action_type": "submit_report",
238
+ "report": _build_report(observation, hypothesis),
239
+ }
240
+ result = api_step(report_payload)
241
+ final_score = float(result["reward"]["value"])
242
+ rewards.append(final_score)
243
+ steps_taken = 5
244
+ log_step(5, "submit_report", final_score, bool(result["done"]), None)
245
+ success = final_score >= SUCCESS_THRESHOLD
246
+ except Exception as exc:
247
+ log_step(steps_taken + 1 if steps_taken else 1, "error", 0.0, True, str(exc).replace("\n", " "))
248
+ final_score = 0.0
249
+ success = False
250
+ finally:
251
+ log_end(success, steps_taken if steps_taken else 1, final_score, rewards or [0.0])
252
+ return final_score
253
+
254
+
255
+ if __name__ == "__main__":
256
+ for task_name in ("easy", "medium", "hard"):
257
+ run_task(task_name)
novatech_logs.db ADDED
Binary file (94.2 kB). View file
 
openenv.yaml ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: NovaTechIncidentCommand
2
+ description: >
3
+ Seeded OpenEnv incident-response benchmark built from a realistic NovaTech log corpus.
4
+ Agents operate under partial observability: they must query logs, inspect dependencies,
5
+ update a structured causal hypothesis, choose safe containment, and submit a final report.
6
+
7
+ tasks:
8
+ - id: easy
9
+ description: Detect a clear login outage caused by auth-service heap exhaustion.
10
+ - id: medium
11
+ description: Resolve competing hypotheses during a payment confirmation outage.
12
+ - id: hard
13
+ description: Reconstruct a cascading multi-service incident under partial observability.
14
+
15
+ action_space:
16
+ type: structured
17
+ fields:
18
+ session_id: string
19
+ action_type: "query_logs | inspect_dependencies | update_hypothesis | execute_containment | submit_report | request_more | no_anomalies"
20
+ query: "optional structured filter with service_name, server_id, levels, start_time, end_time, text_contains, limit"
21
+ target_service: "optional service name"
22
+ hypothesis: "optional structured tuple: primary_service, failure_mode, dependency, customer_impact, confidence"
23
+ containment_plan: "optional list of containment action names"
24
+ report: "optional structured report with evidence_log_ids, impacted_services, root_cause, containment_plan, summary"
25
+
26
+ observation_space:
27
+ type: structured
28
+ fields:
29
+ session_id: string
30
+ task_id: string
31
+ task_title: string
32
+ briefing: "structured incident briefing with incident window, objective, suspected_services, customer_statement, operational_constraints"
33
+ dependency_graph: "service dependency map"
34
+ visible_logs: "list of currently revealed log entries only"
35
+ revealed_log_count: integer
36
+ visited_services: "list of services explored so far"
37
+ submitted_containment: "list of chosen containment actions"
38
+ last_hypothesis: "optional structured root-cause hypothesis"
39
+ step_number: integer
40
+ max_steps: integer
41
+ feedback: string
42
+ done: boolean
43
+ notes:
44
+ - "Observations expose only agent-revealed logs."
45
+ - "The dependency graph is visible, but hidden logs and gold evidence remain private."
46
+ - "The latest structured hypothesis is included so agents can reason iteratively."
47
+
48
+ reward_definition:
49
+ type: scalar
50
+ range: [0.0, 1.0]
51
+ components:
52
+ signal_reward: "Rewards newly discovered relevant signals and evidence quality."
53
+ hypothesis_reward: "Rewards improvement toward the gold causal tuple and safe containment alignment."
54
+ efficiency_reward: "Rewards solving within the action budget."
55
+ penalty: "Penalizes unseen evidence, contradictions, forbidden containment, loops, and empty queries."
56
+ techniques:
57
+ - "Information-gain shaping: focused discovery beats broad noisy retrieval."
58
+ - "Best-hypothesis tracking: reward is tied to causal improvement across the episode."
59
+ - "Observation-consistent grading: unseen evidence references are rejected."
60
+ - "Contradiction penalties: evidence, cause, impact, and timeline must agree."
61
+ - "Safety shaping: destructive containment is penalized even if diagnosis is partially correct."
62
+
63
+ interfaces:
64
+ reset: "reset() -> initial observation"
65
+ step: "step(action) -> observation, reward, done, info"
66
+ state: "state() -> non-leaking public session state"
preflight.sh ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5
+ API_URL="${API_URL:-http://127.0.0.1:7860}"
6
+
7
+ cleanup() {
8
+ if [[ -n "${UVICORN_PID:-}" ]]; then
9
+ kill "${UVICORN_PID}" >/dev/null 2>&1 || true
10
+ fi
11
+ }
12
+ trap cleanup EXIT
13
+
14
+ cd "${ROOT_DIR}"
15
+
16
+ python3 -m py_compile app.py inference.py env/models.py env/environment.py tasks/catalog.py tasks/graders.py data/db_loader.py
17
+
18
+ if command -v openenv >/dev/null 2>&1; then
19
+ openenv validate
20
+ fi
21
+
22
+ python3 -m uvicorn app:app --host 127.0.0.1 --port 7860 >/tmp/logenv2_uvicorn.log 2>&1 &
23
+ UVICORN_PID=$!
24
+ sleep 2
25
+
26
+ curl -sf "${API_URL}/health" >/tmp/logenv2_health.json
27
+ curl -sf -X POST -H "Content-Type: application/json" -d '{"task_id":"easy","seed":42}' "${API_URL}/reset" >/tmp/logenv2_reset.json
28
+
29
+ SESSION_ID="$(python3 - <<'PY'
30
+ import json
31
+ from pathlib import Path
32
+ print(json.loads(Path('/tmp/logenv2_reset.json').read_text())['session_id'])
33
+ PY
34
+ )"
35
+
36
+ curl -sf -X POST -H "Content-Type: application/json" \
37
+ -d "{\"session_id\":\"${SESSION_ID}\",\"action_type\":\"query_logs\",\"query\":{\"levels\":[\"CRITICAL\",\"ERROR\"],\"limit\":4}}" \
38
+ "${API_URL}/step" >/tmp/logenv2_step.json
39
+
40
+ LOGENV_URL="${API_URL}" python3 inference.py >/tmp/logenv2_inference.log
41
+
42
+ python3 - <<'PY'
43
+ from pathlib import Path
44
+ lines = [line.strip() for line in Path("/tmp/logenv2_inference.log").read_text().splitlines() if line.strip()]
45
+ assert any(line.startswith("[START] ") for line in lines)
46
+ assert any(line.startswith("[STEP] ") for line in lines)
47
+ assert any(line.startswith("[END] ") for line in lines)
48
+ print("preflight ok")
49
+ PY
requirements.txt CHANGED
@@ -1,3 +1,6 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
1
+ fastapi==0.115.0
2
+ uvicorn[standard]==0.30.6
3
+ pydantic==2.7.4
4
+ requests==2.32.3
5
+ pyyaml==6.0.2
6
+ openai==2.9.0
tasks/__init__.py ADDED
File without changes
tasks/__pycache__/catalog.cpython-314.pyc ADDED
Binary file (5.58 kB). View file
 
tasks/__pycache__/graders.cpython-314.pyc ADDED
Binary file (12.6 kB). View file
 
tasks/catalog.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task catalog for the hardened NovaTech incident environment.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ from typing import Dict, List
7
+
8
+
9
+ DEPENDENCY_GRAPH: Dict[str, List[str]] = {
10
+ "auth-service": ["user-service", "payment-api", "ldap-directory"],
11
+ "payment-api": ["auth-service", "payment-gateway", "mysql"],
12
+ "order-service": ["payment-api", "mysql", "notification-service"],
13
+ "notification-service": ["order-service", "email-relay"],
14
+ "reporting-service": ["mysql"],
15
+ "user-service": ["auth-service", "ldap-directory"],
16
+ }
17
+
18
+
19
+ CONTAINMENT_DESCRIPTIONS: Dict[str, str] = {
20
+ "increase_auth_heap": "Increase heap headroom for auth-service.",
21
+ "enable_login_rate_limiting": "Slow abusive login traffic without fully blocking healthy users.",
22
+ "restore_payment_gateway_connectivity": "Repair egress, routes, or credentials to the payment gateway.",
23
+ "reduce_checkout_retry_pressure": "Reduce retry storms and non-critical checkout retries.",
24
+ "free_order_log_disk": "Recover /var/log capacity on order-service hosts.",
25
+ "reset_mysql_connection_pool": "Safely recycle exhausted database connection pools.",
26
+ "renew_smtp_certificate": "Renew SMTP or relay TLS certificates before reconnecting.",
27
+ "reroute_notification_traffic": "Use a safe backup path for notifications.",
28
+ "page_major_incident_team": "Escalate to major-incident coordination.",
29
+ "block_all_login_traffic": "Broadly block all login traffic.",
30
+ "wipe_application_logs": "Delete logs to free resources.",
31
+ "restart_everything": "Restart all major services immediately.",
32
+ }
33
+
34
+
35
+ TASK_SPECS: Dict[str, Dict[str, object]] = {
36
+ "easy": {
37
+ "incident_id": "INC-2101",
38
+ "title": "Auth Heap Exhaustion",
39
+ "max_steps": 6,
40
+ "objective": "Detect the direct cause of a login outage and choose a safe first containment move.",
41
+ "incident_window_start": "2025-06-15 02:00:00",
42
+ "incident_window_end": "2025-06-15 02:25:59",
43
+ "suspected_services": ["auth-service", "user-service", "payment-api"],
44
+ "customer_statement": "Support agents report that enterprise admins cannot log in reliably.",
45
+ "operational_constraints": [
46
+ "Do not erase evidence.",
47
+ "Do not fully block all login traffic.",
48
+ "You have six actions before leadership expects a recommendation.",
49
+ ],
50
+ "scope_servers": ["server_01"],
51
+ "scope_services": ["auth-service", "user-service", "payment-api", "notification-service"],
52
+ "must_include_ids": [72, 74, 75, 76, 77],
53
+ "noise_sample_size": 8,
54
+ "gold_evidence_ids": [74, 76, 77],
55
+ "plausible_but_wrong_ids": [72, 75],
56
+ "root_cause": {
57
+ "primary_service": "auth-service",
58
+ "failure_mode": "resource_exhaustion",
59
+ "dependency": "none",
60
+ "customer_impact": "login_failures",
61
+ },
62
+ "impacted_services": ["auth-service"],
63
+ "recommended_containment": ["increase_auth_heap", "enable_login_rate_limiting"],
64
+ "forbidden_containment": ["block_all_login_traffic", "wipe_application_logs", "restart_everything"],
65
+ },
66
+ "medium": {
67
+ "incident_id": "INC-2144",
68
+ "title": "Checkout Competing Hypotheses",
69
+ "max_steps": 7,
70
+ "objective": "Distinguish a genuine payment dependency outage from plausible but unrelated upstream noise.",
71
+ "incident_window_start": "2025-06-15 06:20:00",
72
+ "incident_window_end": "2025-06-15 06:45:59",
73
+ "suspected_services": ["payment-api", "auth-service", "user-service"],
74
+ "customer_statement": "Customers complete checkout, but confirmations remain pending for tens of seconds.",
75
+ "operational_constraints": [
76
+ "Keep checkout partially available if possible.",
77
+ "Avoid blind restarts.",
78
+ "You must justify why the leading hypothesis beats the alternative.",
79
+ ],
80
+ "scope_servers": ["server_01", "server_02"],
81
+ "scope_services": ["payment-api", "auth-service", "user-service", "order-service"],
82
+ "must_include_ids": [74, 76, 77, 193, 194, 195, 607],
83
+ "noise_sample_size": 12,
84
+ "gold_evidence_ids": [193, 194, 195],
85
+ "plausible_but_wrong_ids": [74, 76, 77, 607],
86
+ "root_cause": {
87
+ "primary_service": "payment-api",
88
+ "failure_mode": "dependency_outage",
89
+ "dependency": "payment-gateway",
90
+ "customer_impact": "checkout_delays",
91
+ },
92
+ "impacted_services": ["payment-api"],
93
+ "recommended_containment": ["restore_payment_gateway_connectivity", "reduce_checkout_retry_pressure"],
94
+ "forbidden_containment": ["restart_everything", "wipe_application_logs"],
95
+ },
96
+ "hard": {
97
+ "incident_id": "INC-2199",
98
+ "title": "Cascading Multi-Service Incident",
99
+ "max_steps": 9,
100
+ "objective": "Reconstruct a cascading outage under partial observability, identify the initiating service, and choose layered containment.",
101
+ "incident_window_start": "2025-06-15 02:00:00",
102
+ "incident_window_end": "2025-06-15 18:45:00",
103
+ "suspected_services": ["auth-service", "payment-api", "order-service", "notification-service", "reporting-service"],
104
+ "customer_statement": "Users report failed logins, stuck payments, delayed orders, and missing outbound notifications.",
105
+ "operational_constraints": [
106
+ "Preserve evidence across services.",
107
+ "Prioritize customer safety and data integrity over broad resets.",
108
+ "A major-incident bridge is already open.",
109
+ ],
110
+ "scope_servers": ["server_01", "server_02", "server_03", "server_04"],
111
+ "scope_services": ["auth-service", "payment-api", "order-service", "notification-service", "reporting-service", "user-service"],
112
+ "must_include_ids": [72, 74, 76, 77, 193, 194, 195, 266, 267, 334, 426, 429, 481, 564, 607],
113
+ "noise_sample_size": 24,
114
+ "gold_evidence_ids": [74, 76, 77, 193, 194, 266, 267, 426, 429, 564],
115
+ "plausible_but_wrong_ids": [195, 334, 481, 607],
116
+ "root_cause": {
117
+ "primary_service": "auth-service",
118
+ "failure_mode": "resource_exhaustion",
119
+ "dependency": "payment-api",
120
+ "customer_impact": "cross_service_major_incident",
121
+ },
122
+ "impacted_services": ["auth-service", "payment-api", "order-service", "notification-service"],
123
+ "recommended_containment": [
124
+ "increase_auth_heap",
125
+ "enable_login_rate_limiting",
126
+ "restore_payment_gateway_connectivity",
127
+ "free_order_log_disk",
128
+ "renew_smtp_certificate",
129
+ "page_major_incident_team",
130
+ ],
131
+ "forbidden_containment": ["wipe_application_logs", "block_all_login_traffic", "restart_everything"],
132
+ },
133
+ }
tasks/graders.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Structured deterministic graders for NovaTech incidents.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ from typing import Dict, Iterable, List, Sequence, Set, Tuple
7
+
8
+ from env.models import IncidentReport, Reward, RootCauseHypothesis
9
+ from tasks.catalog import TASK_SPECS
10
+
11
+
12
+ def _set_f1(predicted: Iterable[int], gold: Iterable[int]) -> Tuple[float, float, float]:
13
+ pred = set(int(x) for x in predicted)
14
+ truth = set(int(x) for x in gold)
15
+ tp = len(pred & truth)
16
+ fp = len(pred - truth)
17
+ fn = len(truth - pred)
18
+ precision = tp / (tp + fp) if (tp + fp) else 0.0
19
+ recall = tp / (tp + fn) if (tp + fn) else 0.0
20
+ f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
21
+ return round(f1, 4), round(precision, 4), round(recall, 4)
22
+
23
+
24
+ def hypothesis_match_score(hypothesis: RootCauseHypothesis | None, task_id: str) -> float:
25
+ if hypothesis is None:
26
+ return 0.0
27
+ gold = TASK_SPECS[task_id]["root_cause"]
28
+ return round(
29
+ 0.40 * float(hypothesis.primary_service == gold["primary_service"])
30
+ + 0.30 * float(hypothesis.failure_mode == gold["failure_mode"])
31
+ + 0.15 * float(hypothesis.dependency == gold["dependency"])
32
+ + 0.15 * float(hypothesis.customer_impact == gold["customer_impact"]),
33
+ 4,
34
+ )
35
+
36
+
37
+ def containment_alignment(actions: Sequence[str], task_id: str) -> Tuple[float, float]:
38
+ spec = TASK_SPECS[task_id]
39
+ recommended = set(spec["recommended_containment"])
40
+ forbidden = set(spec["forbidden_containment"])
41
+ chosen = set(actions)
42
+ positive = len(chosen & recommended) / len(recommended) if recommended else 0.0
43
+ negative = len(chosen & forbidden) / len(forbidden) if forbidden else 0.0
44
+ return round(positive, 4), round(negative, 4)
45
+
46
+
47
+ def impacted_service_score(predicted: Sequence[str], task_id: str) -> float:
48
+ gold = set(TASK_SPECS[task_id]["impacted_services"])
49
+ pred = set(predicted)
50
+ if not gold:
51
+ return 0.0
52
+ return round(len(pred & gold) / len(gold), 4)
53
+
54
+
55
+ def _evidence_consistency(report: IncidentReport, revealed_log_ids: Set[int], task_id: str) -> Tuple[float, float, float, List[str]]:
56
+ issues: List[str] = []
57
+ evidence = list(report.evidence_log_ids)
58
+ if any(log_id not in revealed_log_ids for log_id in evidence):
59
+ unseen = sorted(log_id for log_id in evidence if log_id not in revealed_log_ids)
60
+ issues.append(f"Unseen evidence referenced: {unseen}")
61
+ return 0.0, 0.0, 0.0, issues
62
+
63
+ spec = TASK_SPECS[task_id]
64
+ gold_f1, precision, recall = _set_f1(evidence, spec["gold_evidence_ids"])
65
+ if recall < 0.5:
66
+ issues.append("Evidence misses too many key signals.")
67
+ if precision < 0.5:
68
+ issues.append("Evidence includes too many irrelevant signals.")
69
+ return gold_f1, precision, recall, issues
70
+
71
+
72
+ def _causal_consistency(report: IncidentReport, task_id: str, revealed_log_map: Dict[int, Dict[str, object]]) -> Tuple[float, List[str]]:
73
+ issues: List[str] = []
74
+ cause_score = hypothesis_match_score(report.root_cause, task_id)
75
+ evidence_logs = [revealed_log_map[log_id] for log_id in report.evidence_log_ids if log_id in revealed_log_map]
76
+ if not evidence_logs:
77
+ return 0.0, ["No visible evidence supplied."]
78
+
79
+ service_present = any(log["service_name"] == report.root_cause.primary_service for log in evidence_logs)
80
+ if not service_present:
81
+ issues.append("Root cause service is not supported by selected evidence.")
82
+ cause_score *= 0.4
83
+
84
+ earliest = min(evidence_logs, key=lambda item: item["timestamp"])
85
+ if task_id == "hard" and earliest["service_name"] != report.root_cause.primary_service:
86
+ issues.append("Selected timeline does not start with the claimed initiating service.")
87
+ cause_score *= 0.7
88
+
89
+ if report.root_cause.customer_impact == "checkout_delays":
90
+ payment_evidence = any(log["service_name"] == "payment-api" for log in evidence_logs)
91
+ if not payment_evidence:
92
+ issues.append("Checkout impact claimed without payment-api evidence.")
93
+ cause_score *= 0.5
94
+ if report.root_cause.customer_impact == "cross_service_major_incident":
95
+ covered = {log["service_name"] for log in evidence_logs}
96
+ expected = {"auth-service", "payment-api", "order-service", "notification-service"}
97
+ if len(covered & expected) < 3:
98
+ issues.append("Cross-service incident claimed without cross-service evidence.")
99
+ cause_score *= 0.5
100
+
101
+ return round(max(0.0, cause_score), 4), issues
102
+
103
+
104
+ def build_dense_reward(
105
+ *,
106
+ signal_reward: float,
107
+ hypothesis_reward: float,
108
+ efficiency_reward: float,
109
+ penalty: float,
110
+ info: Dict[str, object],
111
+ ) -> Reward:
112
+ value = max(0.0, min(1.0, round((0.55 * signal_reward) + (0.25 * hypothesis_reward) + (0.20 * efficiency_reward) - (0.30 * penalty), 4)))
113
+ return Reward(
114
+ value=value,
115
+ signal_reward=round(signal_reward, 4),
116
+ hypothesis_reward=round(hypothesis_reward, 4),
117
+ efficiency_reward=round(efficiency_reward, 4),
118
+ penalty=round(penalty, 4),
119
+ info=info,
120
+ )
121
+
122
+
123
+ def grade_report(
124
+ *,
125
+ task_id: str,
126
+ report: IncidentReport,
127
+ revealed_log_ids: Set[int],
128
+ revealed_log_map: Dict[int, Dict[str, object]],
129
+ step_number: int,
130
+ max_steps: int,
131
+ repeated_action_count: int,
132
+ ) -> Reward:
133
+ evidence_score, precision, recall, evidence_issues = _evidence_consistency(report, revealed_log_ids, task_id)
134
+ if evidence_score == 0.0 and evidence_issues and "Unseen evidence" in evidence_issues[0]:
135
+ return build_dense_reward(
136
+ signal_reward=0.0,
137
+ hypothesis_reward=0.0,
138
+ efficiency_reward=0.0,
139
+ penalty=1.0,
140
+ info={"issues": evidence_issues, "message": "Report rejected due to unseen evidence."},
141
+ )
142
+
143
+ cause_score, cause_issues = _causal_consistency(report, task_id, revealed_log_map)
144
+ impact_score = impacted_service_score(report.impacted_services, task_id)
145
+ positive_containment, forbidden_containment = containment_alignment(report.containment_plan, task_id)
146
+ contradiction_penalty = 0.0
147
+ if not set(report.impacted_services) & set(TASK_SPECS[task_id]["impacted_services"]):
148
+ contradiction_penalty += 0.4
149
+ if recall < 0.5:
150
+ contradiction_penalty += 0.45
151
+ if precision < 0.5:
152
+ contradiction_penalty += 0.30
153
+ if forbidden_containment > 0:
154
+ contradiction_penalty += min(0.7, forbidden_containment)
155
+ if repeated_action_count > 0:
156
+ contradiction_penalty += min(0.2, repeated_action_count / max(1.0, float(step_number)))
157
+
158
+ signal_reward = round((0.75 * evidence_score) + (0.25 * impact_score), 4)
159
+ hypothesis_reward = round((0.80 * cause_score) + (0.20 * positive_containment), 4)
160
+ efficiency_reward = max(0.0, round(1.0 - ((step_number - 1) / max(1, max_steps - 1)), 4))
161
+ penalty = round(min(1.0, contradiction_penalty), 4)
162
+ return build_dense_reward(
163
+ signal_reward=signal_reward,
164
+ hypothesis_reward=hypothesis_reward,
165
+ efficiency_reward=efficiency_reward,
166
+ penalty=penalty,
167
+ info={
168
+ "evidence_score": evidence_score,
169
+ "precision": precision,
170
+ "recall": recall,
171
+ "cause_score": cause_score,
172
+ "impact_score": impact_score,
173
+ "positive_containment": positive_containment,
174
+ "forbidden_containment": forbidden_containment,
175
+ "issues": evidence_issues + cause_issues,
176
+ },
177
+ )