Spaces:

XcodeAddy
/

incident-triage-env

Running

App Files Files Community

XcodeAddy commited on Apr 11

Commit

35ea9cd

1 Parent(s): 9347ce5

Initial SetUp Fixed

Browse files

Files changed (27) hide show

.dockerignore +9 -0
.gitignore +12 -10
CHANGELOG_AND_RUNBOOK.md +504 -0
Dockerfile +6 -2
README.md +279 -0
Readme.md +0 -375
__init__.py +0 -16
app.py +176 -28
client.py +73 -93
environment.py +179 -31
graders.py +53 -15
incidents.py +8 -8
inference.py +373 -159
models.py +81 -33
openenv.yaml +45 -16
pyproject.toml +23 -29
requirements.txt +4 -1
server/__init__.py +1 -0
server/app.py +13 -0
tests/test_env.py +126 -0
tests/test_graders.py +85 -0
ui/assets/app.js +290 -0
ui/assets/styles.css +731 -0
ui/index.html +117 -0
ui/playground.html +153 -0
ui/status.html +118 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,9 @@

+.git
+.gitignore
+.venv
+__pycache__
+.pycache
+.DS_Store
+.env
+logs.jsonl
+outputs

.gitignore CHANGED Viewed

@@ -1,13 +1,13 @@
 .DS_Store
 # =========================
-# ENV & SECRETS 🔐
 # =========================
 .env
 .env.*
 *.env
 # =========================
-# PYTHON 🐍
 # =========================
 __pycache__/
 *.pyc
@@ -24,19 +24,20 @@ env/
 .venv/
 # =========================
-# LOG FILES 📄
 # =========================
 *.log
 logs.jsonl
 # =========================
-# OS FILES 💻
 # =========================
 .DS_Store
 Thumbs.db
 # =========================
-# IDE / EDITOR ⚙️
 # =========================
 .vscode/
 .idea/
@@ -44,7 +45,7 @@ Thumbs.db
 *.swo
 # =========================
-# MODEL / DATA FILES 🤖
 # =========================
 *.onnx
 *.pt
@@ -57,27 +58,28 @@ data/
 datasets/
 # =========================
-# BUILD / OUTPUT 🚀
 # =========================
 dist/
 build/
 out/
 # =========================
-# TEMP FILES 🗑️
 # =========================
 *.tmp
 *.temp
 .cache/
 # =========================
-# TEST / COVERAGE 🧪
 # =========================
 coverage/
 .nyc_output/
 # =========================
-# DOCKER 🐳 (optional)
 # =========================
 *.pid
 *.seed

 .DS_Store
 # =========================
+# ENV AND SECRETS
 # =========================
 .env
 .env.*
 *.env
 # =========================
+# PYTHON
 # =========================
 __pycache__/
 *.pyc
 .venv/
 # =========================
+# LOG FILES
 # =========================
 *.log
 logs.jsonl
+outputs/
 # =========================
+# OS FILES
 # =========================
 .DS_Store
 Thumbs.db
 # =========================
+# IDE / EDITOR
 # =========================
 .vscode/
 .idea/
 *.swo
 # =========================
+# MODEL / DATA FILES
 # =========================
 *.onnx
 *.pt
 datasets/
 # =========================
+# BUILD / OUTPUT
 # =========================
 dist/
 build/
 out/
 # =========================
+# TEMP FILES
 # =========================
 *.tmp
 *.temp
 .cache/
+.pycache/
 # =========================
+# TEST / COVERAGE
 # =========================
 coverage/
 .nyc_output/
 # =========================
+# DOCKER (optional)
 # =========================
 *.pid
 *.seed

CHANGELOG_AND_RUNBOOK.md ADDED Viewed

	@@ -0,0 +1,504 @@

+# Change Log and Runbook
+This file explains what changed in the project and how to run or test each part.
+Project path:
+```bash
+cd /Users/adityagaba/Downloads/incident-triage-env
+```
+## 1. What changed
+### Backend and OpenEnv API
+The backend is still a FastAPI app, but it now behaves like a stronger OpenEnv-style environment.
+Main files:
+- `app.py`
+- `environment.py`
+- `models.py`
+- `graders.py`
+- `incidents.py`
+- `inference.py`
+- `openenv.yaml`
+Important backend changes:
+- Added typed request and response models for observation, action, reward, state, and reset.
+- Added proper `reset`, `step`, and `state` behavior.
+- Added strict action validation.
+- Added deterministic graders with partial credit.
+- Added runtime-validator helper endpoints:
+  - `GET /health`
+  - `GET /metadata`
+  - `GET /schema`
+  - `POST /mcp`
+- Updated `inference.py` to print strict `[START]`, `[STEP]`, and `[END]` logs.
+### Frontend UI
+A browser UI was added on top of the same FastAPI app.
+Main files:
+- `ui/index.html`
+- `ui/status.html`
+- `ui/playground.html`
+- `ui/assets/styles.css`
+- `ui/assets/app.js`
+New UI routes:
+- `/` shows the landing page.
+- `/status` shows live health, schema, tasks, and grader status.
+- `/playground` lets you reset an incident and submit an action from the browser.
+- `/docs` still shows FastAPI API docs.
+Latest UI improvements:
+- The playground has quick presets for task1, task2, and task3.
+- The playground loads the real ticket inventory from `/tickets`.
+- Invalid ticket IDs such as `INC-105` are blocked in the UI before calling `/reset`.
+- The playground now shows visible success and error messages.
+- The summary strip shows incident id, expected field, reward, and episode status.
+- Cards, form controls, and output panels have more spacing and padding.
+- Reset and step buttons show loading states while requests are running.
+The UI is served from `app.py` with:
+- `app.mount("/assets", ...)`
+- `GET /`
+- `GET /status`
+- `GET /playground`
+### Docker and Space readiness
+Main files:
+- `Dockerfile`
+- `.dockerignore`
+- `README.md`
+- `openenv.yaml`
+- `server/app.py`
+Important changes:
+- Docker runs `uvicorn app:app` on port `7860`.
+- `README.md` includes Hugging Face Docker Space metadata.
+- `server/app.py` is present as a compatibility entrypoint.
+- `openenv validate` passes locally.
+- Runtime validation was made compatible by adding `/schema`, `/mcp`, and `{"status":"healthy"}` from `/health`.
+### Tests
+Main files:
+- `tests/test_env.py`
+- `tests/test_graders.py`
+Test coverage includes:
+- health, schema, and MCP helper endpoints
+- UI routes and static assets
+- reset, step, and state behavior
+- wrong task-type rejection
+- grader score range checks
+- partial-credit checks
+- non-constant grader behavior
+### Terminal logs
+The backend now prints useful logs when the UI or API is used:
+```text
+[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
+[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
+[STATE] session_id=... incident_id=INC-014 done=true
+[STEP_ERROR] session_id=... incident_id=INC-014 error=...
+```
+These logs appear in the same terminal where `uvicorn` is running.
+## 2. Start the backend and UI locally
+Use port `8000` locally if port `7860` is busy.
+```bash
+cd /Users/adityagaba/Downloads/incident-triage-env
+source .venv/bin/activate
+.venv/bin/python -m uvicorn app:app --host 127.0.0.1 --port 8000
+```
+Keep that terminal open.
+Open these browser URLs:
+```text
+http://127.0.0.1:8000/
+http://127.0.0.1:8000/status
+http://127.0.0.1:8000/playground
+http://127.0.0.1:8000/docs
+```
+If you already had the server running, stop it with `Ctrl+C` and start it again. Use hard refresh in the browser if the old UI is still visible.
+Expected results:
+- `/` shows the Incident Triage landing page.
+- `/status` shows health and task cards.
+- `/playground` lets you reset and step through an incident.
+- `/docs` shows generated API documentation.
+## 3. Test the UI manually
+### Landing page
+Open:
+```text
+http://127.0.0.1:8000/
+```
+Check:
+- The page title says `Welcome to Incident Triage Environment`.
+- Live snapshot cards load data.
+- Task cards appear.
+- Links to `/status`, `/playground`, and `/docs` work.
+### Status page
+Open:
+```text
+http://127.0.0.1:8000/status
+```
+Check:
+- Health shows `healthy`.
+- Total incidents shows `36`.
+- Task cards show task1, task2, and task3.
+- Schema coverage shows available runtime contracts.
+- Grader summary loads.
+### Playground page
+Open:
+```text
+http://127.0.0.1:8000/playground
+```
+Run a correct hard-task case:
+1. Click the `Action case` preset, or manually select `task3`.
+2. Confirm ticket id is `INC-014`.
+3. Click `Reset Environment`.
+4. Confirm expected field is `action`.
+5. Select `FAILOVER`.
+6. Click `Submit Step`.
+Expected result:
+- `reward.value` is `1.0`.
+- `done` is `true`.
+- `info.correct` is `true`.
+- `info.ground_truth` is `FAILOVER`.
+Important ticket rule:
+- Valid tickets are `INC-001` through `INC-036`.
+- `INC-105` is not in this dataset, so reset should fail for that ticket.
+- The updated UI loads valid tickets from `/tickets` and warns before sending an invalid ticket to the backend.
+Expected terminal logs:
+```text
+[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
+[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
+```
+Run a task1 case:
+1. Click the `Severity case` preset, or manually select `task1`.
+2. Confirm ticket id is `INC-001`.
+3. Click `Reset Environment`.
+4. Confirm expected field is `severity`.
+5. Select `SEV1`.
+6. Click `Submit Step`.
+Expected result:
+- reward should be `1.0`.
+Run a task2 case:
+1. Click the `Root cause case` preset, or manually select `task2`.
+2. Confirm ticket id is `INC-006`.
+3. Click `Reset Environment`.
+4. Confirm expected field is `root_cause`.
+5. Select `DATABASE`.
+6. Click `Submit Step`.
+Expected result:
+- reward should be `1.0`.
+## 4. Test backend API with curl
+Use a second terminal while the app is running on port `8000`.
+Health:
+```bash
+curl -s http://127.0.0.1:8000/health | python3 -m json.tool
+```
+Expected:
+```json
+{
+    "status": "healthy"
+}
+```
+Metadata:
+```bash
+curl -s http://127.0.0.1:8000/metadata | python3 -m json.tool
+```
+Schema:
+```bash
+curl -s http://127.0.0.1:8000/schema | python3 -m json.tool
+```
+Reset a fixed incident:
+```bash
+curl -s -X POST http://127.0.0.1:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_type":"task3","ticket_id":"INC-014"}' > /tmp/reset.json
+python3 -m json.tool /tmp/reset.json
+```
+Extract session id:
+```bash
+SESSION_ID=$(python3 -c 'import json; print(json.load(open("/tmp/reset.json"))["info"]["session_id"])')
+echo $SESSION_ID
+```
+Submit a correct step:
+```bash
+curl -s -X POST "http://127.0.0.1:8000/step?session_id=$SESSION_ID" \
+  -H "Content-Type: application/json" \
+  -d '{"incident_id":"INC-014","task_type":"task3","action":"FAILOVER"}' | python3 -m json.tool
+```
+Check state:
+```bash
+curl -s "http://127.0.0.1:8000/state?session_id=$SESSION_ID" | python3 -m json.tool
+```
+Expected state:
+- `done` is `true`
+- `status` is `completed`
+- `last_reward` is `1.0`
+## 5. Test backend edge cases
+Bad session:
+```bash
+curl -s -X POST "http://127.0.0.1:8000/step?session_id=bad-session" \
+  -H "Content-Type: application/json" \
+  -d '{"incident_id":"INC-014","task_type":"task3","action":"FAILOVER"}' | python3 -m json.tool
+```
+Expected:
+```json
+{
+    "detail": "Session not found. Call /reset first."
+}
+```
+Bad ticket:
+```bash
+curl -s -X POST http://127.0.0.1:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_type":"task1","ticket_id":"INC-999"}' | python3 -m json.tool
+```
+Expected:
+```json
+{
+    "detail": "No ticket found for ticket_id: INC-999"
+}
+```
+Wrong field for task3:
+```bash
+curl -s -X POST http://127.0.0.1:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_type":"task3","ticket_id":"INC-014"}' > /tmp/reset_wrong_field.json
+SESSION_WRONG_FIELD=$(python3 -c 'import json; print(json.load(open("/tmp/reset_wrong_field.json"))["info"]["session_id"])')
+curl -s -X POST "http://127.0.0.1:8000/step?session_id=$SESSION_WRONG_FIELD" \
+  -H "Content-Type: application/json" \
+  -d '{"incident_id":"INC-014","task_type":"task3","root_cause":"NETWORK"}' | python3 -m json.tool
+```
+Expected:
+```json
+{
+    "detail": "Task 'task3' expects field 'action', but got 'root_cause'."
+}
+```
+## 6. Run automated tests
+```bash
+cd /Users/adityagaba/Downloads/incident-triage-env
+.venv/bin/python -m unittest discover -s tests -v
+```
+Expected:
+```text
+OK
+```
+## 7. Run OpenEnv local validation
+```bash
+cd /Users/adityagaba/Downloads/incident-triage-env
+.venv/bin/openenv validate . --json
+```
+Expected:
+```json
+"passed": true
+```
+## 8. Run the baseline inference script
+If the local app is running on port `8000`:
+```bash
+cd /Users/adityagaba/Downloads/incident-triage-env
+ENV_URL=http://127.0.0.1:8000 .venv/bin/python inference.py
+```
+Expected log format:
+```text
+[START] task=INC-001 env=incident-triage-env model=...
+[STEP] step=1 action=SEV1 reward=1.00 done=true error=null
+[END] success=true steps=1 score=1.00 rewards=1.00
+```
+If no server is reachable, `inference.py` falls back to an in-process FastAPI client.
+## 9. Docker commands
+If `docker` is available on PATH:
+```bash
+docker build -t incident-triage-env .
+docker run --rm -p 8001:7860 incident-triage-env
+```
+If using Docker Desktop on macOS and `docker` is not on PATH:
+```bash
+export PATH=/Applications/Docker.app/Contents/Resources/bin:$PATH
+/Applications/Docker.app/Contents/Resources/bin/docker build -t incident-triage-env .
+/Applications/Docker.app/Contents/Resources/bin/docker run --rm -p 8001:7860 incident-triage-env
+```
+Then test:
+```bash
+curl -s http://127.0.0.1:8001/health | python3 -m json.tool
+curl -s -X POST http://127.0.0.1:8001/reset -H "Content-Type: application/json" -d '{}' | python3 -m json.tool
+```
+Open Docker UI routes:
+```text
+http://127.0.0.1:8001/
+http://127.0.0.1:8001/status
+http://127.0.0.1:8001/playground
+http://127.0.0.1:8001/docs
+```
+Expected:
+- `/health` returns `{"status": "healthy"}`
+- `/reset` returns `observation`, `reward`, `done`, and `info`
+- `/` shows the landing page
+- `/status` shows the live dashboard
+- `/playground` lets you test incidents from the browser
+## 10. Live Hugging Face Space validation
+Replace `<space-url>` with the actual public URL:
+```bash
+curl -s <space-url>/health | python3 -m json.tool
+curl -s -X POST <space-url>/reset -H "Content-Type: application/json" -d '{}' | python3 -m json.tool
+.venv/bin/openenv validate --url <space-url> --timeout 10
+```
+Expected:
+- `/health` returns `{"status": "healthy"}`
+- `/reset` returns `200` with a typed environment response
+- `openenv validate --url` returns `"passed": true`
+## 11. Common issues
+### Port 7860 is busy
+Use port `8000` locally:
+```bash
+.venv/bin/python -m uvicorn app:app --host 127.0.0.1 --port 8000
+```
+### Root URL returns Not Found
+This should no longer happen after the UI change. The root route `/` now serves the landing page.
+### Playground says session not found
+Click `Reset Environment` first, then submit a step.
+### Wrong task errors happen after completion
+Each episode is single-step. To test validation errors, reset a fresh session first.
+### Docker credential helper error
+Run:
+```bash
+export PATH=/Applications/Docker.app/Contents/Resources/bin:$PATH
+```
+Then retry the Docker command.

Dockerfile CHANGED Viewed

@@ -1,12 +1,16 @@
 FROM python:3.10-slim
 WORKDIR /app
 COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
 EXPOSE 7860
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

 FROM python:3.10-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
 WORKDIR /app
 COPY requirements.txt .
+RUN python -m pip install -r requirements.txt
 COPY . .
 EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,279 @@

+---
+title: Incident Triage OpenEnv
+colorFrom: red
+colorTo: blue
+sdk: docker
+app_port: 7860
+pinned: false
+tags:
+  - openenv
+  - fastapi
+  - reinforcement-learning
+---
+# Production Incident Triage Environment
+This project is an OpenEnv-compatible evaluation environment for production incident response. An agent receives a typed incident observation and must perform one of three real-world triage tasks: classify severity, identify the most likely root cause, or recommend the best immediate action.
+The environment is built for the OpenEnv hackathon requirements:
+- real-world utility
+- three graded tasks with easy, medium, and hard difficulty
+- typed observation, action, reward, and state models
+- deterministic reward logic with partial credit
+- root-level `inference.py`
+- Docker-based deployment for Hugging Face Spaces
+## Overview
+The dataset contains 36 incidents across three task families:
+| Task | Difficulty | Count | Objective |
+|---|---|---:|---|
+| `task1` | easy | 11 | Predict incident severity as `SEV1`, `SEV2`, or `SEV3` |
+| `task2` | medium | 12 | Predict the most likely root cause domain |
+| `task3` | hard | 13 | Predict the best immediate operational action |
+The incidents cover realistic production scenarios such as payment failures, queue backlogs, regional network loss, failed deploys, infrastructure saturation, third-party degradation, and failover decisions.
+## API
+The FastAPI app exposes the following endpoints on port `7860`:
+- `GET /health`
+- `GET /metadata`
+- `GET /tasks`
+- `GET /grader`
+- `GET /schema`
+- `POST /reset`
+- `POST /step`
+- `GET /state`
+- `POST /mcp`
+### Reset
+`POST /reset` starts a new single-step episode.
+Optional request body:
+```json
+{
+  "task_type": "task1",
+  "ticket_id": "INC-001",
+  "seed": 42
+}
+```
+Response fields:
+- `observation`
+- `reward`
+- `done`
+- `info`
+### Step
+`POST /step?session_id=<id>` accepts an `IncidentAction` and returns a typed `StepResult`.
+Example request:
+```json
+{
+  "incident_id": "INC-001",
+  "task_type": "task1",
+  "severity": "SEV1"
+}
+```
+### State
+`GET /state?session_id=<id>` returns the current typed `IncidentState`.
+## Web UI
+The project also serves a browser-facing UI from the same FastAPI app:
+- `/` shows the landing page with project overview and task summary
+- `/status` shows live health, schema, and task readiness information
+- `/playground` lets you manually reset a session and submit a step from the browser
+- `/docs` provides the generated FastAPI API reference
+## Models
+The core models are defined in [models.py](/Users/adityagaba/Downloads/incident-triage-env/models.py):
+- `IncidentObservation`
+- `IncidentAction`
+- `IncidentReward`
+- `StepResult`
+- `IncidentState`
+- `ResetRequest`
+Validation rules:
+- `incident_id` must match the active ticket
+- `task_type` must match the active ticket
+- exactly one of `severity`, `root_cause`, or `action` must be populated
+- the populated field must match the expected field for the task
+## Reward Logic
+Rewarding is deterministic and implemented in [graders.py](/Users/adityagaba/Downloads/incident-triage-env/graders.py).
+- `task1`: `1.0` exact, `0.5` adjacent severity, `0.0` far miss
+- `task2`: `1.0` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.0` wrong
+- `task3`: `1.0` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.0` wrong
+This keeps grading reproducible while still giving partial-credit trajectory signal.
+## Repository Layout
+```text
+incident-triage-env/
+- app.py
+- client.py
+- environment.py
+- graders.py
+- incidents.py
+- inference.py
+- models.py
+- openenv.yaml
+- pyproject.toml
+- requirements.txt
+- Dockerfile
+- README.md
+- server/
+- tests/
+```
+Runtime flow:
+1. `incidents.py` stores the ticket dataset.
+2. `environment.py` selects the episode and applies grading.
+3. `app.py` exposes the API surface.
+4. `inference.py` runs the baseline over the environment.
+5. `graders.py` calculates deterministic reward and explanations.
+## Local Setup
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+Optional OpenEnv CLI:
+```bash
+pip install openenv-core
+```
+Optional environment variables for `inference.py`:
+```bash
+export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
+export MODEL_NAME="your-model-name"
+export HF_TOKEN="your-api-key"
+export ENV_URL="http://localhost:7860"
+```
+If no external environment server is reachable, `inference.py` falls back to an in-process FastAPI client.
+## Run Locally
+Start the server:
+```bash
+uvicorn app:app --host 0.0.0.0 --port 7860
+```
+Run the baseline:
+```bash
+python inference.py
+```
+Run the smoke tests:
+```bash
+python -m unittest discover -s tests -v
+```
+## Docker
+Build the image:
+```bash
+docker build -t incident-triage-env .
+```
+Run the container:
+```bash
+docker run --rm -p 7860:7860 incident-triage-env
+```
+Check health:
+```bash
+curl http://localhost:7860/health
+```
+## Baseline Logging
+`inference.py` prints the required structured output:
+```text
+[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
+[STEP] step=1 action=SEV1 reward=1.00 done=true error=null
+[END] success=true steps=1 score=1.00 rewards=1.00
+```
+## Baseline Scores
+Latest local deterministic baseline:
+| Metric | Value |
+|---|---:|
+| Episodes | 36 |
+| Average score | 0.9861 |
+| `task1` average | 1.0000 |
+| `task2` average | 0.9583 |
+| `task3` average | 1.0000 |
+These results are written to `outputs/baseline_scores.json`.
+## Quick API Example
+Reset:
+```bash
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_type":"task1","ticket_id":"INC-001"}'
+```
+Step:
+```bash
+curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "incident_id": "INC-001",
+    "task_type": "task1",
+    "severity": "SEV1"
+  }'
+```
+## Pre-Submission Checklist
+- `openenv validate . --json` passes
+- `openenv validate --url <space-url>` passes
+- `POST /reset` returns `200`
+- `POST /step` returns typed `reward`, `done`, and `info`
+- `GET /state` works for active sessions
+- `inference.py` runs from the repo root
+- `Dockerfile` serves the app on port `7860`
+- `openenv.yaml` matches the current API and dataset counts
+## Notes
+- `models.py` is the source of truth for valid enum labels.
+- `graders.py` is the source of truth for scoring logic.
+- The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.

Readme.md DELETED Viewed

@@ -1,375 +0,0 @@
-# 🚨 Production Incident Triage Environment
-An OpenEnv-compatible backend evaluation system where an AI agent triages production incidents like a real SRE (Site Reliability Engineer). Built for deterministic, RL-style evaluation — no UI, no chatbot, pure backend.
----
-## 📌 What This Is
-This is **not** a chatbot. It is a structured evaluation environment where:
-1. Environment returns a production incident (alert + context)
-2. AI agent reads the incident
-3. Agent returns a structured JSON action
-4. Environment sends action to a deterministic grader
-5. Grader compares against ground truth
-6. Returns a score between `0.0` and `1.0`
----
-## 🗂️ Project Structure
-```
-Incident_Triage/
-│
-├── models.py               # Pydantic schemas — source of truth for all types
-├── incidents.py            # Dataset of 15 production incidents
-├── inference.py            # LLM agent (Mistral via NVIDIA API)
-├── openenv.yaml            # OpenEnv submission config
-├── pyproject.toml          # Project metadata
-├── requirements.txt        # Dependencies
-├── README.md
-│
-└── server/
-    ├── __init__.py         # Empty — do not add imports here
-    ├── app.py              # FastAPI server
-    ├── environment.py      # Core RL-style logic (reset / step)
-    ├── graders.py          # Deterministic scoring functions
-    ├── Dockerfile
-    └── requirements.txt
-```
----
-## ⚙️ Setup
-### 1. Clone and install dependencies
-```bash
-git clone <your-repo-url>
-cd Incident_Triage
-pip install -r requirements.txt
-```
-### 2. Set your NVIDIA / Mistral API key
-```bash
-# Windows
-set NVIDIA_API_KEY=your_nvidia_api_key_here
-# Mac / Linux
-export NVIDIA_API_KEY=your_nvidia_api_key_here
-```
-### 3. Start the server
-```bash
-uvicorn server.app:app --reload
-```
-Server runs at: `http://localhost:8000`
-### 4. Run the agent
-```bash
-python inference.py
-```
----
-## 🔗 API Endpoints
-### `GET /tasks`
-Returns available task types and their descriptions.
-**Response:**
-```json
-{
-  "tasks": {
-    "task1": "Severity Classification  → SeverityLevel enum",
-    "task2": "Root Cause Category     → RootCauseCategory enum",
-    "task3": "Recommended Action      → RecommendedAction enum"
-  }
-}
-```
----
-### `POST /reset`
-Resets the environment and returns a new incident for the agent to triage.
-**Query Params:**
-| Param | Type | Required | Description |
-|---|---|---|---|
-| `task_type` | string | No | Filter by `task1`, `task2`, or `task3`. If omitted, picks any incident randomly. |
-**Example:**
-```bash
-curl -X POST "http://localhost:8000/reset?task_type=task1"
-```
-**Response:**
-```json
-{
-  "incident_id": "INC-001",
-  "task_type": "task1",
-  "alert_text": "[CRITICAL] Payment service returning HTTP 503. Error rate: 94%.",
-  "context": {
-    "service": "payment-service",
-    "error_rate_pct": 94,
-    "affected_users": 120000,
-    "region": "us-east-1"
-  }
-}
-```
----
-### `POST /step`
-Submits the agent's action and returns a graded result.
-**Request Body:**
-```json
-{
-  "incident_id": "INC-001",
-  "task_type": "task1",
-  "severity": "SEV1",
-  "root_cause": null,
-  "action": null
-}
-```
-> Only populate the field relevant to the `task_type`. Set others to `null`.
-**Response:**
-```json
-{
-  "incident_id": "INC-001",
-  "task_type": "task1",
-  "reward": 1.0,
-  "correct": true,
-  "ground_truth": "SEV1",
-  "agent_answer": "SEV1"
-}
-```
-| Field | Type | Description |
-|---|---|---|
-| `reward` | float | `1.0` = correct, `0.0` = wrong |
-| `correct` | bool | True if reward == 1.0 |
-| `ground_truth` | string | Expected answer |
-| `agent_answer` | string | What agent returned |
----
-### `GET /grader`
-Returns grader configuration for transparency.
-**Response:**
-```json
-{
-  "grading": "deterministic",
-  "scoring": "binary (0.0 or 1.0)",
-  "tasks": {
-    "task1": "action.severity   == ground_truth.severity",
-    "task2": "action.root_cause == ground_truth.root_cause",
-    "task3": "action.action     == ground_truth.action"
-  }
-}
-```
----
-## 📋 Enum Reference
-All agent outputs must use **exactly** these enum values (case-sensitive):
-### Task 1 — Severity Classification (`severity` field)
-| Value | Meaning |
-|---|---|
-| `SEV1` | Total outage / confirmed revenue impact |
-| `SEV2` | Partial outage / degraded performance |
-| `SEV3` | Minor / cosmetic / internal only |
-### Task 2 — Root Cause Category (`root_cause` field)
-| Value | Meaning |
-|---|---|
-| `DATABASE` | DB lag, connection pool, replica issues |
-| `NETWORK` | Packet loss, BGP flap, cross-region failures |
-| `APPLICATION` | Code bug, exception, bad deploy |
-| `INFRASTRUCTURE` | Kubernetes, EC2, spot interruption |
-| `THIRD_PARTY` | Stripe, SendGrid, external vendor |
-| `UNKNOWN` | Cannot determine root cause |
-### Task 3 — Recommended Action (`action` field)
-| Value | Meaning |
-|---|---|
-| `ROLLBACK` | Revert to last stable deploy |
-| `SCALE_UP` | Increase replicas / resources |
-| `RESTART_SERVICE` | Restart stuck / deadlocked process |
-| `FAILOVER` | Switch to replica / standby |
-| `NOTIFY_VENDOR` | Escalate to third-party vendor |
-| `INVESTIGATE` | Need more info before acting |
-| `NO_ACTION` | Monitor only, no action needed |
----
-## 🤖 Agent JSON Format
-The agent must return **strict JSON only** — no markdown, no explanation, no extra text.
-```json
-{
-  "incident_id": "INC-006",
-  "task_type": "task2",
-  "severity": null,
-  "root_cause": "DATABASE",
-  "action": null
-}
-```
-Rules:
-- `incident_id` must match the one returned by `/reset`
-- `task_type` must match the one returned by `/reset`
-- Only one field (`severity`, `root_cause`, or `action`) should be non-null
-- The non-null field must use a valid enum value
----
-## 🧠 How Grading Works
-Grading is **fully deterministic** — no LLM is used inside the grader.
-```
-agent_answer == ground_truth  →  reward: 1.0  (correct)
-agent_answer != ground_truth  →  reward: 0.0  (wrong)
-missing field (null)          →  reward: 0.0  (wrong)
-```
-Scoring is binary because incident triage is a classification task. A wrong severity leads to a wrong on-call response — partial credit would mask bad agent behavior.
----
-## 🧪 Quick Test (curl)
-```bash
-# 1. Check available tasks
-curl http://localhost:8000/tasks
-# 2. Get a task1 incident
-curl -X POST "http://localhost:8000/reset?task_type=task1"
-# 3. Submit agent action (replace incident_id with one from step 2)
-curl -X POST http://localhost:8000/step \
-  -H "Content-Type: application/json" \
-  -d '{"incident_id": "INC-001", "task_type": "task1", "severity": "SEV1", "root_cause": null, "action": null}'
-# 4. Check grader config
-curl http://localhost:8000/grader
-```
----
-## 📊 Dataset Overview
-15 production incidents across 3 task types (5 per task):
-| Task | Incidents | What agent classifies |
-|---|---|---|
-| `task1` | INC-001 to INC-005 | Severity level |
-| `task2` | INC-006 to INC-010 | Root cause category |
-| `task3` | INC-011 to INC-015 | Recommended action |
-Incident types include: payment outages, DB replica lag, Kubernetes node failures, BGP flapping, bad deploys, vendor degradations, memory deadlocks, and more.
----
-## 🔧 Inference Script (Mistral via NVIDIA API)
-`inference.py` uses the Mistral model via NVIDIA's OpenAI-compatible API endpoint.
-Update the client in `inference.py`:
-```python
-from openai import OpenAI
-client = OpenAI(
-    base_url="https://integrate.api.nvidia.com/v1",
-    api_key=os.environ["NVIDIA_API_KEY"]
-)
-response = client.chat.completions.create(
-    model="mistralai/mistral-7b-instruct-v0.3",
-    messages=[
-        {"role": "system", "content": SYSTEM_PROMPT},
-        {"role": "user", "content": build_user_prompt(observation)}
-    ],
-    max_tokens=256,
-    temperature=0.0
-)
-raw = response.choices[0].message.content.strip()
-```
-> `temperature=0.0` is critical — keeps outputs deterministic across runs.
----
-## 📦 Requirements
-```
-fastapi
-uvicorn
-pydantic
-openai
-requests
-```
-Install:
-```bash
-pip install fastapi uvicorn pydantic openai requests
-```
----
-## 🚀 Run Full Evaluation
-```bash
-# Terminal 1
-uvicorn server.app:app --reload
-# Terminal 2
-python inference.py
-```
-Expected output:
-```
-==================================================
-Incident : INC-003
-Task     : task1
-Alert    : [INFO] Admin dashboard CSS assets returning 404...
-LLM Raw  : {"incident_id": "INC-003", "task_type": "task1", "severity": "SEV3", "root_cause": null, "action": null}
-Answer   : SEV3
-Expected : SEV3
-Correct  : True  |  Reward: 1.0
-==================================================
-Total Episodes : 15
-Total Correct  : 13
-Accuracy       : 86.7%
-```
----
-## 📝 Important Rules
-- Never modify enum values in `models.py` — graders depend on exact string matching
-- Never add LLM calls inside `graders.py` — grading must be deterministic
-- Always call `/reset` before `/step` — environment maintains current incident state
-- `server/__init__.py` must stay empty — do not add imports there
-- Always run uvicorn from the project root: `uvicorn server.app:app --reload`

__init__.py DELETED Viewed

@@ -1,16 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Incident Triage Environment."""
-from .client import IncidentTriageEnv
-from .models import IncidentTriageAction, IncidentTriageObservation
-__all__ = [
-    "IncidentTriageAction",
-    "IncidentTriageObservation",
-    "IncidentTriageEnv",
-]

app.py CHANGED Viewed

@@ -1,77 +1,225 @@
-#----- Edited file--------------
-# app.py
 import uuid
 from fastapi import FastAPI, HTTPException
-from models import IncidentAction, StepResult
-from environment import IncidentEnv
-from graders import GRADERS
 app = FastAPI(title="Incident Triage Environment")
 # Session store: session_id -> IncidentEnv instance
 sessions: dict[str, IncidentEnv] = {}
 @app.get("/tasks")
 def get_tasks():
     return {
         "tasks": {
-            "task1": "Severity Classification  → SEV1, SEV2, SEV3",
-            "task2": "Root Cause Category     → DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN",
-            "task3": "Recommended Action      → ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION",
         }
     }
-@app.post("/reset")
-def reset(task_type: str = None):
     session_id = str(uuid.uuid4())
     env = IncidentEnv()
     try:
-        observation = env.reset(task_type=task_type)
     except ValueError as e:
         raise HTTPException(status_code=400, detail=str(e))
     sessions[session_id] = env
-    return {"session_id": session_id, **observation.model_dump()}
 @app.post("/step", response_model=StepResult)
 def step(action: IncidentAction, session_id: str):
     env = sessions.get(session_id)
     if not env:
         raise HTTPException(status_code=404, detail="Session not found. Call /reset first.")
     try:
         result = env.step(action)
     except (RuntimeError, ValueError) as e:
         raise HTTPException(status_code=400, detail=str(e))
-    # Clean up session after step — one action per episode
-    sessions.pop(session_id, None)
     return result
-@app.get("/state")
 def state(session_id: str):
     env = sessions.get(session_id)
-    if not env or env.current_ticket is None:
         raise HTTPException(status_code=404, detail="No active session.")
-    t = env.current_ticket
-    return {
-        "session_id": session_id,
-        "incident_id": t["incident_id"],
-        "task_type":   t["task_type"],
-        "status":      "awaiting_action",
-    }
 @app.get("/grader")
 def get_grader_info():
     return {
         "grading": "deterministic",
-        "scoring": "task1: partial (1.0/0.5/0.0), task2/task3: binary (1.0/0.0)",
         "tasks": {
             "task1": "exact=1.0, adjacent=0.5, far=0.0",
-            "task2": "action.root_cause == ground_truth.root_cause",
-            "task3": "action.action     == ground_truth.action",
         }
-    }

 import uuid
+from collections import Counter
+from pathlib import Path
+import sys
+from typing import Any
 from fastapi import FastAPI, HTTPException
+from fastapi.responses import FileResponse
+from fastapi.staticfiles import StaticFiles
+from environment import IncidentEnv, TASK_SPECS
+from incidents import TICKETS
+from models import (
+    IncidentAction,
+    IncidentObservation,
+    IncidentReward,
+    IncidentState,
+    ResetRequest,
+    StepResult,
+    TaskType,
+)
 app = FastAPI(title="Incident Triage Environment")
+UI_DIR = Path(__file__).parent / "ui"
+ASSETS_DIR = UI_DIR / "assets"
 # Session store: session_id -> IncidentEnv instance
 sessions: dict[str, IncidentEnv] = {}
+task_counts = Counter(ticket["task_type"] for ticket in TICKETS)
+app.mount("/assets", StaticFiles(directory=ASSETS_DIR), name="assets")
+def log_event(event: str, **fields: Any) -> None:
+    details = " ".join(f"{key}={value}" for key, value in fields.items())
+    print(f"[{event}] {details}", file=sys.stderr, flush=True)
+@app.get("/", include_in_schema=False)
+def home_page():
+    return FileResponse(UI_DIR / "index.html")
+@app.get("/status", include_in_schema=False)
+def status_page():
+    return FileResponse(UI_DIR / "status.html")
+@app.get("/playground", include_in_schema=False)
+def playground_page():
+    return FileResponse(UI_DIR / "playground.html")
+@app.get("/health")
+def health():
+    return {"status": "healthy"}
+@app.get("/metadata")
+def metadata():
+    return {
+        "name": "incident-triage-env",
+        "description": "Production incident triage environment for severity, root-cause, and remediation decisions.",
+        "tasks": {
+            task_type.value: {
+                "name": spec["name"],
+                "difficulty": spec["difficulty"],
+                "expected_field": spec["expected_field"],
+                "allowed_values": spec["allowed_values"],
+                "ticket_count": task_counts[task_type.value],
+            }
+            for task_type, spec in TASK_SPECS.items()
+        },
+        "total_tickets": len(TICKETS),
+    }
+@app.get("/schema")
+def schema():
+    return {
+        "action": IncidentAction.model_json_schema(),
+        "observation": IncidentObservation.model_json_schema(),
+        "reward": IncidentReward.model_json_schema(),
+        "state": IncidentState.model_json_schema(),
+        "step_result": StepResult.model_json_schema(),
+    }
 @app.get("/tasks")
 def get_tasks():
     return {
         "tasks": {
+            task_type.value: {
+                "name": spec["name"],
+                "difficulty": spec["difficulty"],
+                "expected_field": spec["expected_field"],
+                "allowed_values": spec["allowed_values"],
+                "ticket_count": task_counts[task_type.value],
+            }
+            for task_type, spec in TASK_SPECS.items()
         }
     }
+@app.get("/tickets")
+def get_tickets():
+    tickets = []
+    for ticket in TICKETS:
+        task_type = TaskType(ticket["task_type"])
+        spec = TASK_SPECS[task_type]
+        tickets.append(
+            {
+                "incident_id": ticket["incident_id"],
+                "task_type": ticket["task_type"],
+                "difficulty": spec["difficulty"],
+                "task_name": spec["name"],
+                "expected_field": spec["expected_field"],
+                "alert_preview": ticket["alert_text"][:120],
+            }
+        )
+    return {"tickets": tickets, "count": len(tickets)}
+@app.post("/reset", response_model=StepResult)
+def reset(reset_request: ResetRequest | None = None):
+    request = reset_request or ResetRequest()
     session_id = str(uuid.uuid4())
     env = IncidentEnv()
     try:
+        result = env.reset(
+            task_type=request.task_type,
+            ticket_id=request.ticket_id,
+            seed=request.seed,
+        )
     except ValueError as e:
+        log_event(
+            "RESET_ERROR",
+            task_type=request.task_type.value if request.task_type else "any",
+            ticket_id=request.ticket_id or "random",
+            error=str(e),
+        )
         raise HTTPException(status_code=400, detail=str(e))
     sessions[session_id] = env
+    result.info["session_id"] = session_id
+    result.info["state"] = env.state(session_id=session_id).model_dump()
+    log_event(
+        "RESET",
+        session_id=session_id,
+        incident_id=result.observation.incident_id,
+        task_type=result.observation.task_type.value,
+        expected_field=result.observation.expected_field,
+    )
+    return result
 @app.post("/step", response_model=StepResult)
 def step(action: IncidentAction, session_id: str):
     env = sessions.get(session_id)
     if not env:
+        log_event("STEP_ERROR", session_id=session_id, error="session_not_found")
         raise HTTPException(status_code=404, detail="Session not found. Call /reset first.")
     try:
         result = env.step(action)
     except (RuntimeError, ValueError) as e:
+        log_event("STEP_ERROR", session_id=session_id, incident_id=action.incident_id, error=str(e))
         raise HTTPException(status_code=400, detail=str(e))
+    result.info["session_id"] = session_id
+    result.info["state"] = env.state(session_id=session_id).model_dump()
+    log_event(
+        "STEP",
+        session_id=session_id,
+        incident_id=action.incident_id,
+        task_type=action.task_type.value,
+        answer=action.selected_value() or "NONE",
+        reward=result.reward.value,
+        done=str(result.done).lower(),
+    )
     return result
+@app.get("/state", response_model=IncidentState)
 def state(session_id: str):
     env = sessions.get(session_id)
+    if not env:
+        log_event("STATE_ERROR", session_id=session_id, error="no_active_session")
         raise HTTPException(status_code=404, detail="No active session.")
+    try:
+        current_state = env.state(session_id=session_id)
+        log_event("STATE", session_id=session_id, incident_id=current_state.incident_id, done=str(current_state.done).lower())
+        return current_state
+    except RuntimeError as e:
+        log_event("STATE_ERROR", session_id=session_id, error=str(e))
+        raise HTTPException(status_code=404, detail=str(e))
 @app.get("/grader")
 def get_grader_info():
     return {
         "grading": "deterministic",
+        "scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit",
         "tasks": {
             "task1": "exact=1.0, adjacent=0.5, far=0.0",
+            "task2": "exact=1.0, related-domain=0.5, unknown=0.25, wrong=0.0",
+            "task3": "exact=1.0, investigate fallback=0.4, related response=0.25, wrong=0.0",
         }
+    }
+@app.post("/mcp")
+def mcp(payload: dict[str, Any] | None = None):
+    request = payload or {}
+    method = request.get("method")
+    rpc_id = request.get("id")
+    if method == "ping":
+        result: dict[str, Any] = {"status": "ok"}
+    elif method == "tools/list":
+        result = {"tools": []}
+    else:
+        result = {
+            "status": "ok",
+            "message": "Incident triage environment does not expose MCP tools.",
+        }
+    return {"jsonrpc": "2.0", "id": rpc_id, "result": result}

client.py CHANGED Viewed

@@ -1,99 +1,79 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Incident Triage Environment Client."""
-from typing import Dict
-from openenv.core import EnvClient
-from openenv.core.client_types import StepResult
-from openenv.core.env_server.types import State
-from .models import IncidentTriageAction, IncidentTriageObservation
-class IncidentTriageEnv(
-    EnvClient[IncidentTriageAction, IncidentTriageObservation, State]
-):
-    """
-    Client for the Incident Triage Environment.
-    This client maintains a persistent WebSocket connection to the environment server,
-    enabling efficient multi-step interactions with lower latency.
-    Each client instance has its own dedicated environment session on the server.
-    Example:
-        >>> # Connect to a running server
-        >>> with IncidentTriageEnv(base_url="http://localhost:8000") as client:
-        ...     result = client.reset()
-        ...     print(result.observation.echoed_message)
-        ...
-        ...     result = client.step(IncidentTriageAction(message="Hello!"))
-        ...     print(result.observation.echoed_message)
-    Example with Docker:
-        >>> # Automatically start container and connect
-        >>> client = IncidentTriageEnv.from_docker_image("Incident_Triage-env:latest")
-        >>> try:
-        ...     result = client.reset()
-        ...     result = client.step(IncidentTriageAction(message="Test"))
-        ... finally:
-        ...     client.close()
-    """
-    def _step_payload(self, action: IncidentTriageAction) -> Dict:
-        """
-        Convert IncidentTriageAction to JSON payload for step message.
-        Args:
-            action: IncidentTriageAction instance
-        Returns:
-            Dictionary representation suitable for JSON encoding
-        """
-        return {
-            "message": action.message,
-        }
-    def _parse_result(self, payload: Dict) -> StepResult[IncidentTriageObservation]:
-        """
-        Parse server response into StepResult[IncidentTriageObservation].
-        Args:
-            payload: JSON response data from server
-        Returns:
-            StepResult with IncidentTriageObservation
-        """
-        obs_data = payload.get("observation", {})
-        observation = IncidentTriageObservation(
-            echoed_message=obs_data.get("echoed_message", ""),
-            message_length=obs_data.get("message_length", 0),
-            done=payload.get("done", False),
-            reward=payload.get("reward"),
-            metadata=obs_data.get("metadata", {}),
-        )
         return StepResult(
-            observation=observation,
-            reward=payload.get("reward"),
-            done=payload.get("done", False),
         )
-    def _parse_state(self, payload: Dict) -> State:
-        """
-        Parse server response into State object.
-        Args:
-            payload: JSON response from state request
-        Returns:
-            State object with episode_id and step_count
-        """
-        return State(
-            episode_id=payload.get("episode_id"),
-            step_count=payload.get("step_count", 0),
         )

+"""Lightweight HTTP client for the current FastAPI incident triage server."""
+from __future__ import annotations
+from typing import Any, Dict, Optional
+import requests
+try:
+    from .models import IncidentAction, IncidentState, StepResult
+except ImportError:
+    from models import IncidentAction, IncidentState, StepResult
+class IncidentTriageClient:
+    """Small helper for calling the local FastAPI endpoints from scripts or notebooks."""
+    def __init__(self, base_url: str = "http://localhost:7860", timeout: float = 30.0):
+        self.base_url = base_url.rstrip("/")
+        self.timeout = timeout
+        self.session = requests.Session()
+    def __enter__(self) -> "IncidentTriageClient":
+        return self
+    def __exit__(self, exc_type, exc, tb) -> None:
+        self.close()
+    def close(self) -> None:
+        self.session.close()
+    def tasks(self) -> Dict[str, Any]:
+        return self._request("GET", "/tasks")
+    def grader_info(self) -> Dict[str, Any]:
+        return self._request("GET", "/grader")
+    def reset(
+        self,
+        task_type: Optional[str] = None,
+        ticket_id: Optional[str] = None,
+        seed: Optional[int] = None,
+    ) -> StepResult:
         return StepResult(
+            **self._request(
+                "POST",
+                "/reset",
+                json={
+                    "task_type": task_type,
+                    "ticket_id": ticket_id,
+                    "seed": seed,
+                },
+            )
         )
+    def state(self, session_id: str) -> IncidentState:
+        return IncidentState(
+            **self._request("GET", "/state", params={"session_id": session_id})
+        )
+    def step(self, session_id: str, action: IncidentAction | Dict[str, Any]) -> StepResult:
+        payload = action.model_dump() if isinstance(action, IncidentAction) else action
+        result = self._request(
+            "POST",
+            "/step",
+            params={"session_id": session_id},
+            json=payload,
+        )
+        return StepResult(**result)
+    def _request(self, method: str, path: str, **kwargs: Any) -> Dict[str, Any]:
+        response = self.session.request(
+            method=method,
+            url=f"{self.base_url}{path}",
+            timeout=self.timeout,
+            **kwargs,
         )
+        response.raise_for_status()
+        return response.json()

environment.py CHANGED Viewed

@@ -1,62 +1,210 @@
-#----- Edited file--------------
-# environment.py
 import random
-from models import IncidentAction, IncidentObservation, StepResult
 from incidents import TICKETS
 from graders import GRADERS
 class IncidentEnv:
     def __init__(self):
         self.current_ticket = None
-    def reset(self, task_type: str = None) -> IncidentObservation:
-        pool = TICKETS
-        if task_type:
-            pool = [t for t in TICKETS if t["task_type"] == task_type]
-        if not pool:
-            raise ValueError(f"No tickets found for task_type: {task_type}")
-        self.current_ticket = random.choice(pool)
-        return IncidentObservation(
-            incident_id=self.current_ticket["incident_id"],
-            task_type=self.current_ticket["task_type"],
-            alert_text=self.current_ticket["alert_text"],
-            context=self.current_ticket["context"],
         )
     def step(self, action: IncidentAction) -> StepResult:
         if self.current_ticket is None:
             raise RuntimeError("Call reset() before step()")
         if action.incident_id != self.current_ticket["incident_id"]:
             raise ValueError(
                 f"Action incident_id '{action.incident_id}' does not match "
                 f"current ticket '{self.current_ticket['incident_id']}'"
             )
         task_type = self.current_ticket["task_type"]
         ground_truth = self.current_ticket["ground_truth"]
         grader_fn = GRADERS[task_type]
-        reward = grader_fn(action, ground_truth)
-        agent_answer = (
-            action.severity.value    if task_type == "task1" and action.severity   else
-            action.root_cause.value  if task_type == "task2" and action.root_cause else
-            action.action.value      if task_type == "task3" and action.action      else
-            "NONE"
-        )
-        gt_field = list(ground_truth.values())[0]
         return StepResult(
             incident_id=self.current_ticket["incident_id"],
-            task_type=task_type,
-            reward=reward,
-            correct=reward == 1.0,
-            ground_truth=gt_field,
-            agent_answer=agent_answer,
-        )

 import random
+import uuid
 from incidents import TICKETS
 from graders import GRADERS
+from models import (
+    IncidentAction,
+    IncidentObservation,
+    IncidentReward,
+    IncidentState,
+    RecommendedAction,
+    RootCauseCategory,
+    SeverityLevel,
+    StepResult,
+    TaskType,
+)
+TASK_SPECS = {
+    TaskType.TASK1: {
+        "name": "Severity Classification",
+        "difficulty": "easy",
+        "expected_field": "severity",
+        "allowed_values": [item.value for item in SeverityLevel],
+        "description": "Classify the severity of the incident using blast radius, user impact, and business risk.",
+    },
+    TaskType.TASK2: {
+        "name": "Root Cause Classification",
+        "difficulty": "medium",
+        "expected_field": "root_cause",
+        "allowed_values": [item.value for item in RootCauseCategory],
+        "description": "Identify the most likely failure domain from the incident evidence.",
+    },
+    TaskType.TASK3: {
+        "name": "Recommended Action",
+        "difficulty": "hard",
+        "expected_field": "action",
+        "allowed_values": [item.value for item in RecommendedAction],
+        "description": "Choose the best immediate operational response for stabilizing the incident.",
+    },
+}
+TICKETS_BY_ID = {ticket["incident_id"]: ticket for ticket in TICKETS}
 class IncidentEnv:
     def __init__(self):
         self.current_ticket = None
+        self.episode_id = ""
+        self.step_count = 0
+        self.max_steps = 1
+        self.total_reward = 0.0
+        self.done = False
+        self.last_reward = 0.0
+        self.last_action_summary = None
+    def reset(
+        self,
+        task_type: TaskType | str | None = None,
+        ticket_id: str | None = None,
+        seed: int | None = None,
+    ) -> StepResult:
+        normalized_task = TaskType(task_type) if task_type else None
+        self.current_ticket = self._select_ticket(normalized_task, ticket_id, seed)
+        self.episode_id = str(uuid.uuid4())
+        self.step_count = 0
+        self.total_reward = 0.0
+        self.done = False
+        self.last_reward = 0.0
+        self.last_action_summary = None
+        return StepResult(
+            observation=self._build_observation(),
+            reward=IncidentReward(value=0.0, reason="Episode initialized."),
+            done=False,
+            info={
+                "episode_id": self.episode_id,
+                "task_name": self._task_spec()["name"],
+                "difficulty": self._task_spec()["difficulty"],
+                "max_steps": self.max_steps,
+            },
         )
     def step(self, action: IncidentAction) -> StepResult:
         if self.current_ticket is None:
             raise RuntimeError("Call reset() before step()")
+        if self.done:
+            raise RuntimeError("Episode already completed. Call reset() to start a new one.")
         if action.incident_id != self.current_ticket["incident_id"]:
             raise ValueError(
                 f"Action incident_id '{action.incident_id}' does not match "
                 f"current ticket '{self.current_ticket['incident_id']}'"
             )
+        if action.task_type != TaskType(self.current_ticket["task_type"]):
+            raise ValueError(
+                f"Action task_type '{action.task_type.value}' does not match "
+                f"current ticket task_type '{self.current_ticket['task_type']}'"
+            )
+        self._validate_action(action)
         task_type = self.current_ticket["task_type"]
         ground_truth = self.current_ticket["ground_truth"]
         grader_fn = GRADERS[task_type]
+        reward_value, reward_reason = grader_fn(action, ground_truth)
+        agent_answer = action.selected_value() or "NONE"
+        selected_field = action.selected_field() or "NONE"
+        ground_truth_value = list(ground_truth.values())[0]
+        self.step_count += 1
+        self.last_reward = reward_value
+        self.total_reward += reward_value
+        self.done = self.step_count >= self.max_steps
+        self.last_action_summary = f"Submitted {selected_field}={agent_answer}"
         return StepResult(
+            observation=self._build_observation(),
+            reward=IncidentReward(value=reward_value, reason=reward_reason),
+            done=self.done,
+            info={
+                "episode_id": self.episode_id,
+                "task_name": self._task_spec()["name"],
+                "difficulty": self._task_spec()["difficulty"],
+                "correct": reward_value == 1.0,
+                "ground_truth": ground_truth_value,
+                "agent_answer": agent_answer,
+                "selected_field": selected_field,
+                "max_steps": self.max_steps,
+            },
+        )
+    def state(self, session_id: str | None = None) -> IncidentState:
+        if self.current_ticket is None:
+            raise RuntimeError("No active episode. Call reset() first.")
+        return IncidentState(
+            episode_id=self.episode_id,
+            session_id=session_id,
+            step_count=self.step_count,
+            max_steps=self.max_steps,
+            total_reward=self.total_reward,
+            done=self.done,
             incident_id=self.current_ticket["incident_id"],
+            task_type=TaskType(self.current_ticket["task_type"]),
+            difficulty=self._task_spec()["difficulty"],
+            status="completed" if self.done else "awaiting_action",
+            last_reward=self.last_reward,
+        )
+    def _select_ticket(
+        self,
+        task_type: TaskType | None = None,
+        ticket_id: str | None = None,
+        seed: int | None = None,
+    ) -> dict:
+        if ticket_id:
+            ticket = TICKETS_BY_ID.get(ticket_id)
+            if ticket is None:
+                raise ValueError(f"No ticket found for ticket_id: {ticket_id}")
+            if task_type and ticket["task_type"] != task_type.value:
+                raise ValueError(
+                    f"Ticket '{ticket_id}' belongs to task_type '{ticket['task_type']}', "
+                    f"not '{task_type.value}'"
+                )
+            return ticket
+        pool = TICKETS
+        if task_type:
+            pool = [ticket for ticket in TICKETS if ticket["task_type"] == task_type.value]
+        if not pool:
+            raise ValueError(f"No tickets found for task_type: {task_type}")
+        chooser = random.Random(seed) if seed is not None else random
+        return chooser.choice(pool)
+    def _task_spec(self) -> dict:
+        if self.current_ticket is None:
+            raise RuntimeError("No active episode. Call reset() first.")
+        return TASK_SPECS[TaskType(self.current_ticket["task_type"])]
+    def _build_observation(self) -> IncidentObservation:
+        spec = self._task_spec()
+        return IncidentObservation(
+            incident_id=self.current_ticket["incident_id"],
+            task_type=TaskType(self.current_ticket["task_type"]),
+            difficulty=spec["difficulty"],
+            task_description=spec["description"],
+            alert_text=self.current_ticket["alert_text"],
+            context=self.current_ticket["context"],
+            expected_field=spec["expected_field"],
+            allowed_values=spec["allowed_values"],
+            step_count=self.step_count,
+            max_steps=self.max_steps,
+            last_action_summary=self.last_action_summary,
+            last_reward=self.last_reward,
+            episode_status="completed" if self.done else "awaiting_action",
+        )
+    def _validate_action(self, action: IncidentAction) -> None:
+        populated = action.populated_fields()
+        if len(populated) != 1:
+            raise ValueError("Action must populate exactly one of severity, root_cause, or action.")
+        expected_field = self._task_spec()["expected_field"]
+        if expected_field not in populated:
+            raise ValueError(
+                f"Task '{self.current_ticket['task_type']}' expects field '{expected_field}', "
+                f"but got '{next(iter(populated))}'."
+            )

graders.py CHANGED Viewed

@@ -1,33 +1,71 @@
-#----- Edited file--------------
-# graders.py
 from models import IncidentAction
 _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
-def grade_task1(action: IncidentAction, ground_truth: dict) -> float:
     if action.severity is None:
-        return 0.0
     predicted = _SEV_ORDER.get(action.severity.value, -1)
-    expected  = _SEV_ORDER.get(ground_truth["severity"], -1)
-    distance  = abs(predicted - expected)
-    return {0: 1.0, 1: 0.5, 2: 0.0}[distance]
-def grade_task2(action: IncidentAction, ground_truth: dict) -> float:
     if action.root_cause is None:
-        return 0.0
-    return 1.0 if action.root_cause.value == ground_truth["root_cause"] else 0.0
-def grade_task3(action: IncidentAction, ground_truth: dict) -> float:
     if action.action is None:
-        return 0.0
-    return 1.0 if action.action.value == ground_truth["action"] else 0.0
 GRADERS = {
     "task1": grade_task1,
     "task2": grade_task2,
     "task3": grade_task3,
-}

 from models import IncidentAction
 _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
+_TASK2_RELATED_GROUPS = [
+    {"DATABASE", "APPLICATION"},
+    {"NETWORK", "INFRASTRUCTURE"},
+    {"NETWORK", "THIRD_PARTY"},
+    {"INFRASTRUCTURE", "THIRD_PARTY"},
+]
+_TASK3_PARTIAL = {
+    ("RESTART_SERVICE", "FAILOVER"): 0.25,
+    ("FAILOVER", "RESTART_SERVICE"): 0.25,
+    ("NOTIFY_VENDOR", "INVESTIGATE"): 0.25,
+    ("SCALE_UP", "INVESTIGATE"): 0.25,
+    ("RESTART_SERVICE", "INVESTIGATE"): 0.25,
+}
+def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.severity is None:
+        return 0.0, "Missing severity classification."
     predicted = _SEV_ORDER.get(action.severity.value, -1)
+    expected = _SEV_ORDER.get(ground_truth["severity"], -1)
+    distance = abs(predicted - expected)
+    score = {0: 1.0, 1: 0.5, 2: 0.0}[distance]
+    if score == 1.0:
+        return score, "Exact severity match."
+    if score == 0.5:
+        return score, "Adjacent severity band: partial credit for a close escalation call."
+    return score, "Severity choice is too far from the ground truth."
+def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.root_cause is None:
+        return 0.0, "Missing root-cause classification."
+    predicted = action.root_cause.value
+    expected = ground_truth["root_cause"]
+    if predicted == expected:
+        return 1.0, "Exact root-cause match."
+    if predicted == "UNKNOWN":
+        return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
+    if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
+        return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
+    return 0.0, "Root-cause classification does not match the expected failure domain."
+def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.action is None:
+        return 0.0, "Missing remediation recommendation."
+    predicted = action.action.value
+    expected = ground_truth["action"]
+    if predicted == expected:
+        return 1.0, "Exact remediation match."
+    if predicted == "INVESTIGATE" and expected != "NO_ACTION":
+        return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
+    if predicted == "NO_ACTION" and expected == "INVESTIGATE":
+        return 0.25, "Conservative response, but deeper investigation was expected."
+    if (predicted, expected) in _TASK3_PARTIAL:
+        return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
+    return 0.0, "Recommended action does not match the expected operator response."
 GRADERS = {
     "task1": grade_task1,
     "task2": grade_task2,
     "task3": grade_task3,
+}

incidents.py CHANGED Viewed

@@ -3,7 +3,7 @@
 TICKETS = [
-    # ── TASK 1: Severity Classification ───────────────────────────────────────
     {
         "incident_id": "INC-001",
@@ -72,7 +72,7 @@ TICKETS = [
         "ground_truth": {"severity": "SEV2"}
     },
-    # ── TASK 2: Root Cause Classification ─────────────────────────────────────
     {
         "incident_id": "INC-006",
@@ -142,7 +142,7 @@ TICKETS = [
         "ground_truth": {"root_cause": "INFRASTRUCTURE"}
     },
-    # ── TASK 3: Recommended Action ────────────────────────────────────────────
     {
         "incident_id": "INC-011",
@@ -226,7 +226,7 @@ TICKETS = [
         "ground_truth": {"severity": "SEV1"}
     },
-    # ── TASK 1: Severity (Ambiguous + Edge) ─────────────────────────────
     {
         "incident_id": "INC-017",
@@ -263,7 +263,7 @@ TICKETS = [
         "ground_truth": {"severity": "SEV3"}
     },
-    # ── TASK 2: Root Cause (Confusing Signals) ───────────────────────────
     {
         "incident_id": "INC-020",
@@ -310,7 +310,7 @@ TICKETS = [
         "ground_truth": {"root_cause": "INFRASTRUCTURE"}
     },
-    # ── TASK 3: Action (Ambiguous Decisions) ─────────────────────────────
     {
         "incident_id": "INC-024",
@@ -368,7 +368,7 @@ TICKETS = [
         "ground_truth": {"action": "FAILOVER"}
     },
-    # ── HARD CASES (REAL THINKING) ──────────────────────────────────────
     {
         "incident_id": "INC-029",
@@ -458,4 +458,4 @@ TICKETS = [
         "ground_truth": {"action": "INVESTIGATE"}
     }
-]

 TICKETS = [
+    # TASK 1: Severity Classification
     {
         "incident_id": "INC-001",
         "ground_truth": {"severity": "SEV2"}
     },
+    # TASK 2: Root Cause Classification
     {
         "incident_id": "INC-006",
         "ground_truth": {"root_cause": "INFRASTRUCTURE"}
     },
+    # TASK 3: Recommended Action
     {
         "incident_id": "INC-011",
         "ground_truth": {"severity": "SEV1"}
     },
+    # TASK 1: Severity (Ambiguous + Edge)
     {
         "incident_id": "INC-017",
         "ground_truth": {"severity": "SEV3"}
     },
+    # TASK 2: Root Cause (Confusing Signals)
     {
         "incident_id": "INC-020",
         "ground_truth": {"root_cause": "INFRASTRUCTURE"}
     },
+    # TASK 3: Action (Ambiguous Decisions)
     {
         "incident_id": "INC-024",
         "ground_truth": {"action": "FAILOVER"}
     },
+    # HARD CASES (REAL THINKING)
     {
         "incident_id": "INC-029",
         "ground_truth": {"action": "INVESTIGATE"}
     }
+]

inference.py CHANGED Viewed

@@ -1,194 +1,408 @@
-# inference.py
-import os
 import json
 import re
 import requests
 from openai import OpenAI
 from incidents import TICKETS
-from dotenv import load_dotenv
 load_dotenv()
-BASE_URL = "http://localhost:8000"
-client = OpenAI(
-    base_url=os.getenv("API_BASE_URL"),
-    api_key=os.getenv("HF_TOKEN")
 )
-SYSTEM_PROMPT = """You are an expert SRE (Site Reliability Engineer) triaging production incidents.
-You will receive an incident alert and context.
-You must respond with ONLY a valid JSON object. No explanation. No markdown. No extra text. No code blocks.
 Rules:
-- For task1: classify severity. Choose ONLY from: SEV1, SEV2, SEV3
-- For task2: classify root cause. Choose ONLY from: DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN
-- For task3: recommend action. Choose ONLY from: ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION
-Response format (return this exact structure):
-{"incident_id": "<incident_id>", "task_type": "<task_type>", "severity": "<value or null>", "root_cause": "<value or null>", "action": "<value or null>"}
-Only populate the field relevant to the task_type. Set others to null.
-"""
-def build_user_prompt(observation: dict) -> str:
-    return f"""Incident ID: {observation['incident_id']}
-Task Type: {observation['task_type']}
-Alert:
-{observation['alert_text']}
-Context:
-{json.dumps(observation['context'], indent=2)}
-Respond with JSON only. No markdown. No explanation."""
-# 🔥 Robust JSON extractor
-def extract_json(raw: str) -> dict:
     match = re.search(r"\{.*\}", raw, re.DOTALL)
     if not match:
-        raise ValueError("No JSON found in response")
     return json.loads(match.group(0))
-def normalize_action(action: dict, task_type: str) -> dict:
     return {
-        "incident_id": action.get("incident_id"),
         "task_type": task_type,
-        "severity": action.get("severity") if task_type == "task1" else None,
-        "root_cause": action.get("root_cause") if task_type == "task2" else None,
-        "action": action.get("action") if task_type == "task3" else None,
     }
-def call_llm(observation: dict) -> str:
-    full_response = ""
     try:
-        completion = client.chat.completions.create(
-            model=os.getenv("MODEL_NAME"),
-            messages=[
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": build_user_prompt(observation)}
-            ],
-            temperature=0.1,
-            top_p=0.9,
-            max_tokens=200,
-            seed=42,
-            stream=True
         )
-        for chunk in completion:
-            if chunk.choices and chunk.choices[0].delta.content is not None:
-                full_response += chunk.choices[0].delta.content
-    except Exception as e:
-        print(f"Error calling LLM: {e}")
-        return ""
-    return full_response.strip()
-def run_episode(task_type: str = None) -> dict:
-    # Step 1 — Reset environment
-    params = {"task_type": task_type} if task_type else {}
-    reset_response = requests.post(f"{BASE_URL}/reset", params=params)
-    reset_response.raise_for_status()
-    reset_data = reset_response.json()
-    session_id = reset_data["session_id"]
-    observation = reset_data
-    print(f"\n{'='*60}")
-    print(f"Incident : {observation['incident_id']}")
-    print(f"Task     : {observation['task_type']}")
-    print(f"Alert    : {observation['alert_text'][:80]}...")
-    # Step 2 — LLM with retry
-    action = None
-    raw = ""
-    for attempt in range(3):
-        raw = call_llm(observation)
-        print(f"LLM Raw (attempt {attempt+1}): {raw}")
-        try:
-            parsed = extract_json(raw)
-            action = normalize_action(parsed, observation["task_type"])
-            break
-        except Exception as e:
-            print(f"Parse failed: {e}")
-    if not action:
-        return {"error": "invalid_json", "raw": raw}
-        # Step 3 — Validate schema
-    required_keys = {"incident_id", "task_type", "severity", "root_cause", "action"}
-    if not required_keys.issubset(action.keys()):
-        print("Invalid schema from LLM")
-        return {"error": "invalid_schema", "raw": raw}
-    # Step 4 — Submit to /step
-    step_response = requests.post(f"{BASE_URL}/step", json=action, params={"session_id": session_id})
-    step_response.raise_for_status()
-    result = step_response.json()
-    # This need to be kept for submission grading, so we print it in a structured way
-    print(f"[STEP] task_id={result['task_type']} action={result['agent_answer']} reward={result['reward']}")
-    print(f"Answer   : {result['agent_answer']}")
-    print(f"Expected : {result['ground_truth']}")
-    print(f"Correct  : {result['correct']}  |  Reward: {result['reward']}")
-    # 🔥 Logging
-    with open("logs.jsonl", "a") as f:
-        f.write(json.dumps({
-            "observation": observation,
-            "response": action,
-            "result": result
-        }) + "\n")
-    return result
-def run_full_eval():
-    print("[START]")
-    task_types = ["task1", "task2", "task3"]
-    rounds = len(TICKETS)  # 🔥 FIXED
-    scores = []
-    errors = 0
-    task_scores = {
-        "task1": [],
-        "task2": [],
-        "task3": []
-    }
-    for i in range(rounds):
-        task = task_types[i % 3]
-        result = run_episode(task_type=task)
-        if "reward" in result:
-            scores.append(result["reward"])
-            task_scores[task].append(result["reward"])
-        else:
-            errors += 1
-    print(f"\n{'='*60}")
-    print(f"Total Episodes : {rounds}")
-    print(f"Graded         : {len(scores)}")
-    print(f"JSON Errors    : {errors}")
-    if scores:
-        print(f"Total Reward : {sum(scores)}")
-        print(f"Average Reward : {sum(scores)/len(scores):.2f}")
-        print(f"Overall Accuracy : {sum(scores)/len(scores)*100:.2f}%")
-        for task in task_scores:
-            if task_scores[task]:
-                acc = sum(task_scores[task]) / len(task_scores[task]) * 100
-                print(f"{task} Accuracy : {acc:.2f}%")
-    print("[END]")
 if __name__ == "__main__":
-    run_full_eval()

 import json
+import os
 import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional
 import requests
+from dotenv import load_dotenv
 from openai import OpenAI
 from incidents import TICKETS
 load_dotenv()
+API_BASE_URL = os.environ.get("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.environ.get("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+API_KEY = (
+    os.environ.get("HF_TOKEN")
+    or os.environ.get("API_KEY")
+    or os.environ.get("OPENAI_API_KEY")
+    or ""
 )
+ENV_URL = os.environ.get("ENV_URL") or "http://localhost:7860"
+BENCHMARK = "incident-triage-env"
+MAX_TOKENS = 300
+TEMPERATURE = 0.0
+OUTPUT_PATH = Path("outputs/baseline_scores.json")
+SYSTEM_PROMPT = """You are an expert SRE triaging production incidents.
+You will receive an incident alert, structured context, and the expected output field.
+Return ONLY a valid JSON object with this exact shape:
+{"incident_id":"<id>","task_type":"<task_type>","severity":null,"root_cause":null,"action":null}
 Rules:
+- Populate exactly one of severity, root_cause, or action based on task_type.
+- Allowed severity values: SEV1, SEV2, SEV3
+- Allowed root_cause values: DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN
+- Allowed action values: ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION
+- Keep incident_id and task_type identical to the observation.
+- Do not return markdown, prose, or any extra keys.
+"""
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    action_clean = action.replace("\n", " ").replace("\r", "")[:100]
+    print(
+        f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}",
+        flush=True,
+    )
+class EnvironmentTransport:
+    def reset(self, task_type: str, ticket_id: str) -> Dict[str, Any]:
+        raise NotImplementedError
+    def step(self, session_id: str, action: Dict[str, Any]) -> Dict[str, Any]:
+        raise NotImplementedError
+    def close(self) -> None:
+        return None
+class HttpEnvironmentTransport(EnvironmentTransport):
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip("/")
+        self.session = requests.Session()
+    def probe(self) -> bool:
+        try:
+            response = self.session.get(f"{self.base_url}/health", timeout=5)
+            return response.ok
+        except requests.RequestException:
+            return False
+    def reset(self, task_type: str, ticket_id: str) -> Dict[str, Any]:
+        response = self.session.post(
+            f"{self.base_url}/reset",
+            json={"task_type": task_type, "ticket_id": ticket_id},
+            timeout=30,
+        )
+        response.raise_for_status()
+        return response.json()
+    def step(self, session_id: str, action: Dict[str, Any]) -> Dict[str, Any]:
+        response = self.session.post(
+            f"{self.base_url}/step",
+            params={"session_id": session_id},
+            json=action,
+            timeout=30,
+        )
+        response.raise_for_status()
+        return response.json()
+    def close(self) -> None:
+        self.session.close()
+class LocalEnvironmentTransport(EnvironmentTransport):
+    def __init__(self):
+        from fastapi.testclient import TestClient
+        import app as app_module
+        self.session = TestClient(app_module.app)
+    def reset(self, task_type: str, ticket_id: str) -> Dict[str, Any]:
+        response = self.session.post(
+            "/reset",
+            json={"task_type": task_type, "ticket_id": ticket_id},
+        )
+        response.raise_for_status()
+        return response.json()
+    def step(self, session_id: str, action: Dict[str, Any]) -> Dict[str, Any]:
+        response = self.session.post(
+            "/step",
+            params={"session_id": session_id},
+            json=action,
+        )
+        response.raise_for_status()
+        return response.json()
+    def close(self) -> None:
+        self.session.close()
+def build_transport() -> EnvironmentTransport:
+    http_transport = HttpEnvironmentTransport(ENV_URL)
+    if http_transport.probe():
+        return http_transport
+    http_transport.close()
+    return LocalEnvironmentTransport()
+def create_model_client() -> Optional[OpenAI]:
+    if not (API_BASE_URL and API_KEY and MODEL_NAME):
+        return None
+    return OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+def build_user_prompt(observation: Dict[str, Any]) -> str:
+    return (
+        f"Incident ID: {observation['incident_id']}\n"
+        f"Task Type: {observation['task_type']}\n"
+        f"Difficulty: {observation['difficulty']}\n"
+        f"Task Description: {observation['task_description']}\n"
+        f"Expected Field: {observation['expected_field']}\n"
+        f"Allowed Values: {', '.join(observation['allowed_values'])}\n\n"
+        f"Alert:\n{observation['alert_text']}\n\n"
+        f"Context:\n{json.dumps(observation['context'], indent=2, sort_keys=True)}\n"
+    )
+def extract_json(raw: str) -> Dict[str, Any]:
+    fenced = re.search(r"```json\s*(.*?)\s*```", raw, re.DOTALL)
+    if fenced:
+        return json.loads(fenced.group(1))
+    try:
+        return json.loads(raw)
+    except json.JSONDecodeError:
+        pass
     match = re.search(r"\{.*\}", raw, re.DOTALL)
     if not match:
+        raise ValueError("No JSON object found in model response.")
     return json.loads(match.group(0))
+def normalize_action(raw_action: Dict[str, Any], observation: Dict[str, Any]) -> Dict[str, Any]:
+    task_type = observation["task_type"]
     return {
+        "incident_id": observation["incident_id"],
         "task_type": task_type,
+        "severity": raw_action.get("severity") if task_type == "task1" else None,
+        "root_cause": raw_action.get("root_cause") if task_type == "task2" else None,
+        "action": raw_action.get("action") if task_type == "task3" else None,
     }
+def _number(value: Any) -> Optional[float]:
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        match = re.search(r"(\d+(?:\.\d+)?)", value)
+        if match:
+            return float(match.group(1))
+    return None
+def predict_severity(alert_text: str, context: Dict[str, Any]) -> str:
+    error_rate = (
+        _number(context.get("error_rate_pct"))
+        or _number(context.get("failure_rate"))
+        or _number(context.get("affected_users_pct"))
+    )
+    revenue_impact = context.get("revenue_impact") is True or context.get("revenue_dependency") == "high"
+    if (
+        "CRITICAL" in alert_text
+        or "100%" in alert_text
+        or "REVENUE IMPACT" in alert_text
+        or context.get("region") == "global"
+        or revenue_impact
+        or (error_rate is not None and error_rate >= 40)
+    ):
+        return "SEV1"
+    if (
+        "INTERNAL ONLY" in alert_text
+        or "COSMETIC" in alert_text
+        or "NO USER-FACING IMPACT" in alert_text
+        or context.get("user_impact") in {"cosmetic", False}
+        or context.get("impact") == "cosmetic"
+    ):
+        return "SEV3"
+    return "SEV2"
+def predict_root_cause(alert_text: str, context_text: str) -> str:
+    if any(keyword in alert_text or keyword in context_text for keyword in ["STRIPE", "SENDGRID", "TWILIO", "VENDOR", "WEBHOOK", "EXTERNAL API"]):
+        return "THIRD_PARTY"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["PACKET LOSS", "BGP", "TRACEROUTE", "ROUTE", "CROSS-REGION", "TRANSIT HOP"]):
+        return "NETWORK"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["POSTGRES", "DB ", "DATABASE", "SLOW QUERY", "CONNECTION POOL", "REPLICA", "WRITE QUERIES", "DB_CPU"]):
+        return "DATABASE"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["KUBERNETES", "NODE", "POD", "CLUSTER", "NOTREADY", "MEMORY PRESSURE", "EC2", "SPOT INTERRUPTION"]):
+        return "INFRASTRUCTURE"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["EXCEPTION", "STACK TRACE", "DEPLOY", "CRASH", "NULLPOINTER", "TIMEOUTEXCEPTION", "CODE"]):
+        return "APPLICATION"
+    return "UNKNOWN"
+def predict_action(alert_text: str, context_text: str) -> str:
+    if any(keyword in alert_text or keyword in context_text for keyword in ["ROLLBACK", "IMMEDIATELY AFTER DEPLOY", "PREVIOUS_STABLE", "RECENT DEPLOY CAUSED"]):
+        return "ROLLBACK"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["CPU", "QUEUE", "AUTOSCALER", "MAX_REPLICAS", "TRAFFIC SPIKE", "FLASH SALE"]):
+        return "SCALE_UP"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["DEADLOCK", "HEALTH CHECK", "STUCK", "NO RESPONSE", "PROCESS NOT RESPONDING"]):
+        return "RESTART_SERVICE"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["FAILOVER", "READ REPLICA", "PRIMARY DOWN", "PRIMARY RDS", "WRITES FAILING"]):
+        return "FAILOVER"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["SENDGRID", "STRIPE", "TWILIO", "VENDOR"]):
+        return "NOTIFY_VENDOR"
+    if any(keyword in alert_text or keyword in context_text for keyword in ["COSMETIC", "MINOR UI GLITCH"]):
+        return "NO_ACTION"
+    return "INVESTIGATE"
+def heuristic_action(observation: Dict[str, Any]) -> Dict[str, Any]:
+    task_type = observation["task_type"]
+    alert_text = observation["alert_text"].upper()
+    context_text = json.dumps(observation["context"]).upper()
+    if task_type == "task1":
+        return normalize_action({"severity": predict_severity(alert_text, observation["context"])}, observation)
+    if task_type == "task2":
+        return normalize_action({"root_cause": predict_root_cause(alert_text, context_text)}, observation)
+    return normalize_action({"action": predict_action(alert_text, context_text)}, observation)
+def get_action(model_client: Optional[OpenAI], observation: Dict[str, Any]) -> Dict[str, Any]:
+    if model_client is None:
+        return heuristic_action(observation)
+    for _ in range(2):
+        try:
+            completion = model_client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": build_user_prompt(observation)},
+                ],
+                temperature=TEMPERATURE,
+                max_tokens=MAX_TOKENS,
+            )
+            content = (completion.choices[0].message.content or "").strip()
+            return normalize_action(extract_json(content), observation)
+        except Exception:
+            continue
+    return heuristic_action(observation)
+def reward_value(step_data: Dict[str, Any]) -> float:
+    reward = step_data.get("reward", {})
+    if isinstance(reward, dict):
+        return float(reward.get("value", 0.0))
+    return float(reward or 0.0)
+def active_model_name(model_client: Optional[OpenAI]) -> str:
+    return MODEL_NAME if model_client is not None else "deterministic-baseline"
+def summarize_action(action: Dict[str, Any]) -> str:
+    for field in ("severity", "root_cause", "action"):
+        value = action.get(field)
+        if value is not None:
+            return str(value)
+    return "no_action"
+def run_episode(
+    transport: EnvironmentTransport,
+    model_client: Optional[OpenAI],
+    ticket: Dict[str, Any],
+) -> Dict[str, Any]:
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=ticket["incident_id"], env=BENCHMARK, model=active_model_name(model_client))
     try:
+        reset_data = transport.reset(ticket["task_type"], ticket["incident_id"])
+        observation = reset_data["observation"]
+        session_id = reset_data.get("info", {}).get("session_id")
+        if not session_id:
+            raise RuntimeError("Environment reset did not return a session_id.")
+        steps_taken = 1
+        action = get_action(model_client, observation)
+        step_data = transport.step(session_id=session_id, action=action)
+        score = reward_value(step_data)
+        rewards.append(score)
+        success = bool(step_data.get("info", {}).get("correct", score >= 0.99))
+        log_step(
+            step=1,
+            action=summarize_action(action),
+            reward=score,
+            done=bool(step_data.get("done", True)),
+            error=None,
         )
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+        return {
+            "incident_id": ticket["incident_id"],
+            "task_type": ticket["task_type"],
+            "difficulty": observation.get("difficulty"),
+            "score": score,
+            "success": success,
+            "ground_truth": step_data.get("info", {}).get("ground_truth"),
+            "agent_answer": step_data.get("info", {}).get("agent_answer"),
+        }
+    except Exception as exc:
+        log_step(step=max(steps_taken, 1), action="error", reward=0.0, done=True, error=str(exc))
+        log_end(success=False, steps=steps_taken, score=0.0, rewards=rewards)
+        return {
+            "incident_id": ticket["incident_id"],
+            "task_type": ticket["task_type"],
+            "score": 0.0,
+            "success": False,
+            "error": str(exc),
+        }
+def write_results(results: List[Dict[str, Any]]) -> None:
+    grouped: Dict[str, List[float]] = {}
+    for result in results:
+        grouped.setdefault(result["task_type"], []).append(result.get("score", 0.0))
+    summary = {
+        "benchmark": BENCHMARK,
+        "model": MODEL_NAME,
+        "episodes": len(results),
+        "average_score": (sum(result.get("score", 0.0) for result in results) / len(results)) if results else 0.0,
+        "by_task": {
+            task_type: {
+                "episodes": len(scores),
+                "average_score": (sum(scores) / len(scores)) if scores else 0.0,
+            }
+            for task_type, scores in grouped.items()
+        },
+        "results": results,
+    }
+    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    OUTPUT_PATH.write_text(json.dumps(summary, indent=2))
+def main() -> None:
+    transport = build_transport()
+    model_client = create_model_client()
+    results = [run_episode(transport, model_client, ticket) for ticket in TICKETS]
+    write_results(results)
+    transport.close()
 if __name__ == "__main__":
+    main()

models.py CHANGED Viewed

@@ -1,11 +1,14 @@
-#----- Edited file--------------
 from pydantic import BaseModel, Field
-from enum import Enum
-from typing import Optional, Dict
-# ── Enums ─────────────────────────────────────────────
 class SeverityLevel(str, Enum):
     SEV1 = "SEV1"
@@ -14,52 +17,97 @@ class SeverityLevel(str, Enum):
 class RootCauseCategory(str, Enum):
-    DATABASE       = "DATABASE"
-    NETWORK        = "NETWORK"
-    APPLICATION    = "APPLICATION"
     INFRASTRUCTURE = "INFRASTRUCTURE"
-    THIRD_PARTY    = "THIRD_PARTY"
-    UNKNOWN        = "UNKNOWN"
 class RecommendedAction(str, Enum):
-    ROLLBACK        = "ROLLBACK"
-    SCALE_UP        = "SCALE_UP"
     RESTART_SERVICE = "RESTART_SERVICE"
-    FAILOVER        = "FAILOVER"
-    NOTIFY_VENDOR   = "NOTIFY_VENDOR"
-    INVESTIGATE     = "INVESTIGATE"
-    NO_ACTION       = "NO_ACTION"
-# ── Observation (Input to Agent) ──────────────────────
 class IncidentObservation(BaseModel):
     incident_id: str
-    task_type: str   # "task1" | "task2" | "task3"
     alert_text: str
-    context: Dict
-# ── Action (Output from Agent) ────────────────────────
 class IncidentAction(BaseModel):
     incident_id: str
-    task_type: str
-    severity:   Optional[SeverityLevel]     = Field(None)
     root_cause: Optional[RootCauseCategory] = Field(None)
-    action:     Optional[RecommendedAction] = Field(None)
-# ── Step Result ───────────────────────────────────────
 class StepResult(BaseModel):
     incident_id: str
-    task_type: str
-    reward: float
-    correct: bool
-    ground_truth: str
-    agent_answer: str

+from enum import Enum
+from typing import Any, Dict, List, Optional
 from pydantic import BaseModel, Field
+class TaskType(str, Enum):
+    TASK1 = "task1"
+    TASK2 = "task2"
+    TASK3 = "task3"
 class SeverityLevel(str, Enum):
     SEV1 = "SEV1"
 class RootCauseCategory(str, Enum):
+    DATABASE = "DATABASE"
+    NETWORK = "NETWORK"
+    APPLICATION = "APPLICATION"
     INFRASTRUCTURE = "INFRASTRUCTURE"
+    THIRD_PARTY = "THIRD_PARTY"
+    UNKNOWN = "UNKNOWN"
 class RecommendedAction(str, Enum):
+    ROLLBACK = "ROLLBACK"
+    SCALE_UP = "SCALE_UP"
     RESTART_SERVICE = "RESTART_SERVICE"
+    FAILOVER = "FAILOVER"
+    NOTIFY_VENDOR = "NOTIFY_VENDOR"
+    INVESTIGATE = "INVESTIGATE"
+    NO_ACTION = "NO_ACTION"
 class IncidentObservation(BaseModel):
     incident_id: str
+    task_type: TaskType
+    difficulty: str
+    task_description: str
     alert_text: str
+    context: Dict[str, Any]
+    expected_field: str
+    allowed_values: List[str] = Field(default_factory=list)
+    step_count: int = 0
+    max_steps: int = 1
+    last_action_summary: Optional[str] = None
+    last_reward: float = 0.0
+    episode_status: str = "awaiting_action"
 class IncidentAction(BaseModel):
     incident_id: str
+    task_type: TaskType
+    severity: Optional[SeverityLevel] = Field(None)
     root_cause: Optional[RootCauseCategory] = Field(None)
+    action: Optional[RecommendedAction] = Field(None)
+    def populated_fields(self) -> Dict[str, str]:
+        fields: Dict[str, str] = {}
+        if self.severity is not None:
+            fields["severity"] = self.severity.value
+        if self.root_cause is not None:
+            fields["root_cause"] = self.root_cause.value
+        if self.action is not None:
+            fields["action"] = self.action.value
+        return fields
+    def selected_field(self) -> Optional[str]:
+        populated = self.populated_fields()
+        if len(populated) != 1:
+            return None
+        return next(iter(populated))
+    def selected_value(self) -> Optional[str]:
+        selected = self.selected_field()
+        if selected is None:
+            return None
+        return self.populated_fields()[selected]
+class IncidentReward(BaseModel):
+    value: float = Field(..., ge=0.0, le=1.0)
+    reason: str
 class StepResult(BaseModel):
+    observation: IncidentObservation
+    reward: IncidentReward
+    done: bool
+    info: Dict[str, Any] = Field(default_factory=dict)
+class IncidentState(BaseModel):
+    episode_id: str
+    session_id: Optional[str] = None
+    step_count: int
+    max_steps: int
+    total_reward: float = 0.0
+    done: bool
     incident_id: str
+    task_type: TaskType
+    difficulty: str
+    status: str
+    last_reward: float = 0.0
+class ResetRequest(BaseModel):
+    task_type: Optional[TaskType] = None
+    ticket_id: Optional[str] = None
+    seed: Optional[int] = None

openenv.yaml CHANGED Viewed

@@ -1,27 +1,45 @@
 spec_version: 1
-name: Incident_Triage
 type: space
 runtime: fastapi
 app: app:app
 port: 7860
 version: "1.0.0"
 description: >
-  RL-style environment for SRE incident triage.
-  An LLM agent receives production alerts and must classify severity,
-  identify root cause, or recommend remediation actions.
 api:
   base_url: http://0.0.0.0:7860
   endpoints:
     reset:
       method: POST
       path: /reset
-      params:
         task_type:
           type: string
           required: false
           enum: [task1, task2, task3]
-      returns: IncidentObservation + session_id
     step:
       method: POST
@@ -31,7 +49,7 @@ api:
           type: string
           required: true
       body: IncidentAction
-      returns: StepResult
     state:
       method: GET
@@ -40,35 +58,46 @@ api:
         session_id:
           type: string
           required: true
-      returns: current episode state
 tasks:
   task1:
     name: Severity Classification
     output_field: severity
     labels: [SEV1, SEV2, SEV3]
-    reward: partial  # 1.0 exact | 0.5 adjacent | 0.0 far
   task2:
     name: Root Cause Classification
     output_field: root_cause
     labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
-    reward: binary  # 1.0 correct | 0.0 incorrect
   task3:
     name: Recommended Action
     output_field: action
     labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
-    reward: binary  # 1.0 correct | 0.0 incorrect
 dataset:
   total_tickets: 36
   split:
-    task1: 13
     task2: 12
-    task3: 11
 reproducibility:
-  llm_seed: 42
-  llm_temperature: 0.15
-  selection: random per task_type pool

 spec_version: 1
+name: incident-triage-env
 type: space
 runtime: fastapi
 app: app:app
 port: 7860
 version: "1.0.0"
+tags: [openenv]
 description: >
+  Production incident triage environment for evaluating agents on realistic
+  SRE workflows. The agent receives a typed incident observation and must
+  classify severity, identify the most likely root cause, or recommend the
+  best immediate remediation action.
 api:
   base_url: http://0.0.0.0:7860
   endpoints:
+    health:
+      method: GET
+      path: /health
+      returns: health status
+    metadata:
+      method: GET
+      path: /metadata
+      returns: task metadata and dataset summary
     reset:
       method: POST
       path: /reset
+      body:
         task_type:
           type: string
           required: false
           enum: [task1, task2, task3]
+        ticket_id:
+          type: string
+          required: false
+        seed:
+          type: integer
+          required: false
+      returns: StepResult with initial observation and session_id in info
     step:
       method: POST
           type: string
           required: true
       body: IncidentAction
+      returns: StepResult with reward object, done flag, and episode info
     state:
       method: GET
         session_id:
           type: string
           required: true
+      returns: IncidentState
 tasks:
   task1:
     name: Severity Classification
+    difficulty: easy
     output_field: severity
     labels: [SEV1, SEV2, SEV3]
+    reward: "1.0 exact | 0.5 adjacent severity | 0.0 far miss"
   task2:
     name: Root Cause Classification
+    difficulty: medium
     output_field: root_cause
     labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
+    reward: "1.0 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.0 wrong"
   task3:
     name: Recommended Action
+    difficulty: hard
     output_field: action
     labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
+    reward: "1.0 exact | 0.4 safe investigate fallback | 0.25 related action | 0.0 wrong"
 dataset:
   total_tickets: 36
   split:
+    task1: 11
     task2: 12
+    task3: 13
+baseline:
+  script: inference.py
+  required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
+  optional_env_vars: [ENV_URL]
+  latest_local_score: 0.9861
+  latest_local_episodes: 36
 reproducibility:
+  inference_temperature: 0.0
+  max_steps_per_episode: 1
+  dataset_order: fixed TICKETS list order in incidents.py
+  baseline_selection: deterministic ticket_id-driven evaluation across all tickets

pyproject.toml CHANGED Viewed

@@ -1,45 +1,39 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
 [build-system]
-requires = ["setuptools>=45", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
-name = "openenv-Incident_Triage"
 version = "0.1.0"
-description = "Incident Triage environment for OpenEnv"
 requires-python = ">=3.10"
 dependencies = [
-    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
-    # install from github
-    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
-    "openenv-core[core]>=0.2.2",
-    # Environment-specific dependencies
-    # Add all dependencies needed for your environment here
-    # Examples:
-    # "numpy>=1.19.0",
-    # "torch>=2.0.0",
-    # "gymnasium>=0.29.0",
-    # "openspiel>=1.0.0",
-    # "smolagents>=1.22.0,<2",
 ]
 [project.optional-dependencies]
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",
 ]
-[project.scripts]
-# Server entry point - enables running via: uv run --project . server
-# or: python -m Incident_Triage.server.app
-server = "Incident_Triage.server.app:main"
 [tool.setuptools]
-include-package-data = true
-packages = ["Incident_Triage", "Incident_Triage.server"]
-package-dir = { "Incident_Triage" = ".", "Incident_Triage.server" = "server" }

 [build-system]
+requires = ["setuptools", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
+name = "incident-triage-env"
 version = "0.1.0"
+description = "FastAPI incident triage environment for production alert classification."
+readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
+    "fastapi",
+    "uvicorn",
+    "pydantic",
+    "openai",
+    "requests",
+    "python-dotenv",
+    "openenv-core>=0.2.0",
 ]
+[project.scripts]
+server = "server.app:main"
 [project.optional-dependencies]
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",
 ]
 [tool.setuptools]
+py-modules = [
+    "app",
+    "client",
+    "environment",
+    "graders",
+    "incidents",
+    "inference",
+    "models",
+]

requirements.txt CHANGED Viewed

@@ -2,4 +2,7 @@ fastapi
 uvicorn
 pydantic
 openai
-requests

 uvicorn
 pydantic
 openai
+requests
+python-dotenv
+setuptools
+wheel

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """OpenEnv compatibility package for the incident triage server."""

server/app.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Compatibility entrypoint expected by OpenEnv validators and templates."""
+from app import app
+def main() -> None:
+    import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=7860)
+if __name__ == "__main__":
+    main()

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import unittest
+from fastapi.testclient import TestClient
+from app import app, sessions
+class IncidentEnvApiTests(unittest.TestCase):
+    def setUp(self) -> None:
+        sessions.clear()
+        self.client = TestClient(app)
+    def tearDown(self) -> None:
+        sessions.clear()
+    def test_health_schema_and_mcp_helper_endpoints(self) -> None:
+        health_response = self.client.get("/health")
+        self.assertEqual(health_response.status_code, 200)
+        self.assertEqual(health_response.json()["status"], "healthy")
+        schema_response = self.client.get("/schema")
+        self.assertEqual(schema_response.status_code, 200)
+        schema_body = schema_response.json()
+        self.assertIn("action", schema_body)
+        self.assertIn("observation", schema_body)
+        self.assertIn("state", schema_body)
+        mcp_response = self.client.post("/mcp", json={"jsonrpc": "2.0", "id": 1, "method": "ping"})
+        self.assertEqual(mcp_response.status_code, 200)
+        mcp_body = mcp_response.json()
+        self.assertEqual(mcp_body["jsonrpc"], "2.0")
+        self.assertEqual(mcp_body["id"], 1)
+    def test_tickets_endpoint_returns_safe_ticket_inventory(self) -> None:
+        response = self.client.get("/tickets")
+        self.assertEqual(response.status_code, 200)
+        body = response.json()
+        self.assertEqual(body["count"], 36)
+        self.assertEqual(body["tickets"][0]["incident_id"], "INC-001")
+        self.assertIn("expected_field", body["tickets"][0])
+        self.assertNotIn("ground_truth", body["tickets"][0])
+    def test_ui_routes_and_assets_are_served(self) -> None:
+        home_response = self.client.get("/")
+        self.assertEqual(home_response.status_code, 200)
+        self.assertIn("Incident Triage Environment", home_response.text)
+        status_response = self.client.get("/status")
+        self.assertEqual(status_response.status_code, 200)
+        self.assertIn("Environment readiness dashboard", status_response.text)
+        playground_response = self.client.get("/playground")
+        self.assertEqual(playground_response.status_code, 200)
+        self.assertIn("Interactive playground", playground_response.text)
+        asset_response = self.client.get("/assets/app.js")
+        self.assertEqual(asset_response.status_code, 200)
+        self.assertIn("bootstrap", asset_response.text)
+    def test_reset_returns_requested_ticket_and_session_state(self) -> None:
+        response = self.client.post(
+            "/reset",
+            json={"task_type": "task3", "ticket_id": "INC-014"},
+        )
+        self.assertEqual(response.status_code, 200)
+        body = response.json()
+        self.assertEqual(body["observation"]["incident_id"], "INC-014")
+        self.assertEqual(body["observation"]["task_type"], "task3")
+        self.assertEqual(body["reward"]["value"], 0.0)
+        self.assertFalse(body["done"])
+        self.assertIn("session_id", body["info"])
+        self.assertEqual(body["info"]["state"]["status"], "awaiting_action")
+    def test_step_completes_episode_and_state_endpoint_reflects_completion(self) -> None:
+        reset_response = self.client.post(
+            "/reset",
+            json={"task_type": "task3", "ticket_id": "INC-014"},
+        )
+        session_id = reset_response.json()["info"]["session_id"]
+        step_response = self.client.post(
+            f"/step?session_id={session_id}",
+            json={
+                "incident_id": "INC-014",
+                "task_type": "task3",
+                "action": "FAILOVER",
+            },
+        )
+        self.assertEqual(step_response.status_code, 200)
+        step_body = step_response.json()
+        self.assertTrue(step_body["done"])
+        self.assertEqual(step_body["reward"]["value"], 1.0)
+        self.assertTrue(step_body["info"]["correct"])
+        self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
+        state_response = self.client.get(f"/state?session_id={session_id}")
+        self.assertEqual(state_response.status_code, 200)
+        state_body = state_response.json()
+        self.assertTrue(state_body["done"])
+        self.assertEqual(state_body["status"], "completed")
+        self.assertEqual(state_body["last_reward"], 1.0)
+    def test_step_rejects_action_for_wrong_task_type(self) -> None:
+        reset_response = self.client.post(
+            "/reset",
+            json={"task_type": "task3", "ticket_id": "INC-014"},
+        )
+        session_id = reset_response.json()["info"]["session_id"]
+        step_response = self.client.post(
+            f"/step?session_id={session_id}",
+            json={
+                "incident_id": "INC-014",
+                "task_type": "task2",
+                "root_cause": "NETWORK",
+            },
+        )
+        self.assertEqual(step_response.status_code, 400)
+        self.assertIn("does not match", step_response.json()["detail"])
+if __name__ == "__main__":
+    unittest.main()

tests/test_graders.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import unittest
+from graders import GRADERS, grade_task1, grade_task2, grade_task3
+from incidents import TICKETS
+from models import IncidentAction
+class GraderTests(unittest.TestCase):
+    def test_all_ticket_ground_truth_scores_are_bounded(self) -> None:
+        for ticket in TICKETS:
+            action = IncidentAction(
+                incident_id=ticket["incident_id"],
+                task_type=ticket["task_type"],
+                **ticket["ground_truth"],
+            )
+            score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
+            self.assertGreaterEqual(score, 0.0, ticket["incident_id"])
+            self.assertLessEqual(score, 1.0, ticket["incident_id"])
+            self.assertIsInstance(reason, str)
+    def test_task1_grader_supports_partial_credit(self) -> None:
+        exact = IncidentAction(
+            incident_id="INC-TEST-1",
+            task_type="task1",
+            severity="SEV1",
+        )
+        adjacent = IncidentAction(
+            incident_id="INC-TEST-1",
+            task_type="task1",
+            severity="SEV2",
+        )
+        exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
+        adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
+        self.assertEqual(exact_score, 1.0)
+        self.assertEqual(adjacent_score, 0.5)
+    def test_task2_grader_is_not_constant(self) -> None:
+        exact = IncidentAction(
+            incident_id="INC-TEST-2",
+            task_type="task2",
+            root_cause="DATABASE",
+        )
+        fallback = IncidentAction(
+            incident_id="INC-TEST-2",
+            task_type="task2",
+            root_cause="UNKNOWN",
+        )
+        wrong = IncidentAction(
+            incident_id="INC-TEST-2",
+            task_type="task2",
+            root_cause="NETWORK",
+        )
+        exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
+        fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
+        wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
+        self.assertEqual(exact_score, 1.0)
+        self.assertEqual(fallback_score, 0.25)
+        self.assertEqual(wrong_score, 0.0)
+    def test_task3_grader_rewards_safe_fallbacks(self) -> None:
+        exact = IncidentAction(
+            incident_id="INC-TEST-3",
+            task_type="task3",
+            action="FAILOVER",
+        )
+        fallback = IncidentAction(
+            incident_id="INC-TEST-3",
+            task_type="task3",
+            action="INVESTIGATE",
+        )
+        wrong = IncidentAction(
+            incident_id="INC-TEST-3",
+            task_type="task3",
+            action="NO_ACTION",
+        )
+        exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
+        fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
+        wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
+        self.assertEqual(exact_score, 1.0)
+        self.assertEqual(fallback_score, 0.4)
+        self.assertEqual(wrong_score, 0.0)
+if __name__ == "__main__":
+    unittest.main()

ui/assets/app.js ADDED Viewed

	@@ -0,0 +1,290 @@

+async function fetchJson(url, options = {}) {
+  const response = await fetch(url, options);
+  const contentType = response.headers.get("content-type") || "";
+  const body = contentType.includes("application/json") ? await response.json() : await response.text();
+  if (!response.ok) {
+    const detail = typeof body === "object" ? body.detail || JSON.stringify(body) : body;
+    throw new Error(`${response.status} ${response.statusText}: ${detail}`);
+  }
+  return body;
+}
+function safeText(value) {
+  return value == null ? "--" : String(value);
+}
+function setHealthPill(status) {
+  const pills = document.querySelectorAll("[data-health-pill]");
+  pills.forEach((pill) => {
+    pill.textContent = status === "healthy" ? "Healthy" : "Unavailable";
+    pill.classList.toggle("is-pending", status !== "healthy");
+  });
+}
+function renderTaskCards(target, tasks) {
+  if (!target) return;
+  target.innerHTML = "";
+  Object.entries(tasks).forEach(([taskId, task]) => {
+    const article = document.createElement("article");
+    article.className = "task-card";
+    article.innerHTML = `
+      <span class="badge difficulty-${task.difficulty}">${task.difficulty}</span>
+      <h3>${task.name}</h3>
+      <p>Expected field: <strong>${task.expected_field || task.output_field}</strong></p>
+      <div class="task-meta">
+        <span class="badge">${taskId}</span>
+        <span class="badge">${task.ticket_count || 0} incidents</span>
+      </div>
+      <div class="task-values">
+        ${(task.allowed_values || task.labels || []).map((value) => `<span class="badge">${value}</span>`).join("")}
+      </div>
+    `;
+    target.appendChild(article);
+  });
+}
+async function initHome() {
+  const [health, metadata] = await Promise.all([
+    fetchJson("/health"),
+    fetchJson("/metadata"),
+  ]);
+  setHealthPill(health.status);
+  document.querySelector("[data-total-incidents]").textContent = safeText(metadata.total_tickets);
+  document.querySelector("[data-task-count]").textContent = safeText(Object.keys(metadata.tasks).length);
+  renderTaskCards(document.querySelector("[data-task-grid]"), metadata.tasks);
+}
+async function initStatus() {
+  const [health, metadata, grader, schema] = await Promise.all([
+    fetchJson("/health"),
+    fetchJson("/metadata"),
+    fetchJson("/grader"),
+    fetchJson("/schema"),
+  ]);
+  document.querySelector("[data-health-text]").textContent = health.status;
+  document.querySelector("[data-total-incidents]").textContent = safeText(metadata.total_tickets);
+  document.querySelector("[data-schema-count]").textContent = safeText(Object.keys(schema).length);
+  renderTaskCards(document.querySelector("[data-task-grid]"), metadata.tasks);
+  const schemaGrid = document.querySelector("[data-schema-grid]");
+  schemaGrid.innerHTML = Object.keys(schema)
+    .map((name) => `<span class="badge">${name}</span>`)
+    .join("");
+  document.querySelector("[data-grader-summary]").textContent = grader.scoring;
+  const graderList = document.querySelector("[data-grader-list]");
+  graderList.innerHTML = Object.entries(grader.tasks)
+    .map(([task, rule]) => `<li><strong>${task}</strong>: ${rule}</li>`)
+    .join("");
+}
+function buildActionPayload(observation, selectedValue) {
+  const payload = {
+    incident_id: observation.incident_id,
+    task_type: observation.task_type,
+  };
+  payload[observation.expected_field] = selectedValue;
+  return payload;
+}
+async function initPlayground() {
+  const resetForm = document.getElementById("reset-form");
+  const stepForm = document.getElementById("step-form");
+  const taskTypeInput = document.getElementById("task-type");
+  const ticketIdInput = document.getElementById("ticket-id");
+  const ticketOptions = document.getElementById("ticket-options");
+  const ticketHelper = document.getElementById("ticket-helper");
+  const expectedFieldInput = document.getElementById("expected-field");
+  const actionValueSelect = document.getElementById("action-value");
+  const stepButton = document.getElementById("step-button");
+  const resetButton = document.getElementById("reset-button");
+  const sessionIdTarget = document.getElementById("session-id");
+  const observationOutput = document.getElementById("observation-output");
+  const resultOutput = document.getElementById("result-output");
+  const messageTarget = document.getElementById("playground-message");
+  const summaryIncident = document.getElementById("summary-incident");
+  const summaryField = document.getElementById("summary-field");
+  const summaryReward = document.getElementById("summary-reward");
+  const summaryStatus = document.getElementById("summary-status");
+  let sessionId = null;
+  let observation = null;
+  let validTickets = [];
+  const setOutput = (target, data) => {
+    target.textContent = typeof data === "string" ? data : JSON.stringify(data, null, 2);
+  };
+  const setMessage = (message, mode = "neutral") => {
+    messageTarget.textContent = message;
+    messageTarget.dataset.mode = mode;
+  };
+  const setBusy = (button, isBusy, busyText, idleText) => {
+    button.disabled = isBusy;
+    button.textContent = isBusy ? busyText : idleText;
+  };
+  const updateSummaryFromObservation = (nextObservation) => {
+    summaryIncident.textContent = nextObservation.incident_id;
+    summaryField.textContent = nextObservation.expected_field;
+    summaryReward.textContent = "--";
+    summaryStatus.textContent = "Awaiting action";
+  };
+  const updateSummaryFromResult = (result) => {
+    summaryReward.textContent = result.reward?.value ?? "--";
+    summaryStatus.textContent = result.done ? "Completed" : "In progress";
+  };
+  const findTicket = (ticketId) => validTickets.find((ticket) => ticket.incident_id === ticketId);
+  const syncTaskTypeFromTicket = () => {
+    const ticket = findTicket(ticketIdInput.value.trim());
+    if (!ticket) return;
+    taskTypeInput.value = ticket.task_type;
+    ticketHelper.textContent = `${ticket.incident_id} is a ${ticket.task_type} ${ticket.difficulty} ticket.`;
+  };
+  const chooseFirstTicketForTask = () => {
+    if (!taskTypeInput.value) return;
+    const ticket = validTickets.find((item) => item.task_type === taskTypeInput.value);
+    if (ticket) {
+      ticketIdInput.value = ticket.incident_id;
+      ticketHelper.textContent = `${ticket.incident_id} selected for ${taskTypeInput.value}.`;
+    }
+  };
+  try {
+    const ticketData = await fetchJson("/tickets");
+    validTickets = ticketData.tickets || [];
+    ticketOptions.innerHTML = validTickets
+      .map((ticket) => `<option value="${ticket.incident_id}" label="${ticket.task_type} / ${ticket.task_name}"></option>`)
+      .join("");
+    ticketHelper.textContent = `Valid ticket range: ${validTickets[0]?.incident_id || "--"} to ${validTickets.at(-1)?.incident_id || "--"}.`;
+  } catch (error) {
+    ticketHelper.textContent = `Could not load ticket list: ${error.message}`;
+  }
+  document.querySelectorAll("[data-preset-task]").forEach((button) => {
+    button.addEventListener("click", () => {
+      taskTypeInput.value = button.dataset.presetTask;
+      ticketIdInput.value = button.dataset.presetTicket;
+      setMessage(`Preset loaded: ${button.dataset.presetTask} / ${button.dataset.presetTicket}. Click Start / Reset Environment.`, "success");
+    });
+  });
+  resetForm.addEventListener("submit", async (event) => {
+    event.preventDefault();
+    const formData = new FormData(resetForm);
+    const payload = {};
+    for (const [key, value] of formData.entries()) {
+      if (value !== "") {
+        payload[key] = key === "seed" ? Number(value) : value;
+      }
+    }
+    const requestedTicket = payload.ticket_id;
+    const knownTicket = requestedTicket ? findTicket(requestedTicket) : null;
+    if (requestedTicket && validTickets.length > 0 && !knownTicket) {
+      const message = `Ticket ${requestedTicket} does not exist. Use one of ${validTickets[0].incident_id} to ${validTickets.at(-1).incident_id}, or click a preset.`;
+      setOutput(observationOutput, { error: message });
+      setMessage(message, "error");
+      return;
+    }
+    if (knownTicket && payload.task_type && payload.task_type !== knownTicket.task_type) {
+      payload.task_type = knownTicket.task_type;
+      taskTypeInput.value = knownTicket.task_type;
+      ticketHelper.textContent = `Task type changed to ${knownTicket.task_type} because ${knownTicket.incident_id} belongs to that task.`;
+    }
+    try {
+      setBusy(resetButton, true, "Starting...", "Start / Reset Environment");
+      setMessage("Reset request sent. Watch the terminal for a [RESET] log.", "neutral");
+      const result = await fetchJson("/reset", {
+        method: "POST",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify(payload),
+      });
+      sessionId = result.info.session_id;
+      observation = result.observation;
+      sessionIdTarget.textContent = sessionId;
+      expectedFieldInput.value = observation.expected_field;
+      actionValueSelect.disabled = false;
+      stepButton.disabled = false;
+      actionValueSelect.innerHTML = observation.allowed_values
+        .map((value) => `<option value="${value}">${value}</option>`)
+        .join("");
+      setOutput(observationOutput, result);
+      setOutput(resultOutput, "No step submitted yet.");
+      updateSummaryFromObservation(observation);
+      setMessage(`Session ready for ${observation.incident_id}. Pick a value and submit the step.`, "success");
+    } catch (error) {
+      setOutput(observationOutput, { error: error.message });
+      setMessage(error.message, "error");
+    } finally {
+      setBusy(resetButton, false, "Starting...", "Start / Reset Environment");
+    }
+  });
+  stepForm.addEventListener("submit", async (event) => {
+    event.preventDefault();
+    if (!sessionId || !observation) {
+      setOutput(resultOutput, { error: "Start a session first." });
+      setMessage("Start a session before submitting a step.", "error");
+      return;
+    }
+    try {
+      setBusy(stepButton, true, "Submitting...", "Submit Step");
+      setMessage("Step request sent. Watch the terminal for a [STEP] log.", "neutral");
+      const result = await fetchJson(`/step?session_id=${encodeURIComponent(sessionId)}`, {
+        method: "POST",
+        headers: { "Content-Type": "application/json" },
+        body: JSON.stringify(buildActionPayload(observation, actionValueSelect.value)),
+      });
+      setOutput(resultOutput, result);
+      updateSummaryFromResult(result);
+      const reward = result.reward?.value ?? "--";
+      setMessage(`Step completed with reward ${reward}.`, reward === 1 ? "success" : "neutral");
+    } catch (error) {
+      setOutput(resultOutput, { error: error.message });
+      setMessage(error.message, "error");
+    } finally {
+      if (observation) {
+        setBusy(stepButton, false, "Submitting...", "Submit Step");
+      }
+    }
+  });
+  ticketIdInput.addEventListener("change", syncTaskTypeFromTicket);
+  ticketIdInput.addEventListener("blur", syncTaskTypeFromTicket);
+  taskTypeInput.addEventListener("change", chooseFirstTicketForTask);
+}
+async function bootstrap() {
+  const page = document.body.dataset.page;
+  try {
+    if (page === "home") {
+      await initHome();
+    } else if (page === "status") {
+      await initStatus();
+    } else if (page === "playground") {
+      await initPlayground();
+    }
+  } catch (error) {
+    const pageShell = document.querySelector(".page-shell");
+    const banner = document.createElement("div");
+    banner.className = "floating-panel";
+    banner.innerHTML = `<strong>UI data load failed.</strong><p class="status-helper">${error.message}</p>`;
+    pageShell?.prepend(banner);
+  }
+}
+window.addEventListener("DOMContentLoaded", bootstrap);

ui/assets/styles.css ADDED Viewed

	@@ -0,0 +1,731 @@

+@import url("https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap");
+:root {
+  --bg: #f4ede2;
+  --bg-accent: #e7f1f0;
+  --surface: rgba(255, 251, 246, 0.86);
+  --surface-strong: rgba(255, 255, 255, 0.94);
+  --border: rgba(24, 46, 56, 0.12);
+  --text: #13232c;
+  --muted: #5d6a70;
+  --accent: #c5532f;
+  --accent-deep: #7a2f1a;
+  --signal: #0d7c66;
+  --signal-soft: rgba(13, 124, 102, 0.14);
+  --shadow: 0 24px 50px rgba(33, 48, 55, 0.12);
+  --radius-xl: 28px;
+  --radius-lg: 20px;
+  --radius-md: 14px;
+}
+* {
+  box-sizing: border-box;
+}
+html {
+  scroll-behavior: smooth;
+}
+body {
+  margin: 0;
+  min-height: 100vh;
+  color: var(--text);
+  background:
+    radial-gradient(circle at top left, rgba(197, 83, 47, 0.16), transparent 28%),
+    radial-gradient(circle at top right, rgba(13, 124, 102, 0.18), transparent 26%),
+    linear-gradient(145deg, #f7efe4 0%, #edf6f4 48%, #f5ebdf 100%);
+  font-family: "Space Grotesk", "Avenir Next", sans-serif;
+}
+body::before,
+body::after {
+  content: "";
+  position: fixed;
+  inset: auto;
+  width: 320px;
+  height: 320px;
+  border-radius: 50%;
+  filter: blur(12px);
+  opacity: 0.4;
+  pointer-events: none;
+  z-index: 0;
+}
+body::before {
+  right: -80px;
+  top: 120px;
+  background: rgba(197, 83, 47, 0.18);
+}
+body::after {
+  left: -100px;
+  bottom: 40px;
+  background: rgba(13, 124, 102, 0.14);
+}
+a {
+  color: inherit;
+  text-decoration: none;
+}
+code,
+pre,
+.mono-label {
+  font-family: "IBM Plex Mono", "SFMono-Regular", monospace;
+}
+.page-shell {
+  position: relative;
+  z-index: 1;
+  width: min(1240px, calc(100vw - 48px));
+  margin: 0 auto;
+  padding: 24px 0 80px;
+}
+.topbar {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: 20px;
+  padding: 14px 0 24px;
+}
+.brand {
+  display: inline-flex;
+  flex-direction: column;
+  gap: 2px;
+}
+.brand-kicker,
+.eyebrow {
+  color: var(--accent-deep);
+  font-size: 0.78rem;
+  font-weight: 700;
+  letter-spacing: 0.18em;
+  text-transform: uppercase;
+}
+.brand-title {
+  font-size: 1.35rem;
+  font-weight: 700;
+}
+.nav-links {
+  display: flex;
+  gap: 18px;
+  flex-wrap: wrap;
+}
+.nav-links a {
+  color: var(--muted);
+  font-weight: 500;
+  transition: color 180ms ease, transform 180ms ease;
+}
+.nav-links a:hover {
+  color: var(--text);
+  transform: translateY(-1px);
+}
+.hero,
+.dual-grid,
+.playground-grid,
+.playground-summary,
+.guide-grid,
+.status-overview,
+.feature-strip,
+.route-grid,
+.task-grid {
+  display: grid;
+  gap: 22px;
+}
+.hero {
+  grid-template-columns: minmax(0, 1.35fr) minmax(320px, 0.95fr);
+  align-items: stretch;
+  min-height: 520px;
+}
+.hero-copy,
+.floating-panel,
+.feature-card,
+.task-card,
+.route-card,
+.stat-card {
+  background: var(--surface);
+  backdrop-filter: blur(18px);
+  border: 1px solid var(--border);
+  box-shadow: var(--shadow);
+}
+.hero-copy,
+.floating-panel {
+  border-radius: var(--radius-xl);
+}
+.hero-copy {
+  padding: 44px;
+  display: flex;
+  flex-direction: column;
+  justify-content: center;
+  animation: rise-in 650ms ease both;
+}
+.hero-copy h1,
+.section-heading h1 {
+  margin: 12px 0;
+  font-size: clamp(2.8rem, 6vw, 5.6rem);
+  line-height: 0.95;
+  letter-spacing: -0.05em;
+}
+.hero-text,
+.section-copy {
+  max-width: 56ch;
+  font-size: 1.05rem;
+  line-height: 1.7;
+  color: var(--muted);
+}
+.hero-actions {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 14px;
+  margin-top: 26px;
+}
+.button {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  min-height: 46px;
+  padding: 0 20px;
+  border-radius: 999px;
+  border: 1px solid transparent;
+  font-weight: 700;
+  cursor: pointer;
+  transition: transform 180ms ease, box-shadow 180ms ease, background 180ms ease;
+}
+.button:hover {
+  transform: translateY(-1px);
+}
+.button:disabled {
+  cursor: not-allowed;
+  opacity: 0.58;
+  transform: none;
+}
+.button-primary {
+  background: linear-gradient(135deg, var(--accent), #dd7e36);
+  color: #fff;
+}
+.button-secondary {
+  background: rgba(19, 35, 44, 0.06);
+  border-color: rgba(19, 35, 44, 0.08);
+}
+.hero-panel,
+.big-status-card {
+  padding: 24px;
+  animation: rise-in 800ms ease both;
+}
+.panel-header {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: 16px;
+  margin-bottom: 18px;
+  font-weight: 700;
+}
+.live-pill,
+.badge,
+.status-chip {
+  display: inline-flex;
+  align-items: center;
+  gap: 8px;
+  min-height: 32px;
+  padding: 0 12px;
+  border-radius: 999px;
+  background: var(--signal-soft);
+  color: var(--signal);
+  font-size: 0.86rem;
+  font-weight: 700;
+}
+.live-pill.is-pending,
+.status-chip.is-pending {
+  background: rgba(197, 83, 47, 0.12);
+  color: var(--accent-deep);
+}
+.stats-grid,
+.feature-strip,
+.status-overview {
+  grid-template-columns: repeat(2, minmax(0, 1fr));
+}
+.feature-strip {
+  margin-top: 24px;
+}
+.guide-grid {
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+}
+.guide-card {
+  min-height: 168px;
+  padding: 24px;
+  border: 1px solid rgba(19, 35, 44, 0.1);
+  border-radius: var(--radius-lg);
+  background: rgba(255, 255, 255, 0.68);
+  box-shadow: 0 16px 34px rgba(33, 48, 55, 0.08);
+}
+.guide-card span {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  min-width: 34px;
+  min-height: 34px;
+  padding: 0 10px;
+  border-radius: 999px;
+  background: rgba(197, 83, 47, 0.12);
+  color: var(--accent-deep);
+  font-weight: 700;
+}
+.guide-card strong {
+  display: block;
+  margin-top: 16px;
+  font-size: 1.2rem;
+}
+.guide-card p {
+  margin: 10px 0 0;
+  color: var(--muted);
+  line-height: 1.65;
+}
+.stat-card,
+.feature-card,
+.task-card,
+.route-card {
+  border-radius: var(--radius-lg);
+  padding: 24px;
+}
+.stat-label,
+.status-caption {
+  display: block;
+  color: var(--muted);
+  font-size: 0.84rem;
+  text-transform: uppercase;
+  letter-spacing: 0.08em;
+}
+.stat-value,
+.status-display {
+  display: block;
+  margin-top: 12px;
+  font-size: clamp(1.8rem, 4vw, 3.2rem);
+  line-height: 1;
+}
+.feature-card {
+  min-height: 220px;
+}
+.feature-index {
+  color: var(--accent);
+  font-size: 0.9rem;
+  font-weight: 700;
+  letter-spacing: 0.16em;
+}
+.feature-card h2,
+.task-card h3,
+.route-card strong {
+  margin: 16px 0 12px;
+  font-size: 1.4rem;
+}
+.feature-card p,
+.task-card p,
+.route-card span,
+.status-helper,
+.copy-block p,
+.bullet-list {
+  color: var(--muted);
+  line-height: 1.7;
+}
+.tasks-section,
+.stack-layout {
+  display: grid;
+  gap: 20px;
+  margin-top: 24px;
+}
+.section-heading {
+  padding-top: 8px;
+}
+.section-heading h2 {
+  margin: 8px 0 0;
+  font-size: clamp(1.9rem, 4vw, 3.4rem);
+  line-height: 1.02;
+}
+.section-heading.compact h2 {
+  font-size: clamp(1.35rem, 3vw, 2rem);
+}
+.task-grid,
+.route-grid {
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+}
+.task-card {
+  position: relative;
+  overflow: hidden;
+  display: flex;
+  min-height: 245px;
+  flex-direction: column;
+  justify-content: space-between;
+  gap: 14px;
+}
+.task-card::after {
+  content: "";
+  position: absolute;
+  inset: auto -20% -60% auto;
+  width: 180px;
+  height: 180px;
+  border-radius: 50%;
+  background: radial-gradient(circle, rgba(13, 124, 102, 0.12), transparent 70%);
+}
+.task-meta,
+.task-values {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 8px;
+  margin-top: 14px;
+}
+.badge {
+  background: rgba(19, 35, 44, 0.06);
+  color: var(--text);
+}
+.difficulty-easy {
+  background: rgba(13, 124, 102, 0.14);
+  color: var(--signal);
+}
+.difficulty-medium {
+  background: rgba(227, 154, 52, 0.16);
+  color: #8b5a12;
+}
+.difficulty-hard {
+  background: rgba(197, 83, 47, 0.14);
+  color: var(--accent-deep);
+}
+.routes-section {
+  margin-top: 24px;
+  padding: 26px;
+}
+.route-card {
+  min-height: 150px;
+  transition: transform 180ms ease, border-color 180ms ease;
+}
+.route-card:hover {
+  transform: translateY(-4px);
+  border-color: rgba(197, 83, 47, 0.2);
+}
+.status-overview {
+  grid-template-columns: 1.2fr 1fr 1fr;
+}
+.big-status-card {
+  background: linear-gradient(145deg, rgba(255, 247, 241, 0.92), rgba(243, 255, 251, 0.82));
+}
+.dual-grid {
+  grid-template-columns: repeat(2, minmax(0, 1fr));
+}
+.floating-panel {
+  padding: 30px;
+}
+.playground-grid {
+  grid-template-columns: minmax(320px, 0.9fr) minmax(360px, 1fr);
+  align-items: start;
+}
+.control-panel {
+  min-height: 100%;
+}
+.preset-row {
+  display: grid;
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+  gap: 10px;
+  margin: 20px 0;
+}
+.preset-button {
+  min-height: 44px;
+  border: 1px solid rgba(19, 35, 44, 0.1);
+  border-radius: 999px;
+  background: rgba(255, 255, 255, 0.72);
+  color: var(--text);
+  font: inherit;
+  font-size: 0.92rem;
+  font-weight: 700;
+  cursor: pointer;
+  transition: transform 180ms ease, border-color 180ms ease, background 180ms ease;
+}
+.preset-button:hover {
+  transform: translateY(-1px);
+  border-color: rgba(197, 83, 47, 0.28);
+  background: rgba(255, 255, 255, 0.94);
+}
+.ui-message {
+  margin-top: 18px;
+  padding: 14px 16px;
+  border-radius: var(--radius-md);
+  background: rgba(19, 35, 44, 0.05);
+  color: var(--muted);
+  line-height: 1.5;
+}
+.ui-message[data-mode="success"] {
+  background: var(--signal-soft);
+  color: var(--signal);
+}
+.ui-message[data-mode="error"] {
+  background: rgba(197, 83, 47, 0.14);
+  color: var(--accent-deep);
+}
+.playground-summary {
+  grid-template-columns: repeat(4, minmax(0, 1fr));
+}
+.summary-card {
+  padding: 18px 20px;
+  border: 1px solid var(--border);
+  border-radius: var(--radius-lg);
+  background: var(--surface-strong);
+  box-shadow: 0 14px 30px rgba(33, 48, 55, 0.08);
+}
+.summary-card span {
+  display: block;
+  color: var(--muted);
+  font-size: 0.8rem;
+  font-weight: 700;
+  letter-spacing: 0.08em;
+  text-transform: uppercase;
+}
+.summary-card strong {
+  display: block;
+  margin-top: 10px;
+  font-size: clamp(1.2rem, 3vw, 2rem);
+}
+.badge-grid {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 10px;
+}
+.copy-block,
+.bullet-list {
+  margin: 0;
+  padding: 0;
+}
+.bullet-list {
+  list-style: none;
+  display: grid;
+  gap: 10px;
+  margin-top: 14px;
+}
+.bullet-list li {
+  padding: 12px 14px;
+  border-radius: var(--radius-md);
+  background: rgba(19, 35, 44, 0.04);
+}
+.form-grid {
+  display: grid;
+  gap: 18px;
+}
+.form-grid label {
+  display: grid;
+  gap: 8px;
+  color: var(--muted);
+  font-weight: 500;
+}
+.form-grid input,
+.form-grid select,
+.form-grid small,
+.form-grid button {
+  width: 100%;
+}
+.form-grid input,
+.form-grid select {
+  min-height: 50px;
+  padding: 0 16px;
+  border-radius: 16px;
+  border: 1px solid rgba(19, 35, 44, 0.12);
+  background: rgba(255, 255, 255, 0.9);
+  color: var(--text);
+  font: inherit;
+}
+.form-grid input:focus,
+.form-grid select:focus {
+  border-color: rgba(197, 83, 47, 0.45);
+  box-shadow: 0 0 0 4px rgba(197, 83, 47, 0.1);
+  outline: none;
+}
+.form-grid small {
+  display: block;
+  color: var(--muted);
+  line-height: 1.45;
+}
+.form-grid input:disabled {
+  color: var(--muted);
+}
+.inline-status {
+  display: flex;
+  align-items: center;
+  gap: 10px;
+  margin-top: 20px;
+  color: var(--muted);
+  overflow-wrap: anywhere;
+}
+.code-panel {
+  min-height: 340px;
+  margin: 0;
+  padding: 22px;
+  border-radius: var(--radius-lg);
+  background: #101a21;
+  color: #e7f3f0;
+  overflow: auto;
+  font-size: 0.92rem;
+  line-height: 1.6;
+}
+.output-grid {
+  align-items: stretch;
+}
+.skeleton-card {
+  min-height: 220px;
+  background:
+    linear-gradient(90deg, rgba(255, 255, 255, 0.55), rgba(255, 255, 255, 0.9), rgba(255, 255, 255, 0.55));
+  background-size: 200% 100%;
+  animation: shimmer 1.2s infinite linear;
+}
+@keyframes shimmer {
+  from { background-position: 200% 0; }
+  to { background-position: -200% 0; }
+}
+@keyframes rise-in {
+  from {
+    opacity: 0;
+    transform: translateY(18px);
+  }
+  to {
+    opacity: 1;
+    transform: translateY(0);
+  }
+}
+@media (max-width: 1080px) {
+  .hero,
+  .dual-grid,
+  .playground-grid,
+  .playground-summary,
+  .guide-grid,
+  .status-overview,
+  .task-grid,
+  .route-grid {
+    grid-template-columns: 1fr;
+  }
+  .stats-grid,
+  .feature-strip {
+    grid-template-columns: 1fr 1fr;
+  }
+}
+@media (max-width: 720px) {
+  .page-shell {
+    width: min(100vw - 24px, 1180px);
+    padding-top: 12px;
+    padding-bottom: 40px;
+  }
+  .topbar {
+    flex-direction: column;
+    align-items: flex-start;
+  }
+  .hero-copy,
+  .floating-panel,
+  .routes-section {
+    padding: 24px;
+  }
+  .preset-row {
+    grid-template-columns: 1fr;
+  }
+  .stats-grid,
+  .feature-strip {
+    grid-template-columns: 1fr;
+  }
+  .hero-copy h1,
+  .section-heading h1 {
+    font-size: clamp(2.4rem, 14vw, 3.6rem);
+  }
+  .section-heading h2 {
+    font-size: clamp(1.6rem, 8vw, 2.4rem);
+  }
+  .hero-actions {
+    flex-direction: column;
+  }
+}

ui/index.html ADDED Viewed

	@@ -0,0 +1,117 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Incident Triage Environment</title>
+  <link rel="stylesheet" href="/assets/styles.css?v=3">
+</head>
+<body data-page="home">
+  <div class="page-shell">
+    <header class="topbar">
+      <a class="brand" href="/">
+        <span class="brand-kicker">OpenEnv Environment</span>
+        <span class="brand-title">Incident Triage</span>
+      </a>
+      <nav class="nav-links">
+        <a href="/status">Status</a>
+        <a href="/playground">Playground</a>
+        <a href="/docs">API Docs</a>
+      </nav>
+    </header>
+    <main>
+      <section class="hero">
+        <div class="hero-copy">
+          <p class="eyebrow">Production Incident Response</p>
+          <h1>Welcome to Incident Triage Environment</h1>
+          <p class="hero-text">
+            A real-world OpenEnv environment for severity classification, root-cause analysis,
+            and next-action recommendation across production incidents.
+          </p>
+          <div class="hero-actions">
+            <a class="button button-primary" href="/playground">Launch Playground</a>
+            <a class="button button-secondary" href="/status">View Live Status</a>
+          </div>
+        </div>
+        <div class="hero-panel floating-panel">
+          <div class="panel-header">
+            <span>Live Snapshot</span>
+            <span class="live-pill" data-health-pill>Checking</span>
+          </div>
+          <div class="stats-grid">
+            <article class="stat-card">
+              <span class="stat-label">Total Incidents</span>
+              <strong class="stat-value" data-total-incidents>--</strong>
+            </article>
+            <article class="stat-card">
+              <span class="stat-label">Task Families</span>
+              <strong class="stat-value" data-task-count>--</strong>
+            </article>
+            <article class="stat-card">
+              <span class="stat-label">API Mode</span>
+              <strong class="stat-value">FastAPI</strong>
+            </article>
+            <article class="stat-card">
+              <span class="stat-label">Episode Shape</span>
+              <strong class="stat-value">Single Step</strong>
+            </article>
+          </div>
+        </div>
+      </section>
+      <section class="feature-strip">
+        <article class="feature-card">
+          <span class="feature-index">01</span>
+          <h2>Typed contracts</h2>
+          <p>Observation, action, reward, state, and reset inputs are all strongly modeled for repeatable evaluation.</p>
+        </article>
+        <article class="feature-card">
+          <span class="feature-index">02</span>
+          <h2>Deterministic graders</h2>
+          <p>Each task returns scores in the 0.0 to 1.0 range with exact-match and partial-credit rules.</p>
+        </article>
+        <article class="feature-card">
+          <span class="feature-index">03</span>
+          <h2>Deployed workflow</h2>
+          <p>Docker packaging, Space-ready metadata, runtime validation, and a root-level baseline script are all included.</p>
+        </article>
+      </section>
+      <section class="tasks-section">
+        <div class="section-heading">
+          <p class="eyebrow">Task Ladder</p>
+          <h2>Three escalating operator decisions</h2>
+        </div>
+        <div class="task-grid" data-task-grid>
+          <article class="task-card skeleton-card"></article>
+          <article class="task-card skeleton-card"></article>
+          <article class="task-card skeleton-card"></article>
+        </div>
+      </section>
+      <section class="routes-section floating-panel">
+        <div class="section-heading compact">
+          <p class="eyebrow">Quick Links</p>
+          <h2>Use the environment your way</h2>
+        </div>
+        <div class="route-grid">
+          <a class="route-card" href="/status">
+            <strong>/status</strong>
+            <span>Live environment health, task inventory, and schema coverage.</span>
+          </a>
+          <a class="route-card" href="/playground">
+            <strong>/playground</strong>
+            <span>Manually reset a session, submit an action, and inspect the typed response.</span>
+          </a>
+          <a class="route-card" href="/docs">
+            <strong>/docs</strong>
+            <span>FastAPI-generated API reference for every endpoint and schema.</span>
+          </a>
+        </div>
+      </section>
+    </main>
+  </div>
+  <script src="/assets/app.js?v=3" defer></script>
+</body>
+</html>

ui/playground.html ADDED Viewed

	@@ -0,0 +1,153 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Incident Triage Playground</title>
+  <link rel="stylesheet" href="/assets/styles.css?v=3">
+</head>
+<body data-page="playground">
+  <div class="page-shell">
+    <header class="topbar">
+      <a class="brand" href="/">
+        <span class="brand-kicker">OpenEnv Environment</span>
+        <span class="brand-title">Incident Triage</span>
+      </a>
+      <nav class="nav-links">
+        <a href="/">Home</a>
+        <a href="/status">Status</a>
+        <a href="/docs">API Docs</a>
+      </nav>
+    </header>
+    <main class="stack-layout">
+      <section class="section-heading">
+        <p class="eyebrow">Manual Evaluation</p>
+        <h1>Interactive playground</h1>
+        <p class="section-copy">
+          This page is a browser version of the OpenEnv flow. Reset starts one evaluation episode,
+          Step submits one agent answer, and the result shows the reward returned by the grader.
+        </p>
+      </section>
+      <section class="guide-grid">
+        <article class="guide-card">
+          <span>1</span>
+          <strong>Start / Reset Environment</strong>
+          <p>Starts a new incident episode and returns the observation. No grading happens yet.</p>
+        </article>
+        <article class="guide-card">
+          <span>2</span>
+          <strong>Read Observation</strong>
+          <p>Check the incident, expected field, allowed values, and context.</p>
+        </article>
+        <article class="guide-card">
+          <span>3</span>
+          <strong>Submit Step</strong>
+          <p>Send one answer. The backend grades it and prints a terminal log.</p>
+        </article>
+      </section>
+      <section class="playground-grid">
+        <article class="floating-panel control-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Step 1</p>
+            <h2>Start a session</h2>
+          </div>
+          <div class="preset-row" aria-label="Quick presets">
+            <button class="preset-button" type="button" data-preset-task="task1" data-preset-ticket="INC-001">Severity case</button>
+            <button class="preset-button" type="button" data-preset-task="task2" data-preset-ticket="INC-006">Root cause case</button>
+            <button class="preset-button" type="button" data-preset-task="task3" data-preset-ticket="INC-014">Action case</button>
+          </div>
+          <form id="reset-form" class="form-grid">
+            <label>
+              <span>Task type</span>
+              <select name="task_type" id="task-type">
+                <option value="">Any task</option>
+                <option value="task1">task1</option>
+                <option value="task2">task2</option>
+                <option value="task3">task3</option>
+              </select>
+            </label>
+            <label>
+              <span>Ticket ID</span>
+              <input type="text" name="ticket_id" id="ticket-id" list="ticket-options" placeholder="INC-014">
+              <datalist id="ticket-options"></datalist>
+              <small id="ticket-helper">Loading valid tickets from the backend.</small>
+            </label>
+            <label>
+              <span>Seed</span>
+              <input type="number" name="seed" placeholder="42">
+            </label>
+            <button class="button button-primary" type="submit" id="reset-button">Start / Reset Environment</button>
+          </form>
+          <div class="inline-status">
+            <span class="mono-label">Session</span>
+            <code id="session-id">Not started</code>
+          </div>
+          <div class="ui-message" id="playground-message">Pick a preset or enter a task and ticket manually.</div>
+        </article>
+        <article class="floating-panel control-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Step 2</p>
+            <h2>Submit an action</h2>
+          </div>
+          <form id="step-form" class="form-grid">
+            <label>
+              <span>Expected field</span>
+              <input id="expected-field" type="text" disabled value="Start a session first">
+            </label>
+            <label>
+              <span>Allowed values</span>
+              <select id="action-value" disabled>
+                <option>Start a session first</option>
+              </select>
+            </label>
+            <button class="button button-secondary" type="submit" disabled id="step-button">Submit Step</button>
+          </form>
+          <p class="status-helper">The playground automatically maps your choice to `severity`, `root_cause`, or `action`. If you choose a known ticket, it also sets the matching task type for you.</p>
+        </article>
+      </section>
+      <section class="playground-summary" id="summary-strip">
+        <article class="summary-card">
+          <span>Incident</span>
+          <strong id="summary-incident">--</strong>
+        </article>
+        <article class="summary-card">
+          <span>Expected field</span>
+          <strong id="summary-field">--</strong>
+        </article>
+        <article class="summary-card">
+          <span>Reward</span>
+          <strong id="summary-reward">--</strong>
+        </article>
+        <article class="summary-card">
+          <span>Status</span>
+          <strong id="summary-status">Waiting</strong>
+        </article>
+      </section>
+      <section class="dual-grid output-grid">
+        <article class="floating-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Observation</p>
+            <h2>Latest reset payload</h2>
+          </div>
+          <pre class="code-panel" id="observation-output">No observation yet.</pre>
+        </article>
+        <article class="floating-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Result</p>
+            <h2>Latest step payload</h2>
+          </div>
+          <pre class="code-panel" id="result-output">No step submitted yet.</pre>
+        </article>
+      </section>
+    </main>
+  </div>
+  <script src="/assets/app.js?v=3" defer></script>
+</body>
+</html>

ui/status.html ADDED Viewed

	@@ -0,0 +1,118 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Incident Triage Status</title>
+  <link rel="stylesheet" href="/assets/styles.css?v=3">
+</head>
+<body data-page="status">
+  <div class="page-shell">
+    <header class="topbar">
+      <a class="brand" href="/">
+        <span class="brand-kicker">OpenEnv Environment</span>
+        <span class="brand-title">Incident Triage</span>
+      </a>
+      <nav class="nav-links">
+        <a href="/">Home</a>
+        <a href="/playground">Playground</a>
+        <a href="/docs">API Docs</a>
+      </nav>
+    </header>
+    <main class="stack-layout">
+      <section class="section-heading">
+        <p class="eyebrow">Live Status</p>
+        <h1>Environment readiness dashboard</h1>
+        <p class="section-copy">
+          This page does not start an episode. It reads the running API and confirms the environment
+          is ready: health, dataset size, schemas, tasks, and grader rules.
+        </p>
+      </section>
+      <section class="guide-grid">
+        <article class="guide-card">
+          <span>Health</span>
+          <strong>Server is reachable</strong>
+          <p>Reads `/health`. The validator expects the value `healthy`.</p>
+        </article>
+        <article class="guide-card">
+          <span>Schema</span>
+          <strong>Contracts are exposed</strong>
+          <p>Reads `/schema` to verify action, observation, reward, and state shapes.</p>
+        </article>
+        <article class="guide-card">
+          <span>Tasks</span>
+          <strong>Dataset is loaded</strong>
+          <p>Reads `/metadata` and `/grader` to show task counts and scoring rules.</p>
+        </article>
+      </section>
+      <section class="status-overview">
+        <article class="floating-panel big-status-card">
+          <span class="status-caption">Health</span>
+          <strong class="status-display" data-health-text>Checking</strong>
+          <p class="status-helper">Expected validator value: <code>healthy</code></p>
+        </article>
+        <article class="floating-panel">
+          <span class="status-caption">Dataset</span>
+          <strong class="status-display" data-total-incidents>--</strong>
+          <p class="status-helper">Total incidents currently exposed by the environment.</p>
+        </article>
+        <article class="floating-panel">
+          <span class="status-caption">Schemas</span>
+          <strong class="status-display" data-schema-count>--</strong>
+          <p class="status-helper">Typed schemas available through <code>/schema</code>.</p>
+        </article>
+      </section>
+      <section class="floating-panel">
+        <div class="section-heading compact">
+          <p class="eyebrow">Task Inventory</p>
+          <h2>Difficulty progression and label space</h2>
+        </div>
+        <div class="task-grid" data-task-grid>
+          <article class="task-card skeleton-card"></article>
+          <article class="task-card skeleton-card"></article>
+          <article class="task-card skeleton-card"></article>
+        </div>
+      </section>
+      <section class="dual-grid">
+        <article class="floating-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Schema Coverage</p>
+            <h2>Runtime contracts</h2>
+          </div>
+          <div class="badge-grid" data-schema-grid></div>
+        </article>
+        <article class="floating-panel">
+          <div class="section-heading compact">
+            <p class="eyebrow">Grader Summary</p>
+            <h2>Deterministic scoring rules</h2>
+          </div>
+          <div class="copy-block">
+            <p data-grader-summary>Loading grader details.</p>
+            <ul class="bullet-list" data-grader-list></ul>
+          </div>
+        </article>
+      </section>
+      <section class="floating-panel">
+        <div class="section-heading compact">
+          <p class="eyebrow">Endpoint Surface</p>
+          <h2>Available routes</h2>
+        </div>
+        <div class="route-grid">
+          <a class="route-card" href="/health"><strong>/health</strong><span>Validator-facing health endpoint.</span></a>
+          <a class="route-card" href="/metadata"><strong>/metadata</strong><span>Environment name, description, and task counts.</span></a>
+          <a class="route-card" href="/schema"><strong>/schema</strong><span>Action, observation, reward, and state schemas.</span></a>
+          <a class="route-card" href="/openapi.json"><strong>/openapi.json</strong><span>FastAPI OpenAPI contract for external tooling.</span></a>
+        </div>
+      </section>
+    </main>
+  </div>
+  <script src="/assets/app.js?v=3" defer></script>
+</body>
+</html>

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff