Spaces:

XcodeAddy
/

incident-triage-env

Running

App Files Files Community

incident-triage-env / README.md

XcodeAddy

Keep grader rewards strictly within unit interval

18aa055 about 1 month ago

preview code

raw

history blame contribute delete

6.59 kB

	---
	title: Incident Triage Env
	colorFrom: gray
	colorTo: blue
	sdk: docker
	app_port: 7860
	license: mit
	short_description: OpenEnv-compatible incident triage evaluation environment.
	---

	# Production Incident Triage Environment

	This project is an OpenEnv-compatible evaluation environment for production incident response. An agent receives a typed incident observation and must perform one of three real-world triage tasks: classify severity, identify the most likely root cause, or recommend the best immediate action.

	The environment is built for the OpenEnv hackathon requirements:
	- real-world utility
	- three graded tasks with easy, medium, and hard difficulty
	- typed observation, action, reward, and state models
	- deterministic reward logic with partial credit
	- root-level `inference.py`
	- Docker-based deployment for Hugging Face Spaces

	## Overview

	The dataset contains 108 incidents across three task families:

	\| Task \| Difficulty \| Count \| Objective \|
	\|---\|---\|---:\|---\|
	\| `task1` \| easy \| 36 \| Predict incident severity as `SEV1`, `SEV2`, or `SEV3` \|
	\| `task2` \| medium \| 36 \| Predict the most likely root cause domain \|
	\| `task3` \| hard \| 36 \| Predict the best immediate operational action \|

	The incidents cover realistic production scenarios such as payment failures, queue backlogs, regional network loss, failed deploys, infrastructure saturation, third-party degradation, and failover decisions.

	## API

	The FastAPI app exposes the following endpoints on port `7860`:

	- `GET /health`
	- `GET /metadata`
	- `GET /tasks`
	- `GET /grader`
	- `GET /schema`
	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `POST /mcp`

	### Reset

	`POST /reset` starts a new single-step episode.

	Optional request body:

	```json
	{
	"task_type": "task1",
	"ticket_id": "INC-001",
	"seed": 42
	}
	```

	Response fields:
	- `observation`
	- `reward`
	- `done`
	- `info`

	### Step

	`POST /step?session_id=<id>` accepts an `IncidentAction` and returns a typed `StepResult`.

	Example request:

	```json
	{
	"incident_id": "INC-001",
	"task_type": "task1",
	"severity": "SEV1"
	}
	```

	### State

	`GET /state?session_id=<id>` returns the current typed `IncidentState`.

	## Web UI

	The project also serves a browser-facing UI from the same FastAPI app:

	- `/` shows the landing page with project overview and task summary
	- `/status` shows live health, schema, and task readiness information
	- `/playground` lets you manually reset a session and submit a step from the browser
	- `/docs` provides the generated FastAPI API reference

	## Models

	The core models are defined in [models.py](./models.py):

	- `IncidentObservation`
	- `IncidentAction`
	- `IncidentReward`
	- `StepResult`
	- `IncidentState`
	- `ResetRequest`

	Validation rules:
	- `incident_id` must match the active ticket
	- `task_type` must match the active ticket
	- exactly one of `severity`, `root_cause`, or `action` must be populated
	- the populated field must match the expected field for the task

	## Reward Logic

	Rewarding is deterministic and implemented in [graders.py](./graders.py).

	- `task1`: `0.99` exact, `0.5` adjacent severity, `0.01` far miss
	- `task2`: `0.99` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.01` wrong
	- `task3`: `0.99` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.01` wrong

	This keeps grading reproducible while still giving partial-credit trajectory signal.

	## Repository Layout

	```text
	incident-triage-env/
	- app.py
	- client.py
	- environment.py
	- graders.py
	- incidents.py
	- inference.py
	- models.py
	- openenv.yaml
	- pyproject.toml
	- requirements.txt
	- Dockerfile
	- README.md
	- server/
	- tests/
	```

	Runtime flow:
	1. `incidents.py` stores the ticket dataset.
	2. `environment.py` selects the episode and applies grading.
	3. `app.py` exposes the API surface.
	4. `inference.py` runs the baseline over the environment.
	5. `graders.py` calculates deterministic reward and explanations.

	## Local Setup

	Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	Optional OpenEnv CLI:

	```bash
	pip install openenv-core
	```

	Optional environment variables for `inference.py`:

	```bash
	export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
	export MODEL_NAME="your-model-name"
	export HF_TOKEN="your-api-key"
	export ENV_URL="http://localhost:7860"
	```

	If no external environment server is reachable, `inference.py` falls back to an in-process FastAPI client.

	## Run Locally

	Start the server:

	```bash
	uvicorn app:app --host 0.0.0.0 --port 7860
	```

	Run the baseline:

	```bash
	python inference.py
	```

	Run the smoke tests:

	```bash
	python -m unittest discover -s tests -v
	```

	## Docker

	Build the image:

	```bash
	docker build -t incident-triage-env .
	```

	Run the container:

	```bash
	docker run --rm -p 7860:7860 incident-triage-env
	```

	Check health:

	```bash
	curl http://localhost:7860/health
	```

	## Baseline Logging

	`inference.py` prints the required structured output:

	```text
	[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
	[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
	[END] success=true steps=1 score=0.99 rewards=0.99
	```

	## Baseline Scores

	Latest local deterministic baseline:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Episodes \| 108 \|
	\| Average score \| 0.9855 \|
	\| `task1` average \| 0.9900 \|
	\| `task2` average \| 0.9764 \|
	\| `task3` average \| 0.9900 \|

	This deterministic local run completed in about `1.34s` on the current machine.
	Results are written by default to `/tmp/outputs/baseline_scores.json`.

	## Quick API Example

	Reset:

	```bash
	curl -X POST http://localhost:7860/reset \
	-H "Content-Type: application/json" \
	-d '{"task_type":"task1","ticket_id":"INC-001"}'
	```

	Step:

	```bash
	curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
	-H "Content-Type: application/json" \
	-d '{
	"incident_id": "INC-001",
	"task_type": "task1",
	"severity": "SEV1"
	}'
	```

	## Pre-Submission Checklist

	- `openenv validate . --json` passes
	- `openenv validate --url <space-url>` passes
	- `POST /reset` returns `200`
	- `POST /step` returns typed `reward`, `done`, and `info`
	- `GET /state` works for active sessions
	- `inference.py` runs from the repo root
	- `Dockerfile` serves the app on port `7860`
	- `openenv.yaml` matches the current API and dataset counts

	## Notes

	- `models.py` is the source of truth for valid enum labels.
	- `graders.py` is the source of truth for scoring logic.
	- Reward values are kept strictly within `(0, 1)` to satisfy Phase 2 validator constraints.
	- The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.