Spaces:

XcodeAddy
/

incident-triage-env

Running

App Files Files Community

incident-triage-env / README.md

XcodeAddy

Keep grader rewards strictly within unit interval

18aa055 about 1 month ago

preview code

raw

history blame contribute delete

6.59 kB

metadata

title: Incident Triage Env
colorFrom: gray
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: OpenEnv-compatible incident triage evaluation environment.

Production Incident Triage Environment

This project is an OpenEnv-compatible evaluation environment for production incident response. An agent receives a typed incident observation and must perform one of three real-world triage tasks: classify severity, identify the most likely root cause, or recommend the best immediate action.

The environment is built for the OpenEnv hackathon requirements:

real-world utility
three graded tasks with easy, medium, and hard difficulty
typed observation, action, reward, and state models
deterministic reward logic with partial credit
root-level inference.py
Docker-based deployment for Hugging Face Spaces

Overview

The dataset contains 108 incidents across three task families:

Task	Difficulty	Count	Objective
`task1`	easy	36	Predict incident severity as `SEV1`, `SEV2`, or `SEV3`
`task2`	medium	36	Predict the most likely root cause domain
`task3`	hard	36	Predict the best immediate operational action

The incidents cover realistic production scenarios such as payment failures, queue backlogs, regional network loss, failed deploys, infrastructure saturation, third-party degradation, and failover decisions.

API

The FastAPI app exposes the following endpoints on port 7860:

GET /health
GET /metadata
GET /tasks
GET /grader
GET /schema
POST /reset
POST /step
GET /state
POST /mcp

Reset

POST /reset starts a new single-step episode.

Optional request body:

{
  "task_type": "task1",
  "ticket_id": "INC-001",
  "seed": 42
}

Response fields:

observation
reward
done
info

Step

POST /step?session_id=<id> accepts an IncidentAction and returns a typed StepResult.

Example request:

{
  "incident_id": "INC-001",
  "task_type": "task1",
  "severity": "SEV1"
}

State

GET /state?session_id=<id> returns the current typed IncidentState.

Web UI

The project also serves a browser-facing UI from the same FastAPI app:

/ shows the landing page with project overview and task summary
/status shows live health, schema, and task readiness information
/playground lets you manually reset a session and submit a step from the browser
/docs provides the generated FastAPI API reference

Models

The core models are defined in models.py:

IncidentObservation
IncidentAction
IncidentReward
StepResult
IncidentState
ResetRequest

Validation rules:

incident_id must match the active ticket
task_type must match the active ticket
exactly one of severity, root_cause, or action must be populated
the populated field must match the expected field for the task

Reward Logic

Rewarding is deterministic and implemented in graders.py.

task1: 0.99 exact, 0.5 adjacent severity, 0.01 far miss
task2: 0.99 exact, 0.5 related domain, 0.25 UNKNOWN, 0.01 wrong
task3: 0.99 exact, 0.4 safe INVESTIGATE fallback, 0.25 related action, 0.01 wrong

This keeps grading reproducible while still giving partial-credit trajectory signal.

Repository Layout

incident-triage-env/
- app.py
- client.py
- environment.py
- graders.py
- incidents.py
- inference.py
- models.py
- openenv.yaml
- pyproject.toml
- requirements.txt
- Dockerfile
- README.md
- server/
- tests/

Runtime flow:

incidents.py stores the ticket dataset.
environment.py selects the episode and applies grading.
app.py exposes the API surface.
inference.py runs the baseline over the environment.
graders.py calculates deterministic reward and explanations.

Local Setup

Install dependencies:

pip install -r requirements.txt

Optional OpenEnv CLI:

pip install openenv-core

Optional environment variables for inference.py:

export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:7860"

If no external environment server is reachable, inference.py falls back to an in-process FastAPI client.

Run Locally

Start the server:

uvicorn app:app --host 0.0.0.0 --port 7860

Run the baseline:

python inference.py

Run the smoke tests:

python -m unittest discover -s tests -v

Docker

Build the image:

docker build -t incident-triage-env .

Run the container:

docker run --rm -p 7860:7860 incident-triage-env

Check health:

curl http://localhost:7860/health

Baseline Logging

inference.py prints the required structured output:

[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99

Baseline Scores

Latest local deterministic baseline:

Metric	Value
Episodes	108
Average score	0.9855
`task1` average	0.9900
`task2` average	0.9764
`task3` average	0.9900

This deterministic local run completed in about 1.34s on the current machine. Results are written by default to /tmp/outputs/baseline_scores.json.

Quick API Example

Reset:

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_type":"task1","ticket_id":"INC-001"}'

Step:

curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "INC-001",
    "task_type": "task1",
    "severity": "SEV1"
  }'

Pre-Submission Checklist

openenv validate . --json passes
openenv validate --url <space-url> passes
POST /reset returns 200
POST /step returns typed reward, done, and info
GET /state works for active sessions
inference.py runs from the repo root
Dockerfile serves the app on port 7860
openenv.yaml matches the current API and dataset counts

Notes

models.py is the source of truth for valid enum labels.
graders.py is the source of truth for scoring logic.
Reward values are kept strictly within (0, 1) to satisfy Phase 2 validator constraints.
The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.