incident-triage-env / README.md
XcodeAddy's picture
Keep grader rewards strictly within unit interval
18aa055
metadata
title: Incident Triage Env
colorFrom: gray
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: OpenEnv-compatible incident triage evaluation environment.

Production Incident Triage Environment

This project is an OpenEnv-compatible evaluation environment for production incident response. An agent receives a typed incident observation and must perform one of three real-world triage tasks: classify severity, identify the most likely root cause, or recommend the best immediate action.

The environment is built for the OpenEnv hackathon requirements:

  • real-world utility
  • three graded tasks with easy, medium, and hard difficulty
  • typed observation, action, reward, and state models
  • deterministic reward logic with partial credit
  • root-level inference.py
  • Docker-based deployment for Hugging Face Spaces

Overview

The dataset contains 108 incidents across three task families:

Task Difficulty Count Objective
task1 easy 36 Predict incident severity as SEV1, SEV2, or SEV3
task2 medium 36 Predict the most likely root cause domain
task3 hard 36 Predict the best immediate operational action

The incidents cover realistic production scenarios such as payment failures, queue backlogs, regional network loss, failed deploys, infrastructure saturation, third-party degradation, and failover decisions.

API

The FastAPI app exposes the following endpoints on port 7860:

  • GET /health
  • GET /metadata
  • GET /tasks
  • GET /grader
  • GET /schema
  • POST /reset
  • POST /step
  • GET /state
  • POST /mcp

Reset

POST /reset starts a new single-step episode.

Optional request body:

{
  "task_type": "task1",
  "ticket_id": "INC-001",
  "seed": 42
}

Response fields:

  • observation
  • reward
  • done
  • info

Step

POST /step?session_id=<id> accepts an IncidentAction and returns a typed StepResult.

Example request:

{
  "incident_id": "INC-001",
  "task_type": "task1",
  "severity": "SEV1"
}

State

GET /state?session_id=<id> returns the current typed IncidentState.

Web UI

The project also serves a browser-facing UI from the same FastAPI app:

  • / shows the landing page with project overview and task summary
  • /status shows live health, schema, and task readiness information
  • /playground lets you manually reset a session and submit a step from the browser
  • /docs provides the generated FastAPI API reference

Models

The core models are defined in models.py:

  • IncidentObservation
  • IncidentAction
  • IncidentReward
  • StepResult
  • IncidentState
  • ResetRequest

Validation rules:

  • incident_id must match the active ticket
  • task_type must match the active ticket
  • exactly one of severity, root_cause, or action must be populated
  • the populated field must match the expected field for the task

Reward Logic

Rewarding is deterministic and implemented in graders.py.

  • task1: 0.99 exact, 0.5 adjacent severity, 0.01 far miss
  • task2: 0.99 exact, 0.5 related domain, 0.25 UNKNOWN, 0.01 wrong
  • task3: 0.99 exact, 0.4 safe INVESTIGATE fallback, 0.25 related action, 0.01 wrong

This keeps grading reproducible while still giving partial-credit trajectory signal.

Repository Layout

incident-triage-env/
- app.py
- client.py
- environment.py
- graders.py
- incidents.py
- inference.py
- models.py
- openenv.yaml
- pyproject.toml
- requirements.txt
- Dockerfile
- README.md
- server/
- tests/

Runtime flow:

  1. incidents.py stores the ticket dataset.
  2. environment.py selects the episode and applies grading.
  3. app.py exposes the API surface.
  4. inference.py runs the baseline over the environment.
  5. graders.py calculates deterministic reward and explanations.

Local Setup

Install dependencies:

pip install -r requirements.txt

Optional OpenEnv CLI:

pip install openenv-core

Optional environment variables for inference.py:

export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:7860"

If no external environment server is reachable, inference.py falls back to an in-process FastAPI client.

Run Locally

Start the server:

uvicorn app:app --host 0.0.0.0 --port 7860

Run the baseline:

python inference.py

Run the smoke tests:

python -m unittest discover -s tests -v

Docker

Build the image:

docker build -t incident-triage-env .

Run the container:

docker run --rm -p 7860:7860 incident-triage-env

Check health:

curl http://localhost:7860/health

Baseline Logging

inference.py prints the required structured output:

[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99

Baseline Scores

Latest local deterministic baseline:

Metric Value
Episodes 108
Average score 0.9855
task1 average 0.9900
task2 average 0.9764
task3 average 0.9900

This deterministic local run completed in about 1.34s on the current machine. Results are written by default to /tmp/outputs/baseline_scores.json.

Quick API Example

Reset:

curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_type":"task1","ticket_id":"INC-001"}'

Step:

curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "INC-001",
    "task_type": "task1",
    "severity": "SEV1"
  }'

Pre-Submission Checklist

  • openenv validate . --json passes
  • openenv validate --url <space-url> passes
  • POST /reset returns 200
  • POST /step returns typed reward, done, and info
  • GET /state works for active sessions
  • inference.py runs from the repo root
  • Dockerfile serves the app on port 7860
  • openenv.yaml matches the current API and dataset counts

Notes

  • models.py is the source of truth for valid enum labels.
  • graders.py is the source of truth for scoring logic.
  • Reward values are kept strictly within (0, 1) to satisfy Phase 2 validator constraints.
  • The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.