File size: 6,590 Bytes
5e5d652
 
 
 
 
 
 
 
 
 
35ea9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
4b84bac
35ea9cd
 
 
4b84bac
 
 
35ea9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca37eed
35ea9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca37eed
35ea9cd
18aa055
 
 
35ea9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18aa055
 
35ea9cd
 
 
 
 
 
 
 
4b84bac
18aa055
 
 
 
35ea9cd
4b84bac
 
35ea9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18aa055
35ea9cd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: Incident Triage Env
colorFrom: gray
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: OpenEnv-compatible incident triage evaluation environment.
---

# Production Incident Triage Environment

This project is an OpenEnv-compatible evaluation environment for production incident response. An agent receives a typed incident observation and must perform one of three real-world triage tasks: classify severity, identify the most likely root cause, or recommend the best immediate action.

The environment is built for the OpenEnv hackathon requirements:
- real-world utility
- three graded tasks with easy, medium, and hard difficulty
- typed observation, action, reward, and state models
- deterministic reward logic with partial credit
- root-level `inference.py`
- Docker-based deployment for Hugging Face Spaces

## Overview

The dataset contains 108 incidents across three task families:

| Task | Difficulty | Count | Objective |
|---|---|---:|---|
| `task1` | easy | 36 | Predict incident severity as `SEV1`, `SEV2`, or `SEV3` |
| `task2` | medium | 36 | Predict the most likely root cause domain |
| `task3` | hard | 36 | Predict the best immediate operational action |

The incidents cover realistic production scenarios such as payment failures, queue backlogs, regional network loss, failed deploys, infrastructure saturation, third-party degradation, and failover decisions.

## API

The FastAPI app exposes the following endpoints on port `7860`:

- `GET /health`
- `GET /metadata`
- `GET /tasks`
- `GET /grader`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `POST /mcp`

### Reset

`POST /reset` starts a new single-step episode.

Optional request body:

```json
{
  "task_type": "task1",
  "ticket_id": "INC-001",
  "seed": 42
}
```

Response fields:
- `observation`
- `reward`
- `done`
- `info`

### Step

`POST /step?session_id=<id>` accepts an `IncidentAction` and returns a typed `StepResult`.

Example request:

```json
{
  "incident_id": "INC-001",
  "task_type": "task1",
  "severity": "SEV1"
}
```

### State

`GET /state?session_id=<id>` returns the current typed `IncidentState`.

## Web UI

The project also serves a browser-facing UI from the same FastAPI app:

- `/` shows the landing page with project overview and task summary
- `/status` shows live health, schema, and task readiness information
- `/playground` lets you manually reset a session and submit a step from the browser
- `/docs` provides the generated FastAPI API reference

## Models

The core models are defined in [models.py](./models.py):

- `IncidentObservation`
- `IncidentAction`
- `IncidentReward`
- `StepResult`
- `IncidentState`
- `ResetRequest`

Validation rules:
- `incident_id` must match the active ticket
- `task_type` must match the active ticket
- exactly one of `severity`, `root_cause`, or `action` must be populated
- the populated field must match the expected field for the task

## Reward Logic

Rewarding is deterministic and implemented in [graders.py](./graders.py).

- `task1`: `0.99` exact, `0.5` adjacent severity, `0.01` far miss
- `task2`: `0.99` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.01` wrong
- `task3`: `0.99` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.01` wrong

This keeps grading reproducible while still giving partial-credit trajectory signal.

## Repository Layout

```text
incident-triage-env/
- app.py
- client.py
- environment.py
- graders.py
- incidents.py
- inference.py
- models.py
- openenv.yaml
- pyproject.toml
- requirements.txt
- Dockerfile
- README.md
- server/
- tests/
```

Runtime flow:
1. `incidents.py` stores the ticket dataset.
2. `environment.py` selects the episode and applies grading.
3. `app.py` exposes the API surface.
4. `inference.py` runs the baseline over the environment.
5. `graders.py` calculates deterministic reward and explanations.

## Local Setup

Install dependencies:

```bash
pip install -r requirements.txt
```

Optional OpenEnv CLI:

```bash
pip install openenv-core
```

Optional environment variables for `inference.py`:

```bash
export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:7860"
```

If no external environment server is reachable, `inference.py` falls back to an in-process FastAPI client.

## Run Locally

Start the server:

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

Run the baseline:

```bash
python inference.py
```

Run the smoke tests:

```bash
python -m unittest discover -s tests -v
```

## Docker

Build the image:

```bash
docker build -t incident-triage-env .
```

Run the container:

```bash
docker run --rm -p 7860:7860 incident-triage-env
```

Check health:

```bash
curl http://localhost:7860/health
```

## Baseline Logging

`inference.py` prints the required structured output:

```text
[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99
```

## Baseline Scores

Latest local deterministic baseline:

| Metric | Value |
|---|---:|
| Episodes | 108 |
| Average score | 0.9855 |
| `task1` average | 0.9900 |
| `task2` average | 0.9764 |
| `task3` average | 0.9900 |

This deterministic local run completed in about `1.34s` on the current machine.
Results are written by default to `/tmp/outputs/baseline_scores.json`.

## Quick API Example

Reset:

```bash
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_type":"task1","ticket_id":"INC-001"}'
```

Step:

```bash
curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "INC-001",
    "task_type": "task1",
    "severity": "SEV1"
  }'
```

## Pre-Submission Checklist

- `openenv validate . --json` passes
- `openenv validate --url <space-url>` passes
- `POST /reset` returns `200`
- `POST /step` returns typed `reward`, `done`, and `info`
- `GET /state` works for active sessions
- `inference.py` runs from the repo root
- `Dockerfile` serves the app on port `7860`
- `openenv.yaml` matches the current API and dataset counts

## Notes

- `models.py` is the source of truth for valid enum labels.
- `graders.py` is the source of truth for scoring logic.
- Reward values are kept strictly within `(0, 1)` to satisfy Phase 2 validator constraints.
- The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.