Spaces:
Running
Running
Coding Ninja commited on
Commit ·
c64d203
1
Parent(s): 6c5051f
Finalize gap fixes and lightweight competitive upgrades
Browse files- KNOWLEDGE.md +42 -17
- README.md +59 -14
- ROADMAP.md +76 -10
- inference.py +124 -8
- models.py +24 -1
- openenv.yaml +1 -0
- server/environment.py +253 -36
- server/tasks.py +7 -3
- tests/test_competitive_upgrade.py +285 -7
- tests/test_environment_smoke.py +7 -0
- tests/test_extra_fields_penalty.py +11 -12
- tests/test_inference_unit.py +16 -0
KNOWLEDGE.md
CHANGED
|
@@ -24,7 +24,7 @@ IT helpdesk routing is a strong hackathon fit because it is:
|
|
| 24 |
- deterministic to grade
|
| 25 |
- naturally multi-step
|
| 26 |
|
| 27 |
-
A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next.
|
| 28 |
|
| 29 |
## The Repo In One Sentence
|
| 30 |
|
|
@@ -134,7 +134,7 @@ Important fields:
|
|
| 134 |
|
| 135 |
### `HelpdeskTicketAction`
|
| 136 |
|
| 137 |
-
Represents the agent
|
| 138 |
|
| 139 |
### `HelpdeskTicketObservation`
|
| 140 |
|
|
@@ -142,6 +142,7 @@ Represents what the agent sees for each step:
|
|
| 142 |
|
| 143 |
- task metadata
|
| 144 |
- visible ticket fields
|
|
|
|
| 145 |
- queue progress
|
| 146 |
- score history
|
| 147 |
|
|
@@ -179,10 +180,19 @@ The observation exposes:
|
|
| 179 |
|
| 180 |
- task metadata
|
| 181 |
- the current ticket
|
|
|
|
|
|
|
|
|
|
| 182 |
- queue progress counters
|
| 183 |
- history
|
| 184 |
- reward and done status
|
| 185 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
The state tracks:
|
| 187 |
|
| 188 |
- current task
|
|
@@ -191,12 +201,13 @@ The state tracks:
|
|
| 191 |
- current ticket index
|
| 192 |
- per-ticket scores
|
| 193 |
- total reward
|
|
|
|
| 194 |
|
| 195 |
## Task Design
|
| 196 |
|
| 197 |
### Task 1: Issue Type Classification
|
| 198 |
|
| 199 |
-
The agent predicts:
|
| 200 |
|
| 201 |
- `issue_type`
|
| 202 |
|
|
@@ -206,7 +217,7 @@ Purpose:
|
|
| 206 |
|
| 207 |
### Task 2: Issue Type And Priority
|
| 208 |
|
| 209 |
-
The agent predicts:
|
| 210 |
|
| 211 |
- `issue_type`
|
| 212 |
- `priority`
|
|
@@ -217,7 +228,7 @@ Purpose:
|
|
| 217 |
|
| 218 |
### Task 3: Full Ticket Routing
|
| 219 |
|
| 220 |
-
The agent predicts:
|
| 221 |
|
| 222 |
- `issue_type`
|
| 223 |
- `priority`
|
|
@@ -256,14 +267,14 @@ This is now proven in checked-in unit tests rather than left as a docs claim.
|
|
| 256 |
|
| 257 |
Step reward:
|
| 258 |
|
| 259 |
-
- current ticket score
|
| 260 |
|
| 261 |
Final reward:
|
| 262 |
|
| 263 |
- average of ticket scores
|
| 264 |
-
- minus a
|
| 265 |
|
| 266 |
-
This
|
| 267 |
|
| 268 |
## Dataset Mental Model
|
| 269 |
|
|
@@ -277,6 +288,8 @@ Current structure:
|
|
| 277 |
- harder ambiguous cases
|
| 278 |
- follow-up tickets connected through `related_ticket_id`
|
| 279 |
|
|
|
|
|
|
|
| 280 |
The dataset is meant to test routing judgment, not just keyword spotting.
|
| 281 |
|
| 282 |
## Grounding Note
|
|
@@ -299,16 +312,18 @@ It:
|
|
| 299 |
|
| 300 |
1. connects to the environment
|
| 301 |
2. loads the available tasks
|
| 302 |
-
3. runs one episode
|
| 303 |
4. picks an action for each ticket
|
| 304 |
5. sends the action back through the client
|
| 305 |
6. records rewards
|
| 306 |
-
7. prints
|
| 307 |
|
| 308 |
It supports:
|
| 309 |
|
| 310 |
- heuristic mode with no external model
|
| 311 |
- LLM mode through an OpenAI-compatible API
|
|
|
|
|
|
|
| 312 |
|
| 313 |
## Files That Matter Most
|
| 314 |
|
|
@@ -374,16 +389,26 @@ That follow-up pass added the remaining Roopal-owned public-clarity items:
|
|
| 374 |
- an internal grounding note tying the label space to public IT-support datasets
|
| 375 |
- a refreshed compliance snapshot in `required.md`
|
| 376 |
|
| 377 |
-
The optional TRL / GRPO README example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 378 |
|
| 379 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
|
| 381 |
-
|
| 382 |
|
| 383 |
-
|
| 384 |
|
| 385 |
-
1.
|
| 386 |
-
2.
|
| 387 |
|
| 388 |
## One-Minute Summary
|
| 389 |
|
|
@@ -396,4 +421,4 @@ If you come back to this repo later, remember:
|
|
| 396 |
- the agent predicts structured routing fields
|
| 397 |
- the grader gives deterministic partial credit
|
| 398 |
- `inference.py` is the baseline agent runner
|
| 399 |
-
- merged-state
|
|
|
|
| 24 |
- deterministic to grade
|
| 25 |
- naturally multi-step
|
| 26 |
|
| 27 |
+
A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. The current runtime now supports a small two-mode action object: investigate first when needed, then submit the final routing answer.
|
| 28 |
|
| 29 |
## The Repo In One Sentence
|
| 30 |
|
|
|
|
| 134 |
|
| 135 |
### `HelpdeskTicketAction`
|
| 136 |
|
| 137 |
+
Represents the agent step. `action_type="submit"` carries routing fields, while `action_type="investigate"` uses a small built-in tool surface before the final submission.
|
| 138 |
|
| 139 |
### `HelpdeskTicketObservation`
|
| 140 |
|
|
|
|
| 142 |
|
| 143 |
- task metadata
|
| 144 |
- visible ticket fields
|
| 145 |
+
- optional ambiguity or follow-up context
|
| 146 |
- queue progress
|
| 147 |
- score history
|
| 148 |
|
|
|
|
| 180 |
|
| 181 |
- task metadata
|
| 182 |
- the current ticket
|
| 183 |
+
- available investigation tools
|
| 184 |
+
- remaining free investigation budget
|
| 185 |
+
- the latest tool result, when one was requested
|
| 186 |
- queue progress counters
|
| 187 |
- history
|
| 188 |
- reward and done status
|
| 189 |
|
| 190 |
+
Useful queue counters now include:
|
| 191 |
+
|
| 192 |
+
- `tickets_remaining`: not-yet-processed tickets, including the current ticket when one is active
|
| 193 |
+
- `tickets_after_current`: how many tickets remain after the current one
|
| 194 |
+
- `queue_position`: 1-based position of the current ticket in the queue
|
| 195 |
+
|
| 196 |
The state tracks:
|
| 197 |
|
| 198 |
- current task
|
|
|
|
| 201 |
- current ticket index
|
| 202 |
- per-ticket scores
|
| 203 |
- total reward
|
| 204 |
+
- investigation step count
|
| 205 |
|
| 206 |
## Task Design
|
| 207 |
|
| 208 |
### Task 1: Issue Type Classification
|
| 209 |
|
| 210 |
+
The agent ultimately predicts:
|
| 211 |
|
| 212 |
- `issue_type`
|
| 213 |
|
|
|
|
| 217 |
|
| 218 |
### Task 2: Issue Type And Priority
|
| 219 |
|
| 220 |
+
The agent ultimately predicts:
|
| 221 |
|
| 222 |
- `issue_type`
|
| 223 |
- `priority`
|
|
|
|
| 228 |
|
| 229 |
### Task 3: Full Ticket Routing
|
| 230 |
|
| 231 |
+
The agent ultimately predicts:
|
| 232 |
|
| 233 |
- `issue_type`
|
| 234 |
- `priority`
|
|
|
|
| 267 |
|
| 268 |
Step reward:
|
| 269 |
|
| 270 |
+
- current ticket score with a small milestone bonus for strong steps and a small penalty for very weak steps
|
| 271 |
|
| 272 |
Final reward:
|
| 273 |
|
| 274 |
- average of ticket scores
|
| 275 |
+
- minus a tiny penalty only if the agent exceeds the free investigation budget for the queue
|
| 276 |
|
| 277 |
+
This keeps the reward dense and deterministic, removes the dead overshoot logic, and adds a small queue-level economics signal without disturbing the no-tool baseline path.
|
| 278 |
|
| 279 |
## Dataset Mental Model
|
| 280 |
|
|
|
|
| 288 |
- harder ambiguous cases
|
| 289 |
- follow-up tickets connected through `related_ticket_id`
|
| 290 |
|
| 291 |
+
When a follow-up link exists, the observation can now surface a lightweight `related_ticket_preview`, and the tool layer can fetch richer related-ticket or requester-history context so the agent does not have to route every ticket from isolated text alone.
|
| 292 |
+
|
| 293 |
The dataset is meant to test routing judgment, not just keyword spotting.
|
| 294 |
|
| 295 |
## Grounding Note
|
|
|
|
| 312 |
|
| 313 |
1. connects to the environment
|
| 314 |
2. loads the available tasks
|
| 315 |
+
3. runs one episode for the requested task
|
| 316 |
4. picks an action for each ticket
|
| 317 |
5. sends the action back through the client
|
| 318 |
6. records rewards
|
| 319 |
+
7. prints structured logs for that run
|
| 320 |
|
| 321 |
It supports:
|
| 322 |
|
| 323 |
- heuristic mode with no external model
|
| 324 |
- LLM mode through an OpenAI-compatible API
|
| 325 |
+
- lightweight investigation-tool calls before the final submit action
|
| 326 |
+
- an explicit local `RUN_ALL_TASKS=1` override when you want the old multi-task sweep
|
| 327 |
|
| 328 |
## Files That Matter Most
|
| 329 |
|
|
|
|
| 389 |
- an internal grounding note tying the label space to public IT-support datasets
|
| 390 |
- a refreshed compliance snapshot in `required.md`
|
| 391 |
|
| 392 |
+
The optional TRL / GRPO README example remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
|
| 393 |
+
|
| 394 |
+
## April 3-7 Status
|
| 395 |
+
|
| 396 |
+
The roadmap through April 7 is now closed in the current repo state.
|
| 397 |
+
|
| 398 |
+
That means the repo now has:
|
| 399 |
|
| 400 |
+
1. checked-in unit, smoke, and integration tests
|
| 401 |
+
2. Docker smoke coverage through the GitHub Actions workflow
|
| 402 |
+
3. a clean-copy install-and-run pass
|
| 403 |
+
4. structured `inference.py` logging verification
|
| 404 |
+
5. a passing local `openenv validate` result after checking in `uv.lock`
|
| 405 |
|
| 406 |
+
## Submission-Day Reminders
|
| 407 |
|
| 408 |
+
The remaining work belongs to the April 8 submission window rather than the April 3 to April 7 implementation window:
|
| 409 |
|
| 410 |
+
1. rerun the final sanity slice on the submission branch
|
| 411 |
+
2. verify the live Hugging Face Space ping and reset path after the final push if a fresh deployment is created
|
| 412 |
|
| 413 |
## One-Minute Summary
|
| 414 |
|
|
|
|
| 421 |
- the agent predicts structured routing fields
|
| 422 |
- the grader gives deterministic partial credit
|
| 423 |
- `inference.py` is the baseline agent runner
|
| 424 |
+
- merged-state validation, Docker smoke coverage, clean-copy rerun, and local validator readiness are all now in place
|
README.md
CHANGED
|
@@ -34,7 +34,7 @@ The environment models a realistic helpdesk workflow:
|
|
| 34 |
|
| 35 |
1. a new ticket enters the queue
|
| 36 |
2. the agent reads the ticket title and description
|
| 37 |
-
3. the agent
|
| 38 |
4. the grader assigns deterministic credit
|
| 39 |
5. the environment advances to the next ticket until the queue is complete
|
| 40 |
|
|
@@ -43,7 +43,7 @@ This domain is useful for OpenEnv because it is operationally realistic, easy to
|
|
| 43 |
## Why This Is A Good Hackathon Domain
|
| 44 |
|
| 45 |
- it reflects real enterprise support operations
|
| 46 |
-
- the action space is structured and judge-friendly
|
| 47 |
- correctness can be scored deterministically
|
| 48 |
- the hard task is meaningfully harder than the easy and medium tasks
|
| 49 |
- the environment is small enough to rerun quickly
|
|
@@ -55,7 +55,7 @@ The project uses a queue-based episode model.
|
|
| 55 |
- `reset()` samples a task and a queue of 3 to 5 tickets
|
| 56 |
- `step()` grades one ticket submission at a time
|
| 57 |
- `state()` exposes the internal episode snapshot
|
| 58 |
-
- final reward is based on average ticket quality
|
| 59 |
|
| 60 |
The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
|
| 61 |
|
|
@@ -115,6 +115,9 @@ Visible ticket fields:
|
|
| 115 |
- `title`
|
| 116 |
- `requester`
|
| 117 |
- `description`
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
Each observation also includes:
|
| 120 |
|
|
@@ -122,9 +125,14 @@ Each observation also includes:
|
|
| 122 |
- `task_name`
|
| 123 |
- `instructions`
|
| 124 |
- `allowed_fields`
|
|
|
|
|
|
|
|
|
|
| 125 |
- `queue_size`
|
| 126 |
- `tickets_remaining`
|
|
|
|
| 127 |
- `tickets_processed`
|
|
|
|
| 128 |
- `history`
|
| 129 |
- standard OpenEnv fields such as `done` and `reward`
|
| 130 |
|
|
@@ -138,11 +146,23 @@ The internal `HelpdeskTicketState` tracks:
|
|
| 138 |
- `current_ticket_index`
|
| 139 |
- `per_ticket_scores`
|
| 140 |
- `total_reward`
|
|
|
|
|
|
|
| 141 |
|
| 142 |
## Grading And Reward
|
| 143 |
|
| 144 |
Scoring is deterministic and normalized to `[0.0, 1.0]`.
|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
Per-field behavior:
|
| 147 |
|
| 148 |
- `issue_type`: exact match, with a few near-miss partial-credit pairs
|
|
@@ -161,11 +181,15 @@ Task weights:
|
|
| 161 |
Final episode reward:
|
| 162 |
|
| 163 |
```text
|
| 164 |
-
average(per_ticket_scores)
|
| 165 |
```
|
| 166 |
|
| 167 |
The result is clamped to `[0.0, 1.0]`.
|
| 168 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
## Grounded Scoring
|
| 170 |
|
| 171 |
The grader is intentionally not fuzzy by default.
|
|
@@ -285,7 +309,7 @@ curl http://localhost:7860/tasks
|
|
| 285 |
|
| 286 |
## Running The Baseline Inference Script
|
| 287 |
|
| 288 |
-
The baseline script supports
|
| 289 |
|
| 290 |
### Heuristic mode
|
| 291 |
|
|
@@ -295,6 +319,12 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
|
|
| 295 |
python inference.py
|
| 296 |
```
|
| 297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
### LLM mode
|
| 299 |
|
| 300 |
Set these environment variables first:
|
|
@@ -313,6 +343,14 @@ Optional target:
|
|
| 313 |
|
| 314 |
- `ENV_URL`
|
| 315 |
- default value: `http://localhost:7860`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
|
| 317 |
## Runtime Validation Snapshot
|
| 318 |
|
|
@@ -324,7 +362,7 @@ Validated locally:
|
|
| 324 |
- `/health`
|
| 325 |
- `/tasks`
|
| 326 |
- `/reset`
|
| 327 |
-
- heuristic `inference.py` run across all 3 tasks
|
| 328 |
|
| 329 |
Current local heuristic results:
|
| 330 |
|
|
@@ -335,7 +373,7 @@ Current local heuristic results:
|
|
| 335 |
| Full Ticket Routing | `0.9400` |
|
| 336 |
| Overall | `0.9400` |
|
| 337 |
|
| 338 |
-
The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo.
|
| 339 |
|
| 340 |
### Windows note
|
| 341 |
|
|
@@ -358,7 +396,7 @@ docker run -p 7860:7860 helpdesk-ticket-routing
|
|
| 358 |
Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
|
| 359 |
|
| 360 |
```bash
|
| 361 |
-
python inference.py
|
| 362 |
```
|
| 363 |
|
| 364 |
If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
|
|
@@ -376,6 +414,7 @@ OpenEnv provides the core environment endpoints, and the repo adds a custom task
|
|
| 376 |
| POST | `/step` | submit an action |
|
| 377 |
| GET | `/state` | inspect internal state |
|
| 378 |
| GET | `/tasks` | list task metadata |
|
|
|
|
| 379 |
| GET | `/docs` | interactive API docs |
|
| 380 |
|
| 381 |
## Submission Readiness
|
|
@@ -397,11 +436,17 @@ An April 6 repo audit also confirmed that all required submission files are pres
|
|
| 397 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 398 |
- docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
|
| 399 |
|
| 400 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 401 |
|
| 402 |
-
-
|
| 403 |
-
-
|
| 404 |
-
- structured `inference.py` log-format verification on the current merged repo state
|
| 405 |
-
- a final clean-machine dry run if possible before submission freeze
|
| 406 |
|
| 407 |
-
The short TRL / GRPO README example from the roadmap
|
|
|
|
| 34 |
|
| 35 |
1. a new ticket enters the queue
|
| 36 |
2. the agent reads the ticket title and description
|
| 37 |
+
3. the agent may investigate with lightweight tools, then submit structured routing fields
|
| 38 |
4. the grader assigns deterministic credit
|
| 39 |
5. the environment advances to the next ticket until the queue is complete
|
| 40 |
|
|
|
|
| 43 |
## Why This Is A Good Hackathon Domain
|
| 44 |
|
| 45 |
- it reflects real enterprise support operations
|
| 46 |
+
- the action space is structured and judge-friendly, with a small investigate-versus-submit split
|
| 47 |
- correctness can be scored deterministically
|
| 48 |
- the hard task is meaningfully harder than the easy and medium tasks
|
| 49 |
- the environment is small enough to rerun quickly
|
|
|
|
| 55 |
- `reset()` samples a task and a queue of 3 to 5 tickets
|
| 56 |
- `step()` grades one ticket submission at a time
|
| 57 |
- `state()` exposes the internal episode snapshot
|
| 58 |
+
- final reward is based on average ticket quality across the queue
|
| 59 |
|
| 60 |
The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
|
| 61 |
|
|
|
|
| 115 |
- `title`
|
| 116 |
- `requester`
|
| 117 |
- `description`
|
| 118 |
+
- optional `ambiguity_note`
|
| 119 |
+
- optional `related_ticket_id`
|
| 120 |
+
- optional `related_ticket_preview`
|
| 121 |
|
| 122 |
Each observation also includes:
|
| 123 |
|
|
|
|
| 125 |
- `task_name`
|
| 126 |
- `instructions`
|
| 127 |
- `allowed_fields`
|
| 128 |
+
- `available_tools`
|
| 129 |
+
- `investigation_budget_remaining`
|
| 130 |
+
- `last_tool_result`
|
| 131 |
- `queue_size`
|
| 132 |
- `tickets_remaining`
|
| 133 |
+
- `tickets_after_current`
|
| 134 |
- `tickets_processed`
|
| 135 |
+
- `queue_position`
|
| 136 |
- `history`
|
| 137 |
- standard OpenEnv fields such as `done` and `reward`
|
| 138 |
|
|
|
|
| 146 |
- `current_ticket_index`
|
| 147 |
- `per_ticket_scores`
|
| 148 |
- `total_reward`
|
| 149 |
+
- `reward`
|
| 150 |
+
- `done`
|
| 151 |
|
| 152 |
## Grading And Reward
|
| 153 |
|
| 154 |
Scoring is deterministic and normalized to `[0.0, 1.0]`.
|
| 155 |
|
| 156 |
+
The action model now supports two paths:
|
| 157 |
+
|
| 158 |
+
- `action_type="submit"` for the final routing answer
|
| 159 |
+
- `action_type="investigate"` with a small built-in tool surface before submission
|
| 160 |
+
|
| 161 |
+
Available tools:
|
| 162 |
+
|
| 163 |
+
- `lookup_related_ticket`
|
| 164 |
+
- `lookup_requester_history`
|
| 165 |
+
|
| 166 |
Per-field behavior:
|
| 167 |
|
| 168 |
- `issue_type`: exact match, with a few near-miss partial-credit pairs
|
|
|
|
| 181 |
Final episode reward:
|
| 182 |
|
| 183 |
```text
|
| 184 |
+
average(per_ticket_scores)
|
| 185 |
```
|
| 186 |
|
| 187 |
The result is clamped to `[0.0, 1.0]`.
|
| 188 |
|
| 189 |
+
Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
|
| 190 |
+
|
| 191 |
+
Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
|
| 192 |
+
|
| 193 |
## Grounded Scoring
|
| 194 |
|
| 195 |
The grader is intentionally not fuzzy by default.
|
|
|
|
| 309 |
|
| 310 |
## Running The Baseline Inference Script
|
| 311 |
|
| 312 |
+
The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
|
| 313 |
|
| 314 |
### Heuristic mode
|
| 315 |
|
|
|
|
| 319 |
python inference.py
|
| 320 |
```
|
| 321 |
|
| 322 |
+
By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
|
| 323 |
+
|
| 324 |
+
```bash
|
| 325 |
+
TASK_ID=3 python inference.py
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
### LLM mode
|
| 329 |
|
| 330 |
Set these environment variables first:
|
|
|
|
| 343 |
|
| 344 |
- `ENV_URL`
|
| 345 |
- default value: `http://localhost:7860`
|
| 346 |
+
- `TASK_ID`
|
| 347 |
+
- `RUN_ALL_TASKS`
|
| 348 |
+
|
| 349 |
+
To reproduce the multi-task local benchmark sweep:
|
| 350 |
+
|
| 351 |
+
```bash
|
| 352 |
+
RUN_ALL_TASKS=1 python inference.py
|
| 353 |
+
```
|
| 354 |
|
| 355 |
## Runtime Validation Snapshot
|
| 356 |
|
|
|
|
| 362 |
- `/health`
|
| 363 |
- `/tasks`
|
| 364 |
- `/reset`
|
| 365 |
+
- heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
|
| 366 |
|
| 367 |
Current local heuristic results:
|
| 368 |
|
|
|
|
| 373 |
| Full Ticket Routing | `0.9400` |
|
| 374 |
| Overall | `0.9400` |
|
| 375 |
|
| 376 |
+
The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
|
| 377 |
|
| 378 |
### Windows note
|
| 379 |
|
|
|
|
| 396 |
Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
|
| 397 |
|
| 398 |
```bash
|
| 399 |
+
RUN_ALL_TASKS=1 python inference.py
|
| 400 |
```
|
| 401 |
|
| 402 |
If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
|
|
|
|
| 414 |
| POST | `/step` | submit an action |
|
| 415 |
| GET | `/state` | inspect internal state |
|
| 416 |
| GET | `/tasks` | list task metadata |
|
| 417 |
+
| GET | `/web` | lightweight HF Space UI |
|
| 418 |
| GET | `/docs` | interactive API docs |
|
| 419 |
|
| 420 |
## Submission Readiness
|
|
|
|
| 436 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 437 |
- docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
|
| 438 |
|
| 439 |
+
Roadmap status through April 7 is complete:
|
| 440 |
+
|
| 441 |
+
- unit, smoke, and integration tests are checked in and green
|
| 442 |
+
- Docker smoke coverage exists through `.github/workflows/docker-smoke-test.yml`
|
| 443 |
+
- `openenv validate` now passes on the current repo state
|
| 444 |
+
- structured `inference.py` logging is verified by tests and the merged-state rerun
|
| 445 |
+
- a clean-copy install-and-run pass has been completed
|
| 446 |
+
|
| 447 |
+
The remaining April 8 work is operational rather than implementation-heavy:
|
| 448 |
|
| 449 |
+
- run the final submission-branch sanity slice before pushing
|
| 450 |
+
- perform the live Hugging Face Space ping and reset check on the deployed submission artifact if a fresh deployment is created
|
|
|
|
|
|
|
| 451 |
|
| 452 |
+
The short TRL / GRPO README example from the roadmap remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
|
ROADMAP.md
CHANGED
|
@@ -11,10 +11,39 @@
|
|
| 11 |
## How To Use This File
|
| 12 |
|
| 13 |
- `PROJECT_STATUS.md` is the canonical log of completed work.
|
| 14 |
-
- This roadmap is the
|
| 15 |
- `required.md` is now the combined official-requirements and project-compliance file.
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
-
- `analysis/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## What We Are Optimizing For
|
| 20 |
|
|
@@ -47,14 +76,51 @@ The repo already has:
|
|
| 47 |
- deterministic grading with limited partial credit
|
| 48 |
- working heuristic baseline
|
| 49 |
- merged local validation on `/health`, `/tasks`, and `inference.py`
|
| 50 |
-
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
The remaining work should be treated as targeted strengthening, not broad feature invention.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## Submission Gates That Must Still Hold
|
| 59 |
|
| 60 |
These come directly from `required.md` and `KNOWLEDGE.md`:
|
|
@@ -114,7 +180,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 114 |
|
| 115 |
**Window:** April 3 to April 4
|
| 116 |
|
| 117 |
-
**Goal:** eliminate the biggest competitive weakness identified in `analysis/
|
| 118 |
|
| 119 |
### Must produce
|
| 120 |
|
|
@@ -182,7 +248,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 182 |
- assignment group and resolution action remain exact
|
| 183 |
- final episode reward stays bounded and deterministic
|
| 184 |
|
| 185 |
-
### Safe improvement candidates from `analysis/
|
| 186 |
|
| 187 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 188 |
- enrich `history` with:
|
|
@@ -237,7 +303,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 237 |
|
| 238 |
**Window:** April 6 to April 7
|
| 239 |
|
| 240 |
-
**Goal:** close the submission-readiness gaps surfaced in `analysis/
|
| 241 |
|
| 242 |
### Must produce
|
| 243 |
|
|
|
|
| 11 |
## How To Use This File
|
| 12 |
|
| 13 |
- `PROJECT_STATUS.md` is the canonical log of completed work.
|
| 14 |
+
- This roadmap is the active plan from the verified April 6, 2026 repo state to final submission.
|
| 15 |
- `required.md` is now the combined official-requirements and project-compliance file.
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
+
- `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
|
| 18 |
+
- The dated April 3 to April 5 sections below are now historical context; the active execution block is the final 24-hour plan for April 6 to April 7, 2026.
|
| 19 |
+
|
| 20 |
+
## Status As Of April 6, 2026
|
| 21 |
+
|
| 22 |
+
The repo is now in the expected "stabilize and merge" phase rather than the earlier "build core fixes" phase.
|
| 23 |
+
|
| 24 |
+
Completed and locally verified:
|
| 25 |
+
|
| 26 |
+
- all concrete items from `gaps.md`
|
| 27 |
+
- the viable low-risk improvements from `analysis/deep_competitive_gap_report.md`
|
| 28 |
+
- single-task `inference.py` execution with `TASK_ID` support and optional `RUN_ALL_TASKS=1`
|
| 29 |
+
- `state()` exposure of `reward` and `done`
|
| 30 |
+
- richer history with predicted actions and follow-up context
|
| 31 |
+
- lightweight investigate-versus-submit action support with tool-backed context lookup
|
| 32 |
+
- small queue-economics signal without major benchmark redesign
|
| 33 |
+
- `/web` UI route
|
| 34 |
+
- local full test pass:
|
| 35 |
+
- `126 passed, 137 subtests passed`
|
| 36 |
+
- local validator pass:
|
| 37 |
+
- `[OK] meta-AIHack: Ready for multi-mode deployment`
|
| 38 |
+
|
| 39 |
+
Merge recommendation:
|
| 40 |
+
|
| 41 |
+
- mergeable as an incremental submission-ready improvement branch
|
| 42 |
+
- do not block merge on major redesign items that were explicitly out of scope:
|
| 43 |
+
- scenario-family task redesign
|
| 44 |
+
- breaking the issue-type-to-assignment shortcut
|
| 45 |
+
- large dataset expansion
|
| 46 |
+
- full queue simulator / economics redesign
|
| 47 |
|
| 48 |
## What We Are Optimizing For
|
| 49 |
|
|
|
|
| 76 |
- deterministic grading with limited partial credit
|
| 77 |
- working heuristic baseline
|
| 78 |
- merged local validation on `/health`, `/tasks`, and `inference.py`
|
| 79 |
+
- single-task evaluator-safe inference behavior
|
| 80 |
+
- reward and done fields on `state()`
|
| 81 |
+
- richer observation history and linked-ticket context
|
| 82 |
+
- lightweight investigate / submit split with small built-in tool support
|
| 83 |
+
- local full-suite verification:
|
| 84 |
+
- `126 passed, 137 subtests passed`
|
| 85 |
+
- local validator verification:
|
| 86 |
+
- `[OK] meta-AIHack: Ready for multi-mode deployment`
|
| 87 |
|
| 88 |
The remaining work should be treated as targeted strengthening, not broad feature invention.
|
| 89 |
|
| 90 |
+
## Final 24-Hour Plan
|
| 91 |
+
|
| 92 |
+
**Active window:** April 6 to April 7, 2026
|
| 93 |
+
**Internal target:** open PR, merge to the common `main`, and complete the final smoke checks by April 7, 2026
|
| 94 |
+
**Official deadline:** April 8, 2026, 11:59 PM IST
|
| 95 |
+
|
| 96 |
+
### Must finish before merge
|
| 97 |
+
|
| 98 |
+
- review the final diff and stage only the intended submission files
|
| 99 |
+
- open the merge PR from a dedicated branch
|
| 100 |
+
- merge into the shared `main` after one last reviewer pass
|
| 101 |
+
- rerun the post-merge smoke checks:
|
| 102 |
+
- `pytest`
|
| 103 |
+
- `openenv validate`
|
| 104 |
+
- `/health`
|
| 105 |
+
- `/tasks`
|
| 106 |
+
- one `reset()` / `step()` sanity path
|
| 107 |
+
|
| 108 |
+
### Do not add before merge
|
| 109 |
+
|
| 110 |
+
- no new benchmark redesign work
|
| 111 |
+
- no new dataset expansion
|
| 112 |
+
- no schema churn
|
| 113 |
+
- no reward refactors beyond blocker-level fixes
|
| 114 |
+
- no last-minute inference prompt rewrites
|
| 115 |
+
|
| 116 |
+
### Success condition for April 7, 2026
|
| 117 |
+
|
| 118 |
+
- PR is up
|
| 119 |
+
- PR is reviewed against `gaps.md` and `analysis/deep_competitive_gap_report.md`
|
| 120 |
+
- shared `main` contains the tested gap-fix branch
|
| 121 |
+
- deployment sanity checks are green
|
| 122 |
+
- repo is frozen except for typo-level fixes
|
| 123 |
+
|
| 124 |
## Submission Gates That Must Still Hold
|
| 125 |
|
| 126 |
These come directly from `required.md` and `KNOWLEDGE.md`:
|
|
|
|
| 180 |
|
| 181 |
**Window:** April 3 to April 4
|
| 182 |
|
| 183 |
+
**Goal:** eliminate the biggest competitive weakness identified in `analysis/competition_notes.md`: lack of checked-in tests.
|
| 184 |
|
| 185 |
### Must produce
|
| 186 |
|
|
|
|
| 248 |
- assignment group and resolution action remain exact
|
| 249 |
- final episode reward stays bounded and deterministic
|
| 250 |
|
| 251 |
+
### Safe improvement candidates from `analysis/competition_notes.md`
|
| 252 |
|
| 253 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 254 |
- enrich `history` with:
|
|
|
|
| 303 |
|
| 304 |
**Window:** April 6 to April 7
|
| 305 |
|
| 306 |
+
**Goal:** close the submission-readiness gaps surfaced in `analysis/competition_notes.md`.
|
| 307 |
|
| 308 |
### Must produce
|
| 309 |
|
inference.py
CHANGED
|
@@ -20,6 +20,15 @@ HF_TOKEN
|
|
| 20 |
HuggingFace authentication token for the LLM provider.
|
| 21 |
No default is set.
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
LOCAL_IMAGE_NAME
|
| 24 |
Optional compatibility variable from the sample inference pattern.
|
| 25 |
This script does not use ``from_docker_image()``, so the value is unused here.
|
|
@@ -65,6 +74,11 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
|
| 65 |
|
| 66 |
SEED = 42
|
| 67 |
TASK_ID_ENV = os.getenv("TASK_ID")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
# ---------------------------------------------------------------------------
|
| 70 |
# LLM helper
|
|
@@ -99,13 +113,36 @@ Return ONLY valid JSON with the requested fields. No markdown, no explanation.""
|
|
| 99 |
|
| 100 |
def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
|
| 101 |
assert llm_client is not None, "LLM client not configured"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
user_msg = (
|
| 104 |
f"Instructions: {instructions}\n\n"
|
| 105 |
f"Allowed fields: {', '.join(allowed_fields)}\n\n"
|
| 106 |
f"Title: {ticket['title']}\n"
|
| 107 |
f"Requester: {ticket['requester']}\n"
|
| 108 |
-
f"Description: {ticket['description']}
|
|
|
|
| 109 |
f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
|
| 110 |
)
|
| 111 |
|
|
@@ -135,17 +172,26 @@ def emit_log(tag: str, **payload: Any) -> None:
|
|
| 135 |
|
| 136 |
|
| 137 |
def get_tasks_to_run(available_tasks: dict) -> list[int]:
|
|
|
|
| 138 |
if TASK_ID_ENV:
|
| 139 |
try:
|
| 140 |
task_id = int(TASK_ID_ENV)
|
| 141 |
except ValueError:
|
| 142 |
print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
|
| 143 |
raise SystemExit(1)
|
| 144 |
-
if task_id not in
|
| 145 |
-
print(
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
| 147 |
return [task_id]
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
|
| 151 |
# ---------------------------------------------------------------------------
|
|
@@ -278,7 +324,18 @@ def heuristic_resolution_action(text: str, issue_type: str) -> str:
|
|
| 278 |
|
| 279 |
|
| 280 |
def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
|
| 281 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
|
| 283 |
issue_type = "general_inquiry"
|
| 284 |
for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
|
|
@@ -329,6 +386,31 @@ def build_action(
|
|
| 329 |
)
|
| 330 |
|
| 331 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 332 |
# ---------------------------------------------------------------------------
|
| 333 |
# Main loop using the HTTP-based sync EnvClient for multi-step episodes
|
| 334 |
# ---------------------------------------------------------------------------
|
|
@@ -347,7 +429,9 @@ def run() -> None:
|
|
| 347 |
all_results: dict[int, dict[str, float | int]] = {}
|
| 348 |
|
| 349 |
tasks_to_run = get_tasks_to_run(available_tasks)
|
| 350 |
-
|
|
|
|
|
|
|
| 351 |
|
| 352 |
for task_id in tasks_to_run:
|
| 353 |
if task_id not in available_tasks:
|
|
@@ -377,8 +461,40 @@ def run() -> None:
|
|
| 377 |
if ticket is None:
|
| 378 |
break
|
| 379 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
action, action_source, fallback_reason = build_action(
|
| 381 |
-
|
| 382 |
obs.allowed_fields,
|
| 383 |
obs.instructions,
|
| 384 |
)
|
|
|
|
| 20 |
HuggingFace authentication token for the LLM provider.
|
| 21 |
No default is set.
|
| 22 |
|
| 23 |
+
TASK_ID
|
| 24 |
+
Optional OpenEnv task ID to run. When unset, the script defaults to the
|
| 25 |
+
first available task so it still emits exactly one ``[START]`` ... ``[END]``
|
| 26 |
+
block for evaluator-style runs.
|
| 27 |
+
|
| 28 |
+
RUN_ALL_TASKS
|
| 29 |
+
Optional local-development override. Set to ``1`` to run every available
|
| 30 |
+
task in sequence and print the aggregate closing ``[END]`` summary.
|
| 31 |
+
|
| 32 |
LOCAL_IMAGE_NAME
|
| 33 |
Optional compatibility variable from the sample inference pattern.
|
| 34 |
This script does not use ``from_docker_image()``, so the value is unused here.
|
|
|
|
| 74 |
|
| 75 |
SEED = 42
|
| 76 |
TASK_ID_ENV = os.getenv("TASK_ID")
|
| 77 |
+
RUN_ALL_TASKS_ENV = os.getenv("RUN_ALL_TASKS", "").strip().lower() in {
|
| 78 |
+
"1",
|
| 79 |
+
"true",
|
| 80 |
+
"yes",
|
| 81 |
+
}
|
| 82 |
|
| 83 |
# ---------------------------------------------------------------------------
|
| 84 |
# LLM helper
|
|
|
|
| 113 |
|
| 114 |
def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
|
| 115 |
assert llm_client is not None, "LLM client not configured"
|
| 116 |
+
ambiguity_note = ticket.get("ambiguity_note")
|
| 117 |
+
related_preview = ticket.get("related_ticket_preview") or {}
|
| 118 |
+
last_tool_result = ticket.get("last_tool_result")
|
| 119 |
+
extra_context_lines: list[str] = []
|
| 120 |
+
if ambiguity_note:
|
| 121 |
+
extra_context_lines.append(f"Ambiguity note: {ambiguity_note}")
|
| 122 |
+
if related_preview:
|
| 123 |
+
extra_context_lines.extend(
|
| 124 |
+
[
|
| 125 |
+
"Related ticket preview:",
|
| 126 |
+
f"- Title: {related_preview.get('title', '')}",
|
| 127 |
+
f"- Requester: {related_preview.get('requester', '')}",
|
| 128 |
+
f"- Description: {related_preview.get('description', '')}",
|
| 129 |
+
]
|
| 130 |
+
)
|
| 131 |
+
if last_tool_result is not None:
|
| 132 |
+
extra_context_lines.append(
|
| 133 |
+
"Investigation result: " + json.dumps(last_tool_result, sort_keys=True)
|
| 134 |
+
)
|
| 135 |
+
extra_context_block = ""
|
| 136 |
+
if extra_context_lines:
|
| 137 |
+
extra_context_block = "\n" + "\n".join(extra_context_lines)
|
| 138 |
|
| 139 |
user_msg = (
|
| 140 |
f"Instructions: {instructions}\n\n"
|
| 141 |
f"Allowed fields: {', '.join(allowed_fields)}\n\n"
|
| 142 |
f"Title: {ticket['title']}\n"
|
| 143 |
f"Requester: {ticket['requester']}\n"
|
| 144 |
+
f"Description: {ticket['description']}"
|
| 145 |
+
f"{extra_context_block}\n\n"
|
| 146 |
f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
|
| 147 |
)
|
| 148 |
|
|
|
|
| 172 |
|
| 173 |
|
| 174 |
def get_tasks_to_run(available_tasks: dict) -> list[int]:
|
| 175 |
+
available_task_ids = sorted(int(task_id) for task_id in available_tasks)
|
| 176 |
if TASK_ID_ENV:
|
| 177 |
try:
|
| 178 |
task_id = int(TASK_ID_ENV)
|
| 179 |
except ValueError:
|
| 180 |
print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
|
| 181 |
raise SystemExit(1)
|
| 182 |
+
if task_id not in available_task_ids:
|
| 183 |
+
print(
|
| 184 |
+
f"[ERROR] TASK_ID={task_id} not in available tasks {available_task_ids}",
|
| 185 |
+
flush=True,
|
| 186 |
+
)
|
| 187 |
+
raise SystemExit(1)
|
| 188 |
return [task_id]
|
| 189 |
+
if RUN_ALL_TASKS_ENV:
|
| 190 |
+
return available_task_ids
|
| 191 |
+
if not available_task_ids:
|
| 192 |
+
return []
|
| 193 |
+
# Default to a single task so evaluation emits exactly one START/END block.
|
| 194 |
+
return [available_task_ids[0]]
|
| 195 |
|
| 196 |
|
| 197 |
# ---------------------------------------------------------------------------
|
|
|
|
| 324 |
|
| 325 |
|
| 326 |
def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
|
| 327 |
+
related_preview = ticket.get("related_ticket_preview") or {}
|
| 328 |
+
last_tool_result = ticket.get("last_tool_result") or {}
|
| 329 |
+
text = " ".join(
|
| 330 |
+
[
|
| 331 |
+
ticket.get("title", ""),
|
| 332 |
+
ticket.get("description", ""),
|
| 333 |
+
ticket.get("ambiguity_note", ""),
|
| 334 |
+
related_preview.get("title", ""),
|
| 335 |
+
related_preview.get("description", ""),
|
| 336 |
+
json.dumps(last_tool_result, sort_keys=True),
|
| 337 |
+
]
|
| 338 |
+
).lower()
|
| 339 |
|
| 340 |
issue_type = "general_inquiry"
|
| 341 |
for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
|
|
|
|
| 386 |
)
|
| 387 |
|
| 388 |
|
| 389 |
+
def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[bool, str | None]:
|
| 390 |
+
if not ticket:
|
| 391 |
+
return False, None
|
| 392 |
+
current_ticket_id = ticket.get("ticket_id")
|
| 393 |
+
already_investigated = any(
|
| 394 |
+
entry.get("ticket_id") == current_ticket_id
|
| 395 |
+
and entry.get("predicted", {}).get("action_type") == "investigate"
|
| 396 |
+
for entry in history
|
| 397 |
+
)
|
| 398 |
+
if already_investigated:
|
| 399 |
+
return False, None
|
| 400 |
+
if ticket.get("related_ticket_id"):
|
| 401 |
+
return True, "lookup_related_ticket"
|
| 402 |
+
if ticket.get("ambiguity_note"):
|
| 403 |
+
return True, "lookup_requester_history"
|
| 404 |
+
return False, None
|
| 405 |
+
|
| 406 |
+
|
| 407 |
+
def merge_ticket_context(ticket: dict, observation: Any) -> dict:
|
| 408 |
+
merged_ticket = dict(ticket)
|
| 409 |
+
if getattr(observation, "last_tool_result", None) is not None:
|
| 410 |
+
merged_ticket["last_tool_result"] = observation.last_tool_result
|
| 411 |
+
return merged_ticket
|
| 412 |
+
|
| 413 |
+
|
| 414 |
# ---------------------------------------------------------------------------
|
| 415 |
# Main loop using the HTTP-based sync EnvClient for multi-step episodes
|
| 416 |
# ---------------------------------------------------------------------------
|
|
|
|
| 429 |
all_results: dict[int, dict[str, float | int]] = {}
|
| 430 |
|
| 431 |
tasks_to_run = get_tasks_to_run(available_tasks)
|
| 432 |
+
if not tasks_to_run:
|
| 433 |
+
return
|
| 434 |
+
single_task_mode = len(tasks_to_run) == 1
|
| 435 |
|
| 436 |
for task_id in tasks_to_run:
|
| 437 |
if task_id not in available_tasks:
|
|
|
|
| 461 |
if ticket is None:
|
| 462 |
break
|
| 463 |
|
| 464 |
+
investigate, tool_name = should_investigate(ticket, obs.history)
|
| 465 |
+
if (
|
| 466 |
+
investigate
|
| 467 |
+
and tool_name is not None
|
| 468 |
+
and getattr(obs, "investigation_budget_remaining", 0) > 0
|
| 469 |
+
):
|
| 470 |
+
tool_action = HelpdeskTicketAction(
|
| 471 |
+
action_type="investigate",
|
| 472 |
+
tool_name=tool_name,
|
| 473 |
+
tool_target_ticket_id=ticket.get("related_ticket_id"),
|
| 474 |
+
)
|
| 475 |
+
result = sync_client.step(tool_action)
|
| 476 |
+
obs = result.observation
|
| 477 |
+
step_num += 1
|
| 478 |
+
emit_log(
|
| 479 |
+
"STEP",
|
| 480 |
+
action=tool_action.model_dump(exclude_none=True),
|
| 481 |
+
action_source="investigation_tool",
|
| 482 |
+
done=bool(result.done),
|
| 483 |
+
fallback_reason=None,
|
| 484 |
+
reward=float(result.reward or 0.0),
|
| 485 |
+
step=step_num,
|
| 486 |
+
task_id=task_id,
|
| 487 |
+
ticket_id=ticket["ticket_id"],
|
| 488 |
+
)
|
| 489 |
+
if result.done:
|
| 490 |
+
break
|
| 491 |
+
ticket = obs.current_ticket
|
| 492 |
+
if ticket is None:
|
| 493 |
+
break
|
| 494 |
+
|
| 495 |
+
ticket_with_context = merge_ticket_context(ticket, obs)
|
| 496 |
action, action_source, fallback_reason = build_action(
|
| 497 |
+
ticket_with_context,
|
| 498 |
obs.allowed_fields,
|
| 499 |
obs.instructions,
|
| 500 |
)
|
models.py
CHANGED
|
@@ -16,6 +16,8 @@ ISSUE_TYPE_SET = set(ISSUE_TYPES)
|
|
| 16 |
PRIORITY_SET = set(PRIORITIES)
|
| 17 |
ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
|
| 18 |
RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
|
|
|
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
|
|
@@ -67,11 +69,24 @@ class HelpdeskTicketRecord(BaseModel):
|
|
| 67 |
|
| 68 |
|
| 69 |
class HelpdeskTicketAction(Action):
|
|
|
|
|
|
|
|
|
|
| 70 |
issue_type: Optional[str] = None
|
| 71 |
priority: Optional[str] = None
|
| 72 |
assignment_group: Optional[str] = None
|
| 73 |
resolution_action: Optional[str] = None
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
@field_validator("issue_type")
|
| 76 |
@classmethod
|
| 77 |
def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
|
|
@@ -98,10 +113,15 @@ class HelpdeskTicketObservation(Observation):
|
|
| 98 |
task_name: str = ""
|
| 99 |
instructions: str = ""
|
| 100 |
allowed_fields: list[str] = Field(default_factory=list)
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
| 102 |
queue_size: int = 0
|
| 103 |
tickets_remaining: int = 0
|
|
|
|
| 104 |
tickets_processed: int = 0
|
|
|
|
| 105 |
history: list[dict[str, Any]] = Field(default_factory=list)
|
| 106 |
|
| 107 |
|
|
@@ -116,4 +136,7 @@ class HelpdeskTicketState(State):
|
|
| 116 |
# `reward` is the field the evaluator checks on GET /state (mentor spec)
|
| 117 |
reward: Optional[float] = None
|
| 118 |
done: bool = False
|
|
|
|
|
|
|
|
|
|
| 119 |
history_entries: list[dict] = Field(default_factory=list)
|
|
|
|
| 16 |
PRIORITY_SET = set(PRIORITIES)
|
| 17 |
ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
|
| 18 |
RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
|
| 19 |
+
ACTION_TYPE_SET = {"submit", "investigate"}
|
| 20 |
+
TOOL_NAME_SET = {"lookup_related_ticket", "lookup_requester_history"}
|
| 21 |
|
| 22 |
|
| 23 |
def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
|
|
|
|
| 69 |
|
| 70 |
|
| 71 |
class HelpdeskTicketAction(Action):
|
| 72 |
+
action_type: str = "submit"
|
| 73 |
+
tool_name: Optional[str] = None
|
| 74 |
+
tool_target_ticket_id: Optional[str] = None
|
| 75 |
issue_type: Optional[str] = None
|
| 76 |
priority: Optional[str] = None
|
| 77 |
assignment_group: Optional[str] = None
|
| 78 |
resolution_action: Optional[str] = None
|
| 79 |
|
| 80 |
+
@field_validator("action_type")
|
| 81 |
+
@classmethod
|
| 82 |
+
def validate_action_type(cls, value: str) -> str:
|
| 83 |
+
return _validate_choice(value, ACTION_TYPE_SET, "action_type")
|
| 84 |
+
|
| 85 |
+
@field_validator("tool_name")
|
| 86 |
+
@classmethod
|
| 87 |
+
def validate_tool_name(cls, value: Optional[str]) -> Optional[str]:
|
| 88 |
+
return _validate_optional_choice(value, TOOL_NAME_SET, "tool_name")
|
| 89 |
+
|
| 90 |
@field_validator("issue_type")
|
| 91 |
@classmethod
|
| 92 |
def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
|
|
|
|
| 113 |
task_name: str = ""
|
| 114 |
instructions: str = ""
|
| 115 |
allowed_fields: list[str] = Field(default_factory=list)
|
| 116 |
+
available_tools: list[str] = Field(default_factory=list)
|
| 117 |
+
investigation_budget_remaining: int = 0
|
| 118 |
+
last_tool_result: Optional[dict[str, Any]] = None
|
| 119 |
+
current_ticket: Optional[dict[str, Any]] = None
|
| 120 |
queue_size: int = 0
|
| 121 |
tickets_remaining: int = 0
|
| 122 |
+
tickets_after_current: int = 0
|
| 123 |
tickets_processed: int = 0
|
| 124 |
+
queue_position: int = 0
|
| 125 |
history: list[dict[str, Any]] = Field(default_factory=list)
|
| 126 |
|
| 127 |
|
|
|
|
| 136 |
# `reward` is the field the evaluator checks on GET /state (mentor spec)
|
| 137 |
reward: Optional[float] = None
|
| 138 |
done: bool = False
|
| 139 |
+
investigation_steps: int = 0
|
| 140 |
+
investigation_budget_remaining: int = 0
|
| 141 |
+
last_tool_result: Optional[dict[str, Any]] = None
|
| 142 |
history_entries: list[dict] = Field(default_factory=list)
|
openenv.yaml
CHANGED
|
@@ -53,6 +53,7 @@ inference:
|
|
| 53 |
- MODEL_NAME
|
| 54 |
- HF_TOKEN
|
| 55 |
- ENV_URL
|
|
|
|
| 56 |
|
| 57 |
requirements:
|
| 58 |
python: ">=3.11"
|
|
|
|
| 53 |
- MODEL_NAME
|
| 54 |
- HF_TOKEN
|
| 55 |
- ENV_URL
|
| 56 |
+
- TASK_ID
|
| 57 |
|
| 58 |
requirements:
|
| 59 |
python: ">=3.11"
|
server/environment.py
CHANGED
|
@@ -18,6 +18,10 @@ from server.tasks import get_task_definition, load_dataset
|
|
| 18 |
|
| 19 |
|
| 20 |
QUEUE_SIZE_RANGE = (3, 5)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
|
| 23 |
def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
|
|
@@ -41,6 +45,7 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 41 |
def __init__(self) -> None:
|
| 42 |
super().__init__()
|
| 43 |
self._dataset = load_dataset()
|
|
|
|
| 44 |
self._rng = random.Random()
|
| 45 |
self._queue: list[HelpdeskTicketRecord] = []
|
| 46 |
self._state = HelpdeskTicketState()
|
|
@@ -57,13 +62,19 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 57 |
) -> HelpdeskTicketObservation:
|
| 58 |
normalized_seed = _coerce_optional_int(seed, "seed")
|
| 59 |
task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
|
|
|
|
| 60 |
task_id = 1 if task_id_value is None else task_id_value
|
| 61 |
task = get_task_definition(task_id)
|
|
|
|
|
|
|
| 62 |
|
| 63 |
if normalized_seed is not None:
|
| 64 |
self._rng.seed(normalized_seed)
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
| 67 |
self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
|
| 68 |
|
| 69 |
self._state = HelpdeskTicketState(
|
|
@@ -75,6 +86,7 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 75 |
current_ticket_index=0,
|
| 76 |
per_ticket_scores=[],
|
| 77 |
total_reward=0.0,
|
|
|
|
| 78 |
)
|
| 79 |
|
| 80 |
return self._build_observation(task)
|
|
@@ -96,34 +108,46 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 96 |
task_id = self._state.current_task_id
|
| 97 |
task = get_task_definition(task_id)
|
| 98 |
|
|
|
|
|
|
|
|
|
|
| 99 |
submitted_fields = {
|
| 100 |
-
f
|
|
|
|
|
|
|
|
|
|
| 101 |
}
|
| 102 |
allowed = set(task["allowed_fields"])
|
| 103 |
extra_fields = submitted_fields - allowed
|
| 104 |
if extra_fields:
|
| 105 |
# Penalty: record score 0.0, advance index, return penalty observation
|
| 106 |
self._state.per_ticket_scores.append(0.0)
|
| 107 |
-
self._state.history_entries.append(
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
|
|
|
|
|
| 115 |
self._state.step_count += 1
|
| 116 |
self._state.current_ticket_index += 1
|
| 117 |
is_done = self._state.current_ticket_index >= len(self._queue)
|
| 118 |
-
self._state.last_step_reward = 0.0
|
| 119 |
-
self._state.reward = 0.0
|
| 120 |
self._state.done = is_done
|
| 121 |
if is_done:
|
| 122 |
traj_reward = compute_trajectory_reward(
|
| 123 |
self._state.per_ticket_scores, len(self._queue), self._state.step_count
|
| 124 |
)
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
score, breakdown = grade_action(action, current_ticket, task_id)
|
| 129 |
step_reward = compute_step_reward(score)
|
|
@@ -139,26 +163,27 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 139 |
len(self._queue),
|
| 140 |
self._state.step_count,
|
| 141 |
)
|
| 142 |
-
|
| 143 |
-
|
| 144 |
else:
|
| 145 |
self._state.per_ticket_scores.append(score)
|
| 146 |
self._state.step_count += 1
|
| 147 |
self._state.current_ticket_index += 1
|
| 148 |
final_reward = step_reward
|
| 149 |
|
| 150 |
-
history_entry =
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
self._state.history_entries.append(history_entry)
|
| 158 |
|
| 159 |
self._state.last_step_reward = final_reward
|
| 160 |
self._state.reward = final_reward
|
| 161 |
self._state.done = is_done
|
|
|
|
| 162 |
|
| 163 |
return self._build_observation(task, done=is_done, reward=final_reward)
|
| 164 |
|
|
@@ -170,6 +195,188 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 170 |
# Helpers
|
| 171 |
# ------------------------------------------------------------------
|
| 172 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
def _build_observation(
|
| 174 |
self,
|
| 175 |
task: dict,
|
|
@@ -181,33 +388,43 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 181 |
|
| 182 |
if idx < queue_size:
|
| 183 |
ticket = self._queue[idx]
|
| 184 |
-
ticket_view
|
| 185 |
-
|
| 186 |
-
"title": ticket.title,
|
| 187 |
-
"requester": ticket.requester,
|
| 188 |
-
"description": ticket.description,
|
| 189 |
-
}
|
| 190 |
-
if ticket.ambiguity_note is not None:
|
| 191 |
-
ticket_view["ambiguity_note"] = ticket.ambiguity_note
|
| 192 |
-
if ticket.related_ticket_id is not None:
|
| 193 |
-
ticket_view["related_ticket_id"] = ticket.related_ticket_id
|
| 194 |
else:
|
| 195 |
ticket_view = None
|
|
|
|
| 196 |
|
| 197 |
history = list(self._state.history_entries)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
return HelpdeskTicketObservation(
|
| 200 |
done=done,
|
| 201 |
reward=reward,
|
| 202 |
-
metadata={
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
task_id=task["id"],
|
| 204 |
task_name=task["name"],
|
| 205 |
instructions=task["instructions"],
|
| 206 |
allowed_fields=list(task["allowed_fields"]),
|
|
|
|
|
|
|
|
|
|
| 207 |
current_ticket=ticket_view,
|
| 208 |
queue_size=queue_size,
|
| 209 |
-
|
| 210 |
-
|
| 211 |
tickets_processed=idx,
|
|
|
|
| 212 |
history=history,
|
| 213 |
)
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
QUEUE_SIZE_RANGE = (3, 5)
|
| 21 |
+
AVAILABLE_TOOLS = ("lookup_related_ticket", "lookup_requester_history")
|
| 22 |
+
FREE_INVESTIGATIONS_PER_TICKET = 1
|
| 23 |
+
EXTRA_INVESTIGATION_COST = 0.02
|
| 24 |
+
MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
|
| 25 |
|
| 26 |
|
| 27 |
def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
|
|
|
|
| 45 |
def __init__(self) -> None:
|
| 46 |
super().__init__()
|
| 47 |
self._dataset = load_dataset()
|
| 48 |
+
self._tickets_by_id = {ticket.ticket_id: ticket for ticket in self._dataset}
|
| 49 |
self._rng = random.Random()
|
| 50 |
self._queue: list[HelpdeskTicketRecord] = []
|
| 51 |
self._state = HelpdeskTicketState()
|
|
|
|
| 62 |
) -> HelpdeskTicketObservation:
|
| 63 |
normalized_seed = _coerce_optional_int(seed, "seed")
|
| 64 |
task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
|
| 65 |
+
queue_size_value = _coerce_optional_int(kwargs.get("queue_size"), "queue_size")
|
| 66 |
task_id = 1 if task_id_value is None else task_id_value
|
| 67 |
task = get_task_definition(task_id)
|
| 68 |
+
if queue_size_value is not None and queue_size_value < 1:
|
| 69 |
+
raise ValueError("queue_size must be >= 1")
|
| 70 |
|
| 71 |
if normalized_seed is not None:
|
| 72 |
self._rng.seed(normalized_seed)
|
| 73 |
|
| 74 |
+
if queue_size_value is None:
|
| 75 |
+
queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
|
| 76 |
+
else:
|
| 77 |
+
queue_size = min(queue_size_value, len(self._dataset))
|
| 78 |
self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
|
| 79 |
|
| 80 |
self._state = HelpdeskTicketState(
|
|
|
|
| 86 |
current_ticket_index=0,
|
| 87 |
per_ticket_scores=[],
|
| 88 |
total_reward=0.0,
|
| 89 |
+
investigation_budget_remaining=queue_size * FREE_INVESTIGATIONS_PER_TICKET,
|
| 90 |
)
|
| 91 |
|
| 92 |
return self._build_observation(task)
|
|
|
|
| 108 |
task_id = self._state.current_task_id
|
| 109 |
task = get_task_definition(task_id)
|
| 110 |
|
| 111 |
+
if action.action_type == "investigate":
|
| 112 |
+
return self._handle_investigation_action(task, current_ticket, action, idx)
|
| 113 |
+
|
| 114 |
submitted_fields = {
|
| 115 |
+
f
|
| 116 |
+
for f, v in action.model_dump(exclude_none=True).items()
|
| 117 |
+
if v is not None
|
| 118 |
+
and f not in {"action_type", "tool_name", "tool_target_ticket_id"}
|
| 119 |
}
|
| 120 |
allowed = set(task["allowed_fields"])
|
| 121 |
extra_fields = submitted_fields - allowed
|
| 122 |
if extra_fields:
|
| 123 |
# Penalty: record score 0.0, advance index, return penalty observation
|
| 124 |
self._state.per_ticket_scores.append(0.0)
|
| 125 |
+
self._state.history_entries.append(
|
| 126 |
+
self._build_history_entry(
|
| 127 |
+
current_ticket,
|
| 128 |
+
predicted=action.model_dump(exclude_none=True),
|
| 129 |
+
score=0.0,
|
| 130 |
+
breakdown={},
|
| 131 |
+
queue_position=idx + 1,
|
| 132 |
+
penalty_reason=f"extra_fields: {sorted(extra_fields)}",
|
| 133 |
+
)
|
| 134 |
+
)
|
| 135 |
self._state.step_count += 1
|
| 136 |
self._state.current_ticket_index += 1
|
| 137 |
is_done = self._state.current_ticket_index >= len(self._queue)
|
|
|
|
|
|
|
| 138 |
self._state.done = is_done
|
| 139 |
if is_done:
|
| 140 |
traj_reward = compute_trajectory_reward(
|
| 141 |
self._state.per_ticket_scores, len(self._queue), self._state.step_count
|
| 142 |
)
|
| 143 |
+
final_reward = self._apply_episode_economics(traj_reward)
|
| 144 |
+
self._state.total_reward = final_reward
|
| 145 |
+
else:
|
| 146 |
+
final_reward = 0.0
|
| 147 |
+
self._state.last_step_reward = final_reward
|
| 148 |
+
self._state.reward = final_reward
|
| 149 |
+
self._state.last_tool_result = None
|
| 150 |
+
return self._build_observation(task, done=is_done, reward=final_reward)
|
| 151 |
|
| 152 |
score, breakdown = grade_action(action, current_ticket, task_id)
|
| 153 |
step_reward = compute_step_reward(score)
|
|
|
|
| 163 |
len(self._queue),
|
| 164 |
self._state.step_count,
|
| 165 |
)
|
| 166 |
+
final_reward = self._apply_episode_economics(traj_reward)
|
| 167 |
+
self._state.total_reward = final_reward
|
| 168 |
else:
|
| 169 |
self._state.per_ticket_scores.append(score)
|
| 170 |
self._state.step_count += 1
|
| 171 |
self._state.current_ticket_index += 1
|
| 172 |
final_reward = step_reward
|
| 173 |
|
| 174 |
+
history_entry = self._build_history_entry(
|
| 175 |
+
current_ticket,
|
| 176 |
+
predicted=action.model_dump(exclude_none=True),
|
| 177 |
+
score=score,
|
| 178 |
+
breakdown=breakdown,
|
| 179 |
+
queue_position=idx + 1,
|
| 180 |
+
)
|
| 181 |
self._state.history_entries.append(history_entry)
|
| 182 |
|
| 183 |
self._state.last_step_reward = final_reward
|
| 184 |
self._state.reward = final_reward
|
| 185 |
self._state.done = is_done
|
| 186 |
+
self._state.last_tool_result = None
|
| 187 |
|
| 188 |
return self._build_observation(task, done=is_done, reward=final_reward)
|
| 189 |
|
|
|
|
| 195 |
# Helpers
|
| 196 |
# ------------------------------------------------------------------
|
| 197 |
|
| 198 |
+
def _apply_episode_economics(self, base_reward: float) -> float:
|
| 199 |
+
free_investigations = len(self._queue) * FREE_INVESTIGATIONS_PER_TICKET
|
| 200 |
+
extra_investigations = max(0, self._state.investigation_steps - free_investigations)
|
| 201 |
+
penalty = min(
|
| 202 |
+
MAX_EXTRA_INVESTIGATION_PENALTY,
|
| 203 |
+
extra_investigations * EXTRA_INVESTIGATION_COST,
|
| 204 |
+
)
|
| 205 |
+
return max(0.0, min(1.0, base_reward - penalty))
|
| 206 |
+
|
| 207 |
+
def _lookup_related_ticket(
|
| 208 |
+
self,
|
| 209 |
+
current_ticket: HelpdeskTicketRecord,
|
| 210 |
+
target_ticket_id: str | None,
|
| 211 |
+
) -> dict[str, Any]:
|
| 212 |
+
target_id = target_ticket_id or current_ticket.related_ticket_id
|
| 213 |
+
if target_id is None:
|
| 214 |
+
return {
|
| 215 |
+
"tool_name": "lookup_related_ticket",
|
| 216 |
+
"found": False,
|
| 217 |
+
"message": "Current ticket has no linked related_ticket_id.",
|
| 218 |
+
}
|
| 219 |
+
related_ticket = self._tickets_by_id.get(target_id)
|
| 220 |
+
if related_ticket is None:
|
| 221 |
+
return {
|
| 222 |
+
"tool_name": "lookup_related_ticket",
|
| 223 |
+
"found": False,
|
| 224 |
+
"message": f"Ticket {target_id!r} was not found in the dataset.",
|
| 225 |
+
}
|
| 226 |
+
return {
|
| 227 |
+
"tool_name": "lookup_related_ticket",
|
| 228 |
+
"found": True,
|
| 229 |
+
"ticket": {
|
| 230 |
+
"ticket_id": related_ticket.ticket_id,
|
| 231 |
+
"title": related_ticket.title,
|
| 232 |
+
"requester": related_ticket.requester,
|
| 233 |
+
"description": related_ticket.description,
|
| 234 |
+
"issue_type": related_ticket.issue_type,
|
| 235 |
+
"priority": related_ticket.priority,
|
| 236 |
+
"assignment_group": related_ticket.assignment_group,
|
| 237 |
+
"resolution_action": related_ticket.resolution_action,
|
| 238 |
+
},
|
| 239 |
+
}
|
| 240 |
+
|
| 241 |
+
def _lookup_requester_history(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
|
| 242 |
+
matches = [
|
| 243 |
+
{
|
| 244 |
+
"ticket_id": ticket.ticket_id,
|
| 245 |
+
"title": ticket.title,
|
| 246 |
+
"issue_type": ticket.issue_type,
|
| 247 |
+
"priority": ticket.priority,
|
| 248 |
+
"assignment_group": ticket.assignment_group,
|
| 249 |
+
"resolution_action": ticket.resolution_action,
|
| 250 |
+
}
|
| 251 |
+
for ticket in self._dataset
|
| 252 |
+
if ticket.requester == current_ticket.requester
|
| 253 |
+
and ticket.ticket_id != current_ticket.ticket_id
|
| 254 |
+
]
|
| 255 |
+
return {
|
| 256 |
+
"tool_name": "lookup_requester_history",
|
| 257 |
+
"found": bool(matches),
|
| 258 |
+
"requester": current_ticket.requester,
|
| 259 |
+
"matches": matches,
|
| 260 |
+
}
|
| 261 |
+
|
| 262 |
+
def _run_investigation_tool(
|
| 263 |
+
self,
|
| 264 |
+
current_ticket: HelpdeskTicketRecord,
|
| 265 |
+
tool_name: str,
|
| 266 |
+
target_ticket_id: str | None,
|
| 267 |
+
) -> dict[str, Any]:
|
| 268 |
+
if tool_name == "lookup_related_ticket":
|
| 269 |
+
return self._lookup_related_ticket(current_ticket, target_ticket_id)
|
| 270 |
+
if tool_name == "lookup_requester_history":
|
| 271 |
+
return self._lookup_requester_history(current_ticket)
|
| 272 |
+
raise ValueError(f"Unsupported tool_name: {tool_name}")
|
| 273 |
+
|
| 274 |
+
def _handle_investigation_action(
|
| 275 |
+
self,
|
| 276 |
+
task: dict,
|
| 277 |
+
current_ticket: HelpdeskTicketRecord,
|
| 278 |
+
action: HelpdeskTicketAction,
|
| 279 |
+
idx: int,
|
| 280 |
+
) -> HelpdeskTicketObservation:
|
| 281 |
+
if action.tool_name is None:
|
| 282 |
+
raise ValueError("Investigate actions require tool_name")
|
| 283 |
+
submitted_fields = {
|
| 284 |
+
field
|
| 285 |
+
for field in ("issue_type", "priority", "assignment_group", "resolution_action")
|
| 286 |
+
if getattr(action, field) is not None
|
| 287 |
+
}
|
| 288 |
+
if submitted_fields:
|
| 289 |
+
raise ValueError(
|
| 290 |
+
"Investigate actions cannot include submit fields: "
|
| 291 |
+
f"{sorted(submitted_fields)}"
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
tool_result = self._run_investigation_tool(
|
| 295 |
+
current_ticket,
|
| 296 |
+
action.tool_name,
|
| 297 |
+
action.tool_target_ticket_id,
|
| 298 |
+
)
|
| 299 |
+
self._state.step_count += 1
|
| 300 |
+
self._state.investigation_steps += 1
|
| 301 |
+
self._state.investigation_budget_remaining = max(
|
| 302 |
+
0,
|
| 303 |
+
self._state.investigation_budget_remaining - 1,
|
| 304 |
+
)
|
| 305 |
+
self._state.last_tool_result = tool_result
|
| 306 |
+
self._state.last_step_reward = 0.0
|
| 307 |
+
self._state.reward = 0.0
|
| 308 |
+
self._state.done = False
|
| 309 |
+
self._state.history_entries.append(
|
| 310 |
+
self._build_history_entry(
|
| 311 |
+
current_ticket,
|
| 312 |
+
predicted=action.model_dump(exclude_none=True),
|
| 313 |
+
score=0.0,
|
| 314 |
+
breakdown={},
|
| 315 |
+
queue_position=idx + 1,
|
| 316 |
+
tool_result=tool_result,
|
| 317 |
+
)
|
| 318 |
+
)
|
| 319 |
+
return self._build_observation(task, done=False, reward=0.0)
|
| 320 |
+
|
| 321 |
+
def _build_ticket_view(self, ticket: HelpdeskTicketRecord) -> dict[str, Any]:
|
| 322 |
+
ticket_view: dict[str, Any] = {
|
| 323 |
+
"ticket_id": ticket.ticket_id,
|
| 324 |
+
"title": ticket.title,
|
| 325 |
+
"requester": ticket.requester,
|
| 326 |
+
"description": ticket.description,
|
| 327 |
+
}
|
| 328 |
+
if ticket.ambiguity_note is not None:
|
| 329 |
+
ticket_view["ambiguity_note"] = ticket.ambiguity_note
|
| 330 |
+
if ticket.related_ticket_id is not None:
|
| 331 |
+
ticket_view["related_ticket_id"] = ticket.related_ticket_id
|
| 332 |
+
related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
|
| 333 |
+
if related_ticket is not None:
|
| 334 |
+
ticket_view["related_ticket_preview"] = {
|
| 335 |
+
"ticket_id": related_ticket.ticket_id,
|
| 336 |
+
"title": related_ticket.title,
|
| 337 |
+
"requester": related_ticket.requester,
|
| 338 |
+
"description": related_ticket.description,
|
| 339 |
+
}
|
| 340 |
+
return ticket_view
|
| 341 |
+
|
| 342 |
+
def _build_history_entry(
|
| 343 |
+
self,
|
| 344 |
+
ticket: HelpdeskTicketRecord,
|
| 345 |
+
*,
|
| 346 |
+
predicted: dict[str, Any],
|
| 347 |
+
score: float,
|
| 348 |
+
breakdown: dict[str, float],
|
| 349 |
+
queue_position: int,
|
| 350 |
+
penalty_reason: str | None = None,
|
| 351 |
+
tool_result: dict[str, Any] | None = None,
|
| 352 |
+
) -> dict[str, Any]:
|
| 353 |
+
history_entry: dict[str, Any] = {
|
| 354 |
+
"ticket_id": ticket.ticket_id,
|
| 355 |
+
"title": ticket.title,
|
| 356 |
+
"requester": ticket.requester,
|
| 357 |
+
"predicted": predicted,
|
| 358 |
+
"score": score,
|
| 359 |
+
"breakdown": breakdown,
|
| 360 |
+
"queue_position": queue_position,
|
| 361 |
+
}
|
| 362 |
+
if ticket.ambiguity_note is not None:
|
| 363 |
+
history_entry["ambiguity_note"] = ticket.ambiguity_note
|
| 364 |
+
if ticket.related_ticket_id is not None:
|
| 365 |
+
history_entry["related_ticket_id"] = ticket.related_ticket_id
|
| 366 |
+
related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
|
| 367 |
+
if related_ticket is not None:
|
| 368 |
+
history_entry["related_ticket_preview"] = {
|
| 369 |
+
"ticket_id": related_ticket.ticket_id,
|
| 370 |
+
"title": related_ticket.title,
|
| 371 |
+
"requester": related_ticket.requester,
|
| 372 |
+
"description": related_ticket.description,
|
| 373 |
+
}
|
| 374 |
+
if penalty_reason is not None:
|
| 375 |
+
history_entry["penalty_reason"] = penalty_reason
|
| 376 |
+
if tool_result is not None:
|
| 377 |
+
history_entry["tool_result"] = tool_result
|
| 378 |
+
return history_entry
|
| 379 |
+
|
| 380 |
def _build_observation(
|
| 381 |
self,
|
| 382 |
task: dict,
|
|
|
|
| 388 |
|
| 389 |
if idx < queue_size:
|
| 390 |
ticket = self._queue[idx]
|
| 391 |
+
ticket_view = self._build_ticket_view(ticket)
|
| 392 |
+
queue_position = idx + 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 393 |
else:
|
| 394 |
ticket_view = None
|
| 395 |
+
queue_position = 0
|
| 396 |
|
| 397 |
history = list(self._state.history_entries)
|
| 398 |
+
tickets_remaining = max(0, queue_size - idx)
|
| 399 |
+
tickets_after_current = max(
|
| 400 |
+
0,
|
| 401 |
+
tickets_remaining - (1 if ticket_view is not None else 0),
|
| 402 |
+
)
|
| 403 |
|
| 404 |
return HelpdeskTicketObservation(
|
| 405 |
done=done,
|
| 406 |
reward=reward,
|
| 407 |
+
metadata={
|
| 408 |
+
"queue_position": queue_position,
|
| 409 |
+
"tickets_remaining_includes_current": ticket_view is not None,
|
| 410 |
+
"has_ambiguity_note": bool(ticket_view and ticket_view.get("ambiguity_note")),
|
| 411 |
+
"has_related_ticket_context": bool(
|
| 412 |
+
ticket_view and ticket_view.get("related_ticket_preview")
|
| 413 |
+
),
|
| 414 |
+
"action_mode": "investigate_or_submit",
|
| 415 |
+
},
|
| 416 |
task_id=task["id"],
|
| 417 |
task_name=task["name"],
|
| 418 |
instructions=task["instructions"],
|
| 419 |
allowed_fields=list(task["allowed_fields"]),
|
| 420 |
+
available_tools=list(AVAILABLE_TOOLS),
|
| 421 |
+
investigation_budget_remaining=self._state.investigation_budget_remaining,
|
| 422 |
+
last_tool_result=self._state.last_tool_result,
|
| 423 |
current_ticket=ticket_view,
|
| 424 |
queue_size=queue_size,
|
| 425 |
+
tickets_remaining=tickets_remaining,
|
| 426 |
+
tickets_after_current=tickets_after_current,
|
| 427 |
tickets_processed=idx,
|
| 428 |
+
queue_position=queue_position,
|
| 429 |
history=history,
|
| 430 |
)
|
server/tasks.py
CHANGED
|
@@ -13,7 +13,8 @@ TASKS = {
|
|
| 13 |
"name": "Issue Type Classification",
|
| 14 |
"difficulty": "easy",
|
| 15 |
"instructions": (
|
| 16 |
-
"Read the ticket and select the single best IT issue type."
|
|
|
|
| 17 |
),
|
| 18 |
"allowed_fields": ["issue_type"],
|
| 19 |
},
|
|
@@ -23,7 +24,8 @@ TASKS = {
|
|
| 23 |
"difficulty": "medium",
|
| 24 |
"instructions": (
|
| 25 |
"Read the ticket, select the best IT issue type, and estimate the "
|
| 26 |
-
"correct operational priority."
|
|
|
|
| 27 |
),
|
| 28 |
"allowed_fields": ["issue_type", "priority"],
|
| 29 |
},
|
|
@@ -33,7 +35,9 @@ TASKS = {
|
|
| 33 |
"difficulty": "hard",
|
| 34 |
"instructions": (
|
| 35 |
"Perform full helpdesk routing by selecting the best issue type, "
|
| 36 |
-
"priority, assignment group, and resolution action for the ticket."
|
|
|
|
|
|
|
| 37 |
),
|
| 38 |
"allowed_fields": [
|
| 39 |
"issue_type",
|
|
|
|
| 13 |
"name": "Issue Type Classification",
|
| 14 |
"difficulty": "easy",
|
| 15 |
"instructions": (
|
| 16 |
+
"Read the ticket and select the single best IT issue type. "
|
| 17 |
+
"You may investigate first, then submit a final routing answer."
|
| 18 |
),
|
| 19 |
"allowed_fields": ["issue_type"],
|
| 20 |
},
|
|
|
|
| 24 |
"difficulty": "medium",
|
| 25 |
"instructions": (
|
| 26 |
"Read the ticket, select the best IT issue type, and estimate the "
|
| 27 |
+
"correct operational priority. If the observation includes ambiguity "
|
| 28 |
+
"or follow-up context, use it. You may investigate before you submit."
|
| 29 |
),
|
| 30 |
"allowed_fields": ["issue_type", "priority"],
|
| 31 |
},
|
|
|
|
| 35 |
"difficulty": "hard",
|
| 36 |
"instructions": (
|
| 37 |
"Perform full helpdesk routing by selecting the best issue type, "
|
| 38 |
+
"priority, assignment group, and resolution action for the ticket. "
|
| 39 |
+
"Use any ambiguity notes or related-ticket previews when present. "
|
| 40 |
+
"You may investigate with tools before you submit the final action."
|
| 41 |
),
|
| 42 |
"allowed_fields": [
|
| 43 |
"issue_type",
|
tests/test_competitive_upgrade.py
CHANGED
|
@@ -81,7 +81,11 @@ def _heuristic_action(obs: HelpdeskTicketObservation) -> HelpdeskTicketAction:
|
|
| 81 |
# 9.1 — Inference single-task mode
|
| 82 |
# ---------------------------------------------------------------------------
|
| 83 |
|
| 84 |
-
def _get_tasks_to_run_impl(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
"""
|
| 86 |
Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
|
| 87 |
|
|
@@ -94,9 +98,13 @@ def _get_tasks_to_run_impl(task_id_env: str | None, available_tasks: dict) -> li
|
|
| 94 |
except ValueError:
|
| 95 |
raise SystemExit(1)
|
| 96 |
if task_id not in available_tasks:
|
| 97 |
-
|
| 98 |
return [task_id]
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
|
| 102 |
class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
@@ -107,14 +115,19 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
| 107 |
result = _get_tasks_to_run_impl("1", available)
|
| 108 |
self.assertEqual(result, [1])
|
| 109 |
|
| 110 |
-
def
|
| 111 |
available = {1: {}, 2: {}, 3: {}}
|
| 112 |
-
|
| 113 |
-
|
| 114 |
|
| 115 |
-
def
|
| 116 |
available = {1: {}, 2: {}, 3: {}}
|
| 117 |
result = _get_tasks_to_run_impl(None, available)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
|
| 119 |
|
| 120 |
def test_task_id_set_to_2_returns_only_task_2(self) -> None:
|
|
@@ -360,6 +373,271 @@ class TestAmbiguityNoteInObservation(unittest.TestCase):
|
|
| 360 |
self.assertIn("ambiguity_note", obs.current_ticket)
|
| 361 |
|
| 362 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 363 |
# ---------------------------------------------------------------------------
|
| 364 |
# 9.7 — Dataset has >= 3 non-default routing tickets
|
| 365 |
# ---------------------------------------------------------------------------
|
|
|
|
| 81 |
# 9.1 — Inference single-task mode
|
| 82 |
# ---------------------------------------------------------------------------
|
| 83 |
|
| 84 |
+
def _get_tasks_to_run_impl(
|
| 85 |
+
task_id_env: str | None,
|
| 86 |
+
available_tasks: dict,
|
| 87 |
+
run_all_tasks: bool = False,
|
| 88 |
+
) -> list[int]:
|
| 89 |
"""
|
| 90 |
Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
|
| 91 |
|
|
|
|
| 98 |
except ValueError:
|
| 99 |
raise SystemExit(1)
|
| 100 |
if task_id not in available_tasks:
|
| 101 |
+
raise SystemExit(1)
|
| 102 |
return [task_id]
|
| 103 |
+
if run_all_tasks:
|
| 104 |
+
return sorted(available_tasks)
|
| 105 |
+
if not available_tasks:
|
| 106 |
+
return []
|
| 107 |
+
return [sorted(available_tasks)[0]]
|
| 108 |
|
| 109 |
|
| 110 |
class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
|
|
| 115 |
result = _get_tasks_to_run_impl("1", available)
|
| 116 |
self.assertEqual(result, [1])
|
| 117 |
|
| 118 |
+
def test_task_id_set_to_unavailable_id_exits(self) -> None:
|
| 119 |
available = {1: {}, 2: {}, 3: {}}
|
| 120 |
+
with self.assertRaises(SystemExit):
|
| 121 |
+
_get_tasks_to_run_impl("999", available)
|
| 122 |
|
| 123 |
+
def test_task_id_unset_defaults_to_first_available_task(self) -> None:
|
| 124 |
available = {1: {}, 2: {}, 3: {}}
|
| 125 |
result = _get_tasks_to_run_impl(None, available)
|
| 126 |
+
self.assertEqual(result, [1])
|
| 127 |
+
|
| 128 |
+
def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
|
| 129 |
+
available = {1: {}, 2: {}, 3: {}}
|
| 130 |
+
result = _get_tasks_to_run_impl(None, available, run_all_tasks=True)
|
| 131 |
self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
|
| 132 |
|
| 133 |
def test_task_id_set_to_2_returns_only_task_2(self) -> None:
|
|
|
|
| 373 |
self.assertIn("ambiguity_note", obs.current_ticket)
|
| 374 |
|
| 375 |
|
| 376 |
+
class TestRelatedTicketPreviewInObservation(unittest.TestCase):
|
| 377 |
+
"""Follow-up tickets expose a lightweight preview of the linked ticket."""
|
| 378 |
+
|
| 379 |
+
def _reset_linked_ticket_env(self):
|
| 380 |
+
from unittest.mock import patch
|
| 381 |
+
|
| 382 |
+
dataset = load_dataset()
|
| 383 |
+
ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
|
| 384 |
+
self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
|
| 385 |
+
related = next(
|
| 386 |
+
(t for t in dataset if t.ticket_id == ticket.related_ticket_id),
|
| 387 |
+
None,
|
| 388 |
+
)
|
| 389 |
+
self.assertIsNotNone(related, "Linked ticket missing from dataset")
|
| 390 |
+
|
| 391 |
+
env = _make_env()
|
| 392 |
+
with patch.object(env, "_dataset", [ticket]):
|
| 393 |
+
with patch.object(
|
| 394 |
+
env,
|
| 395 |
+
"_tickets_by_id",
|
| 396 |
+
{ticket.ticket_id: ticket, related.ticket_id: related},
|
| 397 |
+
):
|
| 398 |
+
obs = env.reset(seed=0, task_id=3, queue_size=1)
|
| 399 |
+
|
| 400 |
+
return env, obs, related
|
| 401 |
+
|
| 402 |
+
def test_related_ticket_preview_present_when_ticket_has_link(self) -> None:
|
| 403 |
+
env, obs, related = self._reset_linked_ticket_env()
|
| 404 |
+
|
| 405 |
+
self.assertIsNotNone(obs.current_ticket)
|
| 406 |
+
self.assertIn("related_ticket_preview", obs.current_ticket)
|
| 407 |
+
self.assertEqual(
|
| 408 |
+
obs.current_ticket["related_ticket_preview"]["ticket_id"],
|
| 409 |
+
related.ticket_id,
|
| 410 |
+
)
|
| 411 |
+
self.assertEqual(
|
| 412 |
+
obs.current_ticket["related_ticket_preview"]["title"],
|
| 413 |
+
related.title,
|
| 414 |
+
)
|
| 415 |
+
|
| 416 |
+
def test_history_keeps_related_ticket_preview_after_step(self) -> None:
|
| 417 |
+
env, obs, related = self._reset_linked_ticket_env()
|
| 418 |
+
next_obs = env.step(_heuristic_action(obs))
|
| 419 |
+
|
| 420 |
+
self.assertGreaterEqual(len(next_obs.history), 1)
|
| 421 |
+
self.assertIn("related_ticket_preview", next_obs.history[0])
|
| 422 |
+
self.assertEqual(
|
| 423 |
+
next_obs.history[0]["related_ticket_preview"]["ticket_id"],
|
| 424 |
+
related.ticket_id,
|
| 425 |
+
)
|
| 426 |
+
|
| 427 |
+
|
| 428 |
+
class TestObservationQueueContext(unittest.TestCase):
|
| 429 |
+
"""Observation includes clearer queue-position counters."""
|
| 430 |
+
|
| 431 |
+
def test_reset_sets_queue_position_and_after_current_counts(self) -> None:
|
| 432 |
+
env = _make_env()
|
| 433 |
+
obs = env.reset(seed=0, task_id=1, queue_size=3)
|
| 434 |
+
|
| 435 |
+
self.assertEqual(obs.queue_position, 1)
|
| 436 |
+
self.assertEqual(obs.tickets_remaining, 3)
|
| 437 |
+
self.assertEqual(obs.tickets_after_current, 2)
|
| 438 |
+
|
| 439 |
+
def test_step_updates_queue_position_and_after_current_counts(self) -> None:
|
| 440 |
+
env = _make_env()
|
| 441 |
+
obs = env.reset(seed=0, task_id=1, queue_size=3)
|
| 442 |
+
obs = env.step(_heuristic_action(obs))
|
| 443 |
+
|
| 444 |
+
if obs.done:
|
| 445 |
+
self.assertEqual(obs.queue_position, 0)
|
| 446 |
+
self.assertEqual(obs.tickets_after_current, 0)
|
| 447 |
+
else:
|
| 448 |
+
self.assertEqual(obs.queue_position, 2)
|
| 449 |
+
self.assertEqual(obs.tickets_remaining, 2)
|
| 450 |
+
self.assertEqual(obs.tickets_after_current, 1)
|
| 451 |
+
|
| 452 |
+
|
| 453 |
+
# ---------------------------------------------------------------------------
|
| 454 |
+
# 9.6b — investigation actions and queue economics
|
| 455 |
+
# ---------------------------------------------------------------------------
|
| 456 |
+
|
| 457 |
+
class TestInvestigationActions(unittest.TestCase):
|
| 458 |
+
"""Minimal tool-assisted investigate/submit flow works and stays backwards compatible."""
|
| 459 |
+
|
| 460 |
+
def _make_linked_env(self):
|
| 461 |
+
from unittest.mock import patch
|
| 462 |
+
|
| 463 |
+
dataset = load_dataset()
|
| 464 |
+
ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
|
| 465 |
+
self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
|
| 466 |
+
related = next(
|
| 467 |
+
(t for t in dataset if t.ticket_id == ticket.related_ticket_id),
|
| 468 |
+
None,
|
| 469 |
+
)
|
| 470 |
+
self.assertIsNotNone(related, "Linked ticket missing from dataset")
|
| 471 |
+
env = _make_env()
|
| 472 |
+
patch_dataset = patch.object(env, "_dataset", [ticket])
|
| 473 |
+
patch_lookup = patch.object(
|
| 474 |
+
env,
|
| 475 |
+
"_tickets_by_id",
|
| 476 |
+
{ticket.ticket_id: ticket, related.ticket_id: related},
|
| 477 |
+
)
|
| 478 |
+
patch_dataset.start()
|
| 479 |
+
patch_lookup.start()
|
| 480 |
+
self.addCleanup(patch_dataset.stop)
|
| 481 |
+
self.addCleanup(patch_lookup.stop)
|
| 482 |
+
obs = env.reset(seed=0, task_id=3, queue_size=1)
|
| 483 |
+
return env, obs, ticket, related
|
| 484 |
+
|
| 485 |
+
def test_investigation_action_does_not_advance_queue(self) -> None:
|
| 486 |
+
env, obs, ticket, related = self._make_linked_env()
|
| 487 |
+
|
| 488 |
+
investigate = HelpdeskTicketAction(
|
| 489 |
+
action_type="investigate",
|
| 490 |
+
tool_name="lookup_related_ticket",
|
| 491 |
+
tool_target_ticket_id=ticket.related_ticket_id,
|
| 492 |
+
)
|
| 493 |
+
obs2 = env.step(investigate)
|
| 494 |
+
|
| 495 |
+
self.assertFalse(obs2.done)
|
| 496 |
+
self.assertEqual(obs2.tickets_processed, 0)
|
| 497 |
+
self.assertEqual(obs2.queue_position, 1)
|
| 498 |
+
self.assertIsNotNone(obs2.last_tool_result)
|
| 499 |
+
self.assertTrue(obs2.last_tool_result["found"])
|
| 500 |
+
self.assertEqual(
|
| 501 |
+
obs2.last_tool_result["ticket"]["ticket_id"],
|
| 502 |
+
related.ticket_id,
|
| 503 |
+
)
|
| 504 |
+
|
| 505 |
+
def test_submit_after_investigation_completes_episode(self) -> None:
|
| 506 |
+
env, obs, ticket, related = self._make_linked_env()
|
| 507 |
+
env.step(
|
| 508 |
+
HelpdeskTicketAction(
|
| 509 |
+
action_type="investigate",
|
| 510 |
+
tool_name="lookup_related_ticket",
|
| 511 |
+
tool_target_ticket_id=ticket.related_ticket_id,
|
| 512 |
+
)
|
| 513 |
+
)
|
| 514 |
+
final_obs = env.step(
|
| 515 |
+
HelpdeskTicketAction(
|
| 516 |
+
issue_type=ticket.issue_type,
|
| 517 |
+
priority=ticket.priority,
|
| 518 |
+
assignment_group=ticket.assignment_group,
|
| 519 |
+
resolution_action=ticket.resolution_action,
|
| 520 |
+
)
|
| 521 |
+
)
|
| 522 |
+
|
| 523 |
+
self.assertTrue(final_obs.done)
|
| 524 |
+
self.assertEqual(final_obs.tickets_processed, 1)
|
| 525 |
+
self.assertGreaterEqual(final_obs.reward, 0.0)
|
| 526 |
+
self.assertLessEqual(final_obs.reward, 1.0)
|
| 527 |
+
|
| 528 |
+
def test_requester_history_tool_returns_matches_for_same_requester(self) -> None:
|
| 529 |
+
from unittest.mock import patch
|
| 530 |
+
|
| 531 |
+
dataset = load_dataset()
|
| 532 |
+
requester_counts: dict[str, int] = {}
|
| 533 |
+
for ticket in dataset:
|
| 534 |
+
requester_counts[ticket.requester] = requester_counts.get(ticket.requester, 0) + 1
|
| 535 |
+
target_requester = next(
|
| 536 |
+
(requester for requester, count in requester_counts.items() if count >= 2),
|
| 537 |
+
None,
|
| 538 |
+
)
|
| 539 |
+
self.assertIsNotNone(target_requester, "Dataset has no repeated requester")
|
| 540 |
+
duplicate_requester_group = [
|
| 541 |
+
ticket for ticket in dataset if ticket.requester == target_requester
|
| 542 |
+
]
|
| 543 |
+
self.assertGreaterEqual(len(duplicate_requester_group), 2)
|
| 544 |
+
|
| 545 |
+
env = _make_env()
|
| 546 |
+
with patch.object(env, "_dataset", duplicate_requester_group):
|
| 547 |
+
with patch.object(
|
| 548 |
+
env,
|
| 549 |
+
"_tickets_by_id",
|
| 550 |
+
{ticket.ticket_id: ticket for ticket in duplicate_requester_group},
|
| 551 |
+
):
|
| 552 |
+
obs = env.reset(seed=0, task_id=2, queue_size=1)
|
| 553 |
+
|
| 554 |
+
obs2 = env.step(
|
| 555 |
+
HelpdeskTicketAction(
|
| 556 |
+
action_type="investigate",
|
| 557 |
+
tool_name="lookup_requester_history",
|
| 558 |
+
)
|
| 559 |
+
)
|
| 560 |
+
|
| 561 |
+
self.assertIsNotNone(obs2.last_tool_result)
|
| 562 |
+
self.assertEqual(obs2.last_tool_result["tool_name"], "lookup_requester_history")
|
| 563 |
+
self.assertTrue(obs2.last_tool_result["found"])
|
| 564 |
+
self.assertGreaterEqual(len(obs2.last_tool_result["matches"]), 1)
|
| 565 |
+
|
| 566 |
+
|
| 567 |
+
class TestQueueEconomics(unittest.TestCase):
|
| 568 |
+
"""Free investigations are allowed, but excessive investigation gets a queue-level penalty."""
|
| 569 |
+
|
| 570 |
+
def test_extra_investigations_reduce_final_reward(self) -> None:
|
| 571 |
+
from unittest.mock import patch
|
| 572 |
+
|
| 573 |
+
dataset = load_dataset()
|
| 574 |
+
ticket = dataset[0]
|
| 575 |
+
env = _make_env()
|
| 576 |
+
with patch.object(env, "_dataset", [ticket]):
|
| 577 |
+
with patch.object(env, "_tickets_by_id", {ticket.ticket_id: ticket}):
|
| 578 |
+
obs = env.reset(seed=0, task_id=1, queue_size=1)
|
| 579 |
+
|
| 580 |
+
obs = env.step(
|
| 581 |
+
HelpdeskTicketAction(
|
| 582 |
+
action_type="investigate",
|
| 583 |
+
tool_name="lookup_requester_history",
|
| 584 |
+
)
|
| 585 |
+
)
|
| 586 |
+
self.assertEqual(env.state.investigation_steps, 1)
|
| 587 |
+
self.assertEqual(env.state.investigation_budget_remaining, 0)
|
| 588 |
+
|
| 589 |
+
obs = env.step(
|
| 590 |
+
HelpdeskTicketAction(
|
| 591 |
+
action_type="investigate",
|
| 592 |
+
tool_name="lookup_requester_history",
|
| 593 |
+
)
|
| 594 |
+
)
|
| 595 |
+
self.assertEqual(env.state.investigation_steps, 2)
|
| 596 |
+
|
| 597 |
+
final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
|
| 598 |
+
|
| 599 |
+
self.assertTrue(final_obs.done)
|
| 600 |
+
self.assertAlmostEqual(final_obs.reward, 0.98, places=9)
|
| 601 |
+
|
| 602 |
+
|
| 603 |
+
class TestTerminalInvalidActionFinalReward(unittest.TestCase):
|
| 604 |
+
"""Terminal invalid submit actions should still return the queue-level final reward."""
|
| 605 |
+
|
| 606 |
+
def test_last_invalid_submit_returns_trajectory_reward_not_zero(self) -> None:
|
| 607 |
+
from unittest.mock import patch
|
| 608 |
+
|
| 609 |
+
dataset = load_dataset()
|
| 610 |
+
first = dataset[0]
|
| 611 |
+
second = dataset[1]
|
| 612 |
+
|
| 613 |
+
env = _make_env()
|
| 614 |
+
with patch.object(env, "_dataset", [first, second]):
|
| 615 |
+
with patch.object(
|
| 616 |
+
env,
|
| 617 |
+
"_tickets_by_id",
|
| 618 |
+
{first.ticket_id: first, second.ticket_id: second},
|
| 619 |
+
):
|
| 620 |
+
obs = env.reset(seed=0, task_id=1, queue_size=2)
|
| 621 |
+
|
| 622 |
+
tickets_by_id = {first.ticket_id: first, second.ticket_id: second}
|
| 623 |
+
current = tickets_by_id[obs.current_ticket["ticket_id"]]
|
| 624 |
+
obs = env.step(HelpdeskTicketAction(issue_type=current.issue_type))
|
| 625 |
+
self.assertFalse(obs.done)
|
| 626 |
+
|
| 627 |
+
current = tickets_by_id[obs.current_ticket["ticket_id"]]
|
| 628 |
+
final_obs = env.step(
|
| 629 |
+
HelpdeskTicketAction(
|
| 630 |
+
issue_type=current.issue_type,
|
| 631 |
+
priority="medium",
|
| 632 |
+
)
|
| 633 |
+
)
|
| 634 |
+
|
| 635 |
+
self.assertTrue(final_obs.done)
|
| 636 |
+
self.assertAlmostEqual(final_obs.reward, 0.5, places=9)
|
| 637 |
+
self.assertAlmostEqual(env.state.total_reward, 0.5, places=9)
|
| 638 |
+
self.assertAlmostEqual(env.state.reward or 0.0, 0.5, places=9)
|
| 639 |
+
|
| 640 |
+
|
| 641 |
# ---------------------------------------------------------------------------
|
| 642 |
# 9.7 — Dataset has >= 3 non-default routing tickets
|
| 643 |
# ---------------------------------------------------------------------------
|
tests/test_environment_smoke.py
CHANGED
|
@@ -101,6 +101,8 @@ class TestResetReturnsValidObservation(unittest.TestCase):
|
|
| 101 |
self.assertIsNotNone(obs.current_ticket)
|
| 102 |
self.assertGreater(obs.queue_size, 0)
|
| 103 |
self.assertEqual(obs.tickets_processed, 0)
|
|
|
|
|
|
|
| 104 |
|
| 105 |
|
| 106 |
class TestResetAllTaskIds(unittest.TestCase):
|
|
@@ -116,6 +118,7 @@ class TestResetAllTaskIds(unittest.TestCase):
|
|
| 116 |
self.assertEqual(obs.tickets_processed, 0)
|
| 117 |
# allowed_fields must match the task definition
|
| 118 |
self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
|
|
|
|
| 119 |
|
| 120 |
def test_reset_task2(self) -> None:
|
| 121 |
env = _make_env()
|
|
@@ -142,6 +145,10 @@ class TestStepAdvancesTicketsProcessed(unittest.TestCase):
|
|
| 142 |
obs2 = env.step(action)
|
| 143 |
|
| 144 |
self.assertEqual(obs2.tickets_processed, 1)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
def test_step_reward_in_unit_interval(self) -> None:
|
| 147 |
from models import HelpdeskTicketAction
|
|
|
|
| 101 |
self.assertIsNotNone(obs.current_ticket)
|
| 102 |
self.assertGreater(obs.queue_size, 0)
|
| 103 |
self.assertEqual(obs.tickets_processed, 0)
|
| 104 |
+
self.assertEqual(obs.queue_position, 1)
|
| 105 |
+
self.assertEqual(obs.tickets_after_current, max(0, obs.queue_size - 1))
|
| 106 |
|
| 107 |
|
| 108 |
class TestResetAllTaskIds(unittest.TestCase):
|
|
|
|
| 118 |
self.assertEqual(obs.tickets_processed, 0)
|
| 119 |
# allowed_fields must match the task definition
|
| 120 |
self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
|
| 121 |
+
self.assertEqual(obs.queue_position, 1)
|
| 122 |
|
| 123 |
def test_reset_task2(self) -> None:
|
| 124 |
env = _make_env()
|
|
|
|
| 145 |
obs2 = env.step(action)
|
| 146 |
|
| 147 |
self.assertEqual(obs2.tickets_processed, 1)
|
| 148 |
+
if obs2.done:
|
| 149 |
+
self.assertEqual(obs2.queue_position, 0)
|
| 150 |
+
else:
|
| 151 |
+
self.assertEqual(obs2.queue_position, 2)
|
| 152 |
|
| 153 |
def test_step_reward_in_unit_interval(self) -> None:
|
| 154 |
from models import HelpdeskTicketAction
|
tests/test_extra_fields_penalty.py
CHANGED
|
@@ -151,32 +151,31 @@ class TestExtraFieldsPenalty(unittest.TestCase):
|
|
| 151 |
self.assertIsInstance(obs, HelpdeskTicketObservation)
|
| 152 |
|
| 153 |
def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
|
| 154 |
-
"""When the penalty step is on the last ticket, done
|
| 155 |
env = _make_env()
|
| 156 |
-
# Use a queue of size 1 by controlling the seed — find a seed that gives queue_size=1
|
| 157 |
-
# Instead, exhaust all but the last ticket normally, then trigger penalty on last
|
| 158 |
obs = env.reset(seed=42, task_id=1)
|
| 159 |
queue_size = obs.queue_size
|
|
|
|
| 160 |
|
| 161 |
# Process all tickets except the last one normally
|
| 162 |
for _ in range(queue_size - 1):
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
action_kwargs["issue_type"] = ISSUE_TYPES[0]
|
| 167 |
-
if "priority" in allowed:
|
| 168 |
-
action_kwargs["priority"] = PRIORITIES[0]
|
| 169 |
-
obs = env.step(HelpdeskTicketAction(**action_kwargs))
|
| 170 |
|
| 171 |
# Now trigger penalty on the last ticket
|
|
|
|
|
|
|
| 172 |
action = HelpdeskTicketAction(
|
| 173 |
-
issue_type=
|
| 174 |
assignment_group=ASSIGNMENT_GROUPS[0], # extra field
|
| 175 |
)
|
| 176 |
final_obs = env.step(action)
|
| 177 |
|
| 178 |
self.assertTrue(final_obs.done)
|
| 179 |
-
|
|
|
|
|
|
|
| 180 |
|
| 181 |
|
| 182 |
if __name__ == "__main__":
|
|
|
|
| 151 |
self.assertIsInstance(obs, HelpdeskTicketObservation)
|
| 152 |
|
| 153 |
def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
|
| 154 |
+
"""When the penalty step is on the last ticket, done stays True and reward stays episode-level."""
|
| 155 |
env = _make_env()
|
|
|
|
|
|
|
| 156 |
obs = env.reset(seed=42, task_id=1)
|
| 157 |
queue_size = obs.queue_size
|
| 158 |
+
tickets_by_id = env._tickets_by_id # noqa: SLF001 - test-only inspection
|
| 159 |
|
| 160 |
# Process all tickets except the last one normally
|
| 161 |
for _ in range(queue_size - 1):
|
| 162 |
+
current_ticket_id = obs.current_ticket["ticket_id"]
|
| 163 |
+
current_ticket = tickets_by_id[current_ticket_id]
|
| 164 |
+
obs = env.step(HelpdeskTicketAction(issue_type=current_ticket.issue_type))
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
# Now trigger penalty on the last ticket
|
| 167 |
+
current_ticket_id = obs.current_ticket["ticket_id"]
|
| 168 |
+
current_ticket = tickets_by_id[current_ticket_id]
|
| 169 |
action = HelpdeskTicketAction(
|
| 170 |
+
issue_type=current_ticket.issue_type,
|
| 171 |
assignment_group=ASSIGNMENT_GROUPS[0], # extra field
|
| 172 |
)
|
| 173 |
final_obs = env.step(action)
|
| 174 |
|
| 175 |
self.assertTrue(final_obs.done)
|
| 176 |
+
expected_reward = (queue_size - 1) / queue_size
|
| 177 |
+
self.assertAlmostEqual(final_obs.reward, expected_reward, places=9)
|
| 178 |
+
self.assertAlmostEqual(env.state.total_reward, expected_reward, places=9)
|
| 179 |
|
| 180 |
|
| 181 |
if __name__ == "__main__":
|
tests/test_inference_unit.py
CHANGED
|
@@ -163,6 +163,22 @@ class InferenceUnitTests(unittest.TestCase):
|
|
| 163 |
)
|
| 164 |
)
|
| 165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
if __name__ == "__main__":
|
| 168 |
unittest.main()
|
|
|
|
| 163 |
)
|
| 164 |
)
|
| 165 |
|
| 166 |
+
def test_default_task_selection_runs_single_first_task(self) -> None:
|
| 167 |
+
inference = _load_inference_module()
|
| 168 |
+
|
| 169 |
+
self.assertEqual(
|
| 170 |
+
inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
|
| 171 |
+
[1],
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
def test_run_all_tasks_override_keeps_local_batch_mode_available(self) -> None:
|
| 175 |
+
inference = _load_inference_module({"RUN_ALL_TASKS": "1"})
|
| 176 |
+
|
| 177 |
+
self.assertEqual(
|
| 178 |
+
inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
|
| 179 |
+
[1, 2, 3],
|
| 180 |
+
)
|
| 181 |
+
|
| 182 |
|
| 183 |
if __name__ == "__main__":
|
| 184 |
unittest.main()
|