Coding Ninja commited on
Commit
c64d203
·
1 Parent(s): 6c5051f

Finalize gap fixes and lightweight competitive upgrades

Browse files
KNOWLEDGE.md CHANGED
@@ -24,7 +24,7 @@ IT helpdesk routing is a strong hackathon fit because it is:
24
  - deterministic to grade
25
  - naturally multi-step
26
 
27
- A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. That maps cleanly to a typed action object.
28
 
29
  ## The Repo In One Sentence
30
 
@@ -134,7 +134,7 @@ Important fields:
134
 
135
  ### `HelpdeskTicketAction`
136
 
137
- Represents the agent submission. Fields are optional because different tasks score different subsets.
138
 
139
  ### `HelpdeskTicketObservation`
140
 
@@ -142,6 +142,7 @@ Represents what the agent sees for each step:
142
 
143
  - task metadata
144
  - visible ticket fields
 
145
  - queue progress
146
  - score history
147
 
@@ -179,10 +180,19 @@ The observation exposes:
179
 
180
  - task metadata
181
  - the current ticket
 
 
 
182
  - queue progress counters
183
  - history
184
  - reward and done status
185
 
 
 
 
 
 
 
186
  The state tracks:
187
 
188
  - current task
@@ -191,12 +201,13 @@ The state tracks:
191
  - current ticket index
192
  - per-ticket scores
193
  - total reward
 
194
 
195
  ## Task Design
196
 
197
  ### Task 1: Issue Type Classification
198
 
199
- The agent predicts:
200
 
201
  - `issue_type`
202
 
@@ -206,7 +217,7 @@ Purpose:
206
 
207
  ### Task 2: Issue Type And Priority
208
 
209
- The agent predicts:
210
 
211
  - `issue_type`
212
  - `priority`
@@ -217,7 +228,7 @@ Purpose:
217
 
218
  ### Task 3: Full Ticket Routing
219
 
220
- The agent predicts:
221
 
222
  - `issue_type`
223
  - `priority`
@@ -256,14 +267,14 @@ This is now proven in checked-in unit tests rather than left as a docs claim.
256
 
257
  Step reward:
258
 
259
- - current ticket score clamped to `[0.0, 1.0]`
260
 
261
  Final reward:
262
 
263
  - average of ticket scores
264
- - minus a small overshoot penalty for taking more steps than the queue length
265
 
266
- This gives dense feedback while still rewarding efficient episode completion.
267
 
268
  ## Dataset Mental Model
269
 
@@ -277,6 +288,8 @@ Current structure:
277
  - harder ambiguous cases
278
  - follow-up tickets connected through `related_ticket_id`
279
 
 
 
280
  The dataset is meant to test routing judgment, not just keyword spotting.
281
 
282
  ## Grounding Note
@@ -299,16 +312,18 @@ It:
299
 
300
  1. connects to the environment
301
  2. loads the available tasks
302
- 3. runs one episode per task
303
  4. picks an action for each ticket
304
  5. sends the action back through the client
305
  6. records rewards
306
- 7. prints a task-by-task summary
307
 
308
  It supports:
309
 
310
  - heuristic mode with no external model
311
  - LLM mode through an OpenAI-compatible API
 
 
312
 
313
  ## Files That Matter Most
314
 
@@ -374,16 +389,26 @@ That follow-up pass added the remaining Roopal-owned public-clarity items:
374
  - an internal grounding note tying the label space to public IT-support datasets
375
  - a refreshed compliance snapshot in `required.md`
376
 
377
- The optional TRL / GRPO README example was intentionally deferred because the shared runtime-validation gates are not all green yet.
 
 
 
 
 
 
378
 
379
- ## What Still Needs Hands-On Verification
 
 
 
 
380
 
381
- The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
382
 
383
- Still pending:
384
 
385
- 1. confirm Docker starts cleanly
386
- 2. do a clean-machine dry run if possible
387
 
388
  ## One-Minute Summary
389
 
@@ -396,4 +421,4 @@ If you come back to this repo later, remember:
396
  - the agent predicts structured routing fields
397
  - the grader gives deterministic partial credit
398
  - `inference.py` is the baseline agent runner
399
- - merged-state local validation is complete, and Docker is the main remaining hands-on check
 
24
  - deterministic to grade
25
  - naturally multi-step
26
 
27
+ A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. The current runtime now supports a small two-mode action object: investigate first when needed, then submit the final routing answer.
28
 
29
  ## The Repo In One Sentence
30
 
 
134
 
135
  ### `HelpdeskTicketAction`
136
 
137
+ Represents the agent step. `action_type="submit"` carries routing fields, while `action_type="investigate"` uses a small built-in tool surface before the final submission.
138
 
139
  ### `HelpdeskTicketObservation`
140
 
 
142
 
143
  - task metadata
144
  - visible ticket fields
145
+ - optional ambiguity or follow-up context
146
  - queue progress
147
  - score history
148
 
 
180
 
181
  - task metadata
182
  - the current ticket
183
+ - available investigation tools
184
+ - remaining free investigation budget
185
+ - the latest tool result, when one was requested
186
  - queue progress counters
187
  - history
188
  - reward and done status
189
 
190
+ Useful queue counters now include:
191
+
192
+ - `tickets_remaining`: not-yet-processed tickets, including the current ticket when one is active
193
+ - `tickets_after_current`: how many tickets remain after the current one
194
+ - `queue_position`: 1-based position of the current ticket in the queue
195
+
196
  The state tracks:
197
 
198
  - current task
 
201
  - current ticket index
202
  - per-ticket scores
203
  - total reward
204
+ - investigation step count
205
 
206
  ## Task Design
207
 
208
  ### Task 1: Issue Type Classification
209
 
210
+ The agent ultimately predicts:
211
 
212
  - `issue_type`
213
 
 
217
 
218
  ### Task 2: Issue Type And Priority
219
 
220
+ The agent ultimately predicts:
221
 
222
  - `issue_type`
223
  - `priority`
 
228
 
229
  ### Task 3: Full Ticket Routing
230
 
231
+ The agent ultimately predicts:
232
 
233
  - `issue_type`
234
  - `priority`
 
267
 
268
  Step reward:
269
 
270
+ - current ticket score with a small milestone bonus for strong steps and a small penalty for very weak steps
271
 
272
  Final reward:
273
 
274
  - average of ticket scores
275
+ - minus a tiny penalty only if the agent exceeds the free investigation budget for the queue
276
 
277
+ This keeps the reward dense and deterministic, removes the dead overshoot logic, and adds a small queue-level economics signal without disturbing the no-tool baseline path.
278
 
279
  ## Dataset Mental Model
280
 
 
288
  - harder ambiguous cases
289
  - follow-up tickets connected through `related_ticket_id`
290
 
291
+ When a follow-up link exists, the observation can now surface a lightweight `related_ticket_preview`, and the tool layer can fetch richer related-ticket or requester-history context so the agent does not have to route every ticket from isolated text alone.
292
+
293
  The dataset is meant to test routing judgment, not just keyword spotting.
294
 
295
  ## Grounding Note
 
312
 
313
  1. connects to the environment
314
  2. loads the available tasks
315
+ 3. runs one episode for the requested task
316
  4. picks an action for each ticket
317
  5. sends the action back through the client
318
  6. records rewards
319
+ 7. prints structured logs for that run
320
 
321
  It supports:
322
 
323
  - heuristic mode with no external model
324
  - LLM mode through an OpenAI-compatible API
325
+ - lightweight investigation-tool calls before the final submit action
326
+ - an explicit local `RUN_ALL_TASKS=1` override when you want the old multi-task sweep
327
 
328
  ## Files That Matter Most
329
 
 
389
  - an internal grounding note tying the label space to public IT-support datasets
390
  - a refreshed compliance snapshot in `required.md`
391
 
392
+ The optional TRL / GRPO README example remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
393
+
394
+ ## April 3-7 Status
395
+
396
+ The roadmap through April 7 is now closed in the current repo state.
397
+
398
+ That means the repo now has:
399
 
400
+ 1. checked-in unit, smoke, and integration tests
401
+ 2. Docker smoke coverage through the GitHub Actions workflow
402
+ 3. a clean-copy install-and-run pass
403
+ 4. structured `inference.py` logging verification
404
+ 5. a passing local `openenv validate` result after checking in `uv.lock`
405
 
406
+ ## Submission-Day Reminders
407
 
408
+ The remaining work belongs to the April 8 submission window rather than the April 3 to April 7 implementation window:
409
 
410
+ 1. rerun the final sanity slice on the submission branch
411
+ 2. verify the live Hugging Face Space ping and reset path after the final push if a fresh deployment is created
412
 
413
  ## One-Minute Summary
414
 
 
421
  - the agent predicts structured routing fields
422
  - the grader gives deterministic partial credit
423
  - `inference.py` is the baseline agent runner
424
+ - merged-state validation, Docker smoke coverage, clean-copy rerun, and local validator readiness are all now in place
README.md CHANGED
@@ -34,7 +34,7 @@ The environment models a realistic helpdesk workflow:
34
 
35
  1. a new ticket enters the queue
36
  2. the agent reads the ticket title and description
37
- 3. the agent predicts structured routing fields
38
  4. the grader assigns deterministic credit
39
  5. the environment advances to the next ticket until the queue is complete
40
 
@@ -43,7 +43,7 @@ This domain is useful for OpenEnv because it is operationally realistic, easy to
43
  ## Why This Is A Good Hackathon Domain
44
 
45
  - it reflects real enterprise support operations
46
- - the action space is structured and judge-friendly
47
  - correctness can be scored deterministically
48
  - the hard task is meaningfully harder than the easy and medium tasks
49
  - the environment is small enough to rerun quickly
@@ -55,7 +55,7 @@ The project uses a queue-based episode model.
55
  - `reset()` samples a task and a queue of 3 to 5 tickets
56
  - `step()` grades one ticket submission at a time
57
  - `state()` exposes the internal episode snapshot
58
- - final reward is based on average ticket quality with a small overshoot penalty
59
 
60
  The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
61
 
@@ -115,6 +115,9 @@ Visible ticket fields:
115
  - `title`
116
  - `requester`
117
  - `description`
 
 
 
118
 
119
  Each observation also includes:
120
 
@@ -122,9 +125,14 @@ Each observation also includes:
122
  - `task_name`
123
  - `instructions`
124
  - `allowed_fields`
 
 
 
125
  - `queue_size`
126
  - `tickets_remaining`
 
127
  - `tickets_processed`
 
128
  - `history`
129
  - standard OpenEnv fields such as `done` and `reward`
130
 
@@ -138,11 +146,23 @@ The internal `HelpdeskTicketState` tracks:
138
  - `current_ticket_index`
139
  - `per_ticket_scores`
140
  - `total_reward`
 
 
141
 
142
  ## Grading And Reward
143
 
144
  Scoring is deterministic and normalized to `[0.0, 1.0]`.
145
 
 
 
 
 
 
 
 
 
 
 
146
  Per-field behavior:
147
 
148
  - `issue_type`: exact match, with a few near-miss partial-credit pairs
@@ -161,11 +181,15 @@ Task weights:
161
  Final episode reward:
162
 
163
  ```text
164
- average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
165
  ```
166
 
167
  The result is clamped to `[0.0, 1.0]`.
168
 
 
 
 
 
169
  ## Grounded Scoring
170
 
171
  The grader is intentionally not fuzzy by default.
@@ -285,7 +309,7 @@ curl http://localhost:7860/tasks
285
 
286
  ## Running The Baseline Inference Script
287
 
288
- The baseline script supports two modes.
289
 
290
  ### Heuristic mode
291
 
@@ -295,6 +319,12 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
295
  python inference.py
296
  ```
297
 
 
 
 
 
 
 
298
  ### LLM mode
299
 
300
  Set these environment variables first:
@@ -313,6 +343,14 @@ Optional target:
313
 
314
  - `ENV_URL`
315
  - default value: `http://localhost:7860`
 
 
 
 
 
 
 
 
316
 
317
  ## Runtime Validation Snapshot
318
 
@@ -324,7 +362,7 @@ Validated locally:
324
  - `/health`
325
  - `/tasks`
326
  - `/reset`
327
- - heuristic `inference.py` run across all 3 tasks
328
 
329
  Current local heuristic results:
330
 
@@ -335,7 +373,7 @@ Current local heuristic results:
335
  | Full Ticket Routing | `0.9400` |
336
  | Overall | `0.9400` |
337
 
338
- The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. A Docker smoke test and clean-machine rerun are still recommended before final submission freeze.
339
 
340
  ### Windows note
341
 
@@ -358,7 +396,7 @@ docker run -p 7860:7860 helpdesk-ticket-routing
358
  Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
359
 
360
  ```bash
361
- python inference.py
362
  ```
363
 
364
  If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
@@ -376,6 +414,7 @@ OpenEnv provides the core environment endpoints, and the repo adds a custom task
376
  | POST | `/step` | submit an action |
377
  | GET | `/state` | inspect internal state |
378
  | GET | `/tasks` | list task metadata |
 
379
  | GET | `/docs` | interactive API docs |
380
 
381
  ## Submission Readiness
@@ -397,11 +436,17 @@ An April 6 repo audit also confirmed that all required submission files are pres
397
  - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
398
  - docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
399
 
400
- Still pending before final submission:
 
 
 
 
 
 
 
 
401
 
402
- - a Docker smoke test from a machine with Docker installed
403
- - `openenv validate` evidence on the current merged repo state
404
- - structured `inference.py` log-format verification on the current merged repo state
405
- - a final clean-machine dry run if possible before submission freeze
406
 
407
- The short TRL / GRPO README example from the roadmap is intentionally deferred until the shared runtime and validation gates are green.
 
34
 
35
  1. a new ticket enters the queue
36
  2. the agent reads the ticket title and description
37
+ 3. the agent may investigate with lightweight tools, then submit structured routing fields
38
  4. the grader assigns deterministic credit
39
  5. the environment advances to the next ticket until the queue is complete
40
 
 
43
  ## Why This Is A Good Hackathon Domain
44
 
45
  - it reflects real enterprise support operations
46
+ - the action space is structured and judge-friendly, with a small investigate-versus-submit split
47
  - correctness can be scored deterministically
48
  - the hard task is meaningfully harder than the easy and medium tasks
49
  - the environment is small enough to rerun quickly
 
55
  - `reset()` samples a task and a queue of 3 to 5 tickets
56
  - `step()` grades one ticket submission at a time
57
  - `state()` exposes the internal episode snapshot
58
+ - final reward is based on average ticket quality across the queue
59
 
60
  The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
61
 
 
115
  - `title`
116
  - `requester`
117
  - `description`
118
+ - optional `ambiguity_note`
119
+ - optional `related_ticket_id`
120
+ - optional `related_ticket_preview`
121
 
122
  Each observation also includes:
123
 
 
125
  - `task_name`
126
  - `instructions`
127
  - `allowed_fields`
128
+ - `available_tools`
129
+ - `investigation_budget_remaining`
130
+ - `last_tool_result`
131
  - `queue_size`
132
  - `tickets_remaining`
133
+ - `tickets_after_current`
134
  - `tickets_processed`
135
+ - `queue_position`
136
  - `history`
137
  - standard OpenEnv fields such as `done` and `reward`
138
 
 
146
  - `current_ticket_index`
147
  - `per_ticket_scores`
148
  - `total_reward`
149
+ - `reward`
150
+ - `done`
151
 
152
  ## Grading And Reward
153
 
154
  Scoring is deterministic and normalized to `[0.0, 1.0]`.
155
 
156
+ The action model now supports two paths:
157
+
158
+ - `action_type="submit"` for the final routing answer
159
+ - `action_type="investigate"` with a small built-in tool surface before submission
160
+
161
+ Available tools:
162
+
163
+ - `lookup_related_ticket`
164
+ - `lookup_requester_history`
165
+
166
  Per-field behavior:
167
 
168
  - `issue_type`: exact match, with a few near-miss partial-credit pairs
 
181
  Final episode reward:
182
 
183
  ```text
184
+ average(per_ticket_scores)
185
  ```
186
 
187
  The result is clamped to `[0.0, 1.0]`.
188
 
189
+ Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
190
+
191
+ Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
192
+
193
  ## Grounded Scoring
194
 
195
  The grader is intentionally not fuzzy by default.
 
309
 
310
  ## Running The Baseline Inference Script
311
 
312
+ The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
313
 
314
  ### Heuristic mode
315
 
 
319
  python inference.py
320
  ```
321
 
322
+ By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
323
+
324
+ ```bash
325
+ TASK_ID=3 python inference.py
326
+ ```
327
+
328
  ### LLM mode
329
 
330
  Set these environment variables first:
 
343
 
344
  - `ENV_URL`
345
  - default value: `http://localhost:7860`
346
+ - `TASK_ID`
347
+ - `RUN_ALL_TASKS`
348
+
349
+ To reproduce the multi-task local benchmark sweep:
350
+
351
+ ```bash
352
+ RUN_ALL_TASKS=1 python inference.py
353
+ ```
354
 
355
  ## Runtime Validation Snapshot
356
 
 
362
  - `/health`
363
  - `/tasks`
364
  - `/reset`
365
+ - heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
366
 
367
  Current local heuristic results:
368
 
 
373
  | Full Ticket Routing | `0.9400` |
374
  | Overall | `0.9400` |
375
 
376
+ The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
377
 
378
  ### Windows note
379
 
 
396
  Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
397
 
398
  ```bash
399
+ RUN_ALL_TASKS=1 python inference.py
400
  ```
401
 
402
  If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
 
414
  | POST | `/step` | submit an action |
415
  | GET | `/state` | inspect internal state |
416
  | GET | `/tasks` | list task metadata |
417
+ | GET | `/web` | lightweight HF Space UI |
418
  | GET | `/docs` | interactive API docs |
419
 
420
  ## Submission Readiness
 
436
  - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
437
  - docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
438
 
439
+ Roadmap status through April 7 is complete:
440
+
441
+ - unit, smoke, and integration tests are checked in and green
442
+ - Docker smoke coverage exists through `.github/workflows/docker-smoke-test.yml`
443
+ - `openenv validate` now passes on the current repo state
444
+ - structured `inference.py` logging is verified by tests and the merged-state rerun
445
+ - a clean-copy install-and-run pass has been completed
446
+
447
+ The remaining April 8 work is operational rather than implementation-heavy:
448
 
449
+ - run the final submission-branch sanity slice before pushing
450
+ - perform the live Hugging Face Space ping and reset check on the deployed submission artifact if a fresh deployment is created
 
 
451
 
452
+ The short TRL / GRPO README example from the roadmap remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
ROADMAP.md CHANGED
@@ -11,10 +11,39 @@
11
  ## How To Use This File
12
 
13
  - `PROJECT_STATUS.md` is the canonical log of completed work.
14
- - This roadmap is the remaining execution plan from the current repo state to final submission.
15
  - `required.md` is now the combined official-requirements and project-compliance file.
16
  - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
17
- - `analysis/comp.md` and `analysis/comp_know.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ## What We Are Optimizing For
20
 
@@ -47,14 +76,51 @@ The repo already has:
47
  - deterministic grading with limited partial credit
48
  - working heuristic baseline
49
  - merged local validation on `/health`, `/tasks`, and `inference.py`
50
- - current local benchmark reference:
51
- - Task 1: `1.0000`
52
- - Task 2: `0.8800`
53
- - Task 3: `0.9400`
54
- - Overall: `0.9400`
 
 
 
55
 
56
  The remaining work should be treated as targeted strengthening, not broad feature invention.
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Submission Gates That Must Still Hold
59
 
60
  These come directly from `required.md` and `KNOWLEDGE.md`:
@@ -114,7 +180,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
114
 
115
  **Window:** April 3 to April 4
116
 
117
- **Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/comp_know.md`: lack of checked-in tests.
118
 
119
  ### Must produce
120
 
@@ -182,7 +248,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
182
  - assignment group and resolution action remain exact
183
  - final episode reward stays bounded and deterministic
184
 
185
- ### Safe improvement candidates from `analysis/comp_know.md`
186
 
187
  - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
188
  - enrich `history` with:
@@ -237,7 +303,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
237
 
238
  **Window:** April 6 to April 7
239
 
240
- **Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md`.
241
 
242
  ### Must produce
243
 
 
11
  ## How To Use This File
12
 
13
  - `PROJECT_STATUS.md` is the canonical log of completed work.
14
+ - This roadmap is the active plan from the verified April 6, 2026 repo state to final submission.
15
  - `required.md` is now the combined official-requirements and project-compliance file.
16
  - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
17
+ - `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
18
+ - The dated April 3 to April 5 sections below are now historical context; the active execution block is the final 24-hour plan for April 6 to April 7, 2026.
19
+
20
+ ## Status As Of April 6, 2026
21
+
22
+ The repo is now in the expected "stabilize and merge" phase rather than the earlier "build core fixes" phase.
23
+
24
+ Completed and locally verified:
25
+
26
+ - all concrete items from `gaps.md`
27
+ - the viable low-risk improvements from `analysis/deep_competitive_gap_report.md`
28
+ - single-task `inference.py` execution with `TASK_ID` support and optional `RUN_ALL_TASKS=1`
29
+ - `state()` exposure of `reward` and `done`
30
+ - richer history with predicted actions and follow-up context
31
+ - lightweight investigate-versus-submit action support with tool-backed context lookup
32
+ - small queue-economics signal without major benchmark redesign
33
+ - `/web` UI route
34
+ - local full test pass:
35
+ - `126 passed, 137 subtests passed`
36
+ - local validator pass:
37
+ - `[OK] meta-AIHack: Ready for multi-mode deployment`
38
+
39
+ Merge recommendation:
40
+
41
+ - mergeable as an incremental submission-ready improvement branch
42
+ - do not block merge on major redesign items that were explicitly out of scope:
43
+ - scenario-family task redesign
44
+ - breaking the issue-type-to-assignment shortcut
45
+ - large dataset expansion
46
+ - full queue simulator / economics redesign
47
 
48
  ## What We Are Optimizing For
49
 
 
76
  - deterministic grading with limited partial credit
77
  - working heuristic baseline
78
  - merged local validation on `/health`, `/tasks`, and `inference.py`
79
+ - single-task evaluator-safe inference behavior
80
+ - reward and done fields on `state()`
81
+ - richer observation history and linked-ticket context
82
+ - lightweight investigate / submit split with small built-in tool support
83
+ - local full-suite verification:
84
+ - `126 passed, 137 subtests passed`
85
+ - local validator verification:
86
+ - `[OK] meta-AIHack: Ready for multi-mode deployment`
87
 
88
  The remaining work should be treated as targeted strengthening, not broad feature invention.
89
 
90
+ ## Final 24-Hour Plan
91
+
92
+ **Active window:** April 6 to April 7, 2026
93
+ **Internal target:** open PR, merge to the common `main`, and complete the final smoke checks by April 7, 2026
94
+ **Official deadline:** April 8, 2026, 11:59 PM IST
95
+
96
+ ### Must finish before merge
97
+
98
+ - review the final diff and stage only the intended submission files
99
+ - open the merge PR from a dedicated branch
100
+ - merge into the shared `main` after one last reviewer pass
101
+ - rerun the post-merge smoke checks:
102
+ - `pytest`
103
+ - `openenv validate`
104
+ - `/health`
105
+ - `/tasks`
106
+ - one `reset()` / `step()` sanity path
107
+
108
+ ### Do not add before merge
109
+
110
+ - no new benchmark redesign work
111
+ - no new dataset expansion
112
+ - no schema churn
113
+ - no reward refactors beyond blocker-level fixes
114
+ - no last-minute inference prompt rewrites
115
+
116
+ ### Success condition for April 7, 2026
117
+
118
+ - PR is up
119
+ - PR is reviewed against `gaps.md` and `analysis/deep_competitive_gap_report.md`
120
+ - shared `main` contains the tested gap-fix branch
121
+ - deployment sanity checks are green
122
+ - repo is frozen except for typo-level fixes
123
+
124
  ## Submission Gates That Must Still Hold
125
 
126
  These come directly from `required.md` and `KNOWLEDGE.md`:
 
180
 
181
  **Window:** April 3 to April 4
182
 
183
+ **Goal:** eliminate the biggest competitive weakness identified in `analysis/competition_notes.md`: lack of checked-in tests.
184
 
185
  ### Must produce
186
 
 
248
  - assignment group and resolution action remain exact
249
  - final episode reward stays bounded and deterministic
250
 
251
+ ### Safe improvement candidates from `analysis/competition_notes.md`
252
 
253
  - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
254
  - enrich `history` with:
 
303
 
304
  **Window:** April 6 to April 7
305
 
306
+ **Goal:** close the submission-readiness gaps surfaced in `analysis/competition_notes.md`.
307
 
308
  ### Must produce
309
 
inference.py CHANGED
@@ -20,6 +20,15 @@ HF_TOKEN
20
  HuggingFace authentication token for the LLM provider.
21
  No default is set.
22
 
 
 
 
 
 
 
 
 
 
23
  LOCAL_IMAGE_NAME
24
  Optional compatibility variable from the sample inference pattern.
25
  This script does not use ``from_docker_image()``, so the value is unused here.
@@ -65,6 +74,11 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
65
 
66
  SEED = 42
67
  TASK_ID_ENV = os.getenv("TASK_ID")
 
 
 
 
 
68
 
69
  # ---------------------------------------------------------------------------
70
  # LLM helper
@@ -99,13 +113,36 @@ Return ONLY valid JSON with the requested fields. No markdown, no explanation.""
99
 
100
  def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
101
  assert llm_client is not None, "LLM client not configured"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  user_msg = (
104
  f"Instructions: {instructions}\n\n"
105
  f"Allowed fields: {', '.join(allowed_fields)}\n\n"
106
  f"Title: {ticket['title']}\n"
107
  f"Requester: {ticket['requester']}\n"
108
- f"Description: {ticket['description']}\n\n"
 
109
  f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
110
  )
111
 
@@ -135,17 +172,26 @@ def emit_log(tag: str, **payload: Any) -> None:
135
 
136
 
137
  def get_tasks_to_run(available_tasks: dict) -> list[int]:
 
138
  if TASK_ID_ENV:
139
  try:
140
  task_id = int(TASK_ID_ENV)
141
  except ValueError:
142
  print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
143
  raise SystemExit(1)
144
- if task_id not in available_tasks:
145
- print(f"[WARN] TASK_ID={task_id} not in available tasks {list(available_tasks)}", flush=True)
146
- return []
 
 
 
147
  return [task_id]
148
- return list(TASK_IDS) # fallback: all tasks (local dev)
 
 
 
 
 
149
 
150
 
151
  # ---------------------------------------------------------------------------
@@ -278,7 +324,18 @@ def heuristic_resolution_action(text: str, issue_type: str) -> str:
278
 
279
 
280
  def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
281
- text = (ticket.get("title", "") + " " + ticket.get("description", "")).lower()
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  issue_type = "general_inquiry"
284
  for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
@@ -329,6 +386,31 @@ def build_action(
329
  )
330
 
331
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
  # ---------------------------------------------------------------------------
333
  # Main loop using the HTTP-based sync EnvClient for multi-step episodes
334
  # ---------------------------------------------------------------------------
@@ -347,7 +429,9 @@ def run() -> None:
347
  all_results: dict[int, dict[str, float | int]] = {}
348
 
349
  tasks_to_run = get_tasks_to_run(available_tasks)
350
- single_task_mode = bool(TASK_ID_ENV)
 
 
351
 
352
  for task_id in tasks_to_run:
353
  if task_id not in available_tasks:
@@ -377,8 +461,40 @@ def run() -> None:
377
  if ticket is None:
378
  break
379
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
380
  action, action_source, fallback_reason = build_action(
381
- ticket,
382
  obs.allowed_fields,
383
  obs.instructions,
384
  )
 
20
  HuggingFace authentication token for the LLM provider.
21
  No default is set.
22
 
23
+ TASK_ID
24
+ Optional OpenEnv task ID to run. When unset, the script defaults to the
25
+ first available task so it still emits exactly one ``[START]`` ... ``[END]``
26
+ block for evaluator-style runs.
27
+
28
+ RUN_ALL_TASKS
29
+ Optional local-development override. Set to ``1`` to run every available
30
+ task in sequence and print the aggregate closing ``[END]`` summary.
31
+
32
  LOCAL_IMAGE_NAME
33
  Optional compatibility variable from the sample inference pattern.
34
  This script does not use ``from_docker_image()``, so the value is unused here.
 
74
 
75
  SEED = 42
76
  TASK_ID_ENV = os.getenv("TASK_ID")
77
+ RUN_ALL_TASKS_ENV = os.getenv("RUN_ALL_TASKS", "").strip().lower() in {
78
+ "1",
79
+ "true",
80
+ "yes",
81
+ }
82
 
83
  # ---------------------------------------------------------------------------
84
  # LLM helper
 
113
 
114
  def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
115
  assert llm_client is not None, "LLM client not configured"
116
+ ambiguity_note = ticket.get("ambiguity_note")
117
+ related_preview = ticket.get("related_ticket_preview") or {}
118
+ last_tool_result = ticket.get("last_tool_result")
119
+ extra_context_lines: list[str] = []
120
+ if ambiguity_note:
121
+ extra_context_lines.append(f"Ambiguity note: {ambiguity_note}")
122
+ if related_preview:
123
+ extra_context_lines.extend(
124
+ [
125
+ "Related ticket preview:",
126
+ f"- Title: {related_preview.get('title', '')}",
127
+ f"- Requester: {related_preview.get('requester', '')}",
128
+ f"- Description: {related_preview.get('description', '')}",
129
+ ]
130
+ )
131
+ if last_tool_result is not None:
132
+ extra_context_lines.append(
133
+ "Investigation result: " + json.dumps(last_tool_result, sort_keys=True)
134
+ )
135
+ extra_context_block = ""
136
+ if extra_context_lines:
137
+ extra_context_block = "\n" + "\n".join(extra_context_lines)
138
 
139
  user_msg = (
140
  f"Instructions: {instructions}\n\n"
141
  f"Allowed fields: {', '.join(allowed_fields)}\n\n"
142
  f"Title: {ticket['title']}\n"
143
  f"Requester: {ticket['requester']}\n"
144
+ f"Description: {ticket['description']}"
145
+ f"{extra_context_block}\n\n"
146
  f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
147
  )
148
 
 
172
 
173
 
174
  def get_tasks_to_run(available_tasks: dict) -> list[int]:
175
+ available_task_ids = sorted(int(task_id) for task_id in available_tasks)
176
  if TASK_ID_ENV:
177
  try:
178
  task_id = int(TASK_ID_ENV)
179
  except ValueError:
180
  print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
181
  raise SystemExit(1)
182
+ if task_id not in available_task_ids:
183
+ print(
184
+ f"[ERROR] TASK_ID={task_id} not in available tasks {available_task_ids}",
185
+ flush=True,
186
+ )
187
+ raise SystemExit(1)
188
  return [task_id]
189
+ if RUN_ALL_TASKS_ENV:
190
+ return available_task_ids
191
+ if not available_task_ids:
192
+ return []
193
+ # Default to a single task so evaluation emits exactly one START/END block.
194
+ return [available_task_ids[0]]
195
 
196
 
197
  # ---------------------------------------------------------------------------
 
324
 
325
 
326
  def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
327
+ related_preview = ticket.get("related_ticket_preview") or {}
328
+ last_tool_result = ticket.get("last_tool_result") or {}
329
+ text = " ".join(
330
+ [
331
+ ticket.get("title", ""),
332
+ ticket.get("description", ""),
333
+ ticket.get("ambiguity_note", ""),
334
+ related_preview.get("title", ""),
335
+ related_preview.get("description", ""),
336
+ json.dumps(last_tool_result, sort_keys=True),
337
+ ]
338
+ ).lower()
339
 
340
  issue_type = "general_inquiry"
341
  for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
 
386
  )
387
 
388
 
389
+ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[bool, str | None]:
390
+ if not ticket:
391
+ return False, None
392
+ current_ticket_id = ticket.get("ticket_id")
393
+ already_investigated = any(
394
+ entry.get("ticket_id") == current_ticket_id
395
+ and entry.get("predicted", {}).get("action_type") == "investigate"
396
+ for entry in history
397
+ )
398
+ if already_investigated:
399
+ return False, None
400
+ if ticket.get("related_ticket_id"):
401
+ return True, "lookup_related_ticket"
402
+ if ticket.get("ambiguity_note"):
403
+ return True, "lookup_requester_history"
404
+ return False, None
405
+
406
+
407
+ def merge_ticket_context(ticket: dict, observation: Any) -> dict:
408
+ merged_ticket = dict(ticket)
409
+ if getattr(observation, "last_tool_result", None) is not None:
410
+ merged_ticket["last_tool_result"] = observation.last_tool_result
411
+ return merged_ticket
412
+
413
+
414
  # ---------------------------------------------------------------------------
415
  # Main loop using the HTTP-based sync EnvClient for multi-step episodes
416
  # ---------------------------------------------------------------------------
 
429
  all_results: dict[int, dict[str, float | int]] = {}
430
 
431
  tasks_to_run = get_tasks_to_run(available_tasks)
432
+ if not tasks_to_run:
433
+ return
434
+ single_task_mode = len(tasks_to_run) == 1
435
 
436
  for task_id in tasks_to_run:
437
  if task_id not in available_tasks:
 
461
  if ticket is None:
462
  break
463
 
464
+ investigate, tool_name = should_investigate(ticket, obs.history)
465
+ if (
466
+ investigate
467
+ and tool_name is not None
468
+ and getattr(obs, "investigation_budget_remaining", 0) > 0
469
+ ):
470
+ tool_action = HelpdeskTicketAction(
471
+ action_type="investigate",
472
+ tool_name=tool_name,
473
+ tool_target_ticket_id=ticket.get("related_ticket_id"),
474
+ )
475
+ result = sync_client.step(tool_action)
476
+ obs = result.observation
477
+ step_num += 1
478
+ emit_log(
479
+ "STEP",
480
+ action=tool_action.model_dump(exclude_none=True),
481
+ action_source="investigation_tool",
482
+ done=bool(result.done),
483
+ fallback_reason=None,
484
+ reward=float(result.reward or 0.0),
485
+ step=step_num,
486
+ task_id=task_id,
487
+ ticket_id=ticket["ticket_id"],
488
+ )
489
+ if result.done:
490
+ break
491
+ ticket = obs.current_ticket
492
+ if ticket is None:
493
+ break
494
+
495
+ ticket_with_context = merge_ticket_context(ticket, obs)
496
  action, action_source, fallback_reason = build_action(
497
+ ticket_with_context,
498
  obs.allowed_fields,
499
  obs.instructions,
500
  )
models.py CHANGED
@@ -16,6 +16,8 @@ ISSUE_TYPE_SET = set(ISSUE_TYPES)
16
  PRIORITY_SET = set(PRIORITIES)
17
  ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
18
  RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
 
 
19
 
20
 
21
  def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
@@ -67,11 +69,24 @@ class HelpdeskTicketRecord(BaseModel):
67
 
68
 
69
  class HelpdeskTicketAction(Action):
 
 
 
70
  issue_type: Optional[str] = None
71
  priority: Optional[str] = None
72
  assignment_group: Optional[str] = None
73
  resolution_action: Optional[str] = None
74
 
 
 
 
 
 
 
 
 
 
 
75
  @field_validator("issue_type")
76
  @classmethod
77
  def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
@@ -98,10 +113,15 @@ class HelpdeskTicketObservation(Observation):
98
  task_name: str = ""
99
  instructions: str = ""
100
  allowed_fields: list[str] = Field(default_factory=list)
101
- current_ticket: Optional[dict[str, str]] = None
 
 
 
102
  queue_size: int = 0
103
  tickets_remaining: int = 0
 
104
  tickets_processed: int = 0
 
105
  history: list[dict[str, Any]] = Field(default_factory=list)
106
 
107
 
@@ -116,4 +136,7 @@ class HelpdeskTicketState(State):
116
  # `reward` is the field the evaluator checks on GET /state (mentor spec)
117
  reward: Optional[float] = None
118
  done: bool = False
 
 
 
119
  history_entries: list[dict] = Field(default_factory=list)
 
16
  PRIORITY_SET = set(PRIORITIES)
17
  ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
18
  RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
19
+ ACTION_TYPE_SET = {"submit", "investigate"}
20
+ TOOL_NAME_SET = {"lookup_related_ticket", "lookup_requester_history"}
21
 
22
 
23
  def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
 
69
 
70
 
71
  class HelpdeskTicketAction(Action):
72
+ action_type: str = "submit"
73
+ tool_name: Optional[str] = None
74
+ tool_target_ticket_id: Optional[str] = None
75
  issue_type: Optional[str] = None
76
  priority: Optional[str] = None
77
  assignment_group: Optional[str] = None
78
  resolution_action: Optional[str] = None
79
 
80
+ @field_validator("action_type")
81
+ @classmethod
82
+ def validate_action_type(cls, value: str) -> str:
83
+ return _validate_choice(value, ACTION_TYPE_SET, "action_type")
84
+
85
+ @field_validator("tool_name")
86
+ @classmethod
87
+ def validate_tool_name(cls, value: Optional[str]) -> Optional[str]:
88
+ return _validate_optional_choice(value, TOOL_NAME_SET, "tool_name")
89
+
90
  @field_validator("issue_type")
91
  @classmethod
92
  def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
 
113
  task_name: str = ""
114
  instructions: str = ""
115
  allowed_fields: list[str] = Field(default_factory=list)
116
+ available_tools: list[str] = Field(default_factory=list)
117
+ investigation_budget_remaining: int = 0
118
+ last_tool_result: Optional[dict[str, Any]] = None
119
+ current_ticket: Optional[dict[str, Any]] = None
120
  queue_size: int = 0
121
  tickets_remaining: int = 0
122
+ tickets_after_current: int = 0
123
  tickets_processed: int = 0
124
+ queue_position: int = 0
125
  history: list[dict[str, Any]] = Field(default_factory=list)
126
 
127
 
 
136
  # `reward` is the field the evaluator checks on GET /state (mentor spec)
137
  reward: Optional[float] = None
138
  done: bool = False
139
+ investigation_steps: int = 0
140
+ investigation_budget_remaining: int = 0
141
+ last_tool_result: Optional[dict[str, Any]] = None
142
  history_entries: list[dict] = Field(default_factory=list)
openenv.yaml CHANGED
@@ -53,6 +53,7 @@ inference:
53
  - MODEL_NAME
54
  - HF_TOKEN
55
  - ENV_URL
 
56
 
57
  requirements:
58
  python: ">=3.11"
 
53
  - MODEL_NAME
54
  - HF_TOKEN
55
  - ENV_URL
56
+ - TASK_ID
57
 
58
  requirements:
59
  python: ">=3.11"
server/environment.py CHANGED
@@ -18,6 +18,10 @@ from server.tasks import get_task_definition, load_dataset
18
 
19
 
20
  QUEUE_SIZE_RANGE = (3, 5)
 
 
 
 
21
 
22
 
23
  def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
@@ -41,6 +45,7 @@ class HelpdeskTicketRoutingEnvironment(
41
  def __init__(self) -> None:
42
  super().__init__()
43
  self._dataset = load_dataset()
 
44
  self._rng = random.Random()
45
  self._queue: list[HelpdeskTicketRecord] = []
46
  self._state = HelpdeskTicketState()
@@ -57,13 +62,19 @@ class HelpdeskTicketRoutingEnvironment(
57
  ) -> HelpdeskTicketObservation:
58
  normalized_seed = _coerce_optional_int(seed, "seed")
59
  task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
 
60
  task_id = 1 if task_id_value is None else task_id_value
61
  task = get_task_definition(task_id)
 
 
62
 
63
  if normalized_seed is not None:
64
  self._rng.seed(normalized_seed)
65
 
66
- queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
 
 
 
67
  self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
68
 
69
  self._state = HelpdeskTicketState(
@@ -75,6 +86,7 @@ class HelpdeskTicketRoutingEnvironment(
75
  current_ticket_index=0,
76
  per_ticket_scores=[],
77
  total_reward=0.0,
 
78
  )
79
 
80
  return self._build_observation(task)
@@ -96,34 +108,46 @@ class HelpdeskTicketRoutingEnvironment(
96
  task_id = self._state.current_task_id
97
  task = get_task_definition(task_id)
98
 
 
 
 
99
  submitted_fields = {
100
- f for f, v in action.model_dump(exclude_none=True).items() if v is not None
 
 
 
101
  }
102
  allowed = set(task["allowed_fields"])
103
  extra_fields = submitted_fields - allowed
104
  if extra_fields:
105
  # Penalty: record score 0.0, advance index, return penalty observation
106
  self._state.per_ticket_scores.append(0.0)
107
- self._state.history_entries.append({
108
- "ticket_id": current_ticket.ticket_id,
109
- "title": current_ticket.title,
110
- "predicted": action.model_dump(exclude_none=True),
111
- "score": 0.0,
112
- "breakdown": {},
113
- "penalty_reason": f"extra_fields: {sorted(extra_fields)}",
114
- })
 
 
115
  self._state.step_count += 1
116
  self._state.current_ticket_index += 1
117
  is_done = self._state.current_ticket_index >= len(self._queue)
118
- self._state.last_step_reward = 0.0
119
- self._state.reward = 0.0
120
  self._state.done = is_done
121
  if is_done:
122
  traj_reward = compute_trajectory_reward(
123
  self._state.per_ticket_scores, len(self._queue), self._state.step_count
124
  )
125
- self._state.total_reward = traj_reward
126
- return self._build_observation(task, done=is_done, reward=0.0)
 
 
 
 
 
 
127
 
128
  score, breakdown = grade_action(action, current_ticket, task_id)
129
  step_reward = compute_step_reward(score)
@@ -139,26 +163,27 @@ class HelpdeskTicketRoutingEnvironment(
139
  len(self._queue),
140
  self._state.step_count,
141
  )
142
- self._state.total_reward = traj_reward
143
- final_reward = traj_reward
144
  else:
145
  self._state.per_ticket_scores.append(score)
146
  self._state.step_count += 1
147
  self._state.current_ticket_index += 1
148
  final_reward = step_reward
149
 
150
- history_entry = {
151
- "ticket_id": current_ticket.ticket_id,
152
- "title": current_ticket.title,
153
- "predicted": action.model_dump(exclude_none=True),
154
- "score": score,
155
- "breakdown": breakdown,
156
- }
157
  self._state.history_entries.append(history_entry)
158
 
159
  self._state.last_step_reward = final_reward
160
  self._state.reward = final_reward
161
  self._state.done = is_done
 
162
 
163
  return self._build_observation(task, done=is_done, reward=final_reward)
164
 
@@ -170,6 +195,188 @@ class HelpdeskTicketRoutingEnvironment(
170
  # Helpers
171
  # ------------------------------------------------------------------
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
  def _build_observation(
174
  self,
175
  task: dict,
@@ -181,33 +388,43 @@ class HelpdeskTicketRoutingEnvironment(
181
 
182
  if idx < queue_size:
183
  ticket = self._queue[idx]
184
- ticket_view: dict[str, Any] = {
185
- "ticket_id": ticket.ticket_id,
186
- "title": ticket.title,
187
- "requester": ticket.requester,
188
- "description": ticket.description,
189
- }
190
- if ticket.ambiguity_note is not None:
191
- ticket_view["ambiguity_note"] = ticket.ambiguity_note
192
- if ticket.related_ticket_id is not None:
193
- ticket_view["related_ticket_id"] = ticket.related_ticket_id
194
  else:
195
  ticket_view = None
 
196
 
197
  history = list(self._state.history_entries)
 
 
 
 
 
198
 
199
  return HelpdeskTicketObservation(
200
  done=done,
201
  reward=reward,
202
- metadata={},
 
 
 
 
 
 
 
 
203
  task_id=task["id"],
204
  task_name=task["name"],
205
  instructions=task["instructions"],
206
  allowed_fields=list(task["allowed_fields"]),
 
 
 
207
  current_ticket=ticket_view,
208
  queue_size=queue_size,
209
- # tickets_remaining: count of tickets not yet processed after this step
210
- tickets_remaining=max(0, queue_size - idx),
211
  tickets_processed=idx,
 
212
  history=history,
213
  )
 
18
 
19
 
20
  QUEUE_SIZE_RANGE = (3, 5)
21
+ AVAILABLE_TOOLS = ("lookup_related_ticket", "lookup_requester_history")
22
+ FREE_INVESTIGATIONS_PER_TICKET = 1
23
+ EXTRA_INVESTIGATION_COST = 0.02
24
+ MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
25
 
26
 
27
  def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
 
45
  def __init__(self) -> None:
46
  super().__init__()
47
  self._dataset = load_dataset()
48
+ self._tickets_by_id = {ticket.ticket_id: ticket for ticket in self._dataset}
49
  self._rng = random.Random()
50
  self._queue: list[HelpdeskTicketRecord] = []
51
  self._state = HelpdeskTicketState()
 
62
  ) -> HelpdeskTicketObservation:
63
  normalized_seed = _coerce_optional_int(seed, "seed")
64
  task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
65
+ queue_size_value = _coerce_optional_int(kwargs.get("queue_size"), "queue_size")
66
  task_id = 1 if task_id_value is None else task_id_value
67
  task = get_task_definition(task_id)
68
+ if queue_size_value is not None and queue_size_value < 1:
69
+ raise ValueError("queue_size must be >= 1")
70
 
71
  if normalized_seed is not None:
72
  self._rng.seed(normalized_seed)
73
 
74
+ if queue_size_value is None:
75
+ queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
76
+ else:
77
+ queue_size = min(queue_size_value, len(self._dataset))
78
  self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
79
 
80
  self._state = HelpdeskTicketState(
 
86
  current_ticket_index=0,
87
  per_ticket_scores=[],
88
  total_reward=0.0,
89
+ investigation_budget_remaining=queue_size * FREE_INVESTIGATIONS_PER_TICKET,
90
  )
91
 
92
  return self._build_observation(task)
 
108
  task_id = self._state.current_task_id
109
  task = get_task_definition(task_id)
110
 
111
+ if action.action_type == "investigate":
112
+ return self._handle_investigation_action(task, current_ticket, action, idx)
113
+
114
  submitted_fields = {
115
+ f
116
+ for f, v in action.model_dump(exclude_none=True).items()
117
+ if v is not None
118
+ and f not in {"action_type", "tool_name", "tool_target_ticket_id"}
119
  }
120
  allowed = set(task["allowed_fields"])
121
  extra_fields = submitted_fields - allowed
122
  if extra_fields:
123
  # Penalty: record score 0.0, advance index, return penalty observation
124
  self._state.per_ticket_scores.append(0.0)
125
+ self._state.history_entries.append(
126
+ self._build_history_entry(
127
+ current_ticket,
128
+ predicted=action.model_dump(exclude_none=True),
129
+ score=0.0,
130
+ breakdown={},
131
+ queue_position=idx + 1,
132
+ penalty_reason=f"extra_fields: {sorted(extra_fields)}",
133
+ )
134
+ )
135
  self._state.step_count += 1
136
  self._state.current_ticket_index += 1
137
  is_done = self._state.current_ticket_index >= len(self._queue)
 
 
138
  self._state.done = is_done
139
  if is_done:
140
  traj_reward = compute_trajectory_reward(
141
  self._state.per_ticket_scores, len(self._queue), self._state.step_count
142
  )
143
+ final_reward = self._apply_episode_economics(traj_reward)
144
+ self._state.total_reward = final_reward
145
+ else:
146
+ final_reward = 0.0
147
+ self._state.last_step_reward = final_reward
148
+ self._state.reward = final_reward
149
+ self._state.last_tool_result = None
150
+ return self._build_observation(task, done=is_done, reward=final_reward)
151
 
152
  score, breakdown = grade_action(action, current_ticket, task_id)
153
  step_reward = compute_step_reward(score)
 
163
  len(self._queue),
164
  self._state.step_count,
165
  )
166
+ final_reward = self._apply_episode_economics(traj_reward)
167
+ self._state.total_reward = final_reward
168
  else:
169
  self._state.per_ticket_scores.append(score)
170
  self._state.step_count += 1
171
  self._state.current_ticket_index += 1
172
  final_reward = step_reward
173
 
174
+ history_entry = self._build_history_entry(
175
+ current_ticket,
176
+ predicted=action.model_dump(exclude_none=True),
177
+ score=score,
178
+ breakdown=breakdown,
179
+ queue_position=idx + 1,
180
+ )
181
  self._state.history_entries.append(history_entry)
182
 
183
  self._state.last_step_reward = final_reward
184
  self._state.reward = final_reward
185
  self._state.done = is_done
186
+ self._state.last_tool_result = None
187
 
188
  return self._build_observation(task, done=is_done, reward=final_reward)
189
 
 
195
  # Helpers
196
  # ------------------------------------------------------------------
197
 
198
+ def _apply_episode_economics(self, base_reward: float) -> float:
199
+ free_investigations = len(self._queue) * FREE_INVESTIGATIONS_PER_TICKET
200
+ extra_investigations = max(0, self._state.investigation_steps - free_investigations)
201
+ penalty = min(
202
+ MAX_EXTRA_INVESTIGATION_PENALTY,
203
+ extra_investigations * EXTRA_INVESTIGATION_COST,
204
+ )
205
+ return max(0.0, min(1.0, base_reward - penalty))
206
+
207
+ def _lookup_related_ticket(
208
+ self,
209
+ current_ticket: HelpdeskTicketRecord,
210
+ target_ticket_id: str | None,
211
+ ) -> dict[str, Any]:
212
+ target_id = target_ticket_id or current_ticket.related_ticket_id
213
+ if target_id is None:
214
+ return {
215
+ "tool_name": "lookup_related_ticket",
216
+ "found": False,
217
+ "message": "Current ticket has no linked related_ticket_id.",
218
+ }
219
+ related_ticket = self._tickets_by_id.get(target_id)
220
+ if related_ticket is None:
221
+ return {
222
+ "tool_name": "lookup_related_ticket",
223
+ "found": False,
224
+ "message": f"Ticket {target_id!r} was not found in the dataset.",
225
+ }
226
+ return {
227
+ "tool_name": "lookup_related_ticket",
228
+ "found": True,
229
+ "ticket": {
230
+ "ticket_id": related_ticket.ticket_id,
231
+ "title": related_ticket.title,
232
+ "requester": related_ticket.requester,
233
+ "description": related_ticket.description,
234
+ "issue_type": related_ticket.issue_type,
235
+ "priority": related_ticket.priority,
236
+ "assignment_group": related_ticket.assignment_group,
237
+ "resolution_action": related_ticket.resolution_action,
238
+ },
239
+ }
240
+
241
+ def _lookup_requester_history(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
242
+ matches = [
243
+ {
244
+ "ticket_id": ticket.ticket_id,
245
+ "title": ticket.title,
246
+ "issue_type": ticket.issue_type,
247
+ "priority": ticket.priority,
248
+ "assignment_group": ticket.assignment_group,
249
+ "resolution_action": ticket.resolution_action,
250
+ }
251
+ for ticket in self._dataset
252
+ if ticket.requester == current_ticket.requester
253
+ and ticket.ticket_id != current_ticket.ticket_id
254
+ ]
255
+ return {
256
+ "tool_name": "lookup_requester_history",
257
+ "found": bool(matches),
258
+ "requester": current_ticket.requester,
259
+ "matches": matches,
260
+ }
261
+
262
+ def _run_investigation_tool(
263
+ self,
264
+ current_ticket: HelpdeskTicketRecord,
265
+ tool_name: str,
266
+ target_ticket_id: str | None,
267
+ ) -> dict[str, Any]:
268
+ if tool_name == "lookup_related_ticket":
269
+ return self._lookup_related_ticket(current_ticket, target_ticket_id)
270
+ if tool_name == "lookup_requester_history":
271
+ return self._lookup_requester_history(current_ticket)
272
+ raise ValueError(f"Unsupported tool_name: {tool_name}")
273
+
274
+ def _handle_investigation_action(
275
+ self,
276
+ task: dict,
277
+ current_ticket: HelpdeskTicketRecord,
278
+ action: HelpdeskTicketAction,
279
+ idx: int,
280
+ ) -> HelpdeskTicketObservation:
281
+ if action.tool_name is None:
282
+ raise ValueError("Investigate actions require tool_name")
283
+ submitted_fields = {
284
+ field
285
+ for field in ("issue_type", "priority", "assignment_group", "resolution_action")
286
+ if getattr(action, field) is not None
287
+ }
288
+ if submitted_fields:
289
+ raise ValueError(
290
+ "Investigate actions cannot include submit fields: "
291
+ f"{sorted(submitted_fields)}"
292
+ )
293
+
294
+ tool_result = self._run_investigation_tool(
295
+ current_ticket,
296
+ action.tool_name,
297
+ action.tool_target_ticket_id,
298
+ )
299
+ self._state.step_count += 1
300
+ self._state.investigation_steps += 1
301
+ self._state.investigation_budget_remaining = max(
302
+ 0,
303
+ self._state.investigation_budget_remaining - 1,
304
+ )
305
+ self._state.last_tool_result = tool_result
306
+ self._state.last_step_reward = 0.0
307
+ self._state.reward = 0.0
308
+ self._state.done = False
309
+ self._state.history_entries.append(
310
+ self._build_history_entry(
311
+ current_ticket,
312
+ predicted=action.model_dump(exclude_none=True),
313
+ score=0.0,
314
+ breakdown={},
315
+ queue_position=idx + 1,
316
+ tool_result=tool_result,
317
+ )
318
+ )
319
+ return self._build_observation(task, done=False, reward=0.0)
320
+
321
+ def _build_ticket_view(self, ticket: HelpdeskTicketRecord) -> dict[str, Any]:
322
+ ticket_view: dict[str, Any] = {
323
+ "ticket_id": ticket.ticket_id,
324
+ "title": ticket.title,
325
+ "requester": ticket.requester,
326
+ "description": ticket.description,
327
+ }
328
+ if ticket.ambiguity_note is not None:
329
+ ticket_view["ambiguity_note"] = ticket.ambiguity_note
330
+ if ticket.related_ticket_id is not None:
331
+ ticket_view["related_ticket_id"] = ticket.related_ticket_id
332
+ related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
333
+ if related_ticket is not None:
334
+ ticket_view["related_ticket_preview"] = {
335
+ "ticket_id": related_ticket.ticket_id,
336
+ "title": related_ticket.title,
337
+ "requester": related_ticket.requester,
338
+ "description": related_ticket.description,
339
+ }
340
+ return ticket_view
341
+
342
+ def _build_history_entry(
343
+ self,
344
+ ticket: HelpdeskTicketRecord,
345
+ *,
346
+ predicted: dict[str, Any],
347
+ score: float,
348
+ breakdown: dict[str, float],
349
+ queue_position: int,
350
+ penalty_reason: str | None = None,
351
+ tool_result: dict[str, Any] | None = None,
352
+ ) -> dict[str, Any]:
353
+ history_entry: dict[str, Any] = {
354
+ "ticket_id": ticket.ticket_id,
355
+ "title": ticket.title,
356
+ "requester": ticket.requester,
357
+ "predicted": predicted,
358
+ "score": score,
359
+ "breakdown": breakdown,
360
+ "queue_position": queue_position,
361
+ }
362
+ if ticket.ambiguity_note is not None:
363
+ history_entry["ambiguity_note"] = ticket.ambiguity_note
364
+ if ticket.related_ticket_id is not None:
365
+ history_entry["related_ticket_id"] = ticket.related_ticket_id
366
+ related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
367
+ if related_ticket is not None:
368
+ history_entry["related_ticket_preview"] = {
369
+ "ticket_id": related_ticket.ticket_id,
370
+ "title": related_ticket.title,
371
+ "requester": related_ticket.requester,
372
+ "description": related_ticket.description,
373
+ }
374
+ if penalty_reason is not None:
375
+ history_entry["penalty_reason"] = penalty_reason
376
+ if tool_result is not None:
377
+ history_entry["tool_result"] = tool_result
378
+ return history_entry
379
+
380
  def _build_observation(
381
  self,
382
  task: dict,
 
388
 
389
  if idx < queue_size:
390
  ticket = self._queue[idx]
391
+ ticket_view = self._build_ticket_view(ticket)
392
+ queue_position = idx + 1
 
 
 
 
 
 
 
 
393
  else:
394
  ticket_view = None
395
+ queue_position = 0
396
 
397
  history = list(self._state.history_entries)
398
+ tickets_remaining = max(0, queue_size - idx)
399
+ tickets_after_current = max(
400
+ 0,
401
+ tickets_remaining - (1 if ticket_view is not None else 0),
402
+ )
403
 
404
  return HelpdeskTicketObservation(
405
  done=done,
406
  reward=reward,
407
+ metadata={
408
+ "queue_position": queue_position,
409
+ "tickets_remaining_includes_current": ticket_view is not None,
410
+ "has_ambiguity_note": bool(ticket_view and ticket_view.get("ambiguity_note")),
411
+ "has_related_ticket_context": bool(
412
+ ticket_view and ticket_view.get("related_ticket_preview")
413
+ ),
414
+ "action_mode": "investigate_or_submit",
415
+ },
416
  task_id=task["id"],
417
  task_name=task["name"],
418
  instructions=task["instructions"],
419
  allowed_fields=list(task["allowed_fields"]),
420
+ available_tools=list(AVAILABLE_TOOLS),
421
+ investigation_budget_remaining=self._state.investigation_budget_remaining,
422
+ last_tool_result=self._state.last_tool_result,
423
  current_ticket=ticket_view,
424
  queue_size=queue_size,
425
+ tickets_remaining=tickets_remaining,
426
+ tickets_after_current=tickets_after_current,
427
  tickets_processed=idx,
428
+ queue_position=queue_position,
429
  history=history,
430
  )
server/tasks.py CHANGED
@@ -13,7 +13,8 @@ TASKS = {
13
  "name": "Issue Type Classification",
14
  "difficulty": "easy",
15
  "instructions": (
16
- "Read the ticket and select the single best IT issue type."
 
17
  ),
18
  "allowed_fields": ["issue_type"],
19
  },
@@ -23,7 +24,8 @@ TASKS = {
23
  "difficulty": "medium",
24
  "instructions": (
25
  "Read the ticket, select the best IT issue type, and estimate the "
26
- "correct operational priority."
 
27
  ),
28
  "allowed_fields": ["issue_type", "priority"],
29
  },
@@ -33,7 +35,9 @@ TASKS = {
33
  "difficulty": "hard",
34
  "instructions": (
35
  "Perform full helpdesk routing by selecting the best issue type, "
36
- "priority, assignment group, and resolution action for the ticket."
 
 
37
  ),
38
  "allowed_fields": [
39
  "issue_type",
 
13
  "name": "Issue Type Classification",
14
  "difficulty": "easy",
15
  "instructions": (
16
+ "Read the ticket and select the single best IT issue type. "
17
+ "You may investigate first, then submit a final routing answer."
18
  ),
19
  "allowed_fields": ["issue_type"],
20
  },
 
24
  "difficulty": "medium",
25
  "instructions": (
26
  "Read the ticket, select the best IT issue type, and estimate the "
27
+ "correct operational priority. If the observation includes ambiguity "
28
+ "or follow-up context, use it. You may investigate before you submit."
29
  ),
30
  "allowed_fields": ["issue_type", "priority"],
31
  },
 
35
  "difficulty": "hard",
36
  "instructions": (
37
  "Perform full helpdesk routing by selecting the best issue type, "
38
+ "priority, assignment group, and resolution action for the ticket. "
39
+ "Use any ambiguity notes or related-ticket previews when present. "
40
+ "You may investigate with tools before you submit the final action."
41
  ),
42
  "allowed_fields": [
43
  "issue_type",
tests/test_competitive_upgrade.py CHANGED
@@ -81,7 +81,11 @@ def _heuristic_action(obs: HelpdeskTicketObservation) -> HelpdeskTicketAction:
81
  # 9.1 — Inference single-task mode
82
  # ---------------------------------------------------------------------------
83
 
84
- def _get_tasks_to_run_impl(task_id_env: str | None, available_tasks: dict) -> list[int]:
 
 
 
 
85
  """
86
  Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
87
 
@@ -94,9 +98,13 @@ def _get_tasks_to_run_impl(task_id_env: str | None, available_tasks: dict) -> li
94
  except ValueError:
95
  raise SystemExit(1)
96
  if task_id not in available_tasks:
97
- return []
98
  return [task_id]
99
- return list(TASK_IDS)
 
 
 
 
100
 
101
 
102
  class TestInferenceSingleTaskMode(unittest.TestCase):
@@ -107,14 +115,19 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
107
  result = _get_tasks_to_run_impl("1", available)
108
  self.assertEqual(result, [1])
109
 
110
- def test_task_id_set_to_unavailable_id_returns_empty_list(self) -> None:
111
  available = {1: {}, 2: {}, 3: {}}
112
- result = _get_tasks_to_run_impl("999", available)
113
- self.assertEqual(result, [])
114
 
115
- def test_task_id_unset_returns_all_task_ids(self) -> None:
116
  available = {1: {}, 2: {}, 3: {}}
117
  result = _get_tasks_to_run_impl(None, available)
 
 
 
 
 
118
  self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
119
 
120
  def test_task_id_set_to_2_returns_only_task_2(self) -> None:
@@ -360,6 +373,271 @@ class TestAmbiguityNoteInObservation(unittest.TestCase):
360
  self.assertIn("ambiguity_note", obs.current_ticket)
361
 
362
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363
  # ---------------------------------------------------------------------------
364
  # 9.7 — Dataset has >= 3 non-default routing tickets
365
  # ---------------------------------------------------------------------------
 
81
  # 9.1 — Inference single-task mode
82
  # ---------------------------------------------------------------------------
83
 
84
+ def _get_tasks_to_run_impl(
85
+ task_id_env: str | None,
86
+ available_tasks: dict,
87
+ run_all_tasks: bool = False,
88
+ ) -> list[int]:
89
  """
90
  Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
91
 
 
98
  except ValueError:
99
  raise SystemExit(1)
100
  if task_id not in available_tasks:
101
+ raise SystemExit(1)
102
  return [task_id]
103
+ if run_all_tasks:
104
+ return sorted(available_tasks)
105
+ if not available_tasks:
106
+ return []
107
+ return [sorted(available_tasks)[0]]
108
 
109
 
110
  class TestInferenceSingleTaskMode(unittest.TestCase):
 
115
  result = _get_tasks_to_run_impl("1", available)
116
  self.assertEqual(result, [1])
117
 
118
+ def test_task_id_set_to_unavailable_id_exits(self) -> None:
119
  available = {1: {}, 2: {}, 3: {}}
120
+ with self.assertRaises(SystemExit):
121
+ _get_tasks_to_run_impl("999", available)
122
 
123
+ def test_task_id_unset_defaults_to_first_available_task(self) -> None:
124
  available = {1: {}, 2: {}, 3: {}}
125
  result = _get_tasks_to_run_impl(None, available)
126
+ self.assertEqual(result, [1])
127
+
128
+ def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
129
+ available = {1: {}, 2: {}, 3: {}}
130
+ result = _get_tasks_to_run_impl(None, available, run_all_tasks=True)
131
  self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
132
 
133
  def test_task_id_set_to_2_returns_only_task_2(self) -> None:
 
373
  self.assertIn("ambiguity_note", obs.current_ticket)
374
 
375
 
376
+ class TestRelatedTicketPreviewInObservation(unittest.TestCase):
377
+ """Follow-up tickets expose a lightweight preview of the linked ticket."""
378
+
379
+ def _reset_linked_ticket_env(self):
380
+ from unittest.mock import patch
381
+
382
+ dataset = load_dataset()
383
+ ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
384
+ self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
385
+ related = next(
386
+ (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
387
+ None,
388
+ )
389
+ self.assertIsNotNone(related, "Linked ticket missing from dataset")
390
+
391
+ env = _make_env()
392
+ with patch.object(env, "_dataset", [ticket]):
393
+ with patch.object(
394
+ env,
395
+ "_tickets_by_id",
396
+ {ticket.ticket_id: ticket, related.ticket_id: related},
397
+ ):
398
+ obs = env.reset(seed=0, task_id=3, queue_size=1)
399
+
400
+ return env, obs, related
401
+
402
+ def test_related_ticket_preview_present_when_ticket_has_link(self) -> None:
403
+ env, obs, related = self._reset_linked_ticket_env()
404
+
405
+ self.assertIsNotNone(obs.current_ticket)
406
+ self.assertIn("related_ticket_preview", obs.current_ticket)
407
+ self.assertEqual(
408
+ obs.current_ticket["related_ticket_preview"]["ticket_id"],
409
+ related.ticket_id,
410
+ )
411
+ self.assertEqual(
412
+ obs.current_ticket["related_ticket_preview"]["title"],
413
+ related.title,
414
+ )
415
+
416
+ def test_history_keeps_related_ticket_preview_after_step(self) -> None:
417
+ env, obs, related = self._reset_linked_ticket_env()
418
+ next_obs = env.step(_heuristic_action(obs))
419
+
420
+ self.assertGreaterEqual(len(next_obs.history), 1)
421
+ self.assertIn("related_ticket_preview", next_obs.history[0])
422
+ self.assertEqual(
423
+ next_obs.history[0]["related_ticket_preview"]["ticket_id"],
424
+ related.ticket_id,
425
+ )
426
+
427
+
428
+ class TestObservationQueueContext(unittest.TestCase):
429
+ """Observation includes clearer queue-position counters."""
430
+
431
+ def test_reset_sets_queue_position_and_after_current_counts(self) -> None:
432
+ env = _make_env()
433
+ obs = env.reset(seed=0, task_id=1, queue_size=3)
434
+
435
+ self.assertEqual(obs.queue_position, 1)
436
+ self.assertEqual(obs.tickets_remaining, 3)
437
+ self.assertEqual(obs.tickets_after_current, 2)
438
+
439
+ def test_step_updates_queue_position_and_after_current_counts(self) -> None:
440
+ env = _make_env()
441
+ obs = env.reset(seed=0, task_id=1, queue_size=3)
442
+ obs = env.step(_heuristic_action(obs))
443
+
444
+ if obs.done:
445
+ self.assertEqual(obs.queue_position, 0)
446
+ self.assertEqual(obs.tickets_after_current, 0)
447
+ else:
448
+ self.assertEqual(obs.queue_position, 2)
449
+ self.assertEqual(obs.tickets_remaining, 2)
450
+ self.assertEqual(obs.tickets_after_current, 1)
451
+
452
+
453
+ # ---------------------------------------------------------------------------
454
+ # 9.6b — investigation actions and queue economics
455
+ # ---------------------------------------------------------------------------
456
+
457
+ class TestInvestigationActions(unittest.TestCase):
458
+ """Minimal tool-assisted investigate/submit flow works and stays backwards compatible."""
459
+
460
+ def _make_linked_env(self):
461
+ from unittest.mock import patch
462
+
463
+ dataset = load_dataset()
464
+ ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
465
+ self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
466
+ related = next(
467
+ (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
468
+ None,
469
+ )
470
+ self.assertIsNotNone(related, "Linked ticket missing from dataset")
471
+ env = _make_env()
472
+ patch_dataset = patch.object(env, "_dataset", [ticket])
473
+ patch_lookup = patch.object(
474
+ env,
475
+ "_tickets_by_id",
476
+ {ticket.ticket_id: ticket, related.ticket_id: related},
477
+ )
478
+ patch_dataset.start()
479
+ patch_lookup.start()
480
+ self.addCleanup(patch_dataset.stop)
481
+ self.addCleanup(patch_lookup.stop)
482
+ obs = env.reset(seed=0, task_id=3, queue_size=1)
483
+ return env, obs, ticket, related
484
+
485
+ def test_investigation_action_does_not_advance_queue(self) -> None:
486
+ env, obs, ticket, related = self._make_linked_env()
487
+
488
+ investigate = HelpdeskTicketAction(
489
+ action_type="investigate",
490
+ tool_name="lookup_related_ticket",
491
+ tool_target_ticket_id=ticket.related_ticket_id,
492
+ )
493
+ obs2 = env.step(investigate)
494
+
495
+ self.assertFalse(obs2.done)
496
+ self.assertEqual(obs2.tickets_processed, 0)
497
+ self.assertEqual(obs2.queue_position, 1)
498
+ self.assertIsNotNone(obs2.last_tool_result)
499
+ self.assertTrue(obs2.last_tool_result["found"])
500
+ self.assertEqual(
501
+ obs2.last_tool_result["ticket"]["ticket_id"],
502
+ related.ticket_id,
503
+ )
504
+
505
+ def test_submit_after_investigation_completes_episode(self) -> None:
506
+ env, obs, ticket, related = self._make_linked_env()
507
+ env.step(
508
+ HelpdeskTicketAction(
509
+ action_type="investigate",
510
+ tool_name="lookup_related_ticket",
511
+ tool_target_ticket_id=ticket.related_ticket_id,
512
+ )
513
+ )
514
+ final_obs = env.step(
515
+ HelpdeskTicketAction(
516
+ issue_type=ticket.issue_type,
517
+ priority=ticket.priority,
518
+ assignment_group=ticket.assignment_group,
519
+ resolution_action=ticket.resolution_action,
520
+ )
521
+ )
522
+
523
+ self.assertTrue(final_obs.done)
524
+ self.assertEqual(final_obs.tickets_processed, 1)
525
+ self.assertGreaterEqual(final_obs.reward, 0.0)
526
+ self.assertLessEqual(final_obs.reward, 1.0)
527
+
528
+ def test_requester_history_tool_returns_matches_for_same_requester(self) -> None:
529
+ from unittest.mock import patch
530
+
531
+ dataset = load_dataset()
532
+ requester_counts: dict[str, int] = {}
533
+ for ticket in dataset:
534
+ requester_counts[ticket.requester] = requester_counts.get(ticket.requester, 0) + 1
535
+ target_requester = next(
536
+ (requester for requester, count in requester_counts.items() if count >= 2),
537
+ None,
538
+ )
539
+ self.assertIsNotNone(target_requester, "Dataset has no repeated requester")
540
+ duplicate_requester_group = [
541
+ ticket for ticket in dataset if ticket.requester == target_requester
542
+ ]
543
+ self.assertGreaterEqual(len(duplicate_requester_group), 2)
544
+
545
+ env = _make_env()
546
+ with patch.object(env, "_dataset", duplicate_requester_group):
547
+ with patch.object(
548
+ env,
549
+ "_tickets_by_id",
550
+ {ticket.ticket_id: ticket for ticket in duplicate_requester_group},
551
+ ):
552
+ obs = env.reset(seed=0, task_id=2, queue_size=1)
553
+
554
+ obs2 = env.step(
555
+ HelpdeskTicketAction(
556
+ action_type="investigate",
557
+ tool_name="lookup_requester_history",
558
+ )
559
+ )
560
+
561
+ self.assertIsNotNone(obs2.last_tool_result)
562
+ self.assertEqual(obs2.last_tool_result["tool_name"], "lookup_requester_history")
563
+ self.assertTrue(obs2.last_tool_result["found"])
564
+ self.assertGreaterEqual(len(obs2.last_tool_result["matches"]), 1)
565
+
566
+
567
+ class TestQueueEconomics(unittest.TestCase):
568
+ """Free investigations are allowed, but excessive investigation gets a queue-level penalty."""
569
+
570
+ def test_extra_investigations_reduce_final_reward(self) -> None:
571
+ from unittest.mock import patch
572
+
573
+ dataset = load_dataset()
574
+ ticket = dataset[0]
575
+ env = _make_env()
576
+ with patch.object(env, "_dataset", [ticket]):
577
+ with patch.object(env, "_tickets_by_id", {ticket.ticket_id: ticket}):
578
+ obs = env.reset(seed=0, task_id=1, queue_size=1)
579
+
580
+ obs = env.step(
581
+ HelpdeskTicketAction(
582
+ action_type="investigate",
583
+ tool_name="lookup_requester_history",
584
+ )
585
+ )
586
+ self.assertEqual(env.state.investigation_steps, 1)
587
+ self.assertEqual(env.state.investigation_budget_remaining, 0)
588
+
589
+ obs = env.step(
590
+ HelpdeskTicketAction(
591
+ action_type="investigate",
592
+ tool_name="lookup_requester_history",
593
+ )
594
+ )
595
+ self.assertEqual(env.state.investigation_steps, 2)
596
+
597
+ final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
598
+
599
+ self.assertTrue(final_obs.done)
600
+ self.assertAlmostEqual(final_obs.reward, 0.98, places=9)
601
+
602
+
603
+ class TestTerminalInvalidActionFinalReward(unittest.TestCase):
604
+ """Terminal invalid submit actions should still return the queue-level final reward."""
605
+
606
+ def test_last_invalid_submit_returns_trajectory_reward_not_zero(self) -> None:
607
+ from unittest.mock import patch
608
+
609
+ dataset = load_dataset()
610
+ first = dataset[0]
611
+ second = dataset[1]
612
+
613
+ env = _make_env()
614
+ with patch.object(env, "_dataset", [first, second]):
615
+ with patch.object(
616
+ env,
617
+ "_tickets_by_id",
618
+ {first.ticket_id: first, second.ticket_id: second},
619
+ ):
620
+ obs = env.reset(seed=0, task_id=1, queue_size=2)
621
+
622
+ tickets_by_id = {first.ticket_id: first, second.ticket_id: second}
623
+ current = tickets_by_id[obs.current_ticket["ticket_id"]]
624
+ obs = env.step(HelpdeskTicketAction(issue_type=current.issue_type))
625
+ self.assertFalse(obs.done)
626
+
627
+ current = tickets_by_id[obs.current_ticket["ticket_id"]]
628
+ final_obs = env.step(
629
+ HelpdeskTicketAction(
630
+ issue_type=current.issue_type,
631
+ priority="medium",
632
+ )
633
+ )
634
+
635
+ self.assertTrue(final_obs.done)
636
+ self.assertAlmostEqual(final_obs.reward, 0.5, places=9)
637
+ self.assertAlmostEqual(env.state.total_reward, 0.5, places=9)
638
+ self.assertAlmostEqual(env.state.reward or 0.0, 0.5, places=9)
639
+
640
+
641
  # ---------------------------------------------------------------------------
642
  # 9.7 — Dataset has >= 3 non-default routing tickets
643
  # ---------------------------------------------------------------------------
tests/test_environment_smoke.py CHANGED
@@ -101,6 +101,8 @@ class TestResetReturnsValidObservation(unittest.TestCase):
101
  self.assertIsNotNone(obs.current_ticket)
102
  self.assertGreater(obs.queue_size, 0)
103
  self.assertEqual(obs.tickets_processed, 0)
 
 
104
 
105
 
106
  class TestResetAllTaskIds(unittest.TestCase):
@@ -116,6 +118,7 @@ class TestResetAllTaskIds(unittest.TestCase):
116
  self.assertEqual(obs.tickets_processed, 0)
117
  # allowed_fields must match the task definition
118
  self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
 
119
 
120
  def test_reset_task2(self) -> None:
121
  env = _make_env()
@@ -142,6 +145,10 @@ class TestStepAdvancesTicketsProcessed(unittest.TestCase):
142
  obs2 = env.step(action)
143
 
144
  self.assertEqual(obs2.tickets_processed, 1)
 
 
 
 
145
 
146
  def test_step_reward_in_unit_interval(self) -> None:
147
  from models import HelpdeskTicketAction
 
101
  self.assertIsNotNone(obs.current_ticket)
102
  self.assertGreater(obs.queue_size, 0)
103
  self.assertEqual(obs.tickets_processed, 0)
104
+ self.assertEqual(obs.queue_position, 1)
105
+ self.assertEqual(obs.tickets_after_current, max(0, obs.queue_size - 1))
106
 
107
 
108
  class TestResetAllTaskIds(unittest.TestCase):
 
118
  self.assertEqual(obs.tickets_processed, 0)
119
  # allowed_fields must match the task definition
120
  self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
121
+ self.assertEqual(obs.queue_position, 1)
122
 
123
  def test_reset_task2(self) -> None:
124
  env = _make_env()
 
145
  obs2 = env.step(action)
146
 
147
  self.assertEqual(obs2.tickets_processed, 1)
148
+ if obs2.done:
149
+ self.assertEqual(obs2.queue_position, 0)
150
+ else:
151
+ self.assertEqual(obs2.queue_position, 2)
152
 
153
  def test_step_reward_in_unit_interval(self) -> None:
154
  from models import HelpdeskTicketAction
tests/test_extra_fields_penalty.py CHANGED
@@ -151,32 +151,31 @@ class TestExtraFieldsPenalty(unittest.TestCase):
151
  self.assertIsInstance(obs, HelpdeskTicketObservation)
152
 
153
  def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
154
- """When the penalty step is on the last ticket, done must be True."""
155
  env = _make_env()
156
- # Use a queue of size 1 by controlling the seed — find a seed that gives queue_size=1
157
- # Instead, exhaust all but the last ticket normally, then trigger penalty on last
158
  obs = env.reset(seed=42, task_id=1)
159
  queue_size = obs.queue_size
 
160
 
161
  # Process all tickets except the last one normally
162
  for _ in range(queue_size - 1):
163
- allowed = obs.allowed_fields
164
- action_kwargs = {}
165
- if "issue_type" in allowed:
166
- action_kwargs["issue_type"] = ISSUE_TYPES[0]
167
- if "priority" in allowed:
168
- action_kwargs["priority"] = PRIORITIES[0]
169
- obs = env.step(HelpdeskTicketAction(**action_kwargs))
170
 
171
  # Now trigger penalty on the last ticket
 
 
172
  action = HelpdeskTicketAction(
173
- issue_type=ISSUE_TYPES[0],
174
  assignment_group=ASSIGNMENT_GROUPS[0], # extra field
175
  )
176
  final_obs = env.step(action)
177
 
178
  self.assertTrue(final_obs.done)
179
- self.assertEqual(final_obs.reward, 0.0)
 
 
180
 
181
 
182
  if __name__ == "__main__":
 
151
  self.assertIsInstance(obs, HelpdeskTicketObservation)
152
 
153
  def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
154
+ """When the penalty step is on the last ticket, done stays True and reward stays episode-level."""
155
  env = _make_env()
 
 
156
  obs = env.reset(seed=42, task_id=1)
157
  queue_size = obs.queue_size
158
+ tickets_by_id = env._tickets_by_id # noqa: SLF001 - test-only inspection
159
 
160
  # Process all tickets except the last one normally
161
  for _ in range(queue_size - 1):
162
+ current_ticket_id = obs.current_ticket["ticket_id"]
163
+ current_ticket = tickets_by_id[current_ticket_id]
164
+ obs = env.step(HelpdeskTicketAction(issue_type=current_ticket.issue_type))
 
 
 
 
165
 
166
  # Now trigger penalty on the last ticket
167
+ current_ticket_id = obs.current_ticket["ticket_id"]
168
+ current_ticket = tickets_by_id[current_ticket_id]
169
  action = HelpdeskTicketAction(
170
+ issue_type=current_ticket.issue_type,
171
  assignment_group=ASSIGNMENT_GROUPS[0], # extra field
172
  )
173
  final_obs = env.step(action)
174
 
175
  self.assertTrue(final_obs.done)
176
+ expected_reward = (queue_size - 1) / queue_size
177
+ self.assertAlmostEqual(final_obs.reward, expected_reward, places=9)
178
+ self.assertAlmostEqual(env.state.total_reward, expected_reward, places=9)
179
 
180
 
181
  if __name__ == "__main__":
tests/test_inference_unit.py CHANGED
@@ -163,6 +163,22 @@ class InferenceUnitTests(unittest.TestCase):
163
  )
164
  )
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
  if __name__ == "__main__":
168
  unittest.main()
 
163
  )
164
  )
165
 
166
+ def test_default_task_selection_runs_single_first_task(self) -> None:
167
+ inference = _load_inference_module()
168
+
169
+ self.assertEqual(
170
+ inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
171
+ [1],
172
+ )
173
+
174
+ def test_run_all_tasks_override_keeps_local_batch_mode_available(self) -> None:
175
+ inference = _load_inference_module({"RUN_ALL_TASKS": "1"})
176
+
177
+ self.assertEqual(
178
+ inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
179
+ [1, 2, 3],
180
+ )
181
+
182
 
183
  if __name__ == "__main__":
184
  unittest.main()