Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

Roopalgn commited on Apr 6

Commit

186fd65

unverified ·

2 Parent(s): 89ca22f 52ab5fa

Merge pull request #10 from suyashkumar102/final-submit-gap-fixes

Browse files

Files changed (19) hide show

KNOWLEDGE.md +25 -10
README.md +46 -7
ROADMAP.md +72 -6
analysis/deep_competitive_gap_report.md +1374 -0
data/dataset.json +36 -0
gaps.md +146 -0
inference.py +142 -8
models.py +29 -1
openenv.yaml +4 -0
server/app.py +20 -0
server/environment.py +285 -33
server/reward.py +12 -4
server/tasks.py +7 -3
tests/test_api_integration.py +1 -1
tests/test_competitive_upgrade.py +746 -0
tests/test_environment_smoke.py +7 -0
tests/test_extra_fields_penalty.py +182 -0
tests/test_inference_unit.py +16 -0
tests/test_tasks_unit.py +1 -1

KNOWLEDGE.md CHANGED Viewed

@@ -24,7 +24,7 @@ IT helpdesk routing is a strong hackathon fit because it is:
 - deterministic to grade
 - naturally multi-step
-A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. That maps cleanly to a typed action object.
 ## The Repo In One Sentence
@@ -134,7 +134,7 @@ Important fields:
 ### `HelpdeskTicketAction`
-Represents the agent submission. Fields are optional because different tasks score different subsets.
 ### `HelpdeskTicketObservation`
@@ -142,6 +142,7 @@ Represents what the agent sees for each step:
 - task metadata
 - visible ticket fields
 - queue progress
 - score history
@@ -179,10 +180,19 @@ The observation exposes:
 - task metadata
 - the current ticket
 - queue progress counters
 - history
 - reward and done status
 The state tracks:
 - current task
@@ -191,12 +201,13 @@ The state tracks:
 - current ticket index
 - per-ticket scores
 - total reward
 ## Task Design
 ### Task 1: Issue Type Classification
-The agent predicts:
 - `issue_type`
@@ -206,7 +217,7 @@ Purpose:
 ### Task 2: Issue Type And Priority
-The agent predicts:
 - `issue_type`
 - `priority`
@@ -217,7 +228,7 @@ Purpose:
 ### Task 3: Full Ticket Routing
-The agent predicts:
 - `issue_type`
 - `priority`
@@ -256,14 +267,14 @@ This is now proven in checked-in unit tests rather than left as a docs claim.
 Step reward:
-- current ticket score clamped to `[0.0, 1.0]`
 Final reward:
 - average of ticket scores
-- minus a small overshoot penalty for taking more steps than the queue length
-This gives dense feedback while still rewarding efficient episode completion.
 ## Dataset Mental Model
@@ -277,6 +288,8 @@ Current structure:
 - harder ambiguous cases
 - follow-up tickets connected through `related_ticket_id`
 The dataset is meant to test routing judgment, not just keyword spotting.
 ## Grounding Note
@@ -299,16 +312,18 @@ It:
 1. connects to the environment
 2. loads the available tasks
-3. runs one episode per task
 4. picks an action for each ticket
 5. sends the action back through the client
 6. records rewards
-7. prints a task-by-task summary
 It supports:
 - heuristic mode with no external model
 - LLM mode through an OpenAI-compatible API
 ## Files That Matter Most

 - deterministic to grade
 - naturally multi-step
+A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. The current runtime now supports a small two-mode action object: investigate first when needed, then submit the final routing answer.
 ## The Repo In One Sentence
 ### `HelpdeskTicketAction`
+Represents the agent step. `action_type="submit"` carries routing fields, while `action_type="investigate"` uses a small built-in tool surface before the final submission.
 ### `HelpdeskTicketObservation`
 - task metadata
 - visible ticket fields
+- optional ambiguity or follow-up context
 - queue progress
 - score history
 - task metadata
 - the current ticket
+- available investigation tools
+- remaining free investigation budget
+- the latest tool result, when one was requested
 - queue progress counters
 - history
 - reward and done status
+Useful queue counters now include:
+- `tickets_remaining`: not-yet-processed tickets, including the current ticket when one is active
+- `tickets_after_current`: how many tickets remain after the current one
+- `queue_position`: 1-based position of the current ticket in the queue
 The state tracks:
 - current task
 - current ticket index
 - per-ticket scores
 - total reward
+- investigation step count
 ## Task Design
 ### Task 1: Issue Type Classification
+The agent ultimately predicts:
 - `issue_type`
 ### Task 2: Issue Type And Priority
+The agent ultimately predicts:
 - `issue_type`
 - `priority`
 ### Task 3: Full Ticket Routing
+The agent ultimately predicts:
 - `issue_type`
 - `priority`
 Step reward:
+- current ticket score with a small milestone bonus for strong steps and a small penalty for very weak steps
 Final reward:
 - average of ticket scores
+- minus a tiny penalty only if the agent exceeds the free investigation budget for the queue
+This keeps the reward dense and deterministic, removes the dead overshoot logic, and adds a small queue-level economics signal without disturbing the no-tool baseline path.
 ## Dataset Mental Model
 - harder ambiguous cases
 - follow-up tickets connected through `related_ticket_id`
+When a follow-up link exists, the observation can now surface a lightweight `related_ticket_preview`, and the tool layer can fetch richer related-ticket or requester-history context so the agent does not have to route every ticket from isolated text alone.
 The dataset is meant to test routing judgment, not just keyword spotting.
 ## Grounding Note
 1. connects to the environment
 2. loads the available tasks
+3. runs one episode for the requested task
 4. picks an action for each ticket
 5. sends the action back through the client
 6. records rewards
+7. prints structured logs for that run
 It supports:
 - heuristic mode with no external model
 - LLM mode through an OpenAI-compatible API
+- lightweight investigation-tool calls before the final submit action
+- an explicit local `RUN_ALL_TASKS=1` override when you want the old multi-task sweep
 ## Files That Matter Most

README.md CHANGED Viewed

@@ -34,7 +34,7 @@ The environment models a realistic helpdesk workflow:
 1. a new ticket enters the queue
 2. the agent reads the ticket title and description
-3. the agent predicts structured routing fields
 4. the grader assigns deterministic credit
 5. the environment advances to the next ticket until the queue is complete
@@ -43,7 +43,7 @@ This domain is useful for OpenEnv because it is operationally realistic, easy to
 ## Why This Is A Good Hackathon Domain
 - it reflects real enterprise support operations
-- the action space is structured and judge-friendly
 - correctness can be scored deterministically
 - the hard task is meaningfully harder than the easy and medium tasks
 - the environment is small enough to rerun quickly
@@ -55,7 +55,7 @@ The project uses a queue-based episode model.
 - `reset()` samples a task and a queue of 3 to 5 tickets
 - `step()` grades one ticket submission at a time
 - `state()` exposes the internal episode snapshot
-- final reward is based on average ticket quality with a small overshoot penalty
 The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
@@ -115,6 +115,9 @@ Visible ticket fields:
 - `title`
 - `requester`
 - `description`
 Each observation also includes:
@@ -122,9 +125,14 @@ Each observation also includes:
 - `task_name`
 - `instructions`
 - `allowed_fields`
 - `queue_size`
 - `tickets_remaining`
 - `tickets_processed`
 - `history`
 - standard OpenEnv fields such as `done` and `reward`
@@ -138,11 +146,23 @@ The internal `HelpdeskTicketState` tracks:
 - `current_ticket_index`
 - `per_ticket_scores`
 - `total_reward`
 ## Grading And Reward
 Scoring is deterministic and normalized to `[0.0, 1.0]`.
 Per-field behavior:
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
@@ -161,11 +181,15 @@ Task weights:
 Final episode reward:
 ```text
-average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
 ```
 The result is clamped to `[0.0, 1.0]`.
 ## Grounded Scoring
 The grader is intentionally not fuzzy by default.
@@ -285,7 +309,7 @@ curl http://localhost:7860/tasks
 ## Running The Baseline Inference Script
-The baseline script supports two modes.
 ### Heuristic mode
@@ -295,6 +319,12 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
 python inference.py
 ```
 ### LLM mode
 Set these environment variables first:
@@ -313,6 +343,14 @@ Optional target:
 - `ENV_URL`
 - default value: `http://localhost:7860`
 ## Runtime Validation Snapshot
@@ -324,7 +362,7 @@ Validated locally:
 - `/health`
 - `/tasks`
 - `/reset`
-- heuristic `inference.py` run across all 3 tasks
 Current local heuristic results:
@@ -358,7 +396,7 @@ docker run -p 7860:7860 helpdesk-ticket-routing
 Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
 ```bash
-python inference.py
 ```
 If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
@@ -376,6 +414,7 @@ OpenEnv provides the core environment endpoints, and the repo adds a custom task
 | POST | `/step` | submit an action |
 | GET | `/state` | inspect internal state |
 | GET | `/tasks` | list task metadata |
 | GET | `/docs` | interactive API docs |
 ## Submission Readiness

 1. a new ticket enters the queue
 2. the agent reads the ticket title and description
+3. the agent may investigate with lightweight tools, then submit structured routing fields
 4. the grader assigns deterministic credit
 5. the environment advances to the next ticket until the queue is complete
 ## Why This Is A Good Hackathon Domain
 - it reflects real enterprise support operations
+- the action space is structured and judge-friendly, with a small investigate-versus-submit split
 - correctness can be scored deterministically
 - the hard task is meaningfully harder than the easy and medium tasks
 - the environment is small enough to rerun quickly
 - `reset()` samples a task and a queue of 3 to 5 tickets
 - `step()` grades one ticket submission at a time
 - `state()` exposes the internal episode snapshot
+- final reward is based on average ticket quality across the queue
 The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
 - `title`
 - `requester`
 - `description`
+- optional `ambiguity_note`
+- optional `related_ticket_id`
+- optional `related_ticket_preview`
 Each observation also includes:
 - `task_name`
 - `instructions`
 - `allowed_fields`
+- `available_tools`
+- `investigation_budget_remaining`
+- `last_tool_result`
 - `queue_size`
 - `tickets_remaining`
+- `tickets_after_current`
 - `tickets_processed`
+- `queue_position`
 - `history`
 - standard OpenEnv fields such as `done` and `reward`
 - `current_ticket_index`
 - `per_ticket_scores`
 - `total_reward`
+- `reward`
+- `done`
 ## Grading And Reward
 Scoring is deterministic and normalized to `[0.0, 1.0]`.
+The action model now supports two paths:
+- `action_type="submit"` for the final routing answer
+- `action_type="investigate"` with a small built-in tool surface before submission
+Available tools:
+- `lookup_related_ticket`
+- `lookup_requester_history`
 Per-field behavior:
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
 Final episode reward:
 ```text
+average(per_ticket_scores)
 ```
 The result is clamped to `[0.0, 1.0]`.
+Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
+Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
 ## Grounded Scoring
 The grader is intentionally not fuzzy by default.
 ## Running The Baseline Inference Script
+The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
 ### Heuristic mode
 python inference.py
 ```
+By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
+```bash
+TASK_ID=3 python inference.py
+```
 ### LLM mode
 Set these environment variables first:
 - `ENV_URL`
 - default value: `http://localhost:7860`
+- `TASK_ID`
+- `RUN_ALL_TASKS`
+To reproduce the multi-task local benchmark sweep:
+```bash
+RUN_ALL_TASKS=1 python inference.py
+```
 ## Runtime Validation Snapshot
 - `/health`
 - `/tasks`
 - `/reset`
+- heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
 Current local heuristic results:
 Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
 ```bash
+RUN_ALL_TASKS=1 python inference.py
 ```
 If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
 | POST | `/step` | submit an action |
 | GET | `/state` | inspect internal state |
 | GET | `/tasks` | list task metadata |
+| GET | `/web` | lightweight HF Space UI |
 | GET | `/docs` | interactive API docs |
 ## Submission Readiness

ROADMAP.md CHANGED Viewed

@@ -11,10 +11,39 @@
 ## How To Use This File
 - `PROJECT_STATUS.md` is the canonical log of completed work.
-- This roadmap is the remaining execution plan from the current repo state to final submission.
 - `required.md` is now the combined official-requirements and project-compliance file.
 - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
 - `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
 ## What We Are Optimizing For
@@ -47,14 +76,51 @@ The repo already has:
 - deterministic grading with limited partial credit
 - working heuristic baseline
 - merged local validation on `/health`, `/tasks`, and `inference.py`
-- current local benchmark reference:
-  - Task 1: `1.0000`
-  - Task 2: `0.8800`
-  - Task 3: `0.9400`
-  - Overall: `0.9400`
 The remaining work should be treated as targeted strengthening, not broad feature invention.
 ## Submission Gates That Must Still Hold
 These come directly from `required.md` and `KNOWLEDGE.md`:

 ## How To Use This File
 - `PROJECT_STATUS.md` is the canonical log of completed work.
+- This roadmap is the active plan from the verified April 6, 2026 repo state to final submission.
 - `required.md` is now the combined official-requirements and project-compliance file.
 - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
 - `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
+- The dated April 3 to April 5 sections below are now historical context; the active execution block is the final 24-hour plan for April 6 to April 7, 2026.
+## Status As Of April 6, 2026
+The repo is now in the expected "stabilize and merge" phase rather than the earlier "build core fixes" phase.
+Completed and locally verified:
+- all concrete items from `gaps.md`
+- the viable low-risk improvements from `analysis/deep_competitive_gap_report.md`
+- single-task `inference.py` execution with `TASK_ID` support and optional `RUN_ALL_TASKS=1`
+- `state()` exposure of `reward` and `done`
+- richer history with predicted actions and follow-up context
+- lightweight investigate-versus-submit action support with tool-backed context lookup
+- small queue-economics signal without major benchmark redesign
+- `/web` UI route
+- local full test pass:
+  - `126 passed, 137 subtests passed`
+- local validator pass:
+  - `[OK] meta-AIHack: Ready for multi-mode deployment`
+Merge recommendation:
+- mergeable as an incremental submission-ready improvement branch
+- do not block merge on major redesign items that were explicitly out of scope:
+  - scenario-family task redesign
+  - breaking the issue-type-to-assignment shortcut
+  - large dataset expansion
+  - full queue simulator / economics redesign
 ## What We Are Optimizing For
 - deterministic grading with limited partial credit
 - working heuristic baseline
 - merged local validation on `/health`, `/tasks`, and `inference.py`
+- single-task evaluator-safe inference behavior
+- reward and done fields on `state()`
+- richer observation history and linked-ticket context
+- lightweight investigate / submit split with small built-in tool support
+- local full-suite verification:
+  - `126 passed, 137 subtests passed`
+- local validator verification:
+  - `[OK] meta-AIHack: Ready for multi-mode deployment`
 The remaining work should be treated as targeted strengthening, not broad feature invention.
+## Final 24-Hour Plan
+**Active window:** April 6 to April 7, 2026
+**Internal target:** open PR, merge to the common `main`, and complete the final smoke checks by April 7, 2026
+**Official deadline:** April 8, 2026, 11:59 PM IST
+### Must finish before merge
+- review the final diff and stage only the intended submission files
+- open the merge PR from a dedicated branch
+- merge into the shared `main` after one last reviewer pass
+- rerun the post-merge smoke checks:
+  - `pytest`
+  - `openenv validate`
+  - `/health`
+  - `/tasks`
+  - one `reset()` / `step()` sanity path
+### Do not add before merge
+- no new benchmark redesign work
+- no new dataset expansion
+- no schema churn
+- no reward refactors beyond blocker-level fixes
+- no last-minute inference prompt rewrites
+### Success condition for April 7, 2026
+- PR is up
+- PR is reviewed against `gaps.md` and `analysis/deep_competitive_gap_report.md`
+- shared `main` contains the tested gap-fix branch
+- deployment sanity checks are green
+- repo is frozen except for typo-level fixes
 ## Submission Gates That Must Still Hold
 These come directly from `required.md` and `KNOWLEDGE.md`:

analysis/deep_competitive_gap_report.md ADDED Viewed

	@@ -0,0 +1,1374 @@

+# Deep Codebase Comparison: OpenEnv Reference Environments vs This Helpdesk Project
+## Scope and Method
+This report was written from a direct code read, not from README-driven interpretation. I treated the `OpenEnv/envs` directory as the reference baseline you pointed to, and I compared it against the implementation that lives in this repository root plus `server/`.
+I focused on code that actually defines runtime behavior:
+- `models.py`
+- `inference.py`
+- `client.py`
+- `vocabulary.py`
+- `server/environment.py`
+- `server/tasks.py`
+- `server/grader.py`
+- `server/reward.py`
+- `server/app.py`
+- `tests/*.py`
+- `data/dataset.json`
+That reading set is enough to answer the question that matters: what design moves make the strongest reference environments hard to beat, where your project is currently thinner than it looks, and what concrete changes would make your environment competitive instead of merely correct.
+## Executive Verdict
+Your project is a clean, readable, deterministic mini-benchmark. It is not yet a high-ceiling agent benchmark.
+That sounds harsh, but it is also the clearest way to unlock the right next move. Right now your environment behaves much more like a structured multi-label classification task wrapped in OpenEnv than like the richer reference environments that expose hidden state, tool use, long-horizon consequences, multi-step reasoning, or grounded interaction with external systems. The code is good enough as a starter environment. It is not yet strong enough to beat the best reference projects on depth, realism, or benchmark credibility.
+The good news is that the codebase is small, coherent, and fixable. The bad news is that the gap is not a one-line polish gap. It is a benchmark design gap.
+The strongest OpenEnv reference environments win for one or more of these reasons:
+- they expose a real action surface, not just label prediction
+- they make the agent inspect state rather than infer everything from one text blob
+- they reward process, not only end labels
+- they support long-horizon or multi-step behavior
+- they are harder to brute-force with dataset-specific heuristics
+- they are backed by real engines, shells, browsers, tools, or stateful simulators
+- they treat evaluation as a first-class system, not as a tiny helper function
+Your project currently loses on most of those axes.
+At the same time, your project has an underrated advantage: the domain is practical, legible, and product-shaped. IT helpdesk routing is a great benchmark domain if you push it harder. It naturally supports ambiguity, policy lookup, account context, queue optimization, escalation rules, duplicates, follow-up chains, customer sentiment, service health, SLA clocks, and partial observability. In other words, the domain is better than the current implementation. The environment has room to grow into something much stronger without abandoning the idea.
+So the answer is not “throw this away and copy BrowserGym.” The answer is “turn this from a label benchmark into a realistic triage operations environment.”
+## What the Reference Environments Actually Do Better
+### 1. They expose richer action spaces
+The single biggest difference between your code and the strongest reference projects is that the agent in your environment does very little. In your environment, the step is basically “predict some labels for this ticket.” In the stronger reference environments, the agent interacts.
+`BrowserGymEnvironment` accepts an `action_str` and pushes it into a live browser benchmark. That means the benchmark difficulty comes from action selection in stateful UI space, not just from text classification. `OpenAppEnvironment` similarly supports `click`, `fill`, `select_option`, `goto`, `scroll`, and `send_keys`, and even mixes BrowserGym-style element IDs with raw Playwright CSS selectors for pragmatic reliability. `GitTaskEnvironment` supports clone, list, and git command execution against a Gitea-backed workspace. `Tbench2Environment` supports `exec`, `write`, `view`, `wait`, `kill`, `write_file`, and `evaluate`, which is much closer to real agent work. `FinQAEnvironment` turns the task into tool use over tables, SQL, and answer submission. `REPLEnvironment` exposes code execution with optional recursive LLM calls. `TextArenaEnvironment` takes natural-language moves and advances a game engine.
+Your environment exposes none of that. The agent does not gather missing evidence. It does not inspect a related ticket. It does not search a KB. It does not look up account tier. It does not check service health. It does not add an internal note. It does not choose between acknowledging first and escalating later. It does not defer. It does not ask for more information. It does not resolve duplicates. It does not manage a queue. It only emits one shot structured output.
+That makes the benchmark much easier to game, much easier to overfit, and much less diagnostic of real agent competence.
+### 2. They separate visible observation from hidden truth
+The strongest reference environments keep some truth state behind the curtain. The agent sees an observation. The environment owns more. That separation is what makes an environment feel like an environment instead of a dataframe with reward labels.
+In `ChessEnvironment`, the agent observes legal moves, FEN, checks, and result state, but the environment owns board progression, opponent strategy, and trajectory reward accumulation. In `MazeEnvironment`, the environment tracks maze status and legal movement dynamics. In `TextArenaEnvironment`, the wrapped engine owns turn state, raw logs, rewards, role mapping, and step info. In `FinQAEnvironment`, the agent sees the question and tools, but the hidden ground truth answer, question identity, and full structured table data live behind the environment. In `Tbench2Environment`, the hidden truth is in the task files and tests. In `BrowserGymEnvironment`, the browser session and benchmark internals are hidden behind the observation.
+Your environment has much less hidden truth than it should. The ticket label is hidden, yes, but the benchmark structure is shallow. More importantly, the code already hints at richer hidden structure and then fails to expose or exploit it. `HelpdeskTicketRecord` includes `ambiguity_note` and `related_ticket_id`, but `_build_observation()` throws both away and only exposes `ticket_id`, `title`, `requester`, and `description`. So even though the dataset contains follow-up relationships and ambiguity annotations, the environment does not actually let the agent work with them as structured state. That is a missed opportunity and a design leak at the same time.
+The dataset is telling you the domain wants threads, ambiguity, and context. The environment currently flattens it back into plain text.
+### 3. They reward more than a final label match
+The reference environments do not all have brilliant reward design, but the best ones take reward seriously.
+`REPLEnvironment` combines an outcome rubric with optional process reward. It can reward successful execution, penalize failures, and separately judge the final answer. `ChessEnvironment` uses a trajectory rubric with exponential discounting to assign credit across a game. `FinQAEnvironment` does robust answer normalization, including boxed answers, percentages, fractions, and multi-value comparisons. `TextArenaEnvironment` overlays auxiliary reward signals such as Wordle greens, yellows, repetitions, and correctness. `Tbench2Environment` evaluates by actually running tests, which is a grounded form of outcome reward.
+Your reward design is better than “exact match only,” but it is still thin. `grade_action()` uses one handcrafted issue similarity table, one handcrafted priority proximity table, and exact match for assignment group and resolution action. `compute_step_reward()` is just clamping. `compute_trajectory_reward()` averages scores and subtracts an overshoot penalty.
+That sounds reasonable until you inspect the runtime path. In practice, the overshoot penalty is effectively dead logic. `step()` increments the ticket index once per ticket and sets done when the index reaches queue length. A later `step()` call raises an error. That means `steps_taken` cannot exceed `queue_size` during normal episode execution, so the overshoot branch in `compute_trajectory_reward()` has no meaningful role in the current environment. The code suggests the benchmark penalizes wasteful action loops, but the environment does not actually allow them.
+The deeper issue is that the reward judges only final fields, not triage quality as a process. There is no penalty for unnecessary escalation unless the final field is wrong. There is no reward for correctly identifying a duplicate and linking it. There is no cost model for routing everything to security “just in case.” There is no SLA-aware penalty for under-prioritizing a time-sensitive issue that still happens to hit some partial-credit similarity. There is no queue-level reward. There is no explanation consistency. There is no tool efficiency score because there are no tools. There is no notion of customer harm, resolver cost, escalation burden, or backlog impact.
+The strongest environments earn their credibility by making reward a modeling decision. Your reward is still a convenience function.
+### 4. They support multi-step or long-horizon behavior
+Even the simpler reference environments tend to have longer horizon than your three-task ladder suggests.
+`ChessEnvironment` is naturally long horizon. `BrowserGymEnvironment` and `OpenAppEnvironment` are stepwise interactions. `TextArenaEnvironment` proceeds over turns. `Tbench2Environment` supports iterative shell work and explicit evaluation. `REPLEnvironment` supports repeated code execution over an evolving namespace. `FinQAEnvironment` allows repeated tool calls up to `max_steps` before submission. Even `ReasoningGymEnvironment`, which is single-step, supports parameterized dataset generation and configurable tasks.
+Your environment has multiple steps inside an episode, but they are just a queue of independent tickets. Each step is still one-shot labeling. Tickets do not affect each other. The queue order does not matter. There is no resource constraint. There is no carry-over state except a score list and counters. No later ticket depends on an earlier action. No policy evolves over the episode. No investigation outcome from step one informs step two.
+So while the environment is technically episodic, it is not operationally long horizon. It is batching.
+That difference matters. The best agents and best benchmarks separate “can classify one item” from “can operate over a process.” Right now your environment mainly measures the first.
+### 5. They parameterize tasks rather than freezing one tiny benchmark
+`ReasoningGymEnvironment` rebuilds datasets from `dataset_name`, `dataset_config`, `dataset_specs`, `seed`, and `size`. `BrowserGymEnvironment` can choose a benchmark and task. `Tbench2Environment` can resolve tasks by task ID or path, even downloading a repo cache if needed. `GitTaskEnvironment` supports task-specific base repo states. `REPLEnvironment` can accept context, task prompt, expected answer, recursion depth, and model parameters at reset. `FinQAEnvironment` iterates over a question bank with real data-backed tools.
+Your environment has three tasks, but they are not truly different environments. They are the same tickets with a different subset of fields exposed through `allowed_fields`. That is a very weak notion of task diversity. Task difficulty is not created by different data generating processes, different hidden state, different workflows, or different action surfaces. It is created by output dimensionality alone.
+That means the easy, medium, and hard tasks are less like three tasks and more like one task with three scoring schemas.
+### 6. They take concurrency and runtime isolation seriously
+Several reference environments explicitly set `SUPPORTS_CONCURRENT_SESSIONS = True`, including `REPLEnvironment`, `Tbench2Environment`, `ReasoningGymEnvironment`, `MazeEnvironment`, and some others. The framework core in `http_server.py` is built around WebSocket sessions, session capacity, session info, session factories, and asynchronous handling. `MCPEnvironment` has explicit async and sync step paths because the framework authors ran into real event-loop and deadlock issues. `Tbench2DockerEnvironment` handles Docker-in-Docker by copying task directories into containers rather than assuming host bind mounts. `Calendar` builds database sessions per tenant. `GitTaskEnvironment` assumes isolated workspaces. `BrowserGymEnvironment` does cleanup of resources.
+Your environment inherits some capability from OpenEnv, but your own code does not actually engage with that depth. The server is mostly a minimal `create_app()` call plus a `/tasks` endpoint. There is no custom metadata. No custom concurrency choices. No session isolation logic beyond what the base server gives you. No runtime cleanup concerns because the environment owns almost no external resources. That simplicity is pleasant, but it also means the project is not stress-tested as a real environment service.
+### 7. They integrate grounded external systems or simulators
+This is where the biggest credibility gap appears.
+`FinQAEnvironment` grounds answers in company tables and SQL. `GitTaskEnvironment` grounds tasks in actual repositories. `Tbench2Environment` grounds them in actual shell execution and tests. `BrowserGymEnvironment` grounds tasks in web environments. `TextArenaEnvironment` grounds them in game engines. `ChessEnvironment` grounds them in a real board state. `Calendar` grounds them in a stateful API-backed application.
+Your environment is grounded in a JSON dataset. That is fine for a prototype, but it is dramatically easier to shortcut. If the environment does not provide tools, latent objects, or stateful consequences, the fastest route to a good score is to learn the labeling policy over the text. That is exactly what your current `inference.py` is doing.
+If you want to beat more ambitious projects, you need to force the agent to do more than map n-grams to labels.
+## Deep Audit of Your Current Project
+### Overall strengths before the critique
+Before I get more surgical, it is worth naming what is already good:
+- The codebase is small enough to understand quickly.
+- The naming is clear and the domain is coherent.
+- Pydantic validation is used correctly in the core models.
+- The taxonomy in `vocabulary.py` is readable and operational.
+- The environment is deterministic given a seed.
+- The three-task ladder is a decent pedagogical introduction.
+- The tests, while limited, are not absent.
+- The dataset has at least some intentional ambiguity and follow-up cases.
+So this is not a bad project. It is a project that has not yet converted a good domain into a hard benchmark.
+### Domain model and task structure
+`vocabulary.py` defines a clean label space:
+- 9 issue types
+- 4 priorities
+- 6 assignment groups
+- 5 resolution actions
+- 3 task IDs
+The mapping dictionaries immediately reveal one important structural weakness: assignment group is fully determined by issue type. Every issue type maps to exactly one assignment group. That means the “assignment_group” prediction in task 3 is not an independent reasoning problem. Once the model gets issue type right, assignment group is a lookup. That collapses the apparent complexity of the hardest task.
+The same problem exists, though less absolutely, for resolution action. `ISSUE_TYPE_TO_RESOLUTION_ACTION` already maps every issue type to a default resolution action. The dataset confirms that several issue types only ever use one resolution action:
+- `feature_request -> acknowledge`
+- `general_inquiry -> acknowledge`
+- `onboarding -> fulfill`
+- `service_request -> assign`
+- `spam_phishing -> ignore`
+Only a subset of issue types vary their resolution action in practice. So task 3 looks like a four-field prediction problem, but much of it is structurally reducible to issue type plus a few keyword exceptions. That is not how hard triage environments should work if the goal is to test agentic reasoning.
+`server/tasks.py` compounds this by defining difficulty purely as output field count:
+- Task 1: issue type only
+- Task 2: issue type plus priority
+- Task 3: full routing
+The ticket pool is the same across tasks. There is no task-specific curation, no task-family-specific observation, no different process constraints, and no different control surface. The only thing that changes is what the grader will read from the submitted action.
+That means your easy-medium-hard ladder is mostly a scoring ladder, not an environment ladder.
+### Observation and state design
+`HelpdeskTicketObservation` contains:
+- task metadata
+- `allowed_fields`
+- `current_ticket`
+- queue counts
+- history
+`current_ticket` exposes only:
+- `ticket_id`
+- `title`
+- `requester`
+- `description`
+This is too little for a benchmark that wants to simulate real helpdesk operations, and it is oddly little given what your data already stores. `HelpdeskTicketRecord` also includes:
+- `ambiguity_note`
+- `related_ticket_id`
+Those two fields are exactly the sort of structured hints that could turn this from flat classification into contextual triage. Yet `_build_observation()` discards them. That means the dataset contains richer structure than the observation contract.
+The state is also minimal:
+- `current_task_id`
+- `seed`
+- `queue_ticket_ids`
+- `current_ticket_index`
+- `per_ticket_scores`
+- `total_reward`
+This is enough for bookkeeping, but not enough for operational simulation. There is no notion of:
+- queue ordering rationale
+- account status
+- customer tier
+- outage context
+- prior communication attempts
+- internal notes
+- pending escalations
+- workload or resolver capacity
+- elapsed time or SLA timers
+- deduplication chains
+- partial investigation state
+The result is that the environment never becomes more informative or more demanding as the episode progresses. The state is a score ledger, not a world model.
+Compare that with the stronger references:
+- `BrowserGymState` tracks benchmark, task, URL, goal, max steps, cumulative reward.
+- `REPLState` tracks context, prompt, iteration, namespace keys, final answer, total execution time.
+- `Tbench2State` tracks task, session, command history, terminal readiness, last output.
+- `TextArenaState` tracks turn, raw state, last reward, last info, environment identity.
+- `FinQAState` tracks current question, company, ground truth, question ID.
+Those states are not just counters. They represent the environment’s evolving operational memory. Yours mostly does not.
+### Environment lifecycle
+`HelpdeskTicketRoutingEnvironment.reset()` is straightforward:
+- coerce `seed`
+- get task definition
+- seed RNG
+- sample a queue size from 3 to 5
+- sample that many tickets from the fixed dataset
+- initialize state
+- return the first observation
+`step()`:
+- validates reset happened
+- grades action against current ticket
+- computes reward
+- advances to next ticket
+- if done, computes trajectory reward
+- otherwise returns immediate step reward
+This is tidy. It is also shallow.
+There is no environment mutation other than index movement. No internal state changes based on the chosen action. No branching. No action-dependent future ticket behavior. No queue reprioritization. No retries. No note writing. No escalation backlog. No “wrong earlier action causes downstream penalty.” The only environment response is score feedback.
+A benchmark like this can still be useful, but it sits much closer to supervised evaluation than to agentic interaction. That becomes a competitive problem when the reference set includes environments where actions actually transform the world.
+One subtle but important weakness is that `step()` does not enforce the task contract tightly. `HelpdeskTicketAction` allows all four fields to be present on any task, and `grade_action()` simply reads the fields relevant to the chosen `task_id`. Extra fields are ignored. That means the environment tells the agent “allowed_fields are X,” but it does not enforce “only X may be submitted.” It is not catastrophic, but it reflects a looser benchmark contract than the environment surface suggests.
+### Grader and reward design
+`server/grader.py` is the most benchmark-defining file in the project, and it currently underdelivers relative to its importance.
+What is good:
+- it has partial credit for issue-type confusions
+- it has proximity-based scoring for priority
+- task weights sum to 1
+- it is deterministic
+- it is easy to reason about
+What is weak:
+- the similarity tables are static, narrow, and handcrafted
+- assignment group and resolution action are exact-match only even though the environment does not expose enough context to make some distinctions fully grounded
+- there is no calibration check on over-escalation
+- there is no queue-level objective
+- there is no policy compliance signal
+- there is no explanation consistency
+- there is no distinction between “reasonable but conservative” and “reckless but lucky”
+The biggest conceptual weakness is that the reward is local and label-centric. A strong helpdesk environment should care about operational behavior, not just answer key overlap.
+For example, suppose two actions both get the final resolution action wrong:
+- one escalates a low-risk general inquiry to security
+- one acknowledges a critical account lockout without escalation
+Today those mistakes mostly show up as missed fields in a flat weighted sum. But in real operations they are qualitatively different failures. One wastes specialist capacity. The other is a dangerous underreaction. A competitive benchmark should encode that asymmetry.
+There is also a concrete implementation weakness in `compute_trajectory_reward()`. It computes:
+- average per-ticket score
+- minus `0.03 * overshoot`
+But `overshoot = max(0, steps_taken - queue_size)`, and the environment ends the episode when the current ticket index reaches queue length. After that point, further stepping raises an error. So in the normal execution path, overshoot is effectively always zero. The code suggests the environment cares about extra wasted steps, but the environment does not actually permit them. That means part of the trajectory logic is decorative rather than active.
+In strong benchmarks, reward code usually reveals the benchmark’s philosophy. In your project, the reward code mostly reveals the current label schema.
+### Dataset design
+`data/dataset.json` currently holds 45 tickets. The class distribution is not terrible for a prototype, but it is still small:
+- `application_support`: 9
+- `billing_license`: 7
+- `service_request`: 6
+- `security_compliance`: 5
+- `spam_phishing`: 5
+- `identity_access`: 4
+- `onboarding`: 4
+- `general_inquiry`: 3
+- `feature_request`: 2
+That is a tiny dataset for any benchmark that hopes to resist memorization or heuristic overfitting. The especially small classes are a concern. A benchmark with 2 feature requests and 3 general inquiries is not meaningfully testing generalization in those categories.
+The priority distribution is also limited:
+- critical: 9
+- high: 15
+- medium: 12
+- low: 9
+That is balanced enough to be usable, but not rich enough to encode the true structure of priority assignment. There is no obvious representation of customer segment, contractual urgency, outage blast radius, legal exposure, dependency graphs, or business calendar sensitivity. Priority is largely being inferred from words in the title and description, which is exactly what a heuristic baseline will exploit.
+The dataset does have four ambiguous records and three follow-up linked records. That is good. But because the environment does not structurally expose `ambiguity_note` or `related_ticket_id`, those richer cases do not actually become richer environment mechanics. They mostly remain hints for the benchmark designer, not tools for the agent.
+The follow-up handling is especially underused. Tickets like `ticket-038` and `ticket-045` clearly encode longitudinal customer frustration and repeated failure, which should change triage behavior. But the environment treats them like standalone text blobs. There is no action to inspect previous tickets. No thread retrieval. No stateful consequence from unresolved history. The environment has the seed of longitudinal realism and then does not build on it.
+There is also no train/eval split, no hidden split, no procedural generation, no adversarial generation, and no OOD slice. The same fixed dataset defines the universe. That is fine for unit tests. It is weak for a benchmark intended to compete.
+### Inference baseline and benchmark leakage
+`inference.py` is more important than it may look, because it tells you how easy the benchmark is to shortcut.
+The heuristic path:
+- scans ticket text for fixed issue-type keywords in fixed order
+- assigns priority from small keyword buckets
+- assigns resolution action from issue type plus a few escalation and fulfillment keywords
+- assigns assignment group from issue type mapping
+That baseline is not merely a harmless example. It is a diagnostic of benchmark leakage. The easier it is to hand-author a ruleset that tracks your label policy, the less benchmark headroom you have.
+And in this codebase, the baseline is not just simple. It is tightly coupled to the environment’s ontology:
+- it uses the exact taxonomy constants
+- it exploits the one-to-one issue-to-assignment mapping
+- it exploits mostly deterministic issue-to-resolution defaults
+- it assumes priority is keyword-addressable from the visible text alone
+That means the benchmark currently invites ontology-driven shortcutting.
+There is an even more concerning signal. The tests describe a heuristic baseline around `0.9400`, but a local code-faithful replay of the rule ordering in PowerShell over the full `data/dataset.json` gives a much weaker picture:
+- issue type exact accuracy: about `0.7333`
+- priority exact accuracy: about `0.3778`
+- assignment exact accuracy: about `0.7333`
+- resolution exact accuracy: about `0.6889`
+- full task-3 exact match: about `0.2444`
+- approximate weighted average score across tasks 1, 2, and 3: about `0.7344`
+The exact number is less important than what it implies: the benchmark narrative about heuristic strength and the actual rule behavior appear out of sync. That can happen for several reasons:
+- the tests are stale relative to current data
+- the claimed baseline was measured on sampled queues rather than the whole dataset
+- the heuristic ordering now creates more collisions than expected
+- the benchmark evolved without a full-baseline recomputation
+Whatever the cause, it is a warning sign. When benchmark claims and benchmark code diverge, trust in the environment falls.
+### Test strategy
+Your project has six test files. That is good relative to many small hackathon projects. But the content of the tests matters more than the count.
+The most important limitation is that multiple tests stub the OpenEnv types, interfaces, or `create_app()` implementation rather than exercising the real installed framework. `tests/openenv_test_stubs.py` injects fake `openenv.core.env_server.types`. `tests/test_environment_smoke.py` and `tests/test_api_integration.py` patch in a fake `Environment` base class. `tests/test_api_integration.py` also installs a stub `create_app` that returns a small FastAPI app with simplified routes.
+That means much of the test suite verifies your code against a locally simulated OpenEnv contract, not against the actual `openenv-core` dependency declared in `pyproject.toml`.
+This is a big competitive weakness because the reference repository’s core is full of behavior that your test harness never touches:
+- WebSocket `/ws` interactions
+- session handling
+- concurrency settings
+- serialization edge cases
+- metadata and schema endpoints
+- MCP endpoints
+- async step paths
+- actual `EnvClient` protocol semantics
+Your tests mostly prove that the environment behaves under your own simplified assumptions. That is useful, but it is not the same as proving robust OpenEnv integration.
+The other limitation is that the tests are mostly shallow-contract tests:
+- reset returns something valid
+- step increments counts
+- reward is in `[0, 1]`
+- task IDs are present
+- heuristic episodes do not error
+Those are necessary. They are not sufficient for a competitive benchmark.
+What is missing includes:
+- real WebSocket end-to-end tests
+- invalid action contract tests with actual framework validation
+- tests for extra fields on restricted tasks
+- concurrency tests
+- seed reproducibility tests across actual server sessions
+- golden regression tests on full-dataset benchmark score
+- hidden/eval split integrity tests
+- tests for ambiguity and follow-up handling
+- tests that verify the environment is hard in the intended way, not just runnable
+In short, the current test suite validates operability, not benchmark integrity.
+## Critical Gaps That Matter Most
+This section is the most actionable part of the report. If the goal is to beat stronger reference projects, these are the gaps that matter.
+### Gap 1: The project is benchmarked as an environment, but designed as a classifier
+The core problem is conceptual. Your code uses the OpenEnv interface, but the actual task shape is still mostly multi-label classification over short ticket text.
+The better reference environments are hard because the agent has to interact:
+- `BrowserGymEnvironment` asks the agent to act in a browser.
+- `FinQAEnvironment` asks the agent to inspect tools and query structured data.
+- `REPLEnvironment` asks the agent to iteratively execute code and decide when to finalize.
+- `Tbench2Environment` asks the agent to manipulate a terminal workspace and then survive evaluation.
+- `TextArenaEnvironment` asks the agent to play through game turns.
+Your environment asks the agent to emit labels. Even when multiple tickets appear in a queue, the agent is still doing the same one-shot operation repeatedly. It is not exploring, not investigating, not mutating meaningful state, not managing resources, and not making action-sequence tradeoffs.
+That difference is bigger than it looks. Once the benchmark is classifier-shaped, the fastest route to good performance is classifier-shaped too. The environment does not force the agent to behave like an operator. It only asks it to sound like one.
+That is why the next leap must be architectural, not cosmetic.
+### Gap 2: The hardest task is structurally easier than it claims
+Task 3 appears to be a four-field routing task, but the ontology collapses much of the difficulty.
+`ISSUE_TYPE_TO_ASSIGNMENT_GROUP` is one-to-one. If the agent gets issue type right, assignment group is already implied. That means one quarter of the task-3 score is mostly a lookup rather than a separate judgment call.
+Resolution action is not fully deterministic, but it is still heavily compressed by issue type defaults. Several issue types have only one action in practice across the dataset. Others vary under small numbers of recognizable phrases such as legal threat, follow-up pressure, or explicit request wording.
+So the “hard” task is closer to:
+- infer issue type
+- infer urgency from a few cues
+- apply one deterministic mapping
+- apply one mostly deterministic mapping with a few exceptions
+That is not trivial, but it is much less rich than real service-desk routing. Real hard cases exist when the same visible ticket text can map to different actions depending on hidden context such as account tier, live incident status, prior history, or internal policy. Your environment does not currently model those cases.
+### Gap 3: The environment underuses the best parts of its own data
+Your dataset is more interesting than your observation contract.
+`HelpdeskTicketRecord` contains `ambiguity_note` and `related_ticket_id`. Those are exactly the kinds of fields that could turn this into a stronger environment:
+- ambiguity makes decisions less keyword-deterministic
+- related ticket IDs create thread continuity
+- follow-ups create escalation pressure and temporal realism
+But `_build_observation()` discards them and only exposes the basic ticket text fields.
+That has two consequences:
+First, the richer authored structure is lost to the agent. Second, the benchmark stops short of the very complexity the dataset author was already beginning to encode.
+This is one of the clearest signs that the current project is a first version. The seeds of a deeper environment are already present in the data model. The runtime contract just does not use them.
+### Gap 4: There is no investigation loop
+In real helpdesk operations, the visible complaint is rarely the whole decision problem.
+An operator often needs to know:
+- whether the requester is on an enterprise contract
+- whether the problem aligns with an active outage
+- whether the user is an admin
+- whether prior tickets already established a root cause
+- whether a security signal exists on the account
+- whether a compliance deadline is legally binding
+- whether the request is actually a duplicate
+Your environment has no tool loop for this. The agent sees a title, requester, and description, then is expected to decide everything directly.
+That makes the environment much easier to brute-force and much less realistic than the domains represented by the best reference projects. `FinQAEnvironment` does not ask the model to guess answers from wording alone; it gives tools. `GitTaskEnvironment` gives a repo. `Tbench2Environment` gives a terminal. `BrowserGymEnvironment` gives a browser. Your helpdesk environment gives a paragraph.
+The fastest path to a stronger benchmark is to add internal tools and make the hardest scenarios impossible to solve reliably without using them.
+### Gap 5: There is almost no internal economics
+A good environment usually has some notion of tradeoff or cost even if it is not expressed as money.
+In your environment:
+- there is no time budget
+- there is no backlog pressure
+- there is no penalty for over-escalating except field mismatch
+- there is no cost for routing everything to the safest specialist
+- there is no consequence for queue ordering
+- there is no tension between fast response and careful investigation
+The queue exists, but it is not an economy. It is just a list.
+That means the environment cannot really test operational judgment. It can only test whether the final labels match the benchmark designer’s answer key. Stronger environments force decisions under constraints. Your current implementation mostly scores unconstrained annotation.
+### Gap 6: The reward story is thinner than the benchmark story
+`grade_action()` is neat and deterministic, but it still mainly scores label overlap. It does not score operator quality.
+There is no difference between:
+- a cautious but slightly conservative routing choice
+- a reckless underreaction that happens to get some partial credit
+- an unnecessary escalation that wastes the security team
+- a smart intermediate step that gathers evidence before final routing
+Those distinctions do not exist because the action surface does not allow them and the reward design does not look for them.
+There is also a direct implementation issue: `compute_trajectory_reward()` includes an overshoot penalty, but because the environment ends when the queue is exhausted and refuses later steps, overshoot does not really happen in the normal path. So part of the trajectory logic looks more meaningful than it actually is.
+When reward code contains dead or decorative logic, trust in the benchmark drops.
+### Gap 7: The current benchmark is highly vulnerable to ontology memorization
+The more the task can be solved by memorizing your ontology and keyword policy, the lower the ceiling of the benchmark.
+Right now the environment is vulnerable because:
+- the dataset is small
+- the label space is public and fixed
+- some output fields are deterministic functions of others
+- the observation is a short text blob
+- the heuristic baseline directly encodes the ontology
+- there is no hidden split or generator-based variation
+The current inference script is a warning sign here. It is not just a demo baseline. It is evidence that a carefully chosen keyword system can cover a large fraction of the problem structure because the problem structure is currently that compressible.
+If you want to build something harder to game, the benchmark must stop being reducible to a keyword policy plus a few ontology tables.
+### Gap 8: The tests are too synthetic for the actual risk profile
+The test suite checks that the environment is runnable. It does not yet prove that the benchmark is trustworthy.
+The biggest limitation is the heavy use of stubs around the OpenEnv dependency boundary. Several tests replace the real OpenEnv types, interfaces, or `create_app()` implementation. That helps local testability, but it means the suite is not validating actual WebSocket session behavior, actual framework serialization, actual schema generation, or actual concurrency handling.
+That is a serious gap if the environment is meant to compete with stronger projects. Reference environments are embedded in a framework that supports:
+- WebSocket sessions
+- session capacity and session info
+- schema endpoints
+- metadata endpoints
+- MCP endpoints
+- sync and async execution paths
+Your current tests mostly validate business logic under a simplified local harness. That is still useful. It is just not enough to prove benchmark robustness.
+There is also no strong integrity suite around the benchmark itself. Missing pieces include:
+- full-dataset regression scoring
+- hidden split integrity
+- adversarial edge-case suites
+- benchmark versioning checks
+- ambiguity and follow-up behavior tests
+- contract tests that verify the hard task is genuinely hard in the intended way
+If you want the project to be taken seriously, the environment and the benchmark need separate test surfaces.
+### Gap 9: The benchmark narrative and executable reality are drifting apart
+A benchmark becomes fragile when people cannot tell which number to trust.
+Your tests imply a strong heuristic baseline. The environment code and local replay of the actual heuristic rules over the dataset suggest a weaker story. That discrepancy may be caused by stale thresholds, changed data, queue sampling effects, or unrefreshed benchmark assumptions. Whatever the reason, it is not a small issue.
+Strong benchmarks need executable answers to simple questions:
+- what is the official baseline?
+- how is it measured?
+- on which split?
+- with what seeds?
+- on which version of the data?
+- under which scenario families?
+Right now those answers are not fully stabilized in code. The result is that the benchmark is harder to trust than it should be.
+That may sound administrative, but it is actually competitive. A benchmark that feels ad hoc will lose to a benchmark that feels governed, even if both are interesting.
+### Gap 10: The project does not yet have a competitive moat
+The strongest environments in the reference set each have a clear identity:
+- BrowserGym: browser-native multimodal interaction
+- FinQA: tool-mediated reasoning over structured finance data
+- REPL: iterative code execution and rubric-based finalization
+- TBench2: terminal tasks grounded by executable evaluation
+- Calendar: stateful tool ecosystem over application APIs
+- Chess: adversarial long-horizon board play
+Your current identity is “helpdesk routing from short ticket text.” That is useful, but not yet distinctive enough to dominate.
+The domain itself can support a much stronger identity:
+- service desk triage under partial observability
+- enterprise support operations with tool use and policy constraints
+- multi-ticket queue management under SLA and escalation economics
+That is the moat you should build. The domain is good enough. The current benchmark shape is not yet deep enough to own it.
+## What Specific Reference Environments Teach You
+### BrowserGym: rich observations create real decision space
+`BrowserGymObservation` includes text, URL, optional screenshot, goal, accessibility tree text, pruned HTML, error strings, and action-error flags. `BrowserGymEnvironment` carefully converts raw benchmark objects into those modalities and preserves additional metadata while filtering large raw fields.
+The lesson is not “copy browser features.” The lesson is that an observation should support several reasoning strategies at once. Strong environments do not force everything through one narrow channel if the domain can naturally expose more.
+Your helpdesk environment should likely move from a plain ticket view to a mixed observation view that includes structured context, queue state, optional note previews, and pointers to retrievable evidence. A stronger observation contract makes the environment harder to solve with surface heuristics and easier to use for real agent development.
+### FinQA: tool use transforms a QA task into an environment
+`FinQAEnvironment` is one of the most relevant reference environments for your redesign. It takes a question-answering domain that could have been implemented as “read prompt, output answer” and instead builds a tool-mediated workflow:
+- list tools
+- inspect table descriptions
+- inspect table metadata
+- run SQL queries
+- submit final answer
+The ground truth is hidden. The agent has to do work. The reward system then normalizes answer formats so the benchmark is measuring reasoning rather than answer string quirks.
+Your helpdesk project should follow that pattern. The hard task should not be “read ticket and guess routing.” It should be “use service desk tools to investigate and then submit routing.” That would immediately raise the benchmark ceiling.
+### REPL: process reward and outcome reward should be separate
+`REPLEnvironment` is instructive because it distinguishes execution quality from final answer quality. The environment tracks iterations, namespace state, execution results, and finalization patterns. The rubric layer then separates outcome reward from process reward.
+That is directly applicable to helpdesk operations. A strong service desk environment should separately measure:
+- whether the final routing/action was correct
+- whether the agent investigated responsibly
+- whether the agent made avoidable operational mistakes
+- whether the agent wasted steps or overused escalation
+Without that split, you cannot tell the difference between good operations and lucky guessing.
+### TBench2: grounded evaluation is a moat
+`Tbench2Environment` is powerful because success is not a declared label. It is an executable check. The agent can manipulate a workspace and then call `evaluate`, which runs tests. That style of evaluation is very hard to fake and very easy to defend.
+Helpdesk will not use pytest in the same way, but the principle transfers cleanly. A stronger helpdesk benchmark should evaluate against hidden operational truth and downstream effects, not just a visible label table. If the environment can compute whether the chosen action violated SLA policy, ignored an active incident, or misrouted a duplicate chain, then benchmark credibility goes up immediately.
+### Calendar MCP: tool ecosystems can scale if the boundary is clean
+The Calendar stack shows how a domain can become more realistic without exploding the action schema. The environment exposes tools, request context, user context, and database-backed state. Tool handlers are generic where possible and dynamic routing does a lot of the heavy lifting.
+For your domain, that is a strong hint that helpdesk should probably become tool-centric. Instead of stuffing everything into one giant action object, expose a small set of operational tools. This will scale better, feel more realistic, and let you design harder scenarios without turning the action model into a kitchen sink.
+### GitTask: reproducible scenario resets matter
+`GitTaskEnvironment` is not the most feature-rich environment in the set, but it gets one important thing right: reproducible task state. Reset means something concrete. The environment can put you back into a known repo state efficiently.
+You need the same discipline in scenario design. Instead of sampling any 3 to 5 tickets from one public pool, define reproducible episode families:
+- urgent outage follow-up
+- mixed billing queue
+- false-positive security scare
+- onboarding plus access control bundle
+- executive escalation chain
+Once episodes become scenario-driven rather than ticket-sampled, the benchmark will feel much more intentional.
+### Chess and TextArena: delayed reward and auxiliary signals are valuable
+`ChessEnvironment` plus `ChessWinLossRubric` shows how delayed reward can be modeled cleanly across a trajectory. `TextArenaEnvironment` plus its reward providers shows how auxiliary signals can coexist with the main reward without replacing it. Those patterns matter because helpdesk operations are not fully one-shot even when the final routing choice is what gets judged.
+In a stronger version of your environment, you could preserve a main final reward while also emitting auxiliary channels such as:
+- evidence quality
+- duplicate-handling quality
+- escalation efficiency
+- SLA awareness
+- customer experience quality
+- policy compliance
+Even if you keep one main scalar reward for training or evaluation, those auxiliary signals would make the benchmark much more diagnosable.
+### ReasoningGym and Maze: simplicity is fine if it is honest
+`ReasoningGymEnvironment` is a simple parameterized single-step environment. `MazeEnvironment` is a simple gridworld. Neither one pretends to be deeper than it is. That honesty is useful as a design lesson.
+If you want to keep a light version of your current project, that is perfectly reasonable. But then it should be presented as a starter triage benchmark, not as a fully realized agentic operations environment. If you want to claim higher competitive value, the environment itself needs to support that claim with deeper mechanics.
+## A Concrete Design for Beating the Stronger Projects
+The right goal is not to imitate the broadest reference project. The right goal is to go much deeper in one domain you already own.
+You do not need to out-BrowserGym BrowserGym. You do not need to out-TBench2 TBench2. You need to become clearly better at service desk operations simulation than the reference set is today.
+### North star: build a service operations simulator
+The strongest future version of this project looks more like an IT service desk simulator than a label prediction benchmark.
+Core properties of that simulator should be:
+- partially observed ticket and account state
+- internal tools for investigation
+- scenario families rather than one static pool
+- multi-step resolution workflows
+- queue-level tradeoffs
+- policy-aware reward
+- hidden evaluation truth
+If you hit those properties, you will not just be polishing the current environment. You will be changing the category of the benchmark.
+### Proposed visible entities
+The agent should see richer but still realistic objects, for example:
+- ticket thread summary
+- current requester details
+- account/org summary
+- queue overview
+- recent internal note previews
+- live incident banner or incident tool access
+- available tools
+- allowed actions
+- task budget and SLA hints
+That does not mean every observation must be huge. It means the visible world should make the agent reason like an operator instead of like a labeler.
+### Proposed hidden entities
+The environment should own hidden state that determines the correct policy:
+- canonical root-cause category
+- customer tier
+- resolver ownership
+- actual business impact
+- active incident linkage
+- prior unresolved duplicates
+- whether manual escalation is necessary or wasteful
+- whether policy requires a specific handling path
+- whether the ticket is self-servable by documented guidance
+These hidden variables are what create genuinely hard cases. Two tickets that look similar on the surface should sometimes route differently because the hidden state differs.
+### Proposed action surface
+I would split the action space into investigation actions and commitment actions.
+Investigation actions:
+- `lookup_requester`
+- `get_account_plan`
+- `get_related_tickets`
+- `check_service_health`
+- `search_kb`
+- `inspect_internal_notes`
+- `get_security_signals`
+- `get_asset_or_license_state`
+Operational actions:
+- `add_internal_note`
+- `request_more_info`
+- `merge_duplicate`
+- `set_priority`
+- `assign_group`
+- `escalate`
+- `acknowledge`
+- `submit_final_decision`
+This preserves your current routing taxonomy while forcing the agent to earn the final answer through interaction.
+### Proposed task families
+Replace the current output-field ladder with scenario families.
+1. **Baseline classification**
+   Keep a simple version of the current task for calibration.
+2. **Priority under operational context**
+   Add visible account metadata and SLA hints.
+3. **Tool-assisted routing**
+   Hard cases require evidence retrieval.
+4. **Follow-up chain handling**
+   Correct routing depends on thread history and prior failures.
+5. **Duplicate resolution**
+   The agent must detect and merge with existing tickets or note the linkage.
+6. **Queue management**
+   Multiple tickets compete for limited steps or limited escalation budget.
+7. **Incident-aware triage**
+   Correct behavior depends on checking active incident state.
+8. **Policy-constrained operations**
+   Compliance, security, or executive-account policies change what the correct action is.
+Now difficulty comes from task structure, not just output dimensionality.
+### Proposed reward design
+A strong reward design for this domain should likely have four layers.
+Layer 1: **final outcome correctness**
+- correct issue family
+- correct priority
+- correct resolver team
+- correct action
+Layer 2: **operational policy correctness**
+- no violation of mandatory escalation rules
+- no unjustified critical priority
+- no missed compliance deadlines
+- no unsupported closure
+Layer 3: **process quality**
+- useful tool use
+- correct duplicate inspection
+- efficient evidence gathering
+- no unnecessary specialist escalation
+Layer 4: **episode economics**
+- queue-wide quality
+- backlog harm
+- escalation cost
+- SLA miss cost
+That may sound like a lot, but you do not need to expose all of it as one scalar at once. Some of it can be stored as metadata or auxiliary reward channels first.
+### Proposed data strategy
+Do not try to hand-author ten thousand fully custom tickets from scratch. Instead, build a layered data strategy.
+Layer A: curated seed cases
+- your best handcrafted exemplars
+- ambiguous pairs
+- follow-up chains
+- adversarial near-neighbors
+Layer B: templated scenario generation
+- same underlying issue with different requester tiers
+- same wording with different hidden incident context
+- duplicate vs non-duplicate versions
+- billing dispute with and without outage linkage
+Layer C: hidden benchmark splits
+- development split
+- public validation split
+- private evaluation split
+Layer D: scenario tagging
+- issue family
+- ambiguity level
+- investigation depth required
+- tool requirement
+- risk class
+- queue pressure
+This approach gives you scale without giving up control.
+## File-by-File Improvement Plan for This Repository
+This section ties the redesign back to the actual code you already have. The point is to show how the current repo can evolve into the stronger benchmark rather than be abandoned.
+### `models.py`
+Right now the models encode the benchmark as a label submission problem. That is fine for version one and too restrictive for version two.
+I would keep the existing validation patterns, but I would expand the schema into typed action families and typed observation payloads.
+Recommended direction:
+- keep `HelpdeskTicketRecord`, but add typed visible vs hidden fields
+- replace the loose `current_ticket: Optional[dict[str, str]]` with a ticket-view model
+- split actions into investigation actions and final submission actions
+- add typed structures for tool results, notes, queue items, and thread previews
+- enrich state with scenario metadata, action audit trail, and resource counters
+Why this matters:
+As long as the schema itself says “the agent submits optional routing fields,” every other part of the environment will naturally stay classifier-shaped. Schema is architecture. If you want the environment to feel agentic, the models have to make agentic behavior first-class.
+### `server/environment.py`
+This file is currently the main reason the benchmark feels thin. It is clean, but it is clean because it has very little world logic.
+I would evolve it in stages.
+Stage 1:
+- expose structured thread/follow-up information
+- enforce task contracts more tightly
+- store full action history, not just scores
+- make scenario metadata visible
+Stage 2:
+- add tool dispatch for investigation actions
+- maintain scenario-local hidden state
+- let actions mutate environment state
+- support final decision submission separately from intermediate investigation
+Stage 3:
+- add queue-level episodes with budget constraints
+- let earlier choices affect later ticket handling
+- introduce scenario-specific logic for duplicates, incidents, and policy constraints
+Why this matters:
+This file should become the simulator, not just the grader entrypoint.
+### `server/tasks.py`
+This file needs the most conceptual change after the environment itself.
+The current task list is:
+- task 1: issue type only
+- task 2: issue type plus priority
+- task 3: full routing
+That is too narrow. I would turn `tasks.py` into a scenario-family registry instead.
+For example:
+- `single_ticket_classification`
+- `priority_under_sla`
+- `tool_assisted_routing`
+- `duplicate_chain_resolution`
+- `incident_aware_triage`
+- `queue_optimization`
+- `policy_constrained_security_case`
+Each task family should define:
+- visible observation contract
+- allowed actions
+- hidden truth generator
+- episode budget
+- reward composition
+- benchmark split membership
+Why this matters:
+Right now tasks differ by scoring columns. A strong benchmark needs tasks that differ by problem structure.
+### `server/grader.py`
+This file should stop being only a lookup-based scorer and become the place where service-desk policy is encoded.
+I would keep the basic idea of partial credit, but move from a pure field-overlap worldview to a policy-and-outcome worldview.
+Examples of richer scoring logic:
+- small penalty for unnecessary escalation
+- strong penalty for under-prioritizing active access outages
+- reward for correctly linking duplicates
+- reward for choosing acknowledgment before final resolution when that is the right workflow
+- penalty for routing compliance work to general support
+- scenario-aware scoring where the same visible ticket can score differently depending on retrieved evidence
+Why this matters:
+The grader is the actual benchmark. It should reflect operational quality, not only taxonomy overlap.
+### `server/reward.py`
+This file is a good place to simplify and then rebuild.
+First, remove or redesign logic that is not meaningfully active, such as the current overshoot penalty that normal episode flow does not really trigger.
+Then add reward layers deliberately:
+- final decision score
+- process score
+- economics score
+- optional auxiliary diagnostics
+Why this matters:
+A benchmark becomes much easier to improve if the reward code honestly reflects what is being optimized.
+### `server/app.py`
+This file is currently fine for a minimal environment, but it should grow once the environment grows.
+Recommended additions:
+- environment metadata endpoint support if you want richer UI or benchmark introspection
+- possibly custom routes for benchmark info, scenario families, or baseline metadata
+- cleaner packaging around path setup once the project stabilizes
+Why this matters:
+This is not the highest-priority file, but stronger benchmark ergonomics do help credibility and usability.
+### `data/dataset.json`
+This file should evolve from “the benchmark” into “part of the benchmark.”
+Keep a curated hand-authored slice, but do not let one public JSON file define the whole environment forever.
+Recommended evolution:
+- expand the dataset substantially
+- add many more feature request and general inquiry cases
+- add multiple duplicate chains
+- add hidden context fields
+- add templated variants of existing scenarios
+- create a private evaluation bank
+Why this matters:
+A tiny fixed public dataset makes memorization too easy and benchmark claims too brittle.
+### `inference.py`
+This file is useful, but it currently plays several roles at once:
+- demo script
+- heuristic baseline
+- optional LLM runner
+- environment smoke path
+I would separate those responsibilities.
+Recommended structure:
+- one official deterministic baseline runner
+- one optional tool-using baseline runner once tools exist
+- one separate example script for simple local usage
+- one benchmark harness that records split, seed, scenario family, and version
+Why this matters:
+Benchmarks need reproducible baselines more than they need convenient demos.
+### `tests/`
+The most important change after environment design is testing philosophy.
+I would split tests into at least four groups:
+1. **unit tests**
+   Validation, scoring primitives, dataset loaders, tool helpers.
+2. **real integration tests**
+   Actual OpenEnv app, actual serialization, actual WebSocket interactions.
+3. **benchmark regression tests**
+   Fixed scenario suites, stable baseline scores, hidden split checks.
+4. **integrity tests**
+   No task leakage, no duplicate split contamination, no benchmark version drift.
+Why this matters:
+A serious benchmark is a data product, an environment product, and an evaluation product. The tests should reflect all three.
+## Practical Roadmap
+### Phase 1: Make the current environment honest and sturdier
+This is the fastest and cheapest improvement phase. Do this even if you are not ready for a full redesign.
+Goals:
+- expose thread/follow-up structure
+- tighten task contracts
+- recompute and stabilize baseline measurements
+- add a hidden evaluation split
+- remove decorative reward logic
+- improve test realism
+Deliverables:
+- stronger observation model
+- benchmark regression script
+- real integration tests
+- scenario-family-aware tasks, even if still text-only
+This phase will not yet make the environment winner-beating, but it will make it much more defensible.
+### Phase 2: Add tool-assisted investigation
+This is the highest-return phase because it changes the category of the benchmark.
+Minimum viable tool set:
+- requester/account lookup
+- related-ticket retrieval
+- service health lookup
+- KB search
+- final decision submission
+Once those exist, create scenario families where the visible ticket text is insufficient without tool use. That immediately raises the benchmark ceiling and reduces shortcutability.
+### Phase 3: Add operational economics and queue-level behavior
+After tool use works, add:
+- queue-wide episodes
+- time or action budgets
+- escalation cost
+- SLA miss cost
+- duplicate-handling benefit
+- specialist-capacity awareness
+This turns the environment from a case-by-case annotation task into an operational management task.
+### Phase 4: Add benchmark governance
+At this point you should formalize:
+- public vs private splits
+- scenario-family tags
+- official baselines
+- benchmark versioning
+- scorecards by scenario family
+- release notes for benchmark changes
+This is what makes the project not just interesting, but trustworthy.
+## Prioritized Recommendation List
+If I had to choose only ten improvements, in order, I would choose these:
+1. Stop defining difficulty only by `allowed_fields`.
+2. Add investigation tools and final submission as separate actions.
+3. Break the deterministic issue-type-to-assignment shortcut.
+4. Make resolution depend on hidden operational context more often.
+5. Surface follow-up and related-ticket structure.
+6. Expand data and add hidden eval splits.
+7. Add process-aware reward and remove dead trajectory logic.
+8. Add queue-level economics and limited budgets.
+9. Replace stub-heavy integration tests with real framework tests.
+10. Publish a stable benchmark harness and official baseline measurement.
+## Final Assessment
+After a deep code read, my conclusion is simple:
+Your project is promising, readable, and based on a very strong domain. But in its current form it is still a compact routing benchmark, not yet a high-ceiling service-operations environment.
+The better reference environments in `OpenEnv/envs` are better not because they are bigger for the sake of being bigger, but because they force the agent to operate inside state, tools, or consequences that cannot be collapsed into label mapping so easily.
+The encouraging part is that your domain can support exactly that kind of benchmark. IT helpdesk operations naturally contain ambiguity, hidden context, tool use, policy constraints, long threads, queue pressure, and downstream costs. Very few toy domains offer that combination so cleanly.
+So the right move is not to abandon the project. The right move is to evolve it.
+If you keep the current shape and only add more tickets, you will get a better classifier benchmark. That may be useful, but it probably will not beat the strongest reference projects.
+If you turn this into a tool-assisted, partially observed, multi-step service-operations simulator with stronger reward design and stronger benchmark governance, then you can absolutely build something more compelling than many of the reference environments, because your domain has the right raw material for a benchmark that is both realistic and highly evaluable.
+The domain is already winner material.
+The current implementation is starter material.
+The opportunity is to close that gap deliberately.
+## Appendix A: Comparative Scorecard
+The table below is not a scientific benchmark. It is a code-read scorecard based on the implementations reviewed in this report. The goal is to make the gap tangible.
+| Dimension | Your project now | Strong reference environments |
+| --- | --- | --- |
+| Action richness | Low | Medium to very high |
+| Hidden state depth | Low | Medium to high |
+| Tool use | None | Present in FinQA, Calendar, TBench2, Git, REPL |
+| Multistep interaction | Low-medium | Medium to high |
+| Queue/process economics | Very low | Medium in some envs, high in operational ones |
+| Reward sophistication | Low-medium | Medium to high |
+| Benchmark anti-overfitting | Low | Medium |
+| Runtime realism | Low | Medium to high |
+| Testing depth | Low-medium | Medium to high at repo scale |
+| Domain relevance | High | Varies by env |
+| Potential ceiling | High | Already demonstrated in several envs |
+The most important row here is the last one. Your current implementation is not yet at the same level as the strongest references, but the domain ceiling is absolutely high enough to catch up and possibly surpass them if you execute the redesign well.
+## Appendix B: What You Should Preserve
+When teams hear “major redesign,” they often accidentally throw away the parts that were already working. I do not recommend that here.
+The current project has several strengths that should be preserved as you expand it:
+### 1. Preserve the compactness of the taxonomy
+The label space in `vocabulary.py` is clear and product-shaped. It is not bloated. Even when the environment becomes tool-based and stateful, keep the routing ontology understandable. The problem with the current benchmark is not that the taxonomy is wrong. The problem is that the environment around the taxonomy is too thin.
+### 2. Preserve deterministic core scoring where possible
+Even after you add process reward and hidden context, keep as much deterministic scoring as possible. One reason your current project is easy to debug is that the grader is inspectable. Do not replace everything with opaque LLM judging if you can avoid it. Use explicit hidden truth and rule-based evaluation for most of the benchmark, and reserve softer judging only for areas that truly need it.
+### 3. Preserve readability
+The current codebase is easy to onboard into. That is an asset. Several bigger reference environments are strong, but also much harder to reason about quickly because they wrap external systems or broad framework machinery. As you deepen this project, keep modules well-separated:
+- models
+- scenario generation
+- environment runtime
+- tools
+- scoring
+- reward composition
+- benchmark harness
+That separation will make future iteration much faster.
+### 4. Preserve seeded reproducibility
+Your existing environment is deterministic under a seed, and that is worth keeping. Stronger benchmarks become much easier to trust when a given scenario family plus seed reproduces the same world state. As you add hidden context and generators, make seed behavior even more explicit instead of less.
+### 5. Preserve explicit validation
+The Pydantic validation in the current models is a quiet strength. Keep that discipline. As the action surface grows, validation becomes more important, not less. Tools and action types should reject malformed inputs cleanly so that environment failures are informative rather than muddy.
+## Appendix C: Example Scenario Families for Version 2
+To make the redesign more concrete, here are example scenario families that would feel much closer to a winner-level helpdesk benchmark.
+### Scenario Family 1: Access outage with incident ambiguity
+Visible state:
+- multiple users report being locked out
+- one requester sounds urgent
+- another sounds like a normal password reset
+Hidden state:
+- there is an active identity provider outage
+- some tickets are duplicate symptoms of the same incident
+Tools needed:
+- `check_service_health`
+- `get_related_tickets`
+- `lookup_requester_role`
+What this tests:
+- whether the agent distinguishes isolated access issues from systemic incidents
+- whether it avoids handling every case as an independent ticket
+- whether it correctly prioritizes executive or admin users without overreacting on every case
+### Scenario Family 2: Billing dispute tied to product defect
+Visible state:
+- customer says they were charged incorrectly
+- another case mentions checkout failures
+Hidden state:
+- the billing dispute is caused by a known application defect that duplicated transactions
+Tools needed:
+- `search_related_tickets`
+- `check_service_health`
+- `read_internal_incident_note`
+What this tests:
+- whether the agent routes based on real causal structure rather than superficial department ownership
+- whether it recognizes that pure billing handling is insufficient because engineering is involved
+### Scenario Family 3: Compliance deadline with account-context twist
+Visible state:
+- requester references GDPR or legal obligation
+Hidden state:
+- some requests are legitimate deletion requests
+- some are actually admin-level data export requests misphrased as deletion
+- some belong to customers on contracts with defined response obligations
+Tools needed:
+- `lookup_contract_tier`
+- `retrieve_policy_snippet`
+- `get_account_data_scope`
+What this tests:
+- whether the agent can combine legal wording with account and policy context
+- whether it overroutes all legal-sounding tickets to the same team
+### Scenario Family 4: Duplicate-heavy queue optimization
+Visible state:
+- ten tickets in a queue
+- several appear to be related
+Hidden state:
+- six are duplicates of two underlying issues
+- one low-volume ticket is actually the most SLA-critical
+Tools needed:
+- `search_related_tickets`
+- `merge_duplicate`
+- `set_priority`
+- `submit_queue_plan`
+What this tests:
+- whether the agent can manage a queue as a system
+- whether it reduces work through linkage
+- whether it balances urgency against volume
+### Scenario Family 5: Feature request versus broken workflow
+Visible state:
+- customer asks for export filters or better reporting
+Hidden state:
+- in some scenarios the feature genuinely does not exist
+- in others the feature exists but the customer lacks permissions or is using the wrong path
+Tools needed:
+- `search_kb`
+- `lookup_plan_features`
+- `inspect_recent_product_change`
+What this tests:
+- whether the agent treats every request for missing functionality as a feature request
+- whether it can separate education/support from roadmap input
+## Appendix D: Red Flags to Avoid During the Redesign
+There are a few ways a redesign like this can go wrong. Avoid these.
+### 1. Do not add tools that are merely decorative
+If a hard task can still be solved reliably without using the tools, then the tool surface is just benchmark theater. The hard scenario families should be designed so that retrieved evidence actually changes the correct answer.
+### 2. Do not make every scenario gigantic
+Richer does not mean bloated. Some scenarios should stay compact. The goal is meaningful hidden context, not maximum token count.
+### 3. Do not replace all scoring with LLM judging
+Use explicit hidden truth and deterministic scoring wherever possible. Opaque judging should be a last resort, not a default.
+### 4. Do not let the ontology become a maze
+Your current taxonomy is pleasantly clean. Keep it that way. More realism should come from state and evidence, not from exploding the label space into dozens of nearly indistinguishable categories.
+### 5. Do not forget benchmark governance
+If you add scenario generation but do not formalize splits, baselines, and versioning, you will create a cooler environment without creating a more trustworthy benchmark.

data/dataset.json CHANGED Viewed

@@ -538,6 +538,42 @@
         "resolution_action":  "escalate",
         "ambiguity_note":  null,
         "related_ticket_id":  "ticket-030"
     }
 ]

         "resolution_action":  "escalate",
         "ambiguity_note":  null,
         "related_ticket_id":  "ticket-030"
+    },
+    {
+        "ticket_id":  "TKT-NONDEFAULT-001",
+        "title":  "Billing question from free-tier account",
+        "requester":  "user@freetier.io",
+        "description":  "I have a question about my invoice but I am on the free plan and there is no charge. The billing team cannot action this; please route to service desk for general assistance.",
+        "issue_type":  "billing_license",
+        "priority":  "low",
+        "assignment_group":  "service_desk",
+        "resolution_action":  "fulfill",
+        "ambiguity_note":  "Account tier is free; billing team cannot action, routed to service desk",
+        "related_ticket_id":  null
+    },
+    {
+        "ticket_id":  "TKT-NONDEFAULT-002",
+        "title":  "App vulnerability flagged in compliance scan",
+        "requester":  "security@clientcorp.com",
+        "description":  "Our compliance scan flagged a product-specific vulnerability in the application layer. This is not a general security policy issue but an app bug requiring the application team to remediate.",
+        "issue_type":  "security_compliance",
+        "priority":  "high",
+        "assignment_group":  "application_team",
+        "resolution_action":  "escalate",
+        "ambiguity_note":  "Compliance issue is product-specific (app vulnerability), routed to app team",
+        "related_ticket_id":  null
+    },
+    {
+        "ticket_id":  "TKT-NONDEFAULT-003",
+        "title":  "Contractor onboarding blocked by access issue",
+        "requester":  "pm@contractorco.com",
+        "description":  "A new contractor cannot complete onboarding because their account access is blocked by a permissions error. The onboarding team cannot resolve access issues; routing to service desk.",
+        "issue_type":  "onboarding",
+        "priority":  "medium",
+        "assignment_group":  "service_desk",
+        "resolution_action":  "fulfill",
+        "ambiguity_note":  "Contractor onboarding blocked by access issue, routed to service desk",
+        "related_ticket_id":  null
     }
 ]

gaps.md ADDED Viewed

	@@ -0,0 +1,146 @@

+# Gap Analysis — IT Helpdesk Ticket Routing OpenEnv
+Deep cross-reference of the codebase against every concrete mentor statement from the bootcamp transcript and Discord Q&A.
+---
+## GAP 1 — CRITICAL: `inference.py` runs all 3 tasks in one invocation
+**Mentor (4/1/26, 9:48 PM, confirmed twice):**
+> "inference.py should execute a single task per run and emit exactly one [START] … [END] block. The evaluation system handles running across multiple tasks, so batching all tasks in one invocation is not expected."
+**Your code in `inference.py`:**
+```python
+TASKS = list(TASK_IDS)  # [1, 2, 3]
+for task_id in TASKS:   # loops all 3
+    emit_log("START", ...)
+    ...
+    emit_log("END", ...)
+emit_log("END", overall_avg=...)  # second END
+```
+The evaluator calls `inference.py` once per task. Your script ignores that and runs all 3 itself, emitting 3 `[START]`/`[END]` pairs. The evaluator expects exactly one. There is no `TASK_ID` env var read anywhere.
+---
+## GAP 2 — CRITICAL: `state()` response is missing `reward` and `done` fields
+**Mentor (4/1/26, 9:33 PM):**
+> "state() must return minimum: `{ 'observation': ..., 'reward': last_step_reward, 'done': True/False }`"
+**Your `HelpdeskTicketState` model:**
+```python
+class HelpdeskTicketState(State):
+    current_task_id: Optional[int] = None
+    seed: Optional[int] = None
+    queue_ticket_ids: list[str]
+    current_ticket_index: int = 0
+    per_ticket_scores: list[float]
+    total_reward: float = 0.0
+    # NO reward field (last step reward)
+    # NO done field
+```
+`GET /state` returns this model directly. The evaluator checking `state()` for `reward` and `done` will find neither. `total_reward` is the accumulated reward, not the last step reward — which the mentor explicitly said NOT to return.
+---
+## GAP 3 — MEDIUM: `history` in observation is too sparse for RL usefulness
+**Ben (YouTube bootcamp, ~00:31:07):**
+> "process supervision... give these more detailed rewards... enrich history with ticket title, predicted fields"
+**Your `_build_observation` history:**
+```python
+history.append({"step": i + 1, "score": s})
+# final entry gets: {"step": N, "ticket_id": ..., "score": ..., "breakdown": ...}
+```
+Non-final history entries only have `step` and `score`. No ticket title, no predicted action fields. The agent cannot learn from history because it cannot see what it predicted or what the ticket was. This directly weakens RL signal quality.
+---
+## GAP 4 — MEDIUM: No milestone/delta reward shaping — flat score passthrough
+**Mentor (4/1/26, 9:34 PM):**
+> "A deterministic terminal grader with partial credit is valid, but it's better to include some intermediate (non-terminal) reward signals as well so the environment provides step-wise feedback. Milestone-based shaping is preferred over dense per-action rewards."
+**Your `step()` in `environment.py`:**
+```python
+if is_done:
+    final_reward = traj_reward   # trajectory reward only at end
+else:
+    final_reward = step_reward   # per-ticket score for non-final steps
+```
+You do return `step_reward` on non-final steps, which is correct. But `step_reward` is just `compute_step_reward(score)` which is `max(0.0, min(1.0, score))` — identical to the raw score. There is no shaping, no milestone signal, no delta-based signal. This is a quality gap, not a blocker.
+---
+## GAP 5 — MEDIUM: `observation.history` doesn't include the predicted action
+**Your `_build_observation`:**
+```python
+history_entry = {
+    "ticket_id": current_ticket.ticket_id,
+    "score": score,
+    "breakdown": breakdown,
+}
+```
+The agent's own predicted action is never stored in history. When the agent looks at history to decide its next action, it cannot see what it previously predicted. This is a real RL signal gap — the agent has no memory of its own decisions.
+---
+## GAP 6 — LOW: `tickets_remaining` semantics slightly ambiguous
+**Your `_build_observation`:**
+```python
+tickets_remaining=max(0, queue_size - idx),
+```
+`idx` is `current_ticket_index` which has already been incremented by `step()` before `_build_observation` is called. During the episode, `tickets_remaining` counts the current ticket as "remaining" even though it is being processed. Minor but could confuse an LLM agent reading the observation.
+---
+## GAP 7 — LOW: `openenv.yaml` `entry_point` vs `pyproject.toml` `server` script mismatch
+**Mentor (3/31/26, 11:27 PM):**
+> "The validator is checking for a specific callable entrypoint. In some setups, it expects a main() function instead of an app object."
+**Your `pyproject.toml`:**
+```toml
+[project.scripts]
+server = "server.app:main"
+```
+**Your `openenv.yaml`:**
+```yaml
+entry_point: server.environment:HelpdeskTicketRoutingEnvironment
+```
+These point to different things. The validator may check `entry_point` in `openenv.yaml` and expect it to match `[project.scripts] server`. This inconsistency could cause validation confusion.
+---
+## GAP 8 — LOW: No `/web` UI endpoint — blank HF Space page
+**Ben (YouTube, ~00:45:08):**
+> "They're small apps and they're based as spaces. So they're deployed with a UI and an API."
+The echo env example had `/web` for the UI. Your app has no `/web` route. The mentor said UI is optional and not scored, but the HF Space will show a blank page with no UI, which looks unpolished to judges doing Phase 3 human review.
+---
+## Summary
+| # | Gap | Severity | File(s) |
+|---|-----|----------|---------|
+| 1 | `inference.py` runs all 3 tasks, evaluator expects 1 per run | CRITICAL | `inference.py` |
+| 2 | `GET /state` missing `reward` (last step) and `done` fields | CRITICAL | `models.py`, `environment.py` |
+| 3 | `history` missing predicted action — agent has no memory of decisions | MEDIUM | `environment.py` |
+| 4 | No milestone/delta reward shaping — flat score passthrough | MEDIUM | `reward.py` |
+| 5 | `history` non-final entries missing ticket title | MEDIUM | `environment.py` |
+| 6 | `tickets_remaining` semantics slightly ambiguous | LOW | `environment.py` |
+| 7 | `openenv.yaml` `entry_point` vs `pyproject.toml` `server` script mismatch | LOW | `openenv.yaml`, `pyproject.toml` |
+| 8 | No `/web` UI — blank HF Space page | LOW | `server/app.py` |

inference.py CHANGED Viewed

@@ -20,6 +20,15 @@ HF_TOKEN
     HuggingFace authentication token for the LLM provider.
     No default is set.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
     This script does not use ``from_docker_image()``, so the value is unused here.
@@ -64,7 +73,12 @@ LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 SEED = 42
-TASKS = list(TASK_IDS)
 # ---------------------------------------------------------------------------
 # LLM helper
@@ -99,13 +113,36 @@ Return ONLY valid JSON with the requested fields. No markdown, no explanation.""
 def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
     assert llm_client is not None, "LLM client not configured"
     user_msg = (
         f"Instructions: {instructions}\n\n"
         f"Allowed fields: {', '.join(allowed_fields)}\n\n"
         f"Title: {ticket['title']}\n"
         f"Requester: {ticket['requester']}\n"
-        f"Description: {ticket['description']}\n\n"
         f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
     )
@@ -134,6 +171,29 @@ def emit_log(tag: str, **payload: Any) -> None:
     print(f"[{tag}] {json.dumps(payload, sort_keys=True, ensure_ascii=True)}")
 # ---------------------------------------------------------------------------
 # Heuristic fallback (no LLM needed)
 # ---------------------------------------------------------------------------
@@ -264,7 +324,18 @@ def heuristic_resolution_action(text: str, issue_type: str) -> str:
 def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
-    text = (ticket.get("title", "") + " " + ticket.get("description", "")).lower()
     issue_type = "general_inquiry"
     for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
@@ -315,6 +386,31 @@ def build_action(
         )
 # ---------------------------------------------------------------------------
 # Main loop using the HTTP-based sync EnvClient for multi-step episodes
 # ---------------------------------------------------------------------------
@@ -332,7 +428,12 @@ def run() -> None:
     all_results: dict[int, dict[str, float | int]] = {}
-    for task_id in TASKS:
         if task_id not in available_tasks:
             continue
@@ -360,8 +461,40 @@ def run() -> None:
                 if ticket is None:
                     break
                 action, action_source, fallback_reason = build_action(
-                    ticket,
                     obs.allowed_fields,
                     obs.instructions,
                 )
@@ -400,11 +533,12 @@ def run() -> None:
     overall = [
         float(all_results[task_id]["final_reward"])
-        for task_id in TASKS
         if task_id in all_results
     ]
-    overall_avg = round(sum(overall) / len(overall), 4) if overall else 0.0
-    emit_log("END", overall_avg=overall_avg, tasks_completed=len(overall))
 if __name__ == "__main__":

     HuggingFace authentication token for the LLM provider.
     No default is set.
+TASK_ID
+    Optional OpenEnv task ID to run. When unset, the script defaults to the
+    first available task so it still emits exactly one ``[START]`` ... ``[END]``
+    block for evaluator-style runs.
+RUN_ALL_TASKS
+    Optional local-development override. Set to ``1`` to run every available
+    task in sequence and print the aggregate closing ``[END]`` summary.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
     This script does not use ``from_docker_image()``, so the value is unused here.
 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 SEED = 42
+TASK_ID_ENV = os.getenv("TASK_ID")
+RUN_ALL_TASKS_ENV = os.getenv("RUN_ALL_TASKS", "").strip().lower() in {
+    "1",
+    "true",
+    "yes",
+}
 # ---------------------------------------------------------------------------
 # LLM helper
 def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
     assert llm_client is not None, "LLM client not configured"
+    ambiguity_note = ticket.get("ambiguity_note")
+    related_preview = ticket.get("related_ticket_preview") or {}
+    last_tool_result = ticket.get("last_tool_result")
+    extra_context_lines: list[str] = []
+    if ambiguity_note:
+        extra_context_lines.append(f"Ambiguity note: {ambiguity_note}")
+    if related_preview:
+        extra_context_lines.extend(
+            [
+                "Related ticket preview:",
+                f"- Title: {related_preview.get('title', '')}",
+                f"- Requester: {related_preview.get('requester', '')}",
+                f"- Description: {related_preview.get('description', '')}",
+            ]
+        )
+    if last_tool_result is not None:
+        extra_context_lines.append(
+            "Investigation result: " + json.dumps(last_tool_result, sort_keys=True)
+        )
+    extra_context_block = ""
+    if extra_context_lines:
+        extra_context_block = "\n" + "\n".join(extra_context_lines)
     user_msg = (
         f"Instructions: {instructions}\n\n"
         f"Allowed fields: {', '.join(allowed_fields)}\n\n"
         f"Title: {ticket['title']}\n"
         f"Requester: {ticket['requester']}\n"
+        f"Description: {ticket['description']}"
+        f"{extra_context_block}\n\n"
         f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
     )
     print(f"[{tag}] {json.dumps(payload, sort_keys=True, ensure_ascii=True)}")
+def get_tasks_to_run(available_tasks: dict) -> list[int]:
+    available_task_ids = sorted(int(task_id) for task_id in available_tasks)
+    if TASK_ID_ENV:
+        try:
+            task_id = int(TASK_ID_ENV)
+        except ValueError:
+            print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
+            raise SystemExit(1)
+        if task_id not in available_task_ids:
+            print(
+                f"[ERROR] TASK_ID={task_id} not in available tasks {available_task_ids}",
+                flush=True,
+            )
+            raise SystemExit(1)
+        return [task_id]
+    if RUN_ALL_TASKS_ENV:
+        return available_task_ids
+    if not available_task_ids:
+        return []
+    # Default to a single task so evaluation emits exactly one START/END block.
+    return [available_task_ids[0]]
 # ---------------------------------------------------------------------------
 # Heuristic fallback (no LLM needed)
 # ---------------------------------------------------------------------------
 def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
+    related_preview = ticket.get("related_ticket_preview") or {}
+    last_tool_result = ticket.get("last_tool_result") or {}
+    text = " ".join(
+        [
+            ticket.get("title", ""),
+            ticket.get("description", ""),
+            ticket.get("ambiguity_note", ""),
+            related_preview.get("title", ""),
+            related_preview.get("description", ""),
+            json.dumps(last_tool_result, sort_keys=True),
+        ]
+    ).lower()
     issue_type = "general_inquiry"
     for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
         )
+def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[bool, str | None]:
+    if not ticket:
+        return False, None
+    current_ticket_id = ticket.get("ticket_id")
+    already_investigated = any(
+        entry.get("ticket_id") == current_ticket_id
+        and entry.get("predicted", {}).get("action_type") == "investigate"
+        for entry in history
+    )
+    if already_investigated:
+        return False, None
+    if ticket.get("related_ticket_id"):
+        return True, "lookup_related_ticket"
+    if ticket.get("ambiguity_note"):
+        return True, "lookup_requester_history"
+    return False, None
+def merge_ticket_context(ticket: dict, observation: Any) -> dict:
+    merged_ticket = dict(ticket)
+    if getattr(observation, "last_tool_result", None) is not None:
+        merged_ticket["last_tool_result"] = observation.last_tool_result
+    return merged_ticket
 # ---------------------------------------------------------------------------
 # Main loop using the HTTP-based sync EnvClient for multi-step episodes
 # ---------------------------------------------------------------------------
     all_results: dict[int, dict[str, float | int]] = {}
+    tasks_to_run = get_tasks_to_run(available_tasks)
+    if not tasks_to_run:
+        return
+    single_task_mode = len(tasks_to_run) == 1
+    for task_id in tasks_to_run:
         if task_id not in available_tasks:
             continue
                 if ticket is None:
                     break
+                investigate, tool_name = should_investigate(ticket, obs.history)
+                if (
+                    investigate
+                    and tool_name is not None
+                    and getattr(obs, "investigation_budget_remaining", 0) > 0
+                ):
+                    tool_action = HelpdeskTicketAction(
+                        action_type="investigate",
+                        tool_name=tool_name,
+                        tool_target_ticket_id=ticket.get("related_ticket_id"),
+                    )
+                    result = sync_client.step(tool_action)
+                    obs = result.observation
+                    step_num += 1
+                    emit_log(
+                        "STEP",
+                        action=tool_action.model_dump(exclude_none=True),
+                        action_source="investigation_tool",
+                        done=bool(result.done),
+                        fallback_reason=None,
+                        reward=float(result.reward or 0.0),
+                        step=step_num,
+                        task_id=task_id,
+                        ticket_id=ticket["ticket_id"],
+                    )
+                    if result.done:
+                        break
+                    ticket = obs.current_ticket
+                    if ticket is None:
+                        break
+                ticket_with_context = merge_ticket_context(ticket, obs)
                 action, action_source, fallback_reason = build_action(
+                    ticket_with_context,
                     obs.allowed_fields,
                     obs.instructions,
                 )
     overall = [
         float(all_results[task_id]["final_reward"])
+        for task_id in tasks_to_run
         if task_id in all_results
     ]
+    if not single_task_mode:
+        overall_avg = round(sum(overall) / len(overall), 4) if overall else 0.0
+        emit_log("END", overall_avg=overall_avg, tasks_completed=len(overall))
 if __name__ == "__main__":

models.py CHANGED Viewed

@@ -16,6 +16,8 @@ ISSUE_TYPE_SET = set(ISSUE_TYPES)
 PRIORITY_SET = set(PRIORITIES)
 ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
 RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
 def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
@@ -67,11 +69,24 @@ class HelpdeskTicketRecord(BaseModel):
 class HelpdeskTicketAction(Action):
     issue_type: Optional[str] = None
     priority: Optional[str] = None
     assignment_group: Optional[str] = None
     resolution_action: Optional[str] = None
     @field_validator("issue_type")
     @classmethod
     def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
@@ -98,10 +113,15 @@ class HelpdeskTicketObservation(Observation):
     task_name: str = ""
     instructions: str = ""
     allowed_fields: list[str] = Field(default_factory=list)
-    current_ticket: Optional[dict[str, str]] = None
     queue_size: int = 0
     tickets_remaining: int = 0
     tickets_processed: int = 0
     history: list[dict[str, Any]] = Field(default_factory=list)
@@ -112,3 +132,11 @@ class HelpdeskTicketState(State):
     current_ticket_index: int = 0
     per_ticket_scores: list[float] = Field(default_factory=list)
     total_reward: float = 0.0

 PRIORITY_SET = set(PRIORITIES)
 ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
 RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
+ACTION_TYPE_SET = {"submit", "investigate"}
+TOOL_NAME_SET = {"lookup_related_ticket", "lookup_requester_history"}
 def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
 class HelpdeskTicketAction(Action):
+    action_type: str = "submit"
+    tool_name: Optional[str] = None
+    tool_target_ticket_id: Optional[str] = None
     issue_type: Optional[str] = None
     priority: Optional[str] = None
     assignment_group: Optional[str] = None
     resolution_action: Optional[str] = None
+    @field_validator("action_type")
+    @classmethod
+    def validate_action_type(cls, value: str) -> str:
+        return _validate_choice(value, ACTION_TYPE_SET, "action_type")
+    @field_validator("tool_name")
+    @classmethod
+    def validate_tool_name(cls, value: Optional[str]) -> Optional[str]:
+        return _validate_optional_choice(value, TOOL_NAME_SET, "tool_name")
     @field_validator("issue_type")
     @classmethod
     def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
     task_name: str = ""
     instructions: str = ""
     allowed_fields: list[str] = Field(default_factory=list)
+    available_tools: list[str] = Field(default_factory=list)
+    investigation_budget_remaining: int = 0
+    last_tool_result: Optional[dict[str, Any]] = None
+    current_ticket: Optional[dict[str, Any]] = None
     queue_size: int = 0
     tickets_remaining: int = 0
+    tickets_after_current: int = 0
     tickets_processed: int = 0
+    queue_position: int = 0
     history: list[dict[str, Any]] = Field(default_factory=list)
     current_ticket_index: int = 0
     per_ticket_scores: list[float] = Field(default_factory=list)
     total_reward: float = 0.0
+    last_step_reward: Optional[float] = None
+    # `reward` is the field the evaluator checks on GET /state (mentor spec)
+    reward: Optional[float] = None
+    done: bool = False
+    investigation_steps: int = 0
+    investigation_budget_remaining: int = 0
+    last_tool_result: Optional[dict[str, Any]] = None
+    history_entries: list[dict] = Field(default_factory=list)

openenv.yaml CHANGED Viewed

@@ -7,6 +7,9 @@ author: Hackstreet Boys - Roopal Guha Neogi, Suyash Kumar
 environment:
   type: openenv
   entry_point: server.environment:HelpdeskTicketRoutingEnvironment
   action_model: models:HelpdeskTicketAction
   observation_model: models:HelpdeskTicketObservation
@@ -50,6 +53,7 @@ inference:
     - MODEL_NAME
     - HF_TOKEN
     - ENV_URL
 requirements:
   python: ">=3.11"

 environment:
   type: openenv
+  # entry_point identifies the Environment class for the OpenEnv validator.
+  # The HTTP server entrypoint for deployment is defined separately in
+  # pyproject.toml under [project.scripts] as: server = "server.app:main"
   entry_point: server.environment:HelpdeskTicketRoutingEnvironment
   action_model: models:HelpdeskTicketAction
   observation_model: models:HelpdeskTicketObservation
     - MODEL_NAME
     - HF_TOKEN
     - ENV_URL
+    - TASK_ID
 requirements:
   python: ">=3.11"

server/app.py CHANGED Viewed

@@ -6,6 +6,7 @@ _repo_root = str(Path(__file__).resolve().parent.parent)
 if _repo_root not in sys.path:
     sys.path.insert(0, _repo_root)
 from openenv.core.env_server import create_app
 from models import HelpdeskTicketAction, HelpdeskTicketObservation
@@ -37,6 +38,25 @@ def list_tasks():
     }
 def main() -> None:
     import uvicorn

 if _repo_root not in sys.path:
     sys.path.insert(0, _repo_root)
+from fastapi.responses import HTMLResponse
 from openenv.core.env_server import create_app
 from models import HelpdeskTicketAction, HelpdeskTicketObservation
     }
+@app.get("/web", response_class=HTMLResponse)
+def web_ui():
+    task_rows = "".join(
+        f"<tr><td>{t['id']}</td><td>{t['name']}</td><td>{t['difficulty']}</td></tr>"
+        for t in TASKS.values()
+    )
+    html = f"""<!DOCTYPE html>
+<html><head><title>{APP_ENV_NAME}</title></head>
+<body>
+<h1>{APP_ENV_NAME}</h1>
+<p>Version: 0.1.0 | <a href="/health">Health</a> | <a href="/docs">API Docs</a></p>
+<h2>Tasks</h2>
+<table border="1"><tr><th>ID</th><th>Name</th><th>Difficulty</th></tr>
+{task_rows}
+</table>
+</body></html>"""
+    return HTMLResponse(content=html)
 def main() -> None:
     import uvicorn

server/environment.py CHANGED Viewed

@@ -18,6 +18,10 @@ from server.tasks import get_task_definition, load_dataset
 QUEUE_SIZE_RANGE = (3, 5)
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
@@ -36,9 +40,12 @@ def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
 class HelpdeskTicketRoutingEnvironment(
     Environment[HelpdeskTicketAction, HelpdeskTicketObservation, HelpdeskTicketState]
 ):
     def __init__(self) -> None:
         super().__init__()
         self._dataset = load_dataset()
         self._rng = random.Random()
         self._queue: list[HelpdeskTicketRecord] = []
         self._state = HelpdeskTicketState()
@@ -55,13 +62,19 @@ class HelpdeskTicketRoutingEnvironment(
     ) -> HelpdeskTicketObservation:
         normalized_seed = _coerce_optional_int(seed, "seed")
         task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
         task_id = 1 if task_id_value is None else task_id_value
         task = get_task_definition(task_id)
         if normalized_seed is not None:
             self._rng.seed(normalized_seed)
-        queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
         self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
         self._state = HelpdeskTicketState(
@@ -73,6 +86,7 @@ class HelpdeskTicketRoutingEnvironment(
             current_ticket_index=0,
             per_ticket_scores=[],
             total_reward=0.0,
         )
         return self._build_observation(task)
@@ -94,38 +108,84 @@ class HelpdeskTicketRoutingEnvironment(
         task_id = self._state.current_task_id
         task = get_task_definition(task_id)
         score, breakdown = grade_action(action, current_ticket, task_id)
         step_reward = compute_step_reward(score)
-        self._state.per_ticket_scores.append(score)
-        self._state.step_count += 1
-        self._state.current_ticket_index += 1
-        is_done = self._state.current_ticket_index >= len(self._queue)
         if is_done:
             traj_reward = compute_trajectory_reward(
                 self._state.per_ticket_scores,
                 len(self._queue),
                 self._state.step_count,
             )
-            self._state.total_reward = traj_reward
-            final_reward = traj_reward
         else:
             final_reward = step_reward
-        history_entry = {
-            "ticket_id": current_ticket.ticket_id,
-            "score": score,
-            "breakdown": breakdown,
-        }
-        return self._build_observation(
-            task,
-            done=is_done,
-            reward=final_reward,
-            extra_history=history_entry,
         )
     @property
     def state(self) -> HelpdeskTicketState:
@@ -135,44 +195,236 @@ class HelpdeskTicketRoutingEnvironment(
     # Helpers
     # ------------------------------------------------------------------
     def _build_observation(
         self,
         task: dict,
         done: bool = False,
         reward: float | None = None,
-        extra_history: dict | None = None,
     ) -> HelpdeskTicketObservation:
         idx = self._state.current_ticket_index
         queue_size = len(self._queue)
         if idx < queue_size:
             ticket = self._queue[idx]
-            ticket_view = {
-                "ticket_id": ticket.ticket_id,
-                "title": ticket.title,
-                "requester": ticket.requester,
-                "description": ticket.description,
-            }
         else:
             ticket_view = None
-        history: list[dict] = []
-        for i, s in enumerate(self._state.per_ticket_scores):
-            history.append({"step": i + 1, "score": s})
-        if extra_history and history:
-            history[-1] = {"step": len(history), **extra_history}
         return HelpdeskTicketObservation(
             done=done,
             reward=reward,
-            metadata={},
             task_id=task["id"],
             task_name=task["name"],
             instructions=task["instructions"],
             allowed_fields=list(task["allowed_fields"]),
             current_ticket=ticket_view,
             queue_size=queue_size,
-            tickets_remaining=max(0, queue_size - idx),
             tickets_processed=idx,
             history=history,
         )

 QUEUE_SIZE_RANGE = (3, 5)
+AVAILABLE_TOOLS = ("lookup_related_ticket", "lookup_requester_history")
+FREE_INVESTIGATIONS_PER_TICKET = 1
+EXTRA_INVESTIGATION_COST = 0.02
+MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
 class HelpdeskTicketRoutingEnvironment(
     Environment[HelpdeskTicketAction, HelpdeskTicketObservation, HelpdeskTicketState]
 ):
+    SUPPORTS_CONCURRENT_SESSIONS = True
     def __init__(self) -> None:
         super().__init__()
         self._dataset = load_dataset()
+        self._tickets_by_id = {ticket.ticket_id: ticket for ticket in self._dataset}
         self._rng = random.Random()
         self._queue: list[HelpdeskTicketRecord] = []
         self._state = HelpdeskTicketState()
     ) -> HelpdeskTicketObservation:
         normalized_seed = _coerce_optional_int(seed, "seed")
         task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
+        queue_size_value = _coerce_optional_int(kwargs.get("queue_size"), "queue_size")
         task_id = 1 if task_id_value is None else task_id_value
         task = get_task_definition(task_id)
+        if queue_size_value is not None and queue_size_value < 1:
+            raise ValueError("queue_size must be >= 1")
         if normalized_seed is not None:
             self._rng.seed(normalized_seed)
+        if queue_size_value is None:
+            queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
+        else:
+            queue_size = min(queue_size_value, len(self._dataset))
         self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
         self._state = HelpdeskTicketState(
             current_ticket_index=0,
             per_ticket_scores=[],
             total_reward=0.0,
+            investigation_budget_remaining=queue_size * FREE_INVESTIGATIONS_PER_TICKET,
         )
         return self._build_observation(task)
         task_id = self._state.current_task_id
         task = get_task_definition(task_id)
+        if action.action_type == "investigate":
+            return self._handle_investigation_action(task, current_ticket, action, idx)
+        submitted_fields = {
+            f
+            for f, v in action.model_dump(exclude_none=True).items()
+            if v is not None
+            and f not in {"action_type", "tool_name", "tool_target_ticket_id"}
+        }
+        allowed = set(task["allowed_fields"])
+        extra_fields = submitted_fields - allowed
+        if extra_fields:
+            # Penalty: record score 0.0, advance index, return penalty observation
+            self._state.per_ticket_scores.append(0.0)
+            self._state.history_entries.append(
+                self._build_history_entry(
+                    current_ticket,
+                    predicted=action.model_dump(exclude_none=True),
+                    score=0.0,
+                    breakdown={},
+                    queue_position=idx + 1,
+                    penalty_reason=f"extra_fields: {sorted(extra_fields)}",
+                )
+            )
+            self._state.step_count += 1
+            self._state.current_ticket_index += 1
+            is_done = self._state.current_ticket_index >= len(self._queue)
+            self._state.done = is_done
+            if is_done:
+                traj_reward = compute_trajectory_reward(
+                    self._state.per_ticket_scores, len(self._queue), self._state.step_count
+                )
+                final_reward = self._apply_episode_economics(traj_reward)
+                self._state.total_reward = final_reward
+            else:
+                final_reward = 0.0
+            self._state.last_step_reward = final_reward
+            self._state.reward = final_reward
+            self._state.last_tool_result = None
+            return self._build_observation(task, done=is_done, reward=final_reward)
         score, breakdown = grade_action(action, current_ticket, task_id)
         step_reward = compute_step_reward(score)
+        is_done = (self._state.current_ticket_index + 1) >= len(self._queue)
         if is_done:
+            self._state.per_ticket_scores.append(score)
+            self._state.step_count += 1
+            self._state.current_ticket_index += 1
             traj_reward = compute_trajectory_reward(
                 self._state.per_ticket_scores,
                 len(self._queue),
                 self._state.step_count,
             )
+            final_reward = self._apply_episode_economics(traj_reward)
+            self._state.total_reward = final_reward
         else:
+            self._state.per_ticket_scores.append(score)
+            self._state.step_count += 1
+            self._state.current_ticket_index += 1
             final_reward = step_reward
+        history_entry = self._build_history_entry(
+            current_ticket,
+            predicted=action.model_dump(exclude_none=True),
+            score=score,
+            breakdown=breakdown,
+            queue_position=idx + 1,
         )
+        self._state.history_entries.append(history_entry)
+        self._state.last_step_reward = final_reward
+        self._state.reward = final_reward
+        self._state.done = is_done
+        self._state.last_tool_result = None
+        return self._build_observation(task, done=is_done, reward=final_reward)
     @property
     def state(self) -> HelpdeskTicketState:
     # Helpers
     # ------------------------------------------------------------------
+    def _apply_episode_economics(self, base_reward: float) -> float:
+        free_investigations = len(self._queue) * FREE_INVESTIGATIONS_PER_TICKET
+        extra_investigations = max(0, self._state.investigation_steps - free_investigations)
+        penalty = min(
+            MAX_EXTRA_INVESTIGATION_PENALTY,
+            extra_investigations * EXTRA_INVESTIGATION_COST,
+        )
+        return max(0.0, min(1.0, base_reward - penalty))
+    def _lookup_related_ticket(
+        self,
+        current_ticket: HelpdeskTicketRecord,
+        target_ticket_id: str | None,
+    ) -> dict[str, Any]:
+        target_id = target_ticket_id or current_ticket.related_ticket_id
+        if target_id is None:
+            return {
+                "tool_name": "lookup_related_ticket",
+                "found": False,
+                "message": "Current ticket has no linked related_ticket_id.",
+            }
+        related_ticket = self._tickets_by_id.get(target_id)
+        if related_ticket is None:
+            return {
+                "tool_name": "lookup_related_ticket",
+                "found": False,
+                "message": f"Ticket {target_id!r} was not found in the dataset.",
+            }
+        return {
+            "tool_name": "lookup_related_ticket",
+            "found": True,
+            "ticket": {
+                "ticket_id": related_ticket.ticket_id,
+                "title": related_ticket.title,
+                "requester": related_ticket.requester,
+                "description": related_ticket.description,
+                "issue_type": related_ticket.issue_type,
+                "priority": related_ticket.priority,
+                "assignment_group": related_ticket.assignment_group,
+                "resolution_action": related_ticket.resolution_action,
+            },
+        }
+    def _lookup_requester_history(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
+        matches = [
+            {
+                "ticket_id": ticket.ticket_id,
+                "title": ticket.title,
+                "issue_type": ticket.issue_type,
+                "priority": ticket.priority,
+                "assignment_group": ticket.assignment_group,
+                "resolution_action": ticket.resolution_action,
+            }
+            for ticket in self._dataset
+            if ticket.requester == current_ticket.requester
+            and ticket.ticket_id != current_ticket.ticket_id
+        ]
+        return {
+            "tool_name": "lookup_requester_history",
+            "found": bool(matches),
+            "requester": current_ticket.requester,
+            "matches": matches,
+        }
+    def _run_investigation_tool(
+        self,
+        current_ticket: HelpdeskTicketRecord,
+        tool_name: str,
+        target_ticket_id: str | None,
+    ) -> dict[str, Any]:
+        if tool_name == "lookup_related_ticket":
+            return self._lookup_related_ticket(current_ticket, target_ticket_id)
+        if tool_name == "lookup_requester_history":
+            return self._lookup_requester_history(current_ticket)
+        raise ValueError(f"Unsupported tool_name: {tool_name}")
+    def _handle_investigation_action(
+        self,
+        task: dict,
+        current_ticket: HelpdeskTicketRecord,
+        action: HelpdeskTicketAction,
+        idx: int,
+    ) -> HelpdeskTicketObservation:
+        if action.tool_name is None:
+            raise ValueError("Investigate actions require tool_name")
+        submitted_fields = {
+            field
+            for field in ("issue_type", "priority", "assignment_group", "resolution_action")
+            if getattr(action, field) is not None
+        }
+        if submitted_fields:
+            raise ValueError(
+                "Investigate actions cannot include submit fields: "
+                f"{sorted(submitted_fields)}"
+            )
+        tool_result = self._run_investigation_tool(
+            current_ticket,
+            action.tool_name,
+            action.tool_target_ticket_id,
+        )
+        self._state.step_count += 1
+        self._state.investigation_steps += 1
+        self._state.investigation_budget_remaining = max(
+            0,
+            self._state.investigation_budget_remaining - 1,
+        )
+        self._state.last_tool_result = tool_result
+        self._state.last_step_reward = 0.0
+        self._state.reward = 0.0
+        self._state.done = False
+        self._state.history_entries.append(
+            self._build_history_entry(
+                current_ticket,
+                predicted=action.model_dump(exclude_none=True),
+                score=0.0,
+                breakdown={},
+                queue_position=idx + 1,
+                tool_result=tool_result,
+            )
+        )
+        return self._build_observation(task, done=False, reward=0.0)
+    def _build_ticket_view(self, ticket: HelpdeskTicketRecord) -> dict[str, Any]:
+        ticket_view: dict[str, Any] = {
+            "ticket_id": ticket.ticket_id,
+            "title": ticket.title,
+            "requester": ticket.requester,
+            "description": ticket.description,
+        }
+        if ticket.ambiguity_note is not None:
+            ticket_view["ambiguity_note"] = ticket.ambiguity_note
+        if ticket.related_ticket_id is not None:
+            ticket_view["related_ticket_id"] = ticket.related_ticket_id
+            related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
+            if related_ticket is not None:
+                ticket_view["related_ticket_preview"] = {
+                    "ticket_id": related_ticket.ticket_id,
+                    "title": related_ticket.title,
+                    "requester": related_ticket.requester,
+                    "description": related_ticket.description,
+                }
+        return ticket_view
+    def _build_history_entry(
+        self,
+        ticket: HelpdeskTicketRecord,
+        *,
+        predicted: dict[str, Any],
+        score: float,
+        breakdown: dict[str, float],
+        queue_position: int,
+        penalty_reason: str | None = None,
+        tool_result: dict[str, Any] | None = None,
+    ) -> dict[str, Any]:
+        history_entry: dict[str, Any] = {
+            "ticket_id": ticket.ticket_id,
+            "title": ticket.title,
+            "requester": ticket.requester,
+            "predicted": predicted,
+            "score": score,
+            "breakdown": breakdown,
+            "queue_position": queue_position,
+        }
+        if ticket.ambiguity_note is not None:
+            history_entry["ambiguity_note"] = ticket.ambiguity_note
+        if ticket.related_ticket_id is not None:
+            history_entry["related_ticket_id"] = ticket.related_ticket_id
+            related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
+            if related_ticket is not None:
+                history_entry["related_ticket_preview"] = {
+                    "ticket_id": related_ticket.ticket_id,
+                    "title": related_ticket.title,
+                    "requester": related_ticket.requester,
+                    "description": related_ticket.description,
+                }
+        if penalty_reason is not None:
+            history_entry["penalty_reason"] = penalty_reason
+        if tool_result is not None:
+            history_entry["tool_result"] = tool_result
+        return history_entry
     def _build_observation(
         self,
         task: dict,
         done: bool = False,
         reward: float | None = None,
     ) -> HelpdeskTicketObservation:
         idx = self._state.current_ticket_index
         queue_size = len(self._queue)
         if idx < queue_size:
             ticket = self._queue[idx]
+            ticket_view = self._build_ticket_view(ticket)
+            queue_position = idx + 1
         else:
             ticket_view = None
+            queue_position = 0
+        history = list(self._state.history_entries)
+        tickets_remaining = max(0, queue_size - idx)
+        tickets_after_current = max(
+            0,
+            tickets_remaining - (1 if ticket_view is not None else 0),
+        )
         return HelpdeskTicketObservation(
             done=done,
             reward=reward,
+            metadata={
+                "queue_position": queue_position,
+                "tickets_remaining_includes_current": ticket_view is not None,
+                "has_ambiguity_note": bool(ticket_view and ticket_view.get("ambiguity_note")),
+                "has_related_ticket_context": bool(
+                    ticket_view and ticket_view.get("related_ticket_preview")
+                ),
+                "action_mode": "investigate_or_submit",
+            },
             task_id=task["id"],
             task_name=task["name"],
             instructions=task["instructions"],
             allowed_fields=list(task["allowed_fields"]),
+            available_tools=list(AVAILABLE_TOOLS),
+            investigation_budget_remaining=self._state.investigation_budget_remaining,
+            last_tool_result=self._state.last_tool_result,
             current_ticket=ticket_view,
             queue_size=queue_size,
+            tickets_remaining=tickets_remaining,
+            tickets_after_current=tickets_after_current,
             tickets_processed=idx,
+            queue_position=queue_position,
             history=history,
         )

server/reward.py CHANGED Viewed

@@ -1,8 +1,18 @@
 from __future__ import annotations
 def compute_step_reward(score: float) -> float:
-    return max(0.0, min(1.0, score))
 def compute_trajectory_reward(
@@ -11,6 +21,4 @@ def compute_trajectory_reward(
     if not per_ticket_scores:
         return 0.0
     avg = sum(per_ticket_scores) / len(per_ticket_scores)
-    overshoot = max(0, steps_taken - queue_size)
-    penalty = overshoot * 0.03
-    return max(0.0, min(1.0, avg - penalty))

 from __future__ import annotations
+MILESTONE_HIGH_THRESHOLD = 0.8
+MILESTONE_LOW_THRESHOLD = 0.2
+MILESTONE_BONUS = 0.05
+MILESTONE_PENALTY = 0.05
 def compute_step_reward(score: float) -> float:
+    base = max(0.0, min(1.0, score))
+    if score >= MILESTONE_HIGH_THRESHOLD:
+        return min(1.0, base + MILESTONE_BONUS)
+    if score < MILESTONE_LOW_THRESHOLD:
+        return max(0.0, base - MILESTONE_PENALTY)
+    return base
 def compute_trajectory_reward(
     if not per_ticket_scores:
         return 0.0
     avg = sum(per_ticket_scores) / len(per_ticket_scores)
+    return max(0.0, min(1.0, avg))

server/tasks.py CHANGED Viewed

@@ -13,7 +13,8 @@ TASKS = {
         "name": "Issue Type Classification",
         "difficulty": "easy",
         "instructions": (
-            "Read the ticket and select the single best IT issue type."
         ),
         "allowed_fields": ["issue_type"],
     },
@@ -23,7 +24,8 @@ TASKS = {
         "difficulty": "medium",
         "instructions": (
             "Read the ticket, select the best IT issue type, and estimate the "
-            "correct operational priority."
         ),
         "allowed_fields": ["issue_type", "priority"],
     },
@@ -33,7 +35,9 @@ TASKS = {
         "difficulty": "hard",
         "instructions": (
             "Perform full helpdesk routing by selecting the best issue type, "
-            "priority, assignment group, and resolution action for the ticket."
         ),
         "allowed_fields": [
             "issue_type",

         "name": "Issue Type Classification",
         "difficulty": "easy",
         "instructions": (
+            "Read the ticket and select the single best IT issue type. "
+            "You may investigate first, then submit a final routing answer."
         ),
         "allowed_fields": ["issue_type"],
     },
         "difficulty": "medium",
         "instructions": (
             "Read the ticket, select the best IT issue type, and estimate the "
+            "correct operational priority. If the observation includes ambiguity "
+            "or follow-up context, use it. You may investigate before you submit."
         ),
         "allowed_fields": ["issue_type", "priority"],
     },
         "difficulty": "hard",
         "instructions": (
             "Perform full helpdesk routing by selecting the best issue type, "
+            "priority, assignment group, and resolution action for the ticket. "
+            "Use any ambiguity notes or related-ticket previews when present. "
+            "You may investigate with tools before you submit the final action."
         ),
         "allowed_fields": [
             "issue_type",

tests/test_api_integration.py CHANGED Viewed

@@ -400,7 +400,7 @@ class TestHeuristicInferenceRegression(unittest.TestCase):
         import inference as _inf
         cls._heuristic_action = staticmethod(_inf.heuristic_action)
         cls._SEED = _inf.SEED
-        cls._TASKS = list(_inf.TASKS)
     def _run_heuristic_episode(self, task_id: int) -> float:
         """Run one full heuristic episode for the given task_id via TestClient.

         import inference as _inf
         cls._heuristic_action = staticmethod(_inf.heuristic_action)
         cls._SEED = _inf.SEED
+        cls._TASKS = list(_inf.TASK_IDS)
     def _run_heuristic_episode(self, task_id: int) -> float:
         """Run one full heuristic episode for the given task_id via TestClient.

tests/test_competitive_upgrade.py ADDED Viewed

	@@ -0,0 +1,746 @@

+"""
+Tests for the helpdesk-competitive-upgrade spec (Task 9).
+Covers:
+  9.1  test_inference_single_task_mode
+  9.2  test_state_has_reward_and_done
+  9.3  test_history_has_title_and_predicted
+  9.4  test_milestone_reward_shaping
+  9.5  test_trajectory_reward_no_overshoot
+  9.6  test_ambiguity_note_in_observation
+  9.7  test_dataset_nondefault_routing
+  9.9  test_concurrent_sessions_flag
+  9.10 test_web_ui_endpoint
+Run with:
+    pytest tests/test_competitive_upgrade.py
+"""
+from __future__ import annotations
+import os
+import sys
+import types as _types
+import unittest
+# Ensure repo root is on sys.path
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+import openenv_test_stubs  # noqa: F401  — must come before any openenv imports
+# Patch in the interfaces module so environment.py can import Environment.
+if "openenv.core.env_server.interfaces" not in sys.modules:
+    _interfaces_mod = _types.ModuleType("openenv.core.env_server.interfaces")
+    class _Environment:
+        """Minimal stub matching the openenv-core Environment base class."""
+        def __init__(self) -> None:
+            pass
+        def __init_subclass__(cls, **kwargs: object) -> None:
+            super().__init_subclass__(**kwargs)
+        @classmethod
+        def __class_getitem__(cls, item: object) -> type:
+            return cls
+    _interfaces_mod.Environment = _Environment  # type: ignore[attr-defined]
+    sys.modules["openenv.core.env_server.interfaces"] = _interfaces_mod
+from models import HelpdeskTicketAction, HelpdeskTicketObservation, HelpdeskTicketState
+from server.environment import HelpdeskTicketRoutingEnvironment
+from server.reward import compute_step_reward, compute_trajectory_reward
+from server.tasks import load_dataset
+from vocabulary import ISSUE_TYPES, PRIORITIES, ASSIGNMENT_GROUPS, RESOLUTION_ACTIONS, TASK_IDS
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _make_env() -> HelpdeskTicketRoutingEnvironment:
+    return HelpdeskTicketRoutingEnvironment()
+def _heuristic_action(obs: HelpdeskTicketObservation) -> HelpdeskTicketAction:
+    allowed = obs.allowed_fields
+    kwargs: dict = {}
+    if "issue_type" in allowed:
+        kwargs["issue_type"] = ISSUE_TYPES[0]
+    if "priority" in allowed:
+        kwargs["priority"] = PRIORITIES[0]
+    if "assignment_group" in allowed:
+        kwargs["assignment_group"] = ASSIGNMENT_GROUPS[0]
+    if "resolution_action" in allowed:
+        kwargs["resolution_action"] = RESOLUTION_ACTIONS[0]
+    return HelpdeskTicketAction(**kwargs)
+# ---------------------------------------------------------------------------
+# 9.1 — Inference single-task mode
+# ---------------------------------------------------------------------------
+def _get_tasks_to_run_impl(
+    task_id_env: str | None,
+    available_tasks: dict,
+    run_all_tasks: bool = False,
+) -> list[int]:
+    """
+    Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
+    This mirrors the logic in inference.py without importing the full module
+    (which has heavy dependencies like openai, httpx, and client.py).
+    """
+    if task_id_env:
+        try:
+            task_id = int(task_id_env)
+        except ValueError:
+            raise SystemExit(1)
+        if task_id not in available_tasks:
+            raise SystemExit(1)
+        return [task_id]
+    if run_all_tasks:
+        return sorted(available_tasks)
+    if not available_tasks:
+        return []
+    return [sorted(available_tasks)[0]]
+class TestInferenceSingleTaskMode(unittest.TestCase):
+    """9.1 — get_tasks_to_run() respects TASK_ID env var."""
+    def test_task_id_set_to_valid_id_returns_single_element_list(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl("1", available)
+        self.assertEqual(result, [1])
+    def test_task_id_set_to_unavailable_id_exits(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        with self.assertRaises(SystemExit):
+            _get_tasks_to_run_impl("999", available)
+    def test_task_id_unset_defaults_to_first_available_task(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl(None, available)
+        self.assertEqual(result, [1])
+    def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl(None, available, run_all_tasks=True)
+        self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
+    def test_task_id_set_to_2_returns_only_task_2(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl("2", available)
+        self.assertEqual(result, [2])
+    def test_task_id_set_to_3_returns_only_task_3(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl("3", available)
+        self.assertEqual(result, [3])
+# ---------------------------------------------------------------------------
+# 9.2 — State has last_step_reward and done after step()
+# ---------------------------------------------------------------------------
+class TestStateHasRewardAndDone(unittest.TestCase):
+    """9.2 — state.last_step_reward and state.done are set after step()."""
+    def test_last_step_reward_is_none_after_reset(self) -> None:
+        env = _make_env()
+        env.reset(seed=42, task_id=1)
+        self.assertIsNone(env.state.last_step_reward)
+    def test_done_is_false_after_reset(self) -> None:
+        env = _make_env()
+        env.reset(seed=42, task_id=1)
+        self.assertFalse(env.state.done)
+    def test_last_step_reward_set_after_step(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        action = _heuristic_action(obs)
+        env.step(action)
+        state = env.state
+        self.assertIsNotNone(state.last_step_reward)
+        self.assertGreaterEqual(state.last_step_reward, 0.0)
+        self.assertLessEqual(state.last_step_reward, 1.0)
+    def test_done_is_true_after_last_ticket(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        while not obs.done:
+            obs = env.step(_heuristic_action(obs))
+        self.assertTrue(env.state.done)
+    def test_done_is_false_before_last_ticket(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        if obs.queue_size > 1:
+            obs = env.step(_heuristic_action(obs))
+            self.assertFalse(env.state.done)
+# ---------------------------------------------------------------------------
+# 9.3 — History entry contains title and predicted
+# ---------------------------------------------------------------------------
+class TestHistoryHasTitleAndPredicted(unittest.TestCase):
+    """9.3 — observation.history[0] contains 'title' and 'predicted' keys."""
+    def test_history_entry_has_title(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        action = _heuristic_action(obs)
+        obs2 = env.step(action)
+        self.assertEqual(len(obs2.history), 1)
+        self.assertIn("title", obs2.history[0])
+        self.assertIsInstance(obs2.history[0]["title"], str)
+        self.assertTrue(obs2.history[0]["title"])  # non-empty
+    def test_history_entry_has_predicted(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        action = _heuristic_action(obs)
+        obs2 = env.step(action)
+        self.assertIn("predicted", obs2.history[0])
+        self.assertIsInstance(obs2.history[0]["predicted"], dict)
+    def test_history_predicted_matches_action(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        action = _heuristic_action(obs)
+        obs2 = env.step(action)
+        predicted = obs2.history[0]["predicted"]
+        action_dict = action.model_dump(exclude_none=True)
+        self.assertEqual(predicted, action_dict)
+    def test_history_entry_has_ticket_id_and_score(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        obs2 = env.step(_heuristic_action(obs))
+        entry = obs2.history[0]
+        self.assertIn("ticket_id", entry)
+        self.assertIn("score", entry)
+# ---------------------------------------------------------------------------
+# 9.4 — Milestone reward shaping
+# ---------------------------------------------------------------------------
+class TestMilestoneRewardShaping(unittest.TestCase):
+    """9.4 — compute_step_reward applies bonus at high scores, penalty at low scores."""
+    def test_high_score_gets_bonus(self) -> None:
+        # score=0.9 >= 0.8 threshold → base=0.9, bonus=0.05 → 0.95
+        result = compute_step_reward(0.9)
+        self.assertAlmostEqual(result, 0.95, places=9)
+    def test_low_score_gets_penalty(self) -> None:
+        # score=0.1 < 0.2 threshold → base=0.1, penalty=0.05 → 0.05
+        result = compute_step_reward(0.1)
+        self.assertAlmostEqual(result, 0.05, places=9)
+    def test_mid_score_is_neutral(self) -> None:
+        # score=0.5 is in [0.2, 0.8) → no shaping → 0.5
+        result = compute_step_reward(0.5)
+        self.assertAlmostEqual(result, 0.5, places=9)
+    def test_boundary_high_threshold_gets_bonus(self) -> None:
+        # score=0.8 exactly → bonus applies → 0.85
+        result = compute_step_reward(0.8)
+        self.assertAlmostEqual(result, 0.85, places=9)
+    def test_boundary_low_threshold_is_neutral(self) -> None:
+        # score=0.2 exactly → not < 0.2, so neutral → 0.2
+        result = compute_step_reward(0.2)
+        self.assertAlmostEqual(result, 0.2, places=9)
+    def test_reward_clamped_to_unit_interval(self) -> None:
+        # score=1.0 → base=1.0, bonus would push to 1.05 → clamped to 1.0
+        result = compute_step_reward(1.0)
+        self.assertLessEqual(result, 1.0)
+        self.assertGreaterEqual(result, 0.0)
+    def test_zero_score_clamped_to_zero(self) -> None:
+        # score=0.0 < 0.2 → base=0.0, penalty → max(0.0, -0.05) = 0.0
+        result = compute_step_reward(0.0)
+        self.assertGreaterEqual(result, 0.0)
+# ---------------------------------------------------------------------------
+# 9.5 — Trajectory reward has no overshoot penalty
+# ---------------------------------------------------------------------------
+class TestTrajectoryRewardNoOvershoot(unittest.TestCase):
+    """9.5 — compute_trajectory_reward does not penalise when steps > queue_size."""
+    def test_no_penalty_when_steps_exceed_queue_size(self) -> None:
+        scores = [0.8, 0.9, 0.7]
+        queue_size = 3
+        steps_taken = 10  # more steps than queue_size
+        result = compute_trajectory_reward(scores, queue_size, steps_taken)
+        expected_avg = sum(scores) / len(scores)
+        self.assertAlmostEqual(result, expected_avg, places=9)
+    def test_result_equals_average_regardless_of_steps(self) -> None:
+        scores = [0.5, 0.6]
+        for steps in [1, 2, 5, 100]:
+            result = compute_trajectory_reward(scores, len(scores), steps)
+            self.assertAlmostEqual(result, 0.55, places=9,
+                                   msg=f"Failed for steps={steps}")
+    def test_empty_scores_returns_zero(self) -> None:
+        self.assertEqual(compute_trajectory_reward([], 3, 3), 0.0)
+    def test_result_in_unit_interval(self) -> None:
+        scores = [0.9, 1.0, 0.95]
+        result = compute_trajectory_reward(scores, 3, 3)
+        self.assertGreaterEqual(result, 0.0)
+        self.assertLessEqual(result, 1.0)
+# ---------------------------------------------------------------------------
+# 9.6 — ambiguity_note appears in current_ticket observation
+# ---------------------------------------------------------------------------
+class TestAmbiguityNoteInObservation(unittest.TestCase):
+    """9.6 — current_ticket includes ambiguity_note when the ticket has one."""
+    def _find_seed_with_ambiguity_note(self, task_id: int = 3) -> int | None:
+        """Try seeds 0..999 to find one where the first ticket has ambiguity_note."""
+        env = _make_env()
+        for seed in range(1000):
+            obs = env.reset(seed=seed, task_id=task_id)
+            if obs.current_ticket and obs.current_ticket.get("ambiguity_note"):
+                return seed
+        return None
+    def test_ambiguity_note_present_when_ticket_has_one(self) -> None:
+        """Force a ticket with ambiguity_note by patching the dataset."""
+        from unittest.mock import patch
+        from server.tasks import load_dataset
+        dataset = load_dataset()
+        # Find a ticket with ambiguity_note
+        ambiguous_tickets = [t for t in dataset if t.ambiguity_note is not None]
+        self.assertGreater(len(ambiguous_tickets), 0, "No tickets with ambiguity_note in dataset")
+        target = ambiguous_tickets[0]
+        env = _make_env()
+        # Patch the dataset to only contain the ambiguous ticket
+        with patch.object(env, "_dataset", [target]):
+            obs = env.reset(seed=0, task_id=3)
+        self.assertIsNotNone(obs.current_ticket)
+        self.assertIn("ambiguity_note", obs.current_ticket)
+        self.assertEqual(obs.current_ticket["ambiguity_note"], target.ambiguity_note)
+    def test_ambiguity_note_absent_when_ticket_has_none(self) -> None:
+        """Tickets without ambiguity_note should not expose the key."""
+        from unittest.mock import patch
+        from server.tasks import load_dataset
+        dataset = load_dataset()
+        non_ambiguous = [t for t in dataset if t.ambiguity_note is None]
+        self.assertGreater(len(non_ambiguous), 0)
+        target = non_ambiguous[0]
+        env = _make_env()
+        with patch.object(env, "_dataset", [target]):
+            obs = env.reset(seed=0, task_id=3)
+        self.assertIsNotNone(obs.current_ticket)
+        self.assertNotIn("ambiguity_note", obs.current_ticket)
+    def test_tkt_nondefault_001_has_ambiguity_note(self) -> None:
+        """TKT-NONDEFAULT-001 specifically has ambiguity_note set."""
+        from unittest.mock import patch
+        from server.tasks import load_dataset
+        dataset = load_dataset()
+        ticket = next((t for t in dataset if t.ticket_id == "TKT-NONDEFAULT-001"), None)
+        self.assertIsNotNone(ticket, "TKT-NONDEFAULT-001 not found in dataset")
+        self.assertIsNotNone(ticket.ambiguity_note)
+        env = _make_env()
+        with patch.object(env, "_dataset", [ticket]):
+            obs = env.reset(seed=0, task_id=3)
+        self.assertIn("ambiguity_note", obs.current_ticket)
+class TestRelatedTicketPreviewInObservation(unittest.TestCase):
+    """Follow-up tickets expose a lightweight preview of the linked ticket."""
+    def _reset_linked_ticket_env(self):
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
+        self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
+        related = next(
+            (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
+            None,
+        )
+        self.assertIsNotNone(related, "Linked ticket missing from dataset")
+        env = _make_env()
+        with patch.object(env, "_dataset", [ticket]):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {ticket.ticket_id: ticket, related.ticket_id: related},
+            ):
+                obs = env.reset(seed=0, task_id=3, queue_size=1)
+        return env, obs, related
+    def test_related_ticket_preview_present_when_ticket_has_link(self) -> None:
+        env, obs, related = self._reset_linked_ticket_env()
+        self.assertIsNotNone(obs.current_ticket)
+        self.assertIn("related_ticket_preview", obs.current_ticket)
+        self.assertEqual(
+            obs.current_ticket["related_ticket_preview"]["ticket_id"],
+            related.ticket_id,
+        )
+        self.assertEqual(
+            obs.current_ticket["related_ticket_preview"]["title"],
+            related.title,
+        )
+    def test_history_keeps_related_ticket_preview_after_step(self) -> None:
+        env, obs, related = self._reset_linked_ticket_env()
+        next_obs = env.step(_heuristic_action(obs))
+        self.assertGreaterEqual(len(next_obs.history), 1)
+        self.assertIn("related_ticket_preview", next_obs.history[0])
+        self.assertEqual(
+            next_obs.history[0]["related_ticket_preview"]["ticket_id"],
+            related.ticket_id,
+        )
+class TestObservationQueueContext(unittest.TestCase):
+    """Observation includes clearer queue-position counters."""
+    def test_reset_sets_queue_position_and_after_current_counts(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=0, task_id=1, queue_size=3)
+        self.assertEqual(obs.queue_position, 1)
+        self.assertEqual(obs.tickets_remaining, 3)
+        self.assertEqual(obs.tickets_after_current, 2)
+    def test_step_updates_queue_position_and_after_current_counts(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=0, task_id=1, queue_size=3)
+        obs = env.step(_heuristic_action(obs))
+        if obs.done:
+            self.assertEqual(obs.queue_position, 0)
+            self.assertEqual(obs.tickets_after_current, 0)
+        else:
+            self.assertEqual(obs.queue_position, 2)
+            self.assertEqual(obs.tickets_remaining, 2)
+            self.assertEqual(obs.tickets_after_current, 1)
+# ---------------------------------------------------------------------------
+# 9.6b — investigation actions and queue economics
+# ---------------------------------------------------------------------------
+class TestInvestigationActions(unittest.TestCase):
+    """Minimal tool-assisted investigate/submit flow works and stays backwards compatible."""
+    def _make_linked_env(self):
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
+        self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
+        related = next(
+            (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
+            None,
+        )
+        self.assertIsNotNone(related, "Linked ticket missing from dataset")
+        env = _make_env()
+        patch_dataset = patch.object(env, "_dataset", [ticket])
+        patch_lookup = patch.object(
+            env,
+            "_tickets_by_id",
+            {ticket.ticket_id: ticket, related.ticket_id: related},
+        )
+        patch_dataset.start()
+        patch_lookup.start()
+        self.addCleanup(patch_dataset.stop)
+        self.addCleanup(patch_lookup.stop)
+        obs = env.reset(seed=0, task_id=3, queue_size=1)
+        return env, obs, ticket, related
+    def test_investigation_action_does_not_advance_queue(self) -> None:
+        env, obs, ticket, related = self._make_linked_env()
+        investigate = HelpdeskTicketAction(
+            action_type="investigate",
+            tool_name="lookup_related_ticket",
+            tool_target_ticket_id=ticket.related_ticket_id,
+        )
+        obs2 = env.step(investigate)
+        self.assertFalse(obs2.done)
+        self.assertEqual(obs2.tickets_processed, 0)
+        self.assertEqual(obs2.queue_position, 1)
+        self.assertIsNotNone(obs2.last_tool_result)
+        self.assertTrue(obs2.last_tool_result["found"])
+        self.assertEqual(
+            obs2.last_tool_result["ticket"]["ticket_id"],
+            related.ticket_id,
+        )
+    def test_submit_after_investigation_completes_episode(self) -> None:
+        env, obs, ticket, related = self._make_linked_env()
+        env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_related_ticket",
+                tool_target_ticket_id=ticket.related_ticket_id,
+            )
+        )
+        final_obs = env.step(
+            HelpdeskTicketAction(
+                issue_type=ticket.issue_type,
+                priority=ticket.priority,
+                assignment_group=ticket.assignment_group,
+                resolution_action=ticket.resolution_action,
+            )
+        )
+        self.assertTrue(final_obs.done)
+        self.assertEqual(final_obs.tickets_processed, 1)
+        self.assertGreaterEqual(final_obs.reward, 0.0)
+        self.assertLessEqual(final_obs.reward, 1.0)
+    def test_requester_history_tool_returns_matches_for_same_requester(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        requester_counts: dict[str, int] = {}
+        for ticket in dataset:
+            requester_counts[ticket.requester] = requester_counts.get(ticket.requester, 0) + 1
+        target_requester = next(
+            (requester for requester, count in requester_counts.items() if count >= 2),
+            None,
+        )
+        self.assertIsNotNone(target_requester, "Dataset has no repeated requester")
+        duplicate_requester_group = [
+            ticket for ticket in dataset if ticket.requester == target_requester
+        ]
+        self.assertGreaterEqual(len(duplicate_requester_group), 2)
+        env = _make_env()
+        with patch.object(env, "_dataset", duplicate_requester_group):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {ticket.ticket_id: ticket for ticket in duplicate_requester_group},
+            ):
+                obs = env.reset(seed=0, task_id=2, queue_size=1)
+        obs2 = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertIsNotNone(obs2.last_tool_result)
+        self.assertEqual(obs2.last_tool_result["tool_name"], "lookup_requester_history")
+        self.assertTrue(obs2.last_tool_result["found"])
+        self.assertGreaterEqual(len(obs2.last_tool_result["matches"]), 1)
+class TestQueueEconomics(unittest.TestCase):
+    """Free investigations are allowed, but excessive investigation gets a queue-level penalty."""
+    def test_extra_investigations_reduce_final_reward(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = dataset[0]
+        env = _make_env()
+        with patch.object(env, "_dataset", [ticket]):
+            with patch.object(env, "_tickets_by_id", {ticket.ticket_id: ticket}):
+                obs = env.reset(seed=0, task_id=1, queue_size=1)
+        obs = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertEqual(env.state.investigation_steps, 1)
+        self.assertEqual(env.state.investigation_budget_remaining, 0)
+        obs = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertEqual(env.state.investigation_steps, 2)
+        final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
+        self.assertTrue(final_obs.done)
+        self.assertAlmostEqual(final_obs.reward, 0.98, places=9)
+class TestTerminalInvalidActionFinalReward(unittest.TestCase):
+    """Terminal invalid submit actions should still return the queue-level final reward."""
+    def test_last_invalid_submit_returns_trajectory_reward_not_zero(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        first = dataset[0]
+        second = dataset[1]
+        env = _make_env()
+        with patch.object(env, "_dataset", [first, second]):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {first.ticket_id: first, second.ticket_id: second},
+            ):
+                obs = env.reset(seed=0, task_id=1, queue_size=2)
+        tickets_by_id = {first.ticket_id: first, second.ticket_id: second}
+        current = tickets_by_id[obs.current_ticket["ticket_id"]]
+        obs = env.step(HelpdeskTicketAction(issue_type=current.issue_type))
+        self.assertFalse(obs.done)
+        current = tickets_by_id[obs.current_ticket["ticket_id"]]
+        final_obs = env.step(
+            HelpdeskTicketAction(
+                issue_type=current.issue_type,
+                priority="medium",
+            )
+        )
+        self.assertTrue(final_obs.done)
+        self.assertAlmostEqual(final_obs.reward, 0.5, places=9)
+        self.assertAlmostEqual(env.state.total_reward, 0.5, places=9)
+        self.assertAlmostEqual(env.state.reward or 0.0, 0.5, places=9)
+# ---------------------------------------------------------------------------
+# 9.7 — Dataset has >= 3 non-default routing tickets
+# ---------------------------------------------------------------------------
+class TestDatasetNonDefaultRouting(unittest.TestCase):
+    """9.7 — Dataset contains at least 3 tickets with non-default assignment_group."""
+    def test_at_least_three_nondefault_routing_tickets(self) -> None:
+        from vocabulary import ISSUE_TYPE_TO_ASSIGNMENT_GROUP
+        dataset = load_dataset()
+        non_default = [
+            t for t in dataset
+            if t.assignment_group != ISSUE_TYPE_TO_ASSIGNMENT_GROUP.get(t.issue_type)
+        ]
+        self.assertGreaterEqual(
+            len(non_default), 3,
+            f"Expected >= 3 non-default routing tickets, found {len(non_default)}: "
+            + str([(t.ticket_id, t.issue_type, t.assignment_group) for t in non_default])
+        )
+    def test_tkt_nondefault_tickets_exist(self) -> None:
+        dataset = load_dataset()
+        ids = {t.ticket_id for t in dataset}
+        for expected_id in ("TKT-NONDEFAULT-001", "TKT-NONDEFAULT-002", "TKT-NONDEFAULT-003"):
+            self.assertIn(expected_id, ids, f"{expected_id} not found in dataset")
+# ---------------------------------------------------------------------------
+# 9.9 — SUPPORTS_CONCURRENT_SESSIONS is True
+# ---------------------------------------------------------------------------
+class TestConcurrentSessionsFlag(unittest.TestCase):
+    """9.9 — HelpdeskTicketRoutingEnvironment.SUPPORTS_CONCURRENT_SESSIONS is True."""
+    def test_supports_concurrent_sessions_is_true(self) -> None:
+        self.assertTrue(HelpdeskTicketRoutingEnvironment.SUPPORTS_CONCURRENT_SESSIONS)
+    def test_flag_is_boolean_true(self) -> None:
+        flag = HelpdeskTicketRoutingEnvironment.SUPPORTS_CONCURRENT_SESSIONS
+        self.assertIs(flag, True)
+# ---------------------------------------------------------------------------
+# 9.10 — GET /web returns 200 with HTML content
+# ---------------------------------------------------------------------------
+def _build_web_test_app():
+    """Build a minimal FastAPI app with only the /web route for testing."""
+    from fastapi import FastAPI
+    from fastapi.responses import HTMLResponse
+    from server.tasks import TASKS
+    from vocabulary import APP_ENV_NAME
+    _app = FastAPI()
+    @_app.get("/web", response_class=HTMLResponse)
+    def web_ui():
+        task_rows = "".join(
+            f"<tr><td>{t['id']}</td><td>{t['name']}</td><td>{t['difficulty']}</td></tr>"
+            for t in TASKS.values()
+        )
+        html = f"""<!DOCTYPE html>
+<html><head><title>{APP_ENV_NAME}</title></head>
+<body>
+<h1>{APP_ENV_NAME}</h1>
+<p>Version: 0.1.0 | <a href="/health">Health</a> | <a href="/docs">API Docs</a></p>
+<h2>Tasks</h2>
+<table border="1"><tr><th>ID</th><th>Name</th><th>Difficulty</th></tr>
+{task_rows}
+</table>
+</body></html>"""
+        return HTMLResponse(content=html)
+    return _app
+class TestWebUIEndpoint(unittest.TestCase):
+    """9.10 — GET /web returns HTTP 200 with HTML content."""
+    @classmethod
+    def setUpClass(cls) -> None:
+        from starlette.testclient import TestClient
+        app = _build_web_test_app()
+        cls.client = TestClient(app)
+    def test_web_returns_200(self) -> None:
+        response = self.client.get("/web")
+        self.assertEqual(response.status_code, 200)
+    def test_web_returns_html_content_type(self) -> None:
+        response = self.client.get("/web")
+        self.assertIn("text/html", response.headers.get("content-type", ""))
+    def test_web_response_contains_html_tag(self) -> None:
+        response = self.client.get("/web")
+        self.assertIn("<!DOCTYPE html>", response.text)
+    def test_web_response_contains_env_name(self) -> None:
+        from vocabulary import APP_ENV_NAME
+        response = self.client.get("/web")
+        self.assertIn(APP_ENV_NAME, response.text)
+if __name__ == "__main__":
+    unittest.main()

tests/test_environment_smoke.py CHANGED Viewed

@@ -101,6 +101,8 @@ class TestResetReturnsValidObservation(unittest.TestCase):
         self.assertIsNotNone(obs.current_ticket)
         self.assertGreater(obs.queue_size, 0)
         self.assertEqual(obs.tickets_processed, 0)
 class TestResetAllTaskIds(unittest.TestCase):
@@ -116,6 +118,7 @@ class TestResetAllTaskIds(unittest.TestCase):
         self.assertEqual(obs.tickets_processed, 0)
         # allowed_fields must match the task definition
         self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
     def test_reset_task2(self) -> None:
         env = _make_env()
@@ -142,6 +145,10 @@ class TestStepAdvancesTicketsProcessed(unittest.TestCase):
         obs2 = env.step(action)
         self.assertEqual(obs2.tickets_processed, 1)
     def test_step_reward_in_unit_interval(self) -> None:
         from models import HelpdeskTicketAction

         self.assertIsNotNone(obs.current_ticket)
         self.assertGreater(obs.queue_size, 0)
         self.assertEqual(obs.tickets_processed, 0)
+        self.assertEqual(obs.queue_position, 1)
+        self.assertEqual(obs.tickets_after_current, max(0, obs.queue_size - 1))
 class TestResetAllTaskIds(unittest.TestCase):
         self.assertEqual(obs.tickets_processed, 0)
         # allowed_fields must match the task definition
         self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
+        self.assertEqual(obs.queue_position, 1)
     def test_reset_task2(self) -> None:
         env = _make_env()
         obs2 = env.step(action)
         self.assertEqual(obs2.tickets_processed, 1)
+        if obs2.done:
+            self.assertEqual(obs2.queue_position, 0)
+        else:
+            self.assertEqual(obs2.queue_position, 2)
     def test_step_reward_in_unit_interval(self) -> None:
         from models import HelpdeskTicketAction

tests/test_extra_fields_penalty.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""
+Tests for action field validation (Task 4) in HelpdeskTicketRoutingEnvironment.step().
+Validates Requirement 7: Step Validates Action Fields Against Task Contract.
+"""
+from __future__ import annotations
+import sys
+import os
+import unittest
+import types as _types
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+import openenv_test_stubs  # noqa: F401
+if "openenv.core.env_server.interfaces" not in sys.modules:
+    _interfaces_mod = _types.ModuleType("openenv.core.env_server.interfaces")
+    class _Environment:
+        def __init__(self) -> None:
+            pass
+        def __init_subclass__(cls, **kwargs: object) -> None:
+            super().__init_subclass__(**kwargs)
+        @classmethod
+        def __class_getitem__(cls, item: object) -> type:
+            return cls
+    _interfaces_mod.Environment = _Environment  # type: ignore[attr-defined]
+    sys.modules["openenv.core.env_server.interfaces"] = _interfaces_mod
+from models import HelpdeskTicketAction, HelpdeskTicketObservation
+from server.environment import HelpdeskTicketRoutingEnvironment
+from server.tasks import TASKS
+from vocabulary import ISSUE_TYPES, PRIORITIES, ASSIGNMENT_GROUPS, RESOLUTION_ACTIONS
+def _make_env() -> HelpdeskTicketRoutingEnvironment:
+    return HelpdeskTicketRoutingEnvironment()
+class TestExtraFieldsPenalty(unittest.TestCase):
+    """Requirement 7: step() rejects actions with fields outside the task's allowed_fields."""
+    def test_extra_fields_returns_reward_zero(self) -> None:
+        """Task 1 only allows issue_type and priority; submitting assignment_group triggers penalty."""
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        # Task 1 allowed_fields should NOT include assignment_group
+        self.assertNotIn("assignment_group", obs.allowed_fields)
+        # Submit an action with an extra field (assignment_group) not in task 1's allowed_fields
+        action = HelpdeskTicketAction(
+            issue_type=ISSUE_TYPES[0],
+            priority=PRIORITIES[0],
+            assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
+        )
+        penalty_obs = env.step(action)
+        self.assertIsInstance(penalty_obs, HelpdeskTicketObservation)
+        self.assertEqual(penalty_obs.reward, 0.0)
+    def test_extra_fields_advances_ticket_index(self) -> None:
+        """Penalty step must advance tickets_processed by 1."""
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        self.assertEqual(obs.tickets_processed, 0)
+        action = HelpdeskTicketAction(
+            issue_type=ISSUE_TYPES[0],
+            assignment_group=ASSIGNMENT_GROUPS[0],  # extra field for task 1
+        )
+        penalty_obs = env.step(action)
+        self.assertEqual(penalty_obs.tickets_processed, 1)
+    def test_extra_fields_records_score_zero(self) -> None:
+        """per_ticket_scores must contain 0.0 after a penalty step."""
+        env = _make_env()
+        env.reset(seed=42, task_id=1)
+        action = HelpdeskTicketAction(
+            issue_type=ISSUE_TYPES[0],
+            assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
+        )
+        env.step(action)
+        state = env.state
+        self.assertEqual(len(state.per_ticket_scores), 1)
+        self.assertEqual(state.per_ticket_scores[0], 0.0)
+    def test_extra_fields_history_entry_has_penalty_reason(self) -> None:
+        """History entry for a penalty step must include penalty_reason."""
+        env = _make_env()
+        env.reset(seed=42, task_id=1)
+        action = HelpdeskTicketAction(
+            issue_type=ISSUE_TYPES[0],
+            assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
+        )
+        penalty_obs = env.step(action)
+        self.assertEqual(len(penalty_obs.history), 1)
+        entry = penalty_obs.history[0]
+        self.assertIn("penalty_reason", entry)
+        self.assertIn("assignment_group", entry["penalty_reason"])
+        self.assertEqual(entry["score"], 0.0)
+    def test_no_extra_fields_grades_normally(self) -> None:
+        """When action fields are within allowed_fields, grading proceeds normally (reward != forced 0.0)."""
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        # Build action using only allowed fields
+        allowed = obs.allowed_fields
+        action_kwargs = {}
+        if "issue_type" in allowed:
+            action_kwargs["issue_type"] = ISSUE_TYPES[0]
+        if "priority" in allowed:
+            action_kwargs["priority"] = PRIORITIES[0]
+        action = HelpdeskTicketAction(**action_kwargs)
+        result_obs = env.step(action)
+        # Should be a valid observation; reward may be any value in [0.0, 1.0]
+        self.assertIsInstance(result_obs, HelpdeskTicketObservation)
+        self.assertIsNotNone(result_obs.reward)
+        # No penalty_reason in history
+        self.assertEqual(len(result_obs.history), 1)
+        self.assertNotIn("penalty_reason", result_obs.history[0])
+    def test_extra_fields_no_exception_raised(self) -> None:
+        """Requirement 7.4: extra fields must not raise an unhandled exception."""
+        env = _make_env()
+        env.reset(seed=42, task_id=1)
+        action = HelpdeskTicketAction(
+            issue_type=ISSUE_TYPES[0],
+            priority=PRIORITIES[0],
+            assignment_group=ASSIGNMENT_GROUPS[0],
+            resolution_action=RESOLUTION_ACTIONS[0],  # multiple extra fields
+        )
+        try:
+            obs = env.step(action)
+        except Exception as exc:  # noqa: BLE001
+            self.fail(f"step() raised an unexpected exception: {exc}")
+        self.assertIsInstance(obs, HelpdeskTicketObservation)
+    def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
+        """When the penalty step is on the last ticket, done stays True and reward stays episode-level."""
+        env = _make_env()
+        obs = env.reset(seed=42, task_id=1)
+        queue_size = obs.queue_size
+        tickets_by_id = env._tickets_by_id  # noqa: SLF001 - test-only inspection
+        # Process all tickets except the last one normally
+        for _ in range(queue_size - 1):
+            current_ticket_id = obs.current_ticket["ticket_id"]
+            current_ticket = tickets_by_id[current_ticket_id]
+            obs = env.step(HelpdeskTicketAction(issue_type=current_ticket.issue_type))
+        # Now trigger penalty on the last ticket
+        current_ticket_id = obs.current_ticket["ticket_id"]
+        current_ticket = tickets_by_id[current_ticket_id]
+        action = HelpdeskTicketAction(
+            issue_type=current_ticket.issue_type,
+            assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
+        )
+        final_obs = env.step(action)
+        self.assertTrue(final_obs.done)
+        expected_reward = (queue_size - 1) / queue_size
+        self.assertAlmostEqual(final_obs.reward, expected_reward, places=9)
+        self.assertAlmostEqual(env.state.total_reward, expected_reward, places=9)
+if __name__ == "__main__":
+    unittest.main()

tests/test_inference_unit.py CHANGED Viewed

@@ -163,6 +163,22 @@ class InferenceUnitTests(unittest.TestCase):
             )
         )
 if __name__ == "__main__":
     unittest.main()

             )
         )
+    def test_default_task_selection_runs_single_first_task(self) -> None:
+        inference = _load_inference_module()
+        self.assertEqual(
+            inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
+            [1],
+        )
+    def test_run_all_tasks_override_keeps_local_batch_mode_available(self) -> None:
+        inference = _load_inference_module({"RUN_ALL_TASKS": "1"})
+        self.assertEqual(
+            inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
+            [1, 2, 3],
+        )
 if __name__ == "__main__":
     unittest.main()

tests/test_tasks_unit.py CHANGED Viewed

@@ -50,7 +50,7 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
     def test_load_dataset_returns_valid_records(self) -> None:
         dataset = load_dataset()
-        self.assertEqual(len(dataset), 45)
         self.assertTrue(all(isinstance(record, HelpdeskTicketRecord) for record in dataset))
     def test_dataset_ticket_ids_are_unique(self) -> None:

     def test_load_dataset_returns_valid_records(self) -> None:
         dataset = load_dataset()
+        self.assertGreaterEqual(len(dataset), 45)
         self.assertTrue(all(isinstance(record, HelpdeskTicketRecord) for record in dataset))
     def test_dataset_ticket_ids_are_unique(self) -> None: