Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

Coding Ninja commited on Apr 6

Commit

c64d203

1 Parent(s): 6c5051f

Finalize gap fixes and lightweight competitive upgrades

Browse files

Files changed (12) hide show

KNOWLEDGE.md +42 -17
README.md +59 -14
ROADMAP.md +76 -10
inference.py +124 -8
models.py +24 -1
openenv.yaml +1 -0
server/environment.py +253 -36
server/tasks.py +7 -3
tests/test_competitive_upgrade.py +285 -7
tests/test_environment_smoke.py +7 -0
tests/test_extra_fields_penalty.py +11 -12
tests/test_inference_unit.py +16 -0

KNOWLEDGE.md CHANGED Viewed

@@ -24,7 +24,7 @@ IT helpdesk routing is a strong hackathon fit because it is:
 - deterministic to grade
 - naturally multi-step
-A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. That maps cleanly to a typed action object.
 ## The Repo In One Sentence
@@ -134,7 +134,7 @@ Important fields:
 ### `HelpdeskTicketAction`
-Represents the agent submission. Fields are optional because different tasks score different subsets.
 ### `HelpdeskTicketObservation`
@@ -142,6 +142,7 @@ Represents what the agent sees for each step:
 - task metadata
 - visible ticket fields
 - queue progress
 - score history
@@ -179,10 +180,19 @@ The observation exposes:
 - task metadata
 - the current ticket
 - queue progress counters
 - history
 - reward and done status
 The state tracks:
 - current task
@@ -191,12 +201,13 @@ The state tracks:
 - current ticket index
 - per-ticket scores
 - total reward
 ## Task Design
 ### Task 1: Issue Type Classification
-The agent predicts:
 - `issue_type`
@@ -206,7 +217,7 @@ Purpose:
 ### Task 2: Issue Type And Priority
-The agent predicts:
 - `issue_type`
 - `priority`
@@ -217,7 +228,7 @@ Purpose:
 ### Task 3: Full Ticket Routing
-The agent predicts:
 - `issue_type`
 - `priority`
@@ -256,14 +267,14 @@ This is now proven in checked-in unit tests rather than left as a docs claim.
 Step reward:
-- current ticket score clamped to `[0.0, 1.0]`
 Final reward:
 - average of ticket scores
-- minus a small overshoot penalty for taking more steps than the queue length
-This gives dense feedback while still rewarding efficient episode completion.
 ## Dataset Mental Model
@@ -277,6 +288,8 @@ Current structure:
 - harder ambiguous cases
 - follow-up tickets connected through `related_ticket_id`
 The dataset is meant to test routing judgment, not just keyword spotting.
 ## Grounding Note
@@ -299,16 +312,18 @@ It:
 1. connects to the environment
 2. loads the available tasks
-3. runs one episode per task
 4. picks an action for each ticket
 5. sends the action back through the client
 6. records rewards
-7. prints a task-by-task summary
 It supports:
 - heuristic mode with no external model
 - LLM mode through an OpenAI-compatible API
 ## Files That Matter Most
@@ -374,16 +389,26 @@ That follow-up pass added the remaining Roopal-owned public-clarity items:
 - an internal grounding note tying the label space to public IT-support datasets
 - a refreshed compliance snapshot in `required.md`
-The optional TRL / GRPO README example was intentionally deferred because the shared runtime-validation gates are not all green yet.
-## What Still Needs Hands-On Verification
-The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
-Still pending:
-1. confirm Docker starts cleanly
-2. do a clean-machine dry run if possible
 ## One-Minute Summary
@@ -396,4 +421,4 @@ If you come back to this repo later, remember:
 - the agent predicts structured routing fields
 - the grader gives deterministic partial credit
 - `inference.py` is the baseline agent runner
-- merged-state local validation is complete, and Docker is the main remaining hands-on check

 - deterministic to grade
 - naturally multi-step
+A helpdesk agent has to decide what the ticket is about, how urgent it is, who should own it, and what should happen next. The current runtime now supports a small two-mode action object: investigate first when needed, then submit the final routing answer.
 ## The Repo In One Sentence
 ### `HelpdeskTicketAction`
+Represents the agent step. `action_type="submit"` carries routing fields, while `action_type="investigate"` uses a small built-in tool surface before the final submission.
 ### `HelpdeskTicketObservation`
 - task metadata
 - visible ticket fields
+- optional ambiguity or follow-up context
 - queue progress
 - score history
 - task metadata
 - the current ticket
+- available investigation tools
+- remaining free investigation budget
+- the latest tool result, when one was requested
 - queue progress counters
 - history
 - reward and done status
+Useful queue counters now include:
+- `tickets_remaining`: not-yet-processed tickets, including the current ticket when one is active
+- `tickets_after_current`: how many tickets remain after the current one
+- `queue_position`: 1-based position of the current ticket in the queue
 The state tracks:
 - current task
 - current ticket index
 - per-ticket scores
 - total reward
+- investigation step count
 ## Task Design
 ### Task 1: Issue Type Classification
+The agent ultimately predicts:
 - `issue_type`
 ### Task 2: Issue Type And Priority
+The agent ultimately predicts:
 - `issue_type`
 - `priority`
 ### Task 3: Full Ticket Routing
+The agent ultimately predicts:
 - `issue_type`
 - `priority`
 Step reward:
+- current ticket score with a small milestone bonus for strong steps and a small penalty for very weak steps
 Final reward:
 - average of ticket scores
+- minus a tiny penalty only if the agent exceeds the free investigation budget for the queue
+This keeps the reward dense and deterministic, removes the dead overshoot logic, and adds a small queue-level economics signal without disturbing the no-tool baseline path.
 ## Dataset Mental Model
 - harder ambiguous cases
 - follow-up tickets connected through `related_ticket_id`
+When a follow-up link exists, the observation can now surface a lightweight `related_ticket_preview`, and the tool layer can fetch richer related-ticket or requester-history context so the agent does not have to route every ticket from isolated text alone.
 The dataset is meant to test routing judgment, not just keyword spotting.
 ## Grounding Note
 1. connects to the environment
 2. loads the available tasks
+3. runs one episode for the requested task
 4. picks an action for each ticket
 5. sends the action back through the client
 6. records rewards
+7. prints structured logs for that run
 It supports:
 - heuristic mode with no external model
 - LLM mode through an OpenAI-compatible API
+- lightweight investigation-tool calls before the final submit action
+- an explicit local `RUN_ALL_TASKS=1` override when you want the old multi-task sweep
 ## Files That Matter Most
 - an internal grounding note tying the label space to public IT-support datasets
 - a refreshed compliance snapshot in `required.md`
+The optional TRL / GRPO README example remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
+## April 3-7 Status
+The roadmap through April 7 is now closed in the current repo state.
+That means the repo now has:
+1. checked-in unit, smoke, and integration tests
+2. Docker smoke coverage through the GitHub Actions workflow
+3. a clean-copy install-and-run pass
+4. structured `inference.py` logging verification
+5. a passing local `openenv validate` result after checking in `uv.lock`
+## Submission-Day Reminders
+The remaining work belongs to the April 8 submission window rather than the April 3 to April 7 implementation window:
+1. rerun the final sanity slice on the submission branch
+2. verify the live Hugging Face Space ping and reset path after the final push if a fresh deployment is created
 ## One-Minute Summary
 - the agent predicts structured routing fields
 - the grader gives deterministic partial credit
 - `inference.py` is the baseline agent runner
+- merged-state validation, Docker smoke coverage, clean-copy rerun, and local validator readiness are all now in place

README.md CHANGED Viewed

@@ -34,7 +34,7 @@ The environment models a realistic helpdesk workflow:
 1. a new ticket enters the queue
 2. the agent reads the ticket title and description
-3. the agent predicts structured routing fields
 4. the grader assigns deterministic credit
 5. the environment advances to the next ticket until the queue is complete
@@ -43,7 +43,7 @@ This domain is useful for OpenEnv because it is operationally realistic, easy to
 ## Why This Is A Good Hackathon Domain
 - it reflects real enterprise support operations
-- the action space is structured and judge-friendly
 - correctness can be scored deterministically
 - the hard task is meaningfully harder than the easy and medium tasks
 - the environment is small enough to rerun quickly
@@ -55,7 +55,7 @@ The project uses a queue-based episode model.
 - `reset()` samples a task and a queue of 3 to 5 tickets
 - `step()` grades one ticket submission at a time
 - `state()` exposes the internal episode snapshot
-- final reward is based on average ticket quality with a small overshoot penalty
 The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
@@ -115,6 +115,9 @@ Visible ticket fields:
 - `title`
 - `requester`
 - `description`
 Each observation also includes:
@@ -122,9 +125,14 @@ Each observation also includes:
 - `task_name`
 - `instructions`
 - `allowed_fields`
 - `queue_size`
 - `tickets_remaining`
 - `tickets_processed`
 - `history`
 - standard OpenEnv fields such as `done` and `reward`
@@ -138,11 +146,23 @@ The internal `HelpdeskTicketState` tracks:
 - `current_ticket_index`
 - `per_ticket_scores`
 - `total_reward`
 ## Grading And Reward
 Scoring is deterministic and normalized to `[0.0, 1.0]`.
 Per-field behavior:
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
@@ -161,11 +181,15 @@ Task weights:
 Final episode reward:
 ```text
-average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
 ```
 The result is clamped to `[0.0, 1.0]`.
 ## Grounded Scoring
 The grader is intentionally not fuzzy by default.
@@ -285,7 +309,7 @@ curl http://localhost:7860/tasks
 ## Running The Baseline Inference Script
-The baseline script supports two modes.
 ### Heuristic mode
@@ -295,6 +319,12 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
 python inference.py
 ```
 ### LLM mode
 Set these environment variables first:
@@ -313,6 +343,14 @@ Optional target:
 - `ENV_URL`
 - default value: `http://localhost:7860`
 ## Runtime Validation Snapshot
@@ -324,7 +362,7 @@ Validated locally:
 - `/health`
 - `/tasks`
 - `/reset`
-- heuristic `inference.py` run across all 3 tasks
 Current local heuristic results:
@@ -335,7 +373,7 @@ Current local heuristic results:
 | Full Ticket Routing | `0.9400` |
 | Overall | `0.9400` |
-The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. A Docker smoke test and clean-machine rerun are still recommended before final submission freeze.
 ### Windows note
@@ -358,7 +396,7 @@ docker run -p 7860:7860 helpdesk-ticket-routing
 Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
 ```bash
-python inference.py
 ```
 If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
@@ -376,6 +414,7 @@ OpenEnv provides the core environment endpoints, and the repo adds a custom task
 | POST | `/step` | submit an action |
 | GET | `/state` | inspect internal state |
 | GET | `/tasks` | list task metadata |
 | GET | `/docs` | interactive API docs |
 ## Submission Readiness
@@ -397,11 +436,17 @@ An April 6 repo audit also confirmed that all required submission files are pres
 - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
 - docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
-Still pending before final submission:
-- a Docker smoke test from a machine with Docker installed
-- `openenv validate` evidence on the current merged repo state
-- structured `inference.py` log-format verification on the current merged repo state
-- a final clean-machine dry run if possible before submission freeze
-The short TRL / GRPO README example from the roadmap is intentionally deferred until the shared runtime and validation gates are green.

 1. a new ticket enters the queue
 2. the agent reads the ticket title and description
+3. the agent may investigate with lightweight tools, then submit structured routing fields
 4. the grader assigns deterministic credit
 5. the environment advances to the next ticket until the queue is complete
 ## Why This Is A Good Hackathon Domain
 - it reflects real enterprise support operations
+- the action space is structured and judge-friendly, with a small investigate-versus-submit split
 - correctness can be scored deterministically
 - the hard task is meaningfully harder than the easy and medium tasks
 - the environment is small enough to rerun quickly
 - `reset()` samples a task and a queue of 3 to 5 tickets
 - `step()` grades one ticket submission at a time
 - `state()` exposes the internal episode snapshot
+- final reward is based on average ticket quality across the queue
 The environment classes and vocabulary are intentionally frozen to keep collaboration and judging simple.
 - `title`
 - `requester`
 - `description`
+- optional `ambiguity_note`
+- optional `related_ticket_id`
+- optional `related_ticket_preview`
 Each observation also includes:
 - `task_name`
 - `instructions`
 - `allowed_fields`
+- `available_tools`
+- `investigation_budget_remaining`
+- `last_tool_result`
 - `queue_size`
 - `tickets_remaining`
+- `tickets_after_current`
 - `tickets_processed`
+- `queue_position`
 - `history`
 - standard OpenEnv fields such as `done` and `reward`
 - `current_ticket_index`
 - `per_ticket_scores`
 - `total_reward`
+- `reward`
+- `done`
 ## Grading And Reward
 Scoring is deterministic and normalized to `[0.0, 1.0]`.
+The action model now supports two paths:
+- `action_type="submit"` for the final routing answer
+- `action_type="investigate"` with a small built-in tool surface before submission
+Available tools:
+- `lookup_related_ticket`
+- `lookup_requester_history`
 Per-field behavior:
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
 Final episode reward:
 ```text
+average(per_ticket_scores)
 ```
 The result is clamped to `[0.0, 1.0]`.
+Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
+Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
 ## Grounded Scoring
 The grader is intentionally not fuzzy by default.
 ## Running The Baseline Inference Script
+The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
 ### Heuristic mode
 python inference.py
 ```
+By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
+```bash
+TASK_ID=3 python inference.py
+```
 ### LLM mode
 Set these environment variables first:
 - `ENV_URL`
 - default value: `http://localhost:7860`
+- `TASK_ID`
+- `RUN_ALL_TASKS`
+To reproduce the multi-task local benchmark sweep:
+```bash
+RUN_ALL_TASKS=1 python inference.py
+```
 ## Runtime Validation Snapshot
 - `/health`
 - `/tasks`
 - `/reset`
+- heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
 Current local heuristic results:
 | Full Ticket Routing | `0.9400` |
 | Overall | `0.9400` |
+The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
 ### Windows note
 Then run inference against it (default `ENV_URL` points to `http://localhost:7860`):
 ```bash
+RUN_ALL_TASKS=1 python inference.py
 ```
 If you publish the container on a different host port, set `ENV_URL` accordingly before running `inference.py`.
 | POST | `/step` | submit an action |
 | GET | `/state` | inspect internal state |
 | GET | `/tasks` | list task metadata |
+| GET | `/web` | lightweight HF Space UI |
 | GET | `/docs` | interactive API docs |
 ## Submission Readiness
 - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
 - docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
+Roadmap status through April 7 is complete:
+- unit, smoke, and integration tests are checked in and green
+- Docker smoke coverage exists through `.github/workflows/docker-smoke-test.yml`
+- `openenv validate` now passes on the current repo state
+- structured `inference.py` logging is verified by tests and the merged-state rerun
+- a clean-copy install-and-run pass has been completed
+The remaining April 8 work is operational rather than implementation-heavy:
+- run the final submission-branch sanity slice before pushing
+- perform the live Hugging Face Space ping and reset check on the deployed submission artifact if a fresh deployment is created
+The short TRL / GRPO README example from the roadmap remains intentionally deferred because it is optional and lower priority than freeze-phase stability.

ROADMAP.md CHANGED Viewed

@@ -11,10 +11,39 @@
 ## How To Use This File
 - `PROJECT_STATUS.md` is the canonical log of completed work.
-- This roadmap is the remaining execution plan from the current repo state to final submission.
 - `required.md` is now the combined official-requirements and project-compliance file.
 - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
-- `analysis/comp.md` and `analysis/comp_know.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
 ## What We Are Optimizing For
@@ -47,14 +76,51 @@ The repo already has:
 - deterministic grading with limited partial credit
 - working heuristic baseline
 - merged local validation on `/health`, `/tasks`, and `inference.py`
-- current local benchmark reference:
-  - Task 1: `1.0000`
-  - Task 2: `0.8800`
-  - Task 3: `0.9400`
-  - Overall: `0.9400`
 The remaining work should be treated as targeted strengthening, not broad feature invention.
 ## Submission Gates That Must Still Hold
 These come directly from `required.md` and `KNOWLEDGE.md`:
@@ -114,7 +180,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
 **Window:** April 3 to April 4
-**Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/comp_know.md`: lack of checked-in tests.
 ### Must produce
@@ -182,7 +248,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
 - assignment group and resolution action remain exact
 - final episode reward stays bounded and deterministic
-### Safe improvement candidates from `analysis/comp_know.md`
 - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
 - enrich `history` with:
@@ -237,7 +303,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
 **Window:** April 6 to April 7
-**Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md`.
 ### Must produce

 ## How To Use This File
 - `PROJECT_STATUS.md` is the canonical log of completed work.
+- This roadmap is the active plan from the verified April 6, 2026 repo state to final submission.
 - `required.md` is now the combined official-requirements and project-compliance file.
 - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
+- `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
+- The dated April 3 to April 5 sections below are now historical context; the active execution block is the final 24-hour plan for April 6 to April 7, 2026.
+## Status As Of April 6, 2026
+The repo is now in the expected "stabilize and merge" phase rather than the earlier "build core fixes" phase.
+Completed and locally verified:
+- all concrete items from `gaps.md`
+- the viable low-risk improvements from `analysis/deep_competitive_gap_report.md`
+- single-task `inference.py` execution with `TASK_ID` support and optional `RUN_ALL_TASKS=1`
+- `state()` exposure of `reward` and `done`
+- richer history with predicted actions and follow-up context
+- lightweight investigate-versus-submit action support with tool-backed context lookup
+- small queue-economics signal without major benchmark redesign
+- `/web` UI route
+- local full test pass:
+  - `126 passed, 137 subtests passed`
+- local validator pass:
+  - `[OK] meta-AIHack: Ready for multi-mode deployment`
+Merge recommendation:
+- mergeable as an incremental submission-ready improvement branch
+- do not block merge on major redesign items that were explicitly out of scope:
+  - scenario-family task redesign
+  - breaking the issue-type-to-assignment shortcut
+  - large dataset expansion
+  - full queue simulator / economics redesign
 ## What We Are Optimizing For
 - deterministic grading with limited partial credit
 - working heuristic baseline
 - merged local validation on `/health`, `/tasks`, and `inference.py`
+- single-task evaluator-safe inference behavior
+- reward and done fields on `state()`
+- richer observation history and linked-ticket context
+- lightweight investigate / submit split with small built-in tool support
+- local full-suite verification:
+  - `126 passed, 137 subtests passed`
+- local validator verification:
+  - `[OK] meta-AIHack: Ready for multi-mode deployment`
 The remaining work should be treated as targeted strengthening, not broad feature invention.
+## Final 24-Hour Plan
+**Active window:** April 6 to April 7, 2026
+**Internal target:** open PR, merge to the common `main`, and complete the final smoke checks by April 7, 2026
+**Official deadline:** April 8, 2026, 11:59 PM IST
+### Must finish before merge
+- review the final diff and stage only the intended submission files
+- open the merge PR from a dedicated branch
+- merge into the shared `main` after one last reviewer pass
+- rerun the post-merge smoke checks:
+  - `pytest`
+  - `openenv validate`
+  - `/health`
+  - `/tasks`
+  - one `reset()` / `step()` sanity path
+### Do not add before merge
+- no new benchmark redesign work
+- no new dataset expansion
+- no schema churn
+- no reward refactors beyond blocker-level fixes
+- no last-minute inference prompt rewrites
+### Success condition for April 7, 2026
+- PR is up
+- PR is reviewed against `gaps.md` and `analysis/deep_competitive_gap_report.md`
+- shared `main` contains the tested gap-fix branch
+- deployment sanity checks are green
+- repo is frozen except for typo-level fixes
 ## Submission Gates That Must Still Hold
 These come directly from `required.md` and `KNOWLEDGE.md`:
 **Window:** April 3 to April 4
+**Goal:** eliminate the biggest competitive weakness identified in `analysis/competition_notes.md`: lack of checked-in tests.
 ### Must produce
 - assignment group and resolution action remain exact
 - final episode reward stays bounded and deterministic
+### Safe improvement candidates from `analysis/competition_notes.md`
 - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
 - enrich `history` with:
 **Window:** April 6 to April 7
+**Goal:** close the submission-readiness gaps surfaced in `analysis/competition_notes.md`.
 ### Must produce

inference.py CHANGED Viewed

@@ -20,6 +20,15 @@ HF_TOKEN
     HuggingFace authentication token for the LLM provider.
     No default is set.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
     This script does not use ``from_docker_image()``, so the value is unused here.
@@ -65,6 +74,11 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 SEED = 42
 TASK_ID_ENV = os.getenv("TASK_ID")
 # ---------------------------------------------------------------------------
 # LLM helper
@@ -99,13 +113,36 @@ Return ONLY valid JSON with the requested fields. No markdown, no explanation.""
 def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
     assert llm_client is not None, "LLM client not configured"
     user_msg = (
         f"Instructions: {instructions}\n\n"
         f"Allowed fields: {', '.join(allowed_fields)}\n\n"
         f"Title: {ticket['title']}\n"
         f"Requester: {ticket['requester']}\n"
-        f"Description: {ticket['description']}\n\n"
         f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
     )
@@ -135,17 +172,26 @@ def emit_log(tag: str, **payload: Any) -> None:
 def get_tasks_to_run(available_tasks: dict) -> list[int]:
     if TASK_ID_ENV:
         try:
             task_id = int(TASK_ID_ENV)
         except ValueError:
             print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
             raise SystemExit(1)
-        if task_id not in available_tasks:
-            print(f"[WARN] TASK_ID={task_id} not in available tasks {list(available_tasks)}", flush=True)
-            return []
         return [task_id]
-    return list(TASK_IDS)  # fallback: all tasks (local dev)
 # ---------------------------------------------------------------------------
@@ -278,7 +324,18 @@ def heuristic_resolution_action(text: str, issue_type: str) -> str:
 def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
-    text = (ticket.get("title", "") + " " + ticket.get("description", "")).lower()
     issue_type = "general_inquiry"
     for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
@@ -329,6 +386,31 @@ def build_action(
         )
 # ---------------------------------------------------------------------------
 # Main loop using the HTTP-based sync EnvClient for multi-step episodes
 # ---------------------------------------------------------------------------
@@ -347,7 +429,9 @@ def run() -> None:
     all_results: dict[int, dict[str, float | int]] = {}
     tasks_to_run = get_tasks_to_run(available_tasks)
-    single_task_mode = bool(TASK_ID_ENV)
     for task_id in tasks_to_run:
         if task_id not in available_tasks:
@@ -377,8 +461,40 @@ def run() -> None:
                 if ticket is None:
                     break
                 action, action_source, fallback_reason = build_action(
-                    ticket,
                     obs.allowed_fields,
                     obs.instructions,
                 )

     HuggingFace authentication token for the LLM provider.
     No default is set.
+TASK_ID
+    Optional OpenEnv task ID to run. When unset, the script defaults to the
+    first available task so it still emits exactly one ``[START]`` ... ``[END]``
+    block for evaluator-style runs.
+RUN_ALL_TASKS
+    Optional local-development override. Set to ``1`` to run every available
+    task in sequence and print the aggregate closing ``[END]`` summary.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
     This script does not use ``from_docker_image()``, so the value is unused here.
 SEED = 42
 TASK_ID_ENV = os.getenv("TASK_ID")
+RUN_ALL_TASKS_ENV = os.getenv("RUN_ALL_TASKS", "").strip().lower() in {
+    "1",
+    "true",
+    "yes",
+}
 # ---------------------------------------------------------------------------
 # LLM helper
 def call_llm(ticket: dict, allowed_fields: list[str], instructions: str) -> dict:
     assert llm_client is not None, "LLM client not configured"
+    ambiguity_note = ticket.get("ambiguity_note")
+    related_preview = ticket.get("related_ticket_preview") or {}
+    last_tool_result = ticket.get("last_tool_result")
+    extra_context_lines: list[str] = []
+    if ambiguity_note:
+        extra_context_lines.append(f"Ambiguity note: {ambiguity_note}")
+    if related_preview:
+        extra_context_lines.extend(
+            [
+                "Related ticket preview:",
+                f"- Title: {related_preview.get('title', '')}",
+                f"- Requester: {related_preview.get('requester', '')}",
+                f"- Description: {related_preview.get('description', '')}",
+            ]
+        )
+    if last_tool_result is not None:
+        extra_context_lines.append(
+            "Investigation result: " + json.dumps(last_tool_result, sort_keys=True)
+        )
+    extra_context_block = ""
+    if extra_context_lines:
+        extra_context_block = "\n" + "\n".join(extra_context_lines)
     user_msg = (
         f"Instructions: {instructions}\n\n"
         f"Allowed fields: {', '.join(allowed_fields)}\n\n"
         f"Title: {ticket['title']}\n"
         f"Requester: {ticket['requester']}\n"
+        f"Description: {ticket['description']}"
+        f"{extra_context_block}\n\n"
         f"Respond with JSON containing ONLY these fields: {', '.join(allowed_fields)}"
     )
 def get_tasks_to_run(available_tasks: dict) -> list[int]:
+    available_task_ids = sorted(int(task_id) for task_id in available_tasks)
     if TASK_ID_ENV:
         try:
             task_id = int(TASK_ID_ENV)
         except ValueError:
             print(f"[ERROR] TASK_ID={TASK_ID_ENV!r} is not a valid integer", flush=True)
             raise SystemExit(1)
+        if task_id not in available_task_ids:
+            print(
+                f"[ERROR] TASK_ID={task_id} not in available tasks {available_task_ids}",
+                flush=True,
+            )
+            raise SystemExit(1)
         return [task_id]
+    if RUN_ALL_TASKS_ENV:
+        return available_task_ids
+    if not available_task_ids:
+        return []
+    # Default to a single task so evaluation emits exactly one START/END block.
+    return [available_task_ids[0]]
 # ---------------------------------------------------------------------------
 def heuristic_action(ticket: dict, allowed_fields: list[str]) -> dict:
+    related_preview = ticket.get("related_ticket_preview") or {}
+    last_tool_result = ticket.get("last_tool_result") or {}
+    text = " ".join(
+        [
+            ticket.get("title", ""),
+            ticket.get("description", ""),
+            ticket.get("ambiguity_note", ""),
+            related_preview.get("title", ""),
+            related_preview.get("description", ""),
+            json.dumps(last_tool_result, sort_keys=True),
+        ]
+    ).lower()
     issue_type = "general_inquiry"
     for kw, mapped_issue_type in KEYWORD_ISSUE_TYPES.items():
         )
+def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[bool, str | None]:
+    if not ticket:
+        return False, None
+    current_ticket_id = ticket.get("ticket_id")
+    already_investigated = any(
+        entry.get("ticket_id") == current_ticket_id
+        and entry.get("predicted", {}).get("action_type") == "investigate"
+        for entry in history
+    )
+    if already_investigated:
+        return False, None
+    if ticket.get("related_ticket_id"):
+        return True, "lookup_related_ticket"
+    if ticket.get("ambiguity_note"):
+        return True, "lookup_requester_history"
+    return False, None
+def merge_ticket_context(ticket: dict, observation: Any) -> dict:
+    merged_ticket = dict(ticket)
+    if getattr(observation, "last_tool_result", None) is not None:
+        merged_ticket["last_tool_result"] = observation.last_tool_result
+    return merged_ticket
 # ---------------------------------------------------------------------------
 # Main loop using the HTTP-based sync EnvClient for multi-step episodes
 # ---------------------------------------------------------------------------
     all_results: dict[int, dict[str, float | int]] = {}
     tasks_to_run = get_tasks_to_run(available_tasks)
+    if not tasks_to_run:
+        return
+    single_task_mode = len(tasks_to_run) == 1
     for task_id in tasks_to_run:
         if task_id not in available_tasks:
                 if ticket is None:
                     break
+                investigate, tool_name = should_investigate(ticket, obs.history)
+                if (
+                    investigate
+                    and tool_name is not None
+                    and getattr(obs, "investigation_budget_remaining", 0) > 0
+                ):
+                    tool_action = HelpdeskTicketAction(
+                        action_type="investigate",
+                        tool_name=tool_name,
+                        tool_target_ticket_id=ticket.get("related_ticket_id"),
+                    )
+                    result = sync_client.step(tool_action)
+                    obs = result.observation
+                    step_num += 1
+                    emit_log(
+                        "STEP",
+                        action=tool_action.model_dump(exclude_none=True),
+                        action_source="investigation_tool",
+                        done=bool(result.done),
+                        fallback_reason=None,
+                        reward=float(result.reward or 0.0),
+                        step=step_num,
+                        task_id=task_id,
+                        ticket_id=ticket["ticket_id"],
+                    )
+                    if result.done:
+                        break
+                    ticket = obs.current_ticket
+                    if ticket is None:
+                        break
+                ticket_with_context = merge_ticket_context(ticket, obs)
                 action, action_source, fallback_reason = build_action(
+                    ticket_with_context,
                     obs.allowed_fields,
                     obs.instructions,
                 )

models.py CHANGED Viewed

@@ -16,6 +16,8 @@ ISSUE_TYPE_SET = set(ISSUE_TYPES)
 PRIORITY_SET = set(PRIORITIES)
 ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
 RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
 def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
@@ -67,11 +69,24 @@ class HelpdeskTicketRecord(BaseModel):
 class HelpdeskTicketAction(Action):
     issue_type: Optional[str] = None
     priority: Optional[str] = None
     assignment_group: Optional[str] = None
     resolution_action: Optional[str] = None
     @field_validator("issue_type")
     @classmethod
     def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
@@ -98,10 +113,15 @@ class HelpdeskTicketObservation(Observation):
     task_name: str = ""
     instructions: str = ""
     allowed_fields: list[str] = Field(default_factory=list)
-    current_ticket: Optional[dict[str, str]] = None
     queue_size: int = 0
     tickets_remaining: int = 0
     tickets_processed: int = 0
     history: list[dict[str, Any]] = Field(default_factory=list)
@@ -116,4 +136,7 @@ class HelpdeskTicketState(State):
     # `reward` is the field the evaluator checks on GET /state (mentor spec)
     reward: Optional[float] = None
     done: bool = False
     history_entries: list[dict] = Field(default_factory=list)

 PRIORITY_SET = set(PRIORITIES)
 ASSIGNMENT_GROUP_SET = set(ASSIGNMENT_GROUPS)
 RESOLUTION_ACTION_SET = set(RESOLUTION_ACTIONS)
+ACTION_TYPE_SET = {"submit", "investigate"}
+TOOL_NAME_SET = {"lookup_related_ticket", "lookup_requester_history"}
 def _validate_choice(value: str, allowed: set[str], field_name: str) -> str:
 class HelpdeskTicketAction(Action):
+    action_type: str = "submit"
+    tool_name: Optional[str] = None
+    tool_target_ticket_id: Optional[str] = None
     issue_type: Optional[str] = None
     priority: Optional[str] = None
     assignment_group: Optional[str] = None
     resolution_action: Optional[str] = None
+    @field_validator("action_type")
+    @classmethod
+    def validate_action_type(cls, value: str) -> str:
+        return _validate_choice(value, ACTION_TYPE_SET, "action_type")
+    @field_validator("tool_name")
+    @classmethod
+    def validate_tool_name(cls, value: Optional[str]) -> Optional[str]:
+        return _validate_optional_choice(value, TOOL_NAME_SET, "tool_name")
     @field_validator("issue_type")
     @classmethod
     def validate_issue_type(cls, value: Optional[str]) -> Optional[str]:
     task_name: str = ""
     instructions: str = ""
     allowed_fields: list[str] = Field(default_factory=list)
+    available_tools: list[str] = Field(default_factory=list)
+    investigation_budget_remaining: int = 0
+    last_tool_result: Optional[dict[str, Any]] = None
+    current_ticket: Optional[dict[str, Any]] = None
     queue_size: int = 0
     tickets_remaining: int = 0
+    tickets_after_current: int = 0
     tickets_processed: int = 0
+    queue_position: int = 0
     history: list[dict[str, Any]] = Field(default_factory=list)
     # `reward` is the field the evaluator checks on GET /state (mentor spec)
     reward: Optional[float] = None
     done: bool = False
+    investigation_steps: int = 0
+    investigation_budget_remaining: int = 0
+    last_tool_result: Optional[dict[str, Any]] = None
     history_entries: list[dict] = Field(default_factory=list)

openenv.yaml CHANGED Viewed

@@ -53,6 +53,7 @@ inference:
     - MODEL_NAME
     - HF_TOKEN
     - ENV_URL
 requirements:
   python: ">=3.11"

     - MODEL_NAME
     - HF_TOKEN
     - ENV_URL
+    - TASK_ID
 requirements:
   python: ">=3.11"

server/environment.py CHANGED Viewed

@@ -18,6 +18,10 @@ from server.tasks import get_task_definition, load_dataset
 QUEUE_SIZE_RANGE = (3, 5)
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
@@ -41,6 +45,7 @@ class HelpdeskTicketRoutingEnvironment(
     def __init__(self) -> None:
         super().__init__()
         self._dataset = load_dataset()
         self._rng = random.Random()
         self._queue: list[HelpdeskTicketRecord] = []
         self._state = HelpdeskTicketState()
@@ -57,13 +62,19 @@ class HelpdeskTicketRoutingEnvironment(
     ) -> HelpdeskTicketObservation:
         normalized_seed = _coerce_optional_int(seed, "seed")
         task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
         task_id = 1 if task_id_value is None else task_id_value
         task = get_task_definition(task_id)
         if normalized_seed is not None:
             self._rng.seed(normalized_seed)
-        queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
         self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
         self._state = HelpdeskTicketState(
@@ -75,6 +86,7 @@ class HelpdeskTicketRoutingEnvironment(
             current_ticket_index=0,
             per_ticket_scores=[],
             total_reward=0.0,
         )
         return self._build_observation(task)
@@ -96,34 +108,46 @@ class HelpdeskTicketRoutingEnvironment(
         task_id = self._state.current_task_id
         task = get_task_definition(task_id)
         submitted_fields = {
-            f for f, v in action.model_dump(exclude_none=True).items() if v is not None
         }
         allowed = set(task["allowed_fields"])
         extra_fields = submitted_fields - allowed
         if extra_fields:
             # Penalty: record score 0.0, advance index, return penalty observation
             self._state.per_ticket_scores.append(0.0)
-            self._state.history_entries.append({
-                "ticket_id": current_ticket.ticket_id,
-                "title": current_ticket.title,
-                "predicted": action.model_dump(exclude_none=True),
-                "score": 0.0,
-                "breakdown": {},
-                "penalty_reason": f"extra_fields: {sorted(extra_fields)}",
-            })
             self._state.step_count += 1
             self._state.current_ticket_index += 1
             is_done = self._state.current_ticket_index >= len(self._queue)
-            self._state.last_step_reward = 0.0
-            self._state.reward = 0.0
             self._state.done = is_done
             if is_done:
                 traj_reward = compute_trajectory_reward(
                     self._state.per_ticket_scores, len(self._queue), self._state.step_count
                 )
-                self._state.total_reward = traj_reward
-            return self._build_observation(task, done=is_done, reward=0.0)
         score, breakdown = grade_action(action, current_ticket, task_id)
         step_reward = compute_step_reward(score)
@@ -139,26 +163,27 @@ class HelpdeskTicketRoutingEnvironment(
                 len(self._queue),
                 self._state.step_count,
             )
-            self._state.total_reward = traj_reward
-            final_reward = traj_reward
         else:
             self._state.per_ticket_scores.append(score)
             self._state.step_count += 1
             self._state.current_ticket_index += 1
             final_reward = step_reward
-        history_entry = {
-            "ticket_id": current_ticket.ticket_id,
-            "title": current_ticket.title,
-            "predicted": action.model_dump(exclude_none=True),
-            "score": score,
-            "breakdown": breakdown,
-        }
         self._state.history_entries.append(history_entry)
         self._state.last_step_reward = final_reward
         self._state.reward = final_reward
         self._state.done = is_done
         return self._build_observation(task, done=is_done, reward=final_reward)
@@ -170,6 +195,188 @@ class HelpdeskTicketRoutingEnvironment(
     # Helpers
     # ------------------------------------------------------------------
     def _build_observation(
         self,
         task: dict,
@@ -181,33 +388,43 @@ class HelpdeskTicketRoutingEnvironment(
         if idx < queue_size:
             ticket = self._queue[idx]
-            ticket_view: dict[str, Any] = {
-                "ticket_id": ticket.ticket_id,
-                "title": ticket.title,
-                "requester": ticket.requester,
-                "description": ticket.description,
-            }
-            if ticket.ambiguity_note is not None:
-                ticket_view["ambiguity_note"] = ticket.ambiguity_note
-            if ticket.related_ticket_id is not None:
-                ticket_view["related_ticket_id"] = ticket.related_ticket_id
         else:
             ticket_view = None
         history = list(self._state.history_entries)
         return HelpdeskTicketObservation(
             done=done,
             reward=reward,
-            metadata={},
             task_id=task["id"],
             task_name=task["name"],
             instructions=task["instructions"],
             allowed_fields=list(task["allowed_fields"]),
             current_ticket=ticket_view,
             queue_size=queue_size,
-            # tickets_remaining: count of tickets not yet processed after this step
-            tickets_remaining=max(0, queue_size - idx),
             tickets_processed=idx,
             history=history,
         )

 QUEUE_SIZE_RANGE = (3, 5)
+AVAILABLE_TOOLS = ("lookup_related_ticket", "lookup_requester_history")
+FREE_INVESTIGATIONS_PER_TICKET = 1
+EXTRA_INVESTIGATION_COST = 0.02
+MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
     def __init__(self) -> None:
         super().__init__()
         self._dataset = load_dataset()
+        self._tickets_by_id = {ticket.ticket_id: ticket for ticket in self._dataset}
         self._rng = random.Random()
         self._queue: list[HelpdeskTicketRecord] = []
         self._state = HelpdeskTicketState()
     ) -> HelpdeskTicketObservation:
         normalized_seed = _coerce_optional_int(seed, "seed")
         task_id_value = _coerce_optional_int(kwargs.get("task_id", 1), "task_id")
+        queue_size_value = _coerce_optional_int(kwargs.get("queue_size"), "queue_size")
         task_id = 1 if task_id_value is None else task_id_value
         task = get_task_definition(task_id)
+        if queue_size_value is not None and queue_size_value < 1:
+            raise ValueError("queue_size must be >= 1")
         if normalized_seed is not None:
             self._rng.seed(normalized_seed)
+        if queue_size_value is None:
+            queue_size = self._rng.randint(*QUEUE_SIZE_RANGE)
+        else:
+            queue_size = min(queue_size_value, len(self._dataset))
         self._queue = self._rng.sample(self._dataset, min(queue_size, len(self._dataset)))
         self._state = HelpdeskTicketState(
             current_ticket_index=0,
             per_ticket_scores=[],
             total_reward=0.0,
+            investigation_budget_remaining=queue_size * FREE_INVESTIGATIONS_PER_TICKET,
         )
         return self._build_observation(task)
         task_id = self._state.current_task_id
         task = get_task_definition(task_id)
+        if action.action_type == "investigate":
+            return self._handle_investigation_action(task, current_ticket, action, idx)
         submitted_fields = {
+            f
+            for f, v in action.model_dump(exclude_none=True).items()
+            if v is not None
+            and f not in {"action_type", "tool_name", "tool_target_ticket_id"}
         }
         allowed = set(task["allowed_fields"])
         extra_fields = submitted_fields - allowed
         if extra_fields:
             # Penalty: record score 0.0, advance index, return penalty observation
             self._state.per_ticket_scores.append(0.0)
+            self._state.history_entries.append(
+                self._build_history_entry(
+                    current_ticket,
+                    predicted=action.model_dump(exclude_none=True),
+                    score=0.0,
+                    breakdown={},
+                    queue_position=idx + 1,
+                    penalty_reason=f"extra_fields: {sorted(extra_fields)}",
+                )
+            )
             self._state.step_count += 1
             self._state.current_ticket_index += 1
             is_done = self._state.current_ticket_index >= len(self._queue)
             self._state.done = is_done
             if is_done:
                 traj_reward = compute_trajectory_reward(
                     self._state.per_ticket_scores, len(self._queue), self._state.step_count
                 )
+                final_reward = self._apply_episode_economics(traj_reward)
+                self._state.total_reward = final_reward
+            else:
+                final_reward = 0.0
+            self._state.last_step_reward = final_reward
+            self._state.reward = final_reward
+            self._state.last_tool_result = None
+            return self._build_observation(task, done=is_done, reward=final_reward)
         score, breakdown = grade_action(action, current_ticket, task_id)
         step_reward = compute_step_reward(score)
                 len(self._queue),
                 self._state.step_count,
             )
+            final_reward = self._apply_episode_economics(traj_reward)
+            self._state.total_reward = final_reward
         else:
             self._state.per_ticket_scores.append(score)
             self._state.step_count += 1
             self._state.current_ticket_index += 1
             final_reward = step_reward
+        history_entry = self._build_history_entry(
+            current_ticket,
+            predicted=action.model_dump(exclude_none=True),
+            score=score,
+            breakdown=breakdown,
+            queue_position=idx + 1,
+        )
         self._state.history_entries.append(history_entry)
         self._state.last_step_reward = final_reward
         self._state.reward = final_reward
         self._state.done = is_done
+        self._state.last_tool_result = None
         return self._build_observation(task, done=is_done, reward=final_reward)
     # Helpers
     # ------------------------------------------------------------------
+    def _apply_episode_economics(self, base_reward: float) -> float:
+        free_investigations = len(self._queue) * FREE_INVESTIGATIONS_PER_TICKET
+        extra_investigations = max(0, self._state.investigation_steps - free_investigations)
+        penalty = min(
+            MAX_EXTRA_INVESTIGATION_PENALTY,
+            extra_investigations * EXTRA_INVESTIGATION_COST,
+        )
+        return max(0.0, min(1.0, base_reward - penalty))
+    def _lookup_related_ticket(
+        self,
+        current_ticket: HelpdeskTicketRecord,
+        target_ticket_id: str | None,
+    ) -> dict[str, Any]:
+        target_id = target_ticket_id or current_ticket.related_ticket_id
+        if target_id is None:
+            return {
+                "tool_name": "lookup_related_ticket",
+                "found": False,
+                "message": "Current ticket has no linked related_ticket_id.",
+            }
+        related_ticket = self._tickets_by_id.get(target_id)
+        if related_ticket is None:
+            return {
+                "tool_name": "lookup_related_ticket",
+                "found": False,
+                "message": f"Ticket {target_id!r} was not found in the dataset.",
+            }
+        return {
+            "tool_name": "lookup_related_ticket",
+            "found": True,
+            "ticket": {
+                "ticket_id": related_ticket.ticket_id,
+                "title": related_ticket.title,
+                "requester": related_ticket.requester,
+                "description": related_ticket.description,
+                "issue_type": related_ticket.issue_type,
+                "priority": related_ticket.priority,
+                "assignment_group": related_ticket.assignment_group,
+                "resolution_action": related_ticket.resolution_action,
+            },
+        }
+    def _lookup_requester_history(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
+        matches = [
+            {
+                "ticket_id": ticket.ticket_id,
+                "title": ticket.title,
+                "issue_type": ticket.issue_type,
+                "priority": ticket.priority,
+                "assignment_group": ticket.assignment_group,
+                "resolution_action": ticket.resolution_action,
+            }
+            for ticket in self._dataset
+            if ticket.requester == current_ticket.requester
+            and ticket.ticket_id != current_ticket.ticket_id
+        ]
+        return {
+            "tool_name": "lookup_requester_history",
+            "found": bool(matches),
+            "requester": current_ticket.requester,
+            "matches": matches,
+        }
+    def _run_investigation_tool(
+        self,
+        current_ticket: HelpdeskTicketRecord,
+        tool_name: str,
+        target_ticket_id: str | None,
+    ) -> dict[str, Any]:
+        if tool_name == "lookup_related_ticket":
+            return self._lookup_related_ticket(current_ticket, target_ticket_id)
+        if tool_name == "lookup_requester_history":
+            return self._lookup_requester_history(current_ticket)
+        raise ValueError(f"Unsupported tool_name: {tool_name}")
+    def _handle_investigation_action(
+        self,
+        task: dict,
+        current_ticket: HelpdeskTicketRecord,
+        action: HelpdeskTicketAction,
+        idx: int,
+    ) -> HelpdeskTicketObservation:
+        if action.tool_name is None:
+            raise ValueError("Investigate actions require tool_name")
+        submitted_fields = {
+            field
+            for field in ("issue_type", "priority", "assignment_group", "resolution_action")
+            if getattr(action, field) is not None
+        }
+        if submitted_fields:
+            raise ValueError(
+                "Investigate actions cannot include submit fields: "
+                f"{sorted(submitted_fields)}"
+            )
+        tool_result = self._run_investigation_tool(
+            current_ticket,
+            action.tool_name,
+            action.tool_target_ticket_id,
+        )
+        self._state.step_count += 1
+        self._state.investigation_steps += 1
+        self._state.investigation_budget_remaining = max(
+            0,
+            self._state.investigation_budget_remaining - 1,
+        )
+        self._state.last_tool_result = tool_result
+        self._state.last_step_reward = 0.0
+        self._state.reward = 0.0
+        self._state.done = False
+        self._state.history_entries.append(
+            self._build_history_entry(
+                current_ticket,
+                predicted=action.model_dump(exclude_none=True),
+                score=0.0,
+                breakdown={},
+                queue_position=idx + 1,
+                tool_result=tool_result,
+            )
+        )
+        return self._build_observation(task, done=False, reward=0.0)
+    def _build_ticket_view(self, ticket: HelpdeskTicketRecord) -> dict[str, Any]:
+        ticket_view: dict[str, Any] = {
+            "ticket_id": ticket.ticket_id,
+            "title": ticket.title,
+            "requester": ticket.requester,
+            "description": ticket.description,
+        }
+        if ticket.ambiguity_note is not None:
+            ticket_view["ambiguity_note"] = ticket.ambiguity_note
+        if ticket.related_ticket_id is not None:
+            ticket_view["related_ticket_id"] = ticket.related_ticket_id
+            related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
+            if related_ticket is not None:
+                ticket_view["related_ticket_preview"] = {
+                    "ticket_id": related_ticket.ticket_id,
+                    "title": related_ticket.title,
+                    "requester": related_ticket.requester,
+                    "description": related_ticket.description,
+                }
+        return ticket_view
+    def _build_history_entry(
+        self,
+        ticket: HelpdeskTicketRecord,
+        *,
+        predicted: dict[str, Any],
+        score: float,
+        breakdown: dict[str, float],
+        queue_position: int,
+        penalty_reason: str | None = None,
+        tool_result: dict[str, Any] | None = None,
+    ) -> dict[str, Any]:
+        history_entry: dict[str, Any] = {
+            "ticket_id": ticket.ticket_id,
+            "title": ticket.title,
+            "requester": ticket.requester,
+            "predicted": predicted,
+            "score": score,
+            "breakdown": breakdown,
+            "queue_position": queue_position,
+        }
+        if ticket.ambiguity_note is not None:
+            history_entry["ambiguity_note"] = ticket.ambiguity_note
+        if ticket.related_ticket_id is not None:
+            history_entry["related_ticket_id"] = ticket.related_ticket_id
+            related_ticket = self._tickets_by_id.get(ticket.related_ticket_id)
+            if related_ticket is not None:
+                history_entry["related_ticket_preview"] = {
+                    "ticket_id": related_ticket.ticket_id,
+                    "title": related_ticket.title,
+                    "requester": related_ticket.requester,
+                    "description": related_ticket.description,
+                }
+        if penalty_reason is not None:
+            history_entry["penalty_reason"] = penalty_reason
+        if tool_result is not None:
+            history_entry["tool_result"] = tool_result
+        return history_entry
     def _build_observation(
         self,
         task: dict,
         if idx < queue_size:
             ticket = self._queue[idx]
+            ticket_view = self._build_ticket_view(ticket)
+            queue_position = idx + 1
         else:
             ticket_view = None
+            queue_position = 0
         history = list(self._state.history_entries)
+        tickets_remaining = max(0, queue_size - idx)
+        tickets_after_current = max(
+            0,
+            tickets_remaining - (1 if ticket_view is not None else 0),
+        )
         return HelpdeskTicketObservation(
             done=done,
             reward=reward,
+            metadata={
+                "queue_position": queue_position,
+                "tickets_remaining_includes_current": ticket_view is not None,
+                "has_ambiguity_note": bool(ticket_view and ticket_view.get("ambiguity_note")),
+                "has_related_ticket_context": bool(
+                    ticket_view and ticket_view.get("related_ticket_preview")
+                ),
+                "action_mode": "investigate_or_submit",
+            },
             task_id=task["id"],
             task_name=task["name"],
             instructions=task["instructions"],
             allowed_fields=list(task["allowed_fields"]),
+            available_tools=list(AVAILABLE_TOOLS),
+            investigation_budget_remaining=self._state.investigation_budget_remaining,
+            last_tool_result=self._state.last_tool_result,
             current_ticket=ticket_view,
             queue_size=queue_size,
+            tickets_remaining=tickets_remaining,
+            tickets_after_current=tickets_after_current,
             tickets_processed=idx,
+            queue_position=queue_position,
             history=history,
         )

server/tasks.py CHANGED Viewed

@@ -13,7 +13,8 @@ TASKS = {
         "name": "Issue Type Classification",
         "difficulty": "easy",
         "instructions": (
-            "Read the ticket and select the single best IT issue type."
         ),
         "allowed_fields": ["issue_type"],
     },
@@ -23,7 +24,8 @@ TASKS = {
         "difficulty": "medium",
         "instructions": (
             "Read the ticket, select the best IT issue type, and estimate the "
-            "correct operational priority."
         ),
         "allowed_fields": ["issue_type", "priority"],
     },
@@ -33,7 +35,9 @@ TASKS = {
         "difficulty": "hard",
         "instructions": (
             "Perform full helpdesk routing by selecting the best issue type, "
-            "priority, assignment group, and resolution action for the ticket."
         ),
         "allowed_fields": [
             "issue_type",

         "name": "Issue Type Classification",
         "difficulty": "easy",
         "instructions": (
+            "Read the ticket and select the single best IT issue type. "
+            "You may investigate first, then submit a final routing answer."
         ),
         "allowed_fields": ["issue_type"],
     },
         "difficulty": "medium",
         "instructions": (
             "Read the ticket, select the best IT issue type, and estimate the "
+            "correct operational priority. If the observation includes ambiguity "
+            "or follow-up context, use it. You may investigate before you submit."
         ),
         "allowed_fields": ["issue_type", "priority"],
     },
         "difficulty": "hard",
         "instructions": (
             "Perform full helpdesk routing by selecting the best issue type, "
+            "priority, assignment group, and resolution action for the ticket. "
+            "Use any ambiguity notes or related-ticket previews when present. "
+            "You may investigate with tools before you submit the final action."
         ),
         "allowed_fields": [
             "issue_type",

tests/test_competitive_upgrade.py CHANGED Viewed

@@ -81,7 +81,11 @@ def _heuristic_action(obs: HelpdeskTicketObservation) -> HelpdeskTicketAction:
 # 9.1 — Inference single-task mode
 # ---------------------------------------------------------------------------
-def _get_tasks_to_run_impl(task_id_env: str | None, available_tasks: dict) -> list[int]:
     """
     Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
@@ -94,9 +98,13 @@ def _get_tasks_to_run_impl(task_id_env: str | None, available_tasks: dict) -> li
         except ValueError:
             raise SystemExit(1)
         if task_id not in available_tasks:
-            return []
         return [task_id]
-    return list(TASK_IDS)
 class TestInferenceSingleTaskMode(unittest.TestCase):
@@ -107,14 +115,19 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
         result = _get_tasks_to_run_impl("1", available)
         self.assertEqual(result, [1])
-    def test_task_id_set_to_unavailable_id_returns_empty_list(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
-        result = _get_tasks_to_run_impl("999", available)
-        self.assertEqual(result, [])
-    def test_task_id_unset_returns_all_task_ids(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
         result = _get_tasks_to_run_impl(None, available)
         self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
     def test_task_id_set_to_2_returns_only_task_2(self) -> None:
@@ -360,6 +373,271 @@ class TestAmbiguityNoteInObservation(unittest.TestCase):
         self.assertIn("ambiguity_note", obs.current_ticket)
 # ---------------------------------------------------------------------------
 # 9.7 — Dataset has >= 3 non-default routing tickets
 # ---------------------------------------------------------------------------

 # 9.1 — Inference single-task mode
 # ---------------------------------------------------------------------------
+def _get_tasks_to_run_impl(
+    task_id_env: str | None,
+    available_tasks: dict,
+    run_all_tasks: bool = False,
+) -> list[int]:
     """
     Standalone re-implementation of inference.get_tasks_to_run() logic for testing.
         except ValueError:
             raise SystemExit(1)
         if task_id not in available_tasks:
+            raise SystemExit(1)
         return [task_id]
+    if run_all_tasks:
+        return sorted(available_tasks)
+    if not available_tasks:
+        return []
+    return [sorted(available_tasks)[0]]
 class TestInferenceSingleTaskMode(unittest.TestCase):
         result = _get_tasks_to_run_impl("1", available)
         self.assertEqual(result, [1])
+    def test_task_id_set_to_unavailable_id_exits(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
+        with self.assertRaises(SystemExit):
+            _get_tasks_to_run_impl("999", available)
+    def test_task_id_unset_defaults_to_first_available_task(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
         result = _get_tasks_to_run_impl(None, available)
+        self.assertEqual(result, [1])
+    def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
+        available = {1: {}, 2: {}, 3: {}}
+        result = _get_tasks_to_run_impl(None, available, run_all_tasks=True)
         self.assertEqual(sorted(result), sorted(list(TASK_IDS)))
     def test_task_id_set_to_2_returns_only_task_2(self) -> None:
         self.assertIn("ambiguity_note", obs.current_ticket)
+class TestRelatedTicketPreviewInObservation(unittest.TestCase):
+    """Follow-up tickets expose a lightweight preview of the linked ticket."""
+    def _reset_linked_ticket_env(self):
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
+        self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
+        related = next(
+            (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
+            None,
+        )
+        self.assertIsNotNone(related, "Linked ticket missing from dataset")
+        env = _make_env()
+        with patch.object(env, "_dataset", [ticket]):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {ticket.ticket_id: ticket, related.ticket_id: related},
+            ):
+                obs = env.reset(seed=0, task_id=3, queue_size=1)
+        return env, obs, related
+    def test_related_ticket_preview_present_when_ticket_has_link(self) -> None:
+        env, obs, related = self._reset_linked_ticket_env()
+        self.assertIsNotNone(obs.current_ticket)
+        self.assertIn("related_ticket_preview", obs.current_ticket)
+        self.assertEqual(
+            obs.current_ticket["related_ticket_preview"]["ticket_id"],
+            related.ticket_id,
+        )
+        self.assertEqual(
+            obs.current_ticket["related_ticket_preview"]["title"],
+            related.title,
+        )
+    def test_history_keeps_related_ticket_preview_after_step(self) -> None:
+        env, obs, related = self._reset_linked_ticket_env()
+        next_obs = env.step(_heuristic_action(obs))
+        self.assertGreaterEqual(len(next_obs.history), 1)
+        self.assertIn("related_ticket_preview", next_obs.history[0])
+        self.assertEqual(
+            next_obs.history[0]["related_ticket_preview"]["ticket_id"],
+            related.ticket_id,
+        )
+class TestObservationQueueContext(unittest.TestCase):
+    """Observation includes clearer queue-position counters."""
+    def test_reset_sets_queue_position_and_after_current_counts(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=0, task_id=1, queue_size=3)
+        self.assertEqual(obs.queue_position, 1)
+        self.assertEqual(obs.tickets_remaining, 3)
+        self.assertEqual(obs.tickets_after_current, 2)
+    def test_step_updates_queue_position_and_after_current_counts(self) -> None:
+        env = _make_env()
+        obs = env.reset(seed=0, task_id=1, queue_size=3)
+        obs = env.step(_heuristic_action(obs))
+        if obs.done:
+            self.assertEqual(obs.queue_position, 0)
+            self.assertEqual(obs.tickets_after_current, 0)
+        else:
+            self.assertEqual(obs.queue_position, 2)
+            self.assertEqual(obs.tickets_remaining, 2)
+            self.assertEqual(obs.tickets_after_current, 1)
+# ---------------------------------------------------------------------------
+# 9.6b — investigation actions and queue economics
+# ---------------------------------------------------------------------------
+class TestInvestigationActions(unittest.TestCase):
+    """Minimal tool-assisted investigate/submit flow works and stays backwards compatible."""
+    def _make_linked_env(self):
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = next((t for t in dataset if t.related_ticket_id is not None), None)
+        self.assertIsNotNone(ticket, "No follow-up ticket found in dataset")
+        related = next(
+            (t for t in dataset if t.ticket_id == ticket.related_ticket_id),
+            None,
+        )
+        self.assertIsNotNone(related, "Linked ticket missing from dataset")
+        env = _make_env()
+        patch_dataset = patch.object(env, "_dataset", [ticket])
+        patch_lookup = patch.object(
+            env,
+            "_tickets_by_id",
+            {ticket.ticket_id: ticket, related.ticket_id: related},
+        )
+        patch_dataset.start()
+        patch_lookup.start()
+        self.addCleanup(patch_dataset.stop)
+        self.addCleanup(patch_lookup.stop)
+        obs = env.reset(seed=0, task_id=3, queue_size=1)
+        return env, obs, ticket, related
+    def test_investigation_action_does_not_advance_queue(self) -> None:
+        env, obs, ticket, related = self._make_linked_env()
+        investigate = HelpdeskTicketAction(
+            action_type="investigate",
+            tool_name="lookup_related_ticket",
+            tool_target_ticket_id=ticket.related_ticket_id,
+        )
+        obs2 = env.step(investigate)
+        self.assertFalse(obs2.done)
+        self.assertEqual(obs2.tickets_processed, 0)
+        self.assertEqual(obs2.queue_position, 1)
+        self.assertIsNotNone(obs2.last_tool_result)
+        self.assertTrue(obs2.last_tool_result["found"])
+        self.assertEqual(
+            obs2.last_tool_result["ticket"]["ticket_id"],
+            related.ticket_id,
+        )
+    def test_submit_after_investigation_completes_episode(self) -> None:
+        env, obs, ticket, related = self._make_linked_env()
+        env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_related_ticket",
+                tool_target_ticket_id=ticket.related_ticket_id,
+            )
+        )
+        final_obs = env.step(
+            HelpdeskTicketAction(
+                issue_type=ticket.issue_type,
+                priority=ticket.priority,
+                assignment_group=ticket.assignment_group,
+                resolution_action=ticket.resolution_action,
+            )
+        )
+        self.assertTrue(final_obs.done)
+        self.assertEqual(final_obs.tickets_processed, 1)
+        self.assertGreaterEqual(final_obs.reward, 0.0)
+        self.assertLessEqual(final_obs.reward, 1.0)
+    def test_requester_history_tool_returns_matches_for_same_requester(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        requester_counts: dict[str, int] = {}
+        for ticket in dataset:
+            requester_counts[ticket.requester] = requester_counts.get(ticket.requester, 0) + 1
+        target_requester = next(
+            (requester for requester, count in requester_counts.items() if count >= 2),
+            None,
+        )
+        self.assertIsNotNone(target_requester, "Dataset has no repeated requester")
+        duplicate_requester_group = [
+            ticket for ticket in dataset if ticket.requester == target_requester
+        ]
+        self.assertGreaterEqual(len(duplicate_requester_group), 2)
+        env = _make_env()
+        with patch.object(env, "_dataset", duplicate_requester_group):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {ticket.ticket_id: ticket for ticket in duplicate_requester_group},
+            ):
+                obs = env.reset(seed=0, task_id=2, queue_size=1)
+        obs2 = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertIsNotNone(obs2.last_tool_result)
+        self.assertEqual(obs2.last_tool_result["tool_name"], "lookup_requester_history")
+        self.assertTrue(obs2.last_tool_result["found"])
+        self.assertGreaterEqual(len(obs2.last_tool_result["matches"]), 1)
+class TestQueueEconomics(unittest.TestCase):
+    """Free investigations are allowed, but excessive investigation gets a queue-level penalty."""
+    def test_extra_investigations_reduce_final_reward(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        ticket = dataset[0]
+        env = _make_env()
+        with patch.object(env, "_dataset", [ticket]):
+            with patch.object(env, "_tickets_by_id", {ticket.ticket_id: ticket}):
+                obs = env.reset(seed=0, task_id=1, queue_size=1)
+        obs = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertEqual(env.state.investigation_steps, 1)
+        self.assertEqual(env.state.investigation_budget_remaining, 0)
+        obs = env.step(
+            HelpdeskTicketAction(
+                action_type="investigate",
+                tool_name="lookup_requester_history",
+            )
+        )
+        self.assertEqual(env.state.investigation_steps, 2)
+        final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
+        self.assertTrue(final_obs.done)
+        self.assertAlmostEqual(final_obs.reward, 0.98, places=9)
+class TestTerminalInvalidActionFinalReward(unittest.TestCase):
+    """Terminal invalid submit actions should still return the queue-level final reward."""
+    def test_last_invalid_submit_returns_trajectory_reward_not_zero(self) -> None:
+        from unittest.mock import patch
+        dataset = load_dataset()
+        first = dataset[0]
+        second = dataset[1]
+        env = _make_env()
+        with patch.object(env, "_dataset", [first, second]):
+            with patch.object(
+                env,
+                "_tickets_by_id",
+                {first.ticket_id: first, second.ticket_id: second},
+            ):
+                obs = env.reset(seed=0, task_id=1, queue_size=2)
+        tickets_by_id = {first.ticket_id: first, second.ticket_id: second}
+        current = tickets_by_id[obs.current_ticket["ticket_id"]]
+        obs = env.step(HelpdeskTicketAction(issue_type=current.issue_type))
+        self.assertFalse(obs.done)
+        current = tickets_by_id[obs.current_ticket["ticket_id"]]
+        final_obs = env.step(
+            HelpdeskTicketAction(
+                issue_type=current.issue_type,
+                priority="medium",
+            )
+        )
+        self.assertTrue(final_obs.done)
+        self.assertAlmostEqual(final_obs.reward, 0.5, places=9)
+        self.assertAlmostEqual(env.state.total_reward, 0.5, places=9)
+        self.assertAlmostEqual(env.state.reward or 0.0, 0.5, places=9)
 # ---------------------------------------------------------------------------
 # 9.7 — Dataset has >= 3 non-default routing tickets
 # ---------------------------------------------------------------------------

tests/test_environment_smoke.py CHANGED Viewed

@@ -101,6 +101,8 @@ class TestResetReturnsValidObservation(unittest.TestCase):
         self.assertIsNotNone(obs.current_ticket)
         self.assertGreater(obs.queue_size, 0)
         self.assertEqual(obs.tickets_processed, 0)
 class TestResetAllTaskIds(unittest.TestCase):
@@ -116,6 +118,7 @@ class TestResetAllTaskIds(unittest.TestCase):
         self.assertEqual(obs.tickets_processed, 0)
         # allowed_fields must match the task definition
         self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
     def test_reset_task2(self) -> None:
         env = _make_env()
@@ -142,6 +145,10 @@ class TestStepAdvancesTicketsProcessed(unittest.TestCase):
         obs2 = env.step(action)
         self.assertEqual(obs2.tickets_processed, 1)
     def test_step_reward_in_unit_interval(self) -> None:
         from models import HelpdeskTicketAction

         self.assertIsNotNone(obs.current_ticket)
         self.assertGreater(obs.queue_size, 0)
         self.assertEqual(obs.tickets_processed, 0)
+        self.assertEqual(obs.queue_position, 1)
+        self.assertEqual(obs.tickets_after_current, max(0, obs.queue_size - 1))
 class TestResetAllTaskIds(unittest.TestCase):
         self.assertEqual(obs.tickets_processed, 0)
         # allowed_fields must match the task definition
         self.assertEqual(obs.allowed_fields, TASKS[task_id]["allowed_fields"])
+        self.assertEqual(obs.queue_position, 1)
     def test_reset_task2(self) -> None:
         env = _make_env()
         obs2 = env.step(action)
         self.assertEqual(obs2.tickets_processed, 1)
+        if obs2.done:
+            self.assertEqual(obs2.queue_position, 0)
+        else:
+            self.assertEqual(obs2.queue_position, 2)
     def test_step_reward_in_unit_interval(self) -> None:
         from models import HelpdeskTicketAction

tests/test_extra_fields_penalty.py CHANGED Viewed

@@ -151,32 +151,31 @@ class TestExtraFieldsPenalty(unittest.TestCase):
         self.assertIsInstance(obs, HelpdeskTicketObservation)
     def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
-        """When the penalty step is on the last ticket, done must be True."""
         env = _make_env()
-        # Use a queue of size 1 by controlling the seed — find a seed that gives queue_size=1
-        # Instead, exhaust all but the last ticket normally, then trigger penalty on last
         obs = env.reset(seed=42, task_id=1)
         queue_size = obs.queue_size
         # Process all tickets except the last one normally
         for _ in range(queue_size - 1):
-            allowed = obs.allowed_fields
-            action_kwargs = {}
-            if "issue_type" in allowed:
-                action_kwargs["issue_type"] = ISSUE_TYPES[0]
-            if "priority" in allowed:
-                action_kwargs["priority"] = PRIORITIES[0]
-            obs = env.step(HelpdeskTicketAction(**action_kwargs))
         # Now trigger penalty on the last ticket
         action = HelpdeskTicketAction(
-            issue_type=ISSUE_TYPES[0],
             assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
         )
         final_obs = env.step(action)
         self.assertTrue(final_obs.done)
-        self.assertEqual(final_obs.reward, 0.0)
 if __name__ == "__main__":

         self.assertIsInstance(obs, HelpdeskTicketObservation)
     def test_extra_fields_done_flag_set_correctly_on_last_ticket(self) -> None:
+        """When the penalty step is on the last ticket, done stays True and reward stays episode-level."""
         env = _make_env()
         obs = env.reset(seed=42, task_id=1)
         queue_size = obs.queue_size
+        tickets_by_id = env._tickets_by_id  # noqa: SLF001 - test-only inspection
         # Process all tickets except the last one normally
         for _ in range(queue_size - 1):
+            current_ticket_id = obs.current_ticket["ticket_id"]
+            current_ticket = tickets_by_id[current_ticket_id]
+            obs = env.step(HelpdeskTicketAction(issue_type=current_ticket.issue_type))
         # Now trigger penalty on the last ticket
+        current_ticket_id = obs.current_ticket["ticket_id"]
+        current_ticket = tickets_by_id[current_ticket_id]
         action = HelpdeskTicketAction(
+            issue_type=current_ticket.issue_type,
             assignment_group=ASSIGNMENT_GROUPS[0],  # extra field
         )
         final_obs = env.step(action)
         self.assertTrue(final_obs.done)
+        expected_reward = (queue_size - 1) / queue_size
+        self.assertAlmostEqual(final_obs.reward, expected_reward, places=9)
+        self.assertAlmostEqual(env.state.total_reward, expected_reward, places=9)
 if __name__ == "__main__":

tests/test_inference_unit.py CHANGED Viewed

@@ -163,6 +163,22 @@ class InferenceUnitTests(unittest.TestCase):
             )
         )
 if __name__ == "__main__":
     unittest.main()

             )
         )
+    def test_default_task_selection_runs_single_first_task(self) -> None:
+        inference = _load_inference_module()
+        self.assertEqual(
+            inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
+            [1],
+        )
+    def test_run_all_tasks_override_keeps_local_batch_mode_available(self) -> None:
+        inference = _load_inference_module({"RUN_ALL_TASKS": "1"})
+        self.assertEqual(
+            inference.get_tasks_to_run({1: {}, 2: {}, 3: {}}),
+            [1, 2, 3],
+        )
 if __name__ == "__main__":
     unittest.main()