Roopalgn commited on
Commit
d378e5d
·
1 Parent(s): d6d9493

Strengthen hard-task investigation and grading

Browse files
README.md CHANGED
@@ -205,9 +205,10 @@ Available tools:
205
 
206
  Hard-task investigation behavior:
207
 
208
- - some ambiguous and non-default-routing tickets start with redacted descriptions
209
  - linked-ticket previews and internal routing notes stay hidden until the matching tool is used
210
- - useful investigation steps return a small positive shaping reward
 
211
  - premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
212
  - terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
213
 
@@ -215,8 +216,8 @@ Per-field behavior:
215
 
216
  - `issue_type`: exact match, with a few near-miss partial-credit pairs
217
  - `priority`: exact match or proximity credit
218
- - `assignment_group`: exact match
219
- - `resolution_action`: exact match
220
 
221
  Task weights:
222
 
@@ -236,7 +237,7 @@ The result is clamped to `[0.0, 1.0]`.
236
 
237
  Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
238
 
239
- Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
240
 
241
  To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
242
 
@@ -246,10 +247,10 @@ To make the environment more RL-friendly, each observation now also surfaces str
246
 
247
  ## Grounded Scoring
248
 
249
- The grader is intentionally not fuzzy by default.
250
 
251
  - exact match is the dominant path for every field
252
- - `assignment_group` and `resolution_action` are exact-match only
253
  - `priority` only gets proximity credit from the declared table in `server/grader.py`
254
  - `issue_type` only gets partial credit for a small declared similarity map
255
  - wrong labels outside those explicit maps score `0.0`
@@ -363,7 +364,7 @@ curl http://localhost:7860/tasks
363
 
364
  ## Running The Baseline Inference Script
365
 
366
- The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
367
 
368
  ### Heuristic mode
369
 
@@ -373,7 +374,7 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
373
  python inference.py
374
  ```
375
 
376
- By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
377
 
378
  ```bash
379
  TASK_ID=3 python inference.py
@@ -401,6 +402,7 @@ Optional target:
401
  - `SEED`
402
  - `TASK_ID`
403
  - `RUN_ALL_TASKS`
 
404
 
405
  To reproduce the multi-task local benchmark sweep:
406
 
@@ -420,16 +422,13 @@ Validated locally:
420
  - `/reset`
421
  - heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
422
 
423
- Current local heuristic results:
424
 
425
- | Task | Result |
426
- |------|--------|
427
- | Issue Type Classification | `1.0000` |
428
- | Issue Type And Priority | `0.8800` |
429
- | Full Ticket Routing | `0.9400` |
430
- | Overall | `0.9400` |
431
 
432
- The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
433
 
434
  ### Windows note
435
 
@@ -440,7 +439,7 @@ During the first runtime pass, the repo surfaced a Windows-specific JSON issue w
440
  Build:
441
 
442
  ```bash
443
- docker build -f server/Dockerfile -t helpdesk-ticket-routing .
444
  ```
445
 
446
  Run locally:
 
205
 
206
  Hard-task investigation behavior:
207
 
208
+ - some ambiguous and non-default-routing tickets start with both redacted titles and redacted descriptions
209
  - linked-ticket previews and internal routing notes stay hidden until the matching tool is used
210
+ - only useful investigation steps return a small positive shaping reward
211
+ - blind or repeated probing does not pay by default
212
  - premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
213
  - terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
214
 
 
216
 
217
  - `issue_type`: exact match, with a few near-miss partial-credit pairs
218
  - `priority`: exact match or proximity credit
219
+ - `assignment_group`: exact match, with a small declared partial-credit map for nearby ownership mistakes
220
+ - `resolution_action`: exact match, with a small declared partial-credit map for nearby next-step mistakes
221
 
222
  Task weights:
223
 
 
237
 
238
  Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
239
 
240
+ Final reward also includes a queue-economics penalty when the agent exceeds the free investigation budget. One investigation per queued ticket is free, but extra investigation steps reduce the final reward more noticeably than before.
241
 
242
  To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
243
 
 
247
 
248
  ## Grounded Scoring
249
 
250
+ The grader is intentionally narrow and declared, not fully fuzzy.
251
 
252
  - exact match is the dominant path for every field
253
+ - `assignment_group` and `resolution_action` now expose only a small declared partial-credit map for nearby mistakes
254
  - `priority` only gets proximity credit from the declared table in `server/grader.py`
255
  - `issue_type` only gets partial credit for a small declared similarity map
256
  - wrong labels outside those explicit maps score `0.0`
 
364
 
365
  ## Running The Baseline Inference Script
366
 
367
+ The baseline script defaults to all declared tasks when `TASK_ID` is not set, which keeps local runs aligned with validator-style sweeps.
368
 
369
  ### Heuristic mode
370
 
 
374
  python inference.py
375
  ```
376
 
377
+ By default that runs all declared tasks and emits a structured `[START] ... [STEP] ... [END]` block for each task. To target a specific task:
378
 
379
  ```bash
380
  TASK_ID=3 python inference.py
 
402
  - `SEED`
403
  - `TASK_ID`
404
  - `RUN_ALL_TASKS`
405
+ compatibility alias for local tooling; all tasks already run by default when `TASK_ID` is unset
406
 
407
  To reproduce the multi-task local benchmark sweep:
408
 
 
422
  - `/reset`
423
  - heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
424
 
425
+ Current local smoke expectations:
426
 
427
+ - the baseline completes all 3 tasks successfully
428
+ - rewards remain in range for every task
429
+ - the hard task now depends much more heavily on investigation behavior, so exact seed-level baseline numbers are no longer treated as the benchmark reference for the repo
 
 
 
430
 
431
+ The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
432
 
433
  ### Windows note
434
 
 
439
  Build:
440
 
441
  ```bash
442
+ docker build -t helpdesk-ticket-routing .
443
  ```
444
 
445
  Run locally:
inference.py CHANGED
@@ -26,12 +26,11 @@ HF_TOKEN
26
 
27
  TASK_ID
28
  Optional OpenEnv task ID to run. When unset, the script defaults to the
29
- first available task so it still emits exactly one ``[START]`` ... ``[END]``
30
- block for evaluator-style runs.
31
 
32
  RUN_ALL_TASKS
33
- Optional local-development override. Set to ``1`` to run every available
34
- task in sequence and print the aggregate closing ``[END]`` summary.
35
 
36
  LOCAL_IMAGE_NAME
37
  Optional compatibility variable from the sample inference pattern.
@@ -761,6 +760,10 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
761
  if not ticket:
762
  return False, None
763
  context_status = ticket.get("context_status") or {}
 
 
 
 
764
  current_ticket_id = ticket.get("ticket_id")
765
  prior_ticket_history = [
766
  entry
@@ -777,7 +780,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
777
  for entry in prior_ticket_history
778
  if entry.get("predicted", {}).get("action_type") == "investigate"
779
  )
780
- hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
781
  if investigations_used >= 3:
782
  return False, None
783
 
@@ -786,6 +788,14 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
786
  for entry in prior_ticket_history
787
  if entry.get("predicted", {}).get("action_type") == "investigate"
788
  }
 
 
 
 
 
 
 
 
789
  routing_text = build_routing_text(ticket)
790
  last_tool_result = ticket.get("last_tool_result") or {}
791
  last_tool_name = str(last_tool_result.get("tool_name", "") or "")
@@ -859,10 +869,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
859
 
860
  if already_investigated and not hidden_context_remaining:
861
  return False, None
862
- if ticket.get("ambiguity_note") and "lookup_internal_routing_note" not in used_tools:
863
- return True, "lookup_internal_routing_note"
864
- if ticket.get("related_ticket_id") and "lookup_related_ticket" not in used_tools:
865
- return True, "lookup_related_ticket"
866
  return False, None
867
 
868
 
 
26
 
27
  TASK_ID
28
  Optional OpenEnv task ID to run. When unset, the script defaults to the
29
+ full declared task set so evaluator-style runs exercise every grader.
 
30
 
31
  RUN_ALL_TASKS
32
+ Optional backwards-compatible local-development alias. The script already
33
+ runs every available task when TASK_ID is unset.
34
 
35
  LOCAL_IMAGE_NAME
36
  Optional compatibility variable from the sample inference pattern.
 
760
  if not ticket:
761
  return False, None
762
  context_status = ticket.get("context_status") or {}
763
+ hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
764
+ investigation_required = bool(context_status.get("investigation_required"))
765
+ if not investigation_required and not hidden_context_remaining:
766
+ return False, None
767
  current_ticket_id = ticket.get("ticket_id")
768
  prior_ticket_history = [
769
  entry
 
780
  for entry in prior_ticket_history
781
  if entry.get("predicted", {}).get("action_type") == "investigate"
782
  )
 
783
  if investigations_used >= 3:
784
  return False, None
785
 
 
788
  for entry in prior_ticket_history
789
  if entry.get("predicted", {}).get("action_type") == "investigate"
790
  }
791
+ recommended_tools = [
792
+ tool_name
793
+ for tool_name in context_status.get("recommended_tools", [])
794
+ if tool_name not in used_tools
795
+ ]
796
+ if hidden_context_remaining and recommended_tools:
797
+ return True, recommended_tools[0]
798
+
799
  routing_text = build_routing_text(ticket)
800
  last_tool_result = ticket.get("last_tool_result") or {}
801
  last_tool_name = str(last_tool_result.get("tool_name", "") or "")
 
869
 
870
  if already_investigated and not hidden_context_remaining:
871
  return False, None
 
 
 
 
872
  return False, None
873
 
874
 
server/environment.py CHANGED
@@ -33,12 +33,13 @@ AVAILABLE_TOOLS = (
33
  "lookup_internal_routing_note",
34
  )
35
  FREE_INVESTIGATIONS_PER_TICKET = 1
36
- EXTRA_INVESTIGATION_COST = 0.02
37
- MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
38
- USEFUL_INVESTIGATION_REWARD = 0.08
39
- PREMATURE_SUBMIT_PENALTY = 0.10
40
- CONTEXT_COMPLETION_BONUS = 0.04
41
- TRAJECTORY_CONTEXT_COMPLETION_BONUS = 0.03
 
42
  PRIORITY_UNDERSHOOT_PENALTY = 0.03
43
  SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
44
  DANGEROUS_RESOLUTION_PENALTY = 0.05
@@ -95,6 +96,18 @@ HARD_TASK_DESCRIPTION_REDACTIONS: dict[str, str] = {
95
  ),
96
  }
97
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
100
  if value is None or value == "":
@@ -412,6 +425,55 @@ class HelpdeskTicketRoutingEnvironment(
412
  return 0.0
413
  return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
415
  def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
416
  return (
417
  ticket.assignment_group
@@ -441,6 +503,22 @@ class HelpdeskTicketRoutingEnvironment(
441
  def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
442
  return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
443
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
444
  def _required_tools_for_ticket(
445
  self,
446
  ticket: HelpdeskTicketRecord,
@@ -453,8 +531,9 @@ class HelpdeskTicketRoutingEnvironment(
453
  if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
454
  required_tools.append("lookup_related_ticket")
455
  if (
456
- ticket.ambiguity_note is not None or self._ticket_has_nondefault_routing(ticket)
457
- ) and "lookup_internal_routing_note" not in required_tools:
 
458
  required_tools.append("lookup_internal_routing_note")
459
  if (
460
  self._ticket_repeated_requester_count(ticket) >= 2
@@ -467,7 +546,13 @@ class HelpdeskTicketRoutingEnvironment(
467
  and "lookup_requester_history" not in required_tools
468
  ):
469
  required_tools.append("lookup_requester_history")
470
- return required_tools
 
 
 
 
 
 
471
 
472
  def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
473
  return list(self._state.ticket_tool_usage.get(ticket_id, []))
@@ -503,23 +588,39 @@ class HelpdeskTicketRoutingEnvironment(
503
  def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
504
  if ticket.related_ticket_id is not None:
505
  return (
506
- "This is a follow-up operational issue that references prior work. "
507
  "Additional routing context is available via investigation."
508
  )
509
- if ticket.ambiguity_note is not None:
510
  return (
511
- "This ticket mixes multiple plausible workflows. "
512
  "Additional routing context is available via investigation."
513
  )
514
  if self._ticket_has_nondefault_routing(ticket):
515
  return (
516
- "The visible request looks straightforward, but the decisive routing "
517
- "detail is hidden until investigation."
518
  )
519
  return (
520
  "Additional routing context is available via investigation before final submission."
521
  )
522
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
523
  def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
524
  if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
525
  return HARD_TASK_DESCRIPTION_REDACTIONS.get(
@@ -537,7 +638,11 @@ class HelpdeskTicketRoutingEnvironment(
537
  penalty = PREMATURE_SUBMIT_PENALTY * (
538
  len(remaining_tools) / max(1, len(required_tools))
539
  )
540
- return penalty, len(remaining_tools)
 
 
 
 
541
 
542
  def _context_completion_bonus(
543
  self,
@@ -691,12 +796,13 @@ class HelpdeskTicketRoutingEnvironment(
691
  }
692
 
693
  def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
694
- found = current_ticket.ambiguity_note is not None
 
695
  return {
696
  "tool_name": "lookup_internal_routing_note",
697
  "found": found,
698
  "ticket_id": current_ticket.ticket_id,
699
- "routing_note": current_ticket.ambiguity_note if found else "",
700
  }
701
 
702
  def _run_investigation_tool(
@@ -753,10 +859,8 @@ class HelpdeskTicketRoutingEnvironment(
753
  self._state.investigation_budget_remaining - 1,
754
  )
755
  self._state.last_tool_result = tool_result
756
- investigation_reward = clamp_open_unit_interval(
757
- USEFUL_INVESTIGATION_REWARD if useful_investigation else 0.0
758
- )
759
- investigation_score = clamp_open_unit_interval(0.0)
760
  self._state.last_step_reward = investigation_reward
761
  self._state.reward = investigation_reward
762
  self._state.done = False
@@ -799,7 +903,7 @@ class HelpdeskTicketRoutingEnvironment(
799
  remaining_tools = progress["remaining_tools"]
800
  ticket_view: dict[str, Any] = {
801
  "ticket_id": ticket.ticket_id,
802
- "title": ticket.title,
803
  "requester": ticket.requester,
804
  "description": self._visible_description(ticket),
805
  }
@@ -811,6 +915,7 @@ class HelpdeskTicketRoutingEnvironment(
811
  "revealed_context_count": progress["revealed_count"],
812
  "context_completeness": progress["completeness"],
813
  "investigations_used_for_ticket": progress["revealed_count"],
 
814
  }
815
  if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
816
  ticket_view["ambiguity_note"] = ticket.ambiguity_note
 
33
  "lookup_internal_routing_note",
34
  )
35
  FREE_INVESTIGATIONS_PER_TICKET = 1
36
+ EXTRA_INVESTIGATION_COST = 0.04
37
+ MAX_EXTRA_INVESTIGATION_PENALTY = 0.25
38
+ USEFUL_INVESTIGATION_REWARD = 0.03
39
+ PREMATURE_SUBMIT_PENALTY = 0.22
40
+ NONDEFAULT_HIDDEN_CONTEXT_PENALTY = 0.08
41
+ CONTEXT_COMPLETION_BONUS = 0.06
42
+ TRAJECTORY_CONTEXT_COMPLETION_BONUS = 0.04
43
  PRIORITY_UNDERSHOOT_PENALTY = 0.03
44
  SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
45
  DANGEROUS_RESOLUTION_PENALTY = 0.05
 
96
  ),
97
  }
98
 
99
+ HARD_TASK_TITLE_REDACTIONS: dict[str, str] = {
100
+ "ticket-021": "Production workflow regression",
101
+ "ticket-022": "Time-sensitive account review",
102
+ "ticket-027": "Commercial workflow request",
103
+ "ticket-029": "Urgent expansion request",
104
+ "ticket-038": "Repeated invoice follow-up",
105
+ "ticket-045": "Company-wide account issue",
106
+ "TKT-NONDEFAULT-001": "Billing-style routing question",
107
+ "TKT-NONDEFAULT-002": "Compliance ownership question",
108
+ "TKT-NONDEFAULT-003": "Workflow blocker with hidden owner",
109
+ }
110
+
111
 
112
  def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
113
  if value is None or value == "":
 
425
  return 0.0
426
  return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
427
 
428
+ def _internal_routing_note_for_ticket(
429
+ self,
430
+ ticket: HelpdeskTicketRecord,
431
+ ) -> str | None:
432
+ if ticket.ambiguity_note is not None:
433
+ return ticket.ambiguity_note
434
+ if self._state.current_task_id != 3:
435
+ return None
436
+
437
+ default_group = ISSUE_TYPE_TO_ASSIGNMENT_GROUP.get(
438
+ ticket.issue_type,
439
+ ticket.assignment_group,
440
+ )
441
+ default_action = ISSUE_TYPE_TO_RESOLUTION_ACTION.get(
442
+ ticket.issue_type,
443
+ ticket.resolution_action,
444
+ )
445
+ note_parts: list[str] = []
446
+
447
+ if ticket.assignment_group != default_group:
448
+ note_parts.append(
449
+ "Routing override: send this to "
450
+ f"{ticket.assignment_group} rather than the default {default_group} queue."
451
+ )
452
+ if ticket.resolution_action != default_action:
453
+ note_parts.append(
454
+ "Action override: use "
455
+ f"{ticket.resolution_action} instead of the default {default_action} next step."
456
+ )
457
+ if ticket.issue_type == "onboarding" and ticket.assignment_group == "service_desk":
458
+ note_parts.append(
459
+ "The onboarding workflow is blocked by an access dependency, so the unblocker owns the next move."
460
+ )
461
+ if (
462
+ ticket.issue_type == "security_compliance"
463
+ and ticket.assignment_group == "application_team"
464
+ ):
465
+ note_parts.append(
466
+ "This compliance issue needs a product-team fix rather than a central security handoff."
467
+ )
468
+ if ticket.issue_type == "billing_license" and ticket.assignment_group == "procurement":
469
+ note_parts.append(
470
+ "Treat this as commercial procurement work instead of routine license fulfillment."
471
+ )
472
+
473
+ if not note_parts:
474
+ return None
475
+ return " ".join(note_parts)
476
+
477
  def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
478
  return (
479
  ticket.assignment_group
 
503
  def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
504
  return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
505
 
506
+ def _tool_has_available_context(
507
+ self,
508
+ ticket: HelpdeskTicketRecord,
509
+ tool_name: str,
510
+ ) -> bool:
511
+ if tool_name == "lookup_related_ticket":
512
+ return (
513
+ ticket.related_ticket_id is not None
514
+ and ticket.related_ticket_id in self._tickets_by_id
515
+ )
516
+ if tool_name == "lookup_requester_history":
517
+ return self._ticket_repeated_requester_count(ticket) >= 2
518
+ if tool_name == "lookup_internal_routing_note":
519
+ return self._internal_routing_note_for_ticket(ticket) is not None
520
+ return False
521
+
522
  def _required_tools_for_ticket(
523
  self,
524
  ticket: HelpdeskTicketRecord,
 
531
  if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
532
  required_tools.append("lookup_related_ticket")
533
  if (
534
+ self._internal_routing_note_for_ticket(ticket) is not None
535
+ and "lookup_internal_routing_note" not in required_tools
536
+ ):
537
  required_tools.append("lookup_internal_routing_note")
538
  if (
539
  self._ticket_repeated_requester_count(ticket) >= 2
 
546
  and "lookup_requester_history" not in required_tools
547
  ):
548
  required_tools.append("lookup_requester_history")
549
+ filtered_required_tools: list[str] = []
550
+ for tool_name in required_tools:
551
+ if tool_name in filtered_required_tools:
552
+ continue
553
+ if self._tool_has_available_context(ticket, tool_name):
554
+ filtered_required_tools.append(tool_name)
555
+ return filtered_required_tools
556
 
557
  def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
558
  return list(self._state.ticket_tool_usage.get(ticket_id, []))
 
588
  def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
589
  if ticket.related_ticket_id is not None:
590
  return (
591
+ "This is a follow-up operational issue. "
592
  "Additional routing context is available via investigation."
593
  )
594
+ if self._internal_routing_note_for_ticket(ticket) is not None:
595
  return (
596
+ "The visible request is not enough to choose the final owner and next step. "
597
  "Additional routing context is available via investigation."
598
  )
599
  if self._ticket_has_nondefault_routing(ticket):
600
  return (
601
+ "The visible request looks straightforward, but the decisive routing detail is hidden until investigation."
 
602
  )
603
  return (
604
  "Additional routing context is available via investigation before final submission."
605
  )
606
 
607
+ def _default_redacted_title(self, ticket: HelpdeskTicketRecord) -> str:
608
+ if ticket.related_ticket_id is not None:
609
+ return "Follow-up request with hidden routing context"
610
+ if self._internal_routing_note_for_ticket(ticket) is not None:
611
+ return "Routing clarification required"
612
+ if self._ticket_mentions_follow_up(ticket):
613
+ return "Priority support follow-up"
614
+ return "Helpdesk routing decision"
615
+
616
+ def _visible_title(self, ticket: HelpdeskTicketRecord) -> str:
617
+ if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
618
+ return HARD_TASK_TITLE_REDACTIONS.get(
619
+ ticket.ticket_id,
620
+ self._default_redacted_title(ticket),
621
+ )
622
+ return ticket.title
623
+
624
  def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
625
  if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
626
  return HARD_TASK_DESCRIPTION_REDACTIONS.get(
 
638
  penalty = PREMATURE_SUBMIT_PENALTY * (
639
  len(remaining_tools) / max(1, len(required_tools))
640
  )
641
+ if self._ticket_has_nondefault_routing(ticket):
642
+ penalty += NONDEFAULT_HIDDEN_CONTEXT_PENALTY * (
643
+ len(remaining_tools) / max(1, len(required_tools))
644
+ )
645
+ return round(min(0.45, penalty), 4), len(remaining_tools)
646
 
647
  def _context_completion_bonus(
648
  self,
 
796
  }
797
 
798
  def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
799
+ routing_note = self._internal_routing_note_for_ticket(current_ticket)
800
+ found = routing_note is not None
801
  return {
802
  "tool_name": "lookup_internal_routing_note",
803
  "found": found,
804
  "ticket_id": current_ticket.ticket_id,
805
+ "routing_note": routing_note if found else "",
806
  }
807
 
808
  def _run_investigation_tool(
 
859
  self._state.investigation_budget_remaining - 1,
860
  )
861
  self._state.last_tool_result = tool_result
862
+ investigation_reward = USEFUL_INVESTIGATION_REWARD if useful_investigation else 0.0
863
+ investigation_score = 0.0
 
 
864
  self._state.last_step_reward = investigation_reward
865
  self._state.reward = investigation_reward
866
  self._state.done = False
 
903
  remaining_tools = progress["remaining_tools"]
904
  ticket_view: dict[str, Any] = {
905
  "ticket_id": ticket.ticket_id,
906
+ "title": self._visible_title(ticket),
907
  "requester": ticket.requester,
908
  "description": self._visible_description(ticket),
909
  }
 
915
  "revealed_context_count": progress["revealed_count"],
916
  "context_completeness": progress["completeness"],
917
  "investigations_used_for_ticket": progress["revealed_count"],
918
+ "recommended_tools": list(remaining_tools),
919
  }
920
  if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
921
  ticket_view["ambiguity_note"] = ticket.ambiguity_note
server/grader.py CHANGED
@@ -24,6 +24,32 @@ ISSUE_TYPE_SIMILARITY = {
24
  ("billing_license", "security_compliance"): 0.2,
25
  }
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  PRIORITY_SCORES = {
28
  ("critical", "high"): 0.6,
29
  ("high", "critical"): 0.6,
@@ -66,6 +92,20 @@ def _score_exact_or_similar(predicted: str | None, expected: str) -> float:
66
  return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
67
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  def _score_priority(predicted: str | None, expected: str) -> float:
70
  pred = _normalized(predicted)
71
  exp = _normalized(expected)
@@ -91,11 +131,15 @@ def grade_action(
91
  field_scores = {
92
  "issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
93
  "priority": _score_priority(action.priority, ticket.priority),
94
- "assignment_group": _score_exact(
95
- action.assignment_group, ticket.assignment_group
 
 
96
  ),
97
- "resolution_action": _score_exact(
98
- action.resolution_action, ticket.resolution_action
 
 
99
  ),
100
  }
101
 
 
24
  ("billing_license", "security_compliance"): 0.2,
25
  }
26
 
27
+ ASSIGNMENT_GROUP_SIMILARITY = {
28
+ ("procurement", "license_ops"): 0.55,
29
+ ("license_ops", "procurement"): 0.55,
30
+ ("service_desk", "onboarding_ops"): 0.5,
31
+ ("onboarding_ops", "service_desk"): 0.5,
32
+ ("application_team", "security_team"): 0.35,
33
+ ("security_team", "application_team"): 0.35,
34
+ ("service_desk", "application_team"): 0.25,
35
+ ("application_team", "service_desk"): 0.25,
36
+ ("service_desk", "security_team"): 0.2,
37
+ ("security_team", "service_desk"): 0.2,
38
+ }
39
+
40
+ RESOLUTION_ACTION_SIMILARITY = {
41
+ ("assign", "escalate"): 0.6,
42
+ ("escalate", "assign"): 0.6,
43
+ ("acknowledge", "fulfill"): 0.35,
44
+ ("fulfill", "acknowledge"): 0.35,
45
+ ("assign", "fulfill"): 0.25,
46
+ ("fulfill", "assign"): 0.25,
47
+ ("escalate", "fulfill"): 0.2,
48
+ ("fulfill", "escalate"): 0.2,
49
+ ("acknowledge", "assign"): 0.2,
50
+ ("assign", "acknowledge"): 0.2,
51
+ }
52
+
53
  PRIORITY_SCORES = {
54
  ("critical", "high"): 0.6,
55
  ("high", "critical"): 0.6,
 
92
  return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
93
 
94
 
95
+ def _score_exact_or_table(
96
+ predicted: str | None,
97
+ expected: str,
98
+ similarity_table: dict[tuple[str, str], float],
99
+ ) -> float:
100
+ pred = _normalized(predicted)
101
+ exp = _normalized(expected)
102
+ if not pred:
103
+ return 0.0
104
+ if pred == exp:
105
+ return 1.0
106
+ return similarity_table.get((pred, exp), 0.0)
107
+
108
+
109
  def _score_priority(predicted: str | None, expected: str) -> float:
110
  pred = _normalized(predicted)
111
  exp = _normalized(expected)
 
131
  field_scores = {
132
  "issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
133
  "priority": _score_priority(action.priority, ticket.priority),
134
+ "assignment_group": _score_exact_or_table(
135
+ action.assignment_group,
136
+ ticket.assignment_group,
137
+ ASSIGNMENT_GROUP_SIMILARITY,
138
  ),
139
+ "resolution_action": _score_exact_or_table(
140
+ action.resolution_action,
141
+ ticket.resolution_action,
142
+ RESOLUTION_ACTION_SIMILARITY,
143
  ),
144
  }
145
 
tests/test_api_integration.py CHANGED
@@ -529,8 +529,8 @@ class TestHeuristicInferenceRegression(unittest.TestCase):
529
  overall_avg = sum(rewards) / len(rewards)
530
  self.assertGreaterEqual(
531
  overall_avg,
532
- 0.8,
533
- f"Overall average reward {overall_avg:.4f} is below 0.8 (baseline: 0.9400)",
534
  )
535
  self.assertLessEqual(
536
  overall_avg,
 
529
  overall_avg = sum(rewards) / len(rewards)
530
  self.assertGreaterEqual(
531
  overall_avg,
532
+ 0.75,
533
+ f"Overall average reward {overall_avg:.4f} is below the smoke-test floor of 0.75",
534
  )
535
  self.assertLessEqual(
536
  overall_avg,
tests/test_competitive_upgrade.py CHANGED
@@ -100,11 +100,11 @@ def _get_tasks_to_run_impl(
100
  if task_id not in available_tasks:
101
  raise SystemExit(1)
102
  return [task_id]
103
- if run_all_tasks:
104
- return sorted(available_tasks)
105
  if not available_tasks:
106
  return []
107
- return [sorted(available_tasks)[0]]
 
 
108
 
109
 
110
  class TestInferenceSingleTaskMode(unittest.TestCase):
@@ -120,10 +120,10 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
120
  with self.assertRaises(SystemExit):
121
  _get_tasks_to_run_impl("999", available)
122
 
123
- def test_task_id_unset_defaults_to_first_available_task(self) -> None:
124
  available = {1: {}, 2: {}, 3: {}}
125
  result = _get_tasks_to_run_impl(None, available)
126
- self.assertEqual(result, [1])
127
 
128
  def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
129
  available = {1: {}, 2: {}, 3: {}}
@@ -710,7 +710,7 @@ class TestQueueEconomics(unittest.TestCase):
710
  final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
711
 
712
  self.assertTrue(final_obs.done)
713
- self.assertAlmostEqual(final_obs.reward, 0.97, places=9)
714
 
715
 
716
  class TestTerminalInvalidActionFinalReward(unittest.TestCase):
 
100
  if task_id not in available_tasks:
101
  raise SystemExit(1)
102
  return [task_id]
 
 
103
  if not available_tasks:
104
  return []
105
+ if run_all_tasks:
106
+ return sorted(available_tasks)
107
+ return sorted(available_tasks)
108
 
109
 
110
  class TestInferenceSingleTaskMode(unittest.TestCase):
 
120
  with self.assertRaises(SystemExit):
121
  _get_tasks_to_run_impl("999", available)
122
 
123
+ def test_task_id_unset_defaults_to_all_available_tasks(self) -> None:
124
  available = {1: {}, 2: {}, 3: {}}
125
  result = _get_tasks_to_run_impl(None, available)
126
+ self.assertEqual(result, [1, 2, 3])
127
 
128
  def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
129
  available = {1: {}, 2: {}, 3: {}}
 
710
  final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
711
 
712
  self.assertTrue(final_obs.done)
713
+ self.assertAlmostEqual(final_obs.reward, 0.95, places=9)
714
 
715
 
716
  class TestTerminalInvalidActionFinalReward(unittest.TestCase):
tests/test_grader_unit.py CHANGED
@@ -6,12 +6,14 @@ import openenv_test_stubs # noqa: F401
6
 
7
  from models import HelpdeskTicketAction, HelpdeskTicketRecord
8
  from server.grader import (
 
9
  ISSUE_TYPE_SIMILARITY,
10
  PRIORITY_SCORES,
 
11
  TASK_WEIGHTS,
12
  grade_action,
13
  )
14
- from vocabulary import ISSUE_TYPES, PRIORITIES
15
 
16
 
17
  def _ticket(
@@ -143,12 +145,26 @@ class GraderUnitTests(unittest.TestCase):
143
  self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
144
  self.assertAlmostEqual(score, 0.8)
145
 
146
- def test_assignment_group_is_exact_match_only(self) -> None:
147
  ticket = _ticket()
148
  action = HelpdeskTicketAction(
149
  issue_type="billing_license",
150
  priority="high",
151
- assignment_group="service_desk",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  resolution_action="fulfill",
153
  )
154
 
@@ -162,7 +178,7 @@ class GraderUnitTests(unittest.TestCase):
162
  action = HelpdeskTicketAction(
163
  issue_type="billing_license",
164
  priority="medium",
165
- assignment_group="service_desk",
166
  resolution_action="fulfill",
167
  )
168
 
@@ -179,13 +195,27 @@ class GraderUnitTests(unittest.TestCase):
179
  )
180
  self.assertAlmostEqual(score, 0.65)
181
 
182
- def test_resolution_action_is_exact_match_only(self) -> None:
183
  ticket = _ticket()
184
  action = HelpdeskTicketAction(
185
  issue_type="billing_license",
186
  priority="high",
187
  assignment_group="license_ops",
188
- resolution_action="assign",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
  )
190
 
191
  score, breakdown = grade_action(action, ticket, task_id=3)
@@ -193,6 +223,70 @@ class GraderUnitTests(unittest.TestCase):
193
  self.assertEqual(breakdown["resolution_action"], 0.0)
194
  self.assertAlmostEqual(score, 0.8)
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  def test_partial_credit_tables_never_override_exact_match(self) -> None:
197
  for pair, value in ISSUE_TYPE_SIMILARITY.items():
198
  with self.subTest(table="issue_type", pair=pair):
@@ -204,6 +298,16 @@ class GraderUnitTests(unittest.TestCase):
204
  self.assertGreater(value, 0.0)
205
  self.assertLess(value, 1.0)
206
 
 
 
 
 
 
 
 
 
 
 
207
  def test_task_weights_sum_to_one_for_each_task(self) -> None:
208
  for task_id, weights in TASK_WEIGHTS.items():
209
  with self.subTest(task_id=task_id):
 
6
 
7
  from models import HelpdeskTicketAction, HelpdeskTicketRecord
8
  from server.grader import (
9
+ ASSIGNMENT_GROUP_SIMILARITY,
10
  ISSUE_TYPE_SIMILARITY,
11
  PRIORITY_SCORES,
12
+ RESOLUTION_ACTION_SIMILARITY,
13
  TASK_WEIGHTS,
14
  grade_action,
15
  )
16
+ from vocabulary import ASSIGNMENT_GROUPS, ISSUE_TYPES, PRIORITIES, RESOLUTION_ACTIONS
17
 
18
 
19
  def _ticket(
 
145
  self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
146
  self.assertAlmostEqual(score, 0.8)
147
 
148
+ def test_assignment_group_partial_credit_uses_declared_similarity_table(self) -> None:
149
  ticket = _ticket()
150
  action = HelpdeskTicketAction(
151
  issue_type="billing_license",
152
  priority="high",
153
+ assignment_group="procurement",
154
+ resolution_action="fulfill",
155
+ )
156
+
157
+ score, breakdown = grade_action(action, ticket, task_id=3)
158
+
159
+ self.assertEqual(breakdown["assignment_group"], 0.55)
160
+ self.assertAlmostEqual(score, 0.8875)
161
+
162
+ def test_assignment_group_unrelated_miss_stays_zero(self) -> None:
163
+ ticket = _ticket()
164
+ action = HelpdeskTicketAction(
165
+ issue_type="billing_license",
166
+ priority="high",
167
+ assignment_group="security_team",
168
  resolution_action="fulfill",
169
  )
170
 
 
178
  action = HelpdeskTicketAction(
179
  issue_type="billing_license",
180
  priority="medium",
181
+ assignment_group="security_team",
182
  resolution_action="fulfill",
183
  )
184
 
 
195
  )
196
  self.assertAlmostEqual(score, 0.65)
197
 
198
+ def test_resolution_action_partial_credit_uses_declared_similarity_table(self) -> None:
199
  ticket = _ticket()
200
  action = HelpdeskTicketAction(
201
  issue_type="billing_license",
202
  priority="high",
203
  assignment_group="license_ops",
204
+ resolution_action="acknowledge",
205
+ )
206
+
207
+ score, breakdown = grade_action(action, ticket, task_id=3)
208
+
209
+ self.assertEqual(breakdown["resolution_action"], 0.35)
210
+ self.assertAlmostEqual(score, 0.87)
211
+
212
+ def test_resolution_action_unrelated_miss_stays_zero(self) -> None:
213
+ ticket = _ticket()
214
+ action = HelpdeskTicketAction(
215
+ issue_type="billing_license",
216
+ priority="high",
217
+ assignment_group="license_ops",
218
+ resolution_action="ignore",
219
  )
220
 
221
  score, breakdown = grade_action(action, ticket, task_id=3)
 
223
  self.assertEqual(breakdown["resolution_action"], 0.0)
224
  self.assertAlmostEqual(score, 0.8)
225
 
226
+ def test_assignment_group_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
227
+ for expected in ASSIGNMENT_GROUPS:
228
+ for predicted in ASSIGNMENT_GROUPS:
229
+ with self.subTest(expected=expected, predicted=predicted):
230
+ ticket = _ticket(assignment_group=expected)
231
+ action = HelpdeskTicketAction(
232
+ issue_type="billing_license",
233
+ priority="high",
234
+ assignment_group=predicted,
235
+ resolution_action="fulfill",
236
+ )
237
+
238
+ score, breakdown = grade_action(action, ticket, task_id=3)
239
+
240
+ assignment_group_score = (
241
+ 1.0
242
+ if predicted == expected
243
+ else ASSIGNMENT_GROUP_SIMILARITY.get((predicted, expected), 0.0)
244
+ )
245
+ self.assertEqual(
246
+ breakdown,
247
+ {
248
+ "issue_type": 1.0,
249
+ "priority": 1.0,
250
+ "assignment_group": assignment_group_score,
251
+ "resolution_action": 1.0,
252
+ },
253
+ )
254
+ raw_score = 0.35 + 0.20 + 0.25 * assignment_group_score + 0.20
255
+ expected_task_score = max(0.01, min(0.99, raw_score))
256
+ self.assertAlmostEqual(score, expected_task_score)
257
+
258
+ def test_resolution_action_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
259
+ for expected in RESOLUTION_ACTIONS:
260
+ for predicted in RESOLUTION_ACTIONS:
261
+ with self.subTest(expected=expected, predicted=predicted):
262
+ ticket = _ticket(resolution_action=expected)
263
+ action = HelpdeskTicketAction(
264
+ issue_type="billing_license",
265
+ priority="high",
266
+ assignment_group="license_ops",
267
+ resolution_action=predicted,
268
+ )
269
+
270
+ score, breakdown = grade_action(action, ticket, task_id=3)
271
+
272
+ resolution_action_score = (
273
+ 1.0
274
+ if predicted == expected
275
+ else RESOLUTION_ACTION_SIMILARITY.get((predicted, expected), 0.0)
276
+ )
277
+ self.assertEqual(
278
+ breakdown,
279
+ {
280
+ "issue_type": 1.0,
281
+ "priority": 1.0,
282
+ "assignment_group": 1.0,
283
+ "resolution_action": resolution_action_score,
284
+ },
285
+ )
286
+ raw_score = 0.35 + 0.20 + 0.25 + 0.20 * resolution_action_score
287
+ expected_task_score = max(0.01, min(0.99, raw_score))
288
+ self.assertAlmostEqual(score, expected_task_score)
289
+
290
  def test_partial_credit_tables_never_override_exact_match(self) -> None:
291
  for pair, value in ISSUE_TYPE_SIMILARITY.items():
292
  with self.subTest(table="issue_type", pair=pair):
 
298
  self.assertGreater(value, 0.0)
299
  self.assertLess(value, 1.0)
300
 
301
+ for pair, value in ASSIGNMENT_GROUP_SIMILARITY.items():
302
+ with self.subTest(table="assignment_group", pair=pair):
303
+ self.assertGreater(value, 0.0)
304
+ self.assertLess(value, 1.0)
305
+
306
+ for pair, value in RESOLUTION_ACTION_SIMILARITY.items():
307
+ with self.subTest(table="resolution_action", pair=pair):
308
+ self.assertGreater(value, 0.0)
309
+ self.assertLess(value, 1.0)
310
+
311
  def test_task_weights_sum_to_one_for_each_task(self) -> None:
312
  for task_id, weights in TASK_WEIGHTS.items():
313
  with self.subTest(task_id=task_id):