Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

Roopalgn commited on Apr 7

Commit

d378e5d

1 Parent(s): d6d9493

Strengthen hard-task investigation and grading

Browse files

Files changed (7) hide show

README.md +17 -18
inference.py +15 -9
server/environment.py +127 -22
server/grader.py +48 -4
tests/test_api_integration.py +2 -2
tests/test_competitive_upgrade.py +6 -6
tests/test_grader_unit.py +110 -6

README.md CHANGED Viewed

@@ -205,9 +205,10 @@ Available tools:
 Hard-task investigation behavior:
-- some ambiguous and non-default-routing tickets start with redacted descriptions
 - linked-ticket previews and internal routing notes stay hidden until the matching tool is used
-- useful investigation steps return a small positive shaping reward
 - premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
 - terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
@@ -215,8 +216,8 @@ Per-field behavior:
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
 - `priority`: exact match or proximity credit
-- `assignment_group`: exact match
-- `resolution_action`: exact match
 Task weights:
@@ -236,7 +237,7 @@ The result is clamped to `[0.0, 1.0]`.
 Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
-Final reward also includes a tiny queue-economics penalty only when the agent exceeds the free investigation budget. One investigation per queued ticket is free; extra investigation steps reduce the final reward slightly.
 To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
@@ -246,10 +247,10 @@ To make the environment more RL-friendly, each observation now also surfaces str
 ## Grounded Scoring
-The grader is intentionally not fuzzy by default.
 - exact match is the dominant path for every field
-- `assignment_group` and `resolution_action` are exact-match only
 - `priority` only gets proximity credit from the declared table in `server/grader.py`
 - `issue_type` only gets partial credit for a small declared similarity map
 - wrong labels outside those explicit maps score `0.0`
@@ -363,7 +364,7 @@ curl http://localhost:7860/tasks
 ## Running The Baseline Inference Script
-The baseline script supports single-task evaluator mode by default, plus an explicit local batch override.
 ### Heuristic mode
@@ -373,7 +374,7 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
 python inference.py
 ```
-By default that runs exactly one task and emits exactly one `[START] ... [END]` block. To target a specific task:
 ```bash
 TASK_ID=3 python inference.py
@@ -401,6 +402,7 @@ Optional target:
 - `SEED`
 - `TASK_ID`
 - `RUN_ALL_TASKS`
 To reproduce the multi-task local benchmark sweep:
@@ -420,16 +422,13 @@ Validated locally:
 - `/reset`
 - heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
-Current local heuristic results:
-| Task | Result |
-|------|--------|
-| Issue Type Classification | `1.0000` |
-| Issue Type And Priority | `0.8800` |
-| Full Ticket Routing | `0.9400` |
-| Overall | `0.9400` |
-The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
 ### Windows note
@@ -440,7 +439,7 @@ During the first runtime pass, the repo surfaced a Windows-specific JSON issue w
 Build:
 ```bash
-docker build -f server/Dockerfile -t helpdesk-ticket-routing .
 ```
 Run locally:

 Hard-task investigation behavior:
+- some ambiguous and non-default-routing tickets start with both redacted titles and redacted descriptions
 - linked-ticket previews and internal routing notes stay hidden until the matching tool is used
+- only useful investigation steps return a small positive shaping reward
+- blind or repeated probing does not pay by default
 - premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
 - terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
 - `issue_type`: exact match, with a few near-miss partial-credit pairs
 - `priority`: exact match or proximity credit
+- `assignment_group`: exact match, with a small declared partial-credit map for nearby ownership mistakes
+- `resolution_action`: exact match, with a small declared partial-credit map for nearby next-step mistakes
 Task weights:
 Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
+Final reward also includes a queue-economics penalty when the agent exceeds the free investigation budget. One investigation per queued ticket is free, but extra investigation steps reduce the final reward more noticeably than before.
 To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
 ## Grounded Scoring
+The grader is intentionally narrow and declared, not fully fuzzy.
 - exact match is the dominant path for every field
+- `assignment_group` and `resolution_action` now expose only a small declared partial-credit map for nearby mistakes
 - `priority` only gets proximity credit from the declared table in `server/grader.py`
 - `issue_type` only gets partial credit for a small declared similarity map
 - wrong labels outside those explicit maps score `0.0`
 ## Running The Baseline Inference Script
+The baseline script defaults to all declared tasks when `TASK_ID` is not set, which keeps local runs aligned with validator-style sweeps.
 ### Heuristic mode
 python inference.py
 ```
+By default that runs all declared tasks and emits a structured `[START] ... [STEP] ... [END]` block for each task. To target a specific task:
 ```bash
 TASK_ID=3 python inference.py
 - `SEED`
 - `TASK_ID`
 - `RUN_ALL_TASKS`
+  compatibility alias for local tooling; all tasks already run by default when `TASK_ID` is unset
 To reproduce the multi-task local benchmark sweep:
 - `/reset`
 - heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
+Current local smoke expectations:
+- the baseline completes all 3 tasks successfully
+- rewards remain in range for every task
+- the hard task now depends much more heavily on investigation behavior, so exact seed-level baseline numbers are no longer treated as the benchmark reference for the repo
+The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
 ### Windows note
 Build:
 ```bash
+docker build -t helpdesk-ticket-routing .
 ```
 Run locally:

inference.py CHANGED Viewed

@@ -26,12 +26,11 @@ HF_TOKEN
 TASK_ID
     Optional OpenEnv task ID to run. When unset, the script defaults to the
-    first available task so it still emits exactly one ``[START]`` ... ``[END]``
-    block for evaluator-style runs.
 RUN_ALL_TASKS
-    Optional local-development override. Set to ``1`` to run every available
-    task in sequence and print the aggregate closing ``[END]`` summary.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
@@ -761,6 +760,10 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
     if not ticket:
         return False, None
     context_status = ticket.get("context_status") or {}
     current_ticket_id = ticket.get("ticket_id")
     prior_ticket_history = [
         entry
@@ -777,7 +780,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
         for entry in prior_ticket_history
         if entry.get("predicted", {}).get("action_type") == "investigate"
     )
-    hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
     if investigations_used >= 3:
         return False, None
@@ -786,6 +788,14 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
         for entry in prior_ticket_history
         if entry.get("predicted", {}).get("action_type") == "investigate"
     }
     routing_text = build_routing_text(ticket)
     last_tool_result = ticket.get("last_tool_result") or {}
     last_tool_name = str(last_tool_result.get("tool_name", "") or "")
@@ -859,10 +869,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
     if already_investigated and not hidden_context_remaining:
         return False, None
-    if ticket.get("ambiguity_note") and "lookup_internal_routing_note" not in used_tools:
-        return True, "lookup_internal_routing_note"
-    if ticket.get("related_ticket_id") and "lookup_related_ticket" not in used_tools:
-        return True, "lookup_related_ticket"
     return False, None

 TASK_ID
     Optional OpenEnv task ID to run. When unset, the script defaults to the
+    full declared task set so evaluator-style runs exercise every grader.
 RUN_ALL_TASKS
+    Optional backwards-compatible local-development alias. The script already
+    runs every available task when TASK_ID is unset.
 LOCAL_IMAGE_NAME
     Optional compatibility variable from the sample inference pattern.
     if not ticket:
         return False, None
     context_status = ticket.get("context_status") or {}
+    hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
+    investigation_required = bool(context_status.get("investigation_required"))
+    if not investigation_required and not hidden_context_remaining:
+        return False, None
     current_ticket_id = ticket.get("ticket_id")
     prior_ticket_history = [
         entry
         for entry in prior_ticket_history
         if entry.get("predicted", {}).get("action_type") == "investigate"
     )
     if investigations_used >= 3:
         return False, None
         for entry in prior_ticket_history
         if entry.get("predicted", {}).get("action_type") == "investigate"
     }
+    recommended_tools = [
+        tool_name
+        for tool_name in context_status.get("recommended_tools", [])
+        if tool_name not in used_tools
+    ]
+    if hidden_context_remaining and recommended_tools:
+        return True, recommended_tools[0]
     routing_text = build_routing_text(ticket)
     last_tool_result = ticket.get("last_tool_result") or {}
     last_tool_name = str(last_tool_result.get("tool_name", "") or "")
     if already_investigated and not hidden_context_remaining:
         return False, None
     return False, None

server/environment.py CHANGED Viewed

@@ -33,12 +33,13 @@ AVAILABLE_TOOLS = (
     "lookup_internal_routing_note",
 )
 FREE_INVESTIGATIONS_PER_TICKET = 1
-EXTRA_INVESTIGATION_COST = 0.02
-MAX_EXTRA_INVESTIGATION_PENALTY = 0.15
-USEFUL_INVESTIGATION_REWARD = 0.08
-PREMATURE_SUBMIT_PENALTY = 0.10
-CONTEXT_COMPLETION_BONUS = 0.04
-TRAJECTORY_CONTEXT_COMPLETION_BONUS = 0.03
 PRIORITY_UNDERSHOOT_PENALTY = 0.03
 SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
 DANGEROUS_RESOLUTION_PENALTY = 0.05
@@ -95,6 +96,18 @@ HARD_TASK_DESCRIPTION_REDACTIONS: dict[str, str] = {
     ),
 }
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
     if value is None or value == "":
@@ -412,6 +425,55 @@ class HelpdeskTicketRoutingEnvironment(
             return 0.0
         return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
     def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
         return (
             ticket.assignment_group
@@ -441,6 +503,22 @@ class HelpdeskTicketRoutingEnvironment(
     def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
         return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
     def _required_tools_for_ticket(
         self,
         ticket: HelpdeskTicketRecord,
@@ -453,8 +531,9 @@ class HelpdeskTicketRoutingEnvironment(
         if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
             required_tools.append("lookup_related_ticket")
         if (
-            ticket.ambiguity_note is not None or self._ticket_has_nondefault_routing(ticket)
-        ) and "lookup_internal_routing_note" not in required_tools:
             required_tools.append("lookup_internal_routing_note")
         if (
             self._ticket_repeated_requester_count(ticket) >= 2
@@ -467,7 +546,13 @@ class HelpdeskTicketRoutingEnvironment(
             and "lookup_requester_history" not in required_tools
         ):
             required_tools.append("lookup_requester_history")
-        return required_tools
     def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
         return list(self._state.ticket_tool_usage.get(ticket_id, []))
@@ -503,23 +588,39 @@ class HelpdeskTicketRoutingEnvironment(
     def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
         if ticket.related_ticket_id is not None:
             return (
-                "This is a follow-up operational issue that references prior work. "
                 "Additional routing context is available via investigation."
             )
-        if ticket.ambiguity_note is not None:
             return (
-                "This ticket mixes multiple plausible workflows. "
                 "Additional routing context is available via investigation."
             )
         if self._ticket_has_nondefault_routing(ticket):
             return (
-                "The visible request looks straightforward, but the decisive routing "
-                "detail is hidden until investigation."
             )
         return (
             "Additional routing context is available via investigation before final submission."
         )
     def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
         if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
             return HARD_TASK_DESCRIPTION_REDACTIONS.get(
@@ -537,7 +638,11 @@ class HelpdeskTicketRoutingEnvironment(
         penalty = PREMATURE_SUBMIT_PENALTY * (
             len(remaining_tools) / max(1, len(required_tools))
         )
-        return penalty, len(remaining_tools)
     def _context_completion_bonus(
         self,
@@ -691,12 +796,13 @@ class HelpdeskTicketRoutingEnvironment(
         }
     def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
-        found = current_ticket.ambiguity_note is not None
         return {
             "tool_name": "lookup_internal_routing_note",
             "found": found,
             "ticket_id": current_ticket.ticket_id,
-            "routing_note": current_ticket.ambiguity_note if found else "",
         }
     def _run_investigation_tool(
@@ -753,10 +859,8 @@ class HelpdeskTicketRoutingEnvironment(
             self._state.investigation_budget_remaining - 1,
         )
         self._state.last_tool_result = tool_result
-        investigation_reward = clamp_open_unit_interval(
-            USEFUL_INVESTIGATION_REWARD if useful_investigation else 0.0
-        )
-        investigation_score = clamp_open_unit_interval(0.0)
         self._state.last_step_reward = investigation_reward
         self._state.reward = investigation_reward
         self._state.done = False
@@ -799,7 +903,7 @@ class HelpdeskTicketRoutingEnvironment(
         remaining_tools = progress["remaining_tools"]
         ticket_view: dict[str, Any] = {
             "ticket_id": ticket.ticket_id,
-            "title": ticket.title,
             "requester": ticket.requester,
             "description": self._visible_description(ticket),
         }
@@ -811,6 +915,7 @@ class HelpdeskTicketRoutingEnvironment(
                 "revealed_context_count": progress["revealed_count"],
                 "context_completeness": progress["completeness"],
                 "investigations_used_for_ticket": progress["revealed_count"],
             }
         if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
             ticket_view["ambiguity_note"] = ticket.ambiguity_note

     "lookup_internal_routing_note",
 )
 FREE_INVESTIGATIONS_PER_TICKET = 1
+EXTRA_INVESTIGATION_COST = 0.04
+MAX_EXTRA_INVESTIGATION_PENALTY = 0.25
+USEFUL_INVESTIGATION_REWARD = 0.03
+PREMATURE_SUBMIT_PENALTY = 0.22
+NONDEFAULT_HIDDEN_CONTEXT_PENALTY = 0.08
+CONTEXT_COMPLETION_BONUS = 0.06
+TRAJECTORY_CONTEXT_COMPLETION_BONUS = 0.04
 PRIORITY_UNDERSHOOT_PENALTY = 0.03
 SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
 DANGEROUS_RESOLUTION_PENALTY = 0.05
     ),
 }
+HARD_TASK_TITLE_REDACTIONS: dict[str, str] = {
+    "ticket-021": "Production workflow regression",
+    "ticket-022": "Time-sensitive account review",
+    "ticket-027": "Commercial workflow request",
+    "ticket-029": "Urgent expansion request",
+    "ticket-038": "Repeated invoice follow-up",
+    "ticket-045": "Company-wide account issue",
+    "TKT-NONDEFAULT-001": "Billing-style routing question",
+    "TKT-NONDEFAULT-002": "Compliance ownership question",
+    "TKT-NONDEFAULT-003": "Workflow blocker with hidden owner",
+}
 def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
     if value is None or value == "":
             return 0.0
         return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
+    def _internal_routing_note_for_ticket(
+        self,
+        ticket: HelpdeskTicketRecord,
+    ) -> str | None:
+        if ticket.ambiguity_note is not None:
+            return ticket.ambiguity_note
+        if self._state.current_task_id != 3:
+            return None
+        default_group = ISSUE_TYPE_TO_ASSIGNMENT_GROUP.get(
+            ticket.issue_type,
+            ticket.assignment_group,
+        )
+        default_action = ISSUE_TYPE_TO_RESOLUTION_ACTION.get(
+            ticket.issue_type,
+            ticket.resolution_action,
+        )
+        note_parts: list[str] = []
+        if ticket.assignment_group != default_group:
+            note_parts.append(
+                "Routing override: send this to "
+                f"{ticket.assignment_group} rather than the default {default_group} queue."
+            )
+        if ticket.resolution_action != default_action:
+            note_parts.append(
+                "Action override: use "
+                f"{ticket.resolution_action} instead of the default {default_action} next step."
+            )
+        if ticket.issue_type == "onboarding" and ticket.assignment_group == "service_desk":
+            note_parts.append(
+                "The onboarding workflow is blocked by an access dependency, so the unblocker owns the next move."
+            )
+        if (
+            ticket.issue_type == "security_compliance"
+            and ticket.assignment_group == "application_team"
+        ):
+            note_parts.append(
+                "This compliance issue needs a product-team fix rather than a central security handoff."
+            )
+        if ticket.issue_type == "billing_license" and ticket.assignment_group == "procurement":
+            note_parts.append(
+                "Treat this as commercial procurement work instead of routine license fulfillment."
+            )
+        if not note_parts:
+            return None
+        return " ".join(note_parts)
     def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
         return (
             ticket.assignment_group
     def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
         return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
+    def _tool_has_available_context(
+        self,
+        ticket: HelpdeskTicketRecord,
+        tool_name: str,
+    ) -> bool:
+        if tool_name == "lookup_related_ticket":
+            return (
+                ticket.related_ticket_id is not None
+                and ticket.related_ticket_id in self._tickets_by_id
+            )
+        if tool_name == "lookup_requester_history":
+            return self._ticket_repeated_requester_count(ticket) >= 2
+        if tool_name == "lookup_internal_routing_note":
+            return self._internal_routing_note_for_ticket(ticket) is not None
+        return False
     def _required_tools_for_ticket(
         self,
         ticket: HelpdeskTicketRecord,
         if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
             required_tools.append("lookup_related_ticket")
         if (
+            self._internal_routing_note_for_ticket(ticket) is not None
+            and "lookup_internal_routing_note" not in required_tools
+        ):
             required_tools.append("lookup_internal_routing_note")
         if (
             self._ticket_repeated_requester_count(ticket) >= 2
             and "lookup_requester_history" not in required_tools
         ):
             required_tools.append("lookup_requester_history")
+        filtered_required_tools: list[str] = []
+        for tool_name in required_tools:
+            if tool_name in filtered_required_tools:
+                continue
+            if self._tool_has_available_context(ticket, tool_name):
+                filtered_required_tools.append(tool_name)
+        return filtered_required_tools
     def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
         return list(self._state.ticket_tool_usage.get(ticket_id, []))
     def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
         if ticket.related_ticket_id is not None:
             return (
+                "This is a follow-up operational issue. "
                 "Additional routing context is available via investigation."
             )
+        if self._internal_routing_note_for_ticket(ticket) is not None:
             return (
+                "The visible request is not enough to choose the final owner and next step. "
                 "Additional routing context is available via investigation."
             )
         if self._ticket_has_nondefault_routing(ticket):
             return (
+                "The visible request looks straightforward, but the decisive routing detail is hidden until investigation."
             )
         return (
             "Additional routing context is available via investigation before final submission."
         )
+    def _default_redacted_title(self, ticket: HelpdeskTicketRecord) -> str:
+        if ticket.related_ticket_id is not None:
+            return "Follow-up request with hidden routing context"
+        if self._internal_routing_note_for_ticket(ticket) is not None:
+            return "Routing clarification required"
+        if self._ticket_mentions_follow_up(ticket):
+            return "Priority support follow-up"
+        return "Helpdesk routing decision"
+    def _visible_title(self, ticket: HelpdeskTicketRecord) -> str:
+        if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
+            return HARD_TASK_TITLE_REDACTIONS.get(
+                ticket.ticket_id,
+                self._default_redacted_title(ticket),
+            )
+        return ticket.title
     def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
         if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
             return HARD_TASK_DESCRIPTION_REDACTIONS.get(
         penalty = PREMATURE_SUBMIT_PENALTY * (
             len(remaining_tools) / max(1, len(required_tools))
         )
+        if self._ticket_has_nondefault_routing(ticket):
+            penalty += NONDEFAULT_HIDDEN_CONTEXT_PENALTY * (
+                len(remaining_tools) / max(1, len(required_tools))
+            )
+        return round(min(0.45, penalty), 4), len(remaining_tools)
     def _context_completion_bonus(
         self,
         }
     def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
+        routing_note = self._internal_routing_note_for_ticket(current_ticket)
+        found = routing_note is not None
         return {
             "tool_name": "lookup_internal_routing_note",
             "found": found,
             "ticket_id": current_ticket.ticket_id,
+            "routing_note": routing_note if found else "",
         }
     def _run_investigation_tool(
             self._state.investigation_budget_remaining - 1,
         )
         self._state.last_tool_result = tool_result
+        investigation_reward = USEFUL_INVESTIGATION_REWARD if useful_investigation else 0.0
+        investigation_score = 0.0
         self._state.last_step_reward = investigation_reward
         self._state.reward = investigation_reward
         self._state.done = False
         remaining_tools = progress["remaining_tools"]
         ticket_view: dict[str, Any] = {
             "ticket_id": ticket.ticket_id,
+            "title": self._visible_title(ticket),
             "requester": ticket.requester,
             "description": self._visible_description(ticket),
         }
                 "revealed_context_count": progress["revealed_count"],
                 "context_completeness": progress["completeness"],
                 "investigations_used_for_ticket": progress["revealed_count"],
+                "recommended_tools": list(remaining_tools),
             }
         if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
             ticket_view["ambiguity_note"] = ticket.ambiguity_note

server/grader.py CHANGED Viewed

@@ -24,6 +24,32 @@ ISSUE_TYPE_SIMILARITY = {
     ("billing_license", "security_compliance"): 0.2,
 }
 PRIORITY_SCORES = {
     ("critical", "high"): 0.6,
     ("high", "critical"): 0.6,
@@ -66,6 +92,20 @@ def _score_exact_or_similar(predicted: str | None, expected: str) -> float:
     return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
 def _score_priority(predicted: str | None, expected: str) -> float:
     pred = _normalized(predicted)
     exp = _normalized(expected)
@@ -91,11 +131,15 @@ def grade_action(
     field_scores = {
         "issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
         "priority": _score_priority(action.priority, ticket.priority),
-        "assignment_group": _score_exact(
-            action.assignment_group, ticket.assignment_group
         ),
-        "resolution_action": _score_exact(
-            action.resolution_action, ticket.resolution_action
         ),
     }

     ("billing_license", "security_compliance"): 0.2,
 }
+ASSIGNMENT_GROUP_SIMILARITY = {
+    ("procurement", "license_ops"): 0.55,
+    ("license_ops", "procurement"): 0.55,
+    ("service_desk", "onboarding_ops"): 0.5,
+    ("onboarding_ops", "service_desk"): 0.5,
+    ("application_team", "security_team"): 0.35,
+    ("security_team", "application_team"): 0.35,
+    ("service_desk", "application_team"): 0.25,
+    ("application_team", "service_desk"): 0.25,
+    ("service_desk", "security_team"): 0.2,
+    ("security_team", "service_desk"): 0.2,
+}
+RESOLUTION_ACTION_SIMILARITY = {
+    ("assign", "escalate"): 0.6,
+    ("escalate", "assign"): 0.6,
+    ("acknowledge", "fulfill"): 0.35,
+    ("fulfill", "acknowledge"): 0.35,
+    ("assign", "fulfill"): 0.25,
+    ("fulfill", "assign"): 0.25,
+    ("escalate", "fulfill"): 0.2,
+    ("fulfill", "escalate"): 0.2,
+    ("acknowledge", "assign"): 0.2,
+    ("assign", "acknowledge"): 0.2,
+}
 PRIORITY_SCORES = {
     ("critical", "high"): 0.6,
     ("high", "critical"): 0.6,
     return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
+def _score_exact_or_table(
+    predicted: str | None,
+    expected: str,
+    similarity_table: dict[tuple[str, str], float],
+) -> float:
+    pred = _normalized(predicted)
+    exp = _normalized(expected)
+    if not pred:
+        return 0.0
+    if pred == exp:
+        return 1.0
+    return similarity_table.get((pred, exp), 0.0)
 def _score_priority(predicted: str | None, expected: str) -> float:
     pred = _normalized(predicted)
     exp = _normalized(expected)
     field_scores = {
         "issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
         "priority": _score_priority(action.priority, ticket.priority),
+        "assignment_group": _score_exact_or_table(
+            action.assignment_group,
+            ticket.assignment_group,
+            ASSIGNMENT_GROUP_SIMILARITY,
         ),
+        "resolution_action": _score_exact_or_table(
+            action.resolution_action,
+            ticket.resolution_action,
+            RESOLUTION_ACTION_SIMILARITY,
         ),
     }

tests/test_api_integration.py CHANGED Viewed

@@ -529,8 +529,8 @@ class TestHeuristicInferenceRegression(unittest.TestCase):
         overall_avg = sum(rewards) / len(rewards)
         self.assertGreaterEqual(
             overall_avg,
-            0.8,
-            f"Overall average reward {overall_avg:.4f} is below 0.8 (baseline: 0.9400)",
         )
         self.assertLessEqual(
             overall_avg,

         overall_avg = sum(rewards) / len(rewards)
         self.assertGreaterEqual(
             overall_avg,
+            0.75,
+            f"Overall average reward {overall_avg:.4f} is below the smoke-test floor of 0.75",
         )
         self.assertLessEqual(
             overall_avg,

tests/test_competitive_upgrade.py CHANGED Viewed

@@ -100,11 +100,11 @@ def _get_tasks_to_run_impl(
         if task_id not in available_tasks:
             raise SystemExit(1)
         return [task_id]
-    if run_all_tasks:
-        return sorted(available_tasks)
     if not available_tasks:
         return []
-    return [sorted(available_tasks)[0]]
 class TestInferenceSingleTaskMode(unittest.TestCase):
@@ -120,10 +120,10 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
         with self.assertRaises(SystemExit):
             _get_tasks_to_run_impl("999", available)
-    def test_task_id_unset_defaults_to_first_available_task(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
         result = _get_tasks_to_run_impl(None, available)
-        self.assertEqual(result, [1])
     def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
@@ -710,7 +710,7 @@ class TestQueueEconomics(unittest.TestCase):
         final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
         self.assertTrue(final_obs.done)
-        self.assertAlmostEqual(final_obs.reward, 0.97, places=9)
 class TestTerminalInvalidActionFinalReward(unittest.TestCase):

         if task_id not in available_tasks:
             raise SystemExit(1)
         return [task_id]
     if not available_tasks:
         return []
+    if run_all_tasks:
+        return sorted(available_tasks)
+    return sorted(available_tasks)
 class TestInferenceSingleTaskMode(unittest.TestCase):
         with self.assertRaises(SystemExit):
             _get_tasks_to_run_impl("999", available)
+    def test_task_id_unset_defaults_to_all_available_tasks(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
         result = _get_tasks_to_run_impl(None, available)
+        self.assertEqual(result, [1, 2, 3])
     def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
         available = {1: {}, 2: {}, 3: {}}
         final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
         self.assertTrue(final_obs.done)
+        self.assertAlmostEqual(final_obs.reward, 0.95, places=9)
 class TestTerminalInvalidActionFinalReward(unittest.TestCase):

tests/test_grader_unit.py CHANGED Viewed

@@ -6,12 +6,14 @@ import openenv_test_stubs  # noqa: F401
 from models import HelpdeskTicketAction, HelpdeskTicketRecord
 from server.grader import (
     ISSUE_TYPE_SIMILARITY,
     PRIORITY_SCORES,
     TASK_WEIGHTS,
     grade_action,
 )
-from vocabulary import ISSUE_TYPES, PRIORITIES
 def _ticket(
@@ -143,12 +145,26 @@ class GraderUnitTests(unittest.TestCase):
         self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
         self.assertAlmostEqual(score, 0.8)
-    def test_assignment_group_is_exact_match_only(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="high",
-            assignment_group="service_desk",
             resolution_action="fulfill",
         )
@@ -162,7 +178,7 @@ class GraderUnitTests(unittest.TestCase):
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="medium",
-            assignment_group="service_desk",
             resolution_action="fulfill",
         )
@@ -179,13 +195,27 @@ class GraderUnitTests(unittest.TestCase):
         )
         self.assertAlmostEqual(score, 0.65)
-    def test_resolution_action_is_exact_match_only(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="high",
             assignment_group="license_ops",
-            resolution_action="assign",
         )
         score, breakdown = grade_action(action, ticket, task_id=3)
@@ -193,6 +223,70 @@ class GraderUnitTests(unittest.TestCase):
         self.assertEqual(breakdown["resolution_action"], 0.0)
         self.assertAlmostEqual(score, 0.8)
     def test_partial_credit_tables_never_override_exact_match(self) -> None:
         for pair, value in ISSUE_TYPE_SIMILARITY.items():
             with self.subTest(table="issue_type", pair=pair):
@@ -204,6 +298,16 @@ class GraderUnitTests(unittest.TestCase):
                 self.assertGreater(value, 0.0)
                 self.assertLess(value, 1.0)
     def test_task_weights_sum_to_one_for_each_task(self) -> None:
         for task_id, weights in TASK_WEIGHTS.items():
             with self.subTest(task_id=task_id):

 from models import HelpdeskTicketAction, HelpdeskTicketRecord
 from server.grader import (
+    ASSIGNMENT_GROUP_SIMILARITY,
     ISSUE_TYPE_SIMILARITY,
     PRIORITY_SCORES,
+    RESOLUTION_ACTION_SIMILARITY,
     TASK_WEIGHTS,
     grade_action,
 )
+from vocabulary import ASSIGNMENT_GROUPS, ISSUE_TYPES, PRIORITIES, RESOLUTION_ACTIONS
 def _ticket(
         self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
         self.assertAlmostEqual(score, 0.8)
+    def test_assignment_group_partial_credit_uses_declared_similarity_table(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="high",
+            assignment_group="procurement",
+            resolution_action="fulfill",
+        )
+        score, breakdown = grade_action(action, ticket, task_id=3)
+        self.assertEqual(breakdown["assignment_group"], 0.55)
+        self.assertAlmostEqual(score, 0.8875)
+    def test_assignment_group_unrelated_miss_stays_zero(self) -> None:
+        ticket = _ticket()
+        action = HelpdeskTicketAction(
+            issue_type="billing_license",
+            priority="high",
+            assignment_group="security_team",
             resolution_action="fulfill",
         )
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="medium",
+            assignment_group="security_team",
             resolution_action="fulfill",
         )
         )
         self.assertAlmostEqual(score, 0.65)
+    def test_resolution_action_partial_credit_uses_declared_similarity_table(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
             issue_type="billing_license",
             priority="high",
             assignment_group="license_ops",
+            resolution_action="acknowledge",
+        )
+        score, breakdown = grade_action(action, ticket, task_id=3)
+        self.assertEqual(breakdown["resolution_action"], 0.35)
+        self.assertAlmostEqual(score, 0.87)
+    def test_resolution_action_unrelated_miss_stays_zero(self) -> None:
+        ticket = _ticket()
+        action = HelpdeskTicketAction(
+            issue_type="billing_license",
+            priority="high",
+            assignment_group="license_ops",
+            resolution_action="ignore",
         )
         score, breakdown = grade_action(action, ticket, task_id=3)
         self.assertEqual(breakdown["resolution_action"], 0.0)
         self.assertAlmostEqual(score, 0.8)
+    def test_assignment_group_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
+        for expected in ASSIGNMENT_GROUPS:
+            for predicted in ASSIGNMENT_GROUPS:
+                with self.subTest(expected=expected, predicted=predicted):
+                    ticket = _ticket(assignment_group=expected)
+                    action = HelpdeskTicketAction(
+                        issue_type="billing_license",
+                        priority="high",
+                        assignment_group=predicted,
+                        resolution_action="fulfill",
+                    )
+                    score, breakdown = grade_action(action, ticket, task_id=3)
+                    assignment_group_score = (
+                        1.0
+                        if predicted == expected
+                        else ASSIGNMENT_GROUP_SIMILARITY.get((predicted, expected), 0.0)
+                    )
+                    self.assertEqual(
+                        breakdown,
+                        {
+                            "issue_type": 1.0,
+                            "priority": 1.0,
+                            "assignment_group": assignment_group_score,
+                            "resolution_action": 1.0,
+                        },
+                    )
+                    raw_score = 0.35 + 0.20 + 0.25 * assignment_group_score + 0.20
+                    expected_task_score = max(0.01, min(0.99, raw_score))
+                    self.assertAlmostEqual(score, expected_task_score)
+    def test_resolution_action_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
+        for expected in RESOLUTION_ACTIONS:
+            for predicted in RESOLUTION_ACTIONS:
+                with self.subTest(expected=expected, predicted=predicted):
+                    ticket = _ticket(resolution_action=expected)
+                    action = HelpdeskTicketAction(
+                        issue_type="billing_license",
+                        priority="high",
+                        assignment_group="license_ops",
+                        resolution_action=predicted,
+                    )
+                    score, breakdown = grade_action(action, ticket, task_id=3)
+                    resolution_action_score = (
+                        1.0
+                        if predicted == expected
+                        else RESOLUTION_ACTION_SIMILARITY.get((predicted, expected), 0.0)
+                    )
+                    self.assertEqual(
+                        breakdown,
+                        {
+                            "issue_type": 1.0,
+                            "priority": 1.0,
+                            "assignment_group": 1.0,
+                            "resolution_action": resolution_action_score,
+                        },
+                    )
+                    raw_score = 0.35 + 0.20 + 0.25 + 0.20 * resolution_action_score
+                    expected_task_score = max(0.01, min(0.99, raw_score))
+                    self.assertAlmostEqual(score, expected_task_score)
     def test_partial_credit_tables_never_override_exact_match(self) -> None:
         for pair, value in ISSUE_TYPE_SIMILARITY.items():
             with self.subTest(table="issue_type", pair=pair):
                 self.assertGreater(value, 0.0)
                 self.assertLess(value, 1.0)
+        for pair, value in ASSIGNMENT_GROUP_SIMILARITY.items():
+            with self.subTest(table="assignment_group", pair=pair):
+                self.assertGreater(value, 0.0)
+                self.assertLess(value, 1.0)
+        for pair, value in RESOLUTION_ACTION_SIMILARITY.items():
+            with self.subTest(table="resolution_action", pair=pair):
+                self.assertGreater(value, 0.0)
+                self.assertLess(value, 1.0)
     def test_task_weights_sum_to_one_for_each_task(self) -> None:
         for task_id, weights in TASK_WEIGHTS.items():
             with self.subTest(task_id=task_id):