Spaces:
Running
Running
Strengthen hard-task investigation and grading
Browse files- README.md +17 -18
- inference.py +15 -9
- server/environment.py +127 -22
- server/grader.py +48 -4
- tests/test_api_integration.py +2 -2
- tests/test_competitive_upgrade.py +6 -6
- tests/test_grader_unit.py +110 -6
README.md
CHANGED
|
@@ -205,9 +205,10 @@ Available tools:
|
|
| 205 |
|
| 206 |
Hard-task investigation behavior:
|
| 207 |
|
| 208 |
-
- some ambiguous and non-default-routing tickets start with redacted descriptions
|
| 209 |
- linked-ticket previews and internal routing notes stay hidden until the matching tool is used
|
| 210 |
-
- useful investigation steps return a small positive shaping reward
|
|
|
|
| 211 |
- premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
|
| 212 |
- terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
|
| 213 |
|
|
@@ -215,8 +216,8 @@ Per-field behavior:
|
|
| 215 |
|
| 216 |
- `issue_type`: exact match, with a few near-miss partial-credit pairs
|
| 217 |
- `priority`: exact match or proximity credit
|
| 218 |
-
- `assignment_group`: exact match
|
| 219 |
-
- `resolution_action`: exact match
|
| 220 |
|
| 221 |
Task weights:
|
| 222 |
|
|
@@ -236,7 +237,7 @@ The result is clamped to `[0.0, 1.0]`.
|
|
| 236 |
|
| 237 |
Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
|
| 238 |
|
| 239 |
-
Final reward also includes a
|
| 240 |
|
| 241 |
To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
|
| 242 |
|
|
@@ -246,10 +247,10 @@ To make the environment more RL-friendly, each observation now also surfaces str
|
|
| 246 |
|
| 247 |
## Grounded Scoring
|
| 248 |
|
| 249 |
-
The grader is intentionally
|
| 250 |
|
| 251 |
- exact match is the dominant path for every field
|
| 252 |
-
- `assignment_group` and `resolution_action`
|
| 253 |
- `priority` only gets proximity credit from the declared table in `server/grader.py`
|
| 254 |
- `issue_type` only gets partial credit for a small declared similarity map
|
| 255 |
- wrong labels outside those explicit maps score `0.0`
|
|
@@ -363,7 +364,7 @@ curl http://localhost:7860/tasks
|
|
| 363 |
|
| 364 |
## Running The Baseline Inference Script
|
| 365 |
|
| 366 |
-
The baseline script
|
| 367 |
|
| 368 |
### Heuristic mode
|
| 369 |
|
|
@@ -373,7 +374,7 @@ If no LLM credentials are set, it uses a keyword-based ticket router:
|
|
| 373 |
python inference.py
|
| 374 |
```
|
| 375 |
|
| 376 |
-
By default that runs
|
| 377 |
|
| 378 |
```bash
|
| 379 |
TASK_ID=3 python inference.py
|
|
@@ -401,6 +402,7 @@ Optional target:
|
|
| 401 |
- `SEED`
|
| 402 |
- `TASK_ID`
|
| 403 |
- `RUN_ALL_TASKS`
|
|
|
|
| 404 |
|
| 405 |
To reproduce the multi-task local benchmark sweep:
|
| 406 |
|
|
@@ -420,16 +422,13 @@ Validated locally:
|
|
| 420 |
- `/reset`
|
| 421 |
- heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
|
| 422 |
|
| 423 |
-
Current local
|
| 424 |
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
| Issue Type And Priority | `0.8800` |
|
| 429 |
-
| Full Ticket Routing | `0.9400` |
|
| 430 |
-
| Overall | `0.9400` |
|
| 431 |
|
| 432 |
-
The
|
| 433 |
|
| 434 |
### Windows note
|
| 435 |
|
|
@@ -440,7 +439,7 @@ During the first runtime pass, the repo surfaced a Windows-specific JSON issue w
|
|
| 440 |
Build:
|
| 441 |
|
| 442 |
```bash
|
| 443 |
-
docker build -
|
| 444 |
```
|
| 445 |
|
| 446 |
Run locally:
|
|
|
|
| 205 |
|
| 206 |
Hard-task investigation behavior:
|
| 207 |
|
| 208 |
+
- some ambiguous and non-default-routing tickets start with both redacted titles and redacted descriptions
|
| 209 |
- linked-ticket previews and internal routing notes stay hidden until the matching tool is used
|
| 210 |
+
- only useful investigation steps return a small positive shaping reward
|
| 211 |
+
- blind or repeated probing does not pay by default
|
| 212 |
- premature hard-task submission can incur a shaping penalty even when the visible text looks plausible
|
| 213 |
- terminal `rubric_reward` remains the objective evaluation signal, while per-step `reward` is the denser training signal
|
| 214 |
|
|
|
|
| 216 |
|
| 217 |
- `issue_type`: exact match, with a few near-miss partial-credit pairs
|
| 218 |
- `priority`: exact match or proximity credit
|
| 219 |
+
- `assignment_group`: exact match, with a small declared partial-credit map for nearby ownership mistakes
|
| 220 |
+
- `resolution_action`: exact match, with a small declared partial-credit map for nearby next-step mistakes
|
| 221 |
|
| 222 |
Task weights:
|
| 223 |
|
|
|
|
| 237 |
|
| 238 |
Step reward is lightly milestone-shaped: high per-ticket scores get a small bonus and very low scores get a small penalty before the final clamp.
|
| 239 |
|
| 240 |
+
Final reward also includes a queue-economics penalty when the agent exceeds the free investigation budget. One investigation per queued ticket is free, but extra investigation steps reduce the final reward more noticeably than before.
|
| 241 |
|
| 242 |
To make the environment more RL-friendly, each observation now also surfaces structured reward telemetry:
|
| 243 |
|
|
|
|
| 247 |
|
| 248 |
## Grounded Scoring
|
| 249 |
|
| 250 |
+
The grader is intentionally narrow and declared, not fully fuzzy.
|
| 251 |
|
| 252 |
- exact match is the dominant path for every field
|
| 253 |
+
- `assignment_group` and `resolution_action` now expose only a small declared partial-credit map for nearby mistakes
|
| 254 |
- `priority` only gets proximity credit from the declared table in `server/grader.py`
|
| 255 |
- `issue_type` only gets partial credit for a small declared similarity map
|
| 256 |
- wrong labels outside those explicit maps score `0.0`
|
|
|
|
| 364 |
|
| 365 |
## Running The Baseline Inference Script
|
| 366 |
|
| 367 |
+
The baseline script defaults to all declared tasks when `TASK_ID` is not set, which keeps local runs aligned with validator-style sweeps.
|
| 368 |
|
| 369 |
### Heuristic mode
|
| 370 |
|
|
|
|
| 374 |
python inference.py
|
| 375 |
```
|
| 376 |
|
| 377 |
+
By default that runs all declared tasks and emits a structured `[START] ... [STEP] ... [END]` block for each task. To target a specific task:
|
| 378 |
|
| 379 |
```bash
|
| 380 |
TASK_ID=3 python inference.py
|
|
|
|
| 402 |
- `SEED`
|
| 403 |
- `TASK_ID`
|
| 404 |
- `RUN_ALL_TASKS`
|
| 405 |
+
compatibility alias for local tooling; all tasks already run by default when `TASK_ID` is unset
|
| 406 |
|
| 407 |
To reproduce the multi-task local benchmark sweep:
|
| 408 |
|
|
|
|
| 422 |
- `/reset`
|
| 423 |
- heuristic `inference.py` run across all 3 tasks with `RUN_ALL_TASKS=1`
|
| 424 |
|
| 425 |
+
Current local smoke expectations:
|
| 426 |
|
| 427 |
+
- the baseline completes all 3 tasks successfully
|
| 428 |
+
- rewards remain in range for every task
|
| 429 |
+
- the hard task now depends much more heavily on investigation behavior, so exact seed-level baseline numbers are no longer treated as the benchmark reference for the repo
|
|
|
|
|
|
|
|
|
|
| 430 |
|
| 431 |
+
The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
|
| 432 |
|
| 433 |
### Windows note
|
| 434 |
|
|
|
|
| 439 |
Build:
|
| 440 |
|
| 441 |
```bash
|
| 442 |
+
docker build -t helpdesk-ticket-routing .
|
| 443 |
```
|
| 444 |
|
| 445 |
Run locally:
|
inference.py
CHANGED
|
@@ -26,12 +26,11 @@ HF_TOKEN
|
|
| 26 |
|
| 27 |
TASK_ID
|
| 28 |
Optional OpenEnv task ID to run. When unset, the script defaults to the
|
| 29 |
-
|
| 30 |
-
block for evaluator-style runs.
|
| 31 |
|
| 32 |
RUN_ALL_TASKS
|
| 33 |
-
Optional local-development
|
| 34 |
-
|
| 35 |
|
| 36 |
LOCAL_IMAGE_NAME
|
| 37 |
Optional compatibility variable from the sample inference pattern.
|
|
@@ -761,6 +760,10 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
|
|
| 761 |
if not ticket:
|
| 762 |
return False, None
|
| 763 |
context_status = ticket.get("context_status") or {}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 764 |
current_ticket_id = ticket.get("ticket_id")
|
| 765 |
prior_ticket_history = [
|
| 766 |
entry
|
|
@@ -777,7 +780,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
|
|
| 777 |
for entry in prior_ticket_history
|
| 778 |
if entry.get("predicted", {}).get("action_type") == "investigate"
|
| 779 |
)
|
| 780 |
-
hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
|
| 781 |
if investigations_used >= 3:
|
| 782 |
return False, None
|
| 783 |
|
|
@@ -786,6 +788,14 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
|
|
| 786 |
for entry in prior_ticket_history
|
| 787 |
if entry.get("predicted", {}).get("action_type") == "investigate"
|
| 788 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 789 |
routing_text = build_routing_text(ticket)
|
| 790 |
last_tool_result = ticket.get("last_tool_result") or {}
|
| 791 |
last_tool_name = str(last_tool_result.get("tool_name", "") or "")
|
|
@@ -859,10 +869,6 @@ def should_investigate(ticket: dict, history: list[dict[str, Any]]) -> tuple[boo
|
|
| 859 |
|
| 860 |
if already_investigated and not hidden_context_remaining:
|
| 861 |
return False, None
|
| 862 |
-
if ticket.get("ambiguity_note") and "lookup_internal_routing_note" not in used_tools:
|
| 863 |
-
return True, "lookup_internal_routing_note"
|
| 864 |
-
if ticket.get("related_ticket_id") and "lookup_related_ticket" not in used_tools:
|
| 865 |
-
return True, "lookup_related_ticket"
|
| 866 |
return False, None
|
| 867 |
|
| 868 |
|
|
|
|
| 26 |
|
| 27 |
TASK_ID
|
| 28 |
Optional OpenEnv task ID to run. When unset, the script defaults to the
|
| 29 |
+
full declared task set so evaluator-style runs exercise every grader.
|
|
|
|
| 30 |
|
| 31 |
RUN_ALL_TASKS
|
| 32 |
+
Optional backwards-compatible local-development alias. The script already
|
| 33 |
+
runs every available task when TASK_ID is unset.
|
| 34 |
|
| 35 |
LOCAL_IMAGE_NAME
|
| 36 |
Optional compatibility variable from the sample inference pattern.
|
|
|
|
| 760 |
if not ticket:
|
| 761 |
return False, None
|
| 762 |
context_status = ticket.get("context_status") or {}
|
| 763 |
+
hidden_context_remaining = bool(context_status.get("hidden_context_remaining"))
|
| 764 |
+
investigation_required = bool(context_status.get("investigation_required"))
|
| 765 |
+
if not investigation_required and not hidden_context_remaining:
|
| 766 |
+
return False, None
|
| 767 |
current_ticket_id = ticket.get("ticket_id")
|
| 768 |
prior_ticket_history = [
|
| 769 |
entry
|
|
|
|
| 780 |
for entry in prior_ticket_history
|
| 781 |
if entry.get("predicted", {}).get("action_type") == "investigate"
|
| 782 |
)
|
|
|
|
| 783 |
if investigations_used >= 3:
|
| 784 |
return False, None
|
| 785 |
|
|
|
|
| 788 |
for entry in prior_ticket_history
|
| 789 |
if entry.get("predicted", {}).get("action_type") == "investigate"
|
| 790 |
}
|
| 791 |
+
recommended_tools = [
|
| 792 |
+
tool_name
|
| 793 |
+
for tool_name in context_status.get("recommended_tools", [])
|
| 794 |
+
if tool_name not in used_tools
|
| 795 |
+
]
|
| 796 |
+
if hidden_context_remaining and recommended_tools:
|
| 797 |
+
return True, recommended_tools[0]
|
| 798 |
+
|
| 799 |
routing_text = build_routing_text(ticket)
|
| 800 |
last_tool_result = ticket.get("last_tool_result") or {}
|
| 801 |
last_tool_name = str(last_tool_result.get("tool_name", "") or "")
|
|
|
|
| 869 |
|
| 870 |
if already_investigated and not hidden_context_remaining:
|
| 871 |
return False, None
|
|
|
|
|
|
|
|
|
|
|
|
|
| 872 |
return False, None
|
| 873 |
|
| 874 |
|
server/environment.py
CHANGED
|
@@ -33,12 +33,13 @@ AVAILABLE_TOOLS = (
|
|
| 33 |
"lookup_internal_routing_note",
|
| 34 |
)
|
| 35 |
FREE_INVESTIGATIONS_PER_TICKET = 1
|
| 36 |
-
EXTRA_INVESTIGATION_COST = 0.
|
| 37 |
-
MAX_EXTRA_INVESTIGATION_PENALTY = 0.
|
| 38 |
-
USEFUL_INVESTIGATION_REWARD = 0.
|
| 39 |
-
PREMATURE_SUBMIT_PENALTY = 0.
|
| 40 |
-
|
| 41 |
-
|
|
|
|
| 42 |
PRIORITY_UNDERSHOOT_PENALTY = 0.03
|
| 43 |
SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
|
| 44 |
DANGEROUS_RESOLUTION_PENALTY = 0.05
|
|
@@ -95,6 +96,18 @@ HARD_TASK_DESCRIPTION_REDACTIONS: dict[str, str] = {
|
|
| 95 |
),
|
| 96 |
}
|
| 97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
|
| 100 |
if value is None or value == "":
|
|
@@ -412,6 +425,55 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 412 |
return 0.0
|
| 413 |
return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
|
| 414 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 415 |
def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
|
| 416 |
return (
|
| 417 |
ticket.assignment_group
|
|
@@ -441,6 +503,22 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 441 |
def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
|
| 442 |
return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
|
| 443 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 444 |
def _required_tools_for_ticket(
|
| 445 |
self,
|
| 446 |
ticket: HelpdeskTicketRecord,
|
|
@@ -453,8 +531,9 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 453 |
if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
|
| 454 |
required_tools.append("lookup_related_ticket")
|
| 455 |
if (
|
| 456 |
-
|
| 457 |
-
|
|
|
|
| 458 |
required_tools.append("lookup_internal_routing_note")
|
| 459 |
if (
|
| 460 |
self._ticket_repeated_requester_count(ticket) >= 2
|
|
@@ -467,7 +546,13 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 467 |
and "lookup_requester_history" not in required_tools
|
| 468 |
):
|
| 469 |
required_tools.append("lookup_requester_history")
|
| 470 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 471 |
|
| 472 |
def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
|
| 473 |
return list(self._state.ticket_tool_usage.get(ticket_id, []))
|
|
@@ -503,23 +588,39 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 503 |
def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
|
| 504 |
if ticket.related_ticket_id is not None:
|
| 505 |
return (
|
| 506 |
-
"This is a follow-up operational issue
|
| 507 |
"Additional routing context is available via investigation."
|
| 508 |
)
|
| 509 |
-
if
|
| 510 |
return (
|
| 511 |
-
"
|
| 512 |
"Additional routing context is available via investigation."
|
| 513 |
)
|
| 514 |
if self._ticket_has_nondefault_routing(ticket):
|
| 515 |
return (
|
| 516 |
-
"The visible request looks straightforward, but the decisive routing "
|
| 517 |
-
"detail is hidden until investigation."
|
| 518 |
)
|
| 519 |
return (
|
| 520 |
"Additional routing context is available via investigation before final submission."
|
| 521 |
)
|
| 522 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 523 |
def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
|
| 524 |
if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
|
| 525 |
return HARD_TASK_DESCRIPTION_REDACTIONS.get(
|
|
@@ -537,7 +638,11 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 537 |
penalty = PREMATURE_SUBMIT_PENALTY * (
|
| 538 |
len(remaining_tools) / max(1, len(required_tools))
|
| 539 |
)
|
| 540 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 541 |
|
| 542 |
def _context_completion_bonus(
|
| 543 |
self,
|
|
@@ -691,12 +796,13 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 691 |
}
|
| 692 |
|
| 693 |
def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
|
| 694 |
-
|
|
|
|
| 695 |
return {
|
| 696 |
"tool_name": "lookup_internal_routing_note",
|
| 697 |
"found": found,
|
| 698 |
"ticket_id": current_ticket.ticket_id,
|
| 699 |
-
"routing_note":
|
| 700 |
}
|
| 701 |
|
| 702 |
def _run_investigation_tool(
|
|
@@ -753,10 +859,8 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 753 |
self._state.investigation_budget_remaining - 1,
|
| 754 |
)
|
| 755 |
self._state.last_tool_result = tool_result
|
| 756 |
-
investigation_reward =
|
| 757 |
-
|
| 758 |
-
)
|
| 759 |
-
investigation_score = clamp_open_unit_interval(0.0)
|
| 760 |
self._state.last_step_reward = investigation_reward
|
| 761 |
self._state.reward = investigation_reward
|
| 762 |
self._state.done = False
|
|
@@ -799,7 +903,7 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 799 |
remaining_tools = progress["remaining_tools"]
|
| 800 |
ticket_view: dict[str, Any] = {
|
| 801 |
"ticket_id": ticket.ticket_id,
|
| 802 |
-
"title":
|
| 803 |
"requester": ticket.requester,
|
| 804 |
"description": self._visible_description(ticket),
|
| 805 |
}
|
|
@@ -811,6 +915,7 @@ class HelpdeskTicketRoutingEnvironment(
|
|
| 811 |
"revealed_context_count": progress["revealed_count"],
|
| 812 |
"context_completeness": progress["completeness"],
|
| 813 |
"investigations_used_for_ticket": progress["revealed_count"],
|
|
|
|
| 814 |
}
|
| 815 |
if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
|
| 816 |
ticket_view["ambiguity_note"] = ticket.ambiguity_note
|
|
|
|
| 33 |
"lookup_internal_routing_note",
|
| 34 |
)
|
| 35 |
FREE_INVESTIGATIONS_PER_TICKET = 1
|
| 36 |
+
EXTRA_INVESTIGATION_COST = 0.04
|
| 37 |
+
MAX_EXTRA_INVESTIGATION_PENALTY = 0.25
|
| 38 |
+
USEFUL_INVESTIGATION_REWARD = 0.03
|
| 39 |
+
PREMATURE_SUBMIT_PENALTY = 0.22
|
| 40 |
+
NONDEFAULT_HIDDEN_CONTEXT_PENALTY = 0.08
|
| 41 |
+
CONTEXT_COMPLETION_BONUS = 0.06
|
| 42 |
+
TRAJECTORY_CONTEXT_COMPLETION_BONUS = 0.04
|
| 43 |
PRIORITY_UNDERSHOOT_PENALTY = 0.03
|
| 44 |
SEVERE_PRIORITY_UNDERSHOOT_PENALTY = 0.07
|
| 45 |
DANGEROUS_RESOLUTION_PENALTY = 0.05
|
|
|
|
| 96 |
),
|
| 97 |
}
|
| 98 |
|
| 99 |
+
HARD_TASK_TITLE_REDACTIONS: dict[str, str] = {
|
| 100 |
+
"ticket-021": "Production workflow regression",
|
| 101 |
+
"ticket-022": "Time-sensitive account review",
|
| 102 |
+
"ticket-027": "Commercial workflow request",
|
| 103 |
+
"ticket-029": "Urgent expansion request",
|
| 104 |
+
"ticket-038": "Repeated invoice follow-up",
|
| 105 |
+
"ticket-045": "Company-wide account issue",
|
| 106 |
+
"TKT-NONDEFAULT-001": "Billing-style routing question",
|
| 107 |
+
"TKT-NONDEFAULT-002": "Compliance ownership question",
|
| 108 |
+
"TKT-NONDEFAULT-003": "Workflow blocker with hidden owner",
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
|
| 112 |
def _coerce_optional_int(value: Any, field_name: str) -> Optional[int]:
|
| 113 |
if value is None or value == "":
|
|
|
|
| 425 |
return 0.0
|
| 426 |
return sum(self._state.per_ticket_scores) / len(self._state.per_ticket_scores)
|
| 427 |
|
| 428 |
+
def _internal_routing_note_for_ticket(
|
| 429 |
+
self,
|
| 430 |
+
ticket: HelpdeskTicketRecord,
|
| 431 |
+
) -> str | None:
|
| 432 |
+
if ticket.ambiguity_note is not None:
|
| 433 |
+
return ticket.ambiguity_note
|
| 434 |
+
if self._state.current_task_id != 3:
|
| 435 |
+
return None
|
| 436 |
+
|
| 437 |
+
default_group = ISSUE_TYPE_TO_ASSIGNMENT_GROUP.get(
|
| 438 |
+
ticket.issue_type,
|
| 439 |
+
ticket.assignment_group,
|
| 440 |
+
)
|
| 441 |
+
default_action = ISSUE_TYPE_TO_RESOLUTION_ACTION.get(
|
| 442 |
+
ticket.issue_type,
|
| 443 |
+
ticket.resolution_action,
|
| 444 |
+
)
|
| 445 |
+
note_parts: list[str] = []
|
| 446 |
+
|
| 447 |
+
if ticket.assignment_group != default_group:
|
| 448 |
+
note_parts.append(
|
| 449 |
+
"Routing override: send this to "
|
| 450 |
+
f"{ticket.assignment_group} rather than the default {default_group} queue."
|
| 451 |
+
)
|
| 452 |
+
if ticket.resolution_action != default_action:
|
| 453 |
+
note_parts.append(
|
| 454 |
+
"Action override: use "
|
| 455 |
+
f"{ticket.resolution_action} instead of the default {default_action} next step."
|
| 456 |
+
)
|
| 457 |
+
if ticket.issue_type == "onboarding" and ticket.assignment_group == "service_desk":
|
| 458 |
+
note_parts.append(
|
| 459 |
+
"The onboarding workflow is blocked by an access dependency, so the unblocker owns the next move."
|
| 460 |
+
)
|
| 461 |
+
if (
|
| 462 |
+
ticket.issue_type == "security_compliance"
|
| 463 |
+
and ticket.assignment_group == "application_team"
|
| 464 |
+
):
|
| 465 |
+
note_parts.append(
|
| 466 |
+
"This compliance issue needs a product-team fix rather than a central security handoff."
|
| 467 |
+
)
|
| 468 |
+
if ticket.issue_type == "billing_license" and ticket.assignment_group == "procurement":
|
| 469 |
+
note_parts.append(
|
| 470 |
+
"Treat this as commercial procurement work instead of routine license fulfillment."
|
| 471 |
+
)
|
| 472 |
+
|
| 473 |
+
if not note_parts:
|
| 474 |
+
return None
|
| 475 |
+
return " ".join(note_parts)
|
| 476 |
+
|
| 477 |
def _ticket_has_nondefault_routing(self, ticket: HelpdeskTicketRecord) -> bool:
|
| 478 |
return (
|
| 479 |
ticket.assignment_group
|
|
|
|
| 503 |
def _ticket_repeated_requester_count(self, ticket: HelpdeskTicketRecord) -> int:
|
| 504 |
return sum(1 for candidate in self._dataset if candidate.requester == ticket.requester)
|
| 505 |
|
| 506 |
+
def _tool_has_available_context(
|
| 507 |
+
self,
|
| 508 |
+
ticket: HelpdeskTicketRecord,
|
| 509 |
+
tool_name: str,
|
| 510 |
+
) -> bool:
|
| 511 |
+
if tool_name == "lookup_related_ticket":
|
| 512 |
+
return (
|
| 513 |
+
ticket.related_ticket_id is not None
|
| 514 |
+
and ticket.related_ticket_id in self._tickets_by_id
|
| 515 |
+
)
|
| 516 |
+
if tool_name == "lookup_requester_history":
|
| 517 |
+
return self._ticket_repeated_requester_count(ticket) >= 2
|
| 518 |
+
if tool_name == "lookup_internal_routing_note":
|
| 519 |
+
return self._internal_routing_note_for_ticket(ticket) is not None
|
| 520 |
+
return False
|
| 521 |
+
|
| 522 |
def _required_tools_for_ticket(
|
| 523 |
self,
|
| 524 |
ticket: HelpdeskTicketRecord,
|
|
|
|
| 531 |
if ticket.related_ticket_id is not None and "lookup_related_ticket" not in required_tools:
|
| 532 |
required_tools.append("lookup_related_ticket")
|
| 533 |
if (
|
| 534 |
+
self._internal_routing_note_for_ticket(ticket) is not None
|
| 535 |
+
and "lookup_internal_routing_note" not in required_tools
|
| 536 |
+
):
|
| 537 |
required_tools.append("lookup_internal_routing_note")
|
| 538 |
if (
|
| 539 |
self._ticket_repeated_requester_count(ticket) >= 2
|
|
|
|
| 546 |
and "lookup_requester_history" not in required_tools
|
| 547 |
):
|
| 548 |
required_tools.append("lookup_requester_history")
|
| 549 |
+
filtered_required_tools: list[str] = []
|
| 550 |
+
for tool_name in required_tools:
|
| 551 |
+
if tool_name in filtered_required_tools:
|
| 552 |
+
continue
|
| 553 |
+
if self._tool_has_available_context(ticket, tool_name):
|
| 554 |
+
filtered_required_tools.append(tool_name)
|
| 555 |
+
return filtered_required_tools
|
| 556 |
|
| 557 |
def _used_tools_for_ticket(self, ticket_id: str) -> list[str]:
|
| 558 |
return list(self._state.ticket_tool_usage.get(ticket_id, []))
|
|
|
|
| 588 |
def _default_redacted_description(self, ticket: HelpdeskTicketRecord) -> str:
|
| 589 |
if ticket.related_ticket_id is not None:
|
| 590 |
return (
|
| 591 |
+
"This is a follow-up operational issue. "
|
| 592 |
"Additional routing context is available via investigation."
|
| 593 |
)
|
| 594 |
+
if self._internal_routing_note_for_ticket(ticket) is not None:
|
| 595 |
return (
|
| 596 |
+
"The visible request is not enough to choose the final owner and next step. "
|
| 597 |
"Additional routing context is available via investigation."
|
| 598 |
)
|
| 599 |
if self._ticket_has_nondefault_routing(ticket):
|
| 600 |
return (
|
| 601 |
+
"The visible request looks straightforward, but the decisive routing detail is hidden until investigation."
|
|
|
|
| 602 |
)
|
| 603 |
return (
|
| 604 |
"Additional routing context is available via investigation before final submission."
|
| 605 |
)
|
| 606 |
|
| 607 |
+
def _default_redacted_title(self, ticket: HelpdeskTicketRecord) -> str:
|
| 608 |
+
if ticket.related_ticket_id is not None:
|
| 609 |
+
return "Follow-up request with hidden routing context"
|
| 610 |
+
if self._internal_routing_note_for_ticket(ticket) is not None:
|
| 611 |
+
return "Routing clarification required"
|
| 612 |
+
if self._ticket_mentions_follow_up(ticket):
|
| 613 |
+
return "Priority support follow-up"
|
| 614 |
+
return "Helpdesk routing decision"
|
| 615 |
+
|
| 616 |
+
def _visible_title(self, ticket: HelpdeskTicketRecord) -> str:
|
| 617 |
+
if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
|
| 618 |
+
return HARD_TASK_TITLE_REDACTIONS.get(
|
| 619 |
+
ticket.ticket_id,
|
| 620 |
+
self._default_redacted_title(ticket),
|
| 621 |
+
)
|
| 622 |
+
return ticket.title
|
| 623 |
+
|
| 624 |
def _visible_description(self, ticket: HelpdeskTicketRecord) -> str:
|
| 625 |
if self._state.current_task_id == 3 and self._remaining_tools_for_ticket(ticket):
|
| 626 |
return HARD_TASK_DESCRIPTION_REDACTIONS.get(
|
|
|
|
| 638 |
penalty = PREMATURE_SUBMIT_PENALTY * (
|
| 639 |
len(remaining_tools) / max(1, len(required_tools))
|
| 640 |
)
|
| 641 |
+
if self._ticket_has_nondefault_routing(ticket):
|
| 642 |
+
penalty += NONDEFAULT_HIDDEN_CONTEXT_PENALTY * (
|
| 643 |
+
len(remaining_tools) / max(1, len(required_tools))
|
| 644 |
+
)
|
| 645 |
+
return round(min(0.45, penalty), 4), len(remaining_tools)
|
| 646 |
|
| 647 |
def _context_completion_bonus(
|
| 648 |
self,
|
|
|
|
| 796 |
}
|
| 797 |
|
| 798 |
def _lookup_internal_routing_note(self, current_ticket: HelpdeskTicketRecord) -> dict[str, Any]:
|
| 799 |
+
routing_note = self._internal_routing_note_for_ticket(current_ticket)
|
| 800 |
+
found = routing_note is not None
|
| 801 |
return {
|
| 802 |
"tool_name": "lookup_internal_routing_note",
|
| 803 |
"found": found,
|
| 804 |
"ticket_id": current_ticket.ticket_id,
|
| 805 |
+
"routing_note": routing_note if found else "",
|
| 806 |
}
|
| 807 |
|
| 808 |
def _run_investigation_tool(
|
|
|
|
| 859 |
self._state.investigation_budget_remaining - 1,
|
| 860 |
)
|
| 861 |
self._state.last_tool_result = tool_result
|
| 862 |
+
investigation_reward = USEFUL_INVESTIGATION_REWARD if useful_investigation else 0.0
|
| 863 |
+
investigation_score = 0.0
|
|
|
|
|
|
|
| 864 |
self._state.last_step_reward = investigation_reward
|
| 865 |
self._state.reward = investigation_reward
|
| 866 |
self._state.done = False
|
|
|
|
| 903 |
remaining_tools = progress["remaining_tools"]
|
| 904 |
ticket_view: dict[str, Any] = {
|
| 905 |
"ticket_id": ticket.ticket_id,
|
| 906 |
+
"title": self._visible_title(ticket),
|
| 907 |
"requester": ticket.requester,
|
| 908 |
"description": self._visible_description(ticket),
|
| 909 |
}
|
|
|
|
| 915 |
"revealed_context_count": progress["revealed_count"],
|
| 916 |
"context_completeness": progress["completeness"],
|
| 917 |
"investigations_used_for_ticket": progress["revealed_count"],
|
| 918 |
+
"recommended_tools": list(remaining_tools),
|
| 919 |
}
|
| 920 |
if ticket.ambiguity_note is not None and "lookup_internal_routing_note" not in remaining_tools:
|
| 921 |
ticket_view["ambiguity_note"] = ticket.ambiguity_note
|
server/grader.py
CHANGED
|
@@ -24,6 +24,32 @@ ISSUE_TYPE_SIMILARITY = {
|
|
| 24 |
("billing_license", "security_compliance"): 0.2,
|
| 25 |
}
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
PRIORITY_SCORES = {
|
| 28 |
("critical", "high"): 0.6,
|
| 29 |
("high", "critical"): 0.6,
|
|
@@ -66,6 +92,20 @@ def _score_exact_or_similar(predicted: str | None, expected: str) -> float:
|
|
| 66 |
return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
|
| 67 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
def _score_priority(predicted: str | None, expected: str) -> float:
|
| 70 |
pred = _normalized(predicted)
|
| 71 |
exp = _normalized(expected)
|
|
@@ -91,11 +131,15 @@ def grade_action(
|
|
| 91 |
field_scores = {
|
| 92 |
"issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
|
| 93 |
"priority": _score_priority(action.priority, ticket.priority),
|
| 94 |
-
"assignment_group":
|
| 95 |
-
action.assignment_group,
|
|
|
|
|
|
|
| 96 |
),
|
| 97 |
-
"resolution_action":
|
| 98 |
-
action.resolution_action,
|
|
|
|
|
|
|
| 99 |
),
|
| 100 |
}
|
| 101 |
|
|
|
|
| 24 |
("billing_license", "security_compliance"): 0.2,
|
| 25 |
}
|
| 26 |
|
| 27 |
+
ASSIGNMENT_GROUP_SIMILARITY = {
|
| 28 |
+
("procurement", "license_ops"): 0.55,
|
| 29 |
+
("license_ops", "procurement"): 0.55,
|
| 30 |
+
("service_desk", "onboarding_ops"): 0.5,
|
| 31 |
+
("onboarding_ops", "service_desk"): 0.5,
|
| 32 |
+
("application_team", "security_team"): 0.35,
|
| 33 |
+
("security_team", "application_team"): 0.35,
|
| 34 |
+
("service_desk", "application_team"): 0.25,
|
| 35 |
+
("application_team", "service_desk"): 0.25,
|
| 36 |
+
("service_desk", "security_team"): 0.2,
|
| 37 |
+
("security_team", "service_desk"): 0.2,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
RESOLUTION_ACTION_SIMILARITY = {
|
| 41 |
+
("assign", "escalate"): 0.6,
|
| 42 |
+
("escalate", "assign"): 0.6,
|
| 43 |
+
("acknowledge", "fulfill"): 0.35,
|
| 44 |
+
("fulfill", "acknowledge"): 0.35,
|
| 45 |
+
("assign", "fulfill"): 0.25,
|
| 46 |
+
("fulfill", "assign"): 0.25,
|
| 47 |
+
("escalate", "fulfill"): 0.2,
|
| 48 |
+
("fulfill", "escalate"): 0.2,
|
| 49 |
+
("acknowledge", "assign"): 0.2,
|
| 50 |
+
("assign", "acknowledge"): 0.2,
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
PRIORITY_SCORES = {
|
| 54 |
("critical", "high"): 0.6,
|
| 55 |
("high", "critical"): 0.6,
|
|
|
|
| 92 |
return ISSUE_TYPE_SIMILARITY.get((pred, exp), 0.0)
|
| 93 |
|
| 94 |
|
| 95 |
+
def _score_exact_or_table(
|
| 96 |
+
predicted: str | None,
|
| 97 |
+
expected: str,
|
| 98 |
+
similarity_table: dict[tuple[str, str], float],
|
| 99 |
+
) -> float:
|
| 100 |
+
pred = _normalized(predicted)
|
| 101 |
+
exp = _normalized(expected)
|
| 102 |
+
if not pred:
|
| 103 |
+
return 0.0
|
| 104 |
+
if pred == exp:
|
| 105 |
+
return 1.0
|
| 106 |
+
return similarity_table.get((pred, exp), 0.0)
|
| 107 |
+
|
| 108 |
+
|
| 109 |
def _score_priority(predicted: str | None, expected: str) -> float:
|
| 110 |
pred = _normalized(predicted)
|
| 111 |
exp = _normalized(expected)
|
|
|
|
| 131 |
field_scores = {
|
| 132 |
"issue_type": _score_exact_or_similar(action.issue_type, ticket.issue_type),
|
| 133 |
"priority": _score_priority(action.priority, ticket.priority),
|
| 134 |
+
"assignment_group": _score_exact_or_table(
|
| 135 |
+
action.assignment_group,
|
| 136 |
+
ticket.assignment_group,
|
| 137 |
+
ASSIGNMENT_GROUP_SIMILARITY,
|
| 138 |
),
|
| 139 |
+
"resolution_action": _score_exact_or_table(
|
| 140 |
+
action.resolution_action,
|
| 141 |
+
ticket.resolution_action,
|
| 142 |
+
RESOLUTION_ACTION_SIMILARITY,
|
| 143 |
),
|
| 144 |
}
|
| 145 |
|
tests/test_api_integration.py
CHANGED
|
@@ -529,8 +529,8 @@ class TestHeuristicInferenceRegression(unittest.TestCase):
|
|
| 529 |
overall_avg = sum(rewards) / len(rewards)
|
| 530 |
self.assertGreaterEqual(
|
| 531 |
overall_avg,
|
| 532 |
-
0.
|
| 533 |
-
f"Overall average reward {overall_avg:.4f} is below
|
| 534 |
)
|
| 535 |
self.assertLessEqual(
|
| 536 |
overall_avg,
|
|
|
|
| 529 |
overall_avg = sum(rewards) / len(rewards)
|
| 530 |
self.assertGreaterEqual(
|
| 531 |
overall_avg,
|
| 532 |
+
0.75,
|
| 533 |
+
f"Overall average reward {overall_avg:.4f} is below the smoke-test floor of 0.75",
|
| 534 |
)
|
| 535 |
self.assertLessEqual(
|
| 536 |
overall_avg,
|
tests/test_competitive_upgrade.py
CHANGED
|
@@ -100,11 +100,11 @@ def _get_tasks_to_run_impl(
|
|
| 100 |
if task_id not in available_tasks:
|
| 101 |
raise SystemExit(1)
|
| 102 |
return [task_id]
|
| 103 |
-
if run_all_tasks:
|
| 104 |
-
return sorted(available_tasks)
|
| 105 |
if not available_tasks:
|
| 106 |
return []
|
| 107 |
-
|
|
|
|
|
|
|
| 108 |
|
| 109 |
|
| 110 |
class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
@@ -120,10 +120,10 @@ class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
| 120 |
with self.assertRaises(SystemExit):
|
| 121 |
_get_tasks_to_run_impl("999", available)
|
| 122 |
|
| 123 |
-
def
|
| 124 |
available = {1: {}, 2: {}, 3: {}}
|
| 125 |
result = _get_tasks_to_run_impl(None, available)
|
| 126 |
-
self.assertEqual(result, [1])
|
| 127 |
|
| 128 |
def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
|
| 129 |
available = {1: {}, 2: {}, 3: {}}
|
|
@@ -710,7 +710,7 @@ class TestQueueEconomics(unittest.TestCase):
|
|
| 710 |
final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
|
| 711 |
|
| 712 |
self.assertTrue(final_obs.done)
|
| 713 |
-
self.assertAlmostEqual(final_obs.reward, 0.
|
| 714 |
|
| 715 |
|
| 716 |
class TestTerminalInvalidActionFinalReward(unittest.TestCase):
|
|
|
|
| 100 |
if task_id not in available_tasks:
|
| 101 |
raise SystemExit(1)
|
| 102 |
return [task_id]
|
|
|
|
|
|
|
| 103 |
if not available_tasks:
|
| 104 |
return []
|
| 105 |
+
if run_all_tasks:
|
| 106 |
+
return sorted(available_tasks)
|
| 107 |
+
return sorted(available_tasks)
|
| 108 |
|
| 109 |
|
| 110 |
class TestInferenceSingleTaskMode(unittest.TestCase):
|
|
|
|
| 120 |
with self.assertRaises(SystemExit):
|
| 121 |
_get_tasks_to_run_impl("999", available)
|
| 122 |
|
| 123 |
+
def test_task_id_unset_defaults_to_all_available_tasks(self) -> None:
|
| 124 |
available = {1: {}, 2: {}, 3: {}}
|
| 125 |
result = _get_tasks_to_run_impl(None, available)
|
| 126 |
+
self.assertEqual(result, [1, 2, 3])
|
| 127 |
|
| 128 |
def test_run_all_tasks_override_returns_all_task_ids(self) -> None:
|
| 129 |
available = {1: {}, 2: {}, 3: {}}
|
|
|
|
| 710 |
final_obs = env.step(HelpdeskTicketAction(issue_type=ticket.issue_type))
|
| 711 |
|
| 712 |
self.assertTrue(final_obs.done)
|
| 713 |
+
self.assertAlmostEqual(final_obs.reward, 0.95, places=9)
|
| 714 |
|
| 715 |
|
| 716 |
class TestTerminalInvalidActionFinalReward(unittest.TestCase):
|
tests/test_grader_unit.py
CHANGED
|
@@ -6,12 +6,14 @@ import openenv_test_stubs # noqa: F401
|
|
| 6 |
|
| 7 |
from models import HelpdeskTicketAction, HelpdeskTicketRecord
|
| 8 |
from server.grader import (
|
|
|
|
| 9 |
ISSUE_TYPE_SIMILARITY,
|
| 10 |
PRIORITY_SCORES,
|
|
|
|
| 11 |
TASK_WEIGHTS,
|
| 12 |
grade_action,
|
| 13 |
)
|
| 14 |
-
from vocabulary import ISSUE_TYPES, PRIORITIES
|
| 15 |
|
| 16 |
|
| 17 |
def _ticket(
|
|
@@ -143,12 +145,26 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 143 |
self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
|
| 144 |
self.assertAlmostEqual(score, 0.8)
|
| 145 |
|
| 146 |
-
def
|
| 147 |
ticket = _ticket()
|
| 148 |
action = HelpdeskTicketAction(
|
| 149 |
issue_type="billing_license",
|
| 150 |
priority="high",
|
| 151 |
-
assignment_group="
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
resolution_action="fulfill",
|
| 153 |
)
|
| 154 |
|
|
@@ -162,7 +178,7 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 162 |
action = HelpdeskTicketAction(
|
| 163 |
issue_type="billing_license",
|
| 164 |
priority="medium",
|
| 165 |
-
assignment_group="
|
| 166 |
resolution_action="fulfill",
|
| 167 |
)
|
| 168 |
|
|
@@ -179,13 +195,27 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 179 |
)
|
| 180 |
self.assertAlmostEqual(score, 0.65)
|
| 181 |
|
| 182 |
-
def
|
| 183 |
ticket = _ticket()
|
| 184 |
action = HelpdeskTicketAction(
|
| 185 |
issue_type="billing_license",
|
| 186 |
priority="high",
|
| 187 |
assignment_group="license_ops",
|
| 188 |
-
resolution_action="
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
)
|
| 190 |
|
| 191 |
score, breakdown = grade_action(action, ticket, task_id=3)
|
|
@@ -193,6 +223,70 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 193 |
self.assertEqual(breakdown["resolution_action"], 0.0)
|
| 194 |
self.assertAlmostEqual(score, 0.8)
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
def test_partial_credit_tables_never_override_exact_match(self) -> None:
|
| 197 |
for pair, value in ISSUE_TYPE_SIMILARITY.items():
|
| 198 |
with self.subTest(table="issue_type", pair=pair):
|
|
@@ -204,6 +298,16 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 204 |
self.assertGreater(value, 0.0)
|
| 205 |
self.assertLess(value, 1.0)
|
| 206 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
def test_task_weights_sum_to_one_for_each_task(self) -> None:
|
| 208 |
for task_id, weights in TASK_WEIGHTS.items():
|
| 209 |
with self.subTest(task_id=task_id):
|
|
|
|
| 6 |
|
| 7 |
from models import HelpdeskTicketAction, HelpdeskTicketRecord
|
| 8 |
from server.grader import (
|
| 9 |
+
ASSIGNMENT_GROUP_SIMILARITY,
|
| 10 |
ISSUE_TYPE_SIMILARITY,
|
| 11 |
PRIORITY_SCORES,
|
| 12 |
+
RESOLUTION_ACTION_SIMILARITY,
|
| 13 |
TASK_WEIGHTS,
|
| 14 |
grade_action,
|
| 15 |
)
|
| 16 |
+
from vocabulary import ASSIGNMENT_GROUPS, ISSUE_TYPES, PRIORITIES, RESOLUTION_ACTIONS
|
| 17 |
|
| 18 |
|
| 19 |
def _ticket(
|
|
|
|
| 145 |
self.assertEqual(breakdown, {"issue_type": 1.0, "priority": 0.5})
|
| 146 |
self.assertAlmostEqual(score, 0.8)
|
| 147 |
|
| 148 |
+
def test_assignment_group_partial_credit_uses_declared_similarity_table(self) -> None:
|
| 149 |
ticket = _ticket()
|
| 150 |
action = HelpdeskTicketAction(
|
| 151 |
issue_type="billing_license",
|
| 152 |
priority="high",
|
| 153 |
+
assignment_group="procurement",
|
| 154 |
+
resolution_action="fulfill",
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
score, breakdown = grade_action(action, ticket, task_id=3)
|
| 158 |
+
|
| 159 |
+
self.assertEqual(breakdown["assignment_group"], 0.55)
|
| 160 |
+
self.assertAlmostEqual(score, 0.8875)
|
| 161 |
+
|
| 162 |
+
def test_assignment_group_unrelated_miss_stays_zero(self) -> None:
|
| 163 |
+
ticket = _ticket()
|
| 164 |
+
action = HelpdeskTicketAction(
|
| 165 |
+
issue_type="billing_license",
|
| 166 |
+
priority="high",
|
| 167 |
+
assignment_group="security_team",
|
| 168 |
resolution_action="fulfill",
|
| 169 |
)
|
| 170 |
|
|
|
|
| 178 |
action = HelpdeskTicketAction(
|
| 179 |
issue_type="billing_license",
|
| 180 |
priority="medium",
|
| 181 |
+
assignment_group="security_team",
|
| 182 |
resolution_action="fulfill",
|
| 183 |
)
|
| 184 |
|
|
|
|
| 195 |
)
|
| 196 |
self.assertAlmostEqual(score, 0.65)
|
| 197 |
|
| 198 |
+
def test_resolution_action_partial_credit_uses_declared_similarity_table(self) -> None:
|
| 199 |
ticket = _ticket()
|
| 200 |
action = HelpdeskTicketAction(
|
| 201 |
issue_type="billing_license",
|
| 202 |
priority="high",
|
| 203 |
assignment_group="license_ops",
|
| 204 |
+
resolution_action="acknowledge",
|
| 205 |
+
)
|
| 206 |
+
|
| 207 |
+
score, breakdown = grade_action(action, ticket, task_id=3)
|
| 208 |
+
|
| 209 |
+
self.assertEqual(breakdown["resolution_action"], 0.35)
|
| 210 |
+
self.assertAlmostEqual(score, 0.87)
|
| 211 |
+
|
| 212 |
+
def test_resolution_action_unrelated_miss_stays_zero(self) -> None:
|
| 213 |
+
ticket = _ticket()
|
| 214 |
+
action = HelpdeskTicketAction(
|
| 215 |
+
issue_type="billing_license",
|
| 216 |
+
priority="high",
|
| 217 |
+
assignment_group="license_ops",
|
| 218 |
+
resolution_action="ignore",
|
| 219 |
)
|
| 220 |
|
| 221 |
score, breakdown = grade_action(action, ticket, task_id=3)
|
|
|
|
| 223 |
self.assertEqual(breakdown["resolution_action"], 0.0)
|
| 224 |
self.assertAlmostEqual(score, 0.8)
|
| 225 |
|
| 226 |
+
def test_assignment_group_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
|
| 227 |
+
for expected in ASSIGNMENT_GROUPS:
|
| 228 |
+
for predicted in ASSIGNMENT_GROUPS:
|
| 229 |
+
with self.subTest(expected=expected, predicted=predicted):
|
| 230 |
+
ticket = _ticket(assignment_group=expected)
|
| 231 |
+
action = HelpdeskTicketAction(
|
| 232 |
+
issue_type="billing_license",
|
| 233 |
+
priority="high",
|
| 234 |
+
assignment_group=predicted,
|
| 235 |
+
resolution_action="fulfill",
|
| 236 |
+
)
|
| 237 |
+
|
| 238 |
+
score, breakdown = grade_action(action, ticket, task_id=3)
|
| 239 |
+
|
| 240 |
+
assignment_group_score = (
|
| 241 |
+
1.0
|
| 242 |
+
if predicted == expected
|
| 243 |
+
else ASSIGNMENT_GROUP_SIMILARITY.get((predicted, expected), 0.0)
|
| 244 |
+
)
|
| 245 |
+
self.assertEqual(
|
| 246 |
+
breakdown,
|
| 247 |
+
{
|
| 248 |
+
"issue_type": 1.0,
|
| 249 |
+
"priority": 1.0,
|
| 250 |
+
"assignment_group": assignment_group_score,
|
| 251 |
+
"resolution_action": 1.0,
|
| 252 |
+
},
|
| 253 |
+
)
|
| 254 |
+
raw_score = 0.35 + 0.20 + 0.25 * assignment_group_score + 0.20
|
| 255 |
+
expected_task_score = max(0.01, min(0.99, raw_score))
|
| 256 |
+
self.assertAlmostEqual(score, expected_task_score)
|
| 257 |
+
|
| 258 |
+
def test_resolution_action_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
|
| 259 |
+
for expected in RESOLUTION_ACTIONS:
|
| 260 |
+
for predicted in RESOLUTION_ACTIONS:
|
| 261 |
+
with self.subTest(expected=expected, predicted=predicted):
|
| 262 |
+
ticket = _ticket(resolution_action=expected)
|
| 263 |
+
action = HelpdeskTicketAction(
|
| 264 |
+
issue_type="billing_license",
|
| 265 |
+
priority="high",
|
| 266 |
+
assignment_group="license_ops",
|
| 267 |
+
resolution_action=predicted,
|
| 268 |
+
)
|
| 269 |
+
|
| 270 |
+
score, breakdown = grade_action(action, ticket, task_id=3)
|
| 271 |
+
|
| 272 |
+
resolution_action_score = (
|
| 273 |
+
1.0
|
| 274 |
+
if predicted == expected
|
| 275 |
+
else RESOLUTION_ACTION_SIMILARITY.get((predicted, expected), 0.0)
|
| 276 |
+
)
|
| 277 |
+
self.assertEqual(
|
| 278 |
+
breakdown,
|
| 279 |
+
{
|
| 280 |
+
"issue_type": 1.0,
|
| 281 |
+
"priority": 1.0,
|
| 282 |
+
"assignment_group": 1.0,
|
| 283 |
+
"resolution_action": resolution_action_score,
|
| 284 |
+
},
|
| 285 |
+
)
|
| 286 |
+
raw_score = 0.35 + 0.20 + 0.25 + 0.20 * resolution_action_score
|
| 287 |
+
expected_task_score = max(0.01, min(0.99, raw_score))
|
| 288 |
+
self.assertAlmostEqual(score, expected_task_score)
|
| 289 |
+
|
| 290 |
def test_partial_credit_tables_never_override_exact_match(self) -> None:
|
| 291 |
for pair, value in ISSUE_TYPE_SIMILARITY.items():
|
| 292 |
with self.subTest(table="issue_type", pair=pair):
|
|
|
|
| 298 |
self.assertGreater(value, 0.0)
|
| 299 |
self.assertLess(value, 1.0)
|
| 300 |
|
| 301 |
+
for pair, value in ASSIGNMENT_GROUP_SIMILARITY.items():
|
| 302 |
+
with self.subTest(table="assignment_group", pair=pair):
|
| 303 |
+
self.assertGreater(value, 0.0)
|
| 304 |
+
self.assertLess(value, 1.0)
|
| 305 |
+
|
| 306 |
+
for pair, value in RESOLUTION_ACTION_SIMILARITY.items():
|
| 307 |
+
with self.subTest(table="resolution_action", pair=pair):
|
| 308 |
+
self.assertGreater(value, 0.0)
|
| 309 |
+
self.assertLess(value, 1.0)
|
| 310 |
+
|
| 311 |
def test_task_weights_sum_to_one_for_each_task(self) -> None:
|
| 312 |
for task_id, weights in TASK_WEIGHTS.items():
|
| 313 |
with self.subTest(task_id=task_id):
|