Roopalgn commited on
Commit
6920aae
·
1 Parent(s): 72d2634

Complete Roopal roadmap work for April 4-7

Browse files
KNOWLEDGE.md CHANGED
@@ -237,12 +237,21 @@ The grader is deterministic and intentionally simple to explain.
237
  - `assignment_group` gets exact credit
238
  - `resolution_action` gets exact credit
239
 
 
 
 
 
 
 
 
240
  Task weighting:
241
 
242
  - Task 1: only `issue_type`
243
  - Task 2: `issue_type` 60%, `priority` 40%
244
  - Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
245
 
 
 
246
  ## Reward Mental Model
247
 
248
  Step reward:
@@ -270,6 +279,18 @@ Current structure:
270
 
271
  The dataset is meant to test routing judgment, not just keyword spotting.
272
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  ## Inference Script In Simple Terms
274
 
275
  `inference.py` is the baseline agent runner.
@@ -344,6 +365,17 @@ An April 6 audit confirmed:
344
  - the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
345
  - the remaining work is execution validation, not documentation cleanup
346
 
 
 
 
 
 
 
 
 
 
 
 
347
  ## What Still Needs Hands-On Verification
348
 
349
  The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
 
237
  - `assignment_group` gets exact credit
238
  - `resolution_action` gets exact credit
239
 
240
+ Just as important, the grader is not fuzzy by default:
241
+
242
+ - exact matches stay dominant
243
+ - wrong issue types outside the declared similarity map score `0.0`
244
+ - wrong priorities outside the declared proximity table score `0.0`
245
+ - assignment group and resolution action never receive partial credit
246
+
247
  Task weighting:
248
 
249
  - Task 1: only `issue_type`
250
  - Task 2: `issue_type` 60%, `priority` 40%
251
  - Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
252
 
253
+ This is now proven in checked-in unit tests rather than left as a docs claim.
254
+
255
  ## Reward Mental Model
256
 
257
  Step reward:
 
279
 
280
  The dataset is meant to test routing judgment, not just keyword spotting.
281
 
282
+ ## Grounding Note
283
+
284
+ The taxonomy and limited partial-credit policy were reviewed against public IT-support references recorded in `analysis/grounding_audit.md`.
285
+
286
+ The grounding inputs used for that review were:
287
+
288
+ - `Classification of IT Support Tickets`
289
+ - `Semantic Similarity of IT Support Tickets`
290
+ - `MSDialog`
291
+
292
+ The key conclusion was to keep the similarity map narrow. The current issue-type near misses are defensible, but broader additions would blur operationally distinct routing actions too much this late in the submission cycle.
293
+
294
  ## Inference Script In Simple Terms
295
 
296
  `inference.py` is the baseline agent runner.
 
365
  - the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
366
  - the remaining work is execution validation, not documentation cleanup
367
 
368
+ ### April 6 and April 7 Roopal-side doc pass
369
+
370
+ That follow-up pass added the remaining Roopal-owned public-clarity items:
371
+
372
+ - Hugging Face Spaces README frontmatter
373
+ - explicit judge-facing explanation that scoring is deterministic and only partially fuzzy in declared places
374
+ - an internal grounding note tying the label space to public IT-support datasets
375
+ - a refreshed compliance snapshot in `required.md`
376
+
377
+ The optional TRL / GRPO README example was intentionally deferred because the shared runtime-validation gates are not all green yet.
378
+
379
  ## What Still Needs Hands-On Verification
380
 
381
  The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
PROJECT_STATUS.md CHANGED
@@ -191,3 +191,37 @@ Still pending after the current checkpoint:
191
 
192
  - perform a Docker smoke test from the current merged repo state
193
  - do a clean-machine dry run if possible before final submission freeze
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
  - perform a Docker smoke test from the current merged repo state
193
  - do a clean-machine dry run if possible before final submission freeze
194
+
195
+ ## April 3, 2026 (Pulled Forward April 4-5 Roopal Scope)
196
+
197
+ Status: complete for the Roopal-owned roadmap items originally scheduled for April 4 and April 5
198
+
199
+ Roopal-side work completed:
200
+
201
+ - expanded `tests/test_grader_unit.py` to lock scorer crispness with exhaustive issue-type and priority-table checks
202
+ - added explicit invariants for task-weight sums, exact-match dominance, and deterministic repeated grading
203
+ - expanded `tests/test_tasks_unit.py` to cover the frozen task difficulty ladder plus dataset coverage across all issue types, priorities, assignment groups, and resolution actions
204
+ - added `analysis/grounding_audit.md` as the internal grounding note requested by the roadmap
205
+ - reviewed candidate issue-type similarity expansions and decided to keep the current similarity map unchanged
206
+
207
+ Decision notes:
208
+
209
+ - scorer fuzziness is now proven by tests to exist only where the declared similarity map or priority table allows it
210
+ - no additional issue-type similarity pairs were adopted in this pass because the reviewed candidates were too operationally fuzzy
211
+
212
+ ## April 3, 2026 (Pulled Forward April 6-7 Roopal Scope)
213
+
214
+ Status: complete for the Roopal-owned roadmap items originally scheduled for April 6 and April 7
215
+
216
+ Roopal-side work completed:
217
+
218
+ - added Hugging Face Spaces README frontmatter
219
+ - updated `README.md` with an explicit judge-facing explanation of deterministic, grounded scoring
220
+ - updated `KNOWLEDGE.md` to state clearly that the grader is not fuzzy by default and to reference the grounding audit
221
+ - updated `required.md` with a current compliance snapshot separating already-satisfied requirements from shared pending validation gates
222
+ - completed the final Roopal-side consistency pass across `README.md`, `KNOWLEDGE.md`, and `required.md`
223
+
224
+ Decision notes:
225
+
226
+ - no scorer change was needed from the grounding review, so this pass stayed documentation-only
227
+ - the optional TRL / GRPO README example remains deferred until the shared runtime-validation gates are green
README.md CHANGED
@@ -1,3 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # IT Helpdesk Ticket Routing OpenEnv
2
 
3
  > Meta PyTorch OpenEnv Hackathon Round 1 submission
@@ -152,6 +166,26 @@ average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
152
 
153
  The result is clamped to `[0.0, 1.0]`.
154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ## Dataset Snapshot
156
 
157
  The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
@@ -349,6 +383,8 @@ The repo is already aligned on:
349
  - typed models
350
  - grader and reward design
351
  - packaging metadata and Docker entry point
 
 
352
 
353
  An April 6 repo audit also confirmed that all required submission files are present:
354
 
@@ -359,4 +395,8 @@ An April 6 repo audit also confirmed that all required submission files are pres
359
  Still pending before final submission:
360
 
361
  - a Docker smoke test from a machine with Docker installed
 
 
362
  - a final clean-machine dry run if possible before submission freeze
 
 
 
1
+ ---
2
+ title: IT Helpdesk Ticket Routing OpenEnv
3
+ colorFrom: blue
4
+ colorTo: indigo
5
+ sdk: docker
6
+ pinned: false
7
+ app_port: 7860
8
+ tags:
9
+ - openenv
10
+ - helpdesk
11
+ - ticket-routing
12
+ - customer-support
13
+ ---
14
+
15
  # IT Helpdesk Ticket Routing OpenEnv
16
 
17
  > Meta PyTorch OpenEnv Hackathon Round 1 submission
 
166
 
167
  The result is clamped to `[0.0, 1.0]`.
168
 
169
+ ## Grounded Scoring
170
+
171
+ The grader is intentionally not fuzzy by default.
172
+
173
+ - exact match is the dominant path for every field
174
+ - `assignment_group` and `resolution_action` are exact-match only
175
+ - `priority` only gets proximity credit from the declared table in `server/grader.py`
176
+ - `issue_type` only gets partial credit for a small declared similarity map
177
+ - wrong labels outside those explicit maps score `0.0`
178
+
179
+ That scoring policy is now backed by checked-in unit tests in `tests/test_grader_unit.py` and `tests/test_tasks_unit.py`.
180
+
181
+ The label set and partial-credit choices were also reviewed against public IT-support references captured in `analysis/grounding_audit.md`, including:
182
+
183
+ - `Classification of IT Support Tickets`
184
+ - `Semantic Similarity of IT Support Tickets`
185
+ - `MSDialog`
186
+
187
+ That grounding pass supported keeping the current similarity map small and explainable. No new issue-type similarity pairs were added from the review.
188
+
189
  ## Dataset Snapshot
190
 
191
  The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
 
383
  - typed models
384
  - grader and reward design
385
  - packaging metadata and Docker entry point
386
+ - Hugging Face Spaces README frontmatter
387
+ - judge-facing documentation of deterministic, grounded scoring
388
 
389
  An April 6 repo audit also confirmed that all required submission files are present:
390
 
 
395
  Still pending before final submission:
396
 
397
  - a Docker smoke test from a machine with Docker installed
398
+ - `openenv validate` evidence on the current merged repo state
399
+ - structured `inference.py` log-format verification on the current merged repo state
400
  - a final clean-machine dry run if possible before submission freeze
401
+
402
+ The short TRL / GRPO README example from the roadmap is intentionally deferred until the shared runtime and validation gates are green.
analysis/grounding_audit.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Grounding Audit For Taxonomy And Similarity Decisions
2
+
3
+ > Internal note for the roadmap work originally planned for April 5, 2026.
4
+ > Reviewed on April 3, 2026 and pulled forward ahead of schedule.
5
+
6
+ ## Goal
7
+
8
+ Ground the current ticket taxonomy and the limited partial-credit policy against real public IT-support data without turning external datasets into a runtime dependency.
9
+
10
+ ## Sources Reviewed
11
+
12
+ 1. [Classification of IT Support Tickets](https://zenodo.org/records/7648117)
13
+ - Zenodo dataset with 2,229 manually classified support tickets.
14
+ - Dataset description says the tickets were classified by three IT support professionals.
15
+ - The public preview exposes seven coarse categories: `Fileservice`, `Support general`, `Software`, `O365`, `Active Directory`, `Computer-Services`, and `EOL`.
16
+
17
+ 2. [Semantic Similarity of IT Support Tickets](https://zenodo.org/records/7426225)
18
+ - Zenodo dataset with 300 ticket pairs manually labeled for semantic similarity.
19
+ - The description says three IT support professionals performed the labeling.
20
+ - This is the best direct grounding for keeping similarity explicit and limited instead of treating the whole label space as fuzzy.
21
+
22
+ 3. [MSDialog dataset page](https://ciir.cs.umass.edu/downloads/msdialog/)
23
+ - Technical-support dialog corpus drawn from Microsoft Community.
24
+ - The site reports 35,000 dialogs in `MSDialog-Complete` and 2,199 labeled dialogs with 10,020 utterances in `MSDialog-Intent`.
25
+ - This grounds our use of follow-up cases, clarification-heavy threads, and helpdesk-style conversational language.
26
+
27
+ ## Mapping Principle
28
+
29
+ The external datasets validate that real IT support traffic mixes access problems, software incidents, generic support questions, procurement-like requests, and multi-turn follow-ups. Our label set is more operational than the public category sets, so the mappings below are judgment calls based on source descriptions and public previews rather than exact label equivalence.
30
+
31
+ ## Grounding Examples
32
+
33
+ 1. Active Directory lockout, MFA trouble, or password reset -> `identity_access` -> exact-match dominant, with `onboarding` as the only defensible adjacent label when the request is really about new-user provisioning.
34
+ 2. New hire account setup or contractor access provisioning -> `onboarding` -> partial-credit adjacent to `identity_access`, because both can surface as account enablement work before ownership is fully resolved.
35
+ 3. Office or application crash, timeout, webhook failure, or migration-script breakage -> `application_support` -> partial-credit adjacent to `feature_request` only when the report reads like a capability gap rather than a break/fix issue.
36
+ 4. Feature wishlist or export-format enhancement request -> `feature_request` -> partial-credit adjacent to `application_support` only when the user reports the missing capability as if it were a defect.
37
+ 5. Vendor-evaluation question, demo request, or quote request -> `service_request` -> partial-credit adjacent to `general_inquiry` when the request is still exploratory rather than a committed operational action.
38
+ 6. Seat expansion or provisioning-style commercial request -> `service_request` -> partial-credit adjacent to `billing_license` when procurement and account-admin signals are mixed in the same ticket.
39
+ 7. Refund, invoice discrepancy, subscription cancellation, or payment-admin issue -> `billing_license` -> partial-credit adjacent to `service_request` only in commercial admin cases that overlap with a procurement or seat-change request.
40
+ 8. Broad capability question or lightweight product clarification -> `general_inquiry` -> partial-credit adjacent to `service_request` or `feature_request` when the request is vague enough to look like either evaluation or roadmap feedback.
41
+ 9. Spam lure or credential-phishing message sent to the inbox -> `spam_phishing` -> partial-credit adjacent to `security_compliance` only for security-themed inbound items, not for normal access or software tickets.
42
+ 10. GDPR deletion request, DPA request, audit finding, or mandatory MFA policy notice -> `security_compliance` -> exact-match dominant, with very limited adjacency to `spam_phishing` for suspicious security reports and a low-confidence edge to `billing_license` only in contractual paperwork contexts.
43
+ 11. Reopened outage thread or repeated bug report escalation -> `application_support` -> exact-match dominant; the main change across turns is usually `priority`, not `issue_type`.
44
+ 12. Repeated lockout complaint or suspension follow-up -> `identity_access` -> exact-match dominant; follow-up behavior is grounded by MSDialog-style multi-turn support flow rather than by adding new label fuzziness.
45
+
46
+ ## Review Of Current Similarity Pairs
47
+
48
+ The current `ISSUE_TYPE_SIMILARITY` map stays intentionally small. The defensible themes are:
49
+
50
+ - `billing_license` <-> `service_request`: commercial admin and procurement requests can overlap before the owning team is clear.
51
+ - `application_support` <-> `identity_access`: SSO and login failures can initially look like either app failure or access failure.
52
+ - `application_support` <-> `feature_request`: some users describe missing functionality in bug-report language.
53
+ - `onboarding` <-> `identity_access`: provisioning and account enablement are adjacent in real helpdesk traffic.
54
+ - `general_inquiry` <-> `feature_request`: vague product questions can blur into roadmap requests.
55
+ - `general_inquiry` <-> `service_request`: vendor-evaluation and exploratory capability questions often overlap.
56
+ - `spam_phishing` <-> `security_compliance`: both are security-facing, but they should stay separate from normal access or app-routing labels.
57
+ - `security_compliance` <-> `billing_license`: kept only as a very low-score edge for contract and paperwork overlap; this is the weakest current pair and should not be expanded further without ticket-level evidence.
58
+
59
+ ## Candidate Expansions Reviewed And Rejected
60
+
61
+ These pairs were reviewed during the April 5 roadmap pass and are intentionally not being added:
62
+
63
+ - `onboarding` <-> `service_request`: both can involve setup, but the owning teams and next actions diverge too quickly.
64
+ - `feature_request` <-> `service_request`: roadmap asks and procurement actions are operationally different.
65
+ - `security_compliance` <-> `identity_access`: policy obligations may mention accounts, but the compliance workflow is distinct from user access support.
66
+ - `billing_license` <-> `identity_access`: nonpayment or suspension can mention lockout symptoms, but the root-cause owner is different.
67
+ - `application_support` <-> `billing_license`: mixed commercial and outage narratives exist, but broad partial credit here would blur incident handling too much.
68
+
69
+ ## Decision
70
+
71
+ No new issue-type similarity pairs should be added from this review.
72
+
73
+ The safest grounded position is:
74
+
75
+ - keep the current limited similarity map,
76
+ - rely on exact-match scoring for most wrong labels,
77
+ - let `priority`, `assignment_group`, and `resolution_action` keep the hard-task routing signal crisp.
required.md CHANGED
@@ -350,3 +350,26 @@ The project is ready when:
350
  7. HF deployment checks succeed or are as close to verified as possible before submission
351
  8. the docs are clean, current, and submission-ready
352
  9. the repo clearly presents Hackstreet Boys as the team
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
350
  7. HF deployment checks succeed or are as close to verified as possible before submission
351
  8. the docs are clean, current, and submission-ready
352
  9. the repo clearly presents Hackstreet Boys as the team
353
+
354
+ ## Current Compliance Snapshot
355
+
356
+ As of April 3, 2026, the Roopal-side compliance review says these items are already in place:
357
+
358
+ - real-world task definition is clear and stable
359
+ - typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
360
+ - 3-task easy -> medium -> hard ladder is present
361
+ - graders are deterministic and bounded to `[0.0, 1.0]`
362
+ - unit tests now prove scorer crispness, task invariants, and dataset coverage
363
+ - baseline heuristic results are recorded in the docs
364
+ - the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
365
+ - an internal grounding audit exists in `analysis/grounding_audit.md`
366
+
367
+ The items still pending or shared with runtime-side work are:
368
+
369
+ - `openenv validate` evidence on the merged repo state
370
+ - Docker smoke evidence on the merged repo state
371
+ - Hugging Face deployment ping and reset verification
372
+ - structured `inference.py` log-format verification
373
+ - clean-machine rerun evidence if practical
374
+
375
+ The roadmap's short TRL / GRPO README example remains optional and should stay deferred until the pending validation items above are green.
tests/test_grader_unit.py CHANGED
@@ -5,7 +5,13 @@ import unittest
5
  import openenv_test_stubs # noqa: F401
6
 
7
  from models import HelpdeskTicketAction, HelpdeskTicketRecord
8
- from server.grader import grade_action
 
 
 
 
 
 
9
 
10
 
11
  def _ticket(
@@ -66,6 +72,23 @@ class GraderUnitTests(unittest.TestCase):
66
  self.assertAlmostEqual(score, 0.4)
67
  self.assertEqual(breakdown, {"issue_type": 0.4})
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
70
  ticket = _ticket(issue_type="onboarding")
71
  action = HelpdeskTicketAction(issue_type="spam_phishing")
@@ -85,6 +108,29 @@ class GraderUnitTests(unittest.TestCase):
85
  self.assertAlmostEqual(breakdown["priority"], 0.6)
86
  self.assertAlmostEqual(score, 0.84)
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  def test_task_2_weights_apply_as_documented(self) -> None:
89
  ticket = _ticket(priority="high")
90
  action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
@@ -108,6 +154,28 @@ class GraderUnitTests(unittest.TestCase):
108
  self.assertEqual(breakdown["assignment_group"], 0.0)
109
  self.assertAlmostEqual(score, 0.75)
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  def test_resolution_action_is_exact_match_only(self) -> None:
112
  ticket = _ticket()
113
  action = HelpdeskTicketAction(
@@ -122,6 +190,37 @@ class GraderUnitTests(unittest.TestCase):
122
  self.assertEqual(breakdown["resolution_action"], 0.0)
123
  self.assertAlmostEqual(score, 0.8)
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  if __name__ == "__main__":
127
  unittest.main()
 
5
  import openenv_test_stubs # noqa: F401
6
 
7
  from models import HelpdeskTicketAction, HelpdeskTicketRecord
8
+ from server.grader import (
9
+ ISSUE_TYPE_SIMILARITY,
10
+ PRIORITY_SCORES,
11
+ TASK_WEIGHTS,
12
+ grade_action,
13
+ )
14
+ from vocabulary import ISSUE_TYPES, PRIORITIES
15
 
16
 
17
  def _ticket(
 
72
  self.assertAlmostEqual(score, 0.4)
73
  self.assertEqual(breakdown, {"issue_type": 0.4})
74
 
75
+ def test_issue_type_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
76
+ for expected in ISSUE_TYPES:
77
+ for predicted in ISSUE_TYPES:
78
+ with self.subTest(expected=expected, predicted=predicted):
79
+ ticket = _ticket(issue_type=expected)
80
+ action = HelpdeskTicketAction(issue_type=predicted)
81
+
82
+ score, breakdown = grade_action(action, ticket, task_id=1)
83
+
84
+ expected_score = (
85
+ 1.0
86
+ if predicted == expected
87
+ else ISSUE_TYPE_SIMILARITY.get((predicted, expected), 0.0)
88
+ )
89
+ self.assertAlmostEqual(score, expected_score)
90
+ self.assertEqual(breakdown, {"issue_type": expected_score})
91
+
92
  def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
93
  ticket = _ticket(issue_type="onboarding")
94
  action = HelpdeskTicketAction(issue_type="spam_phishing")
 
108
  self.assertAlmostEqual(breakdown["priority"], 0.6)
109
  self.assertAlmostEqual(score, 0.84)
110
 
111
+ def test_priority_scoring_matches_declared_table_exhaustively(self) -> None:
112
+ for expected in PRIORITIES:
113
+ for predicted in PRIORITIES:
114
+ with self.subTest(expected=expected, predicted=predicted):
115
+ ticket = _ticket(priority=expected)
116
+ action = HelpdeskTicketAction(
117
+ issue_type="billing_license",
118
+ priority=predicted,
119
+ )
120
+
121
+ score, breakdown = grade_action(action, ticket, task_id=2)
122
+
123
+ priority_score = (
124
+ 1.0
125
+ if predicted == expected
126
+ else PRIORITY_SCORES.get((predicted, expected), 0.0)
127
+ )
128
+ self.assertEqual(
129
+ breakdown,
130
+ {"issue_type": 1.0, "priority": priority_score},
131
+ )
132
+ self.assertAlmostEqual(score, 0.6 + 0.4 * priority_score)
133
+
134
  def test_task_2_weights_apply_as_documented(self) -> None:
135
  ticket = _ticket(priority="high")
136
  action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
 
154
  self.assertEqual(breakdown["assignment_group"], 0.0)
155
  self.assertAlmostEqual(score, 0.75)
156
 
157
+ def test_task_3_weights_apply_as_documented(self) -> None:
158
+ ticket = _ticket(priority="high")
159
+ action = HelpdeskTicketAction(
160
+ issue_type="billing_license",
161
+ priority="medium",
162
+ assignment_group="service_desk",
163
+ resolution_action="fulfill",
164
+ )
165
+
166
+ score, breakdown = grade_action(action, ticket, task_id=3)
167
+
168
+ self.assertEqual(
169
+ breakdown,
170
+ {
171
+ "issue_type": 1.0,
172
+ "priority": 0.5,
173
+ "assignment_group": 0.0,
174
+ "resolution_action": 1.0,
175
+ },
176
+ )
177
+ self.assertAlmostEqual(score, 0.65)
178
+
179
  def test_resolution_action_is_exact_match_only(self) -> None:
180
  ticket = _ticket()
181
  action = HelpdeskTicketAction(
 
190
  self.assertEqual(breakdown["resolution_action"], 0.0)
191
  self.assertAlmostEqual(score, 0.8)
192
 
193
+ def test_partial_credit_tables_never_override_exact_match(self) -> None:
194
+ for pair, value in ISSUE_TYPE_SIMILARITY.items():
195
+ with self.subTest(table="issue_type", pair=pair):
196
+ self.assertGreater(value, 0.0)
197
+ self.assertLess(value, 1.0)
198
+
199
+ for pair, value in PRIORITY_SCORES.items():
200
+ with self.subTest(table="priority", pair=pair):
201
+ self.assertGreater(value, 0.0)
202
+ self.assertLess(value, 1.0)
203
+
204
+ def test_task_weights_sum_to_one_for_each_task(self) -> None:
205
+ for task_id, weights in TASK_WEIGHTS.items():
206
+ with self.subTest(task_id=task_id):
207
+ self.assertAlmostEqual(sum(weights.values()), 1.0)
208
+
209
+ def test_grade_action_is_deterministic_for_same_inputs(self) -> None:
210
+ ticket = _ticket(issue_type="service_request", priority="medium")
211
+ action = HelpdeskTicketAction(
212
+ issue_type="general_inquiry",
213
+ priority="low",
214
+ assignment_group="license_ops",
215
+ resolution_action="assign",
216
+ )
217
+
218
+ first_score, first_breakdown = grade_action(action, ticket, task_id=3)
219
+ second_score, second_breakdown = grade_action(action, ticket, task_id=3)
220
+
221
+ self.assertEqual(first_score, second_score)
222
+ self.assertEqual(first_breakdown, second_breakdown)
223
+
224
 
225
  if __name__ == "__main__":
226
  unittest.main()
tests/test_tasks_unit.py CHANGED
@@ -9,7 +9,13 @@ import openenv_test_stubs # noqa: F401
9
  from models import HelpdeskTicketRecord
10
  from server import tasks as task_module
11
  from server.tasks import TASKS, get_task_definition, load_dataset
12
- from vocabulary import TASK_IDS
 
 
 
 
 
 
13
 
14
 
15
  class TasksAndDatasetUnitTests(unittest.TestCase):
@@ -31,6 +37,12 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
31
  ],
32
  )
33
 
 
 
 
 
 
 
34
  def test_invalid_task_id_raises(self) -> None:
35
  with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
36
  get_task_definition(0)
@@ -64,20 +76,33 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
64
  dataset = load_dataset()
65
  issue_types = {record.issue_type for record in dataset}
66
 
67
- self.assertEqual(
68
- issue_types,
69
- {
70
- "application_support",
71
- "billing_license",
72
- "feature_request",
73
- "general_inquiry",
74
- "identity_access",
75
- "onboarding",
76
- "security_compliance",
77
- "service_request",
78
- "spam_phishing",
79
- },
80
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  def test_load_dataset_accepts_utf8_bom(self) -> None:
83
  sample = (
 
9
  from models import HelpdeskTicketRecord
10
  from server import tasks as task_module
11
  from server.tasks import TASKS, get_task_definition, load_dataset
12
+ from vocabulary import (
13
+ ASSIGNMENT_GROUPS,
14
+ ISSUE_TYPES,
15
+ PRIORITIES,
16
+ RESOLUTION_ACTIONS,
17
+ TASK_IDS,
18
+ )
19
 
20
 
21
  class TasksAndDatasetUnitTests(unittest.TestCase):
 
37
  ],
38
  )
39
 
40
+ def test_task_difficulty_ladder_is_frozen(self) -> None:
41
+ self.assertEqual(
42
+ [TASKS[task_id]["difficulty"] for task_id in TASK_IDS],
43
+ ["easy", "medium", "hard"],
44
+ )
45
+
46
  def test_invalid_task_id_raises(self) -> None:
47
  with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
48
  get_task_definition(0)
 
76
  dataset = load_dataset()
77
  issue_types = {record.issue_type for record in dataset}
78
 
79
+ self.assertEqual(issue_types, set(ISSUE_TYPES))
80
+
81
+ def test_dataset_covers_all_defined_priorities(self) -> None:
82
+ dataset = load_dataset()
83
+ priorities = {record.priority for record in dataset}
84
+
85
+ self.assertEqual(priorities, set(PRIORITIES))
86
+
87
+ def test_dataset_covers_all_assignment_groups(self) -> None:
88
+ dataset = load_dataset()
89
+ assignment_groups = {record.assignment_group for record in dataset}
90
+
91
+ self.assertEqual(assignment_groups, set(ASSIGNMENT_GROUPS))
92
+
93
+ def test_dataset_covers_all_resolution_actions(self) -> None:
94
+ dataset = load_dataset()
95
+ resolution_actions = {record.resolution_action for record in dataset}
96
+
97
+ self.assertEqual(resolution_actions, set(RESOLUTION_ACTIONS))
98
+
99
+ def test_dataset_preserves_ambiguous_and_follow_up_cases(self) -> None:
100
+ dataset = load_dataset()
101
+ ambiguity_count = sum(1 for record in dataset if record.ambiguity_note)
102
+ follow_up_count = sum(1 for record in dataset if record.related_ticket_id)
103
+
104
+ self.assertGreaterEqual(ambiguity_count, 4)
105
+ self.assertGreaterEqual(follow_up_count, 3)
106
 
107
  def test_load_dataset_accepts_utf8_bom(self) -> None:
108
  sample = (