Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

Roopalgn commited on Apr 3

Commit

6920aae

1 Parent(s): 72d2634

Complete Roopal roadmap work for April 4-7

Browse files

Files changed (7) hide show

KNOWLEDGE.md +32 -0
PROJECT_STATUS.md +34 -0
README.md +40 -0
analysis/grounding_audit.md +77 -0
required.md +23 -0
tests/test_grader_unit.py +100 -1
tests/test_tasks_unit.py +40 -15

KNOWLEDGE.md CHANGED Viewed

@@ -237,12 +237,21 @@ The grader is deterministic and intentionally simple to explain.
 - `assignment_group` gets exact credit
 - `resolution_action` gets exact credit
 Task weighting:
 - Task 1: only `issue_type`
 - Task 2: `issue_type` 60%, `priority` 40%
 - Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
 ## Reward Mental Model
 Step reward:
@@ -270,6 +279,18 @@ Current structure:
 The dataset is meant to test routing judgment, not just keyword spotting.
 ## Inference Script In Simple Terms
 `inference.py` is the baseline agent runner.
@@ -344,6 +365,17 @@ An April 6 audit confirmed:
 - the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
 - the remaining work is execution validation, not documentation cleanup
 ## What Still Needs Hands-On Verification
 The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.

 - `assignment_group` gets exact credit
 - `resolution_action` gets exact credit
+Just as important, the grader is not fuzzy by default:
+- exact matches stay dominant
+- wrong issue types outside the declared similarity map score `0.0`
+- wrong priorities outside the declared proximity table score `0.0`
+- assignment group and resolution action never receive partial credit
 Task weighting:
 - Task 1: only `issue_type`
 - Task 2: `issue_type` 60%, `priority` 40%
 - Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
+This is now proven in checked-in unit tests rather than left as a docs claim.
 ## Reward Mental Model
 Step reward:
 The dataset is meant to test routing judgment, not just keyword spotting.
+## Grounding Note
+The taxonomy and limited partial-credit policy were reviewed against public IT-support references recorded in `analysis/grounding_audit.md`.
+The grounding inputs used for that review were:
+- `Classification of IT Support Tickets`
+- `Semantic Similarity of IT Support Tickets`
+- `MSDialog`
+The key conclusion was to keep the similarity map narrow. The current issue-type near misses are defensible, but broader additions would blur operationally distinct routing actions too much this late in the submission cycle.
 ## Inference Script In Simple Terms
 `inference.py` is the baseline agent runner.
 - the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
 - the remaining work is execution validation, not documentation cleanup
+### April 6 and April 7 Roopal-side doc pass
+That follow-up pass added the remaining Roopal-owned public-clarity items:
+- Hugging Face Spaces README frontmatter
+- explicit judge-facing explanation that scoring is deterministic and only partially fuzzy in declared places
+- an internal grounding note tying the label space to public IT-support datasets
+- a refreshed compliance snapshot in `required.md`
+The optional TRL / GRPO README example was intentionally deferred because the shared runtime-validation gates are not all green yet.
 ## What Still Needs Hands-On Verification
 The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.

PROJECT_STATUS.md CHANGED Viewed

@@ -191,3 +191,37 @@ Still pending after the current checkpoint:
 - perform a Docker smoke test from the current merged repo state
 - do a clean-machine dry run if possible before final submission freeze

 - perform a Docker smoke test from the current merged repo state
 - do a clean-machine dry run if possible before final submission freeze
+## April 3, 2026 (Pulled Forward April 4-5 Roopal Scope)
+Status: complete for the Roopal-owned roadmap items originally scheduled for April 4 and April 5
+Roopal-side work completed:
+- expanded `tests/test_grader_unit.py` to lock scorer crispness with exhaustive issue-type and priority-table checks
+- added explicit invariants for task-weight sums, exact-match dominance, and deterministic repeated grading
+- expanded `tests/test_tasks_unit.py` to cover the frozen task difficulty ladder plus dataset coverage across all issue types, priorities, assignment groups, and resolution actions
+- added `analysis/grounding_audit.md` as the internal grounding note requested by the roadmap
+- reviewed candidate issue-type similarity expansions and decided to keep the current similarity map unchanged
+Decision notes:
+- scorer fuzziness is now proven by tests to exist only where the declared similarity map or priority table allows it
+- no additional issue-type similarity pairs were adopted in this pass because the reviewed candidates were too operationally fuzzy
+## April 3, 2026 (Pulled Forward April 6-7 Roopal Scope)
+Status: complete for the Roopal-owned roadmap items originally scheduled for April 6 and April 7
+Roopal-side work completed:
+- added Hugging Face Spaces README frontmatter
+- updated `README.md` with an explicit judge-facing explanation of deterministic, grounded scoring
+- updated `KNOWLEDGE.md` to state clearly that the grader is not fuzzy by default and to reference the grounding audit
+- updated `required.md` with a current compliance snapshot separating already-satisfied requirements from shared pending validation gates
+- completed the final Roopal-side consistency pass across `README.md`, `KNOWLEDGE.md`, and `required.md`
+Decision notes:
+- no scorer change was needed from the grounding review, so this pass stayed documentation-only
+- the optional TRL / GRPO README example remains deferred until the shared runtime-validation gates are green

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
 # IT Helpdesk Ticket Routing OpenEnv
 > Meta PyTorch OpenEnv Hackathon Round 1 submission
@@ -152,6 +166,26 @@ average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
 The result is clamped to `[0.0, 1.0]`.
 ## Dataset Snapshot
 The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
@@ -349,6 +383,8 @@ The repo is already aligned on:
 - typed models
 - grader and reward design
 - packaging metadata and Docker entry point
 An April 6 repo audit also confirmed that all required submission files are present:
@@ -359,4 +395,8 @@ An April 6 repo audit also confirmed that all required submission files are pres
 Still pending before final submission:
 - a Docker smoke test from a machine with Docker installed
 - a final clean-machine dry run if possible before submission freeze

+---
+title: IT Helpdesk Ticket Routing OpenEnv
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+pinned: false
+app_port: 7860
+tags:
+  - openenv
+  - helpdesk
+  - ticket-routing
+  - customer-support
+---
 # IT Helpdesk Ticket Routing OpenEnv
 > Meta PyTorch OpenEnv Hackathon Round 1 submission
 The result is clamped to `[0.0, 1.0]`.
+## Grounded Scoring
+The grader is intentionally not fuzzy by default.
+- exact match is the dominant path for every field
+- `assignment_group` and `resolution_action` are exact-match only
+- `priority` only gets proximity credit from the declared table in `server/grader.py`
+- `issue_type` only gets partial credit for a small declared similarity map
+- wrong labels outside those explicit maps score `0.0`
+That scoring policy is now backed by checked-in unit tests in `tests/test_grader_unit.py` and `tests/test_tasks_unit.py`.
+The label set and partial-credit choices were also reviewed against public IT-support references captured in `analysis/grounding_audit.md`, including:
+- `Classification of IT Support Tickets`
+- `Semantic Similarity of IT Support Tickets`
+- `MSDialog`
+That grounding pass supported keeping the current similarity map small and explainable. No new issue-type similarity pairs were added from the review.
 ## Dataset Snapshot
 The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
 - typed models
 - grader and reward design
 - packaging metadata and Docker entry point
+- Hugging Face Spaces README frontmatter
+- judge-facing documentation of deterministic, grounded scoring
 An April 6 repo audit also confirmed that all required submission files are present:
 Still pending before final submission:
 - a Docker smoke test from a machine with Docker installed
+- `openenv validate` evidence on the current merged repo state
+- structured `inference.py` log-format verification on the current merged repo state
 - a final clean-machine dry run if possible before submission freeze
+The short TRL / GRPO README example from the roadmap is intentionally deferred until the shared runtime and validation gates are green.

analysis/grounding_audit.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Grounding Audit For Taxonomy And Similarity Decisions
+> Internal note for the roadmap work originally planned for April 5, 2026.
+> Reviewed on April 3, 2026 and pulled forward ahead of schedule.
+## Goal
+Ground the current ticket taxonomy and the limited partial-credit policy against real public IT-support data without turning external datasets into a runtime dependency.
+## Sources Reviewed
+1. [Classification of IT Support Tickets](https://zenodo.org/records/7648117)
+   - Zenodo dataset with 2,229 manually classified support tickets.
+   - Dataset description says the tickets were classified by three IT support professionals.
+   - The public preview exposes seven coarse categories: `Fileservice`, `Support general`, `Software`, `O365`, `Active Directory`, `Computer-Services`, and `EOL`.
+2. [Semantic Similarity of IT Support Tickets](https://zenodo.org/records/7426225)
+   - Zenodo dataset with 300 ticket pairs manually labeled for semantic similarity.
+   - The description says three IT support professionals performed the labeling.
+   - This is the best direct grounding for keeping similarity explicit and limited instead of treating the whole label space as fuzzy.
+3. [MSDialog dataset page](https://ciir.cs.umass.edu/downloads/msdialog/)
+   - Technical-support dialog corpus drawn from Microsoft Community.
+   - The site reports 35,000 dialogs in `MSDialog-Complete` and 2,199 labeled dialogs with 10,020 utterances in `MSDialog-Intent`.
+   - This grounds our use of follow-up cases, clarification-heavy threads, and helpdesk-style conversational language.
+## Mapping Principle
+The external datasets validate that real IT support traffic mixes access problems, software incidents, generic support questions, procurement-like requests, and multi-turn follow-ups. Our label set is more operational than the public category sets, so the mappings below are judgment calls based on source descriptions and public previews rather than exact label equivalence.
+## Grounding Examples
+1. Active Directory lockout, MFA trouble, or password reset -> `identity_access` -> exact-match dominant, with `onboarding` as the only defensible adjacent label when the request is really about new-user provisioning.
+2. New hire account setup or contractor access provisioning -> `onboarding` -> partial-credit adjacent to `identity_access`, because both can surface as account enablement work before ownership is fully resolved.
+3. Office or application crash, timeout, webhook failure, or migration-script breakage -> `application_support` -> partial-credit adjacent to `feature_request` only when the report reads like a capability gap rather than a break/fix issue.
+4. Feature wishlist or export-format enhancement request -> `feature_request` -> partial-credit adjacent to `application_support` only when the user reports the missing capability as if it were a defect.
+5. Vendor-evaluation question, demo request, or quote request -> `service_request` -> partial-credit adjacent to `general_inquiry` when the request is still exploratory rather than a committed operational action.
+6. Seat expansion or provisioning-style commercial request -> `service_request` -> partial-credit adjacent to `billing_license` when procurement and account-admin signals are mixed in the same ticket.
+7. Refund, invoice discrepancy, subscription cancellation, or payment-admin issue -> `billing_license` -> partial-credit adjacent to `service_request` only in commercial admin cases that overlap with a procurement or seat-change request.
+8. Broad capability question or lightweight product clarification -> `general_inquiry` -> partial-credit adjacent to `service_request` or `feature_request` when the request is vague enough to look like either evaluation or roadmap feedback.
+9. Spam lure or credential-phishing message sent to the inbox -> `spam_phishing` -> partial-credit adjacent to `security_compliance` only for security-themed inbound items, not for normal access or software tickets.
+10. GDPR deletion request, DPA request, audit finding, or mandatory MFA policy notice -> `security_compliance` -> exact-match dominant, with very limited adjacency to `spam_phishing` for suspicious security reports and a low-confidence edge to `billing_license` only in contractual paperwork contexts.
+11. Reopened outage thread or repeated bug report escalation -> `application_support` -> exact-match dominant; the main change across turns is usually `priority`, not `issue_type`.
+12. Repeated lockout complaint or suspension follow-up -> `identity_access` -> exact-match dominant; follow-up behavior is grounded by MSDialog-style multi-turn support flow rather than by adding new label fuzziness.
+## Review Of Current Similarity Pairs
+The current `ISSUE_TYPE_SIMILARITY` map stays intentionally small. The defensible themes are:
+- `billing_license` <-> `service_request`: commercial admin and procurement requests can overlap before the owning team is clear.
+- `application_support` <-> `identity_access`: SSO and login failures can initially look like either app failure or access failure.
+- `application_support` <-> `feature_request`: some users describe missing functionality in bug-report language.
+- `onboarding` <-> `identity_access`: provisioning and account enablement are adjacent in real helpdesk traffic.
+- `general_inquiry` <-> `feature_request`: vague product questions can blur into roadmap requests.
+- `general_inquiry` <-> `service_request`: vendor-evaluation and exploratory capability questions often overlap.
+- `spam_phishing` <-> `security_compliance`: both are security-facing, but they should stay separate from normal access or app-routing labels.
+- `security_compliance` <-> `billing_license`: kept only as a very low-score edge for contract and paperwork overlap; this is the weakest current pair and should not be expanded further without ticket-level evidence.
+## Candidate Expansions Reviewed And Rejected
+These pairs were reviewed during the April 5 roadmap pass and are intentionally not being added:
+- `onboarding` <-> `service_request`: both can involve setup, but the owning teams and next actions diverge too quickly.
+- `feature_request` <-> `service_request`: roadmap asks and procurement actions are operationally different.
+- `security_compliance` <-> `identity_access`: policy obligations may mention accounts, but the compliance workflow is distinct from user access support.
+- `billing_license` <-> `identity_access`: nonpayment or suspension can mention lockout symptoms, but the root-cause owner is different.
+- `application_support` <-> `billing_license`: mixed commercial and outage narratives exist, but broad partial credit here would blur incident handling too much.
+## Decision
+No new issue-type similarity pairs should be added from this review.
+The safest grounded position is:
+- keep the current limited similarity map,
+- rely on exact-match scoring for most wrong labels,
+- let `priority`, `assignment_group`, and `resolution_action` keep the hard-task routing signal crisp.

required.md CHANGED Viewed

@@ -350,3 +350,26 @@ The project is ready when:
 7. HF deployment checks succeed or are as close to verified as possible before submission
 8. the docs are clean, current, and submission-ready
 9. the repo clearly presents Hackstreet Boys as the team

 7. HF deployment checks succeed or are as close to verified as possible before submission
 8. the docs are clean, current, and submission-ready
 9. the repo clearly presents Hackstreet Boys as the team
+## Current Compliance Snapshot
+As of April 3, 2026, the Roopal-side compliance review says these items are already in place:
+- real-world task definition is clear and stable
+- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
+- 3-task easy -> medium -> hard ladder is present
+- graders are deterministic and bounded to `[0.0, 1.0]`
+- unit tests now prove scorer crispness, task invariants, and dataset coverage
+- baseline heuristic results are recorded in the docs
+- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
+- an internal grounding audit exists in `analysis/grounding_audit.md`
+The items still pending or shared with runtime-side work are:
+- `openenv validate` evidence on the merged repo state
+- Docker smoke evidence on the merged repo state
+- Hugging Face deployment ping and reset verification
+- structured `inference.py` log-format verification
+- clean-machine rerun evidence if practical
+The roadmap's short TRL / GRPO README example remains optional and should stay deferred until the pending validation items above are green.

tests/test_grader_unit.py CHANGED Viewed

@@ -5,7 +5,13 @@ import unittest
 import openenv_test_stubs  # noqa: F401
 from models import HelpdeskTicketAction, HelpdeskTicketRecord
-from server.grader import grade_action
 def _ticket(
@@ -66,6 +72,23 @@ class GraderUnitTests(unittest.TestCase):
         self.assertAlmostEqual(score, 0.4)
         self.assertEqual(breakdown, {"issue_type": 0.4})
     def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
         ticket = _ticket(issue_type="onboarding")
         action = HelpdeskTicketAction(issue_type="spam_phishing")
@@ -85,6 +108,29 @@ class GraderUnitTests(unittest.TestCase):
         self.assertAlmostEqual(breakdown["priority"], 0.6)
         self.assertAlmostEqual(score, 0.84)
     def test_task_2_weights_apply_as_documented(self) -> None:
         ticket = _ticket(priority="high")
         action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
@@ -108,6 +154,28 @@ class GraderUnitTests(unittest.TestCase):
         self.assertEqual(breakdown["assignment_group"], 0.0)
         self.assertAlmostEqual(score, 0.75)
     def test_resolution_action_is_exact_match_only(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
@@ -122,6 +190,37 @@ class GraderUnitTests(unittest.TestCase):
         self.assertEqual(breakdown["resolution_action"], 0.0)
         self.assertAlmostEqual(score, 0.8)
 if __name__ == "__main__":
     unittest.main()

 import openenv_test_stubs  # noqa: F401
 from models import HelpdeskTicketAction, HelpdeskTicketRecord
+from server.grader import (
+    ISSUE_TYPE_SIMILARITY,
+    PRIORITY_SCORES,
+    TASK_WEIGHTS,
+    grade_action,
+)
+from vocabulary import ISSUE_TYPES, PRIORITIES
 def _ticket(
         self.assertAlmostEqual(score, 0.4)
         self.assertEqual(breakdown, {"issue_type": 0.4})
+    def test_issue_type_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
+        for expected in ISSUE_TYPES:
+            for predicted in ISSUE_TYPES:
+                with self.subTest(expected=expected, predicted=predicted):
+                    ticket = _ticket(issue_type=expected)
+                    action = HelpdeskTicketAction(issue_type=predicted)
+                    score, breakdown = grade_action(action, ticket, task_id=1)
+                    expected_score = (
+                        1.0
+                        if predicted == expected
+                        else ISSUE_TYPE_SIMILARITY.get((predicted, expected), 0.0)
+                    )
+                    self.assertAlmostEqual(score, expected_score)
+                    self.assertEqual(breakdown, {"issue_type": expected_score})
     def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
         ticket = _ticket(issue_type="onboarding")
         action = HelpdeskTicketAction(issue_type="spam_phishing")
         self.assertAlmostEqual(breakdown["priority"], 0.6)
         self.assertAlmostEqual(score, 0.84)
+    def test_priority_scoring_matches_declared_table_exhaustively(self) -> None:
+        for expected in PRIORITIES:
+            for predicted in PRIORITIES:
+                with self.subTest(expected=expected, predicted=predicted):
+                    ticket = _ticket(priority=expected)
+                    action = HelpdeskTicketAction(
+                        issue_type="billing_license",
+                        priority=predicted,
+                    )
+                    score, breakdown = grade_action(action, ticket, task_id=2)
+                    priority_score = (
+                        1.0
+                        if predicted == expected
+                        else PRIORITY_SCORES.get((predicted, expected), 0.0)
+                    )
+                    self.assertEqual(
+                        breakdown,
+                        {"issue_type": 1.0, "priority": priority_score},
+                    )
+                    self.assertAlmostEqual(score, 0.6 + 0.4 * priority_score)
     def test_task_2_weights_apply_as_documented(self) -> None:
         ticket = _ticket(priority="high")
         action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
         self.assertEqual(breakdown["assignment_group"], 0.0)
         self.assertAlmostEqual(score, 0.75)
+    def test_task_3_weights_apply_as_documented(self) -> None:
+        ticket = _ticket(priority="high")
+        action = HelpdeskTicketAction(
+            issue_type="billing_license",
+            priority="medium",
+            assignment_group="service_desk",
+            resolution_action="fulfill",
+        )
+        score, breakdown = grade_action(action, ticket, task_id=3)
+        self.assertEqual(
+            breakdown,
+            {
+                "issue_type": 1.0,
+                "priority": 0.5,
+                "assignment_group": 0.0,
+                "resolution_action": 1.0,
+            },
+        )
+        self.assertAlmostEqual(score, 0.65)
     def test_resolution_action_is_exact_match_only(self) -> None:
         ticket = _ticket()
         action = HelpdeskTicketAction(
         self.assertEqual(breakdown["resolution_action"], 0.0)
         self.assertAlmostEqual(score, 0.8)
+    def test_partial_credit_tables_never_override_exact_match(self) -> None:
+        for pair, value in ISSUE_TYPE_SIMILARITY.items():
+            with self.subTest(table="issue_type", pair=pair):
+                self.assertGreater(value, 0.0)
+                self.assertLess(value, 1.0)
+        for pair, value in PRIORITY_SCORES.items():
+            with self.subTest(table="priority", pair=pair):
+                self.assertGreater(value, 0.0)
+                self.assertLess(value, 1.0)
+    def test_task_weights_sum_to_one_for_each_task(self) -> None:
+        for task_id, weights in TASK_WEIGHTS.items():
+            with self.subTest(task_id=task_id):
+                self.assertAlmostEqual(sum(weights.values()), 1.0)
+    def test_grade_action_is_deterministic_for_same_inputs(self) -> None:
+        ticket = _ticket(issue_type="service_request", priority="medium")
+        action = HelpdeskTicketAction(
+            issue_type="general_inquiry",
+            priority="low",
+            assignment_group="license_ops",
+            resolution_action="assign",
+        )
+        first_score, first_breakdown = grade_action(action, ticket, task_id=3)
+        second_score, second_breakdown = grade_action(action, ticket, task_id=3)
+        self.assertEqual(first_score, second_score)
+        self.assertEqual(first_breakdown, second_breakdown)
 if __name__ == "__main__":
     unittest.main()

tests/test_tasks_unit.py CHANGED Viewed

@@ -9,7 +9,13 @@ import openenv_test_stubs  # noqa: F401
 from models import HelpdeskTicketRecord
 from server import tasks as task_module
 from server.tasks import TASKS, get_task_definition, load_dataset
-from vocabulary import TASK_IDS
 class TasksAndDatasetUnitTests(unittest.TestCase):
@@ -31,6 +37,12 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
             ],
         )
     def test_invalid_task_id_raises(self) -> None:
         with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
             get_task_definition(0)
@@ -64,20 +76,33 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
         dataset = load_dataset()
         issue_types = {record.issue_type for record in dataset}
-        self.assertEqual(
-            issue_types,
-            {
-                "application_support",
-                "billing_license",
-                "feature_request",
-                "general_inquiry",
-                "identity_access",
-                "onboarding",
-                "security_compliance",
-                "service_request",
-                "spam_phishing",
-            },
-        )
     def test_load_dataset_accepts_utf8_bom(self) -> None:
         sample = (

 from models import HelpdeskTicketRecord
 from server import tasks as task_module
 from server.tasks import TASKS, get_task_definition, load_dataset
+from vocabulary import (
+    ASSIGNMENT_GROUPS,
+    ISSUE_TYPES,
+    PRIORITIES,
+    RESOLUTION_ACTIONS,
+    TASK_IDS,
+)
 class TasksAndDatasetUnitTests(unittest.TestCase):
             ],
         )
+    def test_task_difficulty_ladder_is_frozen(self) -> None:
+        self.assertEqual(
+            [TASKS[task_id]["difficulty"] for task_id in TASK_IDS],
+            ["easy", "medium", "hard"],
+        )
     def test_invalid_task_id_raises(self) -> None:
         with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
             get_task_definition(0)
         dataset = load_dataset()
         issue_types = {record.issue_type for record in dataset}
+        self.assertEqual(issue_types, set(ISSUE_TYPES))
+    def test_dataset_covers_all_defined_priorities(self) -> None:
+        dataset = load_dataset()
+        priorities = {record.priority for record in dataset}
+        self.assertEqual(priorities, set(PRIORITIES))
+    def test_dataset_covers_all_assignment_groups(self) -> None:
+        dataset = load_dataset()
+        assignment_groups = {record.assignment_group for record in dataset}
+        self.assertEqual(assignment_groups, set(ASSIGNMENT_GROUPS))
+    def test_dataset_covers_all_resolution_actions(self) -> None:
+        dataset = load_dataset()
+        resolution_actions = {record.resolution_action for record in dataset}
+        self.assertEqual(resolution_actions, set(RESOLUTION_ACTIONS))
+    def test_dataset_preserves_ambiguous_and_follow_up_cases(self) -> None:
+        dataset = load_dataset()
+        ambiguity_count = sum(1 for record in dataset if record.ambiguity_note)
+        follow_up_count = sum(1 for record in dataset if record.related_ticket_id)
+        self.assertGreaterEqual(ambiguity_count, 4)
+        self.assertGreaterEqual(follow_up_count, 3)
     def test_load_dataset_accepts_utf8_bom(self) -> None:
         sample = (