Spaces:
Running
Running
Complete Roopal roadmap work for April 4-7
Browse files- KNOWLEDGE.md +32 -0
- PROJECT_STATUS.md +34 -0
- README.md +40 -0
- analysis/grounding_audit.md +77 -0
- required.md +23 -0
- tests/test_grader_unit.py +100 -1
- tests/test_tasks_unit.py +40 -15
KNOWLEDGE.md
CHANGED
|
@@ -237,12 +237,21 @@ The grader is deterministic and intentionally simple to explain.
|
|
| 237 |
- `assignment_group` gets exact credit
|
| 238 |
- `resolution_action` gets exact credit
|
| 239 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
Task weighting:
|
| 241 |
|
| 242 |
- Task 1: only `issue_type`
|
| 243 |
- Task 2: `issue_type` 60%, `priority` 40%
|
| 244 |
- Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
|
| 245 |
|
|
|
|
|
|
|
| 246 |
## Reward Mental Model
|
| 247 |
|
| 248 |
Step reward:
|
|
@@ -270,6 +279,18 @@ Current structure:
|
|
| 270 |
|
| 271 |
The dataset is meant to test routing judgment, not just keyword spotting.
|
| 272 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
## Inference Script In Simple Terms
|
| 274 |
|
| 275 |
`inference.py` is the baseline agent runner.
|
|
@@ -344,6 +365,17 @@ An April 6 audit confirmed:
|
|
| 344 |
- the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
|
| 345 |
- the remaining work is execution validation, not documentation cleanup
|
| 346 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 347 |
## What Still Needs Hands-On Verification
|
| 348 |
|
| 349 |
The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
|
|
|
|
| 237 |
- `assignment_group` gets exact credit
|
| 238 |
- `resolution_action` gets exact credit
|
| 239 |
|
| 240 |
+
Just as important, the grader is not fuzzy by default:
|
| 241 |
+
|
| 242 |
+
- exact matches stay dominant
|
| 243 |
+
- wrong issue types outside the declared similarity map score `0.0`
|
| 244 |
+
- wrong priorities outside the declared proximity table score `0.0`
|
| 245 |
+
- assignment group and resolution action never receive partial credit
|
| 246 |
+
|
| 247 |
Task weighting:
|
| 248 |
|
| 249 |
- Task 1: only `issue_type`
|
| 250 |
- Task 2: `issue_type` 60%, `priority` 40%
|
| 251 |
- Task 3: `issue_type` 35%, `priority` 20%, `assignment_group` 25%, `resolution_action` 20%
|
| 252 |
|
| 253 |
+
This is now proven in checked-in unit tests rather than left as a docs claim.
|
| 254 |
+
|
| 255 |
## Reward Mental Model
|
| 256 |
|
| 257 |
Step reward:
|
|
|
|
| 279 |
|
| 280 |
The dataset is meant to test routing judgment, not just keyword spotting.
|
| 281 |
|
| 282 |
+
## Grounding Note
|
| 283 |
+
|
| 284 |
+
The taxonomy and limited partial-credit policy were reviewed against public IT-support references recorded in `analysis/grounding_audit.md`.
|
| 285 |
+
|
| 286 |
+
The grounding inputs used for that review were:
|
| 287 |
+
|
| 288 |
+
- `Classification of IT Support Tickets`
|
| 289 |
+
- `Semantic Similarity of IT Support Tickets`
|
| 290 |
+
- `MSDialog`
|
| 291 |
+
|
| 292 |
+
The key conclusion was to keep the similarity map narrow. The current issue-type near misses are defensible, but broader additions would blur operationally distinct routing actions too much this late in the submission cycle.
|
| 293 |
+
|
| 294 |
## Inference Script In Simple Terms
|
| 295 |
|
| 296 |
`inference.py` is the baseline agent runner.
|
|
|
|
| 365 |
- the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
|
| 366 |
- the remaining work is execution validation, not documentation cleanup
|
| 367 |
|
| 368 |
+
### April 6 and April 7 Roopal-side doc pass
|
| 369 |
+
|
| 370 |
+
That follow-up pass added the remaining Roopal-owned public-clarity items:
|
| 371 |
+
|
| 372 |
+
- Hugging Face Spaces README frontmatter
|
| 373 |
+
- explicit judge-facing explanation that scoring is deterministic and only partially fuzzy in declared places
|
| 374 |
+
- an internal grounding note tying the label space to public IT-support datasets
|
| 375 |
+
- a refreshed compliance snapshot in `required.md`
|
| 376 |
+
|
| 377 |
+
The optional TRL / GRPO README example was intentionally deferred because the shared runtime-validation gates are not all green yet.
|
| 378 |
+
|
| 379 |
## What Still Needs Hands-On Verification
|
| 380 |
|
| 381 |
The biggest remaining checks are packaging and clean-machine checks, not merge-state local execution.
|
PROJECT_STATUS.md
CHANGED
|
@@ -191,3 +191,37 @@ Still pending after the current checkpoint:
|
|
| 191 |
|
| 192 |
- perform a Docker smoke test from the current merged repo state
|
| 193 |
- do a clean-machine dry run if possible before final submission freeze
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
- perform a Docker smoke test from the current merged repo state
|
| 193 |
- do a clean-machine dry run if possible before final submission freeze
|
| 194 |
+
|
| 195 |
+
## April 3, 2026 (Pulled Forward April 4-5 Roopal Scope)
|
| 196 |
+
|
| 197 |
+
Status: complete for the Roopal-owned roadmap items originally scheduled for April 4 and April 5
|
| 198 |
+
|
| 199 |
+
Roopal-side work completed:
|
| 200 |
+
|
| 201 |
+
- expanded `tests/test_grader_unit.py` to lock scorer crispness with exhaustive issue-type and priority-table checks
|
| 202 |
+
- added explicit invariants for task-weight sums, exact-match dominance, and deterministic repeated grading
|
| 203 |
+
- expanded `tests/test_tasks_unit.py` to cover the frozen task difficulty ladder plus dataset coverage across all issue types, priorities, assignment groups, and resolution actions
|
| 204 |
+
- added `analysis/grounding_audit.md` as the internal grounding note requested by the roadmap
|
| 205 |
+
- reviewed candidate issue-type similarity expansions and decided to keep the current similarity map unchanged
|
| 206 |
+
|
| 207 |
+
Decision notes:
|
| 208 |
+
|
| 209 |
+
- scorer fuzziness is now proven by tests to exist only where the declared similarity map or priority table allows it
|
| 210 |
+
- no additional issue-type similarity pairs were adopted in this pass because the reviewed candidates were too operationally fuzzy
|
| 211 |
+
|
| 212 |
+
## April 3, 2026 (Pulled Forward April 6-7 Roopal Scope)
|
| 213 |
+
|
| 214 |
+
Status: complete for the Roopal-owned roadmap items originally scheduled for April 6 and April 7
|
| 215 |
+
|
| 216 |
+
Roopal-side work completed:
|
| 217 |
+
|
| 218 |
+
- added Hugging Face Spaces README frontmatter
|
| 219 |
+
- updated `README.md` with an explicit judge-facing explanation of deterministic, grounded scoring
|
| 220 |
+
- updated `KNOWLEDGE.md` to state clearly that the grader is not fuzzy by default and to reference the grounding audit
|
| 221 |
+
- updated `required.md` with a current compliance snapshot separating already-satisfied requirements from shared pending validation gates
|
| 222 |
+
- completed the final Roopal-side consistency pass across `README.md`, `KNOWLEDGE.md`, and `required.md`
|
| 223 |
+
|
| 224 |
+
Decision notes:
|
| 225 |
+
|
| 226 |
+
- no scorer change was needed from the grounding review, so this pass stayed documentation-only
|
| 227 |
+
- the optional TRL / GRPO README example remains deferred until the shared runtime-validation gates are green
|
README.md
CHANGED
|
@@ -1,3 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# IT Helpdesk Ticket Routing OpenEnv
|
| 2 |
|
| 3 |
> Meta PyTorch OpenEnv Hackathon Round 1 submission
|
|
@@ -152,6 +166,26 @@ average(per_ticket_scores) - 0.03 * max(0, steps_taken - queue_size)
|
|
| 152 |
|
| 153 |
The result is clamped to `[0.0, 1.0]`.
|
| 154 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
## Dataset Snapshot
|
| 156 |
|
| 157 |
The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
|
|
@@ -349,6 +383,8 @@ The repo is already aligned on:
|
|
| 349 |
- typed models
|
| 350 |
- grader and reward design
|
| 351 |
- packaging metadata and Docker entry point
|
|
|
|
|
|
|
| 352 |
|
| 353 |
An April 6 repo audit also confirmed that all required submission files are present:
|
| 354 |
|
|
@@ -359,4 +395,8 @@ An April 6 repo audit also confirmed that all required submission files are pres
|
|
| 359 |
Still pending before final submission:
|
| 360 |
|
| 361 |
- a Docker smoke test from a machine with Docker installed
|
|
|
|
|
|
|
| 362 |
- a final clean-machine dry run if possible before submission freeze
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: IT Helpdesk Ticket Routing OpenEnv
|
| 3 |
+
colorFrom: blue
|
| 4 |
+
colorTo: indigo
|
| 5 |
+
sdk: docker
|
| 6 |
+
pinned: false
|
| 7 |
+
app_port: 7860
|
| 8 |
+
tags:
|
| 9 |
+
- openenv
|
| 10 |
+
- helpdesk
|
| 11 |
+
- ticket-routing
|
| 12 |
+
- customer-support
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
# IT Helpdesk Ticket Routing OpenEnv
|
| 16 |
|
| 17 |
> Meta PyTorch OpenEnv Hackathon Round 1 submission
|
|
|
|
| 166 |
|
| 167 |
The result is clamped to `[0.0, 1.0]`.
|
| 168 |
|
| 169 |
+
## Grounded Scoring
|
| 170 |
+
|
| 171 |
+
The grader is intentionally not fuzzy by default.
|
| 172 |
+
|
| 173 |
+
- exact match is the dominant path for every field
|
| 174 |
+
- `assignment_group` and `resolution_action` are exact-match only
|
| 175 |
+
- `priority` only gets proximity credit from the declared table in `server/grader.py`
|
| 176 |
+
- `issue_type` only gets partial credit for a small declared similarity map
|
| 177 |
+
- wrong labels outside those explicit maps score `0.0`
|
| 178 |
+
|
| 179 |
+
That scoring policy is now backed by checked-in unit tests in `tests/test_grader_unit.py` and `tests/test_tasks_unit.py`.
|
| 180 |
+
|
| 181 |
+
The label set and partial-credit choices were also reviewed against public IT-support references captured in `analysis/grounding_audit.md`, including:
|
| 182 |
+
|
| 183 |
+
- `Classification of IT Support Tickets`
|
| 184 |
+
- `Semantic Similarity of IT Support Tickets`
|
| 185 |
+
- `MSDialog`
|
| 186 |
+
|
| 187 |
+
That grounding pass supported keeping the current similarity map small and explainable. No new issue-type similarity pairs were added from the review.
|
| 188 |
+
|
| 189 |
## Dataset Snapshot
|
| 190 |
|
| 191 |
The labeled dataset in `data/dataset.json` currently contains 45 tickets spanning straightforward and ambiguous helpdesk scenarios.
|
|
|
|
| 383 |
- typed models
|
| 384 |
- grader and reward design
|
| 385 |
- packaging metadata and Docker entry point
|
| 386 |
+
- Hugging Face Spaces README frontmatter
|
| 387 |
+
- judge-facing documentation of deterministic, grounded scoring
|
| 388 |
|
| 389 |
An April 6 repo audit also confirmed that all required submission files are present:
|
| 390 |
|
|
|
|
| 395 |
Still pending before final submission:
|
| 396 |
|
| 397 |
- a Docker smoke test from a machine with Docker installed
|
| 398 |
+
- `openenv validate` evidence on the current merged repo state
|
| 399 |
+
- structured `inference.py` log-format verification on the current merged repo state
|
| 400 |
- a final clean-machine dry run if possible before submission freeze
|
| 401 |
+
|
| 402 |
+
The short TRL / GRPO README example from the roadmap is intentionally deferred until the shared runtime and validation gates are green.
|
analysis/grounding_audit.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Grounding Audit For Taxonomy And Similarity Decisions
|
| 2 |
+
|
| 3 |
+
> Internal note for the roadmap work originally planned for April 5, 2026.
|
| 4 |
+
> Reviewed on April 3, 2026 and pulled forward ahead of schedule.
|
| 5 |
+
|
| 6 |
+
## Goal
|
| 7 |
+
|
| 8 |
+
Ground the current ticket taxonomy and the limited partial-credit policy against real public IT-support data without turning external datasets into a runtime dependency.
|
| 9 |
+
|
| 10 |
+
## Sources Reviewed
|
| 11 |
+
|
| 12 |
+
1. [Classification of IT Support Tickets](https://zenodo.org/records/7648117)
|
| 13 |
+
- Zenodo dataset with 2,229 manually classified support tickets.
|
| 14 |
+
- Dataset description says the tickets were classified by three IT support professionals.
|
| 15 |
+
- The public preview exposes seven coarse categories: `Fileservice`, `Support general`, `Software`, `O365`, `Active Directory`, `Computer-Services`, and `EOL`.
|
| 16 |
+
|
| 17 |
+
2. [Semantic Similarity of IT Support Tickets](https://zenodo.org/records/7426225)
|
| 18 |
+
- Zenodo dataset with 300 ticket pairs manually labeled for semantic similarity.
|
| 19 |
+
- The description says three IT support professionals performed the labeling.
|
| 20 |
+
- This is the best direct grounding for keeping similarity explicit and limited instead of treating the whole label space as fuzzy.
|
| 21 |
+
|
| 22 |
+
3. [MSDialog dataset page](https://ciir.cs.umass.edu/downloads/msdialog/)
|
| 23 |
+
- Technical-support dialog corpus drawn from Microsoft Community.
|
| 24 |
+
- The site reports 35,000 dialogs in `MSDialog-Complete` and 2,199 labeled dialogs with 10,020 utterances in `MSDialog-Intent`.
|
| 25 |
+
- This grounds our use of follow-up cases, clarification-heavy threads, and helpdesk-style conversational language.
|
| 26 |
+
|
| 27 |
+
## Mapping Principle
|
| 28 |
+
|
| 29 |
+
The external datasets validate that real IT support traffic mixes access problems, software incidents, generic support questions, procurement-like requests, and multi-turn follow-ups. Our label set is more operational than the public category sets, so the mappings below are judgment calls based on source descriptions and public previews rather than exact label equivalence.
|
| 30 |
+
|
| 31 |
+
## Grounding Examples
|
| 32 |
+
|
| 33 |
+
1. Active Directory lockout, MFA trouble, or password reset -> `identity_access` -> exact-match dominant, with `onboarding` as the only defensible adjacent label when the request is really about new-user provisioning.
|
| 34 |
+
2. New hire account setup or contractor access provisioning -> `onboarding` -> partial-credit adjacent to `identity_access`, because both can surface as account enablement work before ownership is fully resolved.
|
| 35 |
+
3. Office or application crash, timeout, webhook failure, or migration-script breakage -> `application_support` -> partial-credit adjacent to `feature_request` only when the report reads like a capability gap rather than a break/fix issue.
|
| 36 |
+
4. Feature wishlist or export-format enhancement request -> `feature_request` -> partial-credit adjacent to `application_support` only when the user reports the missing capability as if it were a defect.
|
| 37 |
+
5. Vendor-evaluation question, demo request, or quote request -> `service_request` -> partial-credit adjacent to `general_inquiry` when the request is still exploratory rather than a committed operational action.
|
| 38 |
+
6. Seat expansion or provisioning-style commercial request -> `service_request` -> partial-credit adjacent to `billing_license` when procurement and account-admin signals are mixed in the same ticket.
|
| 39 |
+
7. Refund, invoice discrepancy, subscription cancellation, or payment-admin issue -> `billing_license` -> partial-credit adjacent to `service_request` only in commercial admin cases that overlap with a procurement or seat-change request.
|
| 40 |
+
8. Broad capability question or lightweight product clarification -> `general_inquiry` -> partial-credit adjacent to `service_request` or `feature_request` when the request is vague enough to look like either evaluation or roadmap feedback.
|
| 41 |
+
9. Spam lure or credential-phishing message sent to the inbox -> `spam_phishing` -> partial-credit adjacent to `security_compliance` only for security-themed inbound items, not for normal access or software tickets.
|
| 42 |
+
10. GDPR deletion request, DPA request, audit finding, or mandatory MFA policy notice -> `security_compliance` -> exact-match dominant, with very limited adjacency to `spam_phishing` for suspicious security reports and a low-confidence edge to `billing_license` only in contractual paperwork contexts.
|
| 43 |
+
11. Reopened outage thread or repeated bug report escalation -> `application_support` -> exact-match dominant; the main change across turns is usually `priority`, not `issue_type`.
|
| 44 |
+
12. Repeated lockout complaint or suspension follow-up -> `identity_access` -> exact-match dominant; follow-up behavior is grounded by MSDialog-style multi-turn support flow rather than by adding new label fuzziness.
|
| 45 |
+
|
| 46 |
+
## Review Of Current Similarity Pairs
|
| 47 |
+
|
| 48 |
+
The current `ISSUE_TYPE_SIMILARITY` map stays intentionally small. The defensible themes are:
|
| 49 |
+
|
| 50 |
+
- `billing_license` <-> `service_request`: commercial admin and procurement requests can overlap before the owning team is clear.
|
| 51 |
+
- `application_support` <-> `identity_access`: SSO and login failures can initially look like either app failure or access failure.
|
| 52 |
+
- `application_support` <-> `feature_request`: some users describe missing functionality in bug-report language.
|
| 53 |
+
- `onboarding` <-> `identity_access`: provisioning and account enablement are adjacent in real helpdesk traffic.
|
| 54 |
+
- `general_inquiry` <-> `feature_request`: vague product questions can blur into roadmap requests.
|
| 55 |
+
- `general_inquiry` <-> `service_request`: vendor-evaluation and exploratory capability questions often overlap.
|
| 56 |
+
- `spam_phishing` <-> `security_compliance`: both are security-facing, but they should stay separate from normal access or app-routing labels.
|
| 57 |
+
- `security_compliance` <-> `billing_license`: kept only as a very low-score edge for contract and paperwork overlap; this is the weakest current pair and should not be expanded further without ticket-level evidence.
|
| 58 |
+
|
| 59 |
+
## Candidate Expansions Reviewed And Rejected
|
| 60 |
+
|
| 61 |
+
These pairs were reviewed during the April 5 roadmap pass and are intentionally not being added:
|
| 62 |
+
|
| 63 |
+
- `onboarding` <-> `service_request`: both can involve setup, but the owning teams and next actions diverge too quickly.
|
| 64 |
+
- `feature_request` <-> `service_request`: roadmap asks and procurement actions are operationally different.
|
| 65 |
+
- `security_compliance` <-> `identity_access`: policy obligations may mention accounts, but the compliance workflow is distinct from user access support.
|
| 66 |
+
- `billing_license` <-> `identity_access`: nonpayment or suspension can mention lockout symptoms, but the root-cause owner is different.
|
| 67 |
+
- `application_support` <-> `billing_license`: mixed commercial and outage narratives exist, but broad partial credit here would blur incident handling too much.
|
| 68 |
+
|
| 69 |
+
## Decision
|
| 70 |
+
|
| 71 |
+
No new issue-type similarity pairs should be added from this review.
|
| 72 |
+
|
| 73 |
+
The safest grounded position is:
|
| 74 |
+
|
| 75 |
+
- keep the current limited similarity map,
|
| 76 |
+
- rely on exact-match scoring for most wrong labels,
|
| 77 |
+
- let `priority`, `assignment_group`, and `resolution_action` keep the hard-task routing signal crisp.
|
required.md
CHANGED
|
@@ -350,3 +350,26 @@ The project is ready when:
|
|
| 350 |
7. HF deployment checks succeed or are as close to verified as possible before submission
|
| 351 |
8. the docs are clean, current, and submission-ready
|
| 352 |
9. the repo clearly presents Hackstreet Boys as the team
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 350 |
7. HF deployment checks succeed or are as close to verified as possible before submission
|
| 351 |
8. the docs are clean, current, and submission-ready
|
| 352 |
9. the repo clearly presents Hackstreet Boys as the team
|
| 353 |
+
|
| 354 |
+
## Current Compliance Snapshot
|
| 355 |
+
|
| 356 |
+
As of April 3, 2026, the Roopal-side compliance review says these items are already in place:
|
| 357 |
+
|
| 358 |
+
- real-world task definition is clear and stable
|
| 359 |
+
- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
|
| 360 |
+
- 3-task easy -> medium -> hard ladder is present
|
| 361 |
+
- graders are deterministic and bounded to `[0.0, 1.0]`
|
| 362 |
+
- unit tests now prove scorer crispness, task invariants, and dataset coverage
|
| 363 |
+
- baseline heuristic results are recorded in the docs
|
| 364 |
+
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
|
| 365 |
+
- an internal grounding audit exists in `analysis/grounding_audit.md`
|
| 366 |
+
|
| 367 |
+
The items still pending or shared with runtime-side work are:
|
| 368 |
+
|
| 369 |
+
- `openenv validate` evidence on the merged repo state
|
| 370 |
+
- Docker smoke evidence on the merged repo state
|
| 371 |
+
- Hugging Face deployment ping and reset verification
|
| 372 |
+
- structured `inference.py` log-format verification
|
| 373 |
+
- clean-machine rerun evidence if practical
|
| 374 |
+
|
| 375 |
+
The roadmap's short TRL / GRPO README example remains optional and should stay deferred until the pending validation items above are green.
|
tests/test_grader_unit.py
CHANGED
|
@@ -5,7 +5,13 @@ import unittest
|
|
| 5 |
import openenv_test_stubs # noqa: F401
|
| 6 |
|
| 7 |
from models import HelpdeskTicketAction, HelpdeskTicketRecord
|
| 8 |
-
from server.grader import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
def _ticket(
|
|
@@ -66,6 +72,23 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 66 |
self.assertAlmostEqual(score, 0.4)
|
| 67 |
self.assertEqual(breakdown, {"issue_type": 0.4})
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
|
| 70 |
ticket = _ticket(issue_type="onboarding")
|
| 71 |
action = HelpdeskTicketAction(issue_type="spam_phishing")
|
|
@@ -85,6 +108,29 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 85 |
self.assertAlmostEqual(breakdown["priority"], 0.6)
|
| 86 |
self.assertAlmostEqual(score, 0.84)
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
def test_task_2_weights_apply_as_documented(self) -> None:
|
| 89 |
ticket = _ticket(priority="high")
|
| 90 |
action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
|
|
@@ -108,6 +154,28 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 108 |
self.assertEqual(breakdown["assignment_group"], 0.0)
|
| 109 |
self.assertAlmostEqual(score, 0.75)
|
| 110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
def test_resolution_action_is_exact_match_only(self) -> None:
|
| 112 |
ticket = _ticket()
|
| 113 |
action = HelpdeskTicketAction(
|
|
@@ -122,6 +190,37 @@ class GraderUnitTests(unittest.TestCase):
|
|
| 122 |
self.assertEqual(breakdown["resolution_action"], 0.0)
|
| 123 |
self.assertAlmostEqual(score, 0.8)
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
if __name__ == "__main__":
|
| 127 |
unittest.main()
|
|
|
|
| 5 |
import openenv_test_stubs # noqa: F401
|
| 6 |
|
| 7 |
from models import HelpdeskTicketAction, HelpdeskTicketRecord
|
| 8 |
+
from server.grader import (
|
| 9 |
+
ISSUE_TYPE_SIMILARITY,
|
| 10 |
+
PRIORITY_SCORES,
|
| 11 |
+
TASK_WEIGHTS,
|
| 12 |
+
grade_action,
|
| 13 |
+
)
|
| 14 |
+
from vocabulary import ISSUE_TYPES, PRIORITIES
|
| 15 |
|
| 16 |
|
| 17 |
def _ticket(
|
|
|
|
| 72 |
self.assertAlmostEqual(score, 0.4)
|
| 73 |
self.assertEqual(breakdown, {"issue_type": 0.4})
|
| 74 |
|
| 75 |
+
def test_issue_type_scoring_matches_declared_similarity_table_exhaustively(self) -> None:
|
| 76 |
+
for expected in ISSUE_TYPES:
|
| 77 |
+
for predicted in ISSUE_TYPES:
|
| 78 |
+
with self.subTest(expected=expected, predicted=predicted):
|
| 79 |
+
ticket = _ticket(issue_type=expected)
|
| 80 |
+
action = HelpdeskTicketAction(issue_type=predicted)
|
| 81 |
+
|
| 82 |
+
score, breakdown = grade_action(action, ticket, task_id=1)
|
| 83 |
+
|
| 84 |
+
expected_score = (
|
| 85 |
+
1.0
|
| 86 |
+
if predicted == expected
|
| 87 |
+
else ISSUE_TYPE_SIMILARITY.get((predicted, expected), 0.0)
|
| 88 |
+
)
|
| 89 |
+
self.assertAlmostEqual(score, expected_score)
|
| 90 |
+
self.assertEqual(breakdown, {"issue_type": expected_score})
|
| 91 |
+
|
| 92 |
def test_unrelated_issue_type_gets_zero_not_fuzzy_credit(self) -> None:
|
| 93 |
ticket = _ticket(issue_type="onboarding")
|
| 94 |
action = HelpdeskTicketAction(issue_type="spam_phishing")
|
|
|
|
| 108 |
self.assertAlmostEqual(breakdown["priority"], 0.6)
|
| 109 |
self.assertAlmostEqual(score, 0.84)
|
| 110 |
|
| 111 |
+
def test_priority_scoring_matches_declared_table_exhaustively(self) -> None:
|
| 112 |
+
for expected in PRIORITIES:
|
| 113 |
+
for predicted in PRIORITIES:
|
| 114 |
+
with self.subTest(expected=expected, predicted=predicted):
|
| 115 |
+
ticket = _ticket(priority=expected)
|
| 116 |
+
action = HelpdeskTicketAction(
|
| 117 |
+
issue_type="billing_license",
|
| 118 |
+
priority=predicted,
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
score, breakdown = grade_action(action, ticket, task_id=2)
|
| 122 |
+
|
| 123 |
+
priority_score = (
|
| 124 |
+
1.0
|
| 125 |
+
if predicted == expected
|
| 126 |
+
else PRIORITY_SCORES.get((predicted, expected), 0.0)
|
| 127 |
+
)
|
| 128 |
+
self.assertEqual(
|
| 129 |
+
breakdown,
|
| 130 |
+
{"issue_type": 1.0, "priority": priority_score},
|
| 131 |
+
)
|
| 132 |
+
self.assertAlmostEqual(score, 0.6 + 0.4 * priority_score)
|
| 133 |
+
|
| 134 |
def test_task_2_weights_apply_as_documented(self) -> None:
|
| 135 |
ticket = _ticket(priority="high")
|
| 136 |
action = HelpdeskTicketAction(issue_type="billing_license", priority="medium")
|
|
|
|
| 154 |
self.assertEqual(breakdown["assignment_group"], 0.0)
|
| 155 |
self.assertAlmostEqual(score, 0.75)
|
| 156 |
|
| 157 |
+
def test_task_3_weights_apply_as_documented(self) -> None:
|
| 158 |
+
ticket = _ticket(priority="high")
|
| 159 |
+
action = HelpdeskTicketAction(
|
| 160 |
+
issue_type="billing_license",
|
| 161 |
+
priority="medium",
|
| 162 |
+
assignment_group="service_desk",
|
| 163 |
+
resolution_action="fulfill",
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
score, breakdown = grade_action(action, ticket, task_id=3)
|
| 167 |
+
|
| 168 |
+
self.assertEqual(
|
| 169 |
+
breakdown,
|
| 170 |
+
{
|
| 171 |
+
"issue_type": 1.0,
|
| 172 |
+
"priority": 0.5,
|
| 173 |
+
"assignment_group": 0.0,
|
| 174 |
+
"resolution_action": 1.0,
|
| 175 |
+
},
|
| 176 |
+
)
|
| 177 |
+
self.assertAlmostEqual(score, 0.65)
|
| 178 |
+
|
| 179 |
def test_resolution_action_is_exact_match_only(self) -> None:
|
| 180 |
ticket = _ticket()
|
| 181 |
action = HelpdeskTicketAction(
|
|
|
|
| 190 |
self.assertEqual(breakdown["resolution_action"], 0.0)
|
| 191 |
self.assertAlmostEqual(score, 0.8)
|
| 192 |
|
| 193 |
+
def test_partial_credit_tables_never_override_exact_match(self) -> None:
|
| 194 |
+
for pair, value in ISSUE_TYPE_SIMILARITY.items():
|
| 195 |
+
with self.subTest(table="issue_type", pair=pair):
|
| 196 |
+
self.assertGreater(value, 0.0)
|
| 197 |
+
self.assertLess(value, 1.0)
|
| 198 |
+
|
| 199 |
+
for pair, value in PRIORITY_SCORES.items():
|
| 200 |
+
with self.subTest(table="priority", pair=pair):
|
| 201 |
+
self.assertGreater(value, 0.0)
|
| 202 |
+
self.assertLess(value, 1.0)
|
| 203 |
+
|
| 204 |
+
def test_task_weights_sum_to_one_for_each_task(self) -> None:
|
| 205 |
+
for task_id, weights in TASK_WEIGHTS.items():
|
| 206 |
+
with self.subTest(task_id=task_id):
|
| 207 |
+
self.assertAlmostEqual(sum(weights.values()), 1.0)
|
| 208 |
+
|
| 209 |
+
def test_grade_action_is_deterministic_for_same_inputs(self) -> None:
|
| 210 |
+
ticket = _ticket(issue_type="service_request", priority="medium")
|
| 211 |
+
action = HelpdeskTicketAction(
|
| 212 |
+
issue_type="general_inquiry",
|
| 213 |
+
priority="low",
|
| 214 |
+
assignment_group="license_ops",
|
| 215 |
+
resolution_action="assign",
|
| 216 |
+
)
|
| 217 |
+
|
| 218 |
+
first_score, first_breakdown = grade_action(action, ticket, task_id=3)
|
| 219 |
+
second_score, second_breakdown = grade_action(action, ticket, task_id=3)
|
| 220 |
+
|
| 221 |
+
self.assertEqual(first_score, second_score)
|
| 222 |
+
self.assertEqual(first_breakdown, second_breakdown)
|
| 223 |
+
|
| 224 |
|
| 225 |
if __name__ == "__main__":
|
| 226 |
unittest.main()
|
tests/test_tasks_unit.py
CHANGED
|
@@ -9,7 +9,13 @@ import openenv_test_stubs # noqa: F401
|
|
| 9 |
from models import HelpdeskTicketRecord
|
| 10 |
from server import tasks as task_module
|
| 11 |
from server.tasks import TASKS, get_task_definition, load_dataset
|
| 12 |
-
from vocabulary import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
|
| 15 |
class TasksAndDatasetUnitTests(unittest.TestCase):
|
|
@@ -31,6 +37,12 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
|
|
| 31 |
],
|
| 32 |
)
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
def test_invalid_task_id_raises(self) -> None:
|
| 35 |
with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
|
| 36 |
get_task_definition(0)
|
|
@@ -64,20 +76,33 @@ class TasksAndDatasetUnitTests(unittest.TestCase):
|
|
| 64 |
dataset = load_dataset()
|
| 65 |
issue_types = {record.issue_type for record in dataset}
|
| 66 |
|
| 67 |
-
self.assertEqual(
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
def test_load_dataset_accepts_utf8_bom(self) -> None:
|
| 83 |
sample = (
|
|
|
|
| 9 |
from models import HelpdeskTicketRecord
|
| 10 |
from server import tasks as task_module
|
| 11 |
from server.tasks import TASKS, get_task_definition, load_dataset
|
| 12 |
+
from vocabulary import (
|
| 13 |
+
ASSIGNMENT_GROUPS,
|
| 14 |
+
ISSUE_TYPES,
|
| 15 |
+
PRIORITIES,
|
| 16 |
+
RESOLUTION_ACTIONS,
|
| 17 |
+
TASK_IDS,
|
| 18 |
+
)
|
| 19 |
|
| 20 |
|
| 21 |
class TasksAndDatasetUnitTests(unittest.TestCase):
|
|
|
|
| 37 |
],
|
| 38 |
)
|
| 39 |
|
| 40 |
+
def test_task_difficulty_ladder_is_frozen(self) -> None:
|
| 41 |
+
self.assertEqual(
|
| 42 |
+
[TASKS[task_id]["difficulty"] for task_id in TASK_IDS],
|
| 43 |
+
["easy", "medium", "hard"],
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
def test_invalid_task_id_raises(self) -> None:
|
| 47 |
with self.assertRaisesRegex(ValueError, "Unsupported task_id"):
|
| 48 |
get_task_definition(0)
|
|
|
|
| 76 |
dataset = load_dataset()
|
| 77 |
issue_types = {record.issue_type for record in dataset}
|
| 78 |
|
| 79 |
+
self.assertEqual(issue_types, set(ISSUE_TYPES))
|
| 80 |
+
|
| 81 |
+
def test_dataset_covers_all_defined_priorities(self) -> None:
|
| 82 |
+
dataset = load_dataset()
|
| 83 |
+
priorities = {record.priority for record in dataset}
|
| 84 |
+
|
| 85 |
+
self.assertEqual(priorities, set(PRIORITIES))
|
| 86 |
+
|
| 87 |
+
def test_dataset_covers_all_assignment_groups(self) -> None:
|
| 88 |
+
dataset = load_dataset()
|
| 89 |
+
assignment_groups = {record.assignment_group for record in dataset}
|
| 90 |
+
|
| 91 |
+
self.assertEqual(assignment_groups, set(ASSIGNMENT_GROUPS))
|
| 92 |
+
|
| 93 |
+
def test_dataset_covers_all_resolution_actions(self) -> None:
|
| 94 |
+
dataset = load_dataset()
|
| 95 |
+
resolution_actions = {record.resolution_action for record in dataset}
|
| 96 |
+
|
| 97 |
+
self.assertEqual(resolution_actions, set(RESOLUTION_ACTIONS))
|
| 98 |
+
|
| 99 |
+
def test_dataset_preserves_ambiguous_and_follow_up_cases(self) -> None:
|
| 100 |
+
dataset = load_dataset()
|
| 101 |
+
ambiguity_count = sum(1 for record in dataset if record.ambiguity_note)
|
| 102 |
+
follow_up_count = sum(1 for record in dataset if record.related_ticket_id)
|
| 103 |
+
|
| 104 |
+
self.assertGreaterEqual(ambiguity_count, 4)
|
| 105 |
+
self.assertGreaterEqual(follow_up_count, 3)
|
| 106 |
|
| 107 |
def test_load_dataset_accepts_utf8_bom(self) -> None:
|
| 108 |
sample = (
|