Roopalgn commited on
Commit
3b8bf40
·
1 Parent(s): 969eaef

Improve dataset realism and consolidate project status log

Browse files
Files changed (5) hide show
  1. LABEL_AUDIT.md +0 -56
  2. MARCH30_STATUS.md +0 -117
  3. PROJECT_STATUS.md +137 -0
  4. README.md +23 -0
  5. data/dataset.json +21 -21
LABEL_AUDIT.md DELETED
@@ -1,56 +0,0 @@
1
- # Label Audit Notes
2
-
3
- This file records the March 31 and April 1 label-and-grader pass on the Roopal-owned files:
4
-
5
- - `data/dataset.json`
6
- - `server/tasks.py`
7
- - `server/grader.py`
8
-
9
- ## Dataset Decisions
10
-
11
- ### Tightened ambiguity cases
12
-
13
- - `ticket-022`
14
- Reworded to make the billing-versus-application ambiguity clearer while keeping the chosen label as `application_support`.
15
-
16
- - `ticket-027`
17
- Reworded to make the vendor-offer ambiguity clearer between `general_inquiry` and `service_request`.
18
-
19
- - `ticket-029`
20
- Reworded to make the seat-expansion versus prorating ambiguity clearer and changed `resolution_action` from `fulfill` to `assign`.
21
-
22
- - `ticket-040`
23
- Reworded to make the feature-gap versus support-issue ambiguity clearer.
24
-
25
- ### Corrected label consistency
26
-
27
- - `ticket-026`
28
- Changed from `feature_request` / `application_team` to `general_inquiry` / `service_desk` because it is a thank-you note, not a product change request.
29
-
30
- ## Task Wording Changes
31
-
32
- The task instructions in `server/tasks.py` were tightened so they now:
33
-
34
- - sound more like helpdesk triage
35
- - emphasize choosing the single best label
36
- - describe operational priority more clearly
37
- - describe full triage more concretely for Task 3
38
-
39
- ## Grader Changes
40
-
41
- The grader was polished by:
42
-
43
- - making task weights explicit in `TASK_WEIGHTS`
44
- - adding partial-credit pairs for:
45
- - `application_support` vs `feature_request`
46
- - `general_inquiry` vs `service_request`
47
- - keeping the scoring deterministic and task-specific
48
-
49
- ## Intent
50
-
51
- These edits are meant to improve:
52
-
53
- - dataset realism
54
- - label consistency
55
- - hard-task ambiguity quality
56
- - reviewability for judges and teammates
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MARCH30_STATUS.md DELETED
@@ -1,117 +0,0 @@
1
- # March 30 Status Report
2
-
3
- This file captures the code checkpoint completed for March 30, 2026 so both Codex sessions can compare against the same source of truth.
4
-
5
- ## Scope Completed
6
-
7
- The March 30 code checkpoint is complete for the foundational files named in `ROADMAP.md`:
8
-
9
- - `models.py`
10
- - `server/tasks.py`
11
- - `server/grader.py`
12
- - `server/environment.py`
13
-
14
- Related supporting files were also aligned:
15
-
16
- - `client.py`
17
- - `server/app.py`
18
- - `inference.py`
19
- - `vocabulary.py`
20
-
21
- ## What Is Locked
22
-
23
- ### Team and project identity
24
-
25
- - Team: Hackstreet Boys
26
- - Members: Roopal Guha Neogi, Suyash Kumar
27
- - Domain: IT Helpdesk Ticket Routing
28
-
29
- ### Frozen class names
30
-
31
- - `HelpdeskTicketRecord`
32
- - `HelpdeskTicketAction`
33
- - `HelpdeskTicketObservation`
34
- - `HelpdeskTicketState`
35
- - `HelpdeskTicketRoutingEnvironment`
36
- - `HelpdeskTicketEnvClient`
37
-
38
- ### Frozen field names
39
-
40
- - `ticket_id`
41
- - `title`
42
- - `requester`
43
- - `description`
44
- - `issue_type`
45
- - `priority`
46
- - `assignment_group`
47
- - `resolution_action`
48
- - `related_ticket_id`
49
-
50
- ## Code That Exists Now
51
-
52
- ### `vocabulary.py`
53
-
54
- Shared frozen constants now live in one place:
55
-
56
- - team metadata
57
- - environment names
58
- - issue types
59
- - priorities
60
- - assignment groups
61
- - resolution actions
62
- - default issue-type mappings used by inference
63
-
64
- ### `models.py`
65
-
66
- The typed models are defined and the vocabulary is enforced through validators, so unsupported labels should fail fast instead of silently drifting.
67
-
68
- ### `server/tasks.py`
69
-
70
- All three tasks are defined with locked names, instructions, and allowed fields.
71
-
72
- ### `server/grader.py`
73
-
74
- Deterministic scoring is in place with:
75
-
76
- - partial credit for near-miss `issue_type`
77
- - proximity scoring for `priority`
78
- - exact match for `assignment_group`
79
- - exact match for `resolution_action`
80
-
81
- ### `server/environment.py`
82
-
83
- The environment implements:
84
-
85
- - queue sampling
86
- - reset flow
87
- - step flow
88
- - state tracking
89
- - final trajectory reward handoff
90
-
91
- ### `inference.py`
92
-
93
- The baseline runner is aligned to the locked vocabulary and supports:
94
-
95
- - LLM mode
96
- - heuristic mode
97
- - task loop over all 3 tasks
98
-
99
- ## Expected Agreement For The Other Codex Session
100
-
101
- Your teammate's Codex should agree on all of the following:
102
-
103
- 1. the schema names above are frozen
104
- 2. the vocabulary now has a single source of truth in `vocabulary.py`
105
- 3. no one should rename labels after this checkpoint
106
- 4. future work should build on these names, not replace them
107
-
108
- ## What Is Not Verified Yet
109
-
110
- This checkpoint is a code-and-consistency checkpoint, not a runtime-complete checkpoint.
111
-
112
- Still pending:
113
-
114
- - local execution
115
- - heuristic baseline run
116
- - Docker validation
117
- - final benchmark numbers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PROJECT_STATUS.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Status
2
+
3
+ This is the canonical running status file for the repo.
4
+
5
+ Use this file for future progress updates instead of creating new date-specific status files.
6
+
7
+ ## March 30, 2026
8
+
9
+ Status: complete
10
+
11
+ Scope completed:
12
+
13
+ - locked team name, domain, and vocabulary
14
+ - aligned the foundational schema and environment surface
15
+ - froze the core class names and field names
16
+
17
+ Core files aligned:
18
+
19
+ - `models.py`
20
+ - `server/tasks.py`
21
+ - `server/grader.py`
22
+ - `server/environment.py`
23
+ - `client.py`
24
+ - `server/app.py`
25
+ - `inference.py`
26
+ - `vocabulary.py`
27
+
28
+ Key checkpoint outcome:
29
+
30
+ - the project had a single vocabulary source of truth and no remaining schema disagreement
31
+
32
+ ## March 31, 2026
33
+
34
+ Status: complete
35
+
36
+ Roopal-side work completed:
37
+
38
+ - audited `data/dataset.json` end to end
39
+ - tightened ambiguity wording in selected tickets
40
+ - reviewed task wording in `server/tasks.py`
41
+
42
+ Representative dataset decisions:
43
+
44
+ - `ticket-022` kept as `application_support` while making the billing-versus-application ambiguity clearer
45
+ - `ticket-027` kept intentionally ambiguous between `general_inquiry` and `service_request`
46
+ - `ticket-029` was refined to better express seat-expansion versus prorating ambiguity
47
+ - `ticket-040` was kept as `feature_request` while clarifying that some readers could still interpret it as `application_support`
48
+
49
+ Task wording changes:
50
+
51
+ - Task 1 was tightened to emphasize selecting the single best IT issue type
52
+ - Task 2 now explicitly asks for operational priority, not just generic urgency
53
+ - Task 3 wording was refined to describe full helpdesk routing more concretely
54
+
55
+ Shared checkpoint outcome:
56
+
57
+ - no schema changes were still pending after the review pass
58
+
59
+ ## April 1, 2026
60
+
61
+ Status: complete
62
+
63
+ Roopal-side work completed:
64
+
65
+ - polished `server/grader.py`
66
+ - made task weights explicit
67
+ - refined hard-task partial-credit behavior
68
+ - finished remaining dataset label corrections
69
+
70
+ Important label/grader notes:
71
+
72
+ - `ticket-026` was corrected to `general_inquiry` routed to `service_desk`
73
+ - Task 2 weights were fixed at `issue_type` 60% and `priority` 40%
74
+ - Task 3 weights were fixed at `issue_type` 35%, `priority` 20%, `assignment_group` 25%, and `resolution_action` 20%
75
+ - partial-credit pairs were added for `application_support` vs `feature_request`
76
+ - partial-credit pairs were added for `general_inquiry` vs `service_request`
77
+
78
+ Shared checkpoint outcome:
79
+
80
+ - the docs and code agreed on the exact task labels and field vocabulary
81
+
82
+ ## April 2, 2026
83
+
84
+ Status: complete
85
+
86
+ Roopal-side work completed:
87
+
88
+ - improved `README.md`
89
+ - improved `KNOWLEDGE.md`
90
+
91
+ Packaging and metadata alignment completed in repo state:
92
+
93
+ - `openenv.yaml` aligned with runtime naming and dependency expectations
94
+ - `pyproject.toml` and `requirements.txt` use the same OpenEnv dependency source
95
+ - `server/Dockerfile` installs the local package and documented runtime dependencies
96
+
97
+ Shared checkpoint outcome:
98
+
99
+ - docs and code tell the same IT helpdesk ticket routing story
100
+
101
+ ## April 3, 2026
102
+
103
+ Status: Roopal work complete, shared validation underway
104
+
105
+ Roopal-side work completed:
106
+
107
+ - performed a dataset realism pass on `data/dataset.json`
108
+ - replaced several low-realism spam examples with clearer helpdesk-inbox phrasing
109
+ - cleaned visible mojibake dashes from ticket titles
110
+ - added explicit easy, medium, and hard dataset examples to `README.md`
111
+
112
+ Runtime validation notes recorded from the local repo state:
113
+
114
+ - local `reset()` and `inference.py` validation exposed a UTF-8 BOM issue in dataset loading
115
+ - `server/tasks.py` was updated to read `data/dataset.json` with `utf-8-sig`
116
+ - the heuristic baseline then completed successfully
117
+
118
+ Local heuristic baseline on the validated repo state:
119
+
120
+ - Task 1: `1.0000`
121
+ - Task 2: `0.8800`
122
+ - Task 3: `0.9400`
123
+ - Overall: `0.9400`
124
+
125
+ Shared checkpoint outcome so far:
126
+
127
+ - the first bug triage item was identified and fixed
128
+ - a rerun on the latest fully merged branch is still recommended before treating benchmark numbers as final
129
+
130
+ ## Open Items
131
+
132
+ Still pending after the current checkpoint:
133
+
134
+ - rerun runtime validation on the latest shared branch after all pending merges land
135
+ - perform a Docker smoke test from the merged repo state
136
+ - do the April 4 issue-fix pass from any runtime feedback
137
+ - record final benchmark numbers only after the merged-state rerun
README.md CHANGED
@@ -159,6 +159,29 @@ It includes:
159
  - feature requests
160
  - follow-up cases linked through `related_ticket_id`
161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  ## Repository Layout
163
 
164
  ```text
 
159
  - feature requests
160
  - follow-up cases linked through `related_ticket_id`
161
 
162
+ ## Difficulty Coverage
163
+
164
+ The difficulty ladder is visible both in the task fields and in the dataset itself.
165
+
166
+ Easy-style examples:
167
+
168
+ - `ticket-020`: straightforward general inquiry with low urgency and a clean `general_inquiry` label
169
+ - `ticket-041`: clear onboarding request for a new contractor account
170
+ - `ticket-044`: obvious phishing-style lure that should map cleanly to `spam_phishing`
171
+
172
+ Medium-style examples:
173
+
174
+ - `ticket-001`: billing dispute that still requires the agent to judge urgency correctly
175
+ - `ticket-028`: application incident where the issue type is clear but priority still matters
176
+ - `ticket-036`: procurement-style proof-of-concept request that should route as a `service_request`
177
+
178
+ Hard-style examples:
179
+
180
+ - `ticket-022`: mixed billing and application signals in one ticket
181
+ - `ticket-029`: seat expansion combined with a prorating question
182
+ - `ticket-038`: follow-up billing thread with escalated urgency
183
+ - `ticket-045`: repeated account suspension thread with legal-escalation pressure
184
+
185
  ## Repository Layout
186
 
187
  ```text
data/dataset.json CHANGED
@@ -49,9 +49,9 @@
49
  },
50
  {
51
  "ticket_id": "ticket-005",
52
- "title": "Guaranteed crypto income from home",
53
- "requester": "promo@fastwealth.example",
54
- "description": "Limited time offer. Click now to multiply your income and unsubscribe never.",
55
  "issue_type": "spam_phishing",
56
  "priority": "low",
57
  "assignment_group": "security_team",
@@ -73,7 +73,7 @@
73
  },
74
  {
75
  "ticket_id": "ticket-007",
76
- "title": "GDPR data deletion request — 30 day deadline",
77
  "requester": "legal@eurocorp.de",
78
  "description": "Per GDPR Article 17, we request deletion of all personal data associated with our account within 30 days. Failure to comply may result in regulatory action.",
79
  "issue_type": "security_compliance",
@@ -85,9 +85,9 @@
85
  },
86
  {
87
  "ticket_id": "ticket-008",
88
- "title": "Welcome aboard — getting started with your new account",
89
- "requester": "success@brightpath.io",
90
- "description": "Thanks for signing up! We\u0027d like to schedule an onboarding call this week. What time works for your team?",
91
  "issue_type": "onboarding",
92
  "priority": "medium",
93
  "assignment_group": "onboarding_ops",
@@ -145,9 +145,9 @@
145
  },
146
  {
147
  "ticket_id": "ticket-013",
148
- "title": "Free vacation giveaway — claim your prize",
149
- "requester": "offers@tropicaldeals.example",
150
- "description": "Congratulations! You have been selected for an all-expenses-paid trip. Click here immediately.",
151
  "issue_type": "spam_phishing",
152
  "priority": "low",
153
  "assignment_group": "security_team",
@@ -157,7 +157,7 @@
157
  },
158
  {
159
  "ticket_id": "ticket-014",
160
- "title": "Audit report findings — action required by Friday",
161
  "requester": "audit@compliancepartners.com",
162
  "description": "The SOC2 audit uncovered three medium-severity findings. Remediation evidence is due by end of week.",
163
  "issue_type": "security_compliance",
@@ -229,9 +229,9 @@
229
  },
230
  {
231
  "ticket_id": "ticket-020",
232
- "title": "General inquiry about your platform capabilities",
233
- "requester": "info@greenleaf.org",
234
- "description": "Hi, I stumbled across your website and was curious about what your platform does. Can you send some information?",
235
  "issue_type": "general_inquiry",
236
  "priority": "low",
237
  "assignment_group": "service_desk",
@@ -373,7 +373,7 @@
373
  },
374
  {
375
  "ticket_id": "ticket-032",
376
- "title": "Penetration test results — critical vulnerabilities found",
377
  "requester": "security@redteam-auditors.com",
378
  "description": "Our pentest revealed two critical and five high-severity vulnerabilities in your API endpoints. Full report attached. Remediation should begin immediately.",
379
  "issue_type": "security_compliance",
@@ -433,9 +433,9 @@
433
  },
434
  {
435
  "ticket_id": "ticket-037",
436
- "title": "Earn a degree in just 2 weeks!",
437
- "requester": "admissions@diplomamill.example",
438
- "description": "No exams, no classes. Get your accredited degree today. Reply for more information.",
439
  "issue_type": "spam_phishing",
440
  "priority": "low",
441
  "assignment_group": "security_team",
@@ -517,9 +517,9 @@
517
  },
518
  {
519
  "ticket_id": "ticket-044",
520
- "title": "Your account has been compromised — act now",
521
- "requester": "security-alert@phishing.example",
522
- "description": "We detected unusual activity on your account. Click the link below to verify your identity and secure your account immediately.",
523
  "issue_type": "spam_phishing",
524
  "priority": "low",
525
  "assignment_group": "security_team",
 
49
  },
50
  {
51
  "ticket_id": "ticket-005",
52
+ "title": "Spam email promising guaranteed crypto returns hit support inbox",
53
+ "requester": "shared-inbox@northstar-retail.com",
54
+ "description": "A promotional email promising instant crypto income landed in the shared support inbox. It does not reference any legitimate customer account or business request.",
55
  "issue_type": "spam_phishing",
56
  "priority": "low",
57
  "assignment_group": "security_team",
 
73
  },
74
  {
75
  "ticket_id": "ticket-007",
76
+ "title": "GDPR data deletion request - 30 day deadline",
77
  "requester": "legal@eurocorp.de",
78
  "description": "Per GDPR Article 17, we request deletion of all personal data associated with our account within 30 days. Failure to comply may result in regulatory action.",
79
  "issue_type": "security_compliance",
 
85
  },
86
  {
87
  "ticket_id": "ticket-008",
88
+ "title": "Kickoff onboarding session for newly activated account",
89
+ "requester": "admin@brightpath.io",
90
+ "description": "We activated our account this week and need an onboarding call plus admin setup guidance for six internal users.",
91
  "issue_type": "onboarding",
92
  "priority": "medium",
93
  "assignment_group": "onboarding_ops",
 
145
  },
146
  {
147
  "ticket_id": "ticket-013",
148
+ "title": "Suspicious giveaway message forwarded from shared mailbox",
149
+ "requester": "shared-inbox@harborair.io",
150
+ "description": "The shared mailbox received a message claiming the recipient had won a free vacation and urging an immediate click-through. It appears to be pure spam.",
151
  "issue_type": "spam_phishing",
152
  "priority": "low",
153
  "assignment_group": "security_team",
 
157
  },
158
  {
159
  "ticket_id": "ticket-014",
160
+ "title": "Audit report findings - action required by Friday",
161
  "requester": "audit@compliancepartners.com",
162
  "description": "The SOC2 audit uncovered three medium-severity findings. Remediation evidence is due by end of week.",
163
  "issue_type": "security_compliance",
 
229
  },
230
  {
231
  "ticket_id": "ticket-020",
232
+ "title": "General inquiry about platform admin capabilities",
233
+ "requester": "ops-eval@greenleaf.org",
234
+ "description": "Our operations team is doing a lightweight vendor scan and wants a short overview of admin controls, reporting, and deployment options.",
235
  "issue_type": "general_inquiry",
236
  "priority": "low",
237
  "assignment_group": "service_desk",
 
373
  },
374
  {
375
  "ticket_id": "ticket-032",
376
+ "title": "Penetration test results - critical vulnerabilities found",
377
  "requester": "security@redteam-auditors.com",
378
  "description": "Our pentest revealed two critical and five high-severity vulnerabilities in your API endpoints. Full report attached. Remediation should begin immediately.",
379
  "issue_type": "security_compliance",
 
433
  },
434
  {
435
  "ticket_id": "ticket-037",
436
+ "title": "Obvious diploma scam reached admissions support inbox",
437
+ "requester": "support@midcity.edu",
438
+ "description": "An unsolicited message promising a degree in two weeks arrived in the support mailbox. It is not tied to any customer case and should be ignored.",
439
  "issue_type": "spam_phishing",
440
  "priority": "low",
441
  "assignment_group": "security_team",
 
517
  },
518
  {
519
  "ticket_id": "ticket-044",
520
+ "title": "Credential phishing message impersonating security team",
521
+ "requester": "helpdesk@startupxyz.io",
522
+ "description": "A message claiming the account was compromised asked users to click a verification link immediately. It appears to be a classic credential phishing lure.",
523
  "issue_type": "spam_phishing",
524
  "priority": "low",
525
  "assignment_group": "security_team",