Roopalgn commited on
Commit
6f27f26
·
1 Parent(s): 706f85f

Update final submission roadmap

Browse files
Files changed (1) hide show
  1. ROADMAP.md +257 -170
ROADMAP.md CHANGED
@@ -1,4 +1,4 @@
1
- # Hackstreet Boys Final Roadmap
2
 
3
  ## Team
4
 
@@ -7,251 +7,338 @@
7
  - Roopal Guha Neogi
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
 
10
 
11
- ## How To Use This File
12
 
13
- - `PROJECT_STATUS.md` is the canonical log of completed work.
14
- - This roadmap is now the remaining execution plan from the current merged repo state to final submission.
15
- - `analysis/comp.md`, `analysis/comp_know.md`, and `analysis/inference.md` are internal prioritization notes only. Use them to guide priorities, but do not mention competitor repos in public-facing docs.
16
 
17
- ## Current Repo State
 
 
 
18
 
19
- The repo has already established the core submission shape:
20
 
21
- - locked IT helpdesk ticket routing domain
22
- - locked vocabulary and task names
23
- - 3-task difficulty ladder
24
- - deterministic grading with partial credit
25
- - working heuristic baseline
26
- - merged local validation on `/health`, `/tasks`, and `inference.py`
27
- - current local benchmark reference:
28
- - Task 1: `1.0000`
29
- - Task 2: `0.8800`
30
- - Task 3: `0.9400`
31
- - Overall: `0.9400`
32
 
33
- The remaining work is no longer broad feature development. The remaining work is:
34
 
35
- 1. final packaging and deployment readiness
36
- 2. clean rerun evidence
37
- 3. small high-impact improvements that strengthen submission quality without risking regressions
38
- 4. freeze and submit early
39
 
40
- ## Submission Gates That Must Be True
 
 
 
 
 
41
 
42
- These are the practical must-pass items from `PLAN.md` and `KNOWLEDGE.md`:
43
 
44
- - the environment starts correctly
45
- - `reset()`, `step()`, and `state()` behave correctly
46
- - 3 tasks exist and remain meaningfully different
47
- - grader scores stay in `[0.0, 1.0]`
48
- - `inference.py` runs reproducibly without crashing
49
- - Docker builds and starts cleanly
50
- - docs and metadata are current
51
- - the repo is easy for judges to understand and rerun
52
 
53
- ## Final Priority Order
54
 
55
- If time gets tight, prioritize in this exact order:
56
 
57
- 1. merged Docker and deployment validation
58
- 2. clean-copy rerun
59
- 3. README and metadata readiness for Hugging Face / OpenEnv deployment
60
- 4. small reward and observation improvements that strengthen RL value
61
- 5. extra polish
62
 
63
- ## Ownership From Now Until Submission
64
 
65
- ### Roopal ownership
66
 
67
- Files already owned:
68
 
69
- - `data/dataset.json`
70
- - `server/tasks.py`
71
- - `server/grader.py`
72
- - `README.md`
73
- - `KNOWLEDGE.md`
74
- - `MENTAL_MODEL.md`
75
 
76
- Roopal mandatory finish-line responsibilities:
77
 
78
- - keep the docs judge-friendly and fully current
79
- - add Hugging Face Spaces README frontmatter
80
- - keep the task story and public explanation simple and strong
81
- - make only safe grader improvements that improve reward quality without destabilizing labels
82
- - sync benchmark references in docs if any runtime change alters the numbers
83
 
84
- Roopal optional high-value improvements:
85
 
86
- - add a short TRL / GRPO usage example to `README.md`
87
- - expand the issue-type similarity matrix with only a few safe, reviewable near-miss pairs
88
- - add one or two sharper hard-case examples in docs if useful
89
 
90
- ### Suyash ownership
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- Files already owned:
 
 
93
 
94
- - `models.py`
95
- - `server/environment.py`
96
- - `server/app.py`
97
- - `server/reward.py`
98
- - `client.py`
99
- - `inference.py`
100
- - `openenv.yaml`
101
- - `server/Dockerfile`
102
- - `pyproject.toml`
103
- - `requirements.txt`
104
 
105
- Suyash mandatory finish-line responsibilities:
106
 
107
- - keep the runtime stable from the merged branch
108
- - confirm Docker evidence on the merged submission branch
109
- - add `.openenvignore` for cleaner `openenv push` packaging
110
- - verify deployment assumptions around `app_port: 7860`, `/health`, `/docs`, `/ws`, and `/web`
111
- - do a clean-copy install-and-run pass from a fresh clone if possible
112
- - rerun `inference.py` after any runtime-side change
113
 
114
- Suyash optional high-value improvements:
 
 
115
 
116
- - enrich observation history with slightly more useful prior-step context
117
- - support an optional `queue_size` reset kwarg if the change stays tiny and low-risk
118
 
119
- ### Shared responsibilities
 
 
120
 
121
- - do not rename schemas or vocabulary
122
- - rerun the benchmark after any code change that could affect behavior
123
- - keep `PROJECT_STATUS.md` honest
124
- - use the GitHub Actions Docker smoke workflow when local Docker is blocked by machine setup
125
- - stop adding risky features before the deadline day
126
 
127
- ## Improvements Worth Doing Before April 8
 
 
 
128
 
129
- These are the best ideas from the competitive analysis that are still worth doing this late.
130
 
131
- ### P0: Do before submission
 
132
 
133
- - add Hugging Face Spaces frontmatter to `README.md`
134
- - add `.openenvignore`
135
- - make sure the merged branch has a green Docker smoke result
136
- - do one clean-copy rerun outside the current working tree if possible
137
 
138
- ### P1: Do only if the repo remains stable
 
139
 
140
- - add a short TRL / GRPO integration example to `README.md`
141
- - expand `ISSUE_TYPE_SIMILARITY` with only a few obvious, defensible pairs such as:
142
- - `onboarding` vs `service_request`
143
- - `feature_request` vs `service_request`
144
- - `security_compliance` vs `identity_access`
145
- - enrich `history` slightly if it helps multi-step reasoning and does not bloat observations
146
 
147
- ### P2: Defer unless everything else is already green
148
 
149
- - optional `queue_size` reset override
150
 
151
- ## Improvements To Avoid Before The Deadline
 
 
 
 
152
 
153
- These ideas came up in the analysis, but they are too risky or too large for the remaining time window:
154
 
155
- - MCP migration
156
- - transform-based reward refactor
157
- - large dataset expansion from 45 to 100 tickets
158
- - major schema changes
159
- - broad prompt or inference rewrites that could disturb the stable baseline
160
- - big dependency-management changes just for polish
161
 
162
- ## Date-By-Date Execution Plan
 
163
 
164
- ### April 6, 2026
165
 
166
- Primary goal:
167
 
168
- - lock down deployment readiness and clean rerun evidence
 
 
 
 
169
 
170
- Roopal:
171
 
172
- - add Hugging Face Spaces README frontmatter
173
- - keep judge-facing README language concise and strong
174
- - review whether a small issue-similarity expansion is safe enough to land
 
175
 
176
- Suyash:
177
 
178
- - add `.openenvignore`
179
- - verify the Docker smoke workflow on the merged branch
180
- - do a clean-copy install plus `inference.py` rerun from a fresh clone if possible
181
 
182
- Shared checkpoint:
183
 
184
- - Docker evidence is green
185
- - clean-copy rerun is complete or explicitly blocked
186
- - no stale claims remain in docs
187
 
188
- ### April 7, 2026
189
 
190
- Primary goal:
191
 
192
- - only high-signal improvements, then freeze
193
 
194
- Roopal:
 
 
 
 
195
 
196
- - add a short TRL / GRPO example if it can be written cleanly
197
- - make at most one final safe grader improvement if benchmark stability is preserved
198
- - do a final docs consistency pass across `README.md`, `KNOWLEDGE.md`, and `MENTAL_MODEL.md`
199
 
200
- Suyash:
201
 
202
- - make only tiny runtime improvements if they are clearly helpful and low-risk
203
- - otherwise freeze the runtime and packaging files
204
- - rerun the benchmark if any runtime-side change lands
205
 
206
- Shared checkpoint:
207
 
208
- - final benchmark numbers recorded if unchanged or freshly rerun if changed
209
- - docs, metadata, and runtime all tell the same story
210
- - feature work stops by the end of the day
 
 
211
 
212
- ### April 8, 2026
213
 
214
- Primary goal:
215
 
216
- - submit from a calm, validated repo state
217
 
218
- Morning:
219
 
220
- - run one final smoke test on the submission branch
221
- - verify Docker evidence still exists on the merged commit
222
- - verify `README.md`, `openenv.yaml`, and required files are present and current
 
 
 
223
 
224
- Afternoon:
225
 
226
- - make only typo-level or packaging-only fixes
227
- - do not make risky grader, dataset, or runtime changes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
- Final submission rule:
 
 
 
230
 
231
- - stop risky edits several hours before the 11:59 PM IST deadline
232
- - submit early if the repo is already green
233
 
234
- ## What Counts As Complete
 
 
235
 
236
- ### April 6 complete means
237
 
238
- - merged Docker validation exists
239
- - clean-copy rerun evidence exists or a specific blocker is documented
240
- - deployment-readiness files are in place
241
 
242
- ### April 7 complete means
 
 
 
 
243
 
244
- - any remaining safe improvements are merged
245
- - final benchmark reference is recorded
246
- - docs and metadata are frozen
247
 
248
- ### April 8 complete means
249
 
250
- - final smoke test is done
251
- - submission has been sent
 
 
 
252
 
253
  ## Simple Rule To Remember
254
 
255
- Roopal owns the story, labels, and public clarity.
256
- Suyash owns the runtime, packaging, and reproducibility rails.
257
- Both of you should optimize for a clean, rerunnable, judge-friendly submission rather than chasing last-minute complexity.
 
 
1
+ # Hackstreet Boys Roadmap
2
 
3
  ## Team
4
 
 
7
  - Roopal Guha Neogi
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
10
+ - Current planning checkpoint: April 3, 2026
11
 
12
+ ## What We Are Optimizing For
13
 
14
+ These are the main wins for the final stretch, in order:
 
 
15
 
16
+ 1. **RL improvement**
17
+ 2. **Robustness**
18
+ 3. **Real-world grounding**
19
+ 4. **Submission safety**
20
 
21
+ In practice, that means:
22
 
23
+ - improve the reward and episode behavior only where changes are low-risk and test-backed
24
+ - add strong automated validation so the repo feels reliable, not hand-wavy
25
+ - ground our taxonomy and partial-credit choices against real external IT support data without trying to absorb that data into the runtime dataset this late
26
+ - avoid broad refactors that create new failure modes near submission
 
 
 
 
 
 
 
27
 
28
+ ## Honest Scope Call
29
 
30
+ What is viable before the deadline:
 
 
 
31
 
32
+ - unit tests
33
+ - smoke tests
34
+ - focused integration tests
35
+ - deterministic regression checks
36
+ - lightweight RL-oriented scoring improvements
37
+ - grounding audits against public real-world support datasets
38
 
39
+ What is **not** viable before the deadline:
40
 
41
+ - replacing `data/dataset.json` with an external dataset
42
+ - redesigning the taxonomy from scratch
43
+ - large architecture rewrites
44
+ - open-ended benchmark expansion without validation
 
 
 
 
45
 
46
+ ## Guardrails
47
 
48
+ To stay on track:
49
 
50
+ 1. do not merge external datasets into the main runtime dataset before submission
51
+ 2. do not broaden the action schema or rename fields
52
+ 3. do not make reward changes unless tests prove exact, zero, and partial-credit cases clearly
53
+ 4. every Codex-generated code change must end with tests or validation evidence
54
+ 5. prefer small, bounded implementation passes over one large all-at-once rewrite
55
 
56
+ ## Working Model With Codex
57
 
58
+ Using Codex to generate all implementation work is viable **if we keep each ask narrow and verifiable**.
59
 
60
+ Best pattern:
61
 
62
+ 1. ask for one bounded change set
63
+ 2. add or update tests in the same pass
64
+ 3. run the relevant checks
65
+ 4. only then move to the next improvement
 
 
66
 
67
+ Bad pattern:
68
 
69
+ - ask for tests, scoring changes, dataset expansion, CI, and docs all in one prompt
 
 
 
 
70
 
71
+ ## Last-Mile Phase Plan
72
 
73
+ ### Phase 1: Test Foundation
74
+ **Window:** April 3 to April 4
 
75
 
76
+ **Primary objective:** make the current env provably correct before we tune anything
77
+
78
+ Deliverables:
79
+
80
+ - add `pytest`-based test structure
81
+ - add unit tests for:
82
+ - `server/grader.py`
83
+ - `server/reward.py`
84
+ - `server/tasks.py`
85
+ - `models.py` where validation matters
86
+ - add smoke tests for:
87
+ - environment `reset()`
88
+ - environment `step()`
89
+ - deterministic seeded behavior
90
+ - score range `[0.0, 1.0]`
91
+ - add focused integration tests for:
92
+ - FastAPI endpoints such as `/health`, `/tasks`, `/reset`, `/step`, `/state`
93
+ - one full seeded episode through the app surface
94
+
95
+ Most important assertions in this phase:
96
+
97
+ - exact matches score `1.0`
98
+ - unrelated wrong labels score `0.0`
99
+ - only approved near-miss pairs receive partial credit
100
+ - assignment group and resolution action remain exact-match fields
101
+ - the environment is deterministic when seeded
102
+ - the baseline path still completes all tasks
103
+
104
+ Exit criteria:
105
+
106
+ - tests clearly prove the scorer is **not** "always fuzzy"
107
+ - core environment behavior is covered by automated checks
108
+ - we can change scoring logic later without guessing whether we broke it
109
+
110
+ ### Phase 2: RL Improvement Without Big Risk
111
+ **Window:** April 4 to April 5
112
+
113
+ **Primary objective:** make the reward surface better for RL while preserving determinism and judge clarity
114
+
115
+ Allowed improvements in this phase:
116
+
117
+ - refine `ISSUE_TYPE_SIMILARITY` only where justified and test-backed
118
+ - tighten priority partial-credit coverage if tests show obvious gaps
119
+ - improve episode history if it helps multi-step learning and does not complicate grading
120
+ - add deterministic regression checks around expected baseline behavior
121
+ - optionally add a safe `queue_size` override in `reset()` only if it is clean and fully tested
122
+
123
+ Non-goals for this phase:
124
+
125
+ - no new fields in the public schema
126
+ - no major reward-architecture refactor
127
+ - no broad rubric redesign
128
+
129
+ Decision rule:
130
+
131
+ - if a proposed RL improvement makes scoring harder to explain, skip it
132
+ - if it improves learning signal and is easy to test, keep it
133
+
134
+ Exit criteria:
135
 
136
+ - reward logic is still simple to explain
137
+ - exactness is preserved where it should be exact
138
+ - any extra partial credit is intentional, narrow, and documented by tests
139
 
140
+ ### Phase 3: Real-World Grounding Audit
141
+ **Window:** April 5 to April 6
 
 
 
 
 
 
 
 
142
 
143
+ **Primary objective:** show that our labels and ambiguity rules are grounded in real support data, without late-stage dataset merge risk
144
 
145
+ Grounding approach:
 
 
 
 
 
146
 
147
+ - audit our taxonomy against public real-world support datasets
148
+ - use those datasets as reference material, not as direct training/runtime data
149
+ - document what they validate about our domain, labels, and near-miss structure
150
 
151
+ Recommended external references:
 
152
 
153
+ - `Classification of IT Support Tickets` (Zenodo): manually classified IT support tickets
154
+ - `Semantic Similarity of IT Support Tickets` (Zenodo): manually labeled support-ticket similarity pairs
155
+ - `MSDialog`: Microsoft technical support conversations for realistic support-language patterns
156
 
157
+ Concrete work in this phase:
 
 
 
 
158
 
159
+ - compare our issue types to external category patterns
160
+ - review whether our ambiguous tickets reflect real support ambiguity
161
+ - justify or reject candidate partial-credit pairs using external examples
162
+ - note any obvious taxonomy blind spots for future work
163
 
164
+ Important constraint:
165
 
166
+ - do **not** import external rows into `data/dataset.json` at this stage
167
+ - do **not** claim full external-dataset benchmarking unless we actually run it
168
 
169
+ Exit criteria:
 
 
 
170
 
171
+ - we can honestly say our environment design is grounded against real support data
172
+ - any scoring adjustments introduced in Phase 2 have an external rationale, not just intuition
173
 
174
+ ### Phase 4: Hardening And Regression Safety
175
+ **Window:** April 6 to April 7
 
 
 
 
176
 
177
+ **Primary objective:** make the repo reliable from the outside, not just locally understandable
178
 
179
+ Deliverables:
180
 
181
+ - run the full test suite on the merged repo state
182
+ - keep or improve Docker smoke coverage
183
+ - if feasible, add CI for `pytest` in addition to Docker smoke
184
+ - rerun heuristic baseline and confirm it remains stable after test/scoring changes
185
+ - verify docs still match the implemented behavior
186
 
187
+ Exit criteria:
188
 
189
+ - runtime behavior, tests, and docs all agree
190
+ - no unresolved ambiguity remains about the baseline numbers
191
+ - Docker and app-surface behavior have at least one real validation path
 
 
 
192
 
193
+ ### Phase 5: Freeze And Submission Packaging
194
+ **Window:** April 7 to April 8
195
 
196
+ **Primary objective:** stop taking avoidable risk
197
 
198
+ Allowed work:
199
 
200
+ - bug fixes
201
+ - doc corrections
202
+ - metadata fixes
203
+ - smoke-test reruns
204
+ - submission packaging
205
 
206
+ Avoid in this phase:
207
 
208
+ - new dataset content
209
+ - scoring experiments
210
+ - structural refactors
211
+ - "nice-to-have" features
212
 
213
+ Exit criteria:
214
 
215
+ - the repo is stable
216
+ - the docs are accurate
217
+ - the submission story is clear
218
 
219
+ ## Test Strategy
220
 
221
+ ### Unit Tests
 
 
222
 
223
+ Goal:
224
 
225
+ - prove the scorer, reward helpers, and dataset/task loaders behave exactly as intended
226
 
227
+ Priority unit targets:
228
 
229
+ - `grade_action()` exact-match, zero-score, and partial-credit cases
230
+ - unsupported `task_id` behavior
231
+ - task weights summing to expected behavior
232
+ - reward helper bounds
233
+ - dataset loader behavior including Windows BOM handling
234
 
235
+ ### Smoke Tests
 
 
236
 
237
+ Goal:
238
 
239
+ - prove the environment works end to end with minimal assumptions
 
 
240
 
241
+ Priority smoke targets:
242
 
243
+ - `reset()` returns a valid observation
244
+ - `step()` advances the queue
245
+ - final reward stays in `[0.0, 1.0]`
246
+ - same seed gives the same episode behavior
247
+ - heuristic baseline completes without crashing
248
 
249
+ ### Integration Tests
250
 
251
+ Goal:
252
 
253
+ - prove the real app surface behaves correctly, not just the pure Python helpers
254
 
255
+ Priority integration targets:
256
 
257
+ - `/health`
258
+ - `/tasks`
259
+ - `/reset`
260
+ - `/step`
261
+ - `/state`
262
+ - one full seeded episode through the app or client layer
263
 
264
+ ## RL Improvement Rules
265
 
266
+ We should improve RL usefulness in ways that keep the env judge-friendly.
267
+
268
+ Good RL improvements:
269
+
270
+ - clearer deterministic feedback
271
+ - better exact-vs-partial boundaries
272
+ - richer but still simple episode history
273
+ - deterministic controls that help reproducible rollouts
274
+
275
+ Bad RL improvements:
276
+
277
+ - vague similarity expansion without examples
278
+ - turning exact business-routing fields into fuzzy fields
279
+ - adding complexity that makes the README harder to explain
280
+
281
+ ## Grounding Rules
282
+
283
+ Grounding matters, but it must stay lightweight this late.
284
+
285
+ Good grounding work:
286
+
287
+ - audit our taxonomy against public support-ticket datasets
288
+ - use real support phrasing to validate dataset realism
289
+ - use labeled similarity pairs to justify a few near-miss cases
290
+
291
+ Bad grounding work:
292
+
293
+ - rushed ingestion of external datasets
294
+ - category remapping that forces taxonomy churn
295
+ - unsupported claims that our scores are benchmarked externally when they are not
296
+
297
+ ## Ownership Split For The Final Stretch
298
+
299
+ ### Roopal ownership
300
+
301
+ - grounding audit
302
+ - ticket realism review
303
+ - documentation updates
304
+ - competitive-positioning clarity
305
+
306
+ ### Suyash ownership
307
 
308
+ - tests
309
+ - runtime hardening
310
+ - scoring and reward implementation changes
311
+ - Docker and integration validation
312
 
313
+ ### Shared review items
 
314
 
315
+ - any changes to partial-credit rules
316
+ - any benchmark number updates
317
+ - final submission claims
318
 
319
+ ## Priority Order If Time Gets Tight
320
 
321
+ If the deadline compresses further, do this exact order:
 
 
322
 
323
+ 1. unit tests proving non-fuzzy scoring behavior
324
+ 2. smoke and integration tests for seeded deterministic runs
325
+ 3. grounding audit against external real-world support datasets
326
+ 4. low-risk RL reward improvements
327
+ 5. CI and extra polish
328
 
329
+ ## Definition Of Done For This Final Plan
 
 
330
 
331
+ We are done when:
332
 
333
+ 1. the scorer is test-backed and clearly not "always fuzzy"
334
+ 2. the environment has unit, smoke, and integration coverage
335
+ 3. the main RL improvements are implemented without hurting clarity
336
+ 4. grounding is supported by external real-world support datasets
337
+ 5. Docker, baseline behavior, and docs are all in sync
338
 
339
  ## Simple Rule To Remember
340
 
341
+ Improve learning signal.
342
+ Prove correctness.
343
+ Ground the story in real support data.
344
+ Do not take late-stage dataset-merging risk.