Roopalgn commited on
Commit
706f85f
·
2 Parent(s): 54d32f87a88607

Merge branch 'codex/apr5-apr6-roopal'

Browse files
Files changed (1) hide show
  1. ROADMAP.md +150 -232
ROADMAP.md CHANGED
@@ -1,4 +1,4 @@
1
- # Hackstreet Boys Roadmap
2
 
3
  ## Team
4
 
@@ -8,54 +8,64 @@
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
10
 
11
- ## Goal
12
 
13
- Ship a clean, well-documented OpenEnv environment for IT helpdesk ticket routing that:
 
 
14
 
15
- - passes all submission gates
16
- - scores well on real-world utility
17
- - has deterministic, defensible grading
18
- - is easy for judges to understand and rerun
19
 
20
- ## When You Start Coding
21
 
22
- Start coding immediately on **March 30, 2026** after a short 30 to 60 minute alignment pass.
 
 
 
 
 
 
 
 
 
 
23
 
24
- That first coding session should do only high-leverage foundation work:
25
 
26
- - lock the exact ticket vocabulary
27
- - freeze field names in `models.py`
28
- - confirm task fields in `server/tasks.py`
29
- - agree on grader labels in `server/grader.py`
30
- - agree that no one changes schema names casually after this point
31
 
32
- ### First coding targets on March 30, 2026
33
 
34
- Roopal should start with:
35
 
36
- - `data/dataset.json`
37
- - `server/tasks.py`
38
- - `server/grader.py`
39
-
40
- Suyash should start with:
41
-
42
- - `models.py`
43
- - `server/environment.py`
44
- - `inference.py`
45
 
46
- By the end of the first coding block, both of you should have:
47
 
48
- - matching field names
49
- - matching task labels
50
- - matching issue-type vocabulary
51
- - no unresolved schema disagreements
52
 
53
- ## Working Model For Two People
 
 
 
 
54
 
55
- The safest way for two people to work separately and merge cleanly is to divide ownership by file groups, not by abstract ideas.
56
 
57
  ### Roopal ownership
58
 
 
 
59
  - `data/dataset.json`
60
  - `server/tasks.py`
61
  - `server/grader.py`
@@ -63,17 +73,24 @@ The safest way for two people to work separately and merge cleanly is to divide
63
  - `KNOWLEDGE.md`
64
  - `MENTAL_MODEL.md`
65
 
66
- Primary responsibilities:
 
 
 
 
 
 
 
 
67
 
68
- - dataset quality
69
- - label consistency
70
- - task wording
71
- - grader realism
72
- - documentation clarity
73
- - judging-story polish
74
 
75
  ### Suyash ownership
76
 
 
 
77
  - `models.py`
78
  - `server/environment.py`
79
  - `server/app.py`
@@ -85,255 +102,156 @@ Primary responsibilities:
85
  - `pyproject.toml`
86
  - `requirements.txt`
87
 
88
- Primary responsibilities:
89
-
90
- - runtime correctness
91
- - OpenEnv interface
92
- - inference reliability
93
- - Docker and deployment readiness
94
- - integration behavior
95
-
96
- ## Merge Strategy
97
-
98
- To keep parallel work easy to combine:
99
-
100
- 1. avoid editing the same file on the same day unless planned
101
- 2. use one shared terminology list and do not invent alternate labels
102
- 3. sync once daily with a 10 minute review of:
103
- - changed files
104
- - open blockers
105
- - any schema changes
106
- 4. freeze the dataset schema early
107
- 5. freeze the action and observation field names early
108
-
109
- ## Shared Source Of Truth
110
-
111
- These files should be treated as authoritative:
112
-
113
- - `README.md` for the public project story
114
- - `PLAN.md` for project requirements and definition of done
115
- - `MENTAL_MODEL.md` for the current system shape
116
- - `openenv.yaml` for environment metadata
117
- - `server/tasks.py` and `server/grader.py` for task rules
118
-
119
- ## AI Usage Policy
120
-
121
- AI is permitted, so use it aggressively where it saves time, but do not outsource judgment.
122
-
123
- Good uses of AI:
124
-
125
- - draft clearer task descriptions
126
- - propose additional hard-case tickets
127
- - suggest edge cases and label audits
128
- - improve prompts in `inference.py`
129
- - generate test ideas and checklists
130
- - improve README structure and wording
131
-
132
- Human review required for:
133
-
134
- - final dataset labels
135
- - grader weights and partial-credit rules
136
- - any claims in README
137
- - final benchmark numbers
138
- - submission metadata and deployment settings
139
-
140
- ## Submission Criteria Checklist
141
-
142
- ### Must pass
143
-
144
- - environment starts correctly
145
- - `reset()`, `step()`, and `state()` behave correctly
146
- - 3 tasks exist and are meaningfully different
147
- - grader scores are in `[0.0, 1.0]`
148
- - `inference.py` runs without error
149
- - Docker builds and starts
150
- - docs are complete and current
151
-
152
- ### Must score well
153
-
154
- - the task feels like real IT helpdesk work
155
- - the hard task is genuinely harder
156
- - the grader gives partial credit in sensible ways
157
- - the environment is easy to understand and rerun
158
 
159
- ## Timeline
 
 
 
 
 
160
 
161
- ### March 30, 2026
162
 
163
- - lock team name, domain, and vocabulary
164
- - finish repo cleanup
165
- - agree on ownership split
166
- - start coding the core schema and task logic immediately after the vocabulary lock
167
- - target a same-day checkpoint on:
168
- - `models.py`
169
- - `server/tasks.py`
170
- - `server/grader.py`
171
- - `server/environment.py`
172
 
173
- ### March 31, 2026
174
 
175
- Roopal:
 
 
 
 
176
 
177
- - audit `data/dataset.json` labels end to end
178
- - tighten ambiguous cases
179
- - review task wording in `server/tasks.py`
180
- - continue code work in `server/grader.py` if partial-credit tuning is still needed
181
 
182
- Suyash:
183
 
184
- - sanity-check `models.py`, `server/environment.py`, and `client.py`
185
- - check that the field names align everywhere
186
- - continue code work in `inference.py` and `server/app.py`
187
 
188
- Shared checkpoint:
 
 
 
189
 
190
- - confirm no schema changes are still pending
191
 
192
- ### April 1, 2026
 
 
 
 
 
193
 
194
- Roopal:
195
 
196
- - polish `server/grader.py`
197
- - confirm hard-task logic and partial-credit behavior
198
- - finish any remaining dataset label corrections
199
 
200
- Suyash:
201
 
202
- - polish `inference.py`
203
- - confirm heuristic mode uses the new ticket vocabulary consistently
204
- - finish runtime code adjustments in `client.py`, `server/app.py`, and `server/reward.py`
205
 
206
- Shared checkpoint:
207
-
208
- - agree on the exact labels and examples used in docs
 
 
 
209
 
210
- ### April 2, 2026
211
-
212
- Roopal:
213
 
214
- - improve `README.md`
215
- - improve `KNOWLEDGE.md`
216
-
217
- Suyash:
218
-
219
- - validate `openenv.yaml`
220
- - validate `server/Dockerfile`
221
- - validate dependency files
222
-
223
- Shared checkpoint:
224
 
225
- - ensure docs and code tell the same story
226
 
227
- ### April 3, 2026
228
 
229
  Roopal:
230
 
231
- - do a dataset realism pass
232
- - make sure examples clearly cover easy, medium, and hard cases
 
233
 
234
  Suyash:
235
 
236
- - perform the first full local runtime pass
237
- - run heuristic inference
238
- - note bugs or schema mismatches
239
 
240
  Shared checkpoint:
241
 
242
- - bug triage and fix list
243
-
244
- ### Practical coding rule
245
-
246
- If you are wondering "should we still be planning or should we code now?", the answer is:
247
 
248
- - **March 30 to April 4, 2026 = active coding and fixes**
249
- - **April 5 to April 6, 2026 = validation, docs, and score recording**
250
- - **April 7 to April 8, 2026 = freeze, smoke tests, and submission**
251
-
252
- ### April 4, 2026
253
-
254
- Roopal:
255
-
256
- - fix data, wording, and documentation issues from runtime feedback
257
-
258
- Suyash:
259
-
260
- - fix environment, inference, and Docker issues from runtime feedback
261
-
262
- Shared checkpoint:
263
 
264
- - second full local run
265
 
266
- ### April 5, 2026
267
 
268
  Roopal:
269
 
270
- - finalize README and knowledge docs
271
- - prepare a concise judge-facing explanation of the domain
 
272
 
273
  Suyash:
274
 
275
- - confirm Docker flow
276
- - confirm all required env vars are documented and handled
 
277
 
278
  Shared checkpoint:
279
 
280
- - record benchmark numbers if stable
281
-
282
- ### April 6, 2026
283
 
284
- - full dry run from a clean copy if possible
285
- - verify every required file is present
286
- - check for stale claims and outdated wording
287
 
288
- ### April 7, 2026
289
 
290
- - freeze feature changes
291
- - only bug fixes, validation, and submission packaging
292
- - verify final docs, metadata, and benchmark numbers
293
 
294
- ### April 8, 2026
295
 
296
- - do one last deployment and smoke test early in the day
297
- - stop risky edits several hours before deadline
298
- - submit before 11:59 PM IST
299
 
300
- ## Integration Rules
301
 
302
- To keep merges painless:
 
303
 
304
- 1. do not rename schemas after April 1, 2026
305
- 2. do not change task labels after April 2, 2026 without both agreeing
306
- 3. do not edit ownership files casually
307
- 4. if one person must touch the other person's file, call it out before doing it
308
- 5. keep a short daily changelog in chat or a shared note
309
 
310
- ## Definition Of Done For Each Member
 
311
 
312
- ### Roopal done means
313
 
314
- - dataset labels are internally consistent
315
- - docs are submission-ready
316
- - the hard task feels meaningfully harder than the easy and medium tasks
317
 
318
- ### Suyash done means
 
 
319
 
320
- - the environment runs end to end
321
- - the inference script works in heuristic mode
322
- - Docker and metadata are in good shape
323
 
324
- ## Final Two-Day Priority Order
 
 
325
 
326
- If time gets tight, prioritize in this exact order:
327
 
328
- 1. working environment
329
- 2. working inference script
330
- 3. valid grader and tasks
331
- 4. Docker and metadata
332
- 5. README clarity
333
- 6. extra polish
334
 
335
  ## Simple Rule To Remember
336
 
337
- Roopal owns the story and the labels.
338
- Suyash owns the runtime and the rails.
339
- Both review the final submission together.
 
1
+ # Hackstreet Boys Final Roadmap
2
 
3
  ## Team
4
 
 
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
10
 
11
+ ## How To Use This File
12
 
13
+ - `PROJECT_STATUS.md` is the canonical log of completed work.
14
+ - This roadmap is now the remaining execution plan from the current merged repo state to final submission.
15
+ - `analysis/comp.md`, `analysis/comp_know.md`, and `analysis/inference.md` are internal prioritization notes only. Use them to guide priorities, but do not mention competitor repos in public-facing docs.
16
 
17
+ ## Current Repo State
 
 
 
18
 
19
+ The repo has already established the core submission shape:
20
 
21
+ - locked IT helpdesk ticket routing domain
22
+ - locked vocabulary and task names
23
+ - 3-task difficulty ladder
24
+ - deterministic grading with partial credit
25
+ - working heuristic baseline
26
+ - merged local validation on `/health`, `/tasks`, and `inference.py`
27
+ - current local benchmark reference:
28
+ - Task 1: `1.0000`
29
+ - Task 2: `0.8800`
30
+ - Task 3: `0.9400`
31
+ - Overall: `0.9400`
32
 
33
+ The remaining work is no longer broad feature development. The remaining work is:
34
 
35
+ 1. final packaging and deployment readiness
36
+ 2. clean rerun evidence
37
+ 3. small high-impact improvements that strengthen submission quality without risking regressions
38
+ 4. freeze and submit early
 
39
 
40
+ ## Submission Gates That Must Be True
41
 
42
+ These are the practical must-pass items from `PLAN.md` and `KNOWLEDGE.md`:
43
 
44
+ - the environment starts correctly
45
+ - `reset()`, `step()`, and `state()` behave correctly
46
+ - 3 tasks exist and remain meaningfully different
47
+ - grader scores stay in `[0.0, 1.0]`
48
+ - `inference.py` runs reproducibly without crashing
49
+ - Docker builds and starts cleanly
50
+ - docs and metadata are current
51
+ - the repo is easy for judges to understand and rerun
 
52
 
53
+ ## Final Priority Order
54
 
55
+ If time gets tight, prioritize in this exact order:
 
 
 
56
 
57
+ 1. merged Docker and deployment validation
58
+ 2. clean-copy rerun
59
+ 3. README and metadata readiness for Hugging Face / OpenEnv deployment
60
+ 4. small reward and observation improvements that strengthen RL value
61
+ 5. extra polish
62
 
63
+ ## Ownership From Now Until Submission
64
 
65
  ### Roopal ownership
66
 
67
+ Files already owned:
68
+
69
  - `data/dataset.json`
70
  - `server/tasks.py`
71
  - `server/grader.py`
 
73
  - `KNOWLEDGE.md`
74
  - `MENTAL_MODEL.md`
75
 
76
+ Roopal mandatory finish-line responsibilities:
77
+
78
+ - keep the docs judge-friendly and fully current
79
+ - add Hugging Face Spaces README frontmatter
80
+ - keep the task story and public explanation simple and strong
81
+ - make only safe grader improvements that improve reward quality without destabilizing labels
82
+ - sync benchmark references in docs if any runtime change alters the numbers
83
+
84
+ Roopal optional high-value improvements:
85
 
86
+ - add a short TRL / GRPO usage example to `README.md`
87
+ - expand the issue-type similarity matrix with only a few safe, reviewable near-miss pairs
88
+ - add one or two sharper hard-case examples in docs if useful
 
 
 
89
 
90
  ### Suyash ownership
91
 
92
+ Files already owned:
93
+
94
  - `models.py`
95
  - `server/environment.py`
96
  - `server/app.py`
 
102
  - `pyproject.toml`
103
  - `requirements.txt`
104
 
105
+ Suyash mandatory finish-line responsibilities:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
+ - keep the runtime stable from the merged branch
108
+ - confirm Docker evidence on the merged submission branch
109
+ - add `.openenvignore` for cleaner `openenv push` packaging
110
+ - verify deployment assumptions around `app_port: 7860`, `/health`, `/docs`, `/ws`, and `/web`
111
+ - do a clean-copy install-and-run pass from a fresh clone if possible
112
+ - rerun `inference.py` after any runtime-side change
113
 
114
+ Suyash optional high-value improvements:
115
 
116
+ - enrich observation history with slightly more useful prior-step context
117
+ - support an optional `queue_size` reset kwarg if the change stays tiny and low-risk
 
 
 
 
 
 
 
118
 
119
+ ### Shared responsibilities
120
 
121
+ - do not rename schemas or vocabulary
122
+ - rerun the benchmark after any code change that could affect behavior
123
+ - keep `PROJECT_STATUS.md` honest
124
+ - use the GitHub Actions Docker smoke workflow when local Docker is blocked by machine setup
125
+ - stop adding risky features before the deadline day
126
 
127
+ ## Improvements Worth Doing Before April 8
 
 
 
128
 
129
+ These are the best ideas from the competitive analysis that are still worth doing this late.
130
 
131
+ ### P0: Do before submission
 
 
132
 
133
+ - add Hugging Face Spaces frontmatter to `README.md`
134
+ - add `.openenvignore`
135
+ - make sure the merged branch has a green Docker smoke result
136
+ - do one clean-copy rerun outside the current working tree if possible
137
 
138
+ ### P1: Do only if the repo remains stable
139
 
140
+ - add a short TRL / GRPO integration example to `README.md`
141
+ - expand `ISSUE_TYPE_SIMILARITY` with only a few obvious, defensible pairs such as:
142
+ - `onboarding` vs `service_request`
143
+ - `feature_request` vs `service_request`
144
+ - `security_compliance` vs `identity_access`
145
+ - enrich `history` slightly if it helps multi-step reasoning and does not bloat observations
146
 
147
+ ### P2: Defer unless everything else is already green
148
 
149
+ - optional `queue_size` reset override
 
 
150
 
151
+ ## Improvements To Avoid Before The Deadline
152
 
153
+ These ideas came up in the analysis, but they are too risky or too large for the remaining time window:
 
 
154
 
155
+ - MCP migration
156
+ - transform-based reward refactor
157
+ - large dataset expansion from 45 to 100 tickets
158
+ - major schema changes
159
+ - broad prompt or inference rewrites that could disturb the stable baseline
160
+ - big dependency-management changes just for polish
161
 
162
+ ## Date-By-Date Execution Plan
 
 
163
 
164
+ ### April 6, 2026
 
 
 
 
 
 
 
 
 
165
 
166
+ Primary goal:
167
 
168
+ - lock down deployment readiness and clean rerun evidence
169
 
170
  Roopal:
171
 
172
+ - add Hugging Face Spaces README frontmatter
173
+ - keep judge-facing README language concise and strong
174
+ - review whether a small issue-similarity expansion is safe enough to land
175
 
176
  Suyash:
177
 
178
+ - add `.openenvignore`
179
+ - verify the Docker smoke workflow on the merged branch
180
+ - do a clean-copy install plus `inference.py` rerun from a fresh clone if possible
181
 
182
  Shared checkpoint:
183
 
184
+ - Docker evidence is green
185
+ - clean-copy rerun is complete or explicitly blocked
186
+ - no stale claims remain in docs
 
 
187
 
188
+ ### April 7, 2026
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
+ Primary goal:
191
 
192
+ - only high-signal improvements, then freeze
193
 
194
  Roopal:
195
 
196
+ - add a short TRL / GRPO example if it can be written cleanly
197
+ - make at most one final safe grader improvement if benchmark stability is preserved
198
+ - do a final docs consistency pass across `README.md`, `KNOWLEDGE.md`, and `MENTAL_MODEL.md`
199
 
200
  Suyash:
201
 
202
+ - make only tiny runtime improvements if they are clearly helpful and low-risk
203
+ - otherwise freeze the runtime and packaging files
204
+ - rerun the benchmark if any runtime-side change lands
205
 
206
  Shared checkpoint:
207
 
208
+ - final benchmark numbers recorded if unchanged or freshly rerun if changed
209
+ - docs, metadata, and runtime all tell the same story
210
+ - feature work stops by the end of the day
211
 
212
+ ### April 8, 2026
 
 
213
 
214
+ Primary goal:
215
 
216
+ - submit from a calm, validated repo state
 
 
217
 
218
+ Morning:
219
 
220
+ - run one final smoke test on the submission branch
221
+ - verify Docker evidence still exists on the merged commit
222
+ - verify `README.md`, `openenv.yaml`, and required files are present and current
223
 
224
+ Afternoon:
225
 
226
+ - make only typo-level or packaging-only fixes
227
+ - do not make risky grader, dataset, or runtime changes
228
 
229
+ Final submission rule:
 
 
 
 
230
 
231
+ - stop risky edits several hours before the 11:59 PM IST deadline
232
+ - submit early if the repo is already green
233
 
234
+ ## What Counts As Complete
235
 
236
+ ### April 6 complete means
 
 
237
 
238
+ - merged Docker validation exists
239
+ - clean-copy rerun evidence exists or a specific blocker is documented
240
+ - deployment-readiness files are in place
241
 
242
+ ### April 7 complete means
 
 
243
 
244
+ - any remaining safe improvements are merged
245
+ - final benchmark reference is recorded
246
+ - docs and metadata are frozen
247
 
248
+ ### April 8 complete means
249
 
250
+ - final smoke test is done
251
+ - submission has been sent
 
 
 
 
252
 
253
  ## Simple Rule To Remember
254
 
255
+ Roopal owns the story, labels, and public clarity.
256
+ Suyash owns the runtime, packaging, and reproducibility rails.
257
+ Both of you should optimize for a clean, rerunnable, judge-friendly submission rather than chasing last-minute complexity.