Roopalgn commited on
Commit
375aa81
·
1 Parent(s): 6f27f26

Update final submission roadmap

Browse files
Files changed (1) hide show
  1. ROADMAP.md +398 -229
ROADMAP.md CHANGED
@@ -1,4 +1,4 @@
1
- # Hackstreet Boys Roadmap
2
 
3
  ## Team
4
 
@@ -7,338 +7,507 @@
7
  - Roopal Guha Neogi
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
10
- - Current planning checkpoint: April 3, 2026
 
 
 
 
 
 
 
11
 
12
  ## What We Are Optimizing For
13
 
14
- These are the main wins for the final stretch, in order:
 
 
 
 
 
 
 
 
 
15
 
16
- 1. **RL improvement**
17
- 2. **Robustness**
18
  3. **Real-world grounding**
19
- 4. **Submission safety**
 
 
 
 
 
20
 
21
- In practice, that means:
22
 
23
- - improve the reward and episode behavior only where changes are low-risk and test-backed
24
- - add strong automated validation so the repo feels reliable, not hand-wavy
25
- - ground our taxonomy and partial-credit choices against real external IT support data without trying to absorb that data into the runtime dataset this late
26
- - avoid broad refactors that create new failure modes near submission
27
 
28
- ## Honest Scope Call
 
 
 
 
 
 
 
 
 
 
29
 
30
- What is viable before the deadline:
31
 
32
- - unit tests
33
- - smoke tests
34
- - focused integration tests
35
- - deterministic regression checks
36
- - lightweight RL-oriented scoring improvements
37
- - grounding audits against public real-world support datasets
38
 
39
- What is **not** viable before the deadline:
40
 
41
- - replacing `data/dataset.json` with an external dataset
42
- - redesigning the taxonomy from scratch
43
- - large architecture rewrites
44
- - open-ended benchmark expansion without validation
 
 
 
 
45
 
46
- ## Guardrails
47
 
48
- To stay on track:
49
 
50
- 1. do not merge external datasets into the main runtime dataset before submission
51
- 2. do not broaden the action schema or rename fields
52
- 3. do not make reward changes unless tests prove exact, zero, and partial-credit cases clearly
53
- 4. every Codex-generated code change must end with tests or validation evidence
54
- 5. prefer small, bounded implementation passes over one large all-at-once rewrite
 
 
 
55
 
56
- ## Working Model With Codex
57
 
58
- Using Codex to generate all implementation work is viable **if we keep each ask narrow and verifiable**.
 
 
 
 
 
 
59
 
60
- Best pattern:
61
 
62
- 1. ask for one bounded change set
63
- 2. add or update tests in the same pass
64
- 3. run the relevant checks
65
- 4. only then move to the next improvement
66
 
67
- Bad pattern:
 
 
 
 
 
68
 
69
- - ask for tests, scoring changes, dataset expansion, CI, and docs all in one prompt
70
 
71
- ## Last-Mile Phase Plan
72
 
73
- ### Phase 1: Test Foundation
74
  **Window:** April 3 to April 4
75
 
76
- **Primary objective:** make the current env provably correct before we tune anything
77
-
78
- Deliverables:
79
-
80
- - add `pytest`-based test structure
81
- - add unit tests for:
82
- - `server/grader.py`
83
- - `server/reward.py`
84
- - `server/tasks.py`
85
- - `models.py` where validation matters
86
- - add smoke tests for:
87
- - environment `reset()`
88
- - environment `step()`
89
- - deterministic seeded behavior
90
- - score range `[0.0, 1.0]`
91
- - add focused integration tests for:
92
- - FastAPI endpoints such as `/health`, `/tasks`, `/reset`, `/step`, `/state`
93
- - one full seeded episode through the app surface
94
-
95
- Most important assertions in this phase:
96
-
97
- - exact matches score `1.0`
98
- - unrelated wrong labels score `0.0`
99
- - only approved near-miss pairs receive partial credit
100
- - assignment group and resolution action remain exact-match fields
101
- - the environment is deterministic when seeded
102
- - the baseline path still completes all tasks
103
-
104
- Exit criteria:
105
-
106
- - tests clearly prove the scorer is **not** "always fuzzy"
107
- - core environment behavior is covered by automated checks
108
- - we can change scoring logic later without guessing whether we broke it
109
-
110
- ### Phase 2: RL Improvement Without Big Risk
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  **Window:** April 4 to April 5
112
 
113
- **Primary objective:** make the reward surface better for RL while preserving determinism and judge clarity
114
 
115
- Allowed improvements in this phase:
116
 
117
- - refine `ISSUE_TYPE_SIMILARITY` only where justified and test-backed
118
- - tighten priority partial-credit coverage if tests show obvious gaps
119
- - improve episode history if it helps multi-step learning and does not complicate grading
120
- - add deterministic regression checks around expected baseline behavior
121
- - optionally add a safe `queue_size` override in `reset()` only if it is clean and fully tested
122
 
123
- Non-goals for this phase:
124
 
125
- - no new fields in the public schema
126
- - no major reward-architecture refactor
127
- - no broad rubric redesign
 
 
128
 
129
- Decision rule:
130
 
131
- - if a proposed RL improvement makes scoring harder to explain, skip it
132
- - if it improves learning signal and is easy to test, keep it
 
 
 
133
 
134
- Exit criteria:
135
 
136
- - reward logic is still simple to explain
137
- - exactness is preserved where it should be exact
138
- - any extra partial credit is intentional, narrow, and documented by tests
139
 
140
- ### Phase 3: Real-World Grounding Audit
141
  **Window:** April 5 to April 6
142
 
143
- **Primary objective:** show that our labels and ambiguity rules are grounded in real support data, without late-stage dataset merge risk
 
 
144
 
145
- Grounding approach:
 
 
146
 
147
- - audit our taxonomy against public real-world support datasets
148
- - use those datasets as reference material, not as direct training/runtime data
149
- - document what they validate about our domain, labels, and near-miss structure
150
 
151
- Recommended external references:
 
 
152
 
153
- - `Classification of IT Support Tickets` (Zenodo): manually classified IT support tickets
154
- - `Semantic Similarity of IT Support Tickets` (Zenodo): manually labeled support-ticket similarity pairs
155
- - `MSDialog`: Microsoft technical support conversations for realistic support-language patterns
156
 
157
- Concrete work in this phase:
 
 
 
 
158
 
159
- - compare our issue types to external category patterns
160
- - review whether our ambiguous tickets reflect real support ambiguity
161
- - justify or reject candidate partial-credit pairs using external examples
162
- - note any obvious taxonomy blind spots for future work
163
 
164
- Important constraint:
 
 
 
165
 
166
- - do **not** import external rows into `data/dataset.json` at this stage
167
- - do **not** claim full external-dataset benchmarking unless we actually run it
168
 
169
- Exit criteria:
 
 
170
 
171
- - we can honestly say our environment design is grounded against real support data
172
- - any scoring adjustments introduced in Phase 2 have an external rationale, not just intuition
173
 
174
- ### Phase 4: Hardening And Regression Safety
175
  **Window:** April 6 to April 7
176
 
177
- **Primary objective:** make the repo reliable from the outside, not just locally understandable
178
 
179
- Deliverables:
180
 
181
- - run the full test suite on the merged repo state
182
- - keep or improve Docker smoke coverage
183
- - if feasible, add CI for `pytest` in addition to Docker smoke
184
- - rerun heuristic baseline and confirm it remains stable after test/scoring changes
185
- - verify docs still match the implemented behavior
186
 
187
- Exit criteria:
188
 
189
- - runtime behavior, tests, and docs all agree
190
- - no unresolved ambiguity remains about the baseline numbers
191
- - Docker and app-surface behavior have at least one real validation path
192
 
193
- ### Phase 5: Freeze And Submission Packaging
194
- **Window:** April 7 to April 8
195
 
196
- **Primary objective:** stop taking avoidable risk
 
 
197
 
198
- Allowed work:
199
 
200
- - bug fixes
201
- - doc corrections
202
- - metadata fixes
203
- - smoke-test reruns
204
- - submission packaging
205
 
206
- Avoid in this phase:
207
 
208
- - new dataset content
209
- - scoring experiments
210
- - structural refactors
211
- - "nice-to-have" features
212
 
213
- Exit criteria:
 
 
 
 
214
 
215
- - the repo is stable
216
- - the docs are accurate
217
- - the submission story is clear
218
 
219
- ## Test Strategy
220
 
221
- ### Unit Tests
 
 
 
 
 
222
 
223
- Goal:
224
 
225
- - prove the scorer, reward helpers, and dataset/task loaders behave exactly as intended
 
 
 
 
226
 
227
- Priority unit targets:
228
 
229
- - `grade_action()` exact-match, zero-score, and partial-credit cases
230
- - unsupported `task_id` behavior
231
- - task weights summing to expected behavior
232
- - reward helper bounds
233
- - dataset loader behavior including Windows BOM handling
234
 
235
- ### Smoke Tests
236
 
237
- Goal:
238
 
239
- - prove the environment works end to end with minimal assumptions
 
 
 
 
 
 
 
 
 
240
 
241
- Priority smoke targets:
242
 
243
- - `reset()` returns a valid observation
244
- - `step()` advances the queue
245
- - final reward stays in `[0.0, 1.0]`
246
- - same seed gives the same episode behavior
247
- - heuristic baseline completes without crashing
 
248
 
249
- ### Integration Tests
250
 
251
- Goal:
 
 
 
 
 
252
 
253
- - prove the real app surface behaves correctly, not just the pure Python helpers
254
 
255
- Priority integration targets:
 
 
 
 
 
256
 
257
- - `/health`
258
- - `/tasks`
259
- - `/reset`
260
- - `/step`
261
- - `/state`
262
- - one full seeded episode through the app or client layer
263
 
264
- ## RL Improvement Rules
265
 
266
- We should improve RL usefulness in ways that keep the env judge-friendly.
267
 
268
- Good RL improvements:
269
 
270
- - clearer deterministic feedback
271
- - better exact-vs-partial boundaries
272
- - richer but still simple episode history
273
- - deterministic controls that help reproducible rollouts
274
 
275
- Bad RL improvements:
 
 
276
 
277
- - vague similarity expansion without examples
278
- - turning exact business-routing fields into fuzzy fields
279
- - adding complexity that makes the README harder to explain
280
 
281
- ## Grounding Rules
 
 
282
 
283
- Grounding matters, but it must stay lightweight this late.
284
 
285
- Good grounding work:
 
 
286
 
287
- - audit our taxonomy against public support-ticket datasets
288
- - use real support phrasing to validate dataset realism
289
- - use labeled similarity pairs to justify a few near-miss cases
290
 
291
- Bad grounding work:
292
 
293
- - rushed ingestion of external datasets
294
- - category remapping that forces taxonomy churn
295
- - unsupported claims that our scores are benchmarked externally when they are not
296
 
297
- ## Ownership Split For The Final Stretch
298
 
299
- ### Roopal ownership
 
300
 
301
- - grounding audit
302
- - ticket realism review
303
- - documentation updates
304
- - competitive-positioning clarity
305
 
306
- ### Suyash ownership
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307
 
308
- - tests
309
- - runtime hardening
310
- - scoring and reward implementation changes
311
- - Docker and integration validation
312
 
313
- ### Shared review items
314
 
315
- - any changes to partial-credit rules
316
- - any benchmark number updates
317
- - final submission claims
318
 
319
- ## Priority Order If Time Gets Tight
 
 
 
320
 
321
- If the deadline compresses further, do this exact order:
322
 
323
- 1. unit tests proving non-fuzzy scoring behavior
324
- 2. smoke and integration tests for seeded deterministic runs
325
- 3. grounding audit against external real-world support datasets
326
- 4. low-risk RL reward improvements
327
- 5. CI and extra polish
328
 
329
- ## Definition Of Done For This Final Plan
330
 
331
- We are done when:
332
 
333
- 1. the scorer is test-backed and clearly not "always fuzzy"
334
- 2. the environment has unit, smoke, and integration coverage
335
- 3. the main RL improvements are implemented without hurting clarity
336
- 4. grounding is supported by external real-world support datasets
337
- 5. Docker, baseline behavior, and docs are all in sync
 
 
338
 
339
  ## Simple Rule To Remember
340
 
341
- Improve learning signal.
342
- Prove correctness.
343
- Ground the story in real support data.
344
- Do not take late-stage dataset-merging risk.
 
1
+ # Hackstreet Boys Final Roadmap
2
 
3
  ## Team
4
 
 
7
  - Roopal Guha Neogi
8
  - Suyash Kumar
9
  - Submission deadline: April 8, 2026, 11:59 PM IST
10
+
11
+ ## How To Use This File
12
+
13
+ - `PROJECT_STATUS.md` is the canonical log of completed work.
14
+ - This roadmap is the remaining execution plan from the current repo state to final submission.
15
+ - `PLAN.md` defines the must-pass gates.
16
+ - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
17
+ - `analysis/comp.md`, `analysis/comp_know.md`, and `analysis/inference.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
18
 
19
  ## What We Are Optimizing For
20
 
21
+ The highest-value wins from now to submission are:
22
+
23
+ 1. **Robustness**
24
+ - prove the env works through unit, smoke, and integration tests
25
+ - make Docker and clean reruns boring and reliable
26
+
27
+ 2. **RL improvement**
28
+ - keep the reward deterministic
29
+ - make sure scoring is not "always fuzzy"
30
+ - add only small, safe improvements that strengthen reward quality or episode usefulness
31
 
 
 
32
  3. **Real-world grounding**
33
+ - ground our taxonomy and partial-credit choices against real public support-ticket datasets
34
+ - do this as an audit / evidence layer, not as a late dataset merge
35
+
36
+ 4. **Submission readiness**
37
+ - satisfy every requirement from `PLAN.md` and `KNOWLEDGE.md`
38
+ - keep the repo easy for judges to understand and rerun
39
 
40
+ ## Current Repo State
41
 
42
+ The repo already has:
 
 
 
43
 
44
+ - locked IT helpdesk routing domain
45
+ - locked vocabulary and task names
46
+ - 3-task difficulty ladder
47
+ - deterministic grading with limited partial credit
48
+ - working heuristic baseline
49
+ - merged local validation on `/health`, `/tasks`, and `inference.py`
50
+ - current local benchmark reference:
51
+ - Task 1: `1.0000`
52
+ - Task 2: `0.8800`
53
+ - Task 3: `0.9400`
54
+ - Overall: `0.9400`
55
 
56
+ The remaining work should be treated as targeted strengthening, not broad feature invention.
57
 
58
+ ## Submission Gates That Must Still Hold
 
 
 
 
 
59
 
60
+ These come directly from `PLAN.md` and `KNOWLEDGE.md`:
61
 
62
+ - the environment starts correctly
63
+ - `reset()`, `step()`, and `state()` behave correctly
64
+ - 3 tasks exist and remain meaningfully different
65
+ - grader scores stay in `[0.0, 1.0]`
66
+ - `inference.py` runs reproducibly without crashing
67
+ - Docker builds and starts cleanly
68
+ - docs and metadata are current
69
+ - the repo is easy for judges to understand and rerun
70
 
71
+ ## Scope Decisions
72
 
73
+ ### Do Now
74
 
75
+ - add tests:
76
+ - unit
77
+ - smoke
78
+ - integration
79
+ - prove the scorer is crisp where it should be crisp
80
+ - add only safe RL-oriented improvements
81
+ - add external grounding evidence without changing the runtime dataset
82
+ - finish packaging / deployment readiness
83
 
84
+ ### Do Not Do Before Submission
85
 
86
+ - MCP migration
87
+ - transform-based reward refactor
88
+ - large dataset expansion
89
+ - external dataset merge into `data/dataset.json`
90
+ - major schema changes
91
+ - broad prompt / inference rewrites that could disturb the stable baseline
92
+ - dependency churn just for polish
93
 
94
+ ## Codex-First Working Rules
95
 
96
+ Because we are using Codex to generate code, we should optimize for small, bounded tasks:
 
 
 
97
 
98
+ 1. one prompt = one scoped change set
99
+ 2. keep ownership by file group
100
+ 3. require tests for any scorer or runtime change
101
+ 4. review the diff before accepting generated code
102
+ 5. rerun the relevant test slice after each meaningful change
103
+ 6. do not ask Codex for a giant multi-file redesign this late
104
 
105
+ ## Phased Plan
106
 
107
+ ## Phase 1: Test And Robustness Foundation
108
 
 
109
  **Window:** April 3 to April 4
110
 
111
+ **Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/inference.md`: lack of checked-in tests.
112
+
113
+ ### Must produce
114
+
115
+ - `tests/` with at least:
116
+ - grader unit tests
117
+ - task / dataset loader unit tests
118
+ - reward / score-range unit tests
119
+ - environment smoke tests
120
+ - API integration tests
121
+
122
+ ### Test plan
123
+
124
+ #### Unit tests
125
+
126
+ - exact match gives `1.0`
127
+ - unsupported task IDs fail clearly
128
+ - only intended near-miss issue-type pairs get partial credit
129
+ - unrelated wrong issue types get `0.0`
130
+ - priority proximity rules behave exactly as defined
131
+ - assignment group and resolution action remain exact-match only
132
+ - task weights sum and apply correctly
133
+ - dataset loads cleanly with `utf-8-sig`
134
+
135
+ #### Smoke tests
136
+
137
+ - `reset()` returns a valid observation
138
+ - `step()` advances queue progress
139
+ - `state()` reflects runtime state
140
+ - seeded resets are deterministic
141
+ - scores remain in `[0.0, 1.0]`
142
+ - one full episode per task completes without errors
143
+
144
+ #### Integration tests
145
+
146
+ - `/health`
147
+ - `/tasks`
148
+ - `/reset`
149
+ - `/step`
150
+ - `/state`
151
+ - one end-to-end seeded episode over HTTP or client path
152
+ - one heuristic `inference.py` regression check on expected overall behavior
153
+
154
+ ### Why this phase matters
155
+
156
+ - addresses the biggest repo-quality gap vs stronger competitors
157
+ - improves robustness
158
+ - gives us safe rails for all later RL and grounding changes
159
+
160
+ ## Phase 2: Scoring Calibration And Safe RL Improvements
161
+
162
  **Window:** April 4 to April 5
163
 
164
+ **Goal:** improve RL usefulness without destabilizing the submission.
165
 
166
+ ### Must produce
167
 
168
+ - scorer calibration evidence that the system is not "always fuzzy"
169
+ - only a few safe RL-oriented improvements if tests stay green
 
 
 
170
 
171
+ ### Required calibration checks
172
 
173
+ - exact-match path is dominant and clearly tested
174
+ - fuzziness exists only in explicitly defined cases
175
+ - wrong labels outside the similarity map score `0.0`
176
+ - assignment group and resolution action remain exact
177
+ - final episode reward stays bounded and deterministic
178
 
179
+ ### Safe improvement candidates from `analysis/inference.md`
180
 
181
+ - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
182
+ - enrich `history` with:
183
+ - ticket title
184
+ - predicted fields
185
+ - optionally support `queue_size` as a reset kwarg only if the change is tiny and fully tested
186
 
187
+ ### Hard stop
188
 
189
+ - if a change touches behavior and shifts baseline numbers unexpectedly, stop and stabilize rather than stacking more changes
190
+
191
+ ## Phase 3: Real-World Grounding Audit
192
 
 
193
  **Window:** April 5 to April 6
194
 
195
+ **Goal:** add defensible evidence that our taxonomy and partial-credit logic are grounded in real support data, without merging external data into runtime.
196
+
197
+ ### Grounding strategy
198
 
199
+ - use real public support datasets as reference material
200
+ - compare their labels / examples against our taxonomy
201
+ - create an internal audit, not a runtime dependency
202
 
203
+ ### Recommended grounding references
 
 
204
 
205
+ - `Classification of IT Support Tickets` (Zenodo, 2,229 manually classified tickets)
206
+ - `Semantic Similarity of IT Support Tickets` (Zenodo, 300 manually labeled ticket pairs)
207
+ - `MSDialog` for real technical-support conversation patterns and terminology
208
 
209
+ ### Must produce
 
 
210
 
211
+ - an internal grounding note or checklist that captures:
212
+ - which public datasets were reviewed
213
+ - how our labels map to real-world ticket themes
214
+ - which partial-credit pairs are defensible
215
+ - which proposed similarity pairs were rejected as too fuzzy
216
 
217
+ ### Useful output
 
 
 
218
 
219
+ - 10 to 20 grounding examples:
220
+ - real ticket theme
221
+ - closest label in our taxonomy
222
+ - whether it should be exact-match only or partial-credit-adjacent
223
 
224
+ ### Why this phase matters
 
225
 
226
+ - strengthens real-world credibility
227
+ - supports RL reward quality with evidence
228
+ - helps avoid arbitrary or over-fuzzy scorer changes
229
 
230
+ ## Phase 4: Packaging, Deployment, And Judge-Facing Polish
 
231
 
 
232
  **Window:** April 6 to April 7
233
 
234
+ **Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md` and `analysis/inference.md`.
235
 
236
+ ### Must produce
237
 
238
+ - Hugging Face Spaces README frontmatter
239
+ - `.openenvignore`
240
+ - Docker smoke evidence on the merged branch
241
+ - one clean-copy rerun if possible
 
242
 
243
+ ### Nice-to-have only if green
244
 
245
+ - short TRL / GRPO example in `README.md`
246
+ - concise note in docs that grading is deterministic, partially structured, and not purely fuzzy
 
247
 
248
+ ### Do not do here
 
249
 
250
+ - no dataset expansion
251
+ - no major inference rewrite
252
+ - no architecture refactor
253
 
254
+ ## Phase 5: Freeze And Submit
255
 
256
+ **Window:** April 8
 
 
 
 
257
 
258
+ **Goal:** submit from a calm, validated repo state.
259
 
260
+ ### Final day rules
 
 
 
261
 
262
+ - only typo-level, doc-level, or packaging-only fixes
263
+ - no risky scorer changes
264
+ - no runtime refactors
265
+ - no dataset edits unless they fix a blocker
266
+ - stop risky edits several hours before submission
267
 
268
+ ## Ownership From Now Until Submission
269
+
270
+ ### Roopal ownership
271
 
272
+ Primary files:
273
 
274
+ - `data/dataset.json`
275
+ - `server/tasks.py`
276
+ - `server/grader.py`
277
+ - `README.md`
278
+ - `KNOWLEDGE.md`
279
+ - `MENTAL_MODEL.md`
280
 
281
+ Primary responsibilities:
282
 
283
+ - scorer calibration and label quality
284
+ - unit tests around grader / task rules / dataset invariants
285
+ - real-world grounding audit
286
+ - judge-facing explanation of deterministic scoring and real-world realism
287
+ - safe reward-quality improvements only when grounded and tested
288
 
289
+ Concrete deliverables:
290
 
291
+ - grader unit tests
292
+ - grounding mapping note
293
+ - any similarity-matrix update, if justified
294
+ - doc updates if benchmark numbers or scoring explanation change
295
+ - README frontmatter and judge-facing clarity
296
 
297
+ ### Suyash ownership
298
 
299
+ Primary files:
300
 
301
+ - `models.py`
302
+ - `server/environment.py`
303
+ - `server/app.py`
304
+ - `server/reward.py`
305
+ - `client.py`
306
+ - `inference.py`
307
+ - `openenv.yaml`
308
+ - `server/Dockerfile`
309
+ - `pyproject.toml`
310
+ - `requirements.txt`
311
 
312
+ Primary responsibilities:
313
 
314
+ - smoke and integration tests
315
+ - runtime stability
316
+ - Docker and deployment readiness
317
+ - inference reproducibility
318
+ - clean rerun evidence
319
+ - optional small RL-signal improvements on the runtime side
320
 
321
+ Concrete deliverables:
322
 
323
+ - env smoke tests
324
+ - API integration tests
325
+ - heuristic inference regression path
326
+ - `.openenvignore`
327
+ - Docker smoke confirmation
328
+ - clean-copy rerun if possible
329
 
330
+ ### Shared responsibilities
331
 
332
+ - do not rename schemas or vocabulary
333
+ - rerun the benchmark after any behavior-affecting change
334
+ - keep `PROJECT_STATUS.md` honest
335
+ - use the GitHub Actions Docker smoke workflow when local Docker is blocked
336
+ - review Codex-generated diffs before accepting them
337
+ - freeze feature work by the end of April 7
338
 
339
+ ## Date-By-Date Execution Plan
 
 
 
 
 
340
 
341
+ ## April 3, 2026
342
 
343
+ Primary goal:
344
 
345
+ - lock the execution plan and begin test scaffolding immediately
346
 
347
+ Roopal:
 
 
 
348
 
349
+ - finalize the exact scorer behaviors that must be proven by tests
350
+ - list the exact-match-only cases and intended partial-credit cases
351
+ - begin grader and task-loader unit tests
352
 
353
+ Suyash:
 
 
354
 
355
+ - scaffold `tests/`
356
+ - begin smoke tests for `reset()`, `step()`, `state()`, and deterministic seeded behavior
357
+ - confirm how integration tests will hit the app cleanly
358
 
359
+ Shared checkpoint:
360
 
361
+ - test strategy is agreed
362
+ - file ownership is clear
363
+ - no one is making unscoped runtime changes yet
364
 
365
+ ## April 4, 2026
 
 
366
 
367
+ Primary goal:
368
 
369
+ - land the first complete test layer
 
 
370
 
371
+ Roopal:
372
 
373
+ - complete grader, task, and dataset unit tests
374
+ - add explicit tests showing where fuzziness is allowed and where it is not
375
 
376
+ Suyash:
 
 
 
377
 
378
+ - complete smoke tests
379
+ - add first-pass integration tests for `/health`, `/tasks`, `/reset`, and `/step`
380
+
381
+ Shared checkpoint:
382
+
383
+ - checked-in tests exist
384
+ - the repo can prove deterministic scoring and score bounds
385
+ - any failing behavior is triaged before adding improvements
386
+
387
+ ## April 5, 2026
388
+
389
+ Primary goal:
390
+
391
+ - improve RL usefulness safely
392
+
393
+ Roopal:
394
+
395
+ - start the grounding audit using the selected public datasets
396
+ - decide whether any additional similarity pairs are truly defensible
397
+
398
+ Suyash:
399
+
400
+ - add integration coverage for full seeded episode flow and `state()`
401
+ - add a light heuristic regression path for `inference.py`
402
+ - optionally enrich observation history if tests are already green
403
+
404
+ Shared checkpoint:
405
+
406
+ - tests are stable
407
+ - any RL-oriented change is small and justified
408
+ - no baseline drift goes unexplained
409
+
410
+ ## April 6, 2026
411
+
412
+ Primary goal:
413
+
414
+ - finish grounding evidence and close packaging gaps
415
+
416
+ Roopal:
417
+
418
+ - finish grounding audit note
419
+ - land only the scorer adjustments supported by audit evidence, if any
420
+ - update docs to reflect deterministic, grounded scoring
421
+
422
+ Suyash:
423
+
424
+ - add `.openenvignore`
425
+ - verify Docker smoke workflow on the merged branch
426
+ - check deployment assumptions around `app_port`, `/docs`, `/health`, `/ws`, and `/web`
427
+
428
+ Shared checkpoint:
429
+
430
+ - grounding evidence exists
431
+ - packaging gaps are closed or explicitly blocked
432
+ - benchmark references are still current
433
+
434
+ ## April 7, 2026
435
+
436
+ Primary goal:
437
+
438
+ - freeze on a green, submission-ready repo
439
+
440
+ Roopal:
441
+
442
+ - final docs consistency pass across `README.md`, `KNOWLEDGE.md`, and `MENTAL_MODEL.md`
443
+ - add a short TRL / GRPO usage example only if everything else is already green
444
+
445
+ Suyash:
446
+
447
+ - do a clean-copy install-and-run pass if possible
448
+ - rerun heuristic baseline if any runtime-side change landed
449
+ - freeze runtime files by end of day
450
+
451
+ Shared checkpoint:
452
+
453
+ - tests are green
454
+ - Docker evidence exists
455
+ - docs, metadata, and runtime tell the same story
456
+ - feature work stops
457
+
458
+ ## April 8, 2026
459
+
460
+ Primary goal:
461
+
462
+ - submit early from a calm repo state
463
+
464
+ Morning:
465
+
466
+ - run final smoke / test slice on the submission branch
467
+ - verify required files are present
468
+ - verify README and metadata are current
469
+
470
+ Afternoon:
471
+
472
+ - only typo-level or packaging-only fixes
473
+ - no risky code changes
474
+
475
+ Final rule:
476
 
477
+ - stop risky edits several hours before 11:59 PM IST
478
+ - submit as soon as the repo is clearly green
 
 
479
 
480
+ ## Cut Order If Time Gets Tight
481
 
482
+ Cut these first:
 
 
483
 
484
+ 1. `queue_size` reset kwarg
485
+ 2. richer `history`
486
+ 3. TRL / GRPO README example
487
+ 4. any optional similarity expansion beyond the most defensible cases
488
 
489
+ Do not cut these:
490
 
491
+ 1. tests
492
+ 2. scorer crispness checks
493
+ 3. Docker / deployment validation
494
+ 4. grounding audit evidence
495
+ 5. final benchmark sanity rerun if behavior changed
496
 
497
+ ## Definition Of Done
498
 
499
+ The project is ready when:
500
 
501
+ 1. unit, smoke, and integration tests exist and cover the critical paths
502
+ 2. scoring is demonstrably deterministic and not fuzzy by default
503
+ 3. a grounding audit against real public support datasets exists
504
+ 4. the heuristic baseline still runs successfully
505
+ 5. Docker build and run are validated
506
+ 6. docs and metadata are current and judge-friendly
507
+ 7. the repo is frozen and submitted on time
508
 
509
  ## Simple Rule To Remember
510
 
511
+ Roopal owns the labels, scoring truth, grounding, and public clarity.
512
+ Suyash owns the runtime, tests beyond unit scope, packaging, and reproducibility rails.
513
+ Both of you should optimize for a clean, defensible, rerunnable submission rather than last-minute complexity.