Roopalgn commited on
Commit
72d2634
·
1 Parent(s): ae36543

Consolidate requirements docs and align roadmap with official submission rules

Browse files
Files changed (9) hide show
  1. KNOWLEDGE.md +63 -14
  2. MENTAL_MODEL.md +0 -173
  3. PLAN.md +0 -147
  4. PROJECT_STATUS.md +2 -2
  5. README.md +2 -3
  6. ROADMAP.md +32 -12
  7. analysis/comp_know.md +159 -197
  8. analysis/inference.md +0 -218
  9. required.md +352 -0
KNOWLEDGE.md CHANGED
@@ -1,6 +1,6 @@
1
  # IT Helpdesk Ticket Routing OpenEnv - Knowledge Guide
2
 
3
- ## What The Hackathon Is Looking For
4
 
5
  The judges want a real-world environment that follows the OpenEnv pattern and can be understood quickly.
6
 
@@ -14,9 +14,9 @@ That means this repo needs:
14
  6. a baseline `inference.py`
15
  7. Docker and metadata that are easy to rerun
16
 
17
- ## Why IT Helpdesk Ticket Routing Fits Well
18
 
19
- This domain is a strong fit because it is:
20
 
21
  - realistic
22
  - structured
@@ -32,12 +32,12 @@ This environment simulates a short helpdesk queue where an agent routes one tick
32
 
33
  ## Judge-Facing Explanation
34
 
35
- If a judge asks why this environment is a strong submission, the concise answer is:
36
 
37
  1. IT helpdesk routing is a real operational workflow with clear business value.
38
  2. The input is realistic free-form ticket text, but the output is typed and easy to grade deterministically.
39
  3. The three-task ladder creates a clean progression from basic classification to full queue routing.
40
- 4. The repo stays judge-friendly because the vocabulary, task labels, and scoring rules are all explicit and frozen.
41
 
42
  ## Frozen Project Identity
43
 
@@ -47,6 +47,34 @@ If a judge asks why this environment is a strong submission, the concise answer
47
  - OpenEnv name: `it_helpdesk_ticket_routing_openenv`
48
  - App environment name: `it_helpdesk_ticket_routing`
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ## Frozen Runtime Vocabulary
51
 
52
  ### Fields
@@ -145,11 +173,30 @@ On each step, the environment:
145
 
146
  Returns the internal state snapshot for debugging or inspection.
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ## Task Design
149
 
150
  ### Task 1: Issue Type Classification
151
 
152
- The agent only predicts:
153
 
154
  - `issue_type`
155
 
@@ -257,7 +304,7 @@ It supports:
257
 
258
  ## Validation Notes
259
 
260
- The repo has now gone through two useful validation phases.
261
 
262
  ### April 2 consistency pass
263
 
@@ -273,7 +320,7 @@ What needed to agree:
273
 
274
  ### April 3 and April 4 runtime-feedback pass
275
 
276
- The first local runtime pass was then completed and surfaced a practical issue:
277
 
278
  - `data/dataset.json` was saved with a UTF-8 BOM, which caused `json.load()` to fail during environment creation on Windows
279
 
@@ -288,13 +335,13 @@ The local heuristic baseline completed successfully after that fix with:
288
 
289
  A merged-state rerun on the current `main` branch matched those same numbers exactly.
290
 
291
- ## April 6 Repo Audit
292
 
293
- An April 6 documentation and repo audit confirmed:
294
 
295
- - all required runtime, data, metadata, and documentation files are present in the workspace
296
  - the docs consistently describe IT helpdesk ticket routing rather than the old email-triage domain
297
- - the current local benchmark reference is `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
298
  - the remaining work is execution validation, not documentation cleanup
299
 
300
  ## What Still Needs Hands-On Verification
@@ -312,7 +359,9 @@ If you come back to this repo later, remember:
312
 
313
  - the domain is IT helpdesk ticket routing
314
  - the environment is a short queue, not a single-shot classifier
 
 
315
  - the agent predicts structured routing fields
316
- - grading is deterministic with limited partial credit
317
- - the inference script is the baseline player
318
  - merged-state local validation is complete, and Docker is the main remaining hands-on check
 
1
  # IT Helpdesk Ticket Routing OpenEnv - Knowledge Guide
2
 
3
+ ## What This Repo Needs To Prove
4
 
5
  The judges want a real-world environment that follows the OpenEnv pattern and can be understood quickly.
6
 
 
14
  6. a baseline `inference.py`
15
  7. Docker and metadata that are easy to rerun
16
 
17
+ ## Why This Domain Fits
18
 
19
+ IT helpdesk routing is a strong hackathon fit because it is:
20
 
21
  - realistic
22
  - structured
 
32
 
33
  ## Judge-Facing Explanation
34
 
35
+ If a judge asks why this environment is strong, the concise answer is:
36
 
37
  1. IT helpdesk routing is a real operational workflow with clear business value.
38
  2. The input is realistic free-form ticket text, but the output is typed and easy to grade deterministically.
39
  3. The three-task ladder creates a clean progression from basic classification to full queue routing.
40
+ 4. The repo stays judge-friendly because the vocabulary, task labels, and scoring rules are explicit and frozen.
41
 
42
  ## Frozen Project Identity
43
 
 
47
  - OpenEnv name: `it_helpdesk_ticket_routing_openenv`
48
  - App environment name: `it_helpdesk_ticket_routing`
49
 
50
+ ## Practical Mental Model
51
+
52
+ ```text
53
+ inference.py
54
+ |
55
+ v
56
+ client.py <----> server/app.py
57
+ |
58
+ v
59
+ server/environment.py
60
+ | | |
61
+ v v v
62
+ grader.py reward.py tasks.py
63
+ |
64
+ v
65
+ data/dataset.json
66
+ ```
67
+
68
+ The repo is a small OpenEnv stack:
69
+
70
+ - `inference.py` drives episodes
71
+ - `client.py` talks to the app
72
+ - `server/environment.py` manages queue state and episode flow
73
+ - `server/grader.py` scores actions
74
+ - `server/reward.py` computes step and final reward behavior
75
+ - `server/tasks.py` defines the task ladder and loads the dataset
76
+ - `data/dataset.json` stores the labeled helpdesk tickets
77
+
78
  ## Frozen Runtime Vocabulary
79
 
80
  ### Fields
 
173
 
174
  Returns the internal state snapshot for debugging or inspection.
175
 
176
+ ## Observation And State At A Glance
177
+
178
+ The observation exposes:
179
+
180
+ - task metadata
181
+ - the current ticket
182
+ - queue progress counters
183
+ - history
184
+ - reward and done status
185
+
186
+ The state tracks:
187
+
188
+ - current task
189
+ - seed
190
+ - queue ticket IDs
191
+ - current ticket index
192
+ - per-ticket scores
193
+ - total reward
194
+
195
  ## Task Design
196
 
197
  ### Task 1: Issue Type Classification
198
 
199
+ The agent predicts:
200
 
201
  - `issue_type`
202
 
 
304
 
305
  ## Validation Notes
306
 
307
+ The repo has already gone through two useful validation phases.
308
 
309
  ### April 2 consistency pass
310
 
 
320
 
321
  ### April 3 and April 4 runtime-feedback pass
322
 
323
+ The first local runtime pass surfaced one practical issue:
324
 
325
  - `data/dataset.json` was saved with a UTF-8 BOM, which caused `json.load()` to fail during environment creation on Windows
326
 
 
335
 
336
  A merged-state rerun on the current `main` branch matched those same numbers exactly.
337
 
338
+ ### April 6 repo audit
339
 
340
+ An April 6 audit confirmed:
341
 
342
+ - all required runtime, data, metadata, and documentation files are present
343
  - the docs consistently describe IT helpdesk ticket routing rather than the old email-triage domain
344
+ - the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
345
  - the remaining work is execution validation, not documentation cleanup
346
 
347
  ## What Still Needs Hands-On Verification
 
359
 
360
  - the domain is IT helpdesk ticket routing
361
  - the environment is a short queue, not a single-shot classifier
362
+ - the architecture is a compact OpenEnv stack
363
+ - one ticket is shown at a time
364
  - the agent predicts structured routing fields
365
+ - the grader gives deterministic partial credit
366
+ - `inference.py` is the baseline agent runner
367
  - merged-state local validation is complete, and Docker is the main remaining hands-on check
MENTAL_MODEL.md DELETED
@@ -1,173 +0,0 @@
1
- # IT Helpdesk Ticket Routing Mental Model
2
-
3
- This file is the practical mental model of the repo in its current form.
4
-
5
- ## What The Project Is
6
-
7
- This repository is an OpenEnv environment for IT helpdesk ticket routing.
8
-
9
- The environment presents a small queue of tickets. For each ticket, the agent must decide:
10
-
11
- - issue type
12
- - priority
13
- - assignment group
14
- - resolution action
15
-
16
- ## Main Runtime Flow
17
-
18
- ```text
19
- inference.py
20
- |
21
- v
22
- client.py <----> server/app.py
23
- |
24
- v
25
- server/environment.py
26
- | | |
27
- v v v
28
- grader.py reward.py tasks.py
29
- |
30
- v
31
- data/dataset.json
32
- ```
33
-
34
- ## Main Files
35
-
36
- - `models.py`
37
- Typed models for tickets, actions, observations, and state.
38
-
39
- - `server/environment.py`
40
- Main environment engine.
41
-
42
- - `server/grader.py`
43
- Deterministic partial-credit scorer.
44
-
45
- - `server/reward.py`
46
- Step and trajectory reward helpers.
47
-
48
- - `server/tasks.py`
49
- Task definitions and dataset loading.
50
-
51
- - `client.py`
52
- Typed client used for multi-step interaction.
53
-
54
- - `inference.py`
55
- Baseline runner with LLM mode and heuristic mode.
56
-
57
- ## Task Ladder
58
-
59
- ### Task 1
60
-
61
- - predict `issue_type`
62
-
63
- ### Task 2
64
-
65
- - predict `issue_type`
66
- - predict `priority`
67
-
68
- ### Task 3
69
-
70
- - predict `issue_type`
71
- - predict `priority`
72
- - predict `assignment_group`
73
- - predict `resolution_action`
74
-
75
- ## Label Vocabulary
76
-
77
- ### Issue types
78
-
79
- - `billing_license`
80
- - `identity_access`
81
- - `application_support`
82
- - `service_request`
83
- - `spam_phishing`
84
- - `general_inquiry`
85
- - `security_compliance`
86
- - `onboarding`
87
- - `feature_request`
88
-
89
- ### Assignment groups
90
-
91
- - `license_ops`
92
- - `service_desk`
93
- - `application_team`
94
- - `procurement`
95
- - `security_team`
96
- - `onboarding_ops`
97
-
98
- ### Resolution actions
99
-
100
- - `fulfill`
101
- - `escalate`
102
- - `assign`
103
- - `ignore`
104
- - `acknowledge`
105
-
106
- ## Observation And State
107
-
108
- The observation exposes:
109
-
110
- - task metadata
111
- - the current ticket
112
- - queue progress counters
113
- - history
114
- - reward and done status
115
-
116
- The state tracks:
117
-
118
- - current task
119
- - seed
120
- - queue ticket IDs
121
- - current ticket index
122
- - per-ticket scores
123
- - total reward
124
-
125
- ## Reward Logic
126
-
127
- - each step returns the current ticket score
128
- - the final reward is the average of per-ticket scores
129
- - a small overshoot penalty exists as a safeguard
130
-
131
- ## Runtime Notes
132
-
133
- The repo has now passed both the initial local heuristic run and a merged-state rerun on the current `main` branch.
134
-
135
- Current local baseline:
136
-
137
- - Task 1: `1.0000`
138
- - Task 2: `0.8800`
139
- - Task 3: `0.9400`
140
- - Overall: `0.9400`
141
-
142
- The merged-state rerun matched the same baseline numbers exactly.
143
-
144
- One practical implementation note from runtime validation:
145
-
146
- - `data/dataset.json` may be saved with a UTF-8 BOM on Windows, so `server/tasks.py` intentionally loads it with `utf-8-sig`
147
-
148
- ## Dataset Shape
149
-
150
- Each record includes:
151
-
152
- - `ticket_id`
153
- - `title`
154
- - `requester`
155
- - `description`
156
- - `issue_type`
157
- - `priority`
158
- - `assignment_group`
159
- - `resolution_action`
160
- - optional `ambiguity_note`
161
- - optional `related_ticket_id`
162
-
163
- ## Short Version
164
-
165
- If coming back later, remember this:
166
-
167
- - the repo is a helpdesk ticket router
168
- - the architecture is a small OpenEnv stack
169
- - one ticket is shown at a time
170
- - the agent predicts structured routing fields
171
- - the grader gives deterministic partial credit
172
- - `inference.py` is the baseline agent runner
173
- - the local heuristic path now works end to end on the current merged repo state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PLAN.md DELETED
@@ -1,147 +0,0 @@
1
- # IT Helpdesk Ticket Routing OpenEnv - Project Plan
2
-
3
- ## Project Goal
4
-
5
- Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
6
-
7
- - real-world utility
8
- - strong task and grader quality
9
- - clean environment design
10
- - OpenEnv spec compliance
11
- - reproducible baseline inference
12
- - Docker and Hugging Face deployment readiness
13
-
14
- ## Current Product Definition
15
-
16
- The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
17
-
18
- - `issue_type`
19
- - `priority`
20
- - `assignment_group`
21
- - `resolution_action`
22
-
23
- The project keeps three tasks:
24
-
25
- 1. Issue Type Classification
26
- 2. Issue Type And Priority
27
- 3. Full Ticket Routing
28
-
29
- ## What Must Be True At Submission
30
-
31
- ### Pass or fail requirements
32
-
33
- - the environment responds correctly
34
- - OpenEnv metadata is valid
35
- - `reset()`, `step()`, and `state()` work
36
- - there are at least 3 tasks
37
- - graders return scores in `[0.0, 1.0]`
38
- - `inference.py` runs and prints reproducible results
39
- - Docker builds and starts cleanly
40
-
41
- ### Scored requirements
42
-
43
- - the task should clearly feel like real helpdesk work
44
- - the hard task should require meaningful reasoning
45
- - partial credit should be useful and deterministic
46
- - docs should be clear enough for judges to understand quickly
47
-
48
- ## Core Files
49
-
50
- ### Runtime
51
-
52
- - `models.py`
53
- - `server/environment.py`
54
- - `server/grader.py`
55
- - `server/reward.py`
56
- - `server/tasks.py`
57
- - `server/app.py`
58
- - `client.py`
59
- - `inference.py`
60
-
61
- ### Data and metadata
62
-
63
- - `data/dataset.json`
64
- - `openenv.yaml`
65
- - `server/Dockerfile`
66
- - `pyproject.toml`
67
- - `requirements.txt`
68
-
69
- ### Docs
70
-
71
- - `README.md`
72
- - `KNOWLEDGE.md`
73
- - `MENTAL_MODEL.md`
74
-
75
- ## Technical Priorities
76
-
77
- ### P0
78
-
79
- 1. keep the environment behavior correct
80
- 2. verify the task definitions and graders
81
- 3. make the baseline script reliable
82
- 4. confirm dataset coverage and label consistency
83
-
84
- ### P1
85
-
86
- 1. validate Docker
87
- 2. validate deployment assumptions
88
- 3. record baseline scores
89
- 4. polish docs
90
-
91
- ### P2
92
-
93
- 1. strengthen ticket wording for realism
94
- 2. expand hard-case examples if needed
95
- 3. remove low-signal artifacts from the repo
96
-
97
- ## Quality Checks To Perform
98
-
99
- ### Environment
100
-
101
- - reset starts a clean episode
102
- - each step advances the queue correctly
103
- - the final step returns trajectory reward
104
- - state reflects the real internal status
105
-
106
- ### Grader
107
-
108
- - exact matches score `1.0`
109
- - near misses get partial credit where intended
110
- - unsupported task IDs fail clearly
111
- - scores vary across examples
112
-
113
- ### Inference
114
-
115
- - heuristic mode works without model credentials
116
- - LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
117
- - output is reproducible when the seed is fixed
118
-
119
- ### Docs
120
-
121
- - no outdated domain references remain
122
- - team and project metadata are correct
123
- - setup and run instructions are accurate
124
-
125
- ## Risks
126
-
127
- ### Runtime risk
128
-
129
- The first local execution pass and a merged-state rerun have already completed successfully. The remaining runtime risk is Docker and clean-machine behavior, not first-pass local execution.
130
-
131
- ### Benchmark risk
132
-
133
- The current merged-state local benchmark has already been recorded. The remaining benchmark risk is making sure Docker or clean-machine validation does not surface a late behavioral mismatch.
134
-
135
- ### Deployment risk
136
-
137
- Docker and Hugging Face behavior should be validated before the final submission window.
138
-
139
- ## Definition Of Done
140
-
141
- The project is ready when:
142
-
143
- 1. the environment runs locally end to end
144
- 2. the heuristic baseline runs successfully
145
- 3. Docker build and run both succeed
146
- 4. the docs are clean, current, and submission-ready
147
- 5. the repo clearly presents Hackstreet Boys as the team
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PROJECT_STATUS.md CHANGED
@@ -136,7 +136,7 @@ Roopal-side work completed:
136
  - updated `README.md` to reflect the first local runtime pass
137
  - recorded the current heuristic baseline in repo docs as a working, non-final benchmark
138
  - updated `KNOWLEDGE.md` to distinguish consistency validation from runtime validation
139
- - updated `MENTAL_MODEL.md` with runtime-validated notes and the Windows BOM handling detail
140
 
141
  Documentation fixes made from runtime feedback:
142
 
@@ -182,7 +182,7 @@ Roopal-side work completed:
182
 
183
  - audited required submission files and confirmed they are present in the repo
184
  - completed a stale-claims and outdated-wording pass across the core docs
185
- - updated `PLAN.md` to reflect that first-pass local execution is no longer the main runtime risk
186
  - left the remaining work focused on Docker and clean-machine validation rather than documentation cleanup
187
 
188
  ## Open Items
 
136
  - updated `README.md` to reflect the first local runtime pass
137
  - recorded the current heuristic baseline in repo docs as a working, non-final benchmark
138
  - updated `KNOWLEDGE.md` to distinguish consistency validation from runtime validation
139
+ - updated the runtime mental-model notes later merged into `KNOWLEDGE.md`, including the Windows BOM handling detail
140
 
141
  Documentation fixes made from runtime feedback:
142
 
 
182
 
183
  - audited required submission files and confirmed they are present in the repo
184
  - completed a stale-claims and outdated-wording pass across the core docs
185
+ - updated the planning / requirements doc later consolidated into `required.md` to reflect that first-pass local execution is no longer the main runtime risk
186
  - left the remaining work focused on Docker and clean-machine validation rather than documentation cleanup
187
 
188
  ## Open Items
README.md CHANGED
@@ -212,8 +212,7 @@ pyproject.toml
212
  requirements.txt
213
  README.md
214
  KNOWLEDGE.md
215
- PLAN.md
216
- MENTAL_MODEL.md
217
  ROADMAP.md
218
  ```
219
 
@@ -355,7 +354,7 @@ An April 6 repo audit also confirmed that all required submission files are pres
355
 
356
  - runtime: `models.py`, `client.py`, `inference.py`, `server/app.py`, `server/environment.py`, `server/grader.py`, `server/reward.py`, `server/tasks.py`
357
  - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
358
- - docs and planning: `README.md`, `KNOWLEDGE.md`, `MENTAL_MODEL.md`, `PLAN.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
359
 
360
  Still pending before final submission:
361
 
 
212
  requirements.txt
213
  README.md
214
  KNOWLEDGE.md
215
+ required.md
 
216
  ROADMAP.md
217
  ```
218
 
 
354
 
355
  - runtime: `models.py`, `client.py`, `inference.py`, `server/app.py`, `server/environment.py`, `server/grader.py`, `server/reward.py`, `server/tasks.py`
356
  - data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
357
+ - docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
358
 
359
  Still pending before final submission:
360
 
ROADMAP.md CHANGED
@@ -12,9 +12,9 @@
12
 
13
  - `PROJECT_STATUS.md` is the canonical log of completed work.
14
  - This roadmap is the remaining execution plan from the current repo state to final submission.
15
- - `PLAN.md` defines the must-pass gates.
16
  - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
17
- - `analysis/comp.md`, `analysis/comp_know.md`, and `analysis/inference.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
18
 
19
  ## What We Are Optimizing For
20
 
@@ -34,7 +34,7 @@ The highest-value wins from now to submission are:
34
  - do this as an audit / evidence layer, not as a late dataset merge
35
 
36
  4. **Submission readiness**
37
- - satisfy every requirement from `PLAN.md` and `KNOWLEDGE.md`
38
  - keep the repo easy for judges to understand and rerun
39
 
40
  ## Current Repo State
@@ -57,14 +57,19 @@ The remaining work should be treated as targeted strengthening, not broad featur
57
 
58
  ## Submission Gates That Must Still Hold
59
 
60
- These come directly from `PLAN.md` and `KNOWLEDGE.md`:
61
 
62
  - the environment starts correctly
63
  - `reset()`, `step()`, and `state()` behave correctly
64
  - 3 tasks exist and remain meaningfully different
65
  - grader scores stay in `[0.0, 1.0]`
66
  - `inference.py` runs reproducibly without crashing
 
 
 
67
  - Docker builds and starts cleanly
 
 
68
  - docs and metadata are current
69
  - the repo is easy for judges to understand and rerun
70
 
@@ -80,6 +85,7 @@ These come directly from `PLAN.md` and `KNOWLEDGE.md`:
80
  - add only safe RL-oriented improvements
81
  - add external grounding evidence without changing the runtime dataset
82
  - finish packaging / deployment readiness
 
83
 
84
  ### Do Not Do Before Submission
85
 
@@ -108,7 +114,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
108
 
109
  **Window:** April 3 to April 4
110
 
111
- **Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/inference.md`: lack of checked-in tests.
112
 
113
  ### Must produce
114
 
@@ -176,7 +182,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
176
  - assignment group and resolution action remain exact
177
  - final episode reward stays bounded and deterministic
178
 
179
- ### Safe improvement candidates from `analysis/inference.md`
180
 
181
  - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
182
  - enrich `history` with:
@@ -231,14 +237,17 @@ Because we are using Codex to generate code, we should optimize for small, bound
231
 
232
  **Window:** April 6 to April 7
233
 
234
- **Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md` and `analysis/inference.md`.
235
 
236
  ### Must produce
237
 
238
  - Hugging Face Spaces README frontmatter
239
  - `.openenvignore`
 
240
  - Docker smoke evidence on the merged branch
241
  - one clean-copy rerun if possible
 
 
242
 
243
  ### Nice-to-have only if green
244
 
@@ -264,6 +273,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
264
  - no runtime refactors
265
  - no dataset edits unless they fix a blocker
266
  - stop risky edits several hours before submission
 
267
 
268
  ## Ownership From Now Until Submission
269
 
@@ -276,7 +286,6 @@ Primary files:
276
  - `server/grader.py`
277
  - `README.md`
278
  - `KNOWLEDGE.md`
279
- - `MENTAL_MODEL.md`
280
 
281
  Primary responsibilities:
282
 
@@ -293,6 +302,7 @@ Concrete deliverables:
293
  - any similarity-matrix update, if justified
294
  - doc updates if benchmark numbers or scoring explanation change
295
  - README frontmatter and judge-facing clarity
 
296
 
297
  ### Suyash ownership
298
 
@@ -326,6 +336,7 @@ Concrete deliverables:
326
  - `.openenvignore`
327
  - Docker smoke confirmation
328
  - clean-copy rerun if possible
 
329
 
330
  ### Shared responsibilities
331
 
@@ -335,6 +346,7 @@ Concrete deliverables:
335
  - use the GitHub Actions Docker smoke workflow when local Docker is blocked
336
  - review Codex-generated diffs before accepting them
337
  - freeze feature work by the end of April 7
 
338
 
339
  ## Date-By-Date Execution Plan
340
 
@@ -355,6 +367,7 @@ Suyash:
355
  - scaffold `tests/`
356
  - begin smoke tests for `reset()`, `step()`, `state()`, and deterministic seeded behavior
357
  - confirm how integration tests will hit the app cleanly
 
358
 
359
  Shared checkpoint:
360
 
@@ -377,6 +390,7 @@ Suyash:
377
 
378
  - complete smoke tests
379
  - add first-pass integration tests for `/health`, `/tasks`, `/reset`, and `/step`
 
380
 
381
  Shared checkpoint:
382
 
@@ -400,6 +414,7 @@ Suyash:
400
  - add integration coverage for full seeded episode flow and `state()`
401
  - add a light heuristic regression path for `inference.py`
402
  - optionally enrich observation history if tests are already green
 
403
 
404
  Shared checkpoint:
405
 
@@ -424,6 +439,8 @@ Suyash:
424
  - add `.openenvignore`
425
  - verify Docker smoke workflow on the merged branch
426
  - check deployment assumptions around `app_port`, `/docs`, `/health`, `/ws`, and `/web`
 
 
427
 
428
  Shared checkpoint:
429
 
@@ -439,7 +456,7 @@ Primary goal:
439
 
440
  Roopal:
441
 
442
- - final docs consistency pass across `README.md`, `KNOWLEDGE.md`, and `MENTAL_MODEL.md`
443
  - add a short TRL / GRPO usage example only if everything else is already green
444
 
445
  Suyash:
@@ -466,6 +483,7 @@ Morning:
466
  - run final smoke / test slice on the submission branch
467
  - verify required files are present
468
  - verify README and metadata are current
 
469
 
470
  Afternoon:
471
 
@@ -493,6 +511,7 @@ Do not cut these:
493
  3. Docker / deployment validation
494
  4. grounding audit evidence
495
  5. final benchmark sanity rerun if behavior changed
 
496
 
497
  ## Definition Of Done
498
 
@@ -502,9 +521,10 @@ The project is ready when:
502
  2. scoring is demonstrably deterministic and not fuzzy by default
503
  3. a grounding audit against real public support datasets exists
504
  4. the heuristic baseline still runs successfully
505
- 5. Docker build and run are validated
506
- 6. docs and metadata are current and judge-friendly
507
- 7. the repo is frozen and submitted on time
 
508
 
509
  ## Simple Rule To Remember
510
 
 
12
 
13
  - `PROJECT_STATUS.md` is the canonical log of completed work.
14
  - This roadmap is the remaining execution plan from the current repo state to final submission.
15
+ - `required.md` is now the combined official-requirements and project-compliance file.
16
  - `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
17
+ - `analysis/comp.md` and `analysis/comp_know.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
18
 
19
  ## What We Are Optimizing For
20
 
 
34
  - do this as an audit / evidence layer, not as a late dataset merge
35
 
36
  4. **Submission readiness**
37
+ - satisfy every requirement from `required.md` and `KNOWLEDGE.md`
38
  - keep the repo easy for judges to understand and rerun
39
 
40
  ## Current Repo State
 
57
 
58
  ## Submission Gates That Must Still Hold
59
 
60
+ These come directly from `required.md` and `KNOWLEDGE.md`:
61
 
62
  - the environment starts correctly
63
  - `reset()`, `step()`, and `state()` behave correctly
64
  - 3 tasks exist and remain meaningfully different
65
  - grader scores stay in `[0.0, 1.0]`
66
  - `inference.py` runs reproducibly without crashing
67
+ - `inference.py` uses the OpenAI client with `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
68
+ - structured stdout logs follow the official `[START]`, `[STEP]`, and `[END]` format
69
+ - `openenv validate` passes
70
  - Docker builds and starts cleanly
71
+ - HF deployment responds cleanly and reset works
72
+ - inference stays inside the official runtime / machine envelope
73
  - docs and metadata are current
74
  - the repo is easy for judges to understand and rerun
75
 
 
85
  - add only safe RL-oriented improvements
86
  - add external grounding evidence without changing the runtime dataset
87
  - finish packaging / deployment readiness
88
+ - verify official validation constraints, not just local happy-path behavior
89
 
90
  ### Do Not Do Before Submission
91
 
 
114
 
115
  **Window:** April 3 to April 4
116
 
117
+ **Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/comp_know.md`: lack of checked-in tests.
118
 
119
  ### Must produce
120
 
 
182
  - assignment group and resolution action remain exact
183
  - final episode reward stays bounded and deterministic
184
 
185
+ ### Safe improvement candidates from `analysis/comp_know.md`
186
 
187
  - expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
188
  - enrich `history` with:
 
237
 
238
  **Window:** April 6 to April 7
239
 
240
+ **Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md`.
241
 
242
  ### Must produce
243
 
244
  - Hugging Face Spaces README frontmatter
245
  - `.openenvignore`
246
+ - `openenv validate` evidence
247
  - Docker smoke evidence on the merged branch
248
  - one clean-copy rerun if possible
249
+ - structured inference logging verified against the official format
250
+ - a practical check that inference remains inside the official runtime envelope
251
 
252
  ### Nice-to-have only if green
253
 
 
273
  - no runtime refactors
274
  - no dataset edits unless they fix a blocker
275
  - stop risky edits several hours before submission
276
+ - if possible, run the official validator or the closest local equivalent before final push
277
 
278
  ## Ownership From Now Until Submission
279
 
 
286
  - `server/grader.py`
287
  - `README.md`
288
  - `KNOWLEDGE.md`
 
289
 
290
  Primary responsibilities:
291
 
 
302
  - any similarity-matrix update, if justified
303
  - doc updates if benchmark numbers or scoring explanation change
304
  - README frontmatter and judge-facing clarity
305
+ - official requirement compliance review through `required.md`
306
 
307
  ### Suyash ownership
308
 
 
336
  - `.openenvignore`
337
  - Docker smoke confirmation
338
  - clean-copy rerun if possible
339
+ - structured inference logging compliance
340
 
341
  ### Shared responsibilities
342
 
 
346
  - use the GitHub Actions Docker smoke workflow when local Docker is blocked
347
  - review Codex-generated diffs before accepting them
348
  - freeze feature work by the end of April 7
349
+ - do not casually change the `[START]`, `[STEP]`, `[END]` inference log format once implemented
350
 
351
  ## Date-By-Date Execution Plan
352
 
 
367
  - scaffold `tests/`
368
  - begin smoke tests for `reset()`, `step()`, `state()`, and deterministic seeded behavior
369
  - confirm how integration tests will hit the app cleanly
370
+ - review `required.md` and identify the exact official validation items still not reflected in runtime / inference behavior
371
 
372
  Shared checkpoint:
373
 
 
390
 
391
  - complete smoke tests
392
  - add first-pass integration tests for `/health`, `/tasks`, `/reset`, and `/step`
393
+ - begin checking how current `inference.py` differs from the official structured logging requirement
394
 
395
  Shared checkpoint:
396
 
 
414
  - add integration coverage for full seeded episode flow and `state()`
415
  - add a light heuristic regression path for `inference.py`
416
  - optionally enrich observation history if tests are already green
417
+ - bring `inference.py` closer to official structured logging format if the change can be done safely
418
 
419
  Shared checkpoint:
420
 
 
439
  - add `.openenvignore`
440
  - verify Docker smoke workflow on the merged branch
441
  - check deployment assumptions around `app_port`, `/docs`, `/health`, `/ws`, and `/web`
442
+ - run `openenv validate` or the closest available validation path
443
+ - verify structured inference logging and runtime-envelope expectations
444
 
445
  Shared checkpoint:
446
 
 
456
 
457
  Roopal:
458
 
459
+ - final docs consistency pass across `README.md` and `KNOWLEDGE.md`
460
  - add a short TRL / GRPO usage example only if everything else is already green
461
 
462
  Suyash:
 
483
  - run final smoke / test slice on the submission branch
484
  - verify required files are present
485
  - verify README and metadata are current
486
+ - run the final validation checklist from `required.md`
487
 
488
  Afternoon:
489
 
 
511
  3. Docker / deployment validation
512
  4. grounding audit evidence
513
  5. final benchmark sanity rerun if behavior changed
514
+ 6. official structured inference logging compliance
515
 
516
  ## Definition Of Done
517
 
 
521
  2. scoring is demonstrably deterministic and not fuzzy by default
522
  3. a grounding audit against real public support datasets exists
523
  4. the heuristic baseline still runs successfully
524
+ 5. the inference path is compliant with the official log format
525
+ 6. `openenv validate` and Docker checks are validated
526
+ 7. docs and metadata are current and judge-friendly
527
+ 8. the repo is frozen and submitted on time
528
 
529
  ## Simple Rule To Remember
530
 
analysis/comp_know.md CHANGED
@@ -1,275 +1,237 @@
1
- # Competition Knowledge Base OpenEnv Hackathon
2
 
3
- > Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
4
- > Gathered: April 4, 2026
5
- > Purpose: Internal competitive intelligence NOT for commit/push
6
 
7
  ---
8
 
9
- ## Full Environment Inventory (27 envs)
10
 
11
  | Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
12
  |-----|--------|------------|-------------|-------------|------|
13
  | `atari_env` | Classic games | Medium | Dense | Yes | No |
14
  | `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
15
- | `calendar_env` | Calendar/scheduling agent | High | SQL verifier | Yes | Yes (MCP) |
16
  | `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
17
- | `chat_env` | Conversation/tokenization | Low | Custom transform | Yes | No |
18
- | `chess_env` | Chess game | Medium | Win/loss | Yes | No |
19
  | `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
20
- | `connect4_env` | Connect 4 game | Low | Win/loss | Yes | No |
21
- | `dipg_safety_env` | Safety/policy | Medium | Unknown | Yes | No |
22
- | `dm_control_env` | DeepMind Control Suite | High | Dense | Yes | No |
23
- | `echo_env` | Reference/minimal | Minimal | Echo | No | No |
24
- | `finqa_env` | Financial QA (SEC 10-K) | High | Fuzzy numerical | Yes | Yes (MCP) |
25
- | `finrl_env` | Financial RL trading | High | Portfolio return | Yes | No |
26
- | `git_env` | Git operations | Medium | Task-based | Yes | No |
27
- | `grid_world_env` | Grid navigation | Low | Sparse | Yes | No |
28
- | `julia_env` | Julia code execution | Medium | Exit code | Yes | No |
29
- | `kernrl` | Kernel/OS operations | High | Unknown | Yes | No |
30
- | `maze_env` | Maze navigation | Low | Sparse | Yes | No |
31
- | `openapp_env` | Web app UI (BrowserGym) | Extreme | Task-based | Yes | No |
32
- | `openspiel_env` | Multi-agent games | High | Game outcome | Yes | No |
33
- | `reasoning_gym_env` | Reasoning tasks (100+ datasets) | Medium | Exact/partial | Single-step | No |
34
- | `repl_env` | REPL execution | Medium | Exit code | Yes | No |
35
- | `snake_env` | Snake game | Low | Score | Yes | No |
36
- | `sumo_rl_env` | Traffic simulation | High | Traffic flow | Yes | No |
37
- | `tbench2_env` | Terminal Bench 2 (shell tasks) | High | pytest pass/fail | Yes | No |
38
- | `textarena_env` | Text-based games | Medium | Game outcome | Yes | No |
39
- | `unity_env` | Unity 3D simulation | Very High | Task-based | Yes | No |
40
 
41
- ---
42
-
43
- ## Deep Dives: Most Relevant Envs
44
-
45
- ### 1. `finqa_env` — Financial QA
46
-
47
- **What it does**: Agents answer complex financial questions from SEC 10-K filings using SQL tool calls.
48
-
49
- **Architecture**:
50
- - Subclasses `MCPEnvironment` (not plain `Environment`) — uses FastMCP with `@mcp.tool` decorators
51
- - Tools: `get_descriptions`, `get_table_info`, `sql_query`, `submit_answer`
52
- - Dataset: 290 questions from HuggingFace (`snorkelai/finqa-data`)
53
- - Max steps: 50 per episode
54
- - Reward: Binary (1.0 / 0.0) with fuzzy numerical matching (1% relative tolerance + 1.0 absolute tolerance)
55
- - Handles `\boxed{}` LaTeX format, percentages, fractions, thousands separators, negative parens
56
 
57
- **Reward sophistication**: Very high. The `rewards.py` is ~300 lines handling multi-value answers, year-labeled pairs, percentage normalization, and both relative + absolute tolerance checks simultaneously.
58
-
59
- **Key differentiator**: MCP protocol for tool discovery. Client uses `await env.list_tools()` to discover tools at runtime. This is the most "agentic" env in the repo.
60
 
61
- **Integration**: Explicitly shows TRL/GRPO integration pattern in README.
62
 
63
- ---
64
 
65
- ### 2. `coding_env` Python Code Execution
 
 
 
66
 
67
- **What it does**: Executes arbitrary Python code in a sandboxed environment.
68
 
69
- **Architecture**:
70
- - `PythonCodeActEnv` wraps a `PyExecutor` (sandboxed subprocess)
71
- - `create_safe_coding_transform()` transform pipeline for reward computation
72
- - Action: `CodeAction(code: str)`
73
- - Observation: `CodeObservation(stdout, stderr, exit_code)`
74
- - State: `CodeState(episode_id, step_count, last_exit_code)`
75
- - Reward: computed by transform (not in step directly) — extensible pattern
76
 
77
- **Key differentiator**: Transform-based reward. The environment itself doesn't compute reward — a pluggable `Transform` object does. This is the cleanest separation of concerns in the repo.
78
 
79
- **Testing**: Has both unit tests (`test_python_codeact_reset`, `test_python_codeact_rewards`) and integration tests (`test_coding_env_integration`). Most tested env in the repo.
 
 
80
 
81
- ---
82
 
83
- ### 3. `reasoning_gym_env` Reasoning Tasks
 
 
84
 
85
- **What it does**: Wraps the `reasoning-gym` library (100+ reasoning datasets) as a single-step OpenEnv.
86
 
87
- **Architecture**:
88
- - Single-step episodes: `reset()` gives question, `step()` gives score + done=True
89
- - Composite datasets: mix multiple datasets with weights
90
- - Dataset persistence: same dataset reused across resets until config changes
91
- - Supports `dataset_name`, `seed`, `size`, `dataset_specs` in `reset()` kwargs
92
- - Reward: 0.0–1.0 (dataset-dependent, may use partial credit)
93
 
94
- **Key differentiator**: Massive breadth (100+ task types in one env). The `reset()` kwargs pattern for dataset configuration is very clean. Also has `openenv push` CLI for HuggingFace Spaces deployment.
95
 
96
- **Scale**: uv.lock is 551KB — large dependency tree from reasoning-gym.
 
 
97
 
98
  ---
99
 
100
- ### 4. `tbench2_env` Terminal Bench 2
101
 
102
- **What it does**: Wraps Terminal-Bench-2 shell tasks. Agent executes shell commands and is evaluated by pytest.
103
 
104
- **Architecture**:
105
- - Two modes: `local` (direct process) and `docker` (per-task container)
106
- - Rich action type: `exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`
107
- - Session IDs for streaming/non-blocking processes
108
- - Reward: Binary (pytest pass/fail) on `evaluate` action
109
- - Intermediate steps: `reward=None`
110
 
111
- **Key differentiator**: Most realistic "agentic" shell environment. The session ID pattern for streaming processes is unique. Docker-in-Docker mode for full fidelity.
112
 
113
- ---
 
 
 
 
 
114
 
115
- ### 5. `openapp_env` — Web App UI
116
 
117
- **What it does**: Wraps OpenApps (calendar, todo, messenger, maps) + BrowserGym for browser-based UI agent training.
 
 
118
 
119
- **Architecture**:
120
- - Runs TWO services in Docker: OpenApps server (port 5001) + FastAPI (port 8000)
121
- - `start.sh` orchestrates both
122
- - BrowserGym for browser automation (Playwright/Chromium)
123
- - Docker image: ~5.7GB (includes Chromium)
124
- - Multimodal: screenshots + DOM observations
125
 
126
- **Key differentiator**: Most complex env in the repo. Multimodal (visual + text). Real browser interaction. Closest to real-world agent deployment.
 
 
127
 
128
  ---
129
 
130
- ### 6. `calendar_env` — Calendar Scheduling
131
-
132
- **What it does**: Calendar management tasks with SQL database verification.
133
-
134
- **Architecture**:
135
- - MCP-based (like finqa_env)
136
- - Has `client_notebooks/` — Jupyter notebook for interactive evaluation
137
- - Has `mcp_databases/` — SQLite databases for state
138
- - Scenario-based: `scenario_config.json` drives task + verifiers
139
- - Verifiers: SQL queries that check task completion
140
- - Supports OpenAI, Anthropic, Google providers
141
 
142
- **Key differentiator**: Scenario config pattern. Verifier-based reward (SQL queries check if the agent actually completed the task). Most "enterprise workflow" env.
 
 
 
 
 
 
143
 
144
  ---
145
 
146
- ### 7. `chat_env` — Chat/Tokenization
147
 
148
- **What it does**: Manages conversation history + tokenization for LLM RL training.
149
 
150
- **Architecture**:
151
- - Action: `ChatAction(tokens: torch.Tensor)` — takes raw model tokens
152
- - Observation: `ChatObservation(messages, tokens)` — both human-readable + model-ready
153
- - Transform-based reward (pluggable)
154
- - Dual representation: messages (human) + tokens (model)
155
- - No HTTP overhead option: can use directly without server
156
 
157
- **Key differentiator**: Designed for direct LLM RL training loop. The only env that takes raw PyTorch tensors as actions. Pairs with GRPO/PPO training loops directly.
158
 
159
- ---
160
-
161
- ## Structural Patterns Observed Across All Envs
162
-
163
- ### File Structure (canonical)
164
- ```
165
- env_name/
166
- ├── __init__.py # exports
167
- ├── models.py # Action, Observation, State
168
- ├── client.py # EnvClient subclass
169
- ├── openenv.yaml # metadata
170
- ├── pyproject.toml # packaging
171
- ├── README.md # HuggingFace Space frontmatter + docs
172
- └── server/
173
- ├── __init__.py
174
- ├── app.py # FastAPI
175
- ├── environment.py # core logic
176
- └── Dockerfile
177
- ```
178
 
179
- ### README Frontmatter (HuggingFace Spaces)
180
- Every env README has YAML frontmatter:
181
  ```yaml
182
  ---
183
- title: ...
184
- emoji: ...
185
- colorFrom: ...
186
- colorTo: ...
187
  sdk: docker
188
  pinned: false
189
- app_port: 8000
190
  base_path: /web
191
  tags:
192
  - openenv
 
 
 
193
  ---
194
  ```
195
- This is required for HuggingFace Spaces deployment. Our README does NOT have this.
196
-
197
- ### openenv.yaml — Minimal Pattern
198
- Most envs have very minimal `openenv.yaml` (just name + entry_point). Our yaml is the most detailed in the repo.
199
-
200
- ### Dockerfile Patterns
201
- - Most use `openenv-base:latest` as base image (not `python:3.11-slim`)
202
- - Our Dockerfile uses `python:3.11-slim` directly — this is the standalone/HF Spaces pattern
203
- - The `openenv-base` pattern is for the monorepo CI/CD workflow
204
-
205
- ### Testing
206
- - `coding_env`: most tested (unit + integration)
207
- - Most envs: no tests at all
208
- - Our env: no tests (matches majority)
209
-
210
- ### MCP vs HTTP
211
- - Most envs: plain HTTP (`Environment` base class)
212
- - `finqa_env`, `calendar_env`: MCP (`MCPEnvironment` base class, FastMCP tools)
213
- - MCP envs are more "agentic" — tools are discoverable at runtime
214
-
215
- ### Reward Patterns
216
- | Pattern | Envs | Description |
217
- |---------|------|-------------|
218
- | Binary (0/1) | finqa, tbench2, reasoning_gym | Pass/fail |
219
- | Dense partial | ours, chess, atari | Continuous [0,1] |
220
- | Transform-based | coding, chat | Pluggable reward function |
221
- | SQL verifier | calendar | DB state check |
222
- | Game outcome | chess, connect4, openspiel | Win/loss/draw |
223
 
224
- ---
 
 
225
 
226
- ## Deployment Patterns
227
 
228
- ### HuggingFace Spaces
229
- - `openenv push` CLI command (seen in reasoning_gym README)
230
- - Spaces get: `/web` (UI), `/docs` (Swagger), `/health`, `/ws` (WebSocket)
231
- - `base_path: /web` in README frontmatter
232
- - Our env: missing HF Spaces frontmatter in README
233
 
234
- ### Docker
235
- - Most envs: `openenv-base:latest` (monorepo CI)
236
- - Standalone envs (ours, openapp): `python:3.11-slim`
237
- - openapp: 5.7GB image (Chromium)
238
- - Our image: minimal (python:3.11-slim + pip deps)
239
 
240
  ---
241
 
242
- ## Dataset Sizes
243
 
244
- | Env | Dataset Size | Source |
245
- |-----|-------------|--------|
246
- | finqa | 290 questions | HuggingFace (snorkelai/finqa-data) |
247
- | reasoning_gym | 100+ datasets, configurable size | reasoning-gym library |
248
- | calendar | SQLite DBs | Custom |
249
- | ours | 45 tickets | Custom (data/dataset.json) |
250
- | coding | N/A (generates tasks) | N/A |
251
- | tbench2 | Terminal-Bench-2 repo | GitHub auto-download |
252
 
253
- ---
254
 
255
- ## Key Technical Observations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
 
257
- 1. **MCP is the emerging pattern** for tool-using agents. finqa and calendar both use it. Our env uses plain HTTP — simpler but less "agentic."
258
 
259
- 2. **Transform-based rewards** (coding_env, chat_env) are the cleanest architecture for extensible reward shaping. Our reward is hardcoded in `reward.py`.
260
 
261
- 3. **`openenv push` CLI** exists for HuggingFace Spaces deployment. We should use it.
262
 
263
- 4. **README frontmatter** is required for HF Spaces. Our README is missing it.
264
 
265
- 5. **Composite/configurable datasets** (reasoning_gym) are a strong differentiator. Our dataset is fixed at 45 tickets.
266
 
267
- 6. **WebSocket endpoint** (`/ws`) is mentioned in reasoning_gym README as a HF Spaces feature. Our env already has `/ws` via the OpenEnv base.
 
 
 
 
 
 
 
 
 
 
 
268
 
269
- 7. **`uv.lock`** files appear in chat_env and reasoning_gym — reproducible dependency locking. We use `requirements.txt` only.
270
 
271
- 8. **`.openenvignore`** file in finqa_env — analogous to `.dockerignore` for the OpenEnv push CLI.
272
 
273
- 9. **`base_path: /web`** in HF Spaces frontmatter — the web UI is at `/web`, not `/`. Our env would need this.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
274
 
275
- 10. **Episode length**: Most envs are either single-step (reasoning_gym) or unbounded (coding, tbench2). Our env is bounded (3–5 steps) — a clean middle ground.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Competition Knowledge Base And Action Plan
2
 
3
+ > Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
4
+ > Gathered: April 4, 2026
5
+ > Purpose: Internal competitive intelligence plus action planning - NOT for commit/push
6
 
7
  ---
8
 
9
+ ## Full Environment Inventory
10
 
11
  | Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
12
  |-----|--------|------------|-------------|-------------|------|
13
  | `atari_env` | Classic games | Medium | Dense | Yes | No |
14
  | `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
15
+ | `calendar_env` | Calendar / scheduling agent | High | SQL verifier | Yes | Yes |
16
  | `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
17
+ | `chat_env` | Conversation / tokenization | Low | Custom transform | Yes | No |
 
18
  | `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
19
+ | `echo_env` | Reference / minimal | Minimal | Echo | No | No |
20
+ | `finqa_env` | Financial QA | High | Fuzzy numerical | Yes | Yes |
21
+ | `openapp_env` | Web app UI | Extreme | Task-based | Yes | No |
22
+ | `reasoning_gym_env` | Reasoning tasks | Medium | Exact / partial | Single-step | No |
23
+ | `tbench2_env` | Terminal tasks | High | Pytest pass/fail | Yes | No |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ This is not the full raw repo dump anymore. It is the subset that matters most for competitive positioning and late-stage prioritization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ ---
 
 
28
 
29
+ ## Most Relevant Competitor Patterns
30
 
31
+ ### `finqa_env`
32
 
33
+ - strong MCP / tool-using architecture
34
+ - larger dataset than ours
35
+ - binary-style reward with fuzzy numerical matching
36
+ - explicit TRL / GRPO integration story
37
 
38
+ ### `coding_env`
39
 
40
+ - strongest test story
41
+ - clean transform-based reward separation
42
+ - reference example of strong code quality and architecture hygiene
 
 
 
 
43
 
44
+ ### `reasoning_gym_env`
45
 
46
+ - broadest dataset coverage
47
+ - configurable dataset / size pattern
48
+ - useful deployment references for `openenv push`
49
 
50
+ ### `tbench2_env`
51
 
52
+ - strong agentic shell-task realism
53
+ - binary evaluation via pytest
54
+ - little intermediate reward signal
55
 
56
+ ### `openapp_env`
57
 
58
+ - highest complexity
59
+ - multimodal / browser-based
60
+ - difficult to beat on ambition, easier to beat on simplicity and reproducibility
 
 
 
61
 
62
+ ### `calendar_env`
63
 
64
+ - enterprise workflow flavor
65
+ - scenario + verifier pattern
66
+ - stronger on MCP sophistication than on reward density
67
 
68
  ---
69
 
70
+ ## Structural Patterns Across The Field
71
 
72
+ ### Packaging
73
 
74
+ - every serious repo has `models.py`, `client.py`, `openenv.yaml`, `pyproject.toml`, `README.md`, and a `server/` package
75
+ - Hugging Face Spaces frontmatter is standard in competitor `README.md` files
76
+ - `.openenvignore` appears in some stronger submissions
 
 
 
77
 
78
+ ### Reward patterns
79
 
80
+ | Pattern | Examples | Notes |
81
+ |---------|----------|-------|
82
+ | Binary | `finqa_env`, `tbench2_env` | easy to verify, weaker RL signal |
83
+ | Dense partial | ours, games | stronger RL learning signal |
84
+ | Transform-based | `coding_env`, `chat_env` | architecturally clean |
85
+ | SQL / verifier based | `calendar_env` | strong task verification |
86
 
87
+ ### Testing patterns
88
 
89
+ - many repos have little or no tests
90
+ - `coding_env` is still the strongest example of checked-in testing
91
+ - this makes tests a high-value differentiator for us
92
 
93
+ ### Deployment patterns
 
 
 
 
 
94
 
95
+ - Spaces usually expose `/web`, `/docs`, `/health`, and `/ws`
96
+ - `openenv push` is the expected deployment workflow
97
+ - `README` frontmatter and Docker correctness matter more than polish extras
98
 
99
  ---
100
 
101
+ ## Key Technical Observations
 
 
 
 
 
 
 
 
 
 
102
 
103
+ 1. MCP is useful, but too big to add late.
104
+ 2. Transform-based reward is elegant, but not a deadline-critical refactor.
105
+ 3. HF Spaces frontmatter is expected and missing in our repo.
106
+ 4. `.openenvignore` is a cheap packaging win.
107
+ 5. Configurable datasets are nice, but external dataset merge is too risky late.
108
+ 6. Strong tests improve trust more than minor architectural polish.
109
+ 7. Dense, deterministic, partial-credit reward is one of our real advantages.
110
 
111
  ---
112
 
113
+ ## Actionable Inferences
114
 
115
+ ## Critical Missing Items
116
 
117
+ ### 1. README frontmatter for HF Spaces
 
 
 
 
 
118
 
119
+ This is still the cleanest obvious gap. Add it before submission.
120
 
121
+ Recommended fields:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
 
 
123
  ```yaml
124
  ---
125
+ title: IT Helpdesk Ticket Routing OpenEnv
126
+ emoji: "ticket"
127
+ colorFrom: blue
128
+ colorTo: indigo
129
  sdk: docker
130
  pinned: false
131
+ app_port: 7860
132
  base_path: /web
133
  tags:
134
  - openenv
135
+ - helpdesk
136
+ - ticket-routing
137
+ - nlp
138
  ---
139
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
+ ### 2. `.openenvignore`
142
+
143
+ Cheap packaging improvement. Worth adding.
144
 
145
+ ### 3. Verified deployment assumptions
146
 
147
+ We should explicitly verify:
 
 
 
 
148
 
149
+ - `app_port: 7860`
150
+ - `/health`
151
+ - `/docs`
152
+ - `/ws`
153
+ - `/web`
154
 
155
  ---
156
 
157
+ ## High-Value Improvements That Still Make Sense
158
 
159
+ ### 4. Strengthen the scorer only in grounded, tested ways
 
 
 
 
 
 
 
160
 
161
+ Possible additions to `ISSUE_TYPE_SIMILARITY`:
162
 
163
+ - `onboarding` vs `service_request`
164
+ - `feature_request` vs `service_request`
165
+ - `security_compliance` vs `identity_access`
166
+ - `billing_license` vs `identity_access`
167
+
168
+ Only do this if:
169
+
170
+ - the ambiguity is real
171
+ - the change is backed by tests
172
+ - it does not blur operationally distinct actions too much
173
+
174
+ ### 5. Add richer `history` if low-risk
175
+
176
+ Candidate additions:
177
+
178
+ - ticket title
179
+ - predicted fields
180
 
181
+ This can help multi-step reasoning without changing the core task.
182
 
183
+ ### 6. Add `queue_size` as an optional `reset()` kwarg
184
 
185
+ Nice RL/training flexibility, but lower priority than tests, scorer crispness, Docker, and deployment readiness.
186
 
187
+ ### 7. Add a short TRL / GRPO example to README
188
 
189
+ Good judge-facing signal once the repo is already green.
190
 
191
+ ---
192
+
193
+ ## Improvements To Defer
194
+
195
+ - MCP migration
196
+ - transform-based reward refactor
197
+ - major dataset expansion
198
+ - external dataset merge into runtime
199
+ - broad inference rewrite
200
+ - dependency churn just for polish
201
+
202
+ ---
203
 
204
+ ## Competitive Positioning
205
 
206
+ ### Our strengths
207
 
208
+ 1. strong real-world enterprise domain
209
+ 2. dense deterministic reward
210
+ 3. partial-credit grading that is still explainable
211
+ 4. clean 3-task difficulty ladder
212
+ 5. strong heuristic baseline
213
+ 6. compact, rerunnable environment design
214
+
215
+ ### Our weaknesses
216
+
217
+ 1. weaker checked-in test story unless we fix it
218
+ 2. missing HF Spaces frontmatter unless we fix it
219
+ 3. smaller dataset than some top competitors
220
+ 4. less ambitious architecture than the strongest simulator-style or MCP-heavy entries
221
+
222
+ ---
223
 
224
+ ## Priority Action List
225
+
226
+ | Priority | Action | Effort | Impact |
227
+ |----------|--------|--------|--------|
228
+ | P0 | Add tests and prove scorer crispness | 1-2 hrs | High |
229
+ | P0 | Add HF Spaces frontmatter to README | 5 min | High |
230
+ | P0 | Add `.openenvignore` | 5 min | Medium |
231
+ | P1 | Add grounding audit against public support datasets | 1-2 hrs | High |
232
+ | P1 | Expand similarity pairs only if grounded and tested | 20-40 min | Medium |
233
+ | P1 | Add richer `history` if low-risk | 20 min | Medium |
234
+ | P1 | Add TRL / GRPO README example | 30 min | High |
235
+ | P2 | Add `queue_size` kwarg | 15 min | Low |
236
+ | P3 | Expand dataset substantially | 2+ hrs | Medium but risky |
237
+ | P3 | Transform-based reward refactor | 1 hr | Low |
analysis/inference.md DELETED
@@ -1,218 +0,0 @@
1
- # Inferences & Actionable Advantages
2
-
3
- > Based on deep analysis of all 27 OpenEnv competition entries
4
- > Internal use only — NOT for commit/push
5
-
6
- ---
7
-
8
- ## Critical Missing Items (Fix Before Submission)
9
-
10
- ### 1. README HuggingFace Spaces Frontmatter — MISSING
11
-
12
- Every single env in the repo has YAML frontmatter at the top of README.md. Ours does not.
13
- This is required for `openenv push` and HuggingFace Spaces deployment to work correctly.
14
-
15
- **Add to top of `meta-AIHack/README.md`:**
16
- ```yaml
17
- ---
18
- title: IT Helpdesk Ticket Routing OpenEnv
19
- emoji: 🎫
20
- colorFrom: blue
21
- colorTo: indigo
22
- sdk: docker
23
- pinned: false
24
- app_port: 7860
25
- base_path: /web
26
- tags:
27
- - openenv
28
- - helpdesk
29
- - ticket-routing
30
- - nlp
31
- ---
32
- ```
33
-
34
- Note: our port is `7860` (HF Spaces default), not `8000`. Use `7860` here.
35
-
36
- ---
37
-
38
- ### 2. `.openenvignore` File — MISSING
39
-
40
- `finqa_env` has a `.openenvignore` file (analogous to `.dockerignore` for the `openenv push` CLI).
41
- Without it, `openenv push` may upload unnecessary files.
42
-
43
- **Create `meta-AIHack/.openenvignore`:**
44
- ```
45
- *.pyc
46
- __pycache__/
47
- .git/
48
- *.md
49
- PLAN.md
50
- ROADMAP.md
51
- MENTAL_MODEL.md
52
- KNOWLEDGE.md
53
- comp_intel/
54
- bugs/
55
- transcripts/
56
- ```
57
-
58
- ---
59
-
60
- ### 3. `base_path: /web` in openenv.yaml — CHECK
61
-
62
- The HF Spaces web UI is served at `/web`. The `reasoning_gym_env` README explicitly mentions:
63
- - Web Interface at `/web`
64
- - API Documentation at `/docs`
65
- - Health Check at `/health`
66
- - WebSocket at `/ws`
67
-
68
- Our `openenv.yaml` lists `/docs` in `api.endpoints` — good. But we should verify the web interface path is correct when deployed.
69
-
70
- ---
71
-
72
- ## High-Value Improvements (Implement If Time Allows)
73
-
74
- ### 4. Partial Credit Similarity Matrix — Expand
75
-
76
- Our `grader.py` has `ISSUE_TYPE_SIMILARITY` with 16 pairs and `PRIORITY_SCORES` with 10 pairs.
77
-
78
- **Observation from finqa_env**: Their reward uses both relative AND absolute tolerance simultaneously. Our grader uses a flat similarity dict.
79
-
80
- **Improvement**: Add more near-miss pairs to `ISSUE_TYPE_SIMILARITY`. Currently missing:
81
- - `("onboarding", "service_request")` — onboarding tickets often look like service requests
82
- - `("feature_request", "service_request")` — common confusion
83
- - `("security_compliance", "identity_access")` — MFA/SSO tickets can go either way
84
- - `("billing_license", "identity_access")` — license + account access overlap
85
-
86
- This directly improves the reward signal quality for RL training, which is what judges care about.
87
-
88
- ---
89
-
90
- ### 5. Dataset Size — Expand from 45 to ~100 tickets
91
-
92
- **Observation**: finqa has 290 questions, reasoning_gym has configurable sizes up to thousands.
93
- Our 45 tickets is the smallest custom dataset in the repo.
94
-
95
- **Improvement**: Add 55 more tickets to reach 100. Focus on:
96
- - More ambiguous cases (harder for LLMs)
97
- - More `related_ticket_id` chains (multi-ticket threads)
98
- - Edge cases: tickets that span two issue types
99
- - More `spam_phishing` examples (currently underrepresented)
100
-
101
- This makes the benchmark more robust and harder to overfit.
102
-
103
- ---
104
-
105
- ### 6. Transform-Based Reward (Optional Architecture Upgrade)
106
-
107
- **Observation**: `coding_env` uses a pluggable `Transform` object for reward computation instead of hardcoding it in `step()`. This is the cleanest pattern in the repo.
108
-
109
- **Improvement**: Refactor `server/reward.py` to expose a `HelpdeskRewardTransform` class that can be swapped. Low priority — our current design works fine — but it signals architectural sophistication to judges.
110
-
111
- ---
112
-
113
- ### 7. Configurable Queue Size via `reset()` kwargs
114
-
115
- **Observation**: `reasoning_gym_env` passes `size`, `seed`, `dataset_name` as `reset()` kwargs. This makes the env much more flexible for RL training (vary episode length, vary dataset).
116
-
117
- **Improvement**: Accept `queue_size` as a `reset()` kwarg (in addition to `task_id` and `seed`):
118
- ```python
119
- def reset(self, seed=None, episode_id=None, **kwargs):
120
- queue_size = kwargs.get("queue_size", None) # override QUEUE_SIZE_RANGE
121
- ...
122
- ```
123
-
124
- This lets RL trainers control episode length without modifying the env code.
125
-
126
- ---
127
-
128
- ### 8. `uv.lock` for Reproducible Dependencies
129
-
130
- **Observation**: `chat_env` and `reasoning_gym_env` both include `uv.lock` files for fully reproducible dependency resolution.
131
-
132
- **Improvement**: Run `uv lock` in `meta-AIHack/` and commit the `uv.lock`. This signals production-quality dependency management.
133
-
134
- ---
135
-
136
- ### 9. Explicit TRL/GRPO Integration Example in README
137
-
138
- **Observation**: `finqa_env` README explicitly shows a TRL GRPO integration snippet. This is exactly what Meta/PyTorch judges want to see — the env being used for actual RL training.
139
-
140
- **Improvement**: Add a section to our README showing how to use the env with TRL GRPO:
141
- ```python
142
- # Example: Using with TRL GRPO
143
- from trl import GRPOTrainer
144
- from client import HelpdeskTicketEnvClient
145
-
146
- async def rollout_func(prompts, trainer):
147
- sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
148
- with sync_client:
149
- result = sync_client.reset(seed=42, task_id=3)
150
- # ... agent loop
151
- return {"reward": final_reward, "completion": completion}
152
- ```
153
-
154
- ---
155
-
156
- ### 10. `history` Field — Richer Step History
157
-
158
- **Observation**: `finqa_env` passes full tool call history in observation metadata. Our `history` field currently only stores `{step, score, breakdown}`.
159
-
160
- **Improvement**: Include the ticket title and predicted fields in history so the agent can learn from its own past decisions within an episode:
161
- ```python
162
- history_entry = {
163
- "ticket_id": current_ticket.ticket_id,
164
- "title": current_ticket.title, # ADD THIS
165
- "predicted": {k: v for k, v in action.model_dump().items() if v is not None}, # ADD THIS
166
- "score": score,
167
- "breakdown": breakdown,
168
- }
169
- ```
170
-
171
- This gives the LLM agent richer context for multi-step reasoning.
172
-
173
- ---
174
-
175
- ## Competitive Positioning Insights
176
-
177
- ### Our Unique Strengths vs. The Field
178
-
179
- 1. **Richest `openenv.yaml`**: Ours is the most detailed metadata file in the entire repo. Most envs have 3-line yaml files. Ours has tasks, evaluation, grading, reproducibility, inference config. This signals thoroughness.
180
-
181
- 2. **Deterministic + Reproducible**: We explicitly set `deterministic: true` and `reproducible: true` in openenv.yaml. Only a few envs do this. Judges can rerun and get identical results.
182
-
183
- 3. **Task Ladder (3 difficulty levels)**: Most envs have a single task. We have 3 explicitly difficulty-graded tasks. This is a strong differentiator for RL curriculum learning.
184
-
185
- 4. **Partial Credit Grading**: Most envs use binary reward (0/1). Our grader gives partial credit for near-miss issue types and adjacent priorities. This produces a much richer reward signal for RL training.
186
-
187
- 5. **Dense Reward Signal**: Every step produces a reward (not just the final step). Most envs (tbench2, finqa) only reward at the end. Dense rewards are better for RL training.
188
-
189
- 6. **Heuristic Baseline**: We have a working keyword-based heuristic that achieves 0.94 overall. Most envs don't have a baseline agent. This lets judges immediately see the env working.
190
-
191
- 7. **Real-World Domain**: IT helpdesk routing is a real enterprise use case. Many envs are games or synthetic tasks. Ours has immediate practical applicability.
192
-
193
- 8. **Clean Episode Bounds**: 3–5 steps per episode. Not too short (single-step), not unbounded. Clean for RL training.
194
-
195
- ### Our Weaknesses vs. The Field
196
-
197
- 1. **No HF Spaces frontmatter** in README — fixable in 5 minutes
198
- 2. **Smallest dataset** (45 tickets) — expandable
199
- 3. **No MCP tools** — plain HTTP only (simpler but less "agentic")
200
- 4. **No tests** — matches most envs, but coding_env has tests
201
- 5. **No `uv.lock`** — minor
202
- 6. **No `.openenvignore`** — minor
203
-
204
- ---
205
-
206
- ## Priority Action List
207
-
208
- | Priority | Action | Effort | Impact |
209
- |----------|--------|--------|--------|
210
- | P0 | Add HF Spaces frontmatter to README | 5 min | High — required for deployment |
211
- | P0 | Add `.openenvignore` | 5 min | Medium — cleaner push |
212
- | P1 | Add TRL/GRPO example to README | 30 min | High — judges love this |
213
- | P1 | Expand `ISSUE_TYPE_SIMILARITY` pairs | 20 min | Medium — better reward signal |
214
- | P1 | Richer `history` entries (add title + predicted) | 20 min | Medium — better agent context |
215
- | P2 | Expand dataset to ~100 tickets | 2 hrs | Medium — more robust benchmark |
216
- | P2 | Add `queue_size` kwarg to `reset()` | 15 min | Low — flexibility |
217
- | P3 | Add `uv.lock` | 5 min | Low — polish |
218
- | P3 | Transform-based reward refactor | 1 hr | Low — architecture only |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
required.md ADDED
@@ -0,0 +1,352 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Round 1 Requirements And Project Compliance Plan
2
+
3
+ ## Official Problem Statement
4
+
5
+ Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
6
+
7
+ ### Key requirements at a glance
8
+
9
+ - must simulate a real-world task, not a game or toy
10
+ - must implement the full OpenEnv spec with typed models and `openenv.yaml`
11
+ - must include at least 3 tasks with agent graders spanning easy -> medium -> hard
12
+ - graders must return scores in `[0.0, 1.0]`
13
+ - reward must provide meaningful partial-progress signal
14
+ - must include a reproducible baseline `inference.py`
15
+ - must deploy to Hugging Face Spaces with a working Dockerfile
16
+ - README must include environment description, action / observation spaces, setup, usage, and baseline scores
17
+
18
+ ## Official Functional Requirements
19
+
20
+ ### Real-world task simulation
21
+
22
+ The environment must simulate a task humans actually do. The official examples include:
23
+
24
+ - email triage
25
+ - code review
26
+ - data cleaning
27
+ - scheduling
28
+ - customer support
29
+ - content moderation
30
+
31
+ ### OpenEnv spec compliance
32
+
33
+ The environment must implement the OpenEnv interface with:
34
+
35
+ - typed Observation model
36
+ - typed Action model
37
+ - typed state model
38
+ - `step(action)`
39
+ - `reset()`
40
+ - `state()`
41
+ - `openenv.yaml`
42
+
43
+ This is expected to be checked through `openenv validate`.
44
+
45
+ ### Minimum 3 tasks with agent graders
46
+
47
+ Each task must have:
48
+
49
+ - a concrete objective
50
+ - a programmatic grader
51
+ - score output in `[0.0, 1.0]`
52
+ - deterministic success / failure criteria
53
+ - clear difficulty progression from easy to hard
54
+
55
+ ### Meaningful reward function
56
+
57
+ The reward should:
58
+
59
+ - provide signal across the full trajectory
60
+ - reward partial progress
61
+ - penalize clearly undesirable behavior
62
+
63
+ ### Baseline inference script
64
+
65
+ The baseline must:
66
+
67
+ - use the OpenAI client for LLM calls
68
+ - live at the project root as `inference.py`
69
+ - produce reproducible scores
70
+ - complete successfully across all 3 tasks
71
+
72
+ ## Official Non-Functional Requirements
73
+
74
+ ### Hugging Face Spaces
75
+
76
+ - must deploy as a containerized HF Space
77
+ - should be tagged with `openenv`
78
+ - should respond successfully when pinged
79
+
80
+ ### Containerized execution
81
+
82
+ - must include a working Dockerfile
83
+ - should start cleanly with `docker build` + `docker run`
84
+
85
+ ### Documentation
86
+
87
+ README must include:
88
+
89
+ - environment description and motivation
90
+ - action space definition
91
+ - observation space definition
92
+ - task descriptions with difficulty expectations
93
+ - setup and usage instructions
94
+ - baseline scores
95
+
96
+ ## Official Evaluation Criteria
97
+
98
+ ### Weights
99
+
100
+ | Parameter | Weight | What judges look for |
101
+ |-----------|--------|----------------------|
102
+ | Real-world utility | 30% | Genuine practical task and value |
103
+ | Task & grader quality | 25% | Clear objectives, fair graders, real progression |
104
+ | Environment design | 20% | Clean state, sensible API, good reward shaping |
105
+ | Code quality & spec compliance | 15% | OpenEnv compliance, structure, typing, tests, Docker |
106
+ | Creativity & novelty | 10% | Original domain, mechanics, reward ideas |
107
+
108
+ ### Phase 1: Automated validation
109
+
110
+ Pass / fail gate:
111
+
112
+ - HF Space deploys
113
+ - OpenEnv spec compliance
114
+ - Dockerfile builds
115
+ - baseline reproduces
116
+ - 3+ tasks with graders
117
+
118
+ ### Phase 2: Agentic evaluation
119
+
120
+ Scored:
121
+
122
+ - baseline agent rerun
123
+ - standard Open LLM agent run against the environment
124
+ - score variance check
125
+
126
+ ### Phase 3: Human review
127
+
128
+ Top submissions are reviewed by Meta and Hugging Face engineers for:
129
+
130
+ - real-world utility
131
+ - creativity
132
+ - exploit resistance
133
+
134
+ ## Official Disqualification Criteria
135
+
136
+ - environment does not deploy or respond
137
+ - plagiarized or trivially modified existing environment
138
+ - graders always return the same score
139
+ - no baseline inference script
140
+
141
+ ## Official Pre-Submission Checklist
142
+
143
+ All of these must pass:
144
+
145
+ - HF Space deploys and responds
146
+ - automated ping to the Space URL returns `200`
147
+ - reset path works on the deployed environment
148
+ - `openenv validate` passes
149
+ - Dockerfile builds
150
+ - baseline inference completes and produces scores
151
+ - 3+ tasks with graders are present and score in `[0.0, 1.0]`
152
+
153
+ ## Mandatory Additional Instructions
154
+
155
+ ### Required inference environment variables
156
+
157
+ - `API_BASE_URL`
158
+ - `MODEL_NAME`
159
+ - `HF_TOKEN`
160
+
161
+ The official text also mentions `OPENAI_API_KEY` in one place, but the more specific submission instructions above consistently emphasize `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. We should follow the later, more specific instruction while continuing to use the OpenAI client.
162
+
163
+ ### Inference script constraints
164
+
165
+ - script must be named `inference.py`
166
+ - it must live in the project root
167
+ - all LLM calls must use the OpenAI client
168
+ - stdout logs must strictly follow the `[START]`, `[STEP]`, and `[END]` format from the official sample
169
+
170
+ ### Infra restrictions
171
+
172
+ - inference runtime should stay under 20 minutes
173
+ - env and inference should run on a machine with `vcpu=2` and `memory=8gb`
174
+
175
+ ### Validator
176
+
177
+ - run the official pre-submission validation script before final submission if possible
178
+
179
+ ---
180
+
181
+ ## Project Compliance Plan
182
+
183
+ ## Project Goal
184
+
185
+ Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
186
+
187
+ - real-world utility
188
+ - strong task and grader quality
189
+ - clean environment design
190
+ - OpenEnv spec compliance
191
+ - reproducible baseline inference
192
+ - Docker and Hugging Face deployment readiness
193
+
194
+ ## Current Product Definition
195
+
196
+ The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
197
+
198
+ - `issue_type`
199
+ - `priority`
200
+ - `assignment_group`
201
+ - `resolution_action`
202
+
203
+ The project keeps three tasks:
204
+
205
+ 1. Issue Type Classification
206
+ 2. Issue Type And Priority
207
+ 3. Full Ticket Routing
208
+
209
+ ## What Must Be True At Submission
210
+
211
+ ### Pass / fail requirements
212
+
213
+ - the environment responds correctly
214
+ - OpenEnv metadata is valid
215
+ - `reset()`, `step()`, and `state()` work
216
+ - there are at least 3 tasks
217
+ - graders return scores in `[0.0, 1.0]`
218
+ - `inference.py` runs and prints reproducible results
219
+ - `inference.py` uses the OpenAI client and required env vars
220
+ - structured stdout logging matches the official format
221
+ - `openenv validate` passes
222
+ - Docker builds and starts cleanly
223
+ - HF Space responds and reset works
224
+
225
+ ### Scored requirements
226
+
227
+ - the task clearly feels like real helpdesk work
228
+ - the hard task requires meaningful reasoning
229
+ - partial credit is useful and deterministic
230
+ - docs are clear enough for judges to understand quickly
231
+ - reward is informative over the trajectory, not only at the end
232
+
233
+ ## Core Files
234
+
235
+ ### Runtime
236
+
237
+ - `models.py`
238
+ - `server/environment.py`
239
+ - `server/grader.py`
240
+ - `server/reward.py`
241
+ - `server/tasks.py`
242
+ - `server/app.py`
243
+ - `client.py`
244
+ - `inference.py`
245
+
246
+ ### Data and metadata
247
+
248
+ - `data/dataset.json`
249
+ - `openenv.yaml`
250
+ - `server/Dockerfile`
251
+ - `pyproject.toml`
252
+ - `requirements.txt`
253
+
254
+ ### Docs
255
+
256
+ - `README.md`
257
+ - `KNOWLEDGE.md`
258
+ - `required.md`
259
+
260
+ ## Technical Priorities
261
+
262
+ ### P0
263
+
264
+ 1. keep environment behavior correct
265
+ 2. verify task definitions and graders
266
+ 3. make the baseline script reliable and compliant with official logging format
267
+ 4. confirm dataset coverage and label consistency
268
+ 5. validate the official submission gates, not just local behavior
269
+
270
+ ### P1
271
+
272
+ 1. validate Docker
273
+ 2. validate deployment assumptions
274
+ 3. record baseline scores
275
+ 4. polish docs
276
+ 5. verify the runtime envelope and structured inference logs
277
+
278
+ ### P2
279
+
280
+ 1. strengthen ticket wording for realism
281
+ 2. expand hard-case examples if needed
282
+ 3. remove low-signal artifacts from the repo
283
+
284
+ ## Quality Checks To Perform
285
+
286
+ ### Environment
287
+
288
+ - reset starts a clean episode
289
+ - each step advances the queue correctly
290
+ - the final step returns trajectory reward
291
+ - state reflects the real internal status
292
+ - episode boundaries are sensible
293
+
294
+ ### Grader
295
+
296
+ - exact matches score `1.0`
297
+ - near misses get partial credit where intended
298
+ - unsupported task IDs fail clearly
299
+ - scores vary across examples
300
+ - graders do not collapse to constant scores
301
+
302
+ ### Inference
303
+
304
+ - heuristic mode works without model credentials
305
+ - LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
306
+ - uses the OpenAI client
307
+ - stdout follows `[START]`, `[STEP]`, and `[END]`
308
+ - output is reproducible when the seed is fixed
309
+ - runtime stays below the official time budget
310
+
311
+ ### Deployment and validation
312
+
313
+ - `openenv validate` passes
314
+ - Docker build succeeds
315
+ - Docker run succeeds
316
+ - HF ping / reset behavior works
317
+ - official validator script is run if practical
318
+
319
+ ### Docs
320
+
321
+ - no outdated domain references remain
322
+ - team and project metadata are correct
323
+ - setup and run instructions are accurate
324
+ - README reflects the current inference and deployment path
325
+
326
+ ## Risks
327
+
328
+ ### Runtime risk
329
+
330
+ The first local execution pass and merged-state rerun have already succeeded. The remaining runtime risk is Docker, clean-machine behavior, and official-validator-style behavior, not first-pass local execution.
331
+
332
+ ### Benchmark risk
333
+
334
+ The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.
335
+
336
+ ### Deployment risk
337
+
338
+ Docker, HF Spaces, `openenv validate`, and structured inference logging should be verified before the final submission window closes.
339
+
340
+ ## Definition Of Done
341
+
342
+ The project is ready when:
343
+
344
+ 1. the environment runs locally end to end
345
+ 2. unit, smoke, and integration tests cover the critical paths
346
+ 3. the heuristic baseline runs successfully
347
+ 4. the inference script is compliant with the official logging format
348
+ 5. `openenv validate` passes
349
+ 6. Docker build and run both succeed
350
+ 7. HF deployment checks succeed or are as close to verified as possible before submission
351
+ 8. the docs are clean, current, and submission-ready
352
+ 9. the repo clearly presents Hackstreet Boys as the team