Spaces:
Running
Running
Consolidate requirements docs and align roadmap with official submission rules
Browse files- KNOWLEDGE.md +63 -14
- MENTAL_MODEL.md +0 -173
- PLAN.md +0 -147
- PROJECT_STATUS.md +2 -2
- README.md +2 -3
- ROADMAP.md +32 -12
- analysis/comp_know.md +159 -197
- analysis/inference.md +0 -218
- required.md +352 -0
KNOWLEDGE.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# IT Helpdesk Ticket Routing OpenEnv - Knowledge Guide
|
| 2 |
|
| 3 |
-
## What
|
| 4 |
|
| 5 |
The judges want a real-world environment that follows the OpenEnv pattern and can be understood quickly.
|
| 6 |
|
|
@@ -14,9 +14,9 @@ That means this repo needs:
|
|
| 14 |
6. a baseline `inference.py`
|
| 15 |
7. Docker and metadata that are easy to rerun
|
| 16 |
|
| 17 |
-
## Why
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
- realistic
|
| 22 |
- structured
|
|
@@ -32,12 +32,12 @@ This environment simulates a short helpdesk queue where an agent routes one tick
|
|
| 32 |
|
| 33 |
## Judge-Facing Explanation
|
| 34 |
|
| 35 |
-
If a judge asks why this environment is
|
| 36 |
|
| 37 |
1. IT helpdesk routing is a real operational workflow with clear business value.
|
| 38 |
2. The input is realistic free-form ticket text, but the output is typed and easy to grade deterministically.
|
| 39 |
3. The three-task ladder creates a clean progression from basic classification to full queue routing.
|
| 40 |
-
4. The repo stays judge-friendly because the vocabulary, task labels, and scoring rules are
|
| 41 |
|
| 42 |
## Frozen Project Identity
|
| 43 |
|
|
@@ -47,6 +47,34 @@ If a judge asks why this environment is a strong submission, the concise answer
|
|
| 47 |
- OpenEnv name: `it_helpdesk_ticket_routing_openenv`
|
| 48 |
- App environment name: `it_helpdesk_ticket_routing`
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
## Frozen Runtime Vocabulary
|
| 51 |
|
| 52 |
### Fields
|
|
@@ -145,11 +173,30 @@ On each step, the environment:
|
|
| 145 |
|
| 146 |
Returns the internal state snapshot for debugging or inspection.
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
## Task Design
|
| 149 |
|
| 150 |
### Task 1: Issue Type Classification
|
| 151 |
|
| 152 |
-
The agent
|
| 153 |
|
| 154 |
- `issue_type`
|
| 155 |
|
|
@@ -257,7 +304,7 @@ It supports:
|
|
| 257 |
|
| 258 |
## Validation Notes
|
| 259 |
|
| 260 |
-
The repo has
|
| 261 |
|
| 262 |
### April 2 consistency pass
|
| 263 |
|
|
@@ -273,7 +320,7 @@ What needed to agree:
|
|
| 273 |
|
| 274 |
### April 3 and April 4 runtime-feedback pass
|
| 275 |
|
| 276 |
-
The first local runtime pass
|
| 277 |
|
| 278 |
- `data/dataset.json` was saved with a UTF-8 BOM, which caused `json.load()` to fail during environment creation on Windows
|
| 279 |
|
|
@@ -288,13 +335,13 @@ The local heuristic baseline completed successfully after that fix with:
|
|
| 288 |
|
| 289 |
A merged-state rerun on the current `main` branch matched those same numbers exactly.
|
| 290 |
|
| 291 |
-
## April 6
|
| 292 |
|
| 293 |
-
An April 6
|
| 294 |
|
| 295 |
-
- all required runtime, data, metadata, and documentation files are present
|
| 296 |
- the docs consistently describe IT helpdesk ticket routing rather than the old email-triage domain
|
| 297 |
-
- the current local benchmark reference is `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
|
| 298 |
- the remaining work is execution validation, not documentation cleanup
|
| 299 |
|
| 300 |
## What Still Needs Hands-On Verification
|
|
@@ -312,7 +359,9 @@ If you come back to this repo later, remember:
|
|
| 312 |
|
| 313 |
- the domain is IT helpdesk ticket routing
|
| 314 |
- the environment is a short queue, not a single-shot classifier
|
|
|
|
|
|
|
| 315 |
- the agent predicts structured routing fields
|
| 316 |
-
-
|
| 317 |
-
-
|
| 318 |
- merged-state local validation is complete, and Docker is the main remaining hands-on check
|
|
|
|
| 1 |
# IT Helpdesk Ticket Routing OpenEnv - Knowledge Guide
|
| 2 |
|
| 3 |
+
## What This Repo Needs To Prove
|
| 4 |
|
| 5 |
The judges want a real-world environment that follows the OpenEnv pattern and can be understood quickly.
|
| 6 |
|
|
|
|
| 14 |
6. a baseline `inference.py`
|
| 15 |
7. Docker and metadata that are easy to rerun
|
| 16 |
|
| 17 |
+
## Why This Domain Fits
|
| 18 |
|
| 19 |
+
IT helpdesk routing is a strong hackathon fit because it is:
|
| 20 |
|
| 21 |
- realistic
|
| 22 |
- structured
|
|
|
|
| 32 |
|
| 33 |
## Judge-Facing Explanation
|
| 34 |
|
| 35 |
+
If a judge asks why this environment is strong, the concise answer is:
|
| 36 |
|
| 37 |
1. IT helpdesk routing is a real operational workflow with clear business value.
|
| 38 |
2. The input is realistic free-form ticket text, but the output is typed and easy to grade deterministically.
|
| 39 |
3. The three-task ladder creates a clean progression from basic classification to full queue routing.
|
| 40 |
+
4. The repo stays judge-friendly because the vocabulary, task labels, and scoring rules are explicit and frozen.
|
| 41 |
|
| 42 |
## Frozen Project Identity
|
| 43 |
|
|
|
|
| 47 |
- OpenEnv name: `it_helpdesk_ticket_routing_openenv`
|
| 48 |
- App environment name: `it_helpdesk_ticket_routing`
|
| 49 |
|
| 50 |
+
## Practical Mental Model
|
| 51 |
+
|
| 52 |
+
```text
|
| 53 |
+
inference.py
|
| 54 |
+
|
|
| 55 |
+
v
|
| 56 |
+
client.py <----> server/app.py
|
| 57 |
+
|
|
| 58 |
+
v
|
| 59 |
+
server/environment.py
|
| 60 |
+
| | |
|
| 61 |
+
v v v
|
| 62 |
+
grader.py reward.py tasks.py
|
| 63 |
+
|
|
| 64 |
+
v
|
| 65 |
+
data/dataset.json
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
The repo is a small OpenEnv stack:
|
| 69 |
+
|
| 70 |
+
- `inference.py` drives episodes
|
| 71 |
+
- `client.py` talks to the app
|
| 72 |
+
- `server/environment.py` manages queue state and episode flow
|
| 73 |
+
- `server/grader.py` scores actions
|
| 74 |
+
- `server/reward.py` computes step and final reward behavior
|
| 75 |
+
- `server/tasks.py` defines the task ladder and loads the dataset
|
| 76 |
+
- `data/dataset.json` stores the labeled helpdesk tickets
|
| 77 |
+
|
| 78 |
## Frozen Runtime Vocabulary
|
| 79 |
|
| 80 |
### Fields
|
|
|
|
| 173 |
|
| 174 |
Returns the internal state snapshot for debugging or inspection.
|
| 175 |
|
| 176 |
+
## Observation And State At A Glance
|
| 177 |
+
|
| 178 |
+
The observation exposes:
|
| 179 |
+
|
| 180 |
+
- task metadata
|
| 181 |
+
- the current ticket
|
| 182 |
+
- queue progress counters
|
| 183 |
+
- history
|
| 184 |
+
- reward and done status
|
| 185 |
+
|
| 186 |
+
The state tracks:
|
| 187 |
+
|
| 188 |
+
- current task
|
| 189 |
+
- seed
|
| 190 |
+
- queue ticket IDs
|
| 191 |
+
- current ticket index
|
| 192 |
+
- per-ticket scores
|
| 193 |
+
- total reward
|
| 194 |
+
|
| 195 |
## Task Design
|
| 196 |
|
| 197 |
### Task 1: Issue Type Classification
|
| 198 |
|
| 199 |
+
The agent predicts:
|
| 200 |
|
| 201 |
- `issue_type`
|
| 202 |
|
|
|
|
| 304 |
|
| 305 |
## Validation Notes
|
| 306 |
|
| 307 |
+
The repo has already gone through two useful validation phases.
|
| 308 |
|
| 309 |
### April 2 consistency pass
|
| 310 |
|
|
|
|
| 320 |
|
| 321 |
### April 3 and April 4 runtime-feedback pass
|
| 322 |
|
| 323 |
+
The first local runtime pass surfaced one practical issue:
|
| 324 |
|
| 325 |
- `data/dataset.json` was saved with a UTF-8 BOM, which caused `json.load()` to fail during environment creation on Windows
|
| 326 |
|
|
|
|
| 335 |
|
| 336 |
A merged-state rerun on the current `main` branch matched those same numbers exactly.
|
| 337 |
|
| 338 |
+
### April 6 repo audit
|
| 339 |
|
| 340 |
+
An April 6 audit confirmed:
|
| 341 |
|
| 342 |
+
- all required runtime, data, metadata, and documentation files are present
|
| 343 |
- the docs consistently describe IT helpdesk ticket routing rather than the old email-triage domain
|
| 344 |
+
- the current local benchmark reference is still `1.0000`, `0.8800`, `0.9400`, overall `0.9400`
|
| 345 |
- the remaining work is execution validation, not documentation cleanup
|
| 346 |
|
| 347 |
## What Still Needs Hands-On Verification
|
|
|
|
| 359 |
|
| 360 |
- the domain is IT helpdesk ticket routing
|
| 361 |
- the environment is a short queue, not a single-shot classifier
|
| 362 |
+
- the architecture is a compact OpenEnv stack
|
| 363 |
+
- one ticket is shown at a time
|
| 364 |
- the agent predicts structured routing fields
|
| 365 |
+
- the grader gives deterministic partial credit
|
| 366 |
+
- `inference.py` is the baseline agent runner
|
| 367 |
- merged-state local validation is complete, and Docker is the main remaining hands-on check
|
MENTAL_MODEL.md
DELETED
|
@@ -1,173 +0,0 @@
|
|
| 1 |
-
# IT Helpdesk Ticket Routing Mental Model
|
| 2 |
-
|
| 3 |
-
This file is the practical mental model of the repo in its current form.
|
| 4 |
-
|
| 5 |
-
## What The Project Is
|
| 6 |
-
|
| 7 |
-
This repository is an OpenEnv environment for IT helpdesk ticket routing.
|
| 8 |
-
|
| 9 |
-
The environment presents a small queue of tickets. For each ticket, the agent must decide:
|
| 10 |
-
|
| 11 |
-
- issue type
|
| 12 |
-
- priority
|
| 13 |
-
- assignment group
|
| 14 |
-
- resolution action
|
| 15 |
-
|
| 16 |
-
## Main Runtime Flow
|
| 17 |
-
|
| 18 |
-
```text
|
| 19 |
-
inference.py
|
| 20 |
-
|
|
| 21 |
-
v
|
| 22 |
-
client.py <----> server/app.py
|
| 23 |
-
|
|
| 24 |
-
v
|
| 25 |
-
server/environment.py
|
| 26 |
-
| | |
|
| 27 |
-
v v v
|
| 28 |
-
grader.py reward.py tasks.py
|
| 29 |
-
|
|
| 30 |
-
v
|
| 31 |
-
data/dataset.json
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
## Main Files
|
| 35 |
-
|
| 36 |
-
- `models.py`
|
| 37 |
-
Typed models for tickets, actions, observations, and state.
|
| 38 |
-
|
| 39 |
-
- `server/environment.py`
|
| 40 |
-
Main environment engine.
|
| 41 |
-
|
| 42 |
-
- `server/grader.py`
|
| 43 |
-
Deterministic partial-credit scorer.
|
| 44 |
-
|
| 45 |
-
- `server/reward.py`
|
| 46 |
-
Step and trajectory reward helpers.
|
| 47 |
-
|
| 48 |
-
- `server/tasks.py`
|
| 49 |
-
Task definitions and dataset loading.
|
| 50 |
-
|
| 51 |
-
- `client.py`
|
| 52 |
-
Typed client used for multi-step interaction.
|
| 53 |
-
|
| 54 |
-
- `inference.py`
|
| 55 |
-
Baseline runner with LLM mode and heuristic mode.
|
| 56 |
-
|
| 57 |
-
## Task Ladder
|
| 58 |
-
|
| 59 |
-
### Task 1
|
| 60 |
-
|
| 61 |
-
- predict `issue_type`
|
| 62 |
-
|
| 63 |
-
### Task 2
|
| 64 |
-
|
| 65 |
-
- predict `issue_type`
|
| 66 |
-
- predict `priority`
|
| 67 |
-
|
| 68 |
-
### Task 3
|
| 69 |
-
|
| 70 |
-
- predict `issue_type`
|
| 71 |
-
- predict `priority`
|
| 72 |
-
- predict `assignment_group`
|
| 73 |
-
- predict `resolution_action`
|
| 74 |
-
|
| 75 |
-
## Label Vocabulary
|
| 76 |
-
|
| 77 |
-
### Issue types
|
| 78 |
-
|
| 79 |
-
- `billing_license`
|
| 80 |
-
- `identity_access`
|
| 81 |
-
- `application_support`
|
| 82 |
-
- `service_request`
|
| 83 |
-
- `spam_phishing`
|
| 84 |
-
- `general_inquiry`
|
| 85 |
-
- `security_compliance`
|
| 86 |
-
- `onboarding`
|
| 87 |
-
- `feature_request`
|
| 88 |
-
|
| 89 |
-
### Assignment groups
|
| 90 |
-
|
| 91 |
-
- `license_ops`
|
| 92 |
-
- `service_desk`
|
| 93 |
-
- `application_team`
|
| 94 |
-
- `procurement`
|
| 95 |
-
- `security_team`
|
| 96 |
-
- `onboarding_ops`
|
| 97 |
-
|
| 98 |
-
### Resolution actions
|
| 99 |
-
|
| 100 |
-
- `fulfill`
|
| 101 |
-
- `escalate`
|
| 102 |
-
- `assign`
|
| 103 |
-
- `ignore`
|
| 104 |
-
- `acknowledge`
|
| 105 |
-
|
| 106 |
-
## Observation And State
|
| 107 |
-
|
| 108 |
-
The observation exposes:
|
| 109 |
-
|
| 110 |
-
- task metadata
|
| 111 |
-
- the current ticket
|
| 112 |
-
- queue progress counters
|
| 113 |
-
- history
|
| 114 |
-
- reward and done status
|
| 115 |
-
|
| 116 |
-
The state tracks:
|
| 117 |
-
|
| 118 |
-
- current task
|
| 119 |
-
- seed
|
| 120 |
-
- queue ticket IDs
|
| 121 |
-
- current ticket index
|
| 122 |
-
- per-ticket scores
|
| 123 |
-
- total reward
|
| 124 |
-
|
| 125 |
-
## Reward Logic
|
| 126 |
-
|
| 127 |
-
- each step returns the current ticket score
|
| 128 |
-
- the final reward is the average of per-ticket scores
|
| 129 |
-
- a small overshoot penalty exists as a safeguard
|
| 130 |
-
|
| 131 |
-
## Runtime Notes
|
| 132 |
-
|
| 133 |
-
The repo has now passed both the initial local heuristic run and a merged-state rerun on the current `main` branch.
|
| 134 |
-
|
| 135 |
-
Current local baseline:
|
| 136 |
-
|
| 137 |
-
- Task 1: `1.0000`
|
| 138 |
-
- Task 2: `0.8800`
|
| 139 |
-
- Task 3: `0.9400`
|
| 140 |
-
- Overall: `0.9400`
|
| 141 |
-
|
| 142 |
-
The merged-state rerun matched the same baseline numbers exactly.
|
| 143 |
-
|
| 144 |
-
One practical implementation note from runtime validation:
|
| 145 |
-
|
| 146 |
-
- `data/dataset.json` may be saved with a UTF-8 BOM on Windows, so `server/tasks.py` intentionally loads it with `utf-8-sig`
|
| 147 |
-
|
| 148 |
-
## Dataset Shape
|
| 149 |
-
|
| 150 |
-
Each record includes:
|
| 151 |
-
|
| 152 |
-
- `ticket_id`
|
| 153 |
-
- `title`
|
| 154 |
-
- `requester`
|
| 155 |
-
- `description`
|
| 156 |
-
- `issue_type`
|
| 157 |
-
- `priority`
|
| 158 |
-
- `assignment_group`
|
| 159 |
-
- `resolution_action`
|
| 160 |
-
- optional `ambiguity_note`
|
| 161 |
-
- optional `related_ticket_id`
|
| 162 |
-
|
| 163 |
-
## Short Version
|
| 164 |
-
|
| 165 |
-
If coming back later, remember this:
|
| 166 |
-
|
| 167 |
-
- the repo is a helpdesk ticket router
|
| 168 |
-
- the architecture is a small OpenEnv stack
|
| 169 |
-
- one ticket is shown at a time
|
| 170 |
-
- the agent predicts structured routing fields
|
| 171 |
-
- the grader gives deterministic partial credit
|
| 172 |
-
- `inference.py` is the baseline agent runner
|
| 173 |
-
- the local heuristic path now works end to end on the current merged repo state
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PLAN.md
DELETED
|
@@ -1,147 +0,0 @@
|
|
| 1 |
-
# IT Helpdesk Ticket Routing OpenEnv - Project Plan
|
| 2 |
-
|
| 3 |
-
## Project Goal
|
| 4 |
-
|
| 5 |
-
Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
|
| 6 |
-
|
| 7 |
-
- real-world utility
|
| 8 |
-
- strong task and grader quality
|
| 9 |
-
- clean environment design
|
| 10 |
-
- OpenEnv spec compliance
|
| 11 |
-
- reproducible baseline inference
|
| 12 |
-
- Docker and Hugging Face deployment readiness
|
| 13 |
-
|
| 14 |
-
## Current Product Definition
|
| 15 |
-
|
| 16 |
-
The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
|
| 17 |
-
|
| 18 |
-
- `issue_type`
|
| 19 |
-
- `priority`
|
| 20 |
-
- `assignment_group`
|
| 21 |
-
- `resolution_action`
|
| 22 |
-
|
| 23 |
-
The project keeps three tasks:
|
| 24 |
-
|
| 25 |
-
1. Issue Type Classification
|
| 26 |
-
2. Issue Type And Priority
|
| 27 |
-
3. Full Ticket Routing
|
| 28 |
-
|
| 29 |
-
## What Must Be True At Submission
|
| 30 |
-
|
| 31 |
-
### Pass or fail requirements
|
| 32 |
-
|
| 33 |
-
- the environment responds correctly
|
| 34 |
-
- OpenEnv metadata is valid
|
| 35 |
-
- `reset()`, `step()`, and `state()` work
|
| 36 |
-
- there are at least 3 tasks
|
| 37 |
-
- graders return scores in `[0.0, 1.0]`
|
| 38 |
-
- `inference.py` runs and prints reproducible results
|
| 39 |
-
- Docker builds and starts cleanly
|
| 40 |
-
|
| 41 |
-
### Scored requirements
|
| 42 |
-
|
| 43 |
-
- the task should clearly feel like real helpdesk work
|
| 44 |
-
- the hard task should require meaningful reasoning
|
| 45 |
-
- partial credit should be useful and deterministic
|
| 46 |
-
- docs should be clear enough for judges to understand quickly
|
| 47 |
-
|
| 48 |
-
## Core Files
|
| 49 |
-
|
| 50 |
-
### Runtime
|
| 51 |
-
|
| 52 |
-
- `models.py`
|
| 53 |
-
- `server/environment.py`
|
| 54 |
-
- `server/grader.py`
|
| 55 |
-
- `server/reward.py`
|
| 56 |
-
- `server/tasks.py`
|
| 57 |
-
- `server/app.py`
|
| 58 |
-
- `client.py`
|
| 59 |
-
- `inference.py`
|
| 60 |
-
|
| 61 |
-
### Data and metadata
|
| 62 |
-
|
| 63 |
-
- `data/dataset.json`
|
| 64 |
-
- `openenv.yaml`
|
| 65 |
-
- `server/Dockerfile`
|
| 66 |
-
- `pyproject.toml`
|
| 67 |
-
- `requirements.txt`
|
| 68 |
-
|
| 69 |
-
### Docs
|
| 70 |
-
|
| 71 |
-
- `README.md`
|
| 72 |
-
- `KNOWLEDGE.md`
|
| 73 |
-
- `MENTAL_MODEL.md`
|
| 74 |
-
|
| 75 |
-
## Technical Priorities
|
| 76 |
-
|
| 77 |
-
### P0
|
| 78 |
-
|
| 79 |
-
1. keep the environment behavior correct
|
| 80 |
-
2. verify the task definitions and graders
|
| 81 |
-
3. make the baseline script reliable
|
| 82 |
-
4. confirm dataset coverage and label consistency
|
| 83 |
-
|
| 84 |
-
### P1
|
| 85 |
-
|
| 86 |
-
1. validate Docker
|
| 87 |
-
2. validate deployment assumptions
|
| 88 |
-
3. record baseline scores
|
| 89 |
-
4. polish docs
|
| 90 |
-
|
| 91 |
-
### P2
|
| 92 |
-
|
| 93 |
-
1. strengthen ticket wording for realism
|
| 94 |
-
2. expand hard-case examples if needed
|
| 95 |
-
3. remove low-signal artifacts from the repo
|
| 96 |
-
|
| 97 |
-
## Quality Checks To Perform
|
| 98 |
-
|
| 99 |
-
### Environment
|
| 100 |
-
|
| 101 |
-
- reset starts a clean episode
|
| 102 |
-
- each step advances the queue correctly
|
| 103 |
-
- the final step returns trajectory reward
|
| 104 |
-
- state reflects the real internal status
|
| 105 |
-
|
| 106 |
-
### Grader
|
| 107 |
-
|
| 108 |
-
- exact matches score `1.0`
|
| 109 |
-
- near misses get partial credit where intended
|
| 110 |
-
- unsupported task IDs fail clearly
|
| 111 |
-
- scores vary across examples
|
| 112 |
-
|
| 113 |
-
### Inference
|
| 114 |
-
|
| 115 |
-
- heuristic mode works without model credentials
|
| 116 |
-
- LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
|
| 117 |
-
- output is reproducible when the seed is fixed
|
| 118 |
-
|
| 119 |
-
### Docs
|
| 120 |
-
|
| 121 |
-
- no outdated domain references remain
|
| 122 |
-
- team and project metadata are correct
|
| 123 |
-
- setup and run instructions are accurate
|
| 124 |
-
|
| 125 |
-
## Risks
|
| 126 |
-
|
| 127 |
-
### Runtime risk
|
| 128 |
-
|
| 129 |
-
The first local execution pass and a merged-state rerun have already completed successfully. The remaining runtime risk is Docker and clean-machine behavior, not first-pass local execution.
|
| 130 |
-
|
| 131 |
-
### Benchmark risk
|
| 132 |
-
|
| 133 |
-
The current merged-state local benchmark has already been recorded. The remaining benchmark risk is making sure Docker or clean-machine validation does not surface a late behavioral mismatch.
|
| 134 |
-
|
| 135 |
-
### Deployment risk
|
| 136 |
-
|
| 137 |
-
Docker and Hugging Face behavior should be validated before the final submission window.
|
| 138 |
-
|
| 139 |
-
## Definition Of Done
|
| 140 |
-
|
| 141 |
-
The project is ready when:
|
| 142 |
-
|
| 143 |
-
1. the environment runs locally end to end
|
| 144 |
-
2. the heuristic baseline runs successfully
|
| 145 |
-
3. Docker build and run both succeed
|
| 146 |
-
4. the docs are clean, current, and submission-ready
|
| 147 |
-
5. the repo clearly presents Hackstreet Boys as the team
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PROJECT_STATUS.md
CHANGED
|
@@ -136,7 +136,7 @@ Roopal-side work completed:
|
|
| 136 |
- updated `README.md` to reflect the first local runtime pass
|
| 137 |
- recorded the current heuristic baseline in repo docs as a working, non-final benchmark
|
| 138 |
- updated `KNOWLEDGE.md` to distinguish consistency validation from runtime validation
|
| 139 |
-
- updated
|
| 140 |
|
| 141 |
Documentation fixes made from runtime feedback:
|
| 142 |
|
|
@@ -182,7 +182,7 @@ Roopal-side work completed:
|
|
| 182 |
|
| 183 |
- audited required submission files and confirmed they are present in the repo
|
| 184 |
- completed a stale-claims and outdated-wording pass across the core docs
|
| 185 |
-
- updated `
|
| 186 |
- left the remaining work focused on Docker and clean-machine validation rather than documentation cleanup
|
| 187 |
|
| 188 |
## Open Items
|
|
|
|
| 136 |
- updated `README.md` to reflect the first local runtime pass
|
| 137 |
- recorded the current heuristic baseline in repo docs as a working, non-final benchmark
|
| 138 |
- updated `KNOWLEDGE.md` to distinguish consistency validation from runtime validation
|
| 139 |
+
- updated the runtime mental-model notes later merged into `KNOWLEDGE.md`, including the Windows BOM handling detail
|
| 140 |
|
| 141 |
Documentation fixes made from runtime feedback:
|
| 142 |
|
|
|
|
| 182 |
|
| 183 |
- audited required submission files and confirmed they are present in the repo
|
| 184 |
- completed a stale-claims and outdated-wording pass across the core docs
|
| 185 |
+
- updated the planning / requirements doc later consolidated into `required.md` to reflect that first-pass local execution is no longer the main runtime risk
|
| 186 |
- left the remaining work focused on Docker and clean-machine validation rather than documentation cleanup
|
| 187 |
|
| 188 |
## Open Items
|
README.md
CHANGED
|
@@ -212,8 +212,7 @@ pyproject.toml
|
|
| 212 |
requirements.txt
|
| 213 |
README.md
|
| 214 |
KNOWLEDGE.md
|
| 215 |
-
|
| 216 |
-
MENTAL_MODEL.md
|
| 217 |
ROADMAP.md
|
| 218 |
```
|
| 219 |
|
|
@@ -355,7 +354,7 @@ An April 6 repo audit also confirmed that all required submission files are pres
|
|
| 355 |
|
| 356 |
- runtime: `models.py`, `client.py`, `inference.py`, `server/app.py`, `server/environment.py`, `server/grader.py`, `server/reward.py`, `server/tasks.py`
|
| 357 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 358 |
-
- docs and planning: `README.md`, `KNOWLEDGE.md`, `
|
| 359 |
|
| 360 |
Still pending before final submission:
|
| 361 |
|
|
|
|
| 212 |
requirements.txt
|
| 213 |
README.md
|
| 214 |
KNOWLEDGE.md
|
| 215 |
+
required.md
|
|
|
|
| 216 |
ROADMAP.md
|
| 217 |
```
|
| 218 |
|
|
|
|
| 354 |
|
| 355 |
- runtime: `models.py`, `client.py`, `inference.py`, `server/app.py`, `server/environment.py`, `server/grader.py`, `server/reward.py`, `server/tasks.py`
|
| 356 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 357 |
+
- docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
|
| 358 |
|
| 359 |
Still pending before final submission:
|
| 360 |
|
ROADMAP.md
CHANGED
|
@@ -12,9 +12,9 @@
|
|
| 12 |
|
| 13 |
- `PROJECT_STATUS.md` is the canonical log of completed work.
|
| 14 |
- This roadmap is the remaining execution plan from the current repo state to final submission.
|
| 15 |
-
- `
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
-
- `analysis/comp.md`
|
| 18 |
|
| 19 |
## What We Are Optimizing For
|
| 20 |
|
|
@@ -34,7 +34,7 @@ The highest-value wins from now to submission are:
|
|
| 34 |
- do this as an audit / evidence layer, not as a late dataset merge
|
| 35 |
|
| 36 |
4. **Submission readiness**
|
| 37 |
-
- satisfy every requirement from `
|
| 38 |
- keep the repo easy for judges to understand and rerun
|
| 39 |
|
| 40 |
## Current Repo State
|
|
@@ -57,14 +57,19 @@ The remaining work should be treated as targeted strengthening, not broad featur
|
|
| 57 |
|
| 58 |
## Submission Gates That Must Still Hold
|
| 59 |
|
| 60 |
-
These come directly from `
|
| 61 |
|
| 62 |
- the environment starts correctly
|
| 63 |
- `reset()`, `step()`, and `state()` behave correctly
|
| 64 |
- 3 tasks exist and remain meaningfully different
|
| 65 |
- grader scores stay in `[0.0, 1.0]`
|
| 66 |
- `inference.py` runs reproducibly without crashing
|
|
|
|
|
|
|
|
|
|
| 67 |
- Docker builds and starts cleanly
|
|
|
|
|
|
|
| 68 |
- docs and metadata are current
|
| 69 |
- the repo is easy for judges to understand and rerun
|
| 70 |
|
|
@@ -80,6 +85,7 @@ These come directly from `PLAN.md` and `KNOWLEDGE.md`:
|
|
| 80 |
- add only safe RL-oriented improvements
|
| 81 |
- add external grounding evidence without changing the runtime dataset
|
| 82 |
- finish packaging / deployment readiness
|
|
|
|
| 83 |
|
| 84 |
### Do Not Do Before Submission
|
| 85 |
|
|
@@ -108,7 +114,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 108 |
|
| 109 |
**Window:** April 3 to April 4
|
| 110 |
|
| 111 |
-
**Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/
|
| 112 |
|
| 113 |
### Must produce
|
| 114 |
|
|
@@ -176,7 +182,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 176 |
- assignment group and resolution action remain exact
|
| 177 |
- final episode reward stays bounded and deterministic
|
| 178 |
|
| 179 |
-
### Safe improvement candidates from `analysis/
|
| 180 |
|
| 181 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 182 |
- enrich `history` with:
|
|
@@ -231,14 +237,17 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 231 |
|
| 232 |
**Window:** April 6 to April 7
|
| 233 |
|
| 234 |
-
**Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md`
|
| 235 |
|
| 236 |
### Must produce
|
| 237 |
|
| 238 |
- Hugging Face Spaces README frontmatter
|
| 239 |
- `.openenvignore`
|
|
|
|
| 240 |
- Docker smoke evidence on the merged branch
|
| 241 |
- one clean-copy rerun if possible
|
|
|
|
|
|
|
| 242 |
|
| 243 |
### Nice-to-have only if green
|
| 244 |
|
|
@@ -264,6 +273,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 264 |
- no runtime refactors
|
| 265 |
- no dataset edits unless they fix a blocker
|
| 266 |
- stop risky edits several hours before submission
|
|
|
|
| 267 |
|
| 268 |
## Ownership From Now Until Submission
|
| 269 |
|
|
@@ -276,7 +286,6 @@ Primary files:
|
|
| 276 |
- `server/grader.py`
|
| 277 |
- `README.md`
|
| 278 |
- `KNOWLEDGE.md`
|
| 279 |
-
- `MENTAL_MODEL.md`
|
| 280 |
|
| 281 |
Primary responsibilities:
|
| 282 |
|
|
@@ -293,6 +302,7 @@ Concrete deliverables:
|
|
| 293 |
- any similarity-matrix update, if justified
|
| 294 |
- doc updates if benchmark numbers or scoring explanation change
|
| 295 |
- README frontmatter and judge-facing clarity
|
|
|
|
| 296 |
|
| 297 |
### Suyash ownership
|
| 298 |
|
|
@@ -326,6 +336,7 @@ Concrete deliverables:
|
|
| 326 |
- `.openenvignore`
|
| 327 |
- Docker smoke confirmation
|
| 328 |
- clean-copy rerun if possible
|
|
|
|
| 329 |
|
| 330 |
### Shared responsibilities
|
| 331 |
|
|
@@ -335,6 +346,7 @@ Concrete deliverables:
|
|
| 335 |
- use the GitHub Actions Docker smoke workflow when local Docker is blocked
|
| 336 |
- review Codex-generated diffs before accepting them
|
| 337 |
- freeze feature work by the end of April 7
|
|
|
|
| 338 |
|
| 339 |
## Date-By-Date Execution Plan
|
| 340 |
|
|
@@ -355,6 +367,7 @@ Suyash:
|
|
| 355 |
- scaffold `tests/`
|
| 356 |
- begin smoke tests for `reset()`, `step()`, `state()`, and deterministic seeded behavior
|
| 357 |
- confirm how integration tests will hit the app cleanly
|
|
|
|
| 358 |
|
| 359 |
Shared checkpoint:
|
| 360 |
|
|
@@ -377,6 +390,7 @@ Suyash:
|
|
| 377 |
|
| 378 |
- complete smoke tests
|
| 379 |
- add first-pass integration tests for `/health`, `/tasks`, `/reset`, and `/step`
|
|
|
|
| 380 |
|
| 381 |
Shared checkpoint:
|
| 382 |
|
|
@@ -400,6 +414,7 @@ Suyash:
|
|
| 400 |
- add integration coverage for full seeded episode flow and `state()`
|
| 401 |
- add a light heuristic regression path for `inference.py`
|
| 402 |
- optionally enrich observation history if tests are already green
|
|
|
|
| 403 |
|
| 404 |
Shared checkpoint:
|
| 405 |
|
|
@@ -424,6 +439,8 @@ Suyash:
|
|
| 424 |
- add `.openenvignore`
|
| 425 |
- verify Docker smoke workflow on the merged branch
|
| 426 |
- check deployment assumptions around `app_port`, `/docs`, `/health`, `/ws`, and `/web`
|
|
|
|
|
|
|
| 427 |
|
| 428 |
Shared checkpoint:
|
| 429 |
|
|
@@ -439,7 +456,7 @@ Primary goal:
|
|
| 439 |
|
| 440 |
Roopal:
|
| 441 |
|
| 442 |
-
- final docs consistency pass across `README.md`
|
| 443 |
- add a short TRL / GRPO usage example only if everything else is already green
|
| 444 |
|
| 445 |
Suyash:
|
|
@@ -466,6 +483,7 @@ Morning:
|
|
| 466 |
- run final smoke / test slice on the submission branch
|
| 467 |
- verify required files are present
|
| 468 |
- verify README and metadata are current
|
|
|
|
| 469 |
|
| 470 |
Afternoon:
|
| 471 |
|
|
@@ -493,6 +511,7 @@ Do not cut these:
|
|
| 493 |
3. Docker / deployment validation
|
| 494 |
4. grounding audit evidence
|
| 495 |
5. final benchmark sanity rerun if behavior changed
|
|
|
|
| 496 |
|
| 497 |
## Definition Of Done
|
| 498 |
|
|
@@ -502,9 +521,10 @@ The project is ready when:
|
|
| 502 |
2. scoring is demonstrably deterministic and not fuzzy by default
|
| 503 |
3. a grounding audit against real public support datasets exists
|
| 504 |
4. the heuristic baseline still runs successfully
|
| 505 |
-
5.
|
| 506 |
-
6.
|
| 507 |
-
7.
|
|
|
|
| 508 |
|
| 509 |
## Simple Rule To Remember
|
| 510 |
|
|
|
|
| 12 |
|
| 13 |
- `PROJECT_STATUS.md` is the canonical log of completed work.
|
| 14 |
- This roadmap is the remaining execution plan from the current repo state to final submission.
|
| 15 |
+
- `required.md` is now the combined official-requirements and project-compliance file.
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
+
- `analysis/comp.md` and `analysis/comp_know.md` are internal competitive notes only. Use them to prioritize work, but do not mention competitor repos in public-facing docs.
|
| 18 |
|
| 19 |
## What We Are Optimizing For
|
| 20 |
|
|
|
|
| 34 |
- do this as an audit / evidence layer, not as a late dataset merge
|
| 35 |
|
| 36 |
4. **Submission readiness**
|
| 37 |
+
- satisfy every requirement from `required.md` and `KNOWLEDGE.md`
|
| 38 |
- keep the repo easy for judges to understand and rerun
|
| 39 |
|
| 40 |
## Current Repo State
|
|
|
|
| 57 |
|
| 58 |
## Submission Gates That Must Still Hold
|
| 59 |
|
| 60 |
+
These come directly from `required.md` and `KNOWLEDGE.md`:
|
| 61 |
|
| 62 |
- the environment starts correctly
|
| 63 |
- `reset()`, `step()`, and `state()` behave correctly
|
| 64 |
- 3 tasks exist and remain meaningfully different
|
| 65 |
- grader scores stay in `[0.0, 1.0]`
|
| 66 |
- `inference.py` runs reproducibly without crashing
|
| 67 |
+
- `inference.py` uses the OpenAI client with `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
|
| 68 |
+
- structured stdout logs follow the official `[START]`, `[STEP]`, and `[END]` format
|
| 69 |
+
- `openenv validate` passes
|
| 70 |
- Docker builds and starts cleanly
|
| 71 |
+
- HF deployment responds cleanly and reset works
|
| 72 |
+
- inference stays inside the official runtime / machine envelope
|
| 73 |
- docs and metadata are current
|
| 74 |
- the repo is easy for judges to understand and rerun
|
| 75 |
|
|
|
|
| 85 |
- add only safe RL-oriented improvements
|
| 86 |
- add external grounding evidence without changing the runtime dataset
|
| 87 |
- finish packaging / deployment readiness
|
| 88 |
+
- verify official validation constraints, not just local happy-path behavior
|
| 89 |
|
| 90 |
### Do Not Do Before Submission
|
| 91 |
|
|
|
|
| 114 |
|
| 115 |
**Window:** April 3 to April 4
|
| 116 |
|
| 117 |
+
**Goal:** eliminate the biggest competitive weakness identified in `analysis/comp.md` and `analysis/comp_know.md`: lack of checked-in tests.
|
| 118 |
|
| 119 |
### Must produce
|
| 120 |
|
|
|
|
| 182 |
- assignment group and resolution action remain exact
|
| 183 |
- final episode reward stays bounded and deterministic
|
| 184 |
|
| 185 |
+
### Safe improvement candidates from `analysis/comp_know.md`
|
| 186 |
|
| 187 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 188 |
- enrich `history` with:
|
|
|
|
| 237 |
|
| 238 |
**Window:** April 6 to April 7
|
| 239 |
|
| 240 |
+
**Goal:** close the submission-readiness gaps surfaced in `analysis/comp_know.md`.
|
| 241 |
|
| 242 |
### Must produce
|
| 243 |
|
| 244 |
- Hugging Face Spaces README frontmatter
|
| 245 |
- `.openenvignore`
|
| 246 |
+
- `openenv validate` evidence
|
| 247 |
- Docker smoke evidence on the merged branch
|
| 248 |
- one clean-copy rerun if possible
|
| 249 |
+
- structured inference logging verified against the official format
|
| 250 |
+
- a practical check that inference remains inside the official runtime envelope
|
| 251 |
|
| 252 |
### Nice-to-have only if green
|
| 253 |
|
|
|
|
| 273 |
- no runtime refactors
|
| 274 |
- no dataset edits unless they fix a blocker
|
| 275 |
- stop risky edits several hours before submission
|
| 276 |
+
- if possible, run the official validator or the closest local equivalent before final push
|
| 277 |
|
| 278 |
## Ownership From Now Until Submission
|
| 279 |
|
|
|
|
| 286 |
- `server/grader.py`
|
| 287 |
- `README.md`
|
| 288 |
- `KNOWLEDGE.md`
|
|
|
|
| 289 |
|
| 290 |
Primary responsibilities:
|
| 291 |
|
|
|
|
| 302 |
- any similarity-matrix update, if justified
|
| 303 |
- doc updates if benchmark numbers or scoring explanation change
|
| 304 |
- README frontmatter and judge-facing clarity
|
| 305 |
+
- official requirement compliance review through `required.md`
|
| 306 |
|
| 307 |
### Suyash ownership
|
| 308 |
|
|
|
|
| 336 |
- `.openenvignore`
|
| 337 |
- Docker smoke confirmation
|
| 338 |
- clean-copy rerun if possible
|
| 339 |
+
- structured inference logging compliance
|
| 340 |
|
| 341 |
### Shared responsibilities
|
| 342 |
|
|
|
|
| 346 |
- use the GitHub Actions Docker smoke workflow when local Docker is blocked
|
| 347 |
- review Codex-generated diffs before accepting them
|
| 348 |
- freeze feature work by the end of April 7
|
| 349 |
+
- do not casually change the `[START]`, `[STEP]`, `[END]` inference log format once implemented
|
| 350 |
|
| 351 |
## Date-By-Date Execution Plan
|
| 352 |
|
|
|
|
| 367 |
- scaffold `tests/`
|
| 368 |
- begin smoke tests for `reset()`, `step()`, `state()`, and deterministic seeded behavior
|
| 369 |
- confirm how integration tests will hit the app cleanly
|
| 370 |
+
- review `required.md` and identify the exact official validation items still not reflected in runtime / inference behavior
|
| 371 |
|
| 372 |
Shared checkpoint:
|
| 373 |
|
|
|
|
| 390 |
|
| 391 |
- complete smoke tests
|
| 392 |
- add first-pass integration tests for `/health`, `/tasks`, `/reset`, and `/step`
|
| 393 |
+
- begin checking how current `inference.py` differs from the official structured logging requirement
|
| 394 |
|
| 395 |
Shared checkpoint:
|
| 396 |
|
|
|
|
| 414 |
- add integration coverage for full seeded episode flow and `state()`
|
| 415 |
- add a light heuristic regression path for `inference.py`
|
| 416 |
- optionally enrich observation history if tests are already green
|
| 417 |
+
- bring `inference.py` closer to official structured logging format if the change can be done safely
|
| 418 |
|
| 419 |
Shared checkpoint:
|
| 420 |
|
|
|
|
| 439 |
- add `.openenvignore`
|
| 440 |
- verify Docker smoke workflow on the merged branch
|
| 441 |
- check deployment assumptions around `app_port`, `/docs`, `/health`, `/ws`, and `/web`
|
| 442 |
+
- run `openenv validate` or the closest available validation path
|
| 443 |
+
- verify structured inference logging and runtime-envelope expectations
|
| 444 |
|
| 445 |
Shared checkpoint:
|
| 446 |
|
|
|
|
| 456 |
|
| 457 |
Roopal:
|
| 458 |
|
| 459 |
+
- final docs consistency pass across `README.md` and `KNOWLEDGE.md`
|
| 460 |
- add a short TRL / GRPO usage example only if everything else is already green
|
| 461 |
|
| 462 |
Suyash:
|
|
|
|
| 483 |
- run final smoke / test slice on the submission branch
|
| 484 |
- verify required files are present
|
| 485 |
- verify README and metadata are current
|
| 486 |
+
- run the final validation checklist from `required.md`
|
| 487 |
|
| 488 |
Afternoon:
|
| 489 |
|
|
|
|
| 511 |
3. Docker / deployment validation
|
| 512 |
4. grounding audit evidence
|
| 513 |
5. final benchmark sanity rerun if behavior changed
|
| 514 |
+
6. official structured inference logging compliance
|
| 515 |
|
| 516 |
## Definition Of Done
|
| 517 |
|
|
|
|
| 521 |
2. scoring is demonstrably deterministic and not fuzzy by default
|
| 522 |
3. a grounding audit against real public support datasets exists
|
| 523 |
4. the heuristic baseline still runs successfully
|
| 524 |
+
5. the inference path is compliant with the official log format
|
| 525 |
+
6. `openenv validate` and Docker checks are validated
|
| 526 |
+
7. docs and metadata are current and judge-friendly
|
| 527 |
+
8. the repo is frozen and submitted on time
|
| 528 |
|
| 529 |
## Simple Rule To Remember
|
| 530 |
|
analysis/comp_know.md
CHANGED
|
@@ -1,275 +1,237 @@
|
|
| 1 |
-
# Competition Knowledge Base
|
| 2 |
|
| 3 |
-
> Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
|
| 4 |
-
> Gathered: April 4, 2026
|
| 5 |
-
> Purpose: Internal competitive intelligence
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
-
## Full Environment Inventory
|
| 10 |
|
| 11 |
| Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
|
| 12 |
|-----|--------|------------|-------------|-------------|------|
|
| 13 |
| `atari_env` | Classic games | Medium | Dense | Yes | No |
|
| 14 |
| `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
|
| 15 |
-
| `calendar_env` | Calendar/scheduling agent | High | SQL verifier | Yes | Yes
|
| 16 |
| `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
|
| 17 |
-
| `chat_env` | Conversation/tokenization | Low | Custom transform | Yes | No |
|
| 18 |
-
| `chess_env` | Chess game | Medium | Win/loss | Yes | No |
|
| 19 |
| `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
|
| 20 |
-
| `
|
| 21 |
-
| `
|
| 22 |
-
| `
|
| 23 |
-
| `
|
| 24 |
-
| `
|
| 25 |
-
| `finrl_env` | Financial RL trading | High | Portfolio return | Yes | No |
|
| 26 |
-
| `git_env` | Git operations | Medium | Task-based | Yes | No |
|
| 27 |
-
| `grid_world_env` | Grid navigation | Low | Sparse | Yes | No |
|
| 28 |
-
| `julia_env` | Julia code execution | Medium | Exit code | Yes | No |
|
| 29 |
-
| `kernrl` | Kernel/OS operations | High | Unknown | Yes | No |
|
| 30 |
-
| `maze_env` | Maze navigation | Low | Sparse | Yes | No |
|
| 31 |
-
| `openapp_env` | Web app UI (BrowserGym) | Extreme | Task-based | Yes | No |
|
| 32 |
-
| `openspiel_env` | Multi-agent games | High | Game outcome | Yes | No |
|
| 33 |
-
| `reasoning_gym_env` | Reasoning tasks (100+ datasets) | Medium | Exact/partial | Single-step | No |
|
| 34 |
-
| `repl_env` | REPL execution | Medium | Exit code | Yes | No |
|
| 35 |
-
| `snake_env` | Snake game | Low | Score | Yes | No |
|
| 36 |
-
| `sumo_rl_env` | Traffic simulation | High | Traffic flow | Yes | No |
|
| 37 |
-
| `tbench2_env` | Terminal Bench 2 (shell tasks) | High | pytest pass/fail | Yes | No |
|
| 38 |
-
| `textarena_env` | Text-based games | Medium | Game outcome | Yes | No |
|
| 39 |
-
| `unity_env` | Unity 3D simulation | Very High | Task-based | Yes | No |
|
| 40 |
|
| 41 |
-
-
|
| 42 |
-
|
| 43 |
-
## Deep Dives: Most Relevant Envs
|
| 44 |
-
|
| 45 |
-
### 1. `finqa_env` — Financial QA
|
| 46 |
-
|
| 47 |
-
**What it does**: Agents answer complex financial questions from SEC 10-K filings using SQL tool calls.
|
| 48 |
-
|
| 49 |
-
**Architecture**:
|
| 50 |
-
- Subclasses `MCPEnvironment` (not plain `Environment`) — uses FastMCP with `@mcp.tool` decorators
|
| 51 |
-
- Tools: `get_descriptions`, `get_table_info`, `sql_query`, `submit_answer`
|
| 52 |
-
- Dataset: 290 questions from HuggingFace (`snorkelai/finqa-data`)
|
| 53 |
-
- Max steps: 50 per episode
|
| 54 |
-
- Reward: Binary (1.0 / 0.0) with fuzzy numerical matching (1% relative tolerance + 1.0 absolute tolerance)
|
| 55 |
-
- Handles `\boxed{}` LaTeX format, percentages, fractions, thousands separators, negative parens
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
**Key differentiator**: MCP protocol for tool discovery. Client uses `await env.list_tools()` to discover tools at runtime. This is the most "agentic" env in the repo.
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
-
-
|
| 71 |
-
-
|
| 72 |
-
- Action: `CodeAction(code: str)`
|
| 73 |
-
- Observation: `CodeObservation(stdout, stderr, exit_code)`
|
| 74 |
-
- State: `CodeState(episode_id, step_count, last_exit_code)`
|
| 75 |
-
- Reward: computed by transform (not in step directly) — extensible pattern
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
| 88 |
-
-
|
| 89 |
-
-
|
| 90 |
-
- Dataset persistence: same dataset reused across resets until config changes
|
| 91 |
-
- Supports `dataset_name`, `seed`, `size`, `dataset_specs` in `reset()` kwargs
|
| 92 |
-
- Reward: 0.0–1.0 (dataset-dependent, may use partial credit)
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
|
|
|
|
|
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
-
##
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
-
-
|
| 106 |
-
-
|
| 107 |
-
- Session IDs for streaming/non-blocking processes
|
| 108 |
-
- Reward: Binary (pytest pass/fail) on `evaluate` action
|
| 109 |
-
- Intermediate steps: `reward=None`
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
-
###
|
| 116 |
|
| 117 |
-
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
| 120 |
-
- Runs TWO services in Docker: OpenApps server (port 5001) + FastAPI (port 8000)
|
| 121 |
-
- `start.sh` orchestrates both
|
| 122 |
-
- BrowserGym for browser automation (Playwright/Chromium)
|
| 123 |
-
- Docker image: ~5.7GB (includes Chromium)
|
| 124 |
-
- Multimodal: screenshots + DOM observations
|
| 125 |
|
| 126 |
-
|
|
|
|
|
|
|
| 127 |
|
| 128 |
---
|
| 129 |
|
| 130 |
-
##
|
| 131 |
-
|
| 132 |
-
**What it does**: Calendar management tasks with SQL database verification.
|
| 133 |
-
|
| 134 |
-
**Architecture**:
|
| 135 |
-
- MCP-based (like finqa_env)
|
| 136 |
-
- Has `client_notebooks/` — Jupyter notebook for interactive evaluation
|
| 137 |
-
- Has `mcp_databases/` — SQLite databases for state
|
| 138 |
-
- Scenario-based: `scenario_config.json` drives task + verifiers
|
| 139 |
-
- Verifiers: SQL queries that check task completion
|
| 140 |
-
- Supports OpenAI, Anthropic, Google providers
|
| 141 |
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
-
##
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
| 151 |
-
- Action: `ChatAction(tokens: torch.Tensor)` — takes raw model tokens
|
| 152 |
-
- Observation: `ChatObservation(messages, tokens)` — both human-readable + model-ready
|
| 153 |
-
- Transform-based reward (pluggable)
|
| 154 |
-
- Dual representation: messages (human) + tokens (model)
|
| 155 |
-
- No HTTP overhead option: can use directly without server
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
## Structural Patterns Observed Across All Envs
|
| 162 |
-
|
| 163 |
-
### File Structure (canonical)
|
| 164 |
-
```
|
| 165 |
-
env_name/
|
| 166 |
-
├── __init__.py # exports
|
| 167 |
-
├── models.py # Action, Observation, State
|
| 168 |
-
├── client.py # EnvClient subclass
|
| 169 |
-
├── openenv.yaml # metadata
|
| 170 |
-
├── pyproject.toml # packaging
|
| 171 |
-
├── README.md # HuggingFace Space frontmatter + docs
|
| 172 |
-
└── server/
|
| 173 |
-
├── __init__.py
|
| 174 |
-
├── app.py # FastAPI
|
| 175 |
-
├── environment.py # core logic
|
| 176 |
-
└── Dockerfile
|
| 177 |
-
```
|
| 178 |
|
| 179 |
-
### README Frontmatter (HuggingFace Spaces)
|
| 180 |
-
Every env README has YAML frontmatter:
|
| 181 |
```yaml
|
| 182 |
---
|
| 183 |
-
title:
|
| 184 |
-
emoji:
|
| 185 |
-
colorFrom:
|
| 186 |
-
colorTo:
|
| 187 |
sdk: docker
|
| 188 |
pinned: false
|
| 189 |
-
app_port:
|
| 190 |
base_path: /web
|
| 191 |
tags:
|
| 192 |
- openenv
|
|
|
|
|
|
|
|
|
|
| 193 |
---
|
| 194 |
```
|
| 195 |
-
This is required for HuggingFace Spaces deployment. Our README does NOT have this.
|
| 196 |
-
|
| 197 |
-
### openenv.yaml — Minimal Pattern
|
| 198 |
-
Most envs have very minimal `openenv.yaml` (just name + entry_point). Our yaml is the most detailed in the repo.
|
| 199 |
-
|
| 200 |
-
### Dockerfile Patterns
|
| 201 |
-
- Most use `openenv-base:latest` as base image (not `python:3.11-slim`)
|
| 202 |
-
- Our Dockerfile uses `python:3.11-slim` directly — this is the standalone/HF Spaces pattern
|
| 203 |
-
- The `openenv-base` pattern is for the monorepo CI/CD workflow
|
| 204 |
-
|
| 205 |
-
### Testing
|
| 206 |
-
- `coding_env`: most tested (unit + integration)
|
| 207 |
-
- Most envs: no tests at all
|
| 208 |
-
- Our env: no tests (matches majority)
|
| 209 |
-
|
| 210 |
-
### MCP vs HTTP
|
| 211 |
-
- Most envs: plain HTTP (`Environment` base class)
|
| 212 |
-
- `finqa_env`, `calendar_env`: MCP (`MCPEnvironment` base class, FastMCP tools)
|
| 213 |
-
- MCP envs are more "agentic" — tools are discoverable at runtime
|
| 214 |
-
|
| 215 |
-
### Reward Patterns
|
| 216 |
-
| Pattern | Envs | Description |
|
| 217 |
-
|---------|------|-------------|
|
| 218 |
-
| Binary (0/1) | finqa, tbench2, reasoning_gym | Pass/fail |
|
| 219 |
-
| Dense partial | ours, chess, atari | Continuous [0,1] |
|
| 220 |
-
| Transform-based | coding, chat | Pluggable reward function |
|
| 221 |
-
| SQL verifier | calendar | DB state check |
|
| 222 |
-
| Game outcome | chess, connect4, openspiel | Win/loss/draw |
|
| 223 |
|
| 224 |
-
|
|
|
|
|
|
|
| 225 |
|
| 226 |
-
##
|
| 227 |
|
| 228 |
-
|
| 229 |
-
- `openenv push` CLI command (seen in reasoning_gym README)
|
| 230 |
-
- Spaces get: `/web` (UI), `/docs` (Swagger), `/health`, `/ws` (WebSocket)
|
| 231 |
-
- `base_path: /web` in README frontmatter
|
| 232 |
-
- Our env: missing HF Spaces frontmatter in README
|
| 233 |
|
| 234 |
-
|
| 235 |
-
-
|
| 236 |
-
-
|
| 237 |
-
-
|
| 238 |
-
-
|
| 239 |
|
| 240 |
---
|
| 241 |
|
| 242 |
-
##
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|-----|-------------|--------|
|
| 246 |
-
| finqa | 290 questions | HuggingFace (snorkelai/finqa-data) |
|
| 247 |
-
| reasoning_gym | 100+ datasets, configurable size | reasoning-gym library |
|
| 248 |
-
| calendar | SQLite DBs | Custom |
|
| 249 |
-
| ours | 45 tickets | Custom (data/dataset.json) |
|
| 250 |
-
| coding | N/A (generates tasks) | N/A |
|
| 251 |
-
| tbench2 | Terminal-Bench-2 repo | GitHub auto-download |
|
| 252 |
|
| 253 |
-
|
| 254 |
|
| 255 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
|
| 260 |
|
| 261 |
-
|
| 262 |
|
| 263 |
-
|
| 264 |
|
| 265 |
-
|
| 266 |
|
| 267 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
-
|
| 270 |
|
| 271 |
-
|
| 272 |
|
| 273 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
|
| 275 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Competition Knowledge Base And Action Plan
|
| 2 |
|
| 3 |
+
> Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
|
| 4 |
+
> Gathered: April 4, 2026
|
| 5 |
+
> Purpose: Internal competitive intelligence plus action planning - NOT for commit/push
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
+
## Full Environment Inventory
|
| 10 |
|
| 11 |
| Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
|
| 12 |
|-----|--------|------------|-------------|-------------|------|
|
| 13 |
| `atari_env` | Classic games | Medium | Dense | Yes | No |
|
| 14 |
| `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
|
| 15 |
+
| `calendar_env` | Calendar / scheduling agent | High | SQL verifier | Yes | Yes |
|
| 16 |
| `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
|
| 17 |
+
| `chat_env` | Conversation / tokenization | Low | Custom transform | Yes | No |
|
|
|
|
| 18 |
| `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
|
| 19 |
+
| `echo_env` | Reference / minimal | Minimal | Echo | No | No |
|
| 20 |
+
| `finqa_env` | Financial QA | High | Fuzzy numerical | Yes | Yes |
|
| 21 |
+
| `openapp_env` | Web app UI | Extreme | Task-based | Yes | No |
|
| 22 |
+
| `reasoning_gym_env` | Reasoning tasks | Medium | Exact / partial | Single-step | No |
|
| 23 |
+
| `tbench2_env` | Terminal tasks | High | Pytest pass/fail | Yes | No |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
This is not the full raw repo dump anymore. It is the subset that matters most for competitive positioning and late-stage prioritization.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
---
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
## Most Relevant Competitor Patterns
|
| 30 |
|
| 31 |
+
### `finqa_env`
|
| 32 |
|
| 33 |
+
- strong MCP / tool-using architecture
|
| 34 |
+
- larger dataset than ours
|
| 35 |
+
- binary-style reward with fuzzy numerical matching
|
| 36 |
+
- explicit TRL / GRPO integration story
|
| 37 |
|
| 38 |
+
### `coding_env`
|
| 39 |
|
| 40 |
+
- strongest test story
|
| 41 |
+
- clean transform-based reward separation
|
| 42 |
+
- reference example of strong code quality and architecture hygiene
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
### `reasoning_gym_env`
|
| 45 |
|
| 46 |
+
- broadest dataset coverage
|
| 47 |
+
- configurable dataset / size pattern
|
| 48 |
+
- useful deployment references for `openenv push`
|
| 49 |
|
| 50 |
+
### `tbench2_env`
|
| 51 |
|
| 52 |
+
- strong agentic shell-task realism
|
| 53 |
+
- binary evaluation via pytest
|
| 54 |
+
- little intermediate reward signal
|
| 55 |
|
| 56 |
+
### `openapp_env`
|
| 57 |
|
| 58 |
+
- highest complexity
|
| 59 |
+
- multimodal / browser-based
|
| 60 |
+
- difficult to beat on ambition, easier to beat on simplicity and reproducibility
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
### `calendar_env`
|
| 63 |
|
| 64 |
+
- enterprise workflow flavor
|
| 65 |
+
- scenario + verifier pattern
|
| 66 |
+
- stronger on MCP sophistication than on reward density
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
+
## Structural Patterns Across The Field
|
| 71 |
|
| 72 |
+
### Packaging
|
| 73 |
|
| 74 |
+
- every serious repo has `models.py`, `client.py`, `openenv.yaml`, `pyproject.toml`, `README.md`, and a `server/` package
|
| 75 |
+
- Hugging Face Spaces frontmatter is standard in competitor `README.md` files
|
| 76 |
+
- `.openenvignore` appears in some stronger submissions
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
### Reward patterns
|
| 79 |
|
| 80 |
+
| Pattern | Examples | Notes |
|
| 81 |
+
|---------|----------|-------|
|
| 82 |
+
| Binary | `finqa_env`, `tbench2_env` | easy to verify, weaker RL signal |
|
| 83 |
+
| Dense partial | ours, games | stronger RL learning signal |
|
| 84 |
+
| Transform-based | `coding_env`, `chat_env` | architecturally clean |
|
| 85 |
+
| SQL / verifier based | `calendar_env` | strong task verification |
|
| 86 |
|
| 87 |
+
### Testing patterns
|
| 88 |
|
| 89 |
+
- many repos have little or no tests
|
| 90 |
+
- `coding_env` is still the strongest example of checked-in testing
|
| 91 |
+
- this makes tests a high-value differentiator for us
|
| 92 |
|
| 93 |
+
### Deployment patterns
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
+
- Spaces usually expose `/web`, `/docs`, `/health`, and `/ws`
|
| 96 |
+
- `openenv push` is the expected deployment workflow
|
| 97 |
+
- `README` frontmatter and Docker correctness matter more than polish extras
|
| 98 |
|
| 99 |
---
|
| 100 |
|
| 101 |
+
## Key Technical Observations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
1. MCP is useful, but too big to add late.
|
| 104 |
+
2. Transform-based reward is elegant, but not a deadline-critical refactor.
|
| 105 |
+
3. HF Spaces frontmatter is expected and missing in our repo.
|
| 106 |
+
4. `.openenvignore` is a cheap packaging win.
|
| 107 |
+
5. Configurable datasets are nice, but external dataset merge is too risky late.
|
| 108 |
+
6. Strong tests improve trust more than minor architectural polish.
|
| 109 |
+
7. Dense, deterministic, partial-credit reward is one of our real advantages.
|
| 110 |
|
| 111 |
---
|
| 112 |
|
| 113 |
+
## Actionable Inferences
|
| 114 |
|
| 115 |
+
## Critical Missing Items
|
| 116 |
|
| 117 |
+
### 1. README frontmatter for HF Spaces
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
This is still the cleanest obvious gap. Add it before submission.
|
| 120 |
|
| 121 |
+
Recommended fields:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
|
|
|
|
|
|
| 123 |
```yaml
|
| 124 |
---
|
| 125 |
+
title: IT Helpdesk Ticket Routing OpenEnv
|
| 126 |
+
emoji: "ticket"
|
| 127 |
+
colorFrom: blue
|
| 128 |
+
colorTo: indigo
|
| 129 |
sdk: docker
|
| 130 |
pinned: false
|
| 131 |
+
app_port: 7860
|
| 132 |
base_path: /web
|
| 133 |
tags:
|
| 134 |
- openenv
|
| 135 |
+
- helpdesk
|
| 136 |
+
- ticket-routing
|
| 137 |
+
- nlp
|
| 138 |
---
|
| 139 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
### 2. `.openenvignore`
|
| 142 |
+
|
| 143 |
+
Cheap packaging improvement. Worth adding.
|
| 144 |
|
| 145 |
+
### 3. Verified deployment assumptions
|
| 146 |
|
| 147 |
+
We should explicitly verify:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
+
- `app_port: 7860`
|
| 150 |
+
- `/health`
|
| 151 |
+
- `/docs`
|
| 152 |
+
- `/ws`
|
| 153 |
+
- `/web`
|
| 154 |
|
| 155 |
---
|
| 156 |
|
| 157 |
+
## High-Value Improvements That Still Make Sense
|
| 158 |
|
| 159 |
+
### 4. Strengthen the scorer only in grounded, tested ways
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
+
Possible additions to `ISSUE_TYPE_SIMILARITY`:
|
| 162 |
|
| 163 |
+
- `onboarding` vs `service_request`
|
| 164 |
+
- `feature_request` vs `service_request`
|
| 165 |
+
- `security_compliance` vs `identity_access`
|
| 166 |
+
- `billing_license` vs `identity_access`
|
| 167 |
+
|
| 168 |
+
Only do this if:
|
| 169 |
+
|
| 170 |
+
- the ambiguity is real
|
| 171 |
+
- the change is backed by tests
|
| 172 |
+
- it does not blur operationally distinct actions too much
|
| 173 |
+
|
| 174 |
+
### 5. Add richer `history` if low-risk
|
| 175 |
+
|
| 176 |
+
Candidate additions:
|
| 177 |
+
|
| 178 |
+
- ticket title
|
| 179 |
+
- predicted fields
|
| 180 |
|
| 181 |
+
This can help multi-step reasoning without changing the core task.
|
| 182 |
|
| 183 |
+
### 6. Add `queue_size` as an optional `reset()` kwarg
|
| 184 |
|
| 185 |
+
Nice RL/training flexibility, but lower priority than tests, scorer crispness, Docker, and deployment readiness.
|
| 186 |
|
| 187 |
+
### 7. Add a short TRL / GRPO example to README
|
| 188 |
|
| 189 |
+
Good judge-facing signal once the repo is already green.
|
| 190 |
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## Improvements To Defer
|
| 194 |
+
|
| 195 |
+
- MCP migration
|
| 196 |
+
- transform-based reward refactor
|
| 197 |
+
- major dataset expansion
|
| 198 |
+
- external dataset merge into runtime
|
| 199 |
+
- broad inference rewrite
|
| 200 |
+
- dependency churn just for polish
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
|
| 204 |
+
## Competitive Positioning
|
| 205 |
|
| 206 |
+
### Our strengths
|
| 207 |
|
| 208 |
+
1. strong real-world enterprise domain
|
| 209 |
+
2. dense deterministic reward
|
| 210 |
+
3. partial-credit grading that is still explainable
|
| 211 |
+
4. clean 3-task difficulty ladder
|
| 212 |
+
5. strong heuristic baseline
|
| 213 |
+
6. compact, rerunnable environment design
|
| 214 |
+
|
| 215 |
+
### Our weaknesses
|
| 216 |
+
|
| 217 |
+
1. weaker checked-in test story unless we fix it
|
| 218 |
+
2. missing HF Spaces frontmatter unless we fix it
|
| 219 |
+
3. smaller dataset than some top competitors
|
| 220 |
+
4. less ambitious architecture than the strongest simulator-style or MCP-heavy entries
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
|
| 224 |
+
## Priority Action List
|
| 225 |
+
|
| 226 |
+
| Priority | Action | Effort | Impact |
|
| 227 |
+
|----------|--------|--------|--------|
|
| 228 |
+
| P0 | Add tests and prove scorer crispness | 1-2 hrs | High |
|
| 229 |
+
| P0 | Add HF Spaces frontmatter to README | 5 min | High |
|
| 230 |
+
| P0 | Add `.openenvignore` | 5 min | Medium |
|
| 231 |
+
| P1 | Add grounding audit against public support datasets | 1-2 hrs | High |
|
| 232 |
+
| P1 | Expand similarity pairs only if grounded and tested | 20-40 min | Medium |
|
| 233 |
+
| P1 | Add richer `history` if low-risk | 20 min | Medium |
|
| 234 |
+
| P1 | Add TRL / GRPO README example | 30 min | High |
|
| 235 |
+
| P2 | Add `queue_size` kwarg | 15 min | Low |
|
| 236 |
+
| P3 | Expand dataset substantially | 2+ hrs | Medium but risky |
|
| 237 |
+
| P3 | Transform-based reward refactor | 1 hr | Low |
|
analysis/inference.md
DELETED
|
@@ -1,218 +0,0 @@
|
|
| 1 |
-
# Inferences & Actionable Advantages
|
| 2 |
-
|
| 3 |
-
> Based on deep analysis of all 27 OpenEnv competition entries
|
| 4 |
-
> Internal use only — NOT for commit/push
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Critical Missing Items (Fix Before Submission)
|
| 9 |
-
|
| 10 |
-
### 1. README HuggingFace Spaces Frontmatter — MISSING
|
| 11 |
-
|
| 12 |
-
Every single env in the repo has YAML frontmatter at the top of README.md. Ours does not.
|
| 13 |
-
This is required for `openenv push` and HuggingFace Spaces deployment to work correctly.
|
| 14 |
-
|
| 15 |
-
**Add to top of `meta-AIHack/README.md`:**
|
| 16 |
-
```yaml
|
| 17 |
-
---
|
| 18 |
-
title: IT Helpdesk Ticket Routing OpenEnv
|
| 19 |
-
emoji: 🎫
|
| 20 |
-
colorFrom: blue
|
| 21 |
-
colorTo: indigo
|
| 22 |
-
sdk: docker
|
| 23 |
-
pinned: false
|
| 24 |
-
app_port: 7860
|
| 25 |
-
base_path: /web
|
| 26 |
-
tags:
|
| 27 |
-
- openenv
|
| 28 |
-
- helpdesk
|
| 29 |
-
- ticket-routing
|
| 30 |
-
- nlp
|
| 31 |
-
---
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
Note: our port is `7860` (HF Spaces default), not `8000`. Use `7860` here.
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
### 2. `.openenvignore` File — MISSING
|
| 39 |
-
|
| 40 |
-
`finqa_env` has a `.openenvignore` file (analogous to `.dockerignore` for the `openenv push` CLI).
|
| 41 |
-
Without it, `openenv push` may upload unnecessary files.
|
| 42 |
-
|
| 43 |
-
**Create `meta-AIHack/.openenvignore`:**
|
| 44 |
-
```
|
| 45 |
-
*.pyc
|
| 46 |
-
__pycache__/
|
| 47 |
-
.git/
|
| 48 |
-
*.md
|
| 49 |
-
PLAN.md
|
| 50 |
-
ROADMAP.md
|
| 51 |
-
MENTAL_MODEL.md
|
| 52 |
-
KNOWLEDGE.md
|
| 53 |
-
comp_intel/
|
| 54 |
-
bugs/
|
| 55 |
-
transcripts/
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
### 3. `base_path: /web` in openenv.yaml — CHECK
|
| 61 |
-
|
| 62 |
-
The HF Spaces web UI is served at `/web`. The `reasoning_gym_env` README explicitly mentions:
|
| 63 |
-
- Web Interface at `/web`
|
| 64 |
-
- API Documentation at `/docs`
|
| 65 |
-
- Health Check at `/health`
|
| 66 |
-
- WebSocket at `/ws`
|
| 67 |
-
|
| 68 |
-
Our `openenv.yaml` lists `/docs` in `api.endpoints` — good. But we should verify the web interface path is correct when deployed.
|
| 69 |
-
|
| 70 |
-
---
|
| 71 |
-
|
| 72 |
-
## High-Value Improvements (Implement If Time Allows)
|
| 73 |
-
|
| 74 |
-
### 4. Partial Credit Similarity Matrix — Expand
|
| 75 |
-
|
| 76 |
-
Our `grader.py` has `ISSUE_TYPE_SIMILARITY` with 16 pairs and `PRIORITY_SCORES` with 10 pairs.
|
| 77 |
-
|
| 78 |
-
**Observation from finqa_env**: Their reward uses both relative AND absolute tolerance simultaneously. Our grader uses a flat similarity dict.
|
| 79 |
-
|
| 80 |
-
**Improvement**: Add more near-miss pairs to `ISSUE_TYPE_SIMILARITY`. Currently missing:
|
| 81 |
-
- `("onboarding", "service_request")` — onboarding tickets often look like service requests
|
| 82 |
-
- `("feature_request", "service_request")` — common confusion
|
| 83 |
-
- `("security_compliance", "identity_access")` — MFA/SSO tickets can go either way
|
| 84 |
-
- `("billing_license", "identity_access")` — license + account access overlap
|
| 85 |
-
|
| 86 |
-
This directly improves the reward signal quality for RL training, which is what judges care about.
|
| 87 |
-
|
| 88 |
-
---
|
| 89 |
-
|
| 90 |
-
### 5. Dataset Size — Expand from 45 to ~100 tickets
|
| 91 |
-
|
| 92 |
-
**Observation**: finqa has 290 questions, reasoning_gym has configurable sizes up to thousands.
|
| 93 |
-
Our 45 tickets is the smallest custom dataset in the repo.
|
| 94 |
-
|
| 95 |
-
**Improvement**: Add 55 more tickets to reach 100. Focus on:
|
| 96 |
-
- More ambiguous cases (harder for LLMs)
|
| 97 |
-
- More `related_ticket_id` chains (multi-ticket threads)
|
| 98 |
-
- Edge cases: tickets that span two issue types
|
| 99 |
-
- More `spam_phishing` examples (currently underrepresented)
|
| 100 |
-
|
| 101 |
-
This makes the benchmark more robust and harder to overfit.
|
| 102 |
-
|
| 103 |
-
---
|
| 104 |
-
|
| 105 |
-
### 6. Transform-Based Reward (Optional Architecture Upgrade)
|
| 106 |
-
|
| 107 |
-
**Observation**: `coding_env` uses a pluggable `Transform` object for reward computation instead of hardcoding it in `step()`. This is the cleanest pattern in the repo.
|
| 108 |
-
|
| 109 |
-
**Improvement**: Refactor `server/reward.py` to expose a `HelpdeskRewardTransform` class that can be swapped. Low priority — our current design works fine — but it signals architectural sophistication to judges.
|
| 110 |
-
|
| 111 |
-
---
|
| 112 |
-
|
| 113 |
-
### 7. Configurable Queue Size via `reset()` kwargs
|
| 114 |
-
|
| 115 |
-
**Observation**: `reasoning_gym_env` passes `size`, `seed`, `dataset_name` as `reset()` kwargs. This makes the env much more flexible for RL training (vary episode length, vary dataset).
|
| 116 |
-
|
| 117 |
-
**Improvement**: Accept `queue_size` as a `reset()` kwarg (in addition to `task_id` and `seed`):
|
| 118 |
-
```python
|
| 119 |
-
def reset(self, seed=None, episode_id=None, **kwargs):
|
| 120 |
-
queue_size = kwargs.get("queue_size", None) # override QUEUE_SIZE_RANGE
|
| 121 |
-
...
|
| 122 |
-
```
|
| 123 |
-
|
| 124 |
-
This lets RL trainers control episode length without modifying the env code.
|
| 125 |
-
|
| 126 |
-
---
|
| 127 |
-
|
| 128 |
-
### 8. `uv.lock` for Reproducible Dependencies
|
| 129 |
-
|
| 130 |
-
**Observation**: `chat_env` and `reasoning_gym_env` both include `uv.lock` files for fully reproducible dependency resolution.
|
| 131 |
-
|
| 132 |
-
**Improvement**: Run `uv lock` in `meta-AIHack/` and commit the `uv.lock`. This signals production-quality dependency management.
|
| 133 |
-
|
| 134 |
-
---
|
| 135 |
-
|
| 136 |
-
### 9. Explicit TRL/GRPO Integration Example in README
|
| 137 |
-
|
| 138 |
-
**Observation**: `finqa_env` README explicitly shows a TRL GRPO integration snippet. This is exactly what Meta/PyTorch judges want to see — the env being used for actual RL training.
|
| 139 |
-
|
| 140 |
-
**Improvement**: Add a section to our README showing how to use the env with TRL GRPO:
|
| 141 |
-
```python
|
| 142 |
-
# Example: Using with TRL GRPO
|
| 143 |
-
from trl import GRPOTrainer
|
| 144 |
-
from client import HelpdeskTicketEnvClient
|
| 145 |
-
|
| 146 |
-
async def rollout_func(prompts, trainer):
|
| 147 |
-
sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
|
| 148 |
-
with sync_client:
|
| 149 |
-
result = sync_client.reset(seed=42, task_id=3)
|
| 150 |
-
# ... agent loop
|
| 151 |
-
return {"reward": final_reward, "completion": completion}
|
| 152 |
-
```
|
| 153 |
-
|
| 154 |
-
---
|
| 155 |
-
|
| 156 |
-
### 10. `history` Field — Richer Step History
|
| 157 |
-
|
| 158 |
-
**Observation**: `finqa_env` passes full tool call history in observation metadata. Our `history` field currently only stores `{step, score, breakdown}`.
|
| 159 |
-
|
| 160 |
-
**Improvement**: Include the ticket title and predicted fields in history so the agent can learn from its own past decisions within an episode:
|
| 161 |
-
```python
|
| 162 |
-
history_entry = {
|
| 163 |
-
"ticket_id": current_ticket.ticket_id,
|
| 164 |
-
"title": current_ticket.title, # ADD THIS
|
| 165 |
-
"predicted": {k: v for k, v in action.model_dump().items() if v is not None}, # ADD THIS
|
| 166 |
-
"score": score,
|
| 167 |
-
"breakdown": breakdown,
|
| 168 |
-
}
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
This gives the LLM agent richer context for multi-step reasoning.
|
| 172 |
-
|
| 173 |
-
---
|
| 174 |
-
|
| 175 |
-
## Competitive Positioning Insights
|
| 176 |
-
|
| 177 |
-
### Our Unique Strengths vs. The Field
|
| 178 |
-
|
| 179 |
-
1. **Richest `openenv.yaml`**: Ours is the most detailed metadata file in the entire repo. Most envs have 3-line yaml files. Ours has tasks, evaluation, grading, reproducibility, inference config. This signals thoroughness.
|
| 180 |
-
|
| 181 |
-
2. **Deterministic + Reproducible**: We explicitly set `deterministic: true` and `reproducible: true` in openenv.yaml. Only a few envs do this. Judges can rerun and get identical results.
|
| 182 |
-
|
| 183 |
-
3. **Task Ladder (3 difficulty levels)**: Most envs have a single task. We have 3 explicitly difficulty-graded tasks. This is a strong differentiator for RL curriculum learning.
|
| 184 |
-
|
| 185 |
-
4. **Partial Credit Grading**: Most envs use binary reward (0/1). Our grader gives partial credit for near-miss issue types and adjacent priorities. This produces a much richer reward signal for RL training.
|
| 186 |
-
|
| 187 |
-
5. **Dense Reward Signal**: Every step produces a reward (not just the final step). Most envs (tbench2, finqa) only reward at the end. Dense rewards are better for RL training.
|
| 188 |
-
|
| 189 |
-
6. **Heuristic Baseline**: We have a working keyword-based heuristic that achieves 0.94 overall. Most envs don't have a baseline agent. This lets judges immediately see the env working.
|
| 190 |
-
|
| 191 |
-
7. **Real-World Domain**: IT helpdesk routing is a real enterprise use case. Many envs are games or synthetic tasks. Ours has immediate practical applicability.
|
| 192 |
-
|
| 193 |
-
8. **Clean Episode Bounds**: 3–5 steps per episode. Not too short (single-step), not unbounded. Clean for RL training.
|
| 194 |
-
|
| 195 |
-
### Our Weaknesses vs. The Field
|
| 196 |
-
|
| 197 |
-
1. **No HF Spaces frontmatter** in README — fixable in 5 minutes
|
| 198 |
-
2. **Smallest dataset** (45 tickets) — expandable
|
| 199 |
-
3. **No MCP tools** — plain HTTP only (simpler but less "agentic")
|
| 200 |
-
4. **No tests** — matches most envs, but coding_env has tests
|
| 201 |
-
5. **No `uv.lock`** — minor
|
| 202 |
-
6. **No `.openenvignore`** — minor
|
| 203 |
-
|
| 204 |
-
---
|
| 205 |
-
|
| 206 |
-
## Priority Action List
|
| 207 |
-
|
| 208 |
-
| Priority | Action | Effort | Impact |
|
| 209 |
-
|----------|--------|--------|--------|
|
| 210 |
-
| P0 | Add HF Spaces frontmatter to README | 5 min | High — required for deployment |
|
| 211 |
-
| P0 | Add `.openenvignore` | 5 min | Medium — cleaner push |
|
| 212 |
-
| P1 | Add TRL/GRPO example to README | 30 min | High — judges love this |
|
| 213 |
-
| P1 | Expand `ISSUE_TYPE_SIMILARITY` pairs | 20 min | Medium — better reward signal |
|
| 214 |
-
| P1 | Richer `history` entries (add title + predicted) | 20 min | Medium — better agent context |
|
| 215 |
-
| P2 | Expand dataset to ~100 tickets | 2 hrs | Medium — more robust benchmark |
|
| 216 |
-
| P2 | Add `queue_size` kwarg to `reset()` | 15 min | Low — flexibility |
|
| 217 |
-
| P3 | Add `uv.lock` | 5 min | Low — polish |
|
| 218 |
-
| P3 | Transform-based reward refactor | 1 hr | Low — architecture only |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
required.md
ADDED
|
@@ -0,0 +1,352 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Round 1 Requirements And Project Compliance Plan
|
| 2 |
+
|
| 3 |
+
## Official Problem Statement
|
| 4 |
+
|
| 5 |
+
Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
|
| 6 |
+
|
| 7 |
+
### Key requirements at a glance
|
| 8 |
+
|
| 9 |
+
- must simulate a real-world task, not a game or toy
|
| 10 |
+
- must implement the full OpenEnv spec with typed models and `openenv.yaml`
|
| 11 |
+
- must include at least 3 tasks with agent graders spanning easy -> medium -> hard
|
| 12 |
+
- graders must return scores in `[0.0, 1.0]`
|
| 13 |
+
- reward must provide meaningful partial-progress signal
|
| 14 |
+
- must include a reproducible baseline `inference.py`
|
| 15 |
+
- must deploy to Hugging Face Spaces with a working Dockerfile
|
| 16 |
+
- README must include environment description, action / observation spaces, setup, usage, and baseline scores
|
| 17 |
+
|
| 18 |
+
## Official Functional Requirements
|
| 19 |
+
|
| 20 |
+
### Real-world task simulation
|
| 21 |
+
|
| 22 |
+
The environment must simulate a task humans actually do. The official examples include:
|
| 23 |
+
|
| 24 |
+
- email triage
|
| 25 |
+
- code review
|
| 26 |
+
- data cleaning
|
| 27 |
+
- scheduling
|
| 28 |
+
- customer support
|
| 29 |
+
- content moderation
|
| 30 |
+
|
| 31 |
+
### OpenEnv spec compliance
|
| 32 |
+
|
| 33 |
+
The environment must implement the OpenEnv interface with:
|
| 34 |
+
|
| 35 |
+
- typed Observation model
|
| 36 |
+
- typed Action model
|
| 37 |
+
- typed state model
|
| 38 |
+
- `step(action)`
|
| 39 |
+
- `reset()`
|
| 40 |
+
- `state()`
|
| 41 |
+
- `openenv.yaml`
|
| 42 |
+
|
| 43 |
+
This is expected to be checked through `openenv validate`.
|
| 44 |
+
|
| 45 |
+
### Minimum 3 tasks with agent graders
|
| 46 |
+
|
| 47 |
+
Each task must have:
|
| 48 |
+
|
| 49 |
+
- a concrete objective
|
| 50 |
+
- a programmatic grader
|
| 51 |
+
- score output in `[0.0, 1.0]`
|
| 52 |
+
- deterministic success / failure criteria
|
| 53 |
+
- clear difficulty progression from easy to hard
|
| 54 |
+
|
| 55 |
+
### Meaningful reward function
|
| 56 |
+
|
| 57 |
+
The reward should:
|
| 58 |
+
|
| 59 |
+
- provide signal across the full trajectory
|
| 60 |
+
- reward partial progress
|
| 61 |
+
- penalize clearly undesirable behavior
|
| 62 |
+
|
| 63 |
+
### Baseline inference script
|
| 64 |
+
|
| 65 |
+
The baseline must:
|
| 66 |
+
|
| 67 |
+
- use the OpenAI client for LLM calls
|
| 68 |
+
- live at the project root as `inference.py`
|
| 69 |
+
- produce reproducible scores
|
| 70 |
+
- complete successfully across all 3 tasks
|
| 71 |
+
|
| 72 |
+
## Official Non-Functional Requirements
|
| 73 |
+
|
| 74 |
+
### Hugging Face Spaces
|
| 75 |
+
|
| 76 |
+
- must deploy as a containerized HF Space
|
| 77 |
+
- should be tagged with `openenv`
|
| 78 |
+
- should respond successfully when pinged
|
| 79 |
+
|
| 80 |
+
### Containerized execution
|
| 81 |
+
|
| 82 |
+
- must include a working Dockerfile
|
| 83 |
+
- should start cleanly with `docker build` + `docker run`
|
| 84 |
+
|
| 85 |
+
### Documentation
|
| 86 |
+
|
| 87 |
+
README must include:
|
| 88 |
+
|
| 89 |
+
- environment description and motivation
|
| 90 |
+
- action space definition
|
| 91 |
+
- observation space definition
|
| 92 |
+
- task descriptions with difficulty expectations
|
| 93 |
+
- setup and usage instructions
|
| 94 |
+
- baseline scores
|
| 95 |
+
|
| 96 |
+
## Official Evaluation Criteria
|
| 97 |
+
|
| 98 |
+
### Weights
|
| 99 |
+
|
| 100 |
+
| Parameter | Weight | What judges look for |
|
| 101 |
+
|-----------|--------|----------------------|
|
| 102 |
+
| Real-world utility | 30% | Genuine practical task and value |
|
| 103 |
+
| Task & grader quality | 25% | Clear objectives, fair graders, real progression |
|
| 104 |
+
| Environment design | 20% | Clean state, sensible API, good reward shaping |
|
| 105 |
+
| Code quality & spec compliance | 15% | OpenEnv compliance, structure, typing, tests, Docker |
|
| 106 |
+
| Creativity & novelty | 10% | Original domain, mechanics, reward ideas |
|
| 107 |
+
|
| 108 |
+
### Phase 1: Automated validation
|
| 109 |
+
|
| 110 |
+
Pass / fail gate:
|
| 111 |
+
|
| 112 |
+
- HF Space deploys
|
| 113 |
+
- OpenEnv spec compliance
|
| 114 |
+
- Dockerfile builds
|
| 115 |
+
- baseline reproduces
|
| 116 |
+
- 3+ tasks with graders
|
| 117 |
+
|
| 118 |
+
### Phase 2: Agentic evaluation
|
| 119 |
+
|
| 120 |
+
Scored:
|
| 121 |
+
|
| 122 |
+
- baseline agent rerun
|
| 123 |
+
- standard Open LLM agent run against the environment
|
| 124 |
+
- score variance check
|
| 125 |
+
|
| 126 |
+
### Phase 3: Human review
|
| 127 |
+
|
| 128 |
+
Top submissions are reviewed by Meta and Hugging Face engineers for:
|
| 129 |
+
|
| 130 |
+
- real-world utility
|
| 131 |
+
- creativity
|
| 132 |
+
- exploit resistance
|
| 133 |
+
|
| 134 |
+
## Official Disqualification Criteria
|
| 135 |
+
|
| 136 |
+
- environment does not deploy or respond
|
| 137 |
+
- plagiarized or trivially modified existing environment
|
| 138 |
+
- graders always return the same score
|
| 139 |
+
- no baseline inference script
|
| 140 |
+
|
| 141 |
+
## Official Pre-Submission Checklist
|
| 142 |
+
|
| 143 |
+
All of these must pass:
|
| 144 |
+
|
| 145 |
+
- HF Space deploys and responds
|
| 146 |
+
- automated ping to the Space URL returns `200`
|
| 147 |
+
- reset path works on the deployed environment
|
| 148 |
+
- `openenv validate` passes
|
| 149 |
+
- Dockerfile builds
|
| 150 |
+
- baseline inference completes and produces scores
|
| 151 |
+
- 3+ tasks with graders are present and score in `[0.0, 1.0]`
|
| 152 |
+
|
| 153 |
+
## Mandatory Additional Instructions
|
| 154 |
+
|
| 155 |
+
### Required inference environment variables
|
| 156 |
+
|
| 157 |
+
- `API_BASE_URL`
|
| 158 |
+
- `MODEL_NAME`
|
| 159 |
+
- `HF_TOKEN`
|
| 160 |
+
|
| 161 |
+
The official text also mentions `OPENAI_API_KEY` in one place, but the more specific submission instructions above consistently emphasize `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. We should follow the later, more specific instruction while continuing to use the OpenAI client.
|
| 162 |
+
|
| 163 |
+
### Inference script constraints
|
| 164 |
+
|
| 165 |
+
- script must be named `inference.py`
|
| 166 |
+
- it must live in the project root
|
| 167 |
+
- all LLM calls must use the OpenAI client
|
| 168 |
+
- stdout logs must strictly follow the `[START]`, `[STEP]`, and `[END]` format from the official sample
|
| 169 |
+
|
| 170 |
+
### Infra restrictions
|
| 171 |
+
|
| 172 |
+
- inference runtime should stay under 20 minutes
|
| 173 |
+
- env and inference should run on a machine with `vcpu=2` and `memory=8gb`
|
| 174 |
+
|
| 175 |
+
### Validator
|
| 176 |
+
|
| 177 |
+
- run the official pre-submission validation script before final submission if possible
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
## Project Compliance Plan
|
| 182 |
+
|
| 183 |
+
## Project Goal
|
| 184 |
+
|
| 185 |
+
Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
|
| 186 |
+
|
| 187 |
+
- real-world utility
|
| 188 |
+
- strong task and grader quality
|
| 189 |
+
- clean environment design
|
| 190 |
+
- OpenEnv spec compliance
|
| 191 |
+
- reproducible baseline inference
|
| 192 |
+
- Docker and Hugging Face deployment readiness
|
| 193 |
+
|
| 194 |
+
## Current Product Definition
|
| 195 |
+
|
| 196 |
+
The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
|
| 197 |
+
|
| 198 |
+
- `issue_type`
|
| 199 |
+
- `priority`
|
| 200 |
+
- `assignment_group`
|
| 201 |
+
- `resolution_action`
|
| 202 |
+
|
| 203 |
+
The project keeps three tasks:
|
| 204 |
+
|
| 205 |
+
1. Issue Type Classification
|
| 206 |
+
2. Issue Type And Priority
|
| 207 |
+
3. Full Ticket Routing
|
| 208 |
+
|
| 209 |
+
## What Must Be True At Submission
|
| 210 |
+
|
| 211 |
+
### Pass / fail requirements
|
| 212 |
+
|
| 213 |
+
- the environment responds correctly
|
| 214 |
+
- OpenEnv metadata is valid
|
| 215 |
+
- `reset()`, `step()`, and `state()` work
|
| 216 |
+
- there are at least 3 tasks
|
| 217 |
+
- graders return scores in `[0.0, 1.0]`
|
| 218 |
+
- `inference.py` runs and prints reproducible results
|
| 219 |
+
- `inference.py` uses the OpenAI client and required env vars
|
| 220 |
+
- structured stdout logging matches the official format
|
| 221 |
+
- `openenv validate` passes
|
| 222 |
+
- Docker builds and starts cleanly
|
| 223 |
+
- HF Space responds and reset works
|
| 224 |
+
|
| 225 |
+
### Scored requirements
|
| 226 |
+
|
| 227 |
+
- the task clearly feels like real helpdesk work
|
| 228 |
+
- the hard task requires meaningful reasoning
|
| 229 |
+
- partial credit is useful and deterministic
|
| 230 |
+
- docs are clear enough for judges to understand quickly
|
| 231 |
+
- reward is informative over the trajectory, not only at the end
|
| 232 |
+
|
| 233 |
+
## Core Files
|
| 234 |
+
|
| 235 |
+
### Runtime
|
| 236 |
+
|
| 237 |
+
- `models.py`
|
| 238 |
+
- `server/environment.py`
|
| 239 |
+
- `server/grader.py`
|
| 240 |
+
- `server/reward.py`
|
| 241 |
+
- `server/tasks.py`
|
| 242 |
+
- `server/app.py`
|
| 243 |
+
- `client.py`
|
| 244 |
+
- `inference.py`
|
| 245 |
+
|
| 246 |
+
### Data and metadata
|
| 247 |
+
|
| 248 |
+
- `data/dataset.json`
|
| 249 |
+
- `openenv.yaml`
|
| 250 |
+
- `server/Dockerfile`
|
| 251 |
+
- `pyproject.toml`
|
| 252 |
+
- `requirements.txt`
|
| 253 |
+
|
| 254 |
+
### Docs
|
| 255 |
+
|
| 256 |
+
- `README.md`
|
| 257 |
+
- `KNOWLEDGE.md`
|
| 258 |
+
- `required.md`
|
| 259 |
+
|
| 260 |
+
## Technical Priorities
|
| 261 |
+
|
| 262 |
+
### P0
|
| 263 |
+
|
| 264 |
+
1. keep environment behavior correct
|
| 265 |
+
2. verify task definitions and graders
|
| 266 |
+
3. make the baseline script reliable and compliant with official logging format
|
| 267 |
+
4. confirm dataset coverage and label consistency
|
| 268 |
+
5. validate the official submission gates, not just local behavior
|
| 269 |
+
|
| 270 |
+
### P1
|
| 271 |
+
|
| 272 |
+
1. validate Docker
|
| 273 |
+
2. validate deployment assumptions
|
| 274 |
+
3. record baseline scores
|
| 275 |
+
4. polish docs
|
| 276 |
+
5. verify the runtime envelope and structured inference logs
|
| 277 |
+
|
| 278 |
+
### P2
|
| 279 |
+
|
| 280 |
+
1. strengthen ticket wording for realism
|
| 281 |
+
2. expand hard-case examples if needed
|
| 282 |
+
3. remove low-signal artifacts from the repo
|
| 283 |
+
|
| 284 |
+
## Quality Checks To Perform
|
| 285 |
+
|
| 286 |
+
### Environment
|
| 287 |
+
|
| 288 |
+
- reset starts a clean episode
|
| 289 |
+
- each step advances the queue correctly
|
| 290 |
+
- the final step returns trajectory reward
|
| 291 |
+
- state reflects the real internal status
|
| 292 |
+
- episode boundaries are sensible
|
| 293 |
+
|
| 294 |
+
### Grader
|
| 295 |
+
|
| 296 |
+
- exact matches score `1.0`
|
| 297 |
+
- near misses get partial credit where intended
|
| 298 |
+
- unsupported task IDs fail clearly
|
| 299 |
+
- scores vary across examples
|
| 300 |
+
- graders do not collapse to constant scores
|
| 301 |
+
|
| 302 |
+
### Inference
|
| 303 |
+
|
| 304 |
+
- heuristic mode works without model credentials
|
| 305 |
+
- LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`
|
| 306 |
+
- uses the OpenAI client
|
| 307 |
+
- stdout follows `[START]`, `[STEP]`, and `[END]`
|
| 308 |
+
- output is reproducible when the seed is fixed
|
| 309 |
+
- runtime stays below the official time budget
|
| 310 |
+
|
| 311 |
+
### Deployment and validation
|
| 312 |
+
|
| 313 |
+
- `openenv validate` passes
|
| 314 |
+
- Docker build succeeds
|
| 315 |
+
- Docker run succeeds
|
| 316 |
+
- HF ping / reset behavior works
|
| 317 |
+
- official validator script is run if practical
|
| 318 |
+
|
| 319 |
+
### Docs
|
| 320 |
+
|
| 321 |
+
- no outdated domain references remain
|
| 322 |
+
- team and project metadata are correct
|
| 323 |
+
- setup and run instructions are accurate
|
| 324 |
+
- README reflects the current inference and deployment path
|
| 325 |
+
|
| 326 |
+
## Risks
|
| 327 |
+
|
| 328 |
+
### Runtime risk
|
| 329 |
+
|
| 330 |
+
The first local execution pass and merged-state rerun have already succeeded. The remaining runtime risk is Docker, clean-machine behavior, and official-validator-style behavior, not first-pass local execution.
|
| 331 |
+
|
| 332 |
+
### Benchmark risk
|
| 333 |
+
|
| 334 |
+
The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.
|
| 335 |
+
|
| 336 |
+
### Deployment risk
|
| 337 |
+
|
| 338 |
+
Docker, HF Spaces, `openenv validate`, and structured inference logging should be verified before the final submission window closes.
|
| 339 |
+
|
| 340 |
+
## Definition Of Done
|
| 341 |
+
|
| 342 |
+
The project is ready when:
|
| 343 |
+
|
| 344 |
+
1. the environment runs locally end to end
|
| 345 |
+
2. unit, smoke, and integration tests cover the critical paths
|
| 346 |
+
3. the heuristic baseline runs successfully
|
| 347 |
+
4. the inference script is compliant with the official logging format
|
| 348 |
+
5. `openenv validate` passes
|
| 349 |
+
6. Docker build and run both succeed
|
| 350 |
+
7. HF deployment checks succeed or are as close to verified as possible before submission
|
| 351 |
+
8. the docs are clean, current, and submission-ready
|
| 352 |
+
9. the repo clearly presents Hackstreet Boys as the team
|