File size: 11,507 Bytes
72d2634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ada670
 
 
 
72d2634
8ada670
72d2634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ada670
72d2634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89ca22f
72d2634
 
 
 
 
 
 
89ca22f
72d2634
 
 
 
 
 
 
 
 
 
 
 
 
 
6920aae
 
 
5954205
6920aae
 
 
 
 
 
89ca22f
 
6920aae
 
5954205
89ca22f
 
 
 
 
6920aae
5954205
6920aae
5954205
 
 
6920aae
5954205
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
# Round 1 Requirements And Project Compliance Plan

## Official Problem Statement

Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.

### Key requirements at a glance

- must simulate a real-world task, not a game or toy
- must implement the full OpenEnv spec with typed models and `openenv.yaml`
- must include at least 3 tasks with agent graders spanning easy -> medium -> hard
- graders must return scores in `[0.0, 1.0]`
- reward must provide meaningful partial-progress signal
- must include a reproducible baseline `inference.py`
- must deploy to Hugging Face Spaces with a working Dockerfile
- README must include environment description, action / observation spaces, setup, usage, and baseline scores

## Official Functional Requirements

### Real-world task simulation

The environment must simulate a task humans actually do. The official examples include:

- email triage
- code review
- data cleaning
- scheduling
- customer support
- content moderation

### OpenEnv spec compliance

The environment must implement the OpenEnv interface with:

- typed Observation model
- typed Action model
- typed state model
- `step(action)`
- `reset()`
- `state()`
- `openenv.yaml`

This is expected to be checked through `openenv validate`.

### Minimum 3 tasks with agent graders

Each task must have:

- a concrete objective
- a programmatic grader
- score output in `[0.0, 1.0]`
- deterministic success / failure criteria
- clear difficulty progression from easy to hard

### Meaningful reward function

The reward should:

- provide signal across the full trajectory
- reward partial progress
- penalize clearly undesirable behavior

### Baseline inference script

The baseline must:

- use the OpenAI client for LLM calls
- live at the project root as `inference.py`
- produce reproducible scores
- complete successfully across all 3 tasks

## Official Non-Functional Requirements

### Hugging Face Spaces

- must deploy as a containerized HF Space
- should be tagged with `openenv`
- should respond successfully when pinged

### Containerized execution

- must include a working Dockerfile
- should start cleanly with `docker build` + `docker run`

### Documentation

README must include:

- environment description and motivation
- action space definition
- observation space definition
- task descriptions with difficulty expectations
- setup and usage instructions
- baseline scores

## Official Evaluation Criteria

### Weights

| Parameter | Weight | What judges look for |
|-----------|--------|----------------------|
| Real-world utility | 30% | Genuine practical task and value |
| Task & grader quality | 25% | Clear objectives, fair graders, real progression |
| Environment design | 20% | Clean state, sensible API, good reward shaping |
| Code quality & spec compliance | 15% | OpenEnv compliance, structure, typing, tests, Docker |
| Creativity & novelty | 10% | Original domain, mechanics, reward ideas |

### Phase 1: Automated validation

Pass / fail gate:

- HF Space deploys
- OpenEnv spec compliance
- Dockerfile builds
- baseline reproduces
- 3+ tasks with graders

### Phase 2: Agentic evaluation

Scored:

- baseline agent rerun
- standard Open LLM agent run against the environment
- score variance check

### Phase 3: Human review

Top submissions are reviewed by Meta and Hugging Face engineers for:

- real-world utility
- creativity
- exploit resistance

## Official Disqualification Criteria

- environment does not deploy or respond
- plagiarized or trivially modified existing environment
- graders always return the same score
- no baseline inference script

## Official Pre-Submission Checklist

All of these must pass:

- HF Space deploys and responds
- automated ping to the Space URL returns `200`
- reset path works on the deployed environment
- `openenv validate` passes
- Dockerfile builds
- baseline inference completes and produces scores
- 3+ tasks with graders are present and score in `[0.0, 1.0]`

## Mandatory Additional Instructions

### Required inference environment variables

  - `API_BASE_URL`
  - `MODEL_NAME`
  - `API_KEY`
  - `HF_TOKEN`

Use `API_KEY` as the primary evaluator-injected credential for the OpenAI client. `HF_TOKEN` can remain as a backward-compatible local fallback, but submission-time LLM traffic should flow through the injected proxy key.

### Inference script constraints

- script must be named `inference.py`
- it must live in the project root
- all LLM calls must use the OpenAI client
- stdout logs must strictly follow the `[START]`, `[STEP]`, and `[END]` format from the official sample

### Infra restrictions

- inference runtime should stay under 20 minutes
- env and inference should run on a machine with `vcpu=2` and `memory=8gb`

### Validator

- run the official pre-submission validation script before final submission if possible

---

## Project Compliance Plan

## Project Goal

Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:

- real-world utility
- strong task and grader quality
- clean environment design
- OpenEnv spec compliance
- reproducible baseline inference
- Docker and Hugging Face deployment readiness

## Current Product Definition

The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:

- `issue_type`
- `priority`
- `assignment_group`
- `resolution_action`

The project keeps three tasks:

1. Issue Type Classification
2. Issue Type And Priority
3. Full Ticket Routing

## What Must Be True At Submission

### Pass / fail requirements

- the environment responds correctly
- OpenEnv metadata is valid
- `reset()`, `step()`, and `state()` work
- there are at least 3 tasks
- graders return scores in `[0.0, 1.0]`
- `inference.py` runs and prints reproducible results
- `inference.py` uses the OpenAI client and required env vars
- structured stdout logging matches the official format
- `openenv validate` passes
- Docker builds and starts cleanly
- HF Space responds and reset works

### Scored requirements

- the task clearly feels like real helpdesk work
- the hard task requires meaningful reasoning
- partial credit is useful and deterministic
- docs are clear enough for judges to understand quickly
- reward is informative over the trajectory, not only at the end

## Core Files

### Runtime

- `models.py`
- `server/environment.py`
- `server/grader.py`
- `server/reward.py`
- `server/tasks.py`
- `server/app.py`
- `client.py`
- `inference.py`

### Data and metadata

- `data/dataset.json`
- `openenv.yaml`
- `server/Dockerfile`
- `pyproject.toml`
- `requirements.txt`

### Docs

- `README.md`
- `KNOWLEDGE.md`
- `required.md`

## Technical Priorities

### P0

1. keep environment behavior correct
2. verify task definitions and graders
3. make the baseline script reliable and compliant with official logging format
4. confirm dataset coverage and label consistency
5. validate the official submission gates, not just local behavior

### P1

1. validate Docker
2. validate deployment assumptions
3. record baseline scores
4. polish docs
5. verify the runtime envelope and structured inference logs

### P2

1. strengthen ticket wording for realism
2. expand hard-case examples if needed
3. remove low-signal artifacts from the repo

## Quality Checks To Perform

### Environment

- reset starts a clean episode
- each step advances the queue correctly
- the final step returns trajectory reward
- state reflects the real internal status
- episode boundaries are sensible

### Grader

- exact matches score `1.0`
- near misses get partial credit where intended
- unsupported task IDs fail clearly
- scores vary across examples
- graders do not collapse to constant scores

### Inference

- heuristic mode works without model credentials
- LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `API_KEY` (`HF_TOKEN` remains a local fallback)
- uses the OpenAI client
- stdout follows `[START]`, `[STEP]`, and `[END]`
- output is reproducible when the seed is fixed
- runtime stays below the official time budget

### Deployment and validation

- `openenv validate` passes
- Docker build succeeds
- Docker run succeeds
- HF ping / reset behavior works
- official validator script is run if practical

### Docs

- no outdated domain references remain
- team and project metadata are correct
- setup and run instructions are accurate
- README reflects the current inference and deployment path

## Risks

### Runtime risk

The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.

### Benchmark risk

The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.

### Deployment risk

Docker smoke coverage, `openenv validate`, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.

## Definition Of Done

The project is ready when:

1. the environment runs locally end to end
2. unit, smoke, and integration tests cover the critical paths
3. the heuristic baseline runs successfully
4. the inference script is compliant with the official logging format
5. `openenv validate` passes
6. Docker build and run both succeed
7. HF deployment checks succeed or are as close to verified as possible before submission
8. the docs are clean, current, and submission-ready
9. the repo clearly presents Hackstreet Boys as the team

## Current Compliance Snapshot

As of April 8, 2026, the core submission requirements and the major benchmark upgrades are in place:

- real-world task definition is clear and stable
- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
- 3-task easy -> medium -> hard ladder is present
- graders are deterministic and bounded to `[0.0, 1.0]`
- unit tests now prove scorer crispness, task invariants, and dataset coverage
- smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
- integration tests now cover `/health`, `/tasks`, `/reset`, `/step`, `/state`, full seeded episodes, and heuristic regression
- baseline heuristic results are recorded in the docs
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
- the label space and partial-credit policy were reviewed against public IT-support references during development
- `.openenvignore` is present
- Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
- `inference.py` structured `[START]`, `[STEP]`, and `[END]` logging is verified
- `uv.lock` is checked in and `openenv validate` now passes on the current repo state
- a clean-copy install-and-run pass has been completed

The remaining work is optional benchmark expansion rather than submission readiness work:

- make the simulator even more emergent instead of partially authored
- broaden the data distribution further
- replace the local policy search loop with a more training-oriented learning setup if needed later

The short TRL / GRPO README example remains optional and is still deferred because it is not required for this project to be understandable, runnable, or judgeable.