File size: 23,048 Bytes
f44f429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1316d3
f44f429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1316d3
 
 
f44f429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1316d3
f44f429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1316d3
 
 
f44f429
c1316d3
 
f44f429
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
# OpenEnv Submission Checklist
> Complete every item before final submission. A single ❌ in any **DISQUALIFYING** section means you cannot submit.

---

## HOW TO USE THIS CHECKLIST

1. Work through each section **in order** β€” earlier sections unblock later ones.
2. Mark each item `[x]` when confirmed, or add a note if it needs fixing.
3. Any item marked **🚨 DISQUALIFYING** must be `[x]` before submission or you will be automatically rejected.
4. After all items are checked, run the final validator command at the bottom.

---

## SECTION 1 β€” REAL-WORLD TASK SIMULATION

> Weight: 30% of total score. Judges will ask: "Would a practitioner actually use this?"

### 1.1 Domain Validity

- [x] **The environment simulates a task that real humans do professionally or daily.** Examples that pass: email triage, code review, data cleaning, customer support ticket routing, document summarisation, scheduling assistant, content moderation, form validation, compliance checking. Examples that fail: CartPole, GridWorld, Snake, made-up puzzles.
- [x] The task domain is stated clearly in the README's first paragraph β€” a reader understands the real-world context within 3 sentences.
- [x] The environment would be useful for evaluating or training AI agents on a real skill, not just for demonstrating API integration.

### 1.2 Domain Depth

- [x] The environment models at least the core mechanic of the real task (e.g. for email triage: an inbox, email metadata, categories, urgency signals β€” not just "send a string and get a string back").
- [x] Action and observation spaces reflect what a human would actually do and see in this task.
- [x] The hardest task (task 3) would challenge a frontier model (GPT-4o / Claude 3.5 Sonnet level) β€” it is not trivially solved by pattern matching.

---

## SECTION 2 β€” OPENENV SPEC COMPLIANCE

> Weight: part of the 15% code quality score. **All 🚨 items are disqualifying.**

### 2.1 Typed Models

- [x] `Observation` is a Pydantic `BaseModel` with typed fields. No `dict`, no `Any` unless explicitly documented.
- [x] `Action` is a Pydantic `BaseModel` with typed fields.
- [x] `Reward` is a `float` or a Pydantic model containing a `float` value field.
- [x] All three models are importable from a single module (e.g. `from my_env import Observation, Action`).
- [x] Every field has a type annotation. No bare `Optional` without a type parameter.

### 2.2 Core API Methods

- [x] 🚨 `reset()` is implemented and returns an `Observation` (or an object containing one).
- [x] 🚨 `step(action: Action)` is implemented and returns `(observation, reward, done, info)` or a structured equivalent.
- [x] 🚨 `state()` is implemented and returns the current full environment state (serialisable dict or Pydantic model).
- [x] `reset()` produces a **clean, reproducible initial state** β€” calling it twice with the same seed gives the same starting observation.
- [x] `step()` after `done=True` either raises a clean error or resets automatically (document which).
- [x] `info` dict (or equivalent) is non-empty and useful β€” at minimum contains the current task name and step count.

### 2.3 `openenv.yaml`

- [x] 🚨 `openenv.yaml` exists in the project root.
- [x] Contains `name:` field (string, slug-safe).
- [x] Contains `version:` field (semver, e.g. `0.1.0`).
- [x] Contains `description:` field (1–2 sentences).
- [x] Contains `tasks:` list with at least 3 entries, each having `name:`, `difficulty:`, and `description:`.
- [x] Contains `observation_space:` description block.
- [x] Contains `action_space:` description block.
- [x] Passes `openenv validate` without errors (run this command and paste output into your notes).

```bash
# Run this and confirm zero errors:
openenv validate openenv.yaml
```

---

## SECTION 3 β€” MINIMUM 3 TASKS WITH AGENT GRADERS

> Weight: 25% of total score. All 🚨 items are disqualifying.

### 3.1 Task Definitions

- [x] 🚨 Exactly 3 or more tasks are defined.
- [x] Task 1 is labelled **easy** and a baseline LLM can score β‰₯ 0.6 on it with no fine-tuning.
- [x] Task 2 is labelled **medium** and presents a genuine multi-step challenge.
- [x] Task 3 is labelled **hard** and a strong frontier model scores < 0.8 on it without domain-specific prompting.
- [x] Each task has a concise, unambiguous objective statement that a human tester can understand without reading the code.

### 3.2 Grader Requirements

- [x] 🚨 Each task has a **programmatic grader** β€” no human-in-the-loop, no LLM-as-judge for the primary score.
- [x] 🚨 Every grader returns a float in **[0.0, 1.0]** β€” no values below 0 or above 1 ever.
- [x] Graders are **deterministic**: given the same sequence of actions, they always return the same score.
- [x] Graders are **reproducible**: scores do not depend on system time, random seeds not exposed to the grader, or external API calls.
- [x] Partial credit is awarded β€” the grader does not return only 0.0 or 1.0 (binary graders are disqualifying for medium/hard tasks).
- [x] The grader logic is readable: another developer can understand the scoring rubric in < 5 minutes by reading the grader function.

### 3.3 Difficulty Verification (run before submitting)

```bash
# Run baseline inference on all three tasks and record scores:
TASK=easy   python inference.py   # expected: score >= 0.6
TASK=medium python inference.py   # expected: score in 0.3–0.7
TASK=hard   python inference.py   # expected: score < 0.8
```

- [x] Easy task baseline score is β‰₯ 0.6.
- [x] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
- [x] Hard task baseline score is < 0.8 (if it's β‰₯ 0.8, make it harder).
     (Easy: 0.883 | Medium: 0.500 | Hard: 0.512)

---

## SECTION 4 β€” MEANINGFUL REWARD FUNCTION

> Weight: part of the 20% environment design score.

### 4.1 Dense Reward Signal

- [x] The reward function provides **intermediate signal** β€” the agent gets feedback before the episode ends, not only at `done=True`.
- [x] At least 3 distinct reward levels exist across the task trajectory (not just 0.0 at each step then 1.0 at the end).
- [x] Progress toward task completion is reflected in the reward β€” an agent making progress always earns more than one doing nothing.

### 4.2 Reward Shaping

- [x] **Clearly undesirable behaviour is penalised**: e.g. repeated identical actions, contradictory outputs, destructive operations, or exceeding step limits incur a negative reward or zero instead of positive.
- [x] The reward function cannot be gamed by a trivial exploit (e.g. sending the longest possible string every step to maximise a length-based reward without solving the task).
- [x] Total episode reward is bounded β€” the maximum possible score per episode is documented in the README.
- [x] Reward is normalised to [0.0, 1.0] at the episode level (sum of step rewards / max possible reward, clamped).

### 4.3 Reward Documentation

- [x] The reward formula is documented in the README with an example calculation.
- [x] Edge cases are documented: what happens at step 0, at `done=True`, and at the max step limit.

---

## SECTION 5 β€” BASELINE INFERENCE SCRIPT

> Weight: part of the 15% code quality score. All 🚨 items are disqualifying.

### 5.1 File and Location

- [x] 🚨 The script is named **exactly** `inference.py` (lowercase, no suffix variation).
- [x] 🚨 `inference.py` is in the **root directory** of the project (not in a subdirectory).
- [x] The script runs end-to-end without interactive input (no `input()` calls, no manual setup required).

### 5.2 Environment Variables

- [x] 🚨 `API_BASE_URL` is read from `os.getenv("API_BASE_URL", "<your-default>")`. A default is set so the script doesn't crash when the variable is absent.
- [x] 🚨 `MODEL_NAME` is read from `os.getenv("MODEL_NAME", "<your-default>")`.
- [x] 🚨 `HF_TOKEN` is read from `os.getenv("HF_TOKEN")` (no default β€” it must be set externally; the script should fail with a clear message if absent).
- [x] `IMAGE_NAME` / `LOCAL_IMAGE_NAME` is read from `os.getenv("IMAGE_NAME")` or `os.getenv("LOCAL_IMAGE_NAME")` if Docker-based.
- [x] No credentials, tokens, or API keys are hardcoded in any source file.

### 5.3 OpenAI Client Usage

- [x] 🚨 **All LLM calls use the `OpenAI` client** from `openai` package β€” no `requests`, no `httpx`, no `anthropic` SDK, no `transformers` pipeline.
- [x] Client is initialised as: `client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)` where `API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")`.
- [x] `client.chat.completions.create(...)` is used for all inference calls.
- [x] `stream=False` is set explicitly (streaming is not expected by the evaluator).

### 5.4 Stdout Log Format β€” **EXACT FORMAT REQUIRED**

> Any deviation in field names, ordering, or capitalisation will break automated scoring.

- [x] 🚨 Exactly **one `[START]` line** is emitted at the beginning of each episode, before any steps.

  ```
  [START] task=<task_name> env=<benchmark> model=<model_name>
  ```

- [x] 🚨 Exactly **one `[STEP]` line** is emitted after each `env.step()` call, immediately after it returns.

  ```
  [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
  ```

- [x] 🚨 Exactly **one `[END]` line** is emitted after `env.close()`, and it is **always emitted even if an exception occurs** (wrap in `finally:`).

  ```
  [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...,rn>
  ```

- [x] `reward` and all values in `rewards` are formatted to **exactly 2 decimal places** (e.g. `1.00`, `0.75`, `0.00`).
- [x] `score` is formatted to **exactly 3 decimal places** (e.g. `0.750`).
- [x] `done` and `success` are lowercase strings: `true` or `false` (not `True`/`False`, not `1`/`0`).
- [x] `error` is either the raw error string or the literal string `null` (not `None`, not empty string).
- [x] **No newlines within a single log line** β€” each log entry is exactly one line.
- [x] Fields are in the exact order shown above β€” no reordering.
- [x] No extra spaces, tabs, or punctuation between fields (single space separator between `key=value` pairs).

### 5.5 Reproducibility

- [x] Running the script twice with the same `MODEL_NAME` and environment seed produces scores within Β±0.05 of each other (minor LLM variance is acceptable; wild swings are not).
- [x] The script covers all 3 tasks β€” either by looping over task names or via `TASK` environment variable as shown in the sample.
- [x] `MAX_STEPS` is set to a value that allows the task to be completed (not too low) but finishes within the time limit.

### 5.6 Runtime Constraint

- [x] 🚨 The full inference script (all 3 tasks) completes in **under 20 minutes** on a machine with 2 vCPUs and 8 GB RAM.
- [x] Each individual task episode completes in under 5 minutes.
- [x] No step blocks indefinitely β€” all `env.step()` calls have an implicit or explicit timeout.

---

## SECTION 6 β€” DOCKER AND CONTAINERISATION

> Weight: part of the 15% code quality score. All 🚨 items are disqualifying.

### 6.1 Dockerfile

- [x] 🚨 A `Dockerfile` exists in the project root.
- [x] 🚨 `docker build -t myenv .` completes without errors on a clean machine.
- [x] 🚨 `docker run --rm myenv` starts the environment server and it responds to `reset()`.
- [x] The base image is appropriate for the task (e.g. `python:3.11-slim`, not an oversized or obscure base).
- [x] All Python dependencies are installed via `pip install -r requirements.txt` or equivalent inside the Dockerfile.
- [x] The Dockerfile does **not** require internet access at runtime (all deps installed at build time).
- [x] No secrets or API keys are baked into the Docker image.
- [x] The container starts the environment server on a documented port (default: 8000 or 7860).
- [x] The container exposes that port with `EXPOSE <port>` in the Dockerfile.

### 6.2 Resource Constraints

- [x] The built image size is < 5 GB (ideally < 2 GB).
- [x] The running container uses < 6 GB RAM at peak (leaving headroom for the 8 GB machine limit).
- [x] The container starts up in < 60 seconds.

### 6.3 `requirements.txt` (or equivalent)

- [x] `requirements.txt` exists in the project root.
- [x] All dependencies have pinned versions (e.g. `openai==1.30.0`, not `openai`).
- [x] `openai` package is listed (required for inference script).
- [x] `pydantic` package is listed.
- [x] `pyyaml` package is listed (for openenv.yaml parsing).

---

## SECTION 7 β€” HUGGING FACE SPACES DEPLOYMENT

> Weight: part of the 15% code quality score. All 🚨 items are disqualifying.

### 7.1 Space Setup

- [x] 🚨 The HF Space is **publicly accessible** β€” not private or gated.
- [x] 🚨 The Space is tagged with `openenv` in the repository tags.
- [x] The Space type is `Docker` (not `Gradio` or `Streamlit`, unless the env server is built on one of those).
- [x] The Space metadata in `README.md` YAML header includes `tags: [openenv]`.

### 7.2 Availability Check

- [x] 🚨 A `GET` request to `https://your-space-url/` returns HTTP 200.
- [x] 🚨 A `POST` to `https://your-space-url/reset` returns a valid JSON observation.
- [x] `POST /step` with a valid action body returns `(observation, reward, done, info)`.
- [x] `GET /state` returns the current environment state.
- [x] The Space has been running for at least 10 minutes without crashing before submission.

### 7.3 Space Configuration

- [x] `README.md` in the repo root has valid HF Space YAML header:

  ```yaml
  ---
  title: Your Environment Name
  emoji: πŸ€–
  colorFrom: blue
  colorTo: purple
  sdk: docker
  pinned: false
  tags:
    - openenv
  ---
  ```

- [x] The Space hardware tier is sufficient to run the environment (CPU Basic is fine for most cases).
- [x] Environment variables required at runtime are set as **Space Secrets** in the HF Space settings (not hardcoded).

---

## SECTION 8 β€” README DOCUMENTATION

> A well-written README is part of the 15% code quality score.

### 8.1 Required Sections

- [x] **Environment Description** β€” what real-world task is simulated, why it matters, what an agent needs to learn to succeed.
- [x] **Observation Space** β€” table or structured description of every field in the `Observation` model, including type, range, and meaning.
- [x] **Action Space** β€” table or structured description of every field in the `Action` model, including valid values and constraints.
- [x] **Task Descriptions** β€” for each task: name, difficulty label (easy/medium/hard), objective, grader description, example episode.
- [x] **Reward Function** β€” formula, components, max possible reward per episode, normalisation method.
- [x] **Setup Instructions** β€” exact commands to clone, build, and run locally:

  ```bash
  git clone https://huggingface.co/spaces/YOUR_USER/YOUR_ENV
  cd YOUR_ENV
  docker build -t myenv .
  docker run -p 8000:8000 myenv
  ```

- [x] **Inference Script Usage** β€” exact commands with environment variables:

  ```bash
  export HF_TOKEN=hf_...
  export API_BASE_URL=https://router.huggingface.co/v1
  export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
  python inference.py
  ```

- [x] **Baseline Scores** β€” a table with columns: Task | Model | Score | Steps | Notes.

### 8.2 Baseline Scores Table (paste your actual results)

| Task | Difficulty | Model | Score | Steps | Notes |
|------|-----------|-------|-------|-------|-------|
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | |
| js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | |
| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 | |

- [x] The table is filled in with real numbers from a completed inference run.
- [x] The easy task score is β‰₯ 0.6.

---

## SECTION 9 β€” CODE QUALITY AND PROJECT STRUCTURE

### 9.1 Project Layout

- [x] Project root contains at minimum:

  ```
  /
  β”œβ”€β”€ inference.py          ← inference script (mandatory name)
  β”œβ”€β”€ openenv.yaml          ← OpenEnv spec file
  β”œβ”€β”€ Dockerfile            ← container definition
  β”œβ”€β”€ requirements.txt      ← pinned dependencies
  β”œβ”€β”€ README.md             ← documentation
  └── src/ or myenv/       ← environment source code
      β”œβ”€β”€ env.py            ← environment class
      β”œβ”€β”€ models.py         ← Observation, Action, Reward models
      β”œβ”€β”€ tasks/            ← one file per task + grader
      └── server.py         ← HTTP server (FastAPI or equivalent)
  ```

- [x] No large binary files (datasets > 50 MB, model weights) are committed to the repo. Use URLs or HF datasets instead.
- [x] `.gitignore` excludes `__pycache__`, `.env`, `*.pyc`, and any local credentials.

### 9.2 Code Standards

- [x] All Python files pass `flake8` or `ruff` with no errors (warnings are acceptable).
- [x] All Pydantic models have docstrings or field descriptions.
- [x] No bare `except:` clauses β€” exceptions are caught specifically.
- [x] No `print()` statements in the environment code (use `logging`). `print()` is only in `inference.py` for structured stdout logs.
- [x] Environment class has a module-level docstring explaining what it does.

### 9.3 Testing

- [x] At minimum, a smoke test exists: instantiate the env, call `reset()`, call `step()` with a valid action, assert `done` is a bool and `reward` is a float.
- [x] The smoke test passes:

  ```bash
  python -m pytest tests/ -v
  # or
  python test_smoke.py
  ```

---

## SECTION 10 β€” CREATIVITY AND NOVELTY

> Weight: 10% of total score. This section cannot disqualify you, but it can push you to the top.

- [x] The problem domain is novel β€” not a re-skin of email triage or the echo example from the sample script.
- [x] The reward design has an interesting property: e.g. multi-objective trade-offs, adversarial components, information asymmetry, sequential dependency between steps.
- [x] The hard task has a mechanic that makes it qualitatively harder, not just quantitatively (more steps / more categories is not enough β€” the agent must reason differently).
- [x] The environment would be cited or referenced by others building agents in this domain.

---

## SECTION 11 β€” FINAL PRE-SUBMISSION VALIDATION

Run these commands in order. All must succeed with zero errors.

### Step 1 β€” Validate OpenEnv spec

```bash
openenv validate openenv.yaml
```

Expected output: `βœ“ openenv.yaml is valid`

- [x] βœ“ PASSED

### Step 2 β€” Build Docker image

```bash
docker build -t myenv-final .
```

Expected: exits with code 0, image appears in `docker images`.

- [x] βœ“ PASSED

### Step 3 β€” Start container and health check

```bash
docker run -d -p 8000:8000 --name myenv-test myenv-final
sleep 10
curl -s http://localhost:8000/ | python3 -m json.tool
curl -s -X POST http://localhost:8000/reset | python3 -m json.tool
docker stop myenv-test && docker rm myenv-test
```

Expected: Both curl commands return valid JSON with no errors.

- [x] βœ“ PASSED

### Step 4 β€” Run full inference script

```bash
export HF_TOKEN=<your_token>
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

# Run all tasks (adjust loop to match your task names)
for TASK in easy medium hard; do
  MY_ENV_TASK=$TASK python inference.py
done
```

Expected: Three complete runs, each emitting `[START]`, NΓ—`[STEP]`, and `[END]` with no Python exceptions.

- [x] βœ“ PASSED β€” Easy score: 0.883 Medium score: 0.500 Hard score: 0.512

### Step 5 β€” Verify log format

Pipe one run through a format checker:

```bash
MY_ENV_TASK=easy python inference.py 2>/dev/null | python3 -c "
import sys, re
lines = sys.stdin.read().splitlines()
start = sum(1 for l in lines if l.startswith('[START]'))
step  = sum(1 for l in lines if l.startswith('[STEP]'))
end   = sum(1 for l in lines if l.startswith('[END]'))
assert start == 1, f'Expected 1 [START], got {start}'
assert step  >= 1, f'Expected >=1 [STEP], got {step}'
assert end   == 1, f'Expected 1 [END], got {end}'
end_line = next(l for l in lines if l.startswith('[END]'))
assert 'success=' in end_line
assert 'steps=' in end_line
assert 'score=' in end_line
assert 'rewards=' in end_line
score_val = re.search(r'score=(\d+\.\d+)', end_line).group(1)
assert len(score_val.split('.')[1]) == 3, f'score must be 3 decimal places, got: {score_val}'
print('βœ“ Log format is valid')
print(f'  [START] lines: {start}')
print(f'  [STEP] lines:  {step}')
print(f'  [END] lines:   {end}')
"
```

- [x] βœ“ PASSED

### Step 6 β€” Verify HF Space is live

```bash
curl -s -o /dev/null -w "%{http_code}" https://YOUR-USERNAME-YOUR-ENV.hf.space/
# Must return 200
```

- [x] βœ“ PASSED β€” Space URL: https://huggingface.co/spaces/huggingface/openenv-code-security-review

### Step 7 β€” Verify grader scores are in [0, 1]

```bash
python3 -c "
from myenv.tasks import task_easy, task_medium, task_hard  # adjust import
# Run a few grader calls with dummy actions and assert bounds
# (adjust to your actual grader API)
print('βœ“ All graders return values in [0.0, 1.0]')
"
```

- [x] βœ“ PASSED

---

## DISQUALIFICATION SUMMARY

Before submitting, confirm that **every 🚨 item** below is checked. If any are unchecked, stop and fix them first.

| # | Disqualifying Item | Checked? |
|---|---|---|
| D1 | `reset()` is implemented and works | [x] |
| D2 | `step()` is implemented and works | [x] |
| D3 | `state()` is implemented and works | [x] |
| D4 | `openenv.yaml` exists and passes validation | [x] |
| D5 | Exactly 3+ tasks with programmatic graders | [x] |
| D6 | All graders return float in [0.0, 1.0] | [x] |
| D7 | `inference.py` is in the project root | [x] |
| D8 | OpenAI client is used for all LLM calls | [x] |
| D9 | `[START]` log line is exactly correct | [x] |
| D10 | `[STEP]` log line is exactly correct | [x] |
| D11 | `[END]` log line is always emitted (in finally) | [x] |
| D12 | `API_BASE_URL` read from env var | [x] |
| D13 | `MODEL_NAME` read from env var | [x] |
| D14 | `HF_TOKEN` read from env var | [x] |
| D15 | Dockerfile builds without errors | [x] |
| D16 | Container starts and responds to `reset()` | [x] |
| D17 | HF Space is public and returns HTTP 200 | [x] |
| D18 | Full inference run completes in < 20 minutes | [x] |

---

## SUBMISSION SIGN-OFF

When all items above are checked, fill in this block and attach it to your submission.

```
Environment Name:  Code Security Review
HF Space URL:      https://huggingface.co/spaces/inmodel/code-review-env
Baseline Scores:
  - Easy task:     0.883 (task name: python-off-by-one)
  - Medium task:   0.500 (task name: js-idor-auth)
  - Hard task:     0.512 (task name: python-pickle-deserialization)
Inference runtime: < 1 minute
Docker image size: ~300 MB
Submitted by:      Inmodel Labs
Date:              2026-04-08

I confirm all 18 disqualifying items are checked [yes/no]: yes
I confirm the full validator suite passes [yes/no]:         yes
```

---

*Generated for OpenEnv Hackathon submission β€” covers all judging criteria, pre-submission checks, and mandatory infrastructure requirements.*