Divyank1607 commited on
Commit
742e175
Β·
2 Parent(s): 5ce4119ec7c2dd

Merge pull request #1 from DIVYANK-BHARDWAJ/Divyank

Browse files

Add Java rate limiter and C++ event dispatcher tasks; update requirem…

Dockerfile CHANGED
@@ -4,7 +4,8 @@ WORKDIR /app
4
 
5
  # Install dependencies first (layer cache)
6
  COPY requirements.txt .
7
- RUN pip install --no-cache-dir -r requirements.txt
 
8
 
9
  # Copy application code
10
  COPY server/ ./server/
 
4
 
5
  # Install dependencies first (layer cache)
6
  COPY requirements.txt .
7
+ RUN pip install --no-cache-dir --upgrade pip && \
8
+ pip install --no-cache-dir -r requirements.txt
9
 
10
  # Copy application code
11
  COPY server/ ./server/
OPENENV_SUBMISSION_CHECKLIST.md ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Submission Checklist
2
+ > Complete every item before final submission. A single ❌ in any **DISQUALIFYING** section means you cannot submit.
3
+
4
+ ---
5
+
6
+ ## HOW TO USE THIS CHECKLIST
7
+
8
+ 1. Work through each section **in order** β€” earlier sections unblock later ones.
9
+ 2. Mark each item `[x]` when confirmed, or add a note if it needs fixing.
10
+ 3. Any item marked **🚨 DISQUALIFYING** must be `[x]` before submission or you will be automatically rejected.
11
+ 4. After all items are checked, run the final validator command at the bottom.
12
+
13
+ ---
14
+
15
+ ## SECTION 1 β€” REAL-WORLD TASK SIMULATION
16
+
17
+ > Weight: 30% of total score. Judges will ask: "Would a practitioner actually use this?"
18
+
19
+ ### 1.1 Domain Validity
20
+
21
+ - [ ] **The environment simulates a task that real humans do professionally or daily.** Examples that pass: email triage, code review, data cleaning, customer support ticket routing, document summarisation, scheduling assistant, content moderation, form validation, compliance checking. Examples that fail: CartPole, GridWorld, Snake, made-up puzzles.
22
+ - [ ] The task domain is stated clearly in the README's first paragraph β€” a reader understands the real-world context within 3 sentences.
23
+ - [ ] The environment would be useful for evaluating or training AI agents on a real skill, not just for demonstrating API integration.
24
+
25
+ ### 1.2 Domain Depth
26
+
27
+ - [ ] The environment models at least the core mechanic of the real task (e.g. for email triage: an inbox, email metadata, categories, urgency signals β€” not just "send a string and get a string back").
28
+ - [ ] Action and observation spaces reflect what a human would actually do and see in this task.
29
+ - [ ] The hardest task (task 3) would challenge a frontier model (GPT-4o / Claude 3.5 Sonnet level) β€” it is not trivially solved by pattern matching.
30
+
31
+ ---
32
+
33
+ ## SECTION 2 β€” OPENENV SPEC COMPLIANCE
34
+
35
+ > Weight: part of the 15% code quality score. **All 🚨 items are disqualifying.**
36
+
37
+ ### 2.1 Typed Models
38
+
39
+ - [ ] `Observation` is a Pydantic `BaseModel` with typed fields. No `dict`, no `Any` unless explicitly documented.
40
+ - [ ] `Action` is a Pydantic `BaseModel` with typed fields.
41
+ - [ ] `Reward` is a `float` or a Pydantic model containing a `float` value field.
42
+ - [ ] All three models are importable from a single module (e.g. `from my_env import Observation, Action`).
43
+ - [ ] Every field has a type annotation. No bare `Optional` without a type parameter.
44
+
45
+ ### 2.2 Core API Methods
46
+
47
+ - [ ] 🚨 `reset()` is implemented and returns an `Observation` (or an object containing one).
48
+ - [ ] 🚨 `step(action: Action)` is implemented and returns `(observation, reward, done, info)` or a structured equivalent.
49
+ - [ ] 🚨 `state()` is implemented and returns the current full environment state (serialisable dict or Pydantic model).
50
+ - [ ] `reset()` produces a **clean, reproducible initial state** β€” calling it twice with the same seed gives the same starting observation.
51
+ - [ ] `step()` after `done=True` either raises a clean error or resets automatically (document which).
52
+ - [ ] `info` dict (or equivalent) is non-empty and useful β€” at minimum contains the current task name and step count.
53
+
54
+ ### 2.3 `openenv.yaml`
55
+
56
+ - [ ] 🚨 `openenv.yaml` exists in the project root.
57
+ - [ ] Contains `name:` field (string, slug-safe).
58
+ - [ ] Contains `version:` field (semver, e.g. `0.1.0`).
59
+ - [ ] Contains `description:` field (1–2 sentences).
60
+ - [ ] Contains `tasks:` list with at least 3 entries, each having `name:`, `difficulty:`, and `description:`.
61
+ - [ ] Contains `observation_space:` description block.
62
+ - [ ] Contains `action_space:` description block.
63
+ - [ ] Passes `openenv validate` without errors (run this command and paste output into your notes).
64
+
65
+ ```bash
66
+ # Run this and confirm zero errors:
67
+ openenv validate openenv.yaml
68
+ ```
69
+
70
+ ---
71
+
72
+ ## SECTION 3 β€” MINIMUM 3 TASKS WITH AGENT GRADERS
73
+
74
+ > Weight: 25% of total score. All 🚨 items are disqualifying.
75
+
76
+ ### 3.1 Task Definitions
77
+
78
+ - [ ] 🚨 Exactly 3 or more tasks are defined.
79
+ - [ ] Task 1 is labelled **easy** and a baseline LLM can score β‰₯ 0.6 on it with no fine-tuning.
80
+ - [ ] Task 2 is labelled **medium** and presents a genuine multi-step challenge.
81
+ - [ ] Task 3 is labelled **hard** and a strong frontier model scores < 0.8 on it without domain-specific prompting.
82
+ - [ ] Each task has a concise, unambiguous objective statement that a human tester can understand without reading the code.
83
+
84
+ ### 3.2 Grader Requirements
85
+
86
+ - [ ] 🚨 Each task has a **programmatic grader** β€” no human-in-the-loop, no LLM-as-judge for the primary score.
87
+ - [ ] 🚨 Every grader returns a float in **[0.0, 1.0]** β€” no values below 0 or above 1 ever.
88
+ - [ ] Graders are **deterministic**: given the same sequence of actions, they always return the same score.
89
+ - [ ] Graders are **reproducible**: scores do not depend on system time, random seeds not exposed to the grader, or external API calls.
90
+ - [ ] Partial credit is awarded β€” the grader does not return only 0.0 or 1.0 (binary graders are disqualifying for medium/hard tasks).
91
+ - [ ] The grader logic is readable: another developer can understand the scoring rubric in < 5 minutes by reading the grader function.
92
+
93
+ ### 3.3 Difficulty Verification (run before submitting)
94
+
95
+ ```bash
96
+ # Run baseline inference on all three tasks and record scores:
97
+ TASK=easy python inference.py # expected: score >= 0.6
98
+ TASK=medium python inference.py # expected: score in 0.3–0.7
99
+ TASK=hard python inference.py # expected: score < 0.8
100
+ ```
101
+
102
+ - [ ] Easy task baseline score is β‰₯ 0.6.
103
+ - [ ] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
104
+ - [ ] Hard task baseline score is < 0.8 (if it's β‰₯ 0.8, make it harder).
105
+
106
+ ---
107
+
108
+ ## SECTION 4 β€” MEANINGFUL REWARD FUNCTION
109
+
110
+ > Weight: part of the 20% environment design score.
111
+
112
+ ### 4.1 Dense Reward Signal
113
+
114
+ - [ ] The reward function provides **intermediate signal** β€” the agent gets feedback before the episode ends, not only at `done=True`.
115
+ - [ ] At least 3 distinct reward levels exist across the task trajectory (not just 0.0 at each step then 1.0 at the end).
116
+ - [ ] Progress toward task completion is reflected in the reward β€” an agent making progress always earns more than one doing nothing.
117
+
118
+ ### 4.2 Reward Shaping
119
+
120
+ - [ ] **Clearly undesirable behaviour is penalised**: e.g. repeated identical actions, contradictory outputs, destructive operations, or exceeding step limits incur a negative reward or zero instead of positive.
121
+ - [ ] The reward function cannot be gamed by a trivial exploit (e.g. sending the longest possible string every step to maximise a length-based reward without solving the task).
122
+ - [ ] Total episode reward is bounded β€” the maximum possible score per episode is documented in the README.
123
+ - [ ] Reward is normalised to [0.0, 1.0] at the episode level (sum of step rewards / max possible reward, clamped).
124
+
125
+ ### 4.3 Reward Documentation
126
+
127
+ - [ ] The reward formula is documented in the README with an example calculation.
128
+ - [ ] Edge cases are documented: what happens at step 0, at `done=True`, and at the max step limit.
129
+
130
+ ---
131
+
132
+ ## SECTION 5 β€” BASELINE INFERENCE SCRIPT
133
+
134
+ > Weight: part of the 15% code quality score. All 🚨 items are disqualifying.
135
+
136
+ ### 5.1 File and Location
137
+
138
+ - [ ] 🚨 The script is named **exactly** `inference.py` (lowercase, no suffix variation).
139
+ - [ ] 🚨 `inference.py` is in the **root directory** of the project (not in a subdirectory).
140
+ - [ ] The script runs end-to-end without interactive input (no `input()` calls, no manual setup required).
141
+
142
+ ### 5.2 Environment Variables
143
+
144
+ - [ ] 🚨 `API_BASE_URL` is read from `os.getenv("API_BASE_URL", "<your-default>")`. A default is set so the script doesn't crash when the variable is absent.
145
+ - [ ] 🚨 `MODEL_NAME` is read from `os.getenv("MODEL_NAME", "<your-default>")`.
146
+ - [ ] 🚨 `HF_TOKEN` is read from `os.getenv("HF_TOKEN")` (no default β€” it must be set externally; the script should fail with a clear message if absent).
147
+ - [ ] `IMAGE_NAME` / `LOCAL_IMAGE_NAME` is read from `os.getenv("IMAGE_NAME")` or `os.getenv("LOCAL_IMAGE_NAME")` if Docker-based.
148
+ - [ ] No credentials, tokens, or API keys are hardcoded in any source file.
149
+
150
+ ### 5.3 OpenAI Client Usage
151
+
152
+ - [ ] 🚨 **All LLM calls use the `OpenAI` client** from `openai` package β€” no `requests`, no `httpx`, no `anthropic` SDK, no `transformers` pipeline.
153
+ - [ ] Client is initialised as: `client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)` where `API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")`.
154
+ - [ ] `client.chat.completions.create(...)` is used for all inference calls.
155
+ - [ ] `stream=False` is set explicitly (streaming is not expected by the evaluator).
156
+
157
+ ### 5.4 Stdout Log Format β€” **EXACT FORMAT REQUIRED**
158
+
159
+ > Any deviation in field names, ordering, or capitalisation will break automated scoring.
160
+
161
+ - [ ] 🚨 Exactly **one `[START]` line** is emitted at the beginning of each episode, before any steps.
162
+
163
+ ```
164
+ [START] task=<task_name> env=<benchmark> model=<model_name>
165
+ ```
166
+
167
+ - [ ] 🚨 Exactly **one `[STEP]` line** is emitted after each `env.step()` call, immediately after it returns.
168
+
169
+ ```
170
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
171
+ ```
172
+
173
+ - [ ] 🚨 Exactly **one `[END]` line** is emitted after `env.close()`, and it is **always emitted even if an exception occurs** (wrap in `finally:`).
174
+
175
+ ```
176
+ [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...,rn>
177
+ ```
178
+
179
+ - [ ] `reward` and all values in `rewards` are formatted to **exactly 2 decimal places** (e.g. `1.00`, `0.75`, `0.00`).
180
+ - [ ] `score` is formatted to **exactly 3 decimal places** (e.g. `0.750`).
181
+ - [ ] `done` and `success` are lowercase strings: `true` or `false` (not `True`/`False`, not `1`/`0`).
182
+ - [ ] `error` is either the raw error string or the literal string `null` (not `None`, not empty string).
183
+ - [ ] **No newlines within a single log line** β€” each log entry is exactly one line.
184
+ - [ ] Fields are in the exact order shown above β€” no reordering.
185
+ - [ ] No extra spaces, tabs, or punctuation between fields (single space separator between `key=value` pairs).
186
+
187
+ ### 5.5 Reproducibility
188
+
189
+ - [ ] Running the script twice with the same `MODEL_NAME` and environment seed produces scores within Β±0.05 of each other (minor LLM variance is acceptable; wild swings are not).
190
+ - [ ] The script covers all 3 tasks β€” either by looping over task names or via `TASK` environment variable as shown in the sample.
191
+ - [ ] `MAX_STEPS` is set to a value that allows the task to be completed (not too low) but finishes within the time limit.
192
+
193
+ ### 5.6 Runtime Constraint
194
+
195
+ - [ ] 🚨 The full inference script (all 3 tasks) completes in **under 20 minutes** on a machine with 2 vCPUs and 8 GB RAM.
196
+ - [ ] Each individual task episode completes in under 5 minutes.
197
+ - [ ] No step blocks indefinitely β€” all `env.step()` calls have an implicit or explicit timeout.
198
+
199
+ ---
200
+
201
+ ## SECTION 6 β€” DOCKER AND CONTAINERISATION
202
+
203
+ > Weight: part of the 15% code quality score. All 🚨 items are disqualifying.
204
+
205
+ ### 6.1 Dockerfile
206
+
207
+ - [ ] 🚨 A `Dockerfile` exists in the project root.
208
+ - [ ] 🚨 `docker build -t myenv .` completes without errors on a clean machine.
209
+ - [ ] 🚨 `docker run --rm myenv` starts the environment server and it responds to `reset()`.
210
+ - [ ] The base image is appropriate for the task (e.g. `python:3.11-slim`, not an oversized or obscure base).
211
+ - [ ] All Python dependencies are installed via `pip install -r requirements.txt` or equivalent inside the Dockerfile.
212
+ - [ ] The Dockerfile does **not** require internet access at runtime (all deps installed at build time).
213
+ - [ ] No secrets or API keys are baked into the Docker image.
214
+ - [ ] The container starts the environment server on a documented port (default: 8000 or 7860).
215
+ - [ ] The container exposes that port with `EXPOSE <port>` in the Dockerfile.
216
+
217
+ ### 6.2 Resource Constraints
218
+
219
+ - [ ] The built image size is < 5 GB (ideally < 2 GB).
220
+ - [ ] The running container uses < 6 GB RAM at peak (leaving headroom for the 8 GB machine limit).
221
+ - [ ] The container starts up in < 60 seconds.
222
+
223
+ ### 6.3 `requirements.txt` (or equivalent)
224
+
225
+ - [ ] `requirements.txt` exists in the project root.
226
+ - [ ] All dependencies have pinned versions (e.g. `openai==1.30.0`, not `openai`).
227
+ - [ ] `openai` package is listed (required for inference script).
228
+ - [ ] `pydantic` package is listed.
229
+ - [ ] `pyyaml` package is listed (for openenv.yaml parsing).
230
+
231
+ ---
232
+
233
+ ## SECTION 7 β€” HUGGING FACE SPACES DEPLOYMENT
234
+
235
+ > Weight: part of the 15% code quality score. All 🚨 items are disqualifying.
236
+
237
+ ### 7.1 Space Setup
238
+
239
+ - [ ] 🚨 The HF Space is **publicly accessible** β€” not private or gated.
240
+ - [ ] 🚨 The Space is tagged with `openenv` in the repository tags.
241
+ - [ ] The Space type is `Docker` (not `Gradio` or `Streamlit`, unless the env server is built on one of those).
242
+ - [ ] The Space metadata in `README.md` YAML header includes `tags: [openenv]`.
243
+
244
+ ### 7.2 Availability Check
245
+
246
+ - [ ] 🚨 A `GET` request to `https://your-space-url/` returns HTTP 200.
247
+ - [ ] 🚨 A `POST` to `https://your-space-url/reset` returns a valid JSON observation.
248
+ - [ ] `POST /step` with a valid action body returns `(observation, reward, done, info)`.
249
+ - [ ] `GET /state` returns the current environment state.
250
+ - [ ] The Space has been running for at least 10 minutes without crashing before submission.
251
+
252
+ ### 7.3 Space Configuration
253
+
254
+ - [ ] `README.md` in the repo root has valid HF Space YAML header:
255
+
256
+ ```yaml
257
+ ---
258
+ title: Your Environment Name
259
+ emoji: πŸ€–
260
+ colorFrom: blue
261
+ colorTo: purple
262
+ sdk: docker
263
+ pinned: false
264
+ tags:
265
+ - openenv
266
+ ---
267
+ ```
268
+
269
+ - [ ] The Space hardware tier is sufficient to run the environment (CPU Basic is fine for most cases).
270
+ - [ ] Environment variables required at runtime are set as **Space Secrets** in the HF Space settings (not hardcoded).
271
+
272
+ ---
273
+
274
+ ## SECTION 8 β€” README DOCUMENTATION
275
+
276
+ > A well-written README is part of the 15% code quality score.
277
+
278
+ ### 8.1 Required Sections
279
+
280
+ - [ ] **Environment Description** β€” what real-world task is simulated, why it matters, what an agent needs to learn to succeed.
281
+ - [ ] **Observation Space** β€” table or structured description of every field in the `Observation` model, including type, range, and meaning.
282
+ - [ ] **Action Space** β€” table or structured description of every field in the `Action` model, including valid values and constraints.
283
+ - [ ] **Task Descriptions** β€” for each task: name, difficulty label (easy/medium/hard), objective, grader description, example episode.
284
+ - [ ] **Reward Function** β€” formula, components, max possible reward per episode, normalisation method.
285
+ - [ ] **Setup Instructions** β€” exact commands to clone, build, and run locally:
286
+
287
+ ```bash
288
+ git clone https://huggingface.co/spaces/YOUR_USER/YOUR_ENV
289
+ cd YOUR_ENV
290
+ docker build -t myenv .
291
+ docker run -p 8000:8000 myenv
292
+ ```
293
+
294
+ - [ ] **Inference Script Usage** β€” exact commands with environment variables:
295
+
296
+ ```bash
297
+ export HF_TOKEN=hf_...
298
+ export API_BASE_URL=https://router.huggingface.co/v1
299
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
300
+ python inference.py
301
+ ```
302
+
303
+ - [ ] **Baseline Scores** β€” a table with columns: Task | Model | Score | Steps | Notes.
304
+
305
+ ### 8.2 Baseline Scores Table (paste your actual results)
306
+
307
+ | Task | Difficulty | Model | Score | Steps | Notes |
308
+ |------|-----------|-------|-------|-------|-------|
309
+ | task_1 | easy | β€” | β€” | β€” | |
310
+ | task_2 | medium | β€” | β€” | β€” | |
311
+ | task_3 | hard | β€” | β€” | β€” | |
312
+
313
+ - [ ] The table is filled in with real numbers from a completed inference run.
314
+ - [ ] The easy task score is β‰₯ 0.6.
315
+
316
+ ---
317
+
318
+ ## SECTION 9 β€” CODE QUALITY AND PROJECT STRUCTURE
319
+
320
+ ### 9.1 Project Layout
321
+
322
+ - [ ] Project root contains at minimum:
323
+
324
+ ```
325
+ /
326
+ β”œβ”€β”€ inference.py ← inference script (mandatory name)
327
+ β”œβ”€β”€ openenv.yaml ← OpenEnv spec file
328
+ β”œβ”€β”€ Dockerfile ← container definition
329
+ β”œβ”€β”€ requirements.txt ← pinned dependencies
330
+ β”œβ”€β”€ README.md ← documentation
331
+ └── src/ or myenv/ ← environment source code
332
+ β”œβ”€β”€ env.py ← environment class
333
+ β”œβ”€β”€ models.py ← Observation, Action, Reward models
334
+ β”œβ”€β”€ tasks/ ← one file per task + grader
335
+ └── server.py ← HTTP server (FastAPI or equivalent)
336
+ ```
337
+
338
+ - [ ] No large binary files (datasets > 50 MB, model weights) are committed to the repo. Use URLs or HF datasets instead.
339
+ - [ ] `.gitignore` excludes `__pycache__`, `.env`, `*.pyc`, and any local credentials.
340
+
341
+ ### 9.2 Code Standards
342
+
343
+ - [ ] All Python files pass `flake8` or `ruff` with no errors (warnings are acceptable).
344
+ - [ ] All Pydantic models have docstrings or field descriptions.
345
+ - [ ] No bare `except:` clauses β€” exceptions are caught specifically.
346
+ - [ ] No `print()` statements in the environment code (use `logging`). `print()` is only in `inference.py` for structured stdout logs.
347
+ - [ ] Environment class has a module-level docstring explaining what it does.
348
+
349
+ ### 9.3 Testing
350
+
351
+ - [ ] At minimum, a smoke test exists: instantiate the env, call `reset()`, call `step()` with a valid action, assert `done` is a bool and `reward` is a float.
352
+ - [ ] The smoke test passes:
353
+
354
+ ```bash
355
+ python -m pytest tests/ -v
356
+ # or
357
+ python test_smoke.py
358
+ ```
359
+
360
+ ---
361
+
362
+ ## SECTION 10 β€” CREATIVITY AND NOVELTY
363
+
364
+ > Weight: 10% of total score. This section cannot disqualify you, but it can push you to the top.
365
+
366
+ - [ ] The problem domain is novel β€” not a re-skin of email triage or the echo example from the sample script.
367
+ - [ ] The reward design has an interesting property: e.g. multi-objective trade-offs, adversarial components, information asymmetry, sequential dependency between steps.
368
+ - [ ] The hard task has a mechanic that makes it qualitatively harder, not just quantitatively (more steps / more categories is not enough β€” the agent must reason differently).
369
+ - [ ] The environment would be cited or referenced by others building agents in this domain.
370
+
371
+ ---
372
+
373
+ ## SECTION 11 β€” FINAL PRE-SUBMISSION VALIDATION
374
+
375
+ Run these commands in order. All must succeed with zero errors.
376
+
377
+ ### Step 1 β€” Validate OpenEnv spec
378
+
379
+ ```bash
380
+ openenv validate openenv.yaml
381
+ ```
382
+
383
+ Expected output: `βœ“ openenv.yaml is valid`
384
+
385
+ - [ ] βœ“ PASSED
386
+
387
+ ### Step 2 β€” Build Docker image
388
+
389
+ ```bash
390
+ docker build -t myenv-final .
391
+ ```
392
+
393
+ Expected: exits with code 0, image appears in `docker images`.
394
+
395
+ - [ ] βœ“ PASSED
396
+
397
+ ### Step 3 β€” Start container and health check
398
+
399
+ ```bash
400
+ docker run -d -p 8000:8000 --name myenv-test myenv-final
401
+ sleep 10
402
+ curl -s http://localhost:8000/ | python3 -m json.tool
403
+ curl -s -X POST http://localhost:8000/reset | python3 -m json.tool
404
+ docker stop myenv-test && docker rm myenv-test
405
+ ```
406
+
407
+ Expected: Both curl commands return valid JSON with no errors.
408
+
409
+ - [ ] βœ“ PASSED
410
+
411
+ ### Step 4 β€” Run full inference script
412
+
413
+ ```bash
414
+ export HF_TOKEN=<your_token>
415
+ export API_BASE_URL=https://router.huggingface.co/v1
416
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
417
+
418
+ # Run all tasks (adjust loop to match your task names)
419
+ for TASK in easy medium hard; do
420
+ MY_ENV_TASK=$TASK python inference.py
421
+ done
422
+ ```
423
+
424
+ Expected: Three complete runs, each emitting `[START]`, NΓ—`[STEP]`, and `[END]` with no Python exceptions.
425
+
426
+ - [ ] βœ“ PASSED β€” Easy score: ______ Medium score: ______ Hard score: ______
427
+
428
+ ### Step 5 β€” Verify log format
429
+
430
+ Pipe one run through a format checker:
431
+
432
+ ```bash
433
+ MY_ENV_TASK=easy python inference.py 2>/dev/null | python3 -c "
434
+ import sys, re
435
+ lines = sys.stdin.read().splitlines()
436
+ start = sum(1 for l in lines if l.startswith('[START]'))
437
+ step = sum(1 for l in lines if l.startswith('[STEP]'))
438
+ end = sum(1 for l in lines if l.startswith('[END]'))
439
+ assert start == 1, f'Expected 1 [START], got {start}'
440
+ assert step >= 1, f'Expected >=1 [STEP], got {step}'
441
+ assert end == 1, f'Expected 1 [END], got {end}'
442
+ end_line = next(l for l in lines if l.startswith('[END]'))
443
+ assert 'success=' in end_line
444
+ assert 'steps=' in end_line
445
+ assert 'score=' in end_line
446
+ assert 'rewards=' in end_line
447
+ score_val = re.search(r'score=(\d+\.\d+)', end_line).group(1)
448
+ assert len(score_val.split('.')[1]) == 3, f'score must be 3 decimal places, got: {score_val}'
449
+ print('βœ“ Log format is valid')
450
+ print(f' [START] lines: {start}')
451
+ print(f' [STEP] lines: {step}')
452
+ print(f' [END] lines: {end}')
453
+ "
454
+ ```
455
+
456
+ - [ ] βœ“ PASSED
457
+
458
+ ### Step 6 β€” Verify HF Space is live
459
+
460
+ ```bash
461
+ curl -s -o /dev/null -w "%{http_code}" https://YOUR-USERNAME-YOUR-ENV.hf.space/
462
+ # Must return 200
463
+ ```
464
+
465
+ - [ ] βœ“ PASSED β€” Space URL: ______________________________
466
+
467
+ ### Step 7 β€” Verify grader scores are in [0, 1]
468
+
469
+ ```bash
470
+ python3 -c "
471
+ from myenv.tasks import task_easy, task_medium, task_hard # adjust import
472
+ # Run a few grader calls with dummy actions and assert bounds
473
+ # (adjust to your actual grader API)
474
+ print('βœ“ All graders return values in [0.0, 1.0]')
475
+ "
476
+ ```
477
+
478
+ - [ ] βœ“ PASSED
479
+
480
+ ---
481
+
482
+ ## DISQUALIFICATION SUMMARY
483
+
484
+ Before submitting, confirm that **every 🚨 item** below is checked. If any are unchecked, stop and fix them first.
485
+
486
+ | # | Disqualifying Item | Checked? |
487
+ |---|---|---|
488
+ | D1 | `reset()` is implemented and works | ☐ |
489
+ | D2 | `step()` is implemented and works | ☐ |
490
+ | D3 | `state()` is implemented and works | ☐ |
491
+ | D4 | `openenv.yaml` exists and passes validation | ☐ |
492
+ | D5 | Exactly 3+ tasks with programmatic graders | ☐ |
493
+ | D6 | All graders return float in [0.0, 1.0] | ☐ |
494
+ | D7 | `inference.py` is in the project root | ☐ |
495
+ | D8 | OpenAI client is used for all LLM calls | ☐ |
496
+ | D9 | `[START]` log line is exactly correct | ☐ |
497
+ | D10 | `[STEP]` log line is exactly correct | ☐ |
498
+ | D11 | `[END]` log line is always emitted (in finally) | ☐ |
499
+ | D12 | `API_BASE_URL` read from env var | ☐ |
500
+ | D13 | `MODEL_NAME` read from env var | ☐ |
501
+ | D14 | `HF_TOKEN` read from env var | ☐ |
502
+ | D15 | Dockerfile builds without errors | ☐ |
503
+ | D16 | Container starts and responds to `reset()` | ☐ |
504
+ | D17 | HF Space is public and returns HTTP 200 | ☐ |
505
+ | D18 | Full inference run completes in < 20 minutes | ☐ |
506
+
507
+ ---
508
+
509
+ ## SUBMISSION SIGN-OFF
510
+
511
+ When all items above are checked, fill in this block and attach it to your submission.
512
+
513
+ ```
514
+ Environment Name: ___________________________________
515
+ HF Space URL: ___________________________________
516
+ Baseline Scores:
517
+ - Easy task: ______ (task name: _____________)
518
+ - Medium task: ______ (task name: _____________)
519
+ - Hard task: ______ (task name: _____________)
520
+ Inference runtime: ______ minutes
521
+ Docker image size: ______ MB
522
+ Submitted by: ___________________________________
523
+ Date: ___________________________________
524
+
525
+ I confirm all 18 disqualifying items are checked [yes/no]: ______
526
+ I confirm the full validator suite passes [yes/no]: ______
527
+ ```
528
+
529
+ ---
530
+
531
+ *Generated for OpenEnv Hackathon submission β€” covers all judging criteria, pre-submission checks, and mandatory infrastructure requirements.*
inference.py CHANGED
@@ -1,23 +1,22 @@
1
  """
2
  Baseline inference script for Code Security Review OpenEnv.
3
-
4
- Usage:
5
- python inference.py
6
 
7
  Required environment variables:
8
- API_BASE_URL β€” LLM API endpoint (default: HF router)
9
- MODEL_NAME β€” Model identifier
10
- HF_TOKEN β€” Hugging Face / API key
11
- ENV_BASE_URL β€” Running environment URL (default: http://localhost:7860)
12
  """
13
 
14
  import os
15
  import json
16
  import time
17
  import re
 
18
  from dotenv import load_dotenv
19
 
20
- # Load .env variables before anything else
21
  load_dotenv()
22
 
23
  import requests
@@ -28,6 +27,7 @@ API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1"
28
  MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
29
  HF_TOKEN = os.environ.get("HF_TOKEN", "")
30
  ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
 
31
 
32
  client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
33
 
@@ -47,9 +47,26 @@ Schema:
47
  "suggested_fix": "the corrected code snippet or a precise description of the fix"
48
  }"""
49
 
50
- # ── Helpers ───────────────────────────────────────────────────────────────────
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- from typing import Optional
 
 
 
 
53
 
54
  def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = None) -> dict:
55
  url = f"{ENV_BASE_URL}{path}"
@@ -59,9 +76,8 @@ def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = No
59
 
60
 
61
  def parse_json_from_llm(text: str) -> dict:
62
- """Robustly extract JSON from LLM output, stripping markdown fences."""
63
  text = text.strip()
64
- # Strip ```json ... ``` or ``` ... ```
65
  text = re.sub(r"^```(?:json)?\s*", "", text)
66
  text = re.sub(r"\s*```$", "", text)
67
  return json.loads(text)
@@ -84,22 +100,23 @@ def build_prompt(obs: dict) -> str:
84
 
85
  # ── Task runner ───────────────────────────────────────────────────────────────
86
 
87
- def run_task(difficulty: str, task_num: int) -> dict:
88
  reset_resp = env_post("/reset", params={"difficulty": difficulty})
89
  obs = reset_resp["observation"]
 
90
 
91
- print(f"[START] task={task_num} difficulty={difficulty} task_id={obs['task_id']} max_steps={obs['max_steps']}")
92
 
93
- cumulative_reward = 0.0
94
- step_num = 0
95
  done = False
 
96
 
97
- while not done and step_num < obs["max_steps"]:
98
- step_num += 1
99
  prompt = build_prompt(obs)
100
 
101
  # ── LLM call ──────────────────────────────────────────────────────────
102
- t0 = time.time()
103
  try:
104
  response = client.chat.completions.create(
105
  model=MODEL_NAME,
@@ -112,74 +129,62 @@ def run_task(difficulty: str, task_num: int) -> dict:
112
  )
113
  raw = response.choices[0].message.content
114
  action_dict = parse_json_from_llm(raw)
 
 
115
  except Exception as exc:
116
- print(f"[ERROR] task={task_num} step={step_num} llm_error={exc}")
117
  action_dict = {
118
  "bug_identified": False,
119
- "bug_location": "",
120
  "bug_type": "none",
121
- "bug_description": "",
122
  "severity": "none",
123
  "suggested_fix": "",
124
  }
125
- latency = round(time.time() - t0, 2)
126
 
127
  # ── Step env ──────────────────────────────────────────────────────────
128
  step_resp = env_post("/step", data=action_dict)
129
  reward = step_resp["reward"]
130
  done = step_resp["done"]
131
  obs = step_resp["observation"]
132
- info = step_resp.get("info", {})
133
-
134
- cumulative_reward += reward
135
-
136
- print(
137
- f"[STEP] task={task_num} step={step_num} "
138
- f"reward={reward:.3f} cumulative={cumulative_reward:.3f} "
139
- f"done={done} latency_s={latency}"
140
- )
141
-
142
- result = {
143
- "task_num": task_num,
144
- "difficulty": difficulty,
145
- "total_reward": round(cumulative_reward, 3),
146
- "steps_taken": step_num,
147
- "success": cumulative_reward >= 0.8,
148
  }
149
- print(
150
- f"[END] task={task_num} difficulty={difficulty} "
151
- f"total_reward={result['total_reward']} success={result['success']}"
152
- )
153
- return result
154
 
155
 
156
  # ── Main ──────────────────────────────────────────────────────────────────────
157
 
158
  def main():
159
- print(f"[INFO] model={MODEL_NAME} env={ENV_BASE_URL}")
160
-
161
- tasks = [
162
- ("easy", 1),
163
- ("medium", 2),
164
- ("hard", 3),
165
- ]
166
  results = []
167
 
168
- for difficulty, task_num in tasks:
169
  try:
170
- r = run_task(difficulty, task_num)
 
171
  except Exception as exc:
172
- print(f"[ERROR] task={task_num} difficulty={difficulty} error={exc}")
173
- r = {"task_num": task_num, "difficulty": difficulty,
174
- "total_reward": 0.0, "success": False}
175
- results.append(r)
176
-
177
- avg = round(sum(r["total_reward"] for r in results) / len(results), 3)
178
- successes = sum(1 for r in results if r.get("success"))
179
- print(f"\n[SUMMARY] avg_reward={avg} tasks_passed={successes}/{len(results)}")
180
- for r in results:
181
- print(f" [{r['difficulty']:6}] reward={r['total_reward']:.3f} success={r.get('success', False)}")
182
 
 
 
 
 
183
 
184
  if __name__ == "__main__":
185
  main()
 
1
  """
2
  Baseline inference script for Code Security Review OpenEnv.
3
+ Compliant with mandatory STDOUT format: [START], [STEP], [END].
 
 
4
 
5
  Required environment variables:
6
+ API_BASE_URL β€” LLM API endpoint
7
+ MODEL_NAME β€” Model identifier
8
+ HF_TOKEN β€” Hugging Face / API key
9
+ ENV_BASE_URL β€” Running environment URL (default: http://localhost:7860)
10
  """
11
 
12
  import os
13
  import json
14
  import time
15
  import re
16
+ from typing import List, Optional
17
  from dotenv import load_dotenv
18
 
19
+ # Load .env variables
20
  load_dotenv()
21
 
22
  import requests
 
27
  MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
28
  HF_TOKEN = os.environ.get("HF_TOKEN", "")
29
  ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
30
+ BENCHMARK = "code-review-env"
31
 
32
  client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
33
 
 
47
  "suggested_fix": "the corrected code snippet or a precise description of the fix"
48
  }"""
49
 
50
+ # ── Logging Helpers ───────────────────────────────────────────────────────────
51
+
52
+ def log_start(task: str, env: str, model: str) -> None:
53
+ print(f"[START] task={task} env={env} model={model}", flush=True)
54
+
55
+
56
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
57
+ error_val = error if error else "null"
58
+ done_val = str(done).lower()
59
+ print(
60
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
61
+ flush=True,
62
+ )
63
+
64
 
65
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
66
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
67
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}", flush=True)
68
+
69
+ # ── Helpers ───────────────────────────────────────────────────────────────────
70
 
71
  def env_post(path: str, data: Optional[dict] = None, params: Optional[dict] = None) -> dict:
72
  url = f"{ENV_BASE_URL}{path}"
 
76
 
77
 
78
  def parse_json_from_llm(text: str) -> dict:
79
+ """Robustly extract JSON from LLM output."""
80
  text = text.strip()
 
81
  text = re.sub(r"^```(?:json)?\s*", "", text)
82
  text = re.sub(r"\s*```$", "", text)
83
  return json.loads(text)
 
100
 
101
  # ── Task runner ───────────────────────────────────────────────────────────────
102
 
103
+ def run_task(difficulty: str) -> dict:
104
  reset_resp = env_post("/reset", params={"difficulty": difficulty})
105
  obs = reset_resp["observation"]
106
+ task_id = obs['task_id']
107
 
108
+ log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
109
 
110
+ rewards = []
111
+ steps_taken = 0
112
  done = False
113
+ last_error = None
114
 
115
+ while not done and steps_taken < obs["max_steps"]:
116
+ steps_taken += 1
117
  prompt = build_prompt(obs)
118
 
119
  # ── LLM call ──────────────────────────────────────────────────────────
 
120
  try:
121
  response = client.chat.completions.create(
122
  model=MODEL_NAME,
 
129
  )
130
  raw = response.choices[0].message.content
131
  action_dict = parse_json_from_llm(raw)
132
+ action_str = json.dumps(action_dict)
133
+ last_error = None
134
  except Exception as exc:
135
+ last_error = str(exc)
136
  action_dict = {
137
  "bug_identified": False,
138
+ "bug_location": "error",
139
  "bug_type": "none",
140
+ "bug_description": last_error,
141
  "severity": "none",
142
  "suggested_fix": "",
143
  }
144
+ action_str = "{}"
145
 
146
  # ── Step env ──────────────────────────────────────────────────────────
147
  step_resp = env_post("/step", data=action_dict)
148
  reward = step_resp["reward"]
149
  done = step_resp["done"]
150
  obs = step_resp["observation"]
151
+
152
+ rewards.append(reward)
153
+ log_step(step=steps_taken, action=action_str, reward=reward, done=done, error=last_error)
154
+
155
+ # Calculate final score (normalized to [0, 1])
156
+ # Total reward is cumulative in this env, but we cap it at 1.0 for the score
157
+ total_reward = sum(rewards)
158
+ score = min(max(total_reward, 0.0), 1.0)
159
+ success = score >= 0.8
160
+
161
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
162
+
163
+ return {
164
+ "task_id": task_id,
165
+ "score": score,
166
+ "success": success
167
  }
 
 
 
 
 
168
 
169
 
170
  # ── Main ──────────────────────────────────────────────────────────────────────
171
 
172
  def main():
173
+ tasks = ["easy", "medium", "hard"]
 
 
 
 
 
 
174
  results = []
175
 
176
+ for difficulty in tasks:
177
  try:
178
+ r = run_task(difficulty)
179
+ results.append(r)
180
  except Exception as exc:
181
+ # print(f"DEBUG: Task failed: {exc}", flush=True)
182
+ log_end(success=False, steps=0, score=0.0, rewards=[])
 
 
 
 
 
 
 
 
183
 
184
+ if results:
185
+ avg = sum(r["score"] for r in results) / len(results)
186
+ # Optional: summary for human review (will not interfere with [END] parsers)
187
+ # print(f"\n[SUMMARY] avg_score={avg:.3f}")
188
 
189
  if __name__ == "__main__":
190
  main()
inference_output.log ADDED
Binary file (2.58 kB). View file
 
requirements.txt CHANGED
@@ -1,5 +1,7 @@
1
  fastapi==0.115.0
2
- uvicorn[standard]==0.30.6
 
 
3
  pydantic==2.7.4
4
  requests==2.32.3
5
  openai==1.40.0
 
1
  fastapi==0.115.0
2
+ uvicorn
3
+ httptools
4
+ uvloop
5
  pydantic==2.7.4
6
  requests==2.32.3
7
  openai==1.40.0
server/environment.py CHANGED
@@ -144,6 +144,147 @@ TASKS: Dict[str, dict] = {
144
  "severity_valid": ["critical"],
145
  },
146
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  }
148
 
149
 
@@ -303,4 +444,4 @@ class CodeReviewEnvironment:
303
  step_number=step_number,
304
  max_steps=MAX_STEPS,
305
  previous_feedback=previous_feedback,
306
- )
 
144
  "severity_valid": ["critical"],
145
  },
146
  },
147
+
148
+ # ── EXPERT ────────────────────────────────
149
+ "expert": {
150
+ "id": "task_expert_001",
151
+ "difficulty": "expert",
152
+ "language": "java",
153
+ "description": (
154
+ "This Java class implements a token bucket rate limiter. "
155
+ "Identify the logic bug that could allow users to bypass the rate limit."
156
+ ),
157
+ "code": (
158
+ "import java.util.concurrent.atomic.AtomicLong;\n\n"
159
+ "public class TokenBucketRateLimiter {\n"
160
+ " private final long maxTokens;\n"
161
+ " private final long refillRatePerSecond;\n"
162
+ " private AtomicLong currentTokens;\n"
163
+ " private AtomicLong lastRefillTimestamp;\n\n"
164
+ " public TokenBucketRateLimiter(long maxTokens, long refillRatePerSecond) {\n"
165
+ " this.maxTokens = maxTokens;\n"
166
+ " this.refillRatePerSecond = refillRatePerSecond;\n"
167
+ " this.currentTokens = new AtomicLong(maxTokens);\n"
168
+ " this.lastRefillTimestamp = new AtomicLong(System.currentTimeMillis());\n"
169
+ " }\n\n"
170
+ " /**\n"
171
+ " * Checks if the requested number of tokens is available.\n"
172
+ " * Decrements the bucket if allowed.\n"
173
+ " */\n"
174
+ " public synchronized boolean allowRequest(int tokensNeeded) {\n"
175
+ " refill();\n"
176
+ " if (currentTokens.get() >= tokensNeeded) {\n"
177
+ " currentTokens.addAndGet(-tokensNeeded);\n"
178
+ " return true;\n"
179
+ " }\n"
180
+ " return false;\n"
181
+ " }\n\n"
182
+ " private void refill() {\n"
183
+ " long now = System.currentTimeMillis();\n"
184
+ " long timeElapsedMs = now - lastRefillTimestamp.get();\n"
185
+ " \n"
186
+ " // Calculate how many tokens to add based on time elapsed\n"
187
+ " long tokensToAdd = (timeElapsedMs / 1000) * refillRatePerSecond;\n\n"
188
+ " if (tokensToAdd > 0) {\n"
189
+ " // Hint: Look closely at how the tokens are updated here.\n"
190
+ " // Consider what happens if a user stops making requests for a long time.\n"
191
+ " currentTokens.addAndGet(tokensToAdd);\n"
192
+ " lastRefillTimestamp.set(now);\n"
193
+ " }\n"
194
+ " }\n"
195
+ "}"
196
+ ),
197
+ "ground_truth": {
198
+ "bug_identified": True,
199
+ "bug_type_keywords": [
200
+ "logic", "limit", "overflow", "cap", "bound", "maximum", "exceed",
201
+ "logic error", "capacity",
202
+ ],
203
+ "location_keywords": [
204
+ "currentTokens.addAndGet", "refill()", "tokensToAdd",
205
+ "currentTokens.get()", "addAndGet(tokensToAdd)",
206
+ ],
207
+ "description_keywords": [
208
+ "exceed", "maxTokens", "cap", "limit", "bound",
209
+ "overflow", "infinite", "burst", "accumulate",
210
+ ],
211
+ "fix_keywords": [
212
+ "Math.min", "min(", "set(", "if (currentTokens.get() > maxTokens)",
213
+ "compareAndSet", "cap",
214
+ ],
215
+ "severity_valid": ["high", "medium"],
216
+ },
217
+ },
218
+
219
+ # ── EXPERT 2 (C++) ────────────────────────
220
+ "expert2": {
221
+ "id": "task_expert_002",
222
+ "difficulty": "expert2",
223
+ "language": "cpp",
224
+ "description": (
225
+ "This C++ class implements an event dispatcher. "
226
+ "Identify the concurrency bug that can occur when an event is dispatched."
227
+ ),
228
+ "code": (
229
+ "#include <iostream>\n"
230
+ "#include <vector>\n"
231
+ "#include <functional>\n"
232
+ "#include <mutex>\n"
233
+ "#include <algorithm>\n"
234
+ "#include <string>\n\n"
235
+ "class EventDispatcher {\n"
236
+ "public:\n"
237
+ " using Callback = std::function<void(const std::string&)>;\n\n"
238
+ " void subscribe(int listener_id, Callback cb) {\n"
239
+ " std::lock_guard<std::mutex> lock(mut_);\n"
240
+ " listeners_.push_back({listener_id, cb});\n"
241
+ " }\n\n"
242
+ " void unsubscribe(int listener_id) {\n"
243
+ " std::lock_guard<std::mutex> lock(mut_);\n"
244
+ " listeners_.erase(\n"
245
+ " std::remove_if(listeners_.begin(), listeners_.end(),\n"
246
+ " [listener_id](const Listener& l) { return l.id == listener_id; }),\n"
247
+ " listeners_.end()\n"
248
+ " );\n"
249
+ " }\n\n"
250
+ " void dispatch(const std::string& event_data) {\n"
251
+ " std::lock_guard<std::mutex> lock(mut_);\n"
252
+ " for (const auto& listener : listeners_) {\n"
253
+ " // Hint: What happens if a listener decides to call unsubscribe() \n"
254
+ " // from inside their own callback function when an event fires?\n"
255
+ " listener.cb(event_data);\n"
256
+ " }\n"
257
+ " }\n\n"
258
+ "private:\n"
259
+ " struct Listener {\n"
260
+ " int id;\n"
261
+ " Callback cb;\n"
262
+ " };\n \n"
263
+ " std::vector<Listener> listeners_;\n"
264
+ " std::mutex mut_;\n"
265
+ "};"
266
+ ),
267
+ "ground_truth": {
268
+ "bug_identified": True,
269
+ "bug_type_keywords": [
270
+ "deadlock", "concurrency", "lock", "recursive", "reentrant", "hang",
271
+ "iterator validation", "undefined behavior"
272
+ ],
273
+ "location_keywords": [
274
+ "listener.cb", "unsubscribe", "dispatch", "mut_", "std::lock_guard",
275
+ "lock(mut_)"
276
+ ],
277
+ "description_keywords": [
278
+ "deadlock", "already locked", "same thread", "recursive_mutex",
279
+ "reentrant", "hangs", "blocks", "invalidate", "iterator"
280
+ ],
281
+ "fix_keywords": [
282
+ "std::recursive_mutex", "copy", "local copy", "copy the vector",
283
+ "unlock before", "queue", "deferred"
284
+ ],
285
+ "severity_valid": ["high", "critical"],
286
+ },
287
+ },
288
  }
289
 
290
 
 
444
  step_number=step_number,
445
  max_steps=MAX_STEPS,
446
  previous_feedback=previous_feedback,
447
+ )
test_env.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ # Add the current directory to sys.path so we can import 'server'
4
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)))
5
+
6
+ from server.environment import CodeReviewEnvironment
7
+ from server.models import CodeReviewAction
8
+
9
+ def run_test():
10
+ print("Initializing CodeReviewEnvironment...")
11
+ env = CodeReviewEnvironment()
12
+
13
+ print("\n--- 1. Testing 'easy' task (reset) ---")
14
+ obs = env.reset(difficulty="easy")
15
+ print(f"Task ID: {obs.task_id}")
16
+ print(f"Difficulty: {obs.difficulty}")
17
+ print(f"Task Description: {obs.task_description}")
18
+ print(f"Code Snippet:\n{obs.code_snippet}")
19
+ print("-" * 40)
20
+
21
+ print("\n--- 2. Submitting an accurate CodeReviewAction ---")
22
+ action = CodeReviewAction(
23
+ bug_identified=True,
24
+ bug_type="off-by-one error",
25
+ bug_location="range(1, len(arr) + 1)",
26
+ bug_description="The loop contains an off-by-one IndexError because it tries to access arr[i] which goes out of bounds.",
27
+ suggested_fix="Change to range(len(arr))",
28
+ severity="high"
29
+ )
30
+
31
+ obs, reward, done, info = env.step(action)
32
+ print(f"Step Reward: {reward}")
33
+ print(f"Is Done: {done}")
34
+ print(f"Info Breakdown:")
35
+ for k, v in info['breakdown'].items():
36
+ print(f" {k}: {v}")
37
+ print(f"Total Score: {info['total_score']}")
38
+ print(f"Feedback: {info['feedback']}")
39
+
40
+ if __name__ == "__main__":
41
+ run_test()
validation_ascii.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ [END] success=false steps=0 score=0.00 rewards=
2
+ [END] success=false steps=0 score=0.00 rewards=
3
+ [END] success=false steps=0 score=0.00 rewards=
validation_output.log ADDED
Binary file (6.26 kB). View file