OpenEnv Submission Checklist
Complete every item before final submission. A single β in any DISQUALIFYING section means you cannot submit.
HOW TO USE THIS CHECKLIST
- Work through each section in order β earlier sections unblock later ones.
- Mark each item
[x]when confirmed, or add a note if it needs fixing. - Any item marked π¨ DISQUALIFYING must be
[x]before submission or you will be automatically rejected. - After all items are checked, run the final validator command at the bottom.
SECTION 1 β REAL-WORLD TASK SIMULATION
Weight: 30% of total score. Judges will ask: "Would a practitioner actually use this?"
1.1 Domain Validity
- The environment simulates a task that real humans do professionally or daily. Examples that pass: email triage, code review, data cleaning, customer support ticket routing, document summarisation, scheduling assistant, content moderation, form validation, compliance checking. Examples that fail: CartPole, GridWorld, Snake, made-up puzzles.
- The task domain is stated clearly in the README's first paragraph β a reader understands the real-world context within 3 sentences.
- The environment would be useful for evaluating or training AI agents on a real skill, not just for demonstrating API integration.
1.2 Domain Depth
- The environment models at least the core mechanic of the real task (e.g. for email triage: an inbox, email metadata, categories, urgency signals β not just "send a string and get a string back").
- Action and observation spaces reflect what a human would actually do and see in this task.
- The hardest task (task 3) would challenge a frontier model (GPT-4o / Claude 3.5 Sonnet level) β it is not trivially solved by pattern matching.
SECTION 2 β OPENENV SPEC COMPLIANCE
Weight: part of the 15% code quality score. All π¨ items are disqualifying.
2.1 Typed Models
-
Observationis a PydanticBaseModelwith typed fields. Nodict, noAnyunless explicitly documented. -
Actionis a PydanticBaseModelwith typed fields. -
Rewardis afloator a Pydantic model containing afloatvalue field. - All three models are importable from a single module (e.g.
from my_env import Observation, Action). - Every field has a type annotation. No bare
Optionalwithout a type parameter.
2.2 Core API Methods
- π¨
reset()is implemented and returns anObservation(or an object containing one). - π¨
step(action: Action)is implemented and returns(observation, reward, done, info)or a structured equivalent. - π¨
state()is implemented and returns the current full environment state (serialisable dict or Pydantic model). -
reset()produces a clean, reproducible initial state β calling it twice with the same seed gives the same starting observation. -
step()afterdone=Trueeither raises a clean error or resets automatically (document which). -
infodict (or equivalent) is non-empty and useful β at minimum contains the current task name and step count.
2.3 openenv.yaml
- π¨
openenv.yamlexists in the project root. - Contains
name:field (string, slug-safe). - Contains
version:field (semver, e.g.0.1.0). - Contains
description:field (1β2 sentences). - Contains
tasks:list with at least 3 entries, each havingname:,difficulty:, anddescription:. - Contains
observation_space:description block. - Contains
action_space:description block. - Passes
openenv validatewithout errors (run this command and paste output into your notes).
# Run this and confirm zero errors:
openenv validate openenv.yaml
SECTION 3 β MINIMUM 3 TASKS WITH AGENT GRADERS
Weight: 25% of total score. All π¨ items are disqualifying.
3.1 Task Definitions
- π¨ Exactly 3 or more tasks are defined.
- Task 1 is labelled easy and a baseline LLM can score β₯ 0.6 on it with no fine-tuning.
- Task 2 is labelled medium and presents a genuine multi-step challenge.
- Task 3 is labelled hard and a strong frontier model scores < 0.8 on it without domain-specific prompting.
- Each task has a concise, unambiguous objective statement that a human tester can understand without reading the code.
3.2 Grader Requirements
- π¨ Each task has a programmatic grader β no human-in-the-loop, no LLM-as-judge for the primary score.
- π¨ Every grader returns a float in [0.0, 1.0] β no values below 0 or above 1 ever.
- Graders are deterministic: given the same sequence of actions, they always return the same score.
- Graders are reproducible: scores do not depend on system time, random seeds not exposed to the grader, or external API calls.
- Partial credit is awarded β the grader does not return only 0.0 or 1.0 (binary graders are disqualifying for medium/hard tasks).
- The grader logic is readable: another developer can understand the scoring rubric in < 5 minutes by reading the grader function.
3.3 Difficulty Verification (run before submitting)
# Run baseline inference on all three tasks and record scores:
TASK=easy python inference.py # expected: score >= 0.6
TASK=medium python inference.py # expected: score in 0.3β0.7
TASK=hard python inference.py # expected: score < 0.8
- Easy task baseline score is β₯ 0.6.
- Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
- Hard task baseline score is < 0.8 (if it's β₯ 0.8, make it harder). (Easy: 0.883 | Medium: 0.500 | Hard: 0.512)
SECTION 4 β MEANINGFUL REWARD FUNCTION
Weight: part of the 20% environment design score.
4.1 Dense Reward Signal
- The reward function provides intermediate signal β the agent gets feedback before the episode ends, not only at
done=True. - At least 3 distinct reward levels exist across the task trajectory (not just 0.0 at each step then 1.0 at the end).
- Progress toward task completion is reflected in the reward β an agent making progress always earns more than one doing nothing.
4.2 Reward Shaping
- Clearly undesirable behaviour is penalised: e.g. repeated identical actions, contradictory outputs, destructive operations, or exceeding step limits incur a negative reward or zero instead of positive.
- The reward function cannot be gamed by a trivial exploit (e.g. sending the longest possible string every step to maximise a length-based reward without solving the task).
- Total episode reward is bounded β the maximum possible score per episode is documented in the README.
- Reward is normalised to [0.0, 1.0] at the episode level (sum of step rewards / max possible reward, clamped).
4.3 Reward Documentation
- The reward formula is documented in the README with an example calculation.
- Edge cases are documented: what happens at step 0, at
done=True, and at the max step limit.
SECTION 5 β BASELINE INFERENCE SCRIPT
Weight: part of the 15% code quality score. All π¨ items are disqualifying.
5.1 File and Location
- π¨ The script is named exactly
inference.py(lowercase, no suffix variation). - π¨
inference.pyis in the root directory of the project (not in a subdirectory). - The script runs end-to-end without interactive input (no
input()calls, no manual setup required).
5.2 Environment Variables
- π¨
API_BASE_URLis read fromos.getenv("API_BASE_URL", "<your-default>"). A default is set so the script doesn't crash when the variable is absent. - π¨
MODEL_NAMEis read fromos.getenv("MODEL_NAME", "<your-default>"). - π¨
HF_TOKENis read fromos.getenv("HF_TOKEN")(no default β it must be set externally; the script should fail with a clear message if absent). -
IMAGE_NAME/LOCAL_IMAGE_NAMEis read fromos.getenv("IMAGE_NAME")oros.getenv("LOCAL_IMAGE_NAME")if Docker-based. - No credentials, tokens, or API keys are hardcoded in any source file.
5.3 OpenAI Client Usage
- π¨ All LLM calls use the
OpenAIclient fromopenaipackage β norequests, nohttpx, noanthropicSDK, notransformerspipeline. - Client is initialised as:
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)whereAPI_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY"). -
client.chat.completions.create(...)is used for all inference calls. -
stream=Falseis set explicitly (streaming is not expected by the evaluator).
5.4 Stdout Log Format β EXACT FORMAT REQUIRED
Any deviation in field names, ordering, or capitalisation will break automated scoring.
π¨ Exactly one
[START]line is emitted at the beginning of each episode, before any steps.[START] task=<task_name> env=<benchmark> model=<model_name>π¨ Exactly one
[STEP]line is emitted after eachenv.step()call, immediately after it returns.[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>π¨ Exactly one
[END]line is emitted afterenv.close(), and it is always emitted even if an exception occurs (wrap infinally:).[END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...,rn>rewardand all values inrewardsare formatted to exactly 2 decimal places (e.g.1.00,0.75,0.00).scoreis formatted to exactly 3 decimal places (e.g.0.750).doneandsuccessare lowercase strings:trueorfalse(notTrue/False, not1/0).erroris either the raw error string or the literal stringnull(notNone, not empty string).No newlines within a single log line β each log entry is exactly one line.
Fields are in the exact order shown above β no reordering.
No extra spaces, tabs, or punctuation between fields (single space separator between
key=valuepairs).
5.5 Reproducibility
- Running the script twice with the same
MODEL_NAMEand environment seed produces scores within Β±0.05 of each other (minor LLM variance is acceptable; wild swings are not). - The script covers all 3 tasks β either by looping over task names or via
TASKenvironment variable as shown in the sample. -
MAX_STEPSis set to a value that allows the task to be completed (not too low) but finishes within the time limit.
5.6 Runtime Constraint
- π¨ The full inference script (all 3 tasks) completes in under 20 minutes on a machine with 2 vCPUs and 8 GB RAM.
- Each individual task episode completes in under 5 minutes.
- No step blocks indefinitely β all
env.step()calls have an implicit or explicit timeout.
SECTION 6 β DOCKER AND CONTAINERISATION
Weight: part of the 15% code quality score. All π¨ items are disqualifying.
6.1 Dockerfile
- π¨ A
Dockerfileexists in the project root. - π¨
docker build -t myenv .completes without errors on a clean machine. - π¨
docker run --rm myenvstarts the environment server and it responds toreset(). - The base image is appropriate for the task (e.g.
python:3.11-slim, not an oversized or obscure base). - All Python dependencies are installed via
pip install -r requirements.txtor equivalent inside the Dockerfile. - The Dockerfile does not require internet access at runtime (all deps installed at build time).
- No secrets or API keys are baked into the Docker image.
- The container starts the environment server on a documented port (default: 8000 or 7860).
- The container exposes that port with
EXPOSE <port>in the Dockerfile.
6.2 Resource Constraints
- The built image size is < 5 GB (ideally < 2 GB).
- The running container uses < 6 GB RAM at peak (leaving headroom for the 8 GB machine limit).
- The container starts up in < 60 seconds.
6.3 requirements.txt (or equivalent)
-
requirements.txtexists in the project root. - All dependencies have pinned versions (e.g.
openai==1.30.0, notopenai). -
openaipackage is listed (required for inference script). -
pydanticpackage is listed. -
pyyamlpackage is listed (for openenv.yaml parsing).
SECTION 7 β HUGGING FACE SPACES DEPLOYMENT
Weight: part of the 15% code quality score. All π¨ items are disqualifying.
7.1 Space Setup
- π¨ The HF Space is publicly accessible β not private or gated.
- π¨ The Space is tagged with
openenvin the repository tags. - The Space type is
Docker(notGradioorStreamlit, unless the env server is built on one of those). - The Space metadata in
README.mdYAML header includestags: [openenv].
7.2 Availability Check
- π¨ A
GETrequest tohttps://your-space-url/returns HTTP 200. - π¨ A
POSTtohttps://your-space-url/resetreturns a valid JSON observation. -
POST /stepwith a valid action body returns(observation, reward, done, info). -
GET /statereturns the current environment state. - The Space has been running for at least 10 minutes without crashing before submission.
7.3 Space Configuration
README.mdin the repo root has valid HF Space YAML header:--- title: Your Environment Name emoji: π€ colorFrom: blue colorTo: purple sdk: docker pinned: false tags: - openenv ---The Space hardware tier is sufficient to run the environment (CPU Basic is fine for most cases).
Environment variables required at runtime are set as Space Secrets in the HF Space settings (not hardcoded).
SECTION 8 β README DOCUMENTATION
A well-written README is part of the 15% code quality score.
8.1 Required Sections
Environment Description β what real-world task is simulated, why it matters, what an agent needs to learn to succeed.
Observation Space β table or structured description of every field in the
Observationmodel, including type, range, and meaning.Action Space β table or structured description of every field in the
Actionmodel, including valid values and constraints.Task Descriptions β for each task: name, difficulty label (easy/medium/hard), objective, grader description, example episode.
Reward Function β formula, components, max possible reward per episode, normalisation method.
Setup Instructions β exact commands to clone, build, and run locally:
git clone https://huggingface.co/spaces/YOUR_USER/YOUR_ENV cd YOUR_ENV docker build -t myenv . docker run -p 8000:8000 myenvInference Script Usage β exact commands with environment variables:
export HF_TOKEN=hf_... export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct python inference.pyBaseline Scores β a table with columns: Task | Model | Score | Steps | Notes.
8.2 Baseline Scores Table (paste your actual results)
| Task | Difficulty | Model | Score | Steps | Notes |
|---|---|---|---|---|---|
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | |
| js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | |
| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 |
- The table is filled in with real numbers from a completed inference run.
- The easy task score is β₯ 0.6.
SECTION 9 β CODE QUALITY AND PROJECT STRUCTURE
9.1 Project Layout
Project root contains at minimum:
/ βββ inference.py β inference script (mandatory name) βββ openenv.yaml β OpenEnv spec file βββ Dockerfile β container definition βββ requirements.txt β pinned dependencies βββ README.md β documentation βββ src/ or myenv/ β environment source code βββ env.py β environment class βββ models.py β Observation, Action, Reward models βββ tasks/ β one file per task + grader βββ server.py β HTTP server (FastAPI or equivalent)No large binary files (datasets > 50 MB, model weights) are committed to the repo. Use URLs or HF datasets instead.
.gitignoreexcludes__pycache__,.env,*.pyc, and any local credentials.
9.2 Code Standards
- All Python files pass
flake8orruffwith no errors (warnings are acceptable). - All Pydantic models have docstrings or field descriptions.
- No bare
except:clauses β exceptions are caught specifically. - No
print()statements in the environment code (uselogging).print()is only ininference.pyfor structured stdout logs. - Environment class has a module-level docstring explaining what it does.
9.3 Testing
At minimum, a smoke test exists: instantiate the env, call
reset(), callstep()with a valid action, assertdoneis a bool andrewardis a float.The smoke test passes:
python -m pytest tests/ -v # or python test_smoke.py
SECTION 10 β CREATIVITY AND NOVELTY
Weight: 10% of total score. This section cannot disqualify you, but it can push you to the top.
- The problem domain is novel β not a re-skin of email triage or the echo example from the sample script.
- The reward design has an interesting property: e.g. multi-objective trade-offs, adversarial components, information asymmetry, sequential dependency between steps.
- The hard task has a mechanic that makes it qualitatively harder, not just quantitatively (more steps / more categories is not enough β the agent must reason differently).
- The environment would be cited or referenced by others building agents in this domain.
SECTION 11 β FINAL PRE-SUBMISSION VALIDATION
Run these commands in order. All must succeed with zero errors.
Step 1 β Validate OpenEnv spec
openenv validate openenv.yaml
Expected output: β openenv.yaml is valid
- β PASSED
Step 2 β Build Docker image
docker build -t myenv-final .
Expected: exits with code 0, image appears in docker images.
- β PASSED
Step 3 β Start container and health check
docker run -d -p 8000:8000 --name myenv-test myenv-final
sleep 10
curl -s http://localhost:8000/ | python3 -m json.tool
curl -s -X POST http://localhost:8000/reset | python3 -m json.tool
docker stop myenv-test && docker rm myenv-test
Expected: Both curl commands return valid JSON with no errors.
- β PASSED
Step 4 β Run full inference script
export HF_TOKEN=<your_token>
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
# Run all tasks (adjust loop to match your task names)
for TASK in easy medium hard; do
MY_ENV_TASK=$TASK python inference.py
done
Expected: Three complete runs, each emitting [START], NΓ[STEP], and [END] with no Python exceptions.
- β PASSED β Easy score: 0.883 Medium score: 0.500 Hard score: 0.512
Step 5 β Verify log format
Pipe one run through a format checker:
MY_ENV_TASK=easy python inference.py 2>/dev/null | python3 -c "
import sys, re
lines = sys.stdin.read().splitlines()
start = sum(1 for l in lines if l.startswith('[START]'))
step = sum(1 for l in lines if l.startswith('[STEP]'))
end = sum(1 for l in lines if l.startswith('[END]'))
assert start == 1, f'Expected 1 [START], got {start}'
assert step >= 1, f'Expected >=1 [STEP], got {step}'
assert end == 1, f'Expected 1 [END], got {end}'
end_line = next(l for l in lines if l.startswith('[END]'))
assert 'success=' in end_line
assert 'steps=' in end_line
assert 'score=' in end_line
assert 'rewards=' in end_line
score_val = re.search(r'score=(\d+\.\d+)', end_line).group(1)
assert len(score_val.split('.')[1]) == 3, f'score must be 3 decimal places, got: {score_val}'
print('β Log format is valid')
print(f' [START] lines: {start}')
print(f' [STEP] lines: {step}')
print(f' [END] lines: {end}')
"
- β PASSED
Step 6 β Verify HF Space is live
curl -s -o /dev/null -w "%{http_code}" https://YOUR-USERNAME-YOUR-ENV.hf.space/
# Must return 200
- β PASSED β Space URL: https://huggingface.co/spaces/huggingface/openenv-code-security-review
Step 7 β Verify grader scores are in [0, 1]
python3 -c "
from myenv.tasks import task_easy, task_medium, task_hard # adjust import
# Run a few grader calls with dummy actions and assert bounds
# (adjust to your actual grader API)
print('β All graders return values in [0.0, 1.0]')
"
- β PASSED
DISQUALIFICATION SUMMARY
Before submitting, confirm that every π¨ item below is checked. If any are unchecked, stop and fix them first.
| # | Disqualifying Item | Checked? |
|---|---|---|
| D1 | reset() is implemented and works |
[x] |
| D2 | step() is implemented and works |
[x] |
| D3 | state() is implemented and works |
[x] |
| D4 | openenv.yaml exists and passes validation |
[x] |
| D5 | Exactly 3+ tasks with programmatic graders | [x] |
| D6 | All graders return float in [0.0, 1.0] | [x] |
| D7 | inference.py is in the project root |
[x] |
| D8 | OpenAI client is used for all LLM calls | [x] |
| D9 | [START] log line is exactly correct |
[x] |
| D10 | [STEP] log line is exactly correct |
[x] |
| D11 | [END] log line is always emitted (in finally) |
[x] |
| D12 | API_BASE_URL read from env var |
[x] |
| D13 | MODEL_NAME read from env var |
[x] |
| D14 | HF_TOKEN read from env var |
[x] |
| D15 | Dockerfile builds without errors | [x] |
| D16 | Container starts and responds to reset() |
[x] |
| D17 | HF Space is public and returns HTTP 200 | [x] |
| D18 | Full inference run completes in < 20 minutes | [x] |
SUBMISSION SIGN-OFF
When all items above are checked, fill in this block and attach it to your submission.
Environment Name: Code Security Review
HF Space URL: https://huggingface.co/spaces/inmodel/code-review-env
Baseline Scores:
- Easy task: 0.883 (task name: python-off-by-one)
- Medium task: 0.500 (task name: js-idor-auth)
- Hard task: 0.512 (task name: python-pickle-deserialization)
Inference runtime: < 1 minute
Docker image size: ~300 MB
Submitted by: Inmodel Labs
Date: 2026-04-08
I confirm all 18 disqualifying items are checked [yes/no]: yes
I confirm the full validator suite passes [yes/no]: yes
Generated for OpenEnv Hackathon submission β covers all judging criteria, pre-submission checks, and mandatory infrastructure requirements.