File size: 11,390 Bytes
7bdbe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# Meta OpenEnv Hackathon - Round 1

## Overview

Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.

## Task Requirements

### Must-Have Features

1. **Real-world Task Simulation**
   - Must simulate tasks humans actually do
   - Not games or toys
   - Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation

2. **OpenEnv Spec Compliance**
   - Typed Observation, Action, and Reward Pydantic models
   - `step(action)` β†’ returns observation, reward, done, info
   - `reset()` β†’ returns initial observation
   - `state()` β†’ returns current state
   - `openenv.yaml` with metadata
   - Must pass `openenv validate`

3. **Minimum 3 Tasks with Agent Graders**
   - Each task defines a concrete objective
   - Programmatic grader scoring (0.0–1.0)
   - Difficulty range: easy β†’ medium β†’ hard
   - Clear, deterministic success/failure criteria

4. **Meaningful Reward Function**
   - Provides signal over full trajectory (not just binary)
   - Rewards partial progress toward completion
   - Penalizes undesirable behavior (infinite loops, destructive actions)

5. **Baseline Inference Script**
   - Uses OpenAI API client
   - Reads credentials from `OPENAI_API_KEY` environment variable
   - Produces reproducible baseline scores on all 3 tasks

## Non-Functional Requirements

### Deployment
- **Hugging Face Space**: Environment must run as containerized HF Space tagged with `openenv`
- **Dockerfile**: Working containerization with clean `docker build + docker run`

### Documentation
README must include:
- Environment description and motivation
- Action and observation space definitions
- Task descriptions with expected difficulty
- Setup and usage instructions
- Baseline scores

## Evaluation Criteria & Scoring

### Scoring Breakdown (100 points)

| Criterion | Weight | Description |
|-----------|--------|-------------|
| **Real-world utility** | 30% | Does the environment model a genuine task? Would someone use this for training/evaluating agents? |
| **Task & grader quality** | 25% | Well-defined tasks with clear objectives? Accurate graders? Meaningful difficulty progression? |
| **Environment design** | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries |
| **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean structure, typed models, documented, tested, working Dockerfile |
| **Creativity & novelty** | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach |

### Detailed Scoring Rubrics

#### Real-world Utility (30%)
- **0–5**: Toy/artificial problem with no practical application
- **6–15**: Valid domain but shallow modeling
- **16–25**: Good domain modeling, useful for agent evaluation
- **26–30**: Excellent β€” fills real gap, immediate value for RL/agent community

#### Task & Grader Quality (25%)
- 3+ tasks with difficulty range?
- Graders produce scores between 0.0–1.0?
- Graders deterministic and reproducible?
- Hard task genuinely challenges frontier models?

#### Environment Design (20%)
- `reset()` produces clean state?
- Action/observation types well-designed and documented?
- Reward function provides useful varying signal (not sparse)?
- Episode boundaries sensible?

#### Code Quality & Spec Compliance (15%)
- `openenv validate` passes?
- `docker build && docker run` works?
- HF Space deploys and responds?
- Baseline script runs and reproduces scores?

#### Creativity & Novelty (10%)
- Domain not seen in OpenEnv before?
- Reward design has interesting properties?
- Clever mechanics that make environment engaging?

## Judging Process

### Phase 1: Automated Validation (Pass/Fail Gate)
- HF Space deploys
- OpenEnv spec compliance
- Dockerfile builds
- Baseline reproduces
- 3+ tasks with graders

### Phase 2: Agentic Evaluation (Scored)
- Baseline agent re-run
- Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments
- Score variance check

### Phase 3: Human Review
Top submissions reviewed by Meta and Hugging Face engineers for:
- Real-world utility
- Creativity
- Exploit checks

### Disqualification Criteria
- Environment does not deploy or respond
- Plagiarized or trivially modified existing environments
- Graders that always return the same score
- No baseline inference script

## Pre-Submission Checklist

All must pass or you're disqualified:

- [ ] HF Space deploys (200 response to reset())
- [ ] OpenEnv spec compliance validated
- [ ] Dockerfile builds successfully
- [ ] Baseline script reproduces without error
- [ ] 3+ tasks with graders (scores in 0.0–1.0 range)

## Mandatory Requirements

### Environment Variables
Must be defined in your environment configuration:

```bash
API_BASE_URL    # The API endpoint for the LLM
MODEL_NAME      # The model identifier to use for inference
HF_TOKEN        # Your Hugging Face / API key
LOCAL_IMAGE_NAME # (Optional) Name of local image if using from_docker_image()
```

### Script Requirements
- **Filename**: `inference.py` (must be in root directory)
- **LLM Calls**: Must use OpenAI Client with above variables
- **Logging Format**: Must follow [START], [STEP], [END] format (see below)

### Infrastructure Restrictions
- **Runtime**: Inference script must complete in < 20 minutes
- **Resources**: Must run on vcpu=2, memory=8GB

## STDOUT Logging Format

### Required Format
The script must emit exactly three line types to stdout, in this order:

```
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
```

### Format Rules
- One [START] line at episode begin
- One [STEP] line per step, immediately after `env.step()` returns
- One [END] line after `env.close()`, always emitted (even on exception)
- `reward` and `rewards` formatted to 2 decimal places
- `done` and `success` are lowercase booleans: `true` or `false`
- `error` is the raw `last_action_error` string, or `null` if none
- All fields on a single line with no newlines within a line
- Each task should return score in [0, 1]

### Example Output
```
[START] task=click-test env=miniwob model=Qwen3-VL-30B
[STEP] step=1 action=click('123') reward=0.00 done=false error=null
[STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
[STEP] step=3 action=click('789') reward=1.00 done=true error=null
[END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
```

## Sample Inference Script

```python
"""
Inference Script Example
===================================
MANDATORY
- Before submitting, ensure the following variables are defined in your environment configuration:
    API_BASE_URL   The API endpoint for the LLM.
    MODEL_NAME     The model identifier to use for inference.
    HF_TOKEN       Your Hugging Face / API key.
    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
                     method

- Defaults are set only for API_BASE_URL and MODEL_NAME 
    (and should reflect your active inference setup):
    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
    
- The inference script must be named `inference.py` and placed in the root directory of the project
- Participants must use OpenAI Client for all LLM calls using above variables

STDOUT FORMAT
- The script must emit exactly three line types to stdout, in this order:

    [START] task=<task_name> env=<benchmark> model=<model_name>
    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>

  Rules:
    - One [START] line at episode begin.
    - One [STEP] line per step, immediately after env.step() returns.
    - One [END] line after env.close(), always emitted (even on exception).
    - reward and rewards are formatted to 2 decimal places.
    - done and success are lowercase booleans: true or false.
    - error is the raw last_action_error string, or null if none.
    - All fields on a single line with no newlines within a line.
    - Each tasks should return score in [0, 1]

  Example:
    [START] task=click-test env=miniwob model=Qwen3-VL-30B
    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
"""

import asyncio
import os
import textwrap
from typing import List, Optional

from openai import OpenAI

from my_env_v4 import MyEnvV4Action, MyEnvV4Env

IMAGE_NAME = os.getenv("IMAGE_NAME")  # If you are using docker image 
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
MAX_STEPS = 8
TEMPERATURE = 0.7

# TODO: Implement the rest of your inference script here
```

## Pre-Validation Script

```bash
#!/usr/bin/env bash
#
# validate-submission.sh β€” OpenEnv Submission Validator
#
# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
#
# Prerequisites:
#   - Docker:       https://docs.docker.com/get-docker/
#   - openenv-core: pip install openenv-core
#   - curl (usually pre-installed)
#
# Run:
#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
#
#   Or download and run locally:
#     chmod +x validate-submission.sh
#     ./validate-submission.sh <ping_url> [repo_dir]
#
# Arguments:
#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
#   repo_dir   Path to your repo (default: current directory)
#
# Examples:
#   ./validate-submission.sh https://my-team.hf.space
#   ./validate-submission.sh https://my-team.hf.space ./my-repo
#

set -uo pipefail

DOCKER_BUILD_TIMEOUT=600

if [ -t 1 ]; then
  RED='\033[0;31m'
  GREEN='\033[0;32m'
  YELLOW='\033[1;33m'
  BOLD='\033[1m'
  NC='\033[0m'
else
  RED=''
  GREEN=''
  YELLOW=''
  BOLD=''
  NC=''
fi

# TODO: Add the rest of the validation script
```

## Tips for Success

1. **Choose a Real Problem**: Pick a task that has genuine value for the AI/agent community
2. **Design Good Rewards**: Provide meaningful signals throughout the episode, not just at the end
3. **Test Thoroughly**: Ensure your environment works cleanly with `docker build && docker run`
4. **Document Well**: Clear README helps reviewers understand your contribution
5. **Start Simple**: Get the basic OpenEnv spec working first, then add complexity
6. **Run Validator**: Use the pre-validation script before submitting

## Resources

- OpenEnv Documentation: [Link to be added]
- Hugging Face Spaces: https://huggingface.co/spaces
- OpenAI API Client: https://platform.openai.com/docs/api-reference

## Submission Deadline

[To be announced]

---

**Good luck with your submission! πŸš€**