Spaces:
Sleeping
Sleeping
| # Meta OpenEnv Hackathon - Round 1 | |
| ## Overview | |
| Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API. | |
| ## Task Requirements | |
| ### Must-Have Features | |
| 1. **Real-world Task Simulation** | |
| - Must simulate tasks humans actually do | |
| - Not games or toys | |
| - Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation | |
| 2. **OpenEnv Spec Compliance** | |
| - Typed Observation, Action, and Reward Pydantic models | |
| - `step(action)` β returns observation, reward, done, info | |
| - `reset()` β returns initial observation | |
| - `state()` β returns current state | |
| - `openenv.yaml` with metadata | |
| - Must pass `openenv validate` | |
| 3. **Minimum 3 Tasks with Agent Graders** | |
| - Each task defines a concrete objective | |
| - Programmatic grader scoring (0.0β1.0) | |
| - Difficulty range: easy β medium β hard | |
| - Clear, deterministic success/failure criteria | |
| 4. **Meaningful Reward Function** | |
| - Provides signal over full trajectory (not just binary) | |
| - Rewards partial progress toward completion | |
| - Penalizes undesirable behavior (infinite loops, destructive actions) | |
| 5. **Baseline Inference Script** | |
| - Uses OpenAI API client | |
| - Reads credentials from `OPENAI_API_KEY` environment variable | |
| - Produces reproducible baseline scores on all 3 tasks | |
| ## Non-Functional Requirements | |
| ### Deployment | |
| - **Hugging Face Space**: Environment must run as containerized HF Space tagged with `openenv` | |
| - **Dockerfile**: Working containerization with clean `docker build + docker run` | |
| ### Documentation | |
| README must include: | |
| - Environment description and motivation | |
| - Action and observation space definitions | |
| - Task descriptions with expected difficulty | |
| - Setup and usage instructions | |
| - Baseline scores | |
| ## Evaluation Criteria & Scoring | |
| ### Scoring Breakdown (100 points) | |
| | Criterion | Weight | Description | | |
| |-----------|--------|-------------| | |
| | **Real-world utility** | 30% | Does the environment model a genuine task? Would someone use this for training/evaluating agents? | | |
| | **Task & grader quality** | 25% | Well-defined tasks with clear objectives? Accurate graders? Meaningful difficulty progression? | | |
| | **Environment design** | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries | | |
| | **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean structure, typed models, documented, tested, working Dockerfile | | |
| | **Creativity & novelty** | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach | | |
| ### Detailed Scoring Rubrics | |
| #### Real-world Utility (30%) | |
| - **0β5**: Toy/artificial problem with no practical application | |
| - **6β15**: Valid domain but shallow modeling | |
| - **16β25**: Good domain modeling, useful for agent evaluation | |
| - **26β30**: Excellent β fills real gap, immediate value for RL/agent community | |
| #### Task & Grader Quality (25%) | |
| - 3+ tasks with difficulty range? | |
| - Graders produce scores between 0.0β1.0? | |
| - Graders deterministic and reproducible? | |
| - Hard task genuinely challenges frontier models? | |
| #### Environment Design (20%) | |
| - `reset()` produces clean state? | |
| - Action/observation types well-designed and documented? | |
| - Reward function provides useful varying signal (not sparse)? | |
| - Episode boundaries sensible? | |
| #### Code Quality & Spec Compliance (15%) | |
| - `openenv validate` passes? | |
| - `docker build && docker run` works? | |
| - HF Space deploys and responds? | |
| - Baseline script runs and reproduces scores? | |
| #### Creativity & Novelty (10%) | |
| - Domain not seen in OpenEnv before? | |
| - Reward design has interesting properties? | |
| - Clever mechanics that make environment engaging? | |
| ## Judging Process | |
| ### Phase 1: Automated Validation (Pass/Fail Gate) | |
| - HF Space deploys | |
| - OpenEnv spec compliance | |
| - Dockerfile builds | |
| - Baseline reproduces | |
| - 3+ tasks with graders | |
| ### Phase 2: Agentic Evaluation (Scored) | |
| - Baseline agent re-run | |
| - Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments | |
| - Score variance check | |
| ### Phase 3: Human Review | |
| Top submissions reviewed by Meta and Hugging Face engineers for: | |
| - Real-world utility | |
| - Creativity | |
| - Exploit checks | |
| ### Disqualification Criteria | |
| - Environment does not deploy or respond | |
| - Plagiarized or trivially modified existing environments | |
| - Graders that always return the same score | |
| - No baseline inference script | |
| ## Pre-Submission Checklist | |
| All must pass or you're disqualified: | |
| - [ ] HF Space deploys (200 response to reset()) | |
| - [ ] OpenEnv spec compliance validated | |
| - [ ] Dockerfile builds successfully | |
| - [ ] Baseline script reproduces without error | |
| - [ ] 3+ tasks with graders (scores in 0.0β1.0 range) | |
| ## Mandatory Requirements | |
| ### Environment Variables | |
| Must be defined in your environment configuration: | |
| ```bash | |
| API_BASE_URL # The API endpoint for the LLM | |
| MODEL_NAME # The model identifier to use for inference | |
| HF_TOKEN # Your Hugging Face / API key | |
| LOCAL_IMAGE_NAME # (Optional) Name of local image if using from_docker_image() | |
| ``` | |
| ### Script Requirements | |
| - **Filename**: `inference.py` (must be in root directory) | |
| - **LLM Calls**: Must use OpenAI Client with above variables | |
| - **Logging Format**: Must follow [START], [STEP], [END] format (see below) | |
| ### Infrastructure Restrictions | |
| - **Runtime**: Inference script must complete in < 20 minutes | |
| - **Resources**: Must run on vcpu=2, memory=8GB | |
| ## STDOUT Logging Format | |
| ### Required Format | |
| The script must emit exactly three line types to stdout, in this order: | |
| ``` | |
| [START] task=<task_name> env=<benchmark> model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn> | |
| ``` | |
| ### Format Rules | |
| - One [START] line at episode begin | |
| - One [STEP] line per step, immediately after `env.step()` returns | |
| - One [END] line after `env.close()`, always emitted (even on exception) | |
| - `reward` and `rewards` formatted to 2 decimal places | |
| - `done` and `success` are lowercase booleans: `true` or `false` | |
| - `error` is the raw `last_action_error` string, or `null` if none | |
| - All fields on a single line with no newlines within a line | |
| - Each task should return score in [0, 1] | |
| ### Example Output | |
| ``` | |
| [START] task=click-test env=miniwob model=Qwen3-VL-30B | |
| [STEP] step=1 action=click('123') reward=0.00 done=false error=null | |
| [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null | |
| [STEP] step=3 action=click('789') reward=1.00 done=true error=null | |
| [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00 | |
| ``` | |
| ## Sample Inference Script | |
| ```python | |
| """ | |
| Inference Script Example | |
| =================================== | |
| MANDATORY | |
| - Before submitting, ensure the following variables are defined in your environment configuration: | |
| API_BASE_URL The API endpoint for the LLM. | |
| MODEL_NAME The model identifier to use for inference. | |
| HF_TOKEN Your Hugging Face / API key. | |
| LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image() | |
| method | |
| - Defaults are set only for API_BASE_URL and MODEL_NAME | |
| (and should reflect your active inference setup): | |
| API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>") | |
| MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>") | |
| - The inference script must be named `inference.py` and placed in the root directory of the project | |
| - Participants must use OpenAI Client for all LLM calls using above variables | |
| STDOUT FORMAT | |
| - The script must emit exactly three line types to stdout, in this order: | |
| [START] task=<task_name> env=<benchmark> model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn> | |
| Rules: | |
| - One [START] line at episode begin. | |
| - One [STEP] line per step, immediately after env.step() returns. | |
| - One [END] line after env.close(), always emitted (even on exception). | |
| - reward and rewards are formatted to 2 decimal places. | |
| - done and success are lowercase booleans: true or false. | |
| - error is the raw last_action_error string, or null if none. | |
| - All fields on a single line with no newlines within a line. | |
| - Each tasks should return score in [0, 1] | |
| Example: | |
| [START] task=click-test env=miniwob model=Qwen3-VL-30B | |
| [STEP] step=1 action=click('123') reward=0.00 done=false error=null | |
| [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null | |
| [STEP] step=3 action=click('789') reward=1.00 done=true error=null | |
| [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00 | |
| """ | |
| import asyncio | |
| import os | |
| import textwrap | |
| from typing import List, Optional | |
| from openai import OpenAI | |
| from my_env_v4 import MyEnvV4Action, MyEnvV4Env | |
| IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image | |
| API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") | |
| API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1" | |
| MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct" | |
| TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo") | |
| BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4") | |
| MAX_STEPS = 8 | |
| TEMPERATURE = 0.7 | |
| # TODO: Implement the rest of your inference script here | |
| ``` | |
| ## Pre-Validation Script | |
| ```bash | |
| #!/usr/bin/env bash | |
| # | |
| # validate-submission.sh β OpenEnv Submission Validator | |
| # | |
| # Checks that your HF Space is live, Docker image builds, and openenv validate passes. | |
| # | |
| # Prerequisites: | |
| # - Docker: https://docs.docker.com/get-docker/ | |
| # - openenv-core: pip install openenv-core | |
| # - curl (usually pre-installed) | |
| # | |
| # Run: | |
| # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir] | |
| # | |
| # Or download and run locally: | |
| # chmod +x validate-submission.sh | |
| # ./validate-submission.sh <ping_url> [repo_dir] | |
| # | |
| # Arguments: | |
| # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space) | |
| # repo_dir Path to your repo (default: current directory) | |
| # | |
| # Examples: | |
| # ./validate-submission.sh https://my-team.hf.space | |
| # ./validate-submission.sh https://my-team.hf.space ./my-repo | |
| # | |
| set -uo pipefail | |
| DOCKER_BUILD_TIMEOUT=600 | |
| if [ -t 1 ]; then | |
| RED='\033[0;31m' | |
| GREEN='\033[0;32m' | |
| YELLOW='\033[1;33m' | |
| BOLD='\033[1m' | |
| NC='\033[0m' | |
| else | |
| RED='' | |
| GREEN='' | |
| YELLOW='' | |
| BOLD='' | |
| NC='' | |
| fi | |
| # TODO: Add the rest of the validation script | |
| ``` | |
| ## Tips for Success | |
| 1. **Choose a Real Problem**: Pick a task that has genuine value for the AI/agent community | |
| 2. **Design Good Rewards**: Provide meaningful signals throughout the episode, not just at the end | |
| 3. **Test Thoroughly**: Ensure your environment works cleanly with `docker build && docker run` | |
| 4. **Document Well**: Clear README helps reviewers understand your contribution | |
| 5. **Start Simple**: Get the basic OpenEnv spec working first, then add complexity | |
| 6. **Run Validator**: Use the pre-validation script before submitting | |
| ## Resources | |
| - OpenEnv Documentation: [Link to be added] | |
| - Hugging Face Spaces: https://huggingface.co/spaces | |
| - OpenAI API Client: https://platform.openai.com/docs/api-reference | |
| ## Submission Deadline | |
| [To be announced] | |
| --- | |
| **Good luck with your submission! π** | |