# Meta OpenEnv Hackathon - Round 1 ## Overview Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API. ## Task Requirements ### Must-Have Features 1. **Real-world Task Simulation** - Must simulate tasks humans actually do - Not games or toys - Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation 2. **OpenEnv Spec Compliance** - Typed Observation, Action, and Reward Pydantic models - `step(action)` → returns observation, reward, done, info - `reset()` → returns initial observation - `state()` → returns current state - `openenv.yaml` with metadata - Must pass `openenv validate` 3. **Minimum 3 Tasks with Agent Graders** - Each task defines a concrete objective - Programmatic grader scoring (0.0–1.0) - Difficulty range: easy → medium → hard - Clear, deterministic success/failure criteria 4. **Meaningful Reward Function** - Provides signal over full trajectory (not just binary) - Rewards partial progress toward completion - Penalizes undesirable behavior (infinite loops, destructive actions) 5. **Baseline Inference Script** - Uses OpenAI API client - Reads credentials from `OPENAI_API_KEY` environment variable - Produces reproducible baseline scores on all 3 tasks ## Non-Functional Requirements ### Deployment - **Hugging Face Space**: Environment must run as containerized HF Space tagged with `openenv` - **Dockerfile**: Working containerization with clean `docker build + docker run` ### Documentation README must include: - Environment description and motivation - Action and observation space definitions - Task descriptions with expected difficulty - Setup and usage instructions - Baseline scores ## Evaluation Criteria & Scoring ### Scoring Breakdown (100 points) | Criterion | Weight | Description | |-----------|--------|-------------| | **Real-world utility** | 30% | Does the environment model a genuine task? Would someone use this for training/evaluating agents? | | **Task & grader quality** | 25% | Well-defined tasks with clear objectives? Accurate graders? Meaningful difficulty progression? | | **Environment design** | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries | | **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean structure, typed models, documented, tested, working Dockerfile | | **Creativity & novelty** | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach | ### Detailed Scoring Rubrics #### Real-world Utility (30%) - **0–5**: Toy/artificial problem with no practical application - **6–15**: Valid domain but shallow modeling - **16–25**: Good domain modeling, useful for agent evaluation - **26–30**: Excellent — fills real gap, immediate value for RL/agent community #### Task & Grader Quality (25%) - 3+ tasks with difficulty range? - Graders produce scores between 0.0–1.0? - Graders deterministic and reproducible? - Hard task genuinely challenges frontier models? #### Environment Design (20%) - `reset()` produces clean state? - Action/observation types well-designed and documented? - Reward function provides useful varying signal (not sparse)? - Episode boundaries sensible? #### Code Quality & Spec Compliance (15%) - `openenv validate` passes? - `docker build && docker run` works? - HF Space deploys and responds? - Baseline script runs and reproduces scores? #### Creativity & Novelty (10%) - Domain not seen in OpenEnv before? - Reward design has interesting properties? - Clever mechanics that make environment engaging? ## Judging Process ### Phase 1: Automated Validation (Pass/Fail Gate) - HF Space deploys - OpenEnv spec compliance - Dockerfile builds - Baseline reproduces - 3+ tasks with graders ### Phase 2: Agentic Evaluation (Scored) - Baseline agent re-run - Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments - Score variance check ### Phase 3: Human Review Top submissions reviewed by Meta and Hugging Face engineers for: - Real-world utility - Creativity - Exploit checks ### Disqualification Criteria - Environment does not deploy or respond - Plagiarized or trivially modified existing environments - Graders that always return the same score - No baseline inference script ## Pre-Submission Checklist All must pass or you're disqualified: - [ ] HF Space deploys (200 response to reset()) - [ ] OpenEnv spec compliance validated - [ ] Dockerfile builds successfully - [ ] Baseline script reproduces without error - [ ] 3+ tasks with graders (scores in 0.0–1.0 range) ## Mandatory Requirements ### Environment Variables Must be defined in your environment configuration: ```bash API_BASE_URL # The API endpoint for the LLM MODEL_NAME # The model identifier to use for inference HF_TOKEN # Your Hugging Face / API key LOCAL_IMAGE_NAME # (Optional) Name of local image if using from_docker_image() ``` ### Script Requirements - **Filename**: `inference.py` (must be in root directory) - **LLM Calls**: Must use OpenAI Client with above variables - **Logging Format**: Must follow [START], [STEP], [END] format (see below) ### Infrastructure Restrictions - **Runtime**: Inference script must complete in < 20 minutes - **Resources**: Must run on vcpu=2, memory=8GB ## STDOUT Logging Format ### Required Format The script must emit exactly three line types to stdout, in this order: ``` [START] task= env= model= [STEP] step= action= reward=<0.00> done= error= [END] success= steps= score= rewards= ``` ### Format Rules - One [START] line at episode begin - One [STEP] line per step, immediately after `env.step()` returns - One [END] line after `env.close()`, always emitted (even on exception) - `reward` and `rewards` formatted to 2 decimal places - `done` and `success` are lowercase booleans: `true` or `false` - `error` is the raw `last_action_error` string, or `null` if none - All fields on a single line with no newlines within a line - Each task should return score in [0, 1] ### Example Output ``` [START] task=click-test env=miniwob model=Qwen3-VL-30B [STEP] step=1 action=click('123') reward=0.00 done=false error=null [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null [STEP] step=3 action=click('789') reward=1.00 done=true error=null [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00 ``` ## Sample Inference Script ```python """ Inference Script Example =================================== MANDATORY - Before submitting, ensure the following variables are defined in your environment configuration: API_BASE_URL The API endpoint for the LLM. MODEL_NAME The model identifier to use for inference. HF_TOKEN Your Hugging Face / API key. LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image() method - Defaults are set only for API_BASE_URL and MODEL_NAME (and should reflect your active inference setup): API_BASE_URL = os.getenv("API_BASE_URL", "") MODEL_NAME = os.getenv("MODEL_NAME", "") - The inference script must be named `inference.py` and placed in the root directory of the project - Participants must use OpenAI Client for all LLM calls using above variables STDOUT FORMAT - The script must emit exactly three line types to stdout, in this order: [START] task= env= model= [STEP] step= action= reward=<0.00> done= error= [END] success= steps= score= rewards= Rules: - One [START] line at episode begin. - One [STEP] line per step, immediately after env.step() returns. - One [END] line after env.close(), always emitted (even on exception). - reward and rewards are formatted to 2 decimal places. - done and success are lowercase booleans: true or false. - error is the raw last_action_error string, or null if none. - All fields on a single line with no newlines within a line. - Each tasks should return score in [0, 1] Example: [START] task=click-test env=miniwob model=Qwen3-VL-30B [STEP] step=1 action=click('123') reward=0.00 done=false error=null [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null [STEP] step=3 action=click('789') reward=1.00 done=true error=null [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00 """ import asyncio import os import textwrap from typing import List, Optional from openai import OpenAI from my_env_v4 import MyEnvV4Action, MyEnvV4Env IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1" MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct" TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo") BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4") MAX_STEPS = 8 TEMPERATURE = 0.7 # TODO: Implement the rest of your inference script here ``` ## Pre-Validation Script ```bash #!/usr/bin/env bash # # validate-submission.sh — OpenEnv Submission Validator # # Checks that your HF Space is live, Docker image builds, and openenv validate passes. # # Prerequisites: # - Docker: https://docs.docker.com/get-docker/ # - openenv-core: pip install openenv-core # - curl (usually pre-installed) # # Run: # curl -fsSL https://raw.githubusercontent.com///main/scripts/validate-submission.sh | bash -s -- [repo_dir] # # Or download and run locally: # chmod +x validate-submission.sh # ./validate-submission.sh [repo_dir] # # Arguments: # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space) # repo_dir Path to your repo (default: current directory) # # Examples: # ./validate-submission.sh https://my-team.hf.space # ./validate-submission.sh https://my-team.hf.space ./my-repo # set -uo pipefail DOCKER_BUILD_TIMEOUT=600 if [ -t 1 ]; then RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BOLD='\033[1m' NC='\033[0m' else RED='' GREEN='' YELLOW='' BOLD='' NC='' fi # TODO: Add the rest of the validation script ``` ## Tips for Success 1. **Choose a Real Problem**: Pick a task that has genuine value for the AI/agent community 2. **Design Good Rewards**: Provide meaningful signals throughout the episode, not just at the end 3. **Test Thoroughly**: Ensure your environment works cleanly with `docker build && docker run` 4. **Document Well**: Clear README helps reviewers understand your contribution 5. **Start Simple**: Get the basic OpenEnv spec working first, then add complexity 6. **Run Validator**: Use the pre-validation script before submitting ## Resources - OpenEnv Documentation: [Link to be added] - Hugging Face Spaces: https://huggingface.co/spaces - OpenAI API Client: https://platform.openai.com/docs/api-reference ## Submission Deadline [To be announced] --- **Good luck with your submission! 🚀**