Spaces:

Akshaykumarbm
/

scheduling_env

Sleeping

App Files Files Community

scheduling_env / docs /HACKATHON_META.md

Akshaykumarbm

Upload folder using huggingface_hub

7bdbe90 verified about 1 month ago

preview code

raw

history blame contribute delete

11.4 kB

	# Meta OpenEnv Hackathon - Round 1

	## Overview

	Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.

	## Task Requirements

	### Must-Have Features

	1. Real-world Task Simulation
	- Must simulate tasks humans actually do
	- Not games or toys
	- Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation

	2. OpenEnv Spec Compliance
	- Typed Observation, Action, and Reward Pydantic models
	- `step(action)` → returns observation, reward, done, info
	- `reset()` → returns initial observation
	- `state()` → returns current state
	- `openenv.yaml` with metadata
	- Must pass `openenv validate`

	3. Minimum 3 Tasks with Agent Graders
	- Each task defines a concrete objective
	- Programmatic grader scoring (0.0–1.0)
	- Difficulty range: easy → medium → hard
	- Clear, deterministic success/failure criteria

	4. Meaningful Reward Function
	- Provides signal over full trajectory (not just binary)
	- Rewards partial progress toward completion
	- Penalizes undesirable behavior (infinite loops, destructive actions)

	5. Baseline Inference Script
	- Uses OpenAI API client
	- Reads credentials from `OPENAI_API_KEY` environment variable
	- Produces reproducible baseline scores on all 3 tasks

	## Non-Functional Requirements

	### Deployment
	- Hugging Face Space: Environment must run as containerized HF Space tagged with `openenv`
	- Dockerfile: Working containerization with clean `docker build + docker run`

	### Documentation
	README must include:
	- Environment description and motivation
	- Action and observation space definitions
	- Task descriptions with expected difficulty
	- Setup and usage instructions
	- Baseline scores

	## Evaluation Criteria & Scoring

	### Scoring Breakdown (100 points)

	\| Criterion \| Weight \| Description \|
	\|-----------\|--------\|-------------\|
	\| Real-world utility \| 30% \| Does the environment model a genuine task? Would someone use this for training/evaluating agents? \|
	\| Task & grader quality \| 25% \| Well-defined tasks with clear objectives? Accurate graders? Meaningful difficulty progression? \|
	\| Environment design \| 20% \| Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries \|
	\| Code quality & spec compliance \| 15% \| Follows OpenEnv spec, clean structure, typed models, documented, tested, working Dockerfile \|
	\| Creativity & novelty \| 10% \| Novel problem domain, interesting mechanics, clever reward design, original approach \|

	### Detailed Scoring Rubrics

	#### Real-world Utility (30%)
	- 0–5: Toy/artificial problem with no practical application
	- 6–15: Valid domain but shallow modeling
	- 16–25: Good domain modeling, useful for agent evaluation
	- 26–30: Excellent — fills real gap, immediate value for RL/agent community

	#### Task & Grader Quality (25%)
	- 3+ tasks with difficulty range?
	- Graders produce scores between 0.0–1.0?
	- Graders deterministic and reproducible?
	- Hard task genuinely challenges frontier models?

	#### Environment Design (20%)
	- `reset()` produces clean state?
	- Action/observation types well-designed and documented?
	- Reward function provides useful varying signal (not sparse)?
	- Episode boundaries sensible?

	#### Code Quality & Spec Compliance (15%)
	- `openenv validate` passes?
	- `docker build && docker run` works?
	- HF Space deploys and responds?
	- Baseline script runs and reproduces scores?

	#### Creativity & Novelty (10%)
	- Domain not seen in OpenEnv before?
	- Reward design has interesting properties?
	- Clever mechanics that make environment engaging?

	## Judging Process

	### Phase 1: Automated Validation (Pass/Fail Gate)
	- HF Space deploys
	- OpenEnv spec compliance
	- Dockerfile builds
	- Baseline reproduces
	- 3+ tasks with graders

	### Phase 2: Agentic Evaluation (Scored)
	- Baseline agent re-run
	- Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments
	- Score variance check

	### Phase 3: Human Review
	Top submissions reviewed by Meta and Hugging Face engineers for:
	- Real-world utility
	- Creativity
	- Exploit checks

	### Disqualification Criteria
	- Environment does not deploy or respond
	- Plagiarized or trivially modified existing environments
	- Graders that always return the same score
	- No baseline inference script

	## Pre-Submission Checklist

	All must pass or you're disqualified:

	- [ ] HF Space deploys (200 response to reset())
	- [ ] OpenEnv spec compliance validated
	- [ ] Dockerfile builds successfully
	- [ ] Baseline script reproduces without error
	- [ ] 3+ tasks with graders (scores in 0.0–1.0 range)

	## Mandatory Requirements

	### Environment Variables
	Must be defined in your environment configuration:

	```bash
	API_BASE_URL # The API endpoint for the LLM
	MODEL_NAME # The model identifier to use for inference
	HF_TOKEN # Your Hugging Face / API key
	LOCAL_IMAGE_NAME # (Optional) Name of local image if using from_docker_image()
	```

	### Script Requirements
	- Filename: `inference.py` (must be in root directory)
	- LLM Calls: Must use OpenAI Client with above variables
	- Logging Format: Must follow [START], [STEP], [END] format (see below)

	### Infrastructure Restrictions
	- Runtime: Inference script must complete in < 20 minutes
	- Resources: Must run on vcpu=2, memory=8GB

	## STDOUT Logging Format

	### Required Format
	The script must emit exactly three line types to stdout, in this order:

	```
	[START] task=<task_name> env=<benchmark> model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
	```

	### Format Rules
	- One [START] line at episode begin
	- One [STEP] line per step, immediately after `env.step()` returns
	- One [END] line after `env.close()`, always emitted (even on exception)
	- `reward` and `rewards` formatted to 2 decimal places
	- `done` and `success` are lowercase booleans: `true` or `false`
	- `error` is the raw `last_action_error` string, or `null` if none
	- All fields on a single line with no newlines within a line
	- Each task should return score in [0, 1]

	### Example Output
	```
	[START] task=click-test env=miniwob model=Qwen3-VL-30B
	[STEP] step=1 action=click('123') reward=0.00 done=false error=null
	[STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
	[STEP] step=3 action=click('789') reward=1.00 done=true error=null
	[END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
	```

	## Sample Inference Script

	```python
	"""
	Inference Script Example
	===================================
	MANDATORY
	- Before submitting, ensure the following variables are defined in your environment configuration:
	API_BASE_URL The API endpoint for the LLM.
	MODEL_NAME The model identifier to use for inference.
	HF_TOKEN Your Hugging Face / API key.
	LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
	method

	- Defaults are set only for API_BASE_URL and MODEL_NAME
	(and should reflect your active inference setup):
	API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
	MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")

	- The inference script must be named `inference.py` and placed in the root directory of the project
	- Participants must use OpenAI Client for all LLM calls using above variables

	STDOUT FORMAT
	- The script must emit exactly three line types to stdout, in this order:

	[START] task=<task_name> env=<benchmark> model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>

	Rules:
	- One [START] line at episode begin.
	- One [STEP] line per step, immediately after env.step() returns.
	- One [END] line after env.close(), always emitted (even on exception).
	- reward and rewards are formatted to 2 decimal places.
	- done and success are lowercase booleans: true or false.
	- error is the raw last_action_error string, or null if none.
	- All fields on a single line with no newlines within a line.
	- Each tasks should return score in [0, 1]

	Example:
	[START] task=click-test env=miniwob model=Qwen3-VL-30B
	[STEP] step=1 action=click('123') reward=0.00 done=false error=null
	[STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
	[STEP] step=3 action=click('789') reward=1.00 done=true error=null
	[END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
	"""

	import asyncio
	import os
	import textwrap
	from typing import List, Optional

	from openai import OpenAI

	from my_env_v4 import MyEnvV4Action, MyEnvV4Env

	IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
	API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
	API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
	MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
	TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
	BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
	MAX_STEPS = 8
	TEMPERATURE = 0.7

	# TODO: Implement the rest of your inference script here
	```

	## Pre-Validation Script

	```bash
	#!/usr/bin/env bash
	#
	# validate-submission.sh — OpenEnv Submission Validator
	#
	# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
	#
	# Prerequisites:
	# - Docker: https://docs.docker.com/get-docker/
	# - openenv-core: pip install openenv-core
	# - curl (usually pre-installed)
	#
	# Run:
	# curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh \| bash -s -- <ping_url> [repo_dir]
	#
	# Or download and run locally:
	# chmod +x validate-submission.sh
	# ./validate-submission.sh <ping_url> [repo_dir]
	#
	# Arguments:
	# ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
	# repo_dir Path to your repo (default: current directory)
	#
	# Examples:
	# ./validate-submission.sh https://my-team.hf.space
	# ./validate-submission.sh https://my-team.hf.space ./my-repo
	#

	set -uo pipefail

	DOCKER_BUILD_TIMEOUT=600

	if [ -t 1 ]; then
	RED='\033[0;31m'
	GREEN='\033[0;32m'
	YELLOW='\033[1;33m'
	BOLD='\033[1m'
	NC='\033[0m'
	else
	RED=''
	GREEN=''
	YELLOW=''
	BOLD=''
	NC=''
	fi

	# TODO: Add the rest of the validation script
	```

	## Tips for Success

	1. Choose a Real Problem: Pick a task that has genuine value for the AI/agent community
	2. Design Good Rewards: Provide meaningful signals throughout the episode, not just at the end
	3. Test Thoroughly: Ensure your environment works cleanly with `docker build && docker run`
	4. Document Well: Clear README helps reviewers understand your contribution
	5. Start Simple: Get the basic OpenEnv spec working first, then add complexity
	6. Run Validator: Use the pre-validation script before submitting

	## Resources

	- OpenEnv Documentation: [Link to be added]
	- Hugging Face Spaces: https://huggingface.co/spaces
	- OpenAI API Client: https://platform.openai.com/docs/api-reference

	## Submission Deadline

	[To be announced]

	---

	Good luck with your submission! 🚀