Spaces:
Sleeping
Scaler School of Technology β Meta PyTorch Hackathon
OpenEnv Hackathon Dashboard
URL: https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard#form
Timeline
| Stage | Dates |
|---|---|
| Registration | 14th March β 3rd April |
| Declaration | Before Round 1 |
| Prepare | Now β 25th March |
| Round 1 | 25th March β 8th April |
| Results | 10th April |
| Finals | 25th β 26th (April) |
Community
- Discord: Join the Discord Community β all announcements, mentor access, and team matching happens here.
Participation
- Currently registered as Solo Warrior
- Locked for Round 1 β cannot switch to a team until Round 1 is over.
Problem Statement
The Task
Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard
step()/reset()/state()API.
Key Requirements at a Glance
- Must simulate a real-world task (not games or toys)
- Implement full OpenEnv spec: typed models,
step()/reset()/state(),openenv.yaml - Minimum 3 tasks with agent graders (easy β medium β hard, scores 0.0β1.0)
- Meaningful reward function with partial progress signals
- Baseline inference script with reproducible scores
- Deploy to Hugging Face Spaces + working Dockerfile
- README with environment description, action/observation spaces, setup instructions
Detailed Requirements
Real-world task simulation
The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
OpenEnv spec compliance
Implement the full OpenEnv interface:
- Typed
Observation,Action, andRewardPydantic models step(action)β returns observation, reward, done, inforeset()β returns initial observationstate()β returns current stateopenenv.yamlwith metadata- Tested via
openenv validate
Minimum 3 tasks with agent graders
Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0β1.0). Tasks should range: easy β medium β hard. Graders must have clear, deterministic success/failure criteria.
Meaningful reward function
Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
Baseline inference script
Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
Non-Functional Requirements
Deploys to a Hugging Face Space
Environment must run as a containerized HF Space tagged with openenv.
Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
Documentation
README must include:
- Environment description and motivation
- Action and observation space definitions
- Task descriptions with expected difficulty
- Setup and usage instructions
- Baseline scores
Evaluation Criteria
| Parameter | Weight | Description |
|---|---|---|
| Real-world utility | 30% | Does the environment model a genuine task? Would someone actually use this to train or evaluate agents? |
| Task & grader quality | 25% | Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression? |
| Environment design | 20% | Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries |
| Code quality & spec compliance | 15% | Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works |
| Creativity & novelty | 10% | Novel problem domain, interesting mechanics, clever reward design, original approach |
Scoring Breakdown (Real-world utility)
- 0β5: Toy/artificial problem with no practical application
- 6β15: Valid domain but shallow modeling of the real task
- 16β25: Good domain modeling, would be useful for agent evaluation
- 26β30: Excellent β fills a real gap, immediate value for the RL/agent community
Scoring Checklist Questions
Task & grader quality:
- 3+ tasks with difficulty range?
- Graders produce scores between 0.0β1.0?
- Graders deterministic and reproducible?
- Hard task genuinely challenges frontier models?
Environment design:
reset()produces clean state?- Action/observation types well-designed and documented?
- Reward function provides useful varying signal (not just sparse)?
- Episode boundaries sensible?
Code quality & spec compliance:
openenv validatepasses?docker build && docker runworks?- HF Space deploys and responds?
- Baseline script runs and reproduces scores?
Creativity & novelty:
- Domain we haven't seen in OpenEnv before?
- Reward design has interesting properties?
- Clever mechanics that make the environment engaging?
How Judging Works
- Phase 1 β Automated Validation: Pass/fail gate β HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
- Phase 2 β Agentic Evaluation: Scored β baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
- Phase 3 β Human Review: Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
Disqualification Criteria
- Environment does not deploy or respond
- Plagiarized or trivially modified existing environments
- Graders that always return the same score
- No baseline inference script
Pre-Submission Checklist β all must pass or you're disqualified
| Check | Requirement |
|---|---|
| HF Space deploys | Automated ping to the Space URL β must return 200 and respond to reset() |
| OpenEnv spec compliance | Validate openenv.yaml, typed models, step()/reset()/state() endpoints |
| Dockerfile builds | Automated docker build on the submitted repo |
| Baseline reproduces | Run the submitted inference script β must complete without error and produce scores |
| 3+ tasks with graders | Enumerate tasks, run each grader, verify scores in 0.0β1.0 range |
| Infra Restrictions | Runtime of inference script should be less than 20 min. Must run on vcpu=2, memory=8gb |
| Validator | Run the pre-submission validation script before submitting |
Mandatory Additional Instructions
Before submitting, ensure the following variables are defined in your environment configuration:
| Variable | Description |
|---|---|
API_BASE_URL |
The API endpoint for the LLM |
MODEL_NAME |
The model identifier to use for inference |
HF_TOKEN |
Your Hugging Face / API key |
- The inference script must be named
inference.pyand placed in the root directory of the project. - Participants must use the OpenAI Client for all LLM calls using the above variables.
- Participants must emit structured stdout logs strictly following the
[START],[STEP], and[END]format defined in the sample inference script. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer tosample_inference.pyfor the complete format specification and examples.
Infra Restrictions
- Runtime of inference script must be less than 20 minutes.
- Ensure your env and inference can run on a machine with
vcpu=2,memory=8gb.
Validator
Run the pre-submission validation script at pre_validate.sh before submitting.
Sample Inference Script
See sample_inference.py for the complete example, including the mandatory [START], [STEP], and [END] structured log format.
Submission
- Submission window opens: 28th March
- Deadline: 8 April 2026, 11:59 PM IST
Step 1
Choose solo or team before you can start the assessment.
Step 2
Complete Step 1 first. Problem Statement is live. Build and submit.
Study Material
4 modules Β· ~3.5 hours
Each module: read the README first, then open the notebook in Colab. No local setup needed.
Module 1 β Essential for Round 1 (45 min)
What you'll do: Connect to 3 real AI environments hosted online β an Echo bot, a Catch game, and Wordle β and interact with each using the exact same code pattern.
Module 2 β Essential for Round 1 (50 min)
What you'll do: Write 4 different game-playing strategies for a Catch game, run a competition between them, then switch to a completely different game using the same code.
Module 3 β Essential for Round 1 (45 min)
What you'll do: Clone an existing environment, modify it, run it on your machine, then deploy your version live to Hugging Face Spaces with one command.
Module 4 β Most Important for Round 1
What you'll do: Build a complete word-guessing game environment from scratch β define the rules, implement the logic, test it locally, and deploy it live. About 100 lines of real code.
- View full course repository
Guide
What to Expect
Example of what a problem statement looks like:
"Build a mini-game RL environment with clearly defined tasks, automated graders, and deploy it live to Hugging Face Spaces."
Prerequisites (from Step 1 assessment)
- Write graders that verify task completion
- Define reward logic for scoring
- Package using OpenEnv for automated evaluation
Install before April 1st:
| Tool | Requirement | Command |
|---|---|---|
| Python 3.10+ | Install 3.10, 3.11, or 3.12 | python --version |
| Git + GitHub account | Push your submission to GitHub or HF | git --version |
| Hugging Face CLI | Deploy to HF Spaces | pip install huggingface_hub |
huggingface-cli login |
||
| OpenEnv | The framework | pip install openenv-core |
| Google Colab | Prep course runs in Colab (free tier works) | colab.research.google.com |
| Docker | Isolated container testing | docker --version |
| VS Code (Recommended) | Best Python + Docker support |
Step 1 Evaluation Criteria
| Criteria | Standard |
|---|---|
| Runtime correctness | Runs without errors |
| Interface compliance | Follows OpenEnv standard |
| Task design | Clear, realistic, testable |
| Grading logic | Reward system makes sense |
How to Submit
When Round 1 starts on 1 April:
Step 1 β Application Form Choose your problem domain. The task is open-ended β build any real-world OpenEnv environment that a human would actually do.
Step 2 β Scaffold
openenv init my_env
Generate project structure.
Step 3 β Build Define your environment in the generated files.
Step 4 β Test locally
uv run server
Step 5 β Deploy
openenv push --repo-id your-username/my-env
Step 6 β Submit Paste your HF Spaces URL on the platform before the deadline.
- Submission window opens 28th March
- Deadline: 8 April 2026, 11:59 PM IST
Note: Only team leaders can make the final submission.
Note: The Guide above references "4β5 problem statements" β this is outdated. Round 1 is open-ended. There is no fixed list of problem statements to choose from. Build any real-world environment that a human would actually do (e.g. email triage, code review, data cleaning). The requirements and evaluation criteria remain the same.
FAQs
How does the team/solo declaration work?
If you choose to compete solo, you will participate individually for Round 1.
If you form a team (2β3 members), only the Team Lead fills out the team formation form before the Round 1 assessment window opens and adds teammates using their registered email IDs. Once a team is confirmed, it cannot be changed.
Note: Since Round 2 is a 48-hour in-person hackathon, solo participants who qualify will be matched with other qualifying participants to form teams for the final round.
Who should fill the team form?
Only the team lead completes the team registration form. Teammates do not need to fill out anything at this stage. Once the Team Lead submits the form, listed members will receive an invite on their dashboards. The team will be reflected on their dashboards only after they accept the invite.
What if someone already added me to their team?
This will only happen once you accept their invite; your dashboard will then automatically update to reflect the team you have joined. After confirmation, you will not be able to switch to solo mode or join/form another team. Team assignments are permanent once confirmed.
Can I change my team or switch to solo after confirming?
No. Teams are permanent once confirmed, no changes are allowed. Solo declarations are locked for Round 1. A confirmation prompt is shown before submission, so please review carefully before proceeding.
Do I need to complete the prep course?
While not mandatory, it is strongly recommended.
What happens during Round 1?
You will select one problem statement from a set of challenges and build an RL environment using the OpenEnv framework.
Can I update my submission?
Yes. You may update your submission multiple times until the Round 1 deadline (5th April, 11:59 PM IST). Only the latest submission will be evaluated.
How are submissions evaluated?
Round 1 uses an LLM-based evaluator with structured rubrics. The finale includes LLM screening, manual review, and judging by Meta's global team. Evaluation criteria include runtime correctness, OpenEnv interface compliance, task design quality, grading logic, and overall code quality.
What framework must be used?
All environments must be built using the OpenEnv framework by Meta and Hugging Face.
What happens after Round 1?
Results will be announced on 10 April. The top 3,000 teams will advance to the Grand Finale, a 48-hour on-campus hackathon at Scaler School of Technology, Bangalore (25thβ26th April).
What do I need to submit?
A public GitHub repository with your environment code, a requirements.txt, a demo script, and a README. A deployed Hugging Face Spaces URL showcasing your working demo.
Where can I get help?
Join the Discord community for announcements and support.
For account or registration issues, email: help_openenvhackathon@scaler.com
Support
Need help? Reach out to us: