medchain-openenv-hackathon / hackathon_guide.md
nik-55's picture
Upload folder using huggingface_hub
4afc4db verified

Scaler School of Technology β€” Meta PyTorch Hackathon

OpenEnv Hackathon Dashboard

URL: https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard#form


Timeline

Stage Dates
Registration 14th March – 3rd April
Declaration Before Round 1
Prepare Now – 25th March
Round 1 25th March – 8th April
Results 10th April
Finals 25th – 26th (April)

Community

  • Discord: Join the Discord Community β€” all announcements, mentor access, and team matching happens here.

Participation

  • Currently registered as Solo Warrior
  • Locked for Round 1 β€” cannot switch to a team until Round 1 is over.

Problem Statement

The Task

Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.


Key Requirements at a Glance

  • Must simulate a real-world task (not games or toys)
  • Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
  • Minimum 3 tasks with agent graders (easy β†’ medium β†’ hard, scores 0.0–1.0)
  • Meaningful reward function with partial progress signals
  • Baseline inference script with reproducible scores
  • Deploy to Hugging Face Spaces + working Dockerfile
  • README with environment description, action/observation spaces, setup instructions

Detailed Requirements

Real-world task simulation

The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.

OpenEnv spec compliance

Implement the full OpenEnv interface:

  • Typed Observation, Action, and Reward Pydantic models
  • step(action) β†’ returns observation, reward, done, info
  • reset() β†’ returns initial observation
  • state() β†’ returns current state
  • openenv.yaml with metadata
  • Tested via openenv validate

Minimum 3 tasks with agent graders

Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy β†’ medium β†’ hard. Graders must have clear, deterministic success/failure criteria.

Meaningful reward function

Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).

Baseline inference script

Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.


Non-Functional Requirements

Deploys to a Hugging Face Space

Environment must run as a containerized HF Space tagged with openenv. Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.

Documentation

README must include:

  • Environment description and motivation
  • Action and observation space definitions
  • Task descriptions with expected difficulty
  • Setup and usage instructions
  • Baseline scores

Evaluation Criteria

Parameter Weight Description
Real-world utility 30% Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
Task & grader quality 25% Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
Environment design 20% Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries
Code quality & spec compliance 15% Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works
Creativity & novelty 10% Novel problem domain, interesting mechanics, clever reward design, original approach

Scoring Breakdown (Real-world utility)

  • 0–5: Toy/artificial problem with no practical application
  • 6–15: Valid domain but shallow modeling of the real task
  • 16–25: Good domain modeling, would be useful for agent evaluation
  • 26–30: Excellent β€” fills a real gap, immediate value for the RL/agent community

Scoring Checklist Questions

Task & grader quality:

  • 3+ tasks with difficulty range?
  • Graders produce scores between 0.0–1.0?
  • Graders deterministic and reproducible?
  • Hard task genuinely challenges frontier models?

Environment design:

  • reset() produces clean state?
  • Action/observation types well-designed and documented?
  • Reward function provides useful varying signal (not just sparse)?
  • Episode boundaries sensible?

Code quality & spec compliance:

  • openenv validate passes?
  • docker build && docker run works?
  • HF Space deploys and responds?
  • Baseline script runs and reproduces scores?

Creativity & novelty:

  • Domain we haven't seen in OpenEnv before?
  • Reward design has interesting properties?
  • Clever mechanics that make the environment engaging?

How Judging Works

  • Phase 1 β€” Automated Validation: Pass/fail gate β€” HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
  • Phase 2 β€” Agentic Evaluation: Scored β€” baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
  • Phase 3 β€” Human Review: Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.

Disqualification Criteria

  • Environment does not deploy or respond
  • Plagiarized or trivially modified existing environments
  • Graders that always return the same score
  • No baseline inference script

Pre-Submission Checklist β€” all must pass or you're disqualified

Check Requirement
HF Space deploys Automated ping to the Space URL β€” must return 200 and respond to reset()
OpenEnv spec compliance Validate openenv.yaml, typed models, step()/reset()/state() endpoints
Dockerfile builds Automated docker build on the submitted repo
Baseline reproduces Run the submitted inference script β€” must complete without error and produce scores
3+ tasks with graders Enumerate tasks, run each grader, verify scores in 0.0–1.0 range
Infra Restrictions Runtime of inference script should be less than 20 min. Must run on vcpu=2, memory=8gb
Validator Run the pre-submission validation script before submitting

Mandatory Additional Instructions

Before submitting, ensure the following variables are defined in your environment configuration:

Variable Description
API_BASE_URL The API endpoint for the LLM
MODEL_NAME The model identifier to use for inference
HF_TOKEN Your Hugging Face / API key
  • The inference script must be named inference.py and placed in the root directory of the project.
  • Participants must use the OpenAI Client for all LLM calls using the above variables.
  • Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference script. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to sample_inference.py for the complete format specification and examples.

Infra Restrictions

  • Runtime of inference script must be less than 20 minutes.
  • Ensure your env and inference can run on a machine with vcpu=2, memory=8gb.

Validator

Run the pre-submission validation script at pre_validate.sh before submitting.

Sample Inference Script

See sample_inference.py for the complete example, including the mandatory [START], [STEP], and [END] structured log format.


Submission

  • Submission window opens: 28th March
  • Deadline: 8 April 2026, 11:59 PM IST

Step 1

Choose solo or team before you can start the assessment.

Step 2

Complete Step 1 first. Problem Statement is live. Build and submit.


Study Material

4 modules Β· ~3.5 hours

Each module: read the README first, then open the notebook in Colab. No local setup needed.

Module 1 β€” Essential for Round 1 (45 min)

What you'll do: Connect to 3 real AI environments hosted online β€” an Echo bot, a Catch game, and Wordle β€” and interact with each using the exact same code pattern.

Module 2 β€” Essential for Round 1 (50 min)

What you'll do: Write 4 different game-playing strategies for a Catch game, run a competition between them, then switch to a completely different game using the same code.

Module 3 β€” Essential for Round 1 (45 min)

What you'll do: Clone an existing environment, modify it, run it on your machine, then deploy your version live to Hugging Face Spaces with one command.

Module 4 β€” Most Important for Round 1

What you'll do: Build a complete word-guessing game environment from scratch β€” define the rules, implement the logic, test it locally, and deploy it live. About 100 lines of real code.

  • View full course repository

Guide

What to Expect

Example of what a problem statement looks like:

"Build a mini-game RL environment with clearly defined tasks, automated graders, and deploy it live to Hugging Face Spaces."

Prerequisites (from Step 1 assessment)

  • Write graders that verify task completion
  • Define reward logic for scoring
  • Package using OpenEnv for automated evaluation

Install before April 1st:

Tool Requirement Command
Python 3.10+ Install 3.10, 3.11, or 3.12 python --version
Git + GitHub account Push your submission to GitHub or HF git --version
Hugging Face CLI Deploy to HF Spaces pip install huggingface_hub
huggingface-cli login
OpenEnv The framework pip install openenv-core
Google Colab Prep course runs in Colab (free tier works) colab.research.google.com
Docker Isolated container testing docker --version
VS Code (Recommended) Best Python + Docker support

Step 1 Evaluation Criteria

Criteria Standard
Runtime correctness Runs without errors
Interface compliance Follows OpenEnv standard
Task design Clear, realistic, testable
Grading logic Reward system makes sense

How to Submit

When Round 1 starts on 1 April:

Step 1 β€” Application Form Choose your problem domain. The task is open-ended β€” build any real-world OpenEnv environment that a human would actually do.

Step 2 β€” Scaffold

openenv init my_env

Generate project structure.

Step 3 β€” Build Define your environment in the generated files.

Step 4 β€” Test locally

uv run server

Step 5 β€” Deploy

openenv push --repo-id your-username/my-env

Step 6 β€” Submit Paste your HF Spaces URL on the platform before the deadline.

  • Submission window opens 28th March
  • Deadline: 8 April 2026, 11:59 PM IST

Note: Only team leaders can make the final submission.

Note: The Guide above references "4–5 problem statements" β€” this is outdated. Round 1 is open-ended. There is no fixed list of problem statements to choose from. Build any real-world environment that a human would actually do (e.g. email triage, code review, data cleaning). The requirements and evaluation criteria remain the same.


FAQs

How does the team/solo declaration work?

If you choose to compete solo, you will participate individually for Round 1.

If you form a team (2–3 members), only the Team Lead fills out the team formation form before the Round 1 assessment window opens and adds teammates using their registered email IDs. Once a team is confirmed, it cannot be changed.

Note: Since Round 2 is a 48-hour in-person hackathon, solo participants who qualify will be matched with other qualifying participants to form teams for the final round.

Who should fill the team form?

Only the team lead completes the team registration form. Teammates do not need to fill out anything at this stage. Once the Team Lead submits the form, listed members will receive an invite on their dashboards. The team will be reflected on their dashboards only after they accept the invite.

What if someone already added me to their team?

This will only happen once you accept their invite; your dashboard will then automatically update to reflect the team you have joined. After confirmation, you will not be able to switch to solo mode or join/form another team. Team assignments are permanent once confirmed.

Can I change my team or switch to solo after confirming?

No. Teams are permanent once confirmed, no changes are allowed. Solo declarations are locked for Round 1. A confirmation prompt is shown before submission, so please review carefully before proceeding.

Do I need to complete the prep course?

While not mandatory, it is strongly recommended.

What happens during Round 1?

You will select one problem statement from a set of challenges and build an RL environment using the OpenEnv framework.

Can I update my submission?

Yes. You may update your submission multiple times until the Round 1 deadline (5th April, 11:59 PM IST). Only the latest submission will be evaluated.

How are submissions evaluated?

Round 1 uses an LLM-based evaluator with structured rubrics. The finale includes LLM screening, manual review, and judging by Meta's global team. Evaluation criteria include runtime correctness, OpenEnv interface compliance, task design quality, grading logic, and overall code quality.

What framework must be used?

All environments must be built using the OpenEnv framework by Meta and Hugging Face.

What happens after Round 1?

Results will be announced on 10 April. The top 3,000 teams will advance to the Grand Finale, a 48-hour on-campus hackathon at Scaler School of Technology, Bangalore (25th–26th April).

What do I need to submit?

A public GitHub repository with your environment code, a requirements.txt, a demo script, and a README. A deployed Hugging Face Spaces URL showcasing your working demo.

Where can I get help?

Join the Discord community for announcements and support.

For account or registration issues, email: help_openenvhackathon@scaler.com


Support

Need help? Reach out to us: