Spaces:

nik-55
/

medchain-openenv-hackathon

Sleeping

App Files Files Community

medchain-openenv-hackathon / hackathon_guide.md

nik-55

Upload folder using huggingface_hub

4afc4db verified about 1 month ago

preview code

raw

history blame contribute delete

16.2 kB

Scaler School of Technology — Meta PyTorch Hackathon

OpenEnv Hackathon Dashboard

URL: https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard#form

Timeline

Stage	Dates
Registration	14th March – 3rd April
Declaration	Before Round 1
Prepare	Now – 25th March
Round 1	25th March – 8th April
Results	10th April
Finals	25th – 26th (April)

Community

Discord: Join the Discord Community — all announcements, mentor access, and team matching happens here.

Participation

Currently registered as Solo Warrior
Locked for Round 1 — cannot switch to a team until Round 1 is over.

Problem Statement

The Task

Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.

Key Requirements at a Glance

Must simulate a real-world task (not games or toys)
Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
Meaningful reward function with partial progress signals
Baseline inference script with reproducible scores
Deploy to Hugging Face Spaces + working Dockerfile
README with environment description, action/observation spaces, setup instructions

Detailed Requirements

Real-world task simulation

The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.

OpenEnv spec compliance

Implement the full OpenEnv interface:

Typed Observation, Action, and Reward Pydantic models
step(action) → returns observation, reward, done, info
reset() → returns initial observation
state() → returns current state
openenv.yaml with metadata
Tested via openenv validate

Minimum 3 tasks with agent graders

Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.

Meaningful reward function

Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).

Baseline inference script

Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.

Non-Functional Requirements

Deploys to a Hugging Face Space

Environment must run as a containerized HF Space tagged with openenv. Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.

Documentation

README must include:

Environment description and motivation
Action and observation space definitions
Task descriptions with expected difficulty
Setup and usage instructions
Baseline scores

Evaluation Criteria

Parameter	Weight	Description
Real-world utility	30%	Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
Task & grader quality	25%	Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
Environment design	20%	Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries
Code quality & spec compliance	15%	Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works
Creativity & novelty	10%	Novel problem domain, interesting mechanics, clever reward design, original approach

Scoring Breakdown (Real-world utility)

0–5: Toy/artificial problem with no practical application
6–15: Valid domain but shallow modeling of the real task
16–25: Good domain modeling, would be useful for agent evaluation
26–30: Excellent — fills a real gap, immediate value for the RL/agent community

Scoring Checklist Questions

Task & grader quality:

3+ tasks with difficulty range?
Graders produce scores between 0.0–1.0?
Graders deterministic and reproducible?
Hard task genuinely challenges frontier models?

Environment design:

reset() produces clean state?
Action/observation types well-designed and documented?
Reward function provides useful varying signal (not just sparse)?
Episode boundaries sensible?

Code quality & spec compliance:

openenv validate passes?
docker build && docker run works?
HF Space deploys and responds?
Baseline script runs and reproduces scores?

Creativity & novelty:

Domain we haven't seen in OpenEnv before?
Reward design has interesting properties?
Clever mechanics that make the environment engaging?

How Judging Works

Phase 1 — Automated Validation: Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
Phase 2 — Agentic Evaluation: Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
Phase 3 — Human Review: Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.

Disqualification Criteria

Environment does not deploy or respond
Plagiarized or trivially modified existing environments
Graders that always return the same score
No baseline inference script

Pre-Submission Checklist — all must pass or you're disqualified

Check	Requirement
HF Space deploys	Automated ping to the Space URL — must return 200 and respond to `reset()`
OpenEnv spec compliance	Validate `openenv.yaml`, typed models, `step()`/`reset()`/`state()` endpoints
Dockerfile builds	Automated docker build on the submitted repo
Baseline reproduces	Run the submitted inference script — must complete without error and produce scores
3+ tasks with graders	Enumerate tasks, run each grader, verify scores in 0.0–1.0 range
Infra Restrictions	Runtime of inference script should be less than 20 min. Must run on vcpu=2, memory=8gb
Validator	Run the pre-submission validation script before submitting

Mandatory Additional Instructions

Before submitting, ensure the following variables are defined in your environment configuration:

Variable	Description
`API_BASE_URL`	The API endpoint for the LLM
`MODEL_NAME`	The model identifier to use for inference
`HF_TOKEN`	Your Hugging Face / API key

The inference script must be named inference.py and placed in the root directory of the project.
Participants must use the OpenAI Client for all LLM calls using the above variables.
Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference script. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to sample_inference.py for the complete format specification and examples.

Infra Restrictions

Runtime of inference script must be less than 20 minutes.
Ensure your env and inference can run on a machine with vcpu=2, memory=8gb.

Validator

Run the pre-submission validation script at pre_validate.sh before submitting.

Sample Inference Script

See sample_inference.py for the complete example, including the mandatory [START], [STEP], and [END] structured log format.

Submission

Submission window opens: 28th March
Deadline: 8 April 2026, 11:59 PM IST

Step 1

Choose solo or team before you can start the assessment.

Step 2

Complete Step 1 first. Problem Statement is live. Build and submit.

Study Material

4 modules · ~3.5 hours

Each module: read the README first, then open the notebook in Colab. No local setup needed.

Module 1 — Essential for Round 1 (45 min)

What you'll do: Connect to 3 real AI environments hosted online — an Echo bot, a Catch game, and Wordle — and interact with each using the exact same code pattern.

Module 2 — Essential for Round 1 (50 min)

What you'll do: Write 4 different game-playing strategies for a Catch game, run a competition between them, then switch to a completely different game using the same code.

Module 3 — Essential for Round 1 (45 min)

What you'll do: Clone an existing environment, modify it, run it on your machine, then deploy your version live to Hugging Face Spaces with one command.

Module 4 — Most Important for Round 1

What you'll do: Build a complete word-guessing game environment from scratch — define the rules, implement the logic, test it locally, and deploy it live. About 100 lines of real code.

View full course repository

Guide

What to Expect

Example of what a problem statement looks like:

"Build a mini-game RL environment with clearly defined tasks, automated graders, and deploy it live to Hugging Face Spaces."

Prerequisites (from Step 1 assessment)

Write graders that verify task completion
Define reward logic for scoring
Package using OpenEnv for automated evaluation

Install before April 1st:

Tool	Requirement	Command
Python 3.10+	Install 3.10, 3.11, or 3.12	`python --version`
Git + GitHub account	Push your submission to GitHub or HF	`git --version`
Hugging Face CLI	Deploy to HF Spaces	`pip install huggingface_hub`
		`huggingface-cli login`
OpenEnv	The framework	`pip install openenv-core`
Google Colab	Prep course runs in Colab (free tier works)	colab.research.google.com
Docker	Isolated container testing	`docker --version`
VS Code (Recommended)	Best Python + Docker support

Step 1 Evaluation Criteria

Criteria	Standard
Runtime correctness	Runs without errors
Interface compliance	Follows OpenEnv standard
Task design	Clear, realistic, testable
Grading logic	Reward system makes sense

How to Submit

When Round 1 starts on 1 April:

Step 1 — Application Form Choose your problem domain. The task is open-ended — build any real-world OpenEnv environment that a human would actually do.

Step 2 — Scaffold

openenv init my_env

Generate project structure.

Step 3 — Build Define your environment in the generated files.

Step 4 — Test locally

uv run server

Step 5 — Deploy

openenv push --repo-id your-username/my-env

Step 6 — Submit Paste your HF Spaces URL on the platform before the deadline.

Submission window opens 28th March
Deadline: 8 April 2026, 11:59 PM IST

Note: Only team leaders can make the final submission.

Note: The Guide above references "4–5 problem statements" — this is outdated. Round 1 is open-ended. There is no fixed list of problem statements to choose from. Build any real-world environment that a human would actually do (e.g. email triage, code review, data cleaning). The requirements and evaluation criteria remain the same.

FAQs

How does the team/solo declaration work?

If you choose to compete solo, you will participate individually for Round 1.

If you form a team (2–3 members), only the Team Lead fills out the team formation form before the Round 1 assessment window opens and adds teammates using their registered email IDs. Once a team is confirmed, it cannot be changed.

Note: Since Round 2 is a 48-hour in-person hackathon, solo participants who qualify will be matched with other qualifying participants to form teams for the final round.

Who should fill the team form?

Only the team lead completes the team registration form. Teammates do not need to fill out anything at this stage. Once the Team Lead submits the form, listed members will receive an invite on their dashboards. The team will be reflected on their dashboards only after they accept the invite.

What if someone already added me to their team?

This will only happen once you accept their invite; your dashboard will then automatically update to reflect the team you have joined. After confirmation, you will not be able to switch to solo mode or join/form another team. Team assignments are permanent once confirmed.

Can I change my team or switch to solo after confirming?

No. Teams are permanent once confirmed, no changes are allowed. Solo declarations are locked for Round 1. A confirmation prompt is shown before submission, so please review carefully before proceeding.

Do I need to complete the prep course?

While not mandatory, it is strongly recommended.

What happens during Round 1?

You will select one problem statement from a set of challenges and build an RL environment using the OpenEnv framework.

Can I update my submission?

Yes. You may update your submission multiple times until the Round 1 deadline (5th April, 11:59 PM IST). Only the latest submission will be evaluated.

How are submissions evaluated?

Round 1 uses an LLM-based evaluator with structured rubrics. The finale includes LLM screening, manual review, and judging by Meta's global team. Evaluation criteria include runtime correctness, OpenEnv interface compliance, task design quality, grading logic, and overall code quality.

What framework must be used?

All environments must be built using the OpenEnv framework by Meta and Hugging Face.

What happens after Round 1?

Results will be announced on 10 April. The top 3,000 teams will advance to the Grand Finale, a 48-hour on-campus hackathon at Scaler School of Technology, Bangalore (25th–26th April).

What do I need to submit?

A public GitHub repository with your environment code, a requirements.txt, a demo script, and a README. A deployed Hugging Face Spaces URL showcasing your working demo.

Where can I get help?

Join the Discord community for announcements and support.

For account or registration issues, email: help_openenvhackathon@scaler.com

Support

Need help? Reach out to us:

Email: help_openenvhackathon@scaler.com