scheduling_env / docs /hackathon-guide-rl-environments.md
Akshaykumarbm's picture
Upload folder using huggingface_hub
7bdbe90 verified

Building RL Environments for Hackathon: Complete Guide

Overview

This guide provides comprehensive insights for building real-world Reinforcement Learning (RL) environments using the OpenM (Open Environment) library for hackathon participation.


1. Fundamentals of Reinforcement Learning

The Mechanism

  • How it Works: Model generates candidate implementations (actions) → Environment verifies/tests → Environment provides reward signal (score) based on pre-defined rubrics
  • Purpose: Tells the model what is good or bad through trial and error rather than long-context prompts

Position in Training Pipeline

  • Typically follows Supervised Fine-Tuning (SFT)
  • Used to "squeeze out" final performance gains on specific capabilities
  • More efficient alternative to "in-context learning" (which degrades with longer prompts)

Key Challenges

Reward Hacking

  • Models learn to "game" the verifier to get high scores without actually solving the task
  • Mitigation: Inspect output trajectories or use multiple reward functions

Curriculum Learning

  • Start with easy tasks and build complexity progressively
  • Ensures model receives consistent reward signal
  • Prevents "wasting compute" on tasks that are too difficult initially

2. Introduction to OpenM

What is OpenM?

  • Collaborative project between Meta, Hugging Face, and others
  • Standardizes RL environments (like Hugging Face standardized language models)
  • Single, consistent API for environments
  • Interoperable with training frameworks (TRL, Unsloth, etc.)

Core Components

Standard OpenM environment requires defining:

  • Actions (as Pydantic objects)
  • Observations (as Pydantic objects)
  • States (as Pydantic objects)

3. Technical Implementation

CLI Workflow

# Initialize skeleton environment
openm init

# Validate setup
openm validate

# Deploy to Hugging Face Spaces
openm push

Agent Integration

  • Use coding agents (like Codeex) with OpenM "skills"
  • Automatically generate environment code from prompts

Deployment

  • Environments deployed as Docker containers on Hugging Face
  • Provides web interface for manual testing and debugging
  • Important: Dockerfile must be moved outside /server folder to main project directory

4. Hackathon Requirements

Environment Quality

Real-World Focus (Critical)

  • Must build: Real-world task environments (healthcare, email triage, code optimization)
  • Avoid: "Toy" environments, games (Wordle, Connect 4, etc.)
  • Goal: Environment that could realistically be used in model's post-training RL run

Complexity Requirements

  • Map long-running tasks with multiple trajectories/routes
  • Agent should have various possible approaches to solve the task

Technical Requirements

Mandatory Inference Script

  • Required for every submission
  • Used by organizers to evaluate environment effectiveness
  • Measures how well environment provides rewards to model

API Configuration

  • No OpenAI API key required
  • Use Hugging Face token instead
  • Use provided HF Router (API base URL) for model calls
  • HF Router handles model calls through Hugging Face

Docker Setup

  • Move Dockerfile outside /server folder to main project directory
  • Run openm validate before submission

Reward Signal Design

Requirements

  • Score typically between 0 and 1
  • Must deliver valid signal indicating "good" or "bad" performance
  • Grading Diversity: Must not return same score every time
  • Should distinguish between different performance levels

Best Practices

  • Start with achievable tasks for the model
  • Ensure task is feasible but challenging
  • Avoid tasks too difficult or out-of-distribution for the model

5. Grading Criteria

Evaluation based on:

  1. Utility of the Idea

    • How useful is the task for real-world AI?
    • Does it represent authentic human tasks?
  2. Quality of the Grader

    • Returns diverse scores (not same score every time)
    • Value between 0 and 1
    • Distinguishes performance levels
  3. Technical Design

    • Environment architecture and implementation
    • Successful execution of inference script
  4. Novelty

    • Key criterion for high scores
    • Create something not thought of yet
    • Solve problems in unique domains
    • Plagiarism is strictly prohibited

6. Submission Guidelines

Deadline

  • Round One: April 8th

Submission Process

  • Push environment to Hugging Face Spaces using openm push
  • Submit URL of Hugging Face Space
  • Multiple submissions allowed (latest accurate submission used)

Collaboration

  • Teams are highly encouraged
  • Helps manage technical and creative requirements

7. High-Value Environment Ideas

Healthcare Domain

  • Medical triage tools
  • Navigating medical records
  • Healthcare-specific software tool utilization

Productivity and Operations

  • Email Triage: Prioritize, categorize, respond to complex inbox
  • Calendar Management: Coordinate schedules, handle conflicts across multiple participants

Technical and Code Optimization

  • Kernel Optimization: Benchmark and optimize PyTorch/GPU kernels for speed and efficiency
  • Repository Maintenance: Navigate GitHub to identify/fix bugs, run test suites

Logistics and Travel

  • Complex Flight Booking: Navigate changing availability, multi-leg transfers, request missing information from users

API and Tool Integration

  • Wide set of real-world tools
  • Interactive APIs that agents must learn to use correctly

8. Best Practices Summary

Do's

  • Focus on real-world utility
  • Design long-running, multi-trajectory tasks
  • Implement diverse grading systems
  • Start with curriculum learning approach
  • Validate thoroughly before submission
  • Work in teams for better results
  • Aim for novelty and uniqueness

Don'ts

  • Avoid toy environments or games
  • Don't create tasks too difficult for models
  • Don't implement single-score graders
  • Avoid plagiarism
  • Don't submit without testing inference script
  • Don't use tasks without clear reward signals

9. Technical Checklist

  • Initialize project with openm init
  • Define Actions, Observations, States as Pydantic objects
  • Implement diverse reward function (0-1 range)
  • Create mandatory inference script
  • Configure HF token and router (not OpenAI key)
  • Move Dockerfile to main directory (outside /server)
  • Run openm validate to verify setup
  • Test environment locally
  • Deploy with openm push to Hugging Face Spaces
  • Submit Hugging Face Space URL before April 8th

Resources

  • OpenM Library: Standardized RL environment framework
  • Hugging Face Spaces: Deployment platform
  • HF Router: API for model access
  • Training Frameworks: TRL, Unsloth (compatible with OpenM)

This guide synthesizes best practices for building competitive RL environments for hackathons. Focus on real-world utility, technical excellence, and novel approaches for the best results.