scheduling_env / docs /hackathon-guide-rl-environments.md
Akshaykumarbm's picture
Upload folder using huggingface_hub
7bdbe90 verified
# Building RL Environments for Hackathon: Complete Guide
## Overview
This guide provides comprehensive insights for building real-world Reinforcement Learning (RL) environments using the OpenM (Open Environment) library for hackathon participation.
---
## 1. Fundamentals of Reinforcement Learning
### The Mechanism
- **How it Works:** Model generates candidate implementations (actions) → Environment verifies/tests → Environment provides reward signal (score) based on pre-defined rubrics
- **Purpose:** Tells the model what is good or bad through trial and error rather than long-context prompts
### Position in Training Pipeline
- Typically follows **Supervised Fine-Tuning (SFT)**
- Used to "squeeze out" final performance gains on specific capabilities
- More efficient alternative to "in-context learning" (which degrades with longer prompts)
### Key Challenges
#### Reward Hacking
- Models learn to "game" the verifier to get high scores without actually solving the task
- **Mitigation:** Inspect output trajectories or use multiple reward functions
#### Curriculum Learning
- Start with easy tasks and build complexity progressively
- Ensures model receives consistent reward signal
- Prevents "wasting compute" on tasks that are too difficult initially
---
## 2. Introduction to OpenM
### What is OpenM?
- Collaborative project between Meta, Hugging Face, and others
- Standardizes RL environments (like Hugging Face standardized language models)
- Single, consistent API for environments
- Interoperable with training frameworks (TRL, Unsloth, etc.)
### Core Components
Standard OpenM environment requires defining:
- **Actions** (as Pydantic objects)
- **Observations** (as Pydantic objects)
- **States** (as Pydantic objects)
---
## 3. Technical Implementation
### CLI Workflow
```bash
# Initialize skeleton environment
openm init
# Validate setup
openm validate
# Deploy to Hugging Face Spaces
openm push
```
### Agent Integration
- Use coding agents (like Codeex) with OpenM "skills"
- Automatically generate environment code from prompts
### Deployment
- Environments deployed as Docker containers on Hugging Face
- Provides web interface for manual testing and debugging
- **Important:** Dockerfile must be moved outside `/server` folder to main project directory
---
## 4. Hackathon Requirements
### Environment Quality
#### Real-World Focus (Critical)
- **Must build:** Real-world task environments (healthcare, email triage, code optimization)
- **Avoid:** "Toy" environments, games (Wordle, Connect 4, etc.)
- **Goal:** Environment that could realistically be used in model's post-training RL run
#### Complexity Requirements
- Map **long-running tasks** with multiple trajectories/routes
- Agent should have various possible approaches to solve the task
### Technical Requirements
#### Mandatory Inference Script
- **Required for every submission**
- Used by organizers to evaluate environment effectiveness
- Measures how well environment provides rewards to model
#### API Configuration
- **No OpenAI API key required**
- Use **Hugging Face token** instead
- Use provided **HF Router** (API base URL) for model calls
- HF Router handles model calls through Hugging Face
#### Docker Setup
- Move Dockerfile outside `/server` folder to main project directory
- Run `openm validate` before submission
### Reward Signal Design
#### Requirements
- Score typically between 0 and 1
- Must deliver valid signal indicating "good" or "bad" performance
- **Grading Diversity:** Must not return same score every time
- Should distinguish between different performance levels
#### Best Practices
- Start with achievable tasks for the model
- Ensure task is feasible but challenging
- Avoid tasks too difficult or out-of-distribution for the model
---
## 5. Grading Criteria
Evaluation based on:
1. **Utility of the Idea**
- How useful is the task for real-world AI?
- Does it represent authentic human tasks?
2. **Quality of the Grader**
- Returns diverse scores (not same score every time)
- Value between 0 and 1
- Distinguishes performance levels
3. **Technical Design**
- Environment architecture and implementation
- Successful execution of inference script
4. **Novelty**
- Key criterion for high scores
- Create something not thought of yet
- Solve problems in unique domains
- **Plagiarism is strictly prohibited**
---
## 6. Submission Guidelines
### Deadline
- **Round One:** April 8th
### Submission Process
- Push environment to **Hugging Face Spaces** using `openm push`
- Submit URL of Hugging Face Space
- Multiple submissions allowed (latest accurate submission used)
### Collaboration
- Teams are **highly encouraged**
- Helps manage technical and creative requirements
---
## 7. High-Value Environment Ideas
### Healthcare Domain
- Medical triage tools
- Navigating medical records
- Healthcare-specific software tool utilization
### Productivity and Operations
- **Email Triage:** Prioritize, categorize, respond to complex inbox
- **Calendar Management:** Coordinate schedules, handle conflicts across multiple participants
### Technical and Code Optimization
- **Kernel Optimization:** Benchmark and optimize PyTorch/GPU kernels for speed and efficiency
- **Repository Maintenance:** Navigate GitHub to identify/fix bugs, run test suites
### Logistics and Travel
- **Complex Flight Booking:** Navigate changing availability, multi-leg transfers, request missing information from users
### API and Tool Integration
- Wide set of real-world tools
- Interactive APIs that agents must learn to use correctly
---
## 8. Best Practices Summary
### Do's
- Focus on real-world utility
- Design long-running, multi-trajectory tasks
- Implement diverse grading systems
- Start with curriculum learning approach
- Validate thoroughly before submission
- Work in teams for better results
- Aim for novelty and uniqueness
### Don'ts
- Avoid toy environments or games
- Don't create tasks too difficult for models
- Don't implement single-score graders
- Avoid plagiarism
- Don't submit without testing inference script
- Don't use tasks without clear reward signals
---
## 9. Technical Checklist
- [ ] Initialize project with `openm init`
- [ ] Define Actions, Observations, States as Pydantic objects
- [ ] Implement diverse reward function (0-1 range)
- [ ] Create mandatory inference script
- [ ] Configure HF token and router (not OpenAI key)
- [ ] Move Dockerfile to main directory (outside /server)
- [ ] Run `openm validate` to verify setup
- [ ] Test environment locally
- [ ] Deploy with `openm push` to Hugging Face Spaces
- [ ] Submit Hugging Face Space URL before April 8th
---
## Resources
- **OpenM Library:** Standardized RL environment framework
- **Hugging Face Spaces:** Deployment platform
- **HF Router:** API for model access
- **Training Frameworks:** TRL, Unsloth (compatible with OpenM)
---
*This guide synthesizes best practices for building competitive RL environments for hackathons. Focus on real-world utility, technical excellence, and novel approaches for the best results.*