Spaces:

Akshaykumarbm
/

scheduling_env

Sleeping

App Files Files Community

scheduling_env / docs /hackathon-guide-rl-environments.md

Akshaykumarbm

Upload folder using huggingface_hub

7bdbe90 verified about 1 month ago

preview code

raw

history blame contribute delete

7.12 kB

	# Building RL Environments for Hackathon: Complete Guide

	## Overview
	This guide provides comprehensive insights for building real-world Reinforcement Learning (RL) environments using the OpenM (Open Environment) library for hackathon participation.

	---

	## 1. Fundamentals of Reinforcement Learning

	### The Mechanism
	- How it Works: Model generates candidate implementations (actions) → Environment verifies/tests → Environment provides reward signal (score) based on pre-defined rubrics
	- Purpose: Tells the model what is good or bad through trial and error rather than long-context prompts

	### Position in Training Pipeline
	- Typically follows Supervised Fine-Tuning (SFT)
	- Used to "squeeze out" final performance gains on specific capabilities
	- More efficient alternative to "in-context learning" (which degrades with longer prompts)

	### Key Challenges

	#### Reward Hacking
	- Models learn to "game" the verifier to get high scores without actually solving the task
	- Mitigation: Inspect output trajectories or use multiple reward functions

	#### Curriculum Learning
	- Start with easy tasks and build complexity progressively
	- Ensures model receives consistent reward signal
	- Prevents "wasting compute" on tasks that are too difficult initially

	---

	## 2. Introduction to OpenM

	### What is OpenM?
	- Collaborative project between Meta, Hugging Face, and others
	- Standardizes RL environments (like Hugging Face standardized language models)
	- Single, consistent API for environments
	- Interoperable with training frameworks (TRL, Unsloth, etc.)

	### Core Components
	Standard OpenM environment requires defining:
	- Actions (as Pydantic objects)
	- Observations (as Pydantic objects)
	- States (as Pydantic objects)

	---

	## 3. Technical Implementation

	### CLI Workflow
	```bash
	# Initialize skeleton environment
	openm init

	# Validate setup
	openm validate

	# Deploy to Hugging Face Spaces
	openm push
	```

	### Agent Integration
	- Use coding agents (like Codeex) with OpenM "skills"
	- Automatically generate environment code from prompts

	### Deployment
	- Environments deployed as Docker containers on Hugging Face
	- Provides web interface for manual testing and debugging
	- Important: Dockerfile must be moved outside `/server` folder to main project directory

	---

	## 4. Hackathon Requirements

	### Environment Quality

	#### Real-World Focus (Critical)
	- Must build: Real-world task environments (healthcare, email triage, code optimization)
	- Avoid: "Toy" environments, games (Wordle, Connect 4, etc.)
	- Goal: Environment that could realistically be used in model's post-training RL run

	#### Complexity Requirements
	- Map long-running tasks with multiple trajectories/routes
	- Agent should have various possible approaches to solve the task

	### Technical Requirements

	#### Mandatory Inference Script
	- Required for every submission
	- Used by organizers to evaluate environment effectiveness
	- Measures how well environment provides rewards to model

	#### API Configuration
	- No OpenAI API key required
	- Use Hugging Face token instead
	- Use provided HF Router (API base URL) for model calls
	- HF Router handles model calls through Hugging Face

	#### Docker Setup
	- Move Dockerfile outside `/server` folder to main project directory
	- Run `openm validate` before submission

	### Reward Signal Design

	#### Requirements
	- Score typically between 0 and 1
	- Must deliver valid signal indicating "good" or "bad" performance
	- Grading Diversity: Must not return same score every time
	- Should distinguish between different performance levels

	#### Best Practices
	- Start with achievable tasks for the model
	- Ensure task is feasible but challenging
	- Avoid tasks too difficult or out-of-distribution for the model

	---

	## 5. Grading Criteria

	Evaluation based on:

	1. Utility of the Idea
	- How useful is the task for real-world AI?
	- Does it represent authentic human tasks?

	2. Quality of the Grader
	- Returns diverse scores (not same score every time)
	- Value between 0 and 1
	- Distinguishes performance levels

	3. Technical Design
	- Environment architecture and implementation
	- Successful execution of inference script

	4. Novelty
	- Key criterion for high scores
	- Create something not thought of yet
	- Solve problems in unique domains
	- Plagiarism is strictly prohibited

	---

	## 6. Submission Guidelines

	### Deadline
	- Round One: April 8th

	### Submission Process
	- Push environment to Hugging Face Spaces using `openm push`
	- Submit URL of Hugging Face Space
	- Multiple submissions allowed (latest accurate submission used)

	### Collaboration
	- Teams are highly encouraged
	- Helps manage technical and creative requirements

	---

	## 7. High-Value Environment Ideas

	### Healthcare Domain
	- Medical triage tools
	- Navigating medical records
	- Healthcare-specific software tool utilization

	### Productivity and Operations
	- Email Triage: Prioritize, categorize, respond to complex inbox
	- Calendar Management: Coordinate schedules, handle conflicts across multiple participants

	### Technical and Code Optimization
	- Kernel Optimization: Benchmark and optimize PyTorch/GPU kernels for speed and efficiency
	- Repository Maintenance: Navigate GitHub to identify/fix bugs, run test suites

	### Logistics and Travel
	- Complex Flight Booking: Navigate changing availability, multi-leg transfers, request missing information from users

	### API and Tool Integration
	- Wide set of real-world tools
	- Interactive APIs that agents must learn to use correctly

	---

	## 8. Best Practices Summary

	### Do's
	- Focus on real-world utility
	- Design long-running, multi-trajectory tasks
	- Implement diverse grading systems
	- Start with curriculum learning approach
	- Validate thoroughly before submission
	- Work in teams for better results
	- Aim for novelty and uniqueness

	### Don'ts
	- Avoid toy environments or games
	- Don't create tasks too difficult for models
	- Don't implement single-score graders
	- Avoid plagiarism
	- Don't submit without testing inference script
	- Don't use tasks without clear reward signals

	---

	## 9. Technical Checklist

	- [ ] Initialize project with `openm init`
	- [ ] Define Actions, Observations, States as Pydantic objects
	- [ ] Implement diverse reward function (0-1 range)
	- [ ] Create mandatory inference script
	- [ ] Configure HF token and router (not OpenAI key)
	- [ ] Move Dockerfile to main directory (outside /server)
	- [ ] Run `openm validate` to verify setup
	- [ ] Test environment locally
	- [ ] Deploy with `openm push` to Hugging Face Spaces
	- [ ] Submit Hugging Face Space URL before April 8th

	---

	## Resources

	- OpenM Library: Standardized RL environment framework
	- Hugging Face Spaces: Deployment platform
	- HF Router: API for model access
	- Training Frameworks: TRL, Unsloth (compatible with OpenM)

	---

	This guide synthesizes best practices for building competitive RL environments for hackathons. Focus on real-world utility, technical excellence, and novel approaches for the best results.