Spaces:
Sleeping
Sleeping
| # Building RL Environments for Hackathon: Complete Guide | |
| ## Overview | |
| This guide provides comprehensive insights for building real-world Reinforcement Learning (RL) environments using the OpenM (Open Environment) library for hackathon participation. | |
| --- | |
| ## 1. Fundamentals of Reinforcement Learning | |
| ### The Mechanism | |
| - **How it Works:** Model generates candidate implementations (actions) → Environment verifies/tests → Environment provides reward signal (score) based on pre-defined rubrics | |
| - **Purpose:** Tells the model what is good or bad through trial and error rather than long-context prompts | |
| ### Position in Training Pipeline | |
| - Typically follows **Supervised Fine-Tuning (SFT)** | |
| - Used to "squeeze out" final performance gains on specific capabilities | |
| - More efficient alternative to "in-context learning" (which degrades with longer prompts) | |
| ### Key Challenges | |
| #### Reward Hacking | |
| - Models learn to "game" the verifier to get high scores without actually solving the task | |
| - **Mitigation:** Inspect output trajectories or use multiple reward functions | |
| #### Curriculum Learning | |
| - Start with easy tasks and build complexity progressively | |
| - Ensures model receives consistent reward signal | |
| - Prevents "wasting compute" on tasks that are too difficult initially | |
| --- | |
| ## 2. Introduction to OpenM | |
| ### What is OpenM? | |
| - Collaborative project between Meta, Hugging Face, and others | |
| - Standardizes RL environments (like Hugging Face standardized language models) | |
| - Single, consistent API for environments | |
| - Interoperable with training frameworks (TRL, Unsloth, etc.) | |
| ### Core Components | |
| Standard OpenM environment requires defining: | |
| - **Actions** (as Pydantic objects) | |
| - **Observations** (as Pydantic objects) | |
| - **States** (as Pydantic objects) | |
| --- | |
| ## 3. Technical Implementation | |
| ### CLI Workflow | |
| ```bash | |
| # Initialize skeleton environment | |
| openm init | |
| # Validate setup | |
| openm validate | |
| # Deploy to Hugging Face Spaces | |
| openm push | |
| ``` | |
| ### Agent Integration | |
| - Use coding agents (like Codeex) with OpenM "skills" | |
| - Automatically generate environment code from prompts | |
| ### Deployment | |
| - Environments deployed as Docker containers on Hugging Face | |
| - Provides web interface for manual testing and debugging | |
| - **Important:** Dockerfile must be moved outside `/server` folder to main project directory | |
| --- | |
| ## 4. Hackathon Requirements | |
| ### Environment Quality | |
| #### Real-World Focus (Critical) | |
| - **Must build:** Real-world task environments (healthcare, email triage, code optimization) | |
| - **Avoid:** "Toy" environments, games (Wordle, Connect 4, etc.) | |
| - **Goal:** Environment that could realistically be used in model's post-training RL run | |
| #### Complexity Requirements | |
| - Map **long-running tasks** with multiple trajectories/routes | |
| - Agent should have various possible approaches to solve the task | |
| ### Technical Requirements | |
| #### Mandatory Inference Script | |
| - **Required for every submission** | |
| - Used by organizers to evaluate environment effectiveness | |
| - Measures how well environment provides rewards to model | |
| #### API Configuration | |
| - **No OpenAI API key required** | |
| - Use **Hugging Face token** instead | |
| - Use provided **HF Router** (API base URL) for model calls | |
| - HF Router handles model calls through Hugging Face | |
| #### Docker Setup | |
| - Move Dockerfile outside `/server` folder to main project directory | |
| - Run `openm validate` before submission | |
| ### Reward Signal Design | |
| #### Requirements | |
| - Score typically between 0 and 1 | |
| - Must deliver valid signal indicating "good" or "bad" performance | |
| - **Grading Diversity:** Must not return same score every time | |
| - Should distinguish between different performance levels | |
| #### Best Practices | |
| - Start with achievable tasks for the model | |
| - Ensure task is feasible but challenging | |
| - Avoid tasks too difficult or out-of-distribution for the model | |
| --- | |
| ## 5. Grading Criteria | |
| Evaluation based on: | |
| 1. **Utility of the Idea** | |
| - How useful is the task for real-world AI? | |
| - Does it represent authentic human tasks? | |
| 2. **Quality of the Grader** | |
| - Returns diverse scores (not same score every time) | |
| - Value between 0 and 1 | |
| - Distinguishes performance levels | |
| 3. **Technical Design** | |
| - Environment architecture and implementation | |
| - Successful execution of inference script | |
| 4. **Novelty** | |
| - Key criterion for high scores | |
| - Create something not thought of yet | |
| - Solve problems in unique domains | |
| - **Plagiarism is strictly prohibited** | |
| --- | |
| ## 6. Submission Guidelines | |
| ### Deadline | |
| - **Round One:** April 8th | |
| ### Submission Process | |
| - Push environment to **Hugging Face Spaces** using `openm push` | |
| - Submit URL of Hugging Face Space | |
| - Multiple submissions allowed (latest accurate submission used) | |
| ### Collaboration | |
| - Teams are **highly encouraged** | |
| - Helps manage technical and creative requirements | |
| --- | |
| ## 7. High-Value Environment Ideas | |
| ### Healthcare Domain | |
| - Medical triage tools | |
| - Navigating medical records | |
| - Healthcare-specific software tool utilization | |
| ### Productivity and Operations | |
| - **Email Triage:** Prioritize, categorize, respond to complex inbox | |
| - **Calendar Management:** Coordinate schedules, handle conflicts across multiple participants | |
| ### Technical and Code Optimization | |
| - **Kernel Optimization:** Benchmark and optimize PyTorch/GPU kernels for speed and efficiency | |
| - **Repository Maintenance:** Navigate GitHub to identify/fix bugs, run test suites | |
| ### Logistics and Travel | |
| - **Complex Flight Booking:** Navigate changing availability, multi-leg transfers, request missing information from users | |
| ### API and Tool Integration | |
| - Wide set of real-world tools | |
| - Interactive APIs that agents must learn to use correctly | |
| --- | |
| ## 8. Best Practices Summary | |
| ### Do's | |
| - Focus on real-world utility | |
| - Design long-running, multi-trajectory tasks | |
| - Implement diverse grading systems | |
| - Start with curriculum learning approach | |
| - Validate thoroughly before submission | |
| - Work in teams for better results | |
| - Aim for novelty and uniqueness | |
| ### Don'ts | |
| - Avoid toy environments or games | |
| - Don't create tasks too difficult for models | |
| - Don't implement single-score graders | |
| - Avoid plagiarism | |
| - Don't submit without testing inference script | |
| - Don't use tasks without clear reward signals | |
| --- | |
| ## 9. Technical Checklist | |
| - [ ] Initialize project with `openm init` | |
| - [ ] Define Actions, Observations, States as Pydantic objects | |
| - [ ] Implement diverse reward function (0-1 range) | |
| - [ ] Create mandatory inference script | |
| - [ ] Configure HF token and router (not OpenAI key) | |
| - [ ] Move Dockerfile to main directory (outside /server) | |
| - [ ] Run `openm validate` to verify setup | |
| - [ ] Test environment locally | |
| - [ ] Deploy with `openm push` to Hugging Face Spaces | |
| - [ ] Submit Hugging Face Space URL before April 8th | |
| --- | |
| ## Resources | |
| - **OpenM Library:** Standardized RL environment framework | |
| - **Hugging Face Spaces:** Deployment platform | |
| - **HF Router:** API for model access | |
| - **Training Frameworks:** TRL, Unsloth (compatible with OpenM) | |
| --- | |
| *This guide synthesizes best practices for building competitive RL environments for hackathons. Focus on real-world utility, technical excellence, and novel approaches for the best results.* | |