Spaces:

NeoCodes-dev
/

rlm_forge

Runtime error

App Files Files Community

NeoCodes-dev commited on Mar 8

Commit

43cbf31

verified ·

1 Parent(s): 32247b2

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

Dockerfile +37 -0
Hackathon.md +108 -0
README.md +134 -7
RLM_Forge_Project_Overview.md +608 -0
__init__.py +2 -0
client.py +2 -0
main.py +6 -0
models.py +2 -0
openenv.yaml +20 -0
pyproject.toml +23 -0
rlm_forge/__init__.py +5 -0
rlm_forge/client.py +26 -0
rlm_forge/models.py +65 -0
rlm_forge/server/__init__.py +0 -0
rlm_forge/server/app.py +30 -0
rlm_forge/server/environment.py +192 -0
rlm_forge/server/feature_extractor.py +310 -0
rlm_forge/server/repo_manager.py +106 -0
rlm_forge/server/reward.py +169 -0
rlm_forge/server/sandbox.py +213 -0
rlm_forge_training.ipynb +802 -0
rlm_forge_training.py +470 -0
server/__init__.py +1 -0
server/app.py +23 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+FROM python:3.11-slim
+WORKDIR /app
+# System deps: git for cloning repos, grep for search
+RUN apt-get update && apt-get install -y \
+    git \
+    grep \
+    && rm -rf /var/lib/apt/lists/*
+# Copy project files
+COPY pyproject.toml .
+COPY openenv.yaml .
+COPY rlm_forge/ rlm_forge/
+# Install Python deps
+RUN pip install --no-cache-dir -e .
+# Pre-install common test dependencies for target repos
+RUN pip install --no-cache-dir pytest text-unidecode freezegun
+# AMENDMENT 2: Pre-clone curated repos to avoid network I/O on every reset()
+RUN mkdir -p /app/repos && \
+    git clone --depth=1 https://github.com/un33k/python-slugify /app/repos/python-slugify && \
+    git clone --depth=1 https://github.com/python-humanize/humanize /app/repos/humanize
+# Install curated repo dependencies
+RUN pip install --no-cache-dir -e /app/repos/python-slugify || true
+RUN pip install --no-cache-dir -e /app/repos/humanize || true
+EXPOSE 8000
+ENV PYTHONUNBUFFERED=1
+ENV RLM_FORGE_PRE_CLONED_DIR=/app/repos
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "-m", "uvicorn", "rlm_forge.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

Hackathon.md ADDED Viewed

	@@ -0,0 +1,108 @@

+## Rules
+- Your project **must** use OpenEnv (stable release 0.2.1) deployed on HF spaces
+- You must show a minimal training script for your environment using Unsloth or HF TRL in Colab.
+- You must upload a **one minute** demo video to YouTube talking about your submission.
+## Hackathon Problem Statements**
+Your project must address at least **one of the five** required problem statements.
+- Some problem statements include **optional partner-sponsored sub-problem statements**, which are additional focus areas related to the main theme.
+- Your project may align with **multiple partner sub-problem statements**, but you can only be **judged for a maximum of two**. Please **select up to two** when submitting.
+- Projects that match these partner sub-problem statements are eligible for **extra partner prizes**, judged separately from the main track winners.
+- Each partner sub-problem statement carries a prize of **$10,000 USD**.
+**Statement 1: Multi-Agent Interactions**
+Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
+- **Expected Outcome:** an environment that can be used to train multi-agent task handling in a LLM
+- **Example Environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
+- **Partner Sub-Themes:**
+  - **Fleet AI:** Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents operating in complex, multi-agent settings.
+  - **Halluminate:** Multi-Actor Environments: Build a realistic environment where an agent interacts with and manages multiple actors (agents) to discover and achieve the task
+**Statement 2: (Super) Long-Horizon Planning & Instruction Following**
+You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
+- **Expected Outcome:** an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
+- **Example Environments:** Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
+- **Partner Sub-Themes:**
+  - **Mercor:** Make an environment with capped/uncapped rewards where frontier model rewards scale with token output.
+  - **Scale AI:** Environments for long horizon workflows for non-code use cases within a business setting: focusing on either Sales, Project management, or HR & IT.
+**Statement 3: World Modeling**
+- **Statement 3.1: Professional Tasks:** Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
+  - **Expected Outcome:** an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
+  - **Example Environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations with feedback, tool-discovery benchmarks.
+  - **Partner Sub-Theme:**
+    - **Scaler AI Labs:** Multi-App RL Environment for Enterprise Workflows: Create RL environments to demonstrate complex workflows, business rule nuances etc in a large enterprise
+- **Statement 3.2: Personalized Tasks:** Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks.
+  - **Expected Outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
+  - **Example Environments:** Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, etc
+  - **Partner Sub-Theme:**
+    - **Patronus AI:** Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where the underlying data schemas, API contracts, and t&cs/policies/rules change.
+**Statement 4: Self-Improvement**
+The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
+- **Expected Outcome:** an environment for improving self-play of a LLM over a defined set of tasks
+- **Example Environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
+- **Partner Sub-Theme:**
+  - **Snorkel AI:** Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements / preferences.
+**Statement 5: Wild Card - Impress Us!**
+We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
+**Judging Criteria**
+- **Environment Innovation (40%) -** Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
+- **Storytelling (30%) -** Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
+- **Training Script Showing Improvement in Rewards (20%) -** Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
+- **Reward and Training Pipeline Setup (10%) -** Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
+**Judging Process**
+**|** Judging proceeds in two rounds:
+- Hackers will be assigned groups of judges; \~3 minutes to pitch followed by 1-2 minutes of Q/A
+- The top **six** teams in ranking will get to demo on stage to a panel of judges; \~3 minutes to pitch followed by 2-3 minutes for Q/A.
+## **11. Prizes**
+- **1st Place:** $15,000 USD Cash
+- **2nd Place:** $9,000 USD Cash
+- **3rd Place:** $6,000 USD Cash

README.md CHANGED Viewed

@@ -1,10 +1,137 @@
 ---
-title: Rlm Forge
-emoji: 📈
-colorFrom: pink
-colorTo: gray
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: RLM-Forge
+emoji: 🚀
+colorFrom: blue
+colorTo: indigo
+base_path: /web
 ---
+# RLM-Forge
+**Recursive Language Model training environment for AI coding agents.**
+RLM-Forge is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment that trains language models to solve coding tasks on real Python repositories using Recursive Language Model (RLM) patterns.
+## How It Works
+1. **Clone** a real Python repo (e.g., python-slugify, humanize)
+2. **Extract** a source file and replace it with a broken stub (correct signatures, wrong implementations)
+3. **Agent** explores the repo via a sandboxed multi-step REPL with built-in tools
+4. **Reward** = test pass rate (55%) + structural validity (15%) + efficiency (30%)
+5. **Train** with GRPO to improve the agent's coding ability over time
+### The REPL Tools
+The agent has access to these functions in the sandbox:
+| Function | Description |
+|----------|-------------|
+| `read_file(path)` | Read a file from the repo |
+| `list_dir(path='.')` | List directory contents |
+| `search(pattern, path='.')` | Grep for a pattern |
+| `write_file(path, content)` | Write/create a file |
+| `run_tests(test_path=None)` | Run pytest |
+| `spawn_agent(scope, mission)` | Explore a directory scope |
+| `FINAL()` | Signal implementation is complete |
+## Project Structure
+```
+rlm_forge/
+├── __init__.py              # Package exports
+├── models.py                # Pydantic models (Action, Observation, State)
+├── client.py                # EnvClient for remote connections
+└── server/
+    ├── app.py               # FastAPI server (create_app)
+    ├── environment.py       # Core Environment (reset/step)
+    ├── sandbox.py           # Sandboxed Python REPL
+    ├── repo_manager.py      # Repo cloning & dependency management
+    ├── feature_extractor.py # Source file extraction & stub generation
+    └── reward.py            # Composite reward computation
+```
+## Quick Start
+### Install
+```bash
+uv sync
+```
+### Run the Server
+```bash
+uv run uvicorn rlm_forge.server.app:app --host 0.0.0.0 --port 8000
+```
+### Use the Environment Directly
+```python
+from rlm_forge.server.environment import RLMForgeEnvironment
+from rlm_forge.models import RLMForgeAction
+env = RLMForgeEnvironment()
+obs = env.reset(seed=1)
+print(obs.task_description)
+# Agent takes actions
+obs = env.step(RLMForgeAction(code="print(read_file('test.py'))"))
+obs = env.step(RLMForgeAction(code="write_file('slugify/slugify.py', '...')"))
+obs = env.step(RLMForgeAction(code="FINAL()"))
+print(f"Reward: {obs.reward}")
+```
+### Connect via Client
+```python
+from rlm_forge.client import RLMForgeClient
+from rlm_forge.models import RLMForgeAction
+client = RLMForgeClient(base_url="http://localhost:8000")
+client.connect()
+result = client.reset(seed=1)
+result = client.step(RLMForgeAction(code="print(list_dir())"))
+result = client.step(RLMForgeAction(code="FINAL()"))
+print(f"Reward: {result.reward}")
+```
+## Training
+See `rlm_forge_training.ipynb` for the full GRPO training notebook. Designed for Google Colab with an H100 GPU.
+Key training approach:
+- **Multi-step trajectory concatenation**: Full episode (all code actions) treated as one GRPO "completion"
+- **Group Relative Policy Optimization**: Multiple completions per task, advantages computed relative to group mean
+- **LoRA fine-tuning**: 4-bit quantized Qwen2.5-Coder-32B with LoRA adapter
+## Reward Breakdown
+| Component | Weight | Description |
+|-----------|--------|-------------|
+| Test Pass Rate | 55% | Fraction of tests passing |
+| Structural Validity | 15% | AST parse check + import check |
+| Efficiency | 30% | Tiered by iteration budget used |
+## Curated Repos
+| Repo | Source File | Tests | Difficulty |
+|------|-----------|-------|------------|
+| python-slugify | `slugify/slugify.py` | 82 | Easy |
+| humanize (number) | `src/humanize/number.py` | 219 | Medium |
+| humanize (time) | `src/humanize/time.py` | varies | Medium |
+## Docker
+```bash
+docker build -t rlm-forge .
+docker run -p 8000:8000 rlm-forge
+```
+The Dockerfile pre-clones curated repos to avoid network I/O on each `reset()`.
+## Deploy to HF Spaces
+```bash
+openenv push -r your-username/rlm-forge
+```

RLM_Forge_Project_Overview.md ADDED Viewed

	@@ -0,0 +1,608 @@

+# RLM-Forge: A Recursive Language Model Training Environment for AI Coding Agents
+## Project Overview
+RLM-Forge is an OpenEnv environment designed to train small language models to utilize the Recursive Language Model (RLM) framework for solving complex coding tasks on large repositories. It is inspired by the research paper "Recursive Language Models" (Zhang, Kraska, & Khattab, MIT CSAIL, December 2025), which demonstrated that LLMs can process inputs orders of magnitude beyond their context windows by treating prompts as external environment variables and interacting with them through code execution in a REPL.
+The core innovation of RLM-Forge is combining the RLM paradigm with depth-limited sub-agents for repository exploration, creating an environment where a root agent can orchestrate multiple sub-agents — each with their own scoped REPL and file-system tools — to understand and modify codebases far too large for any single model's context window.
+The environment is self-supervised: it clones open-source repositories, programmatically removes a file or module that has associated test coverage, and tasks the agent with rebuilding that feature using only the surrounding codebase. The removed feature's test suite serves as an automatic, objective reward signal.
+---
+## Motivation & Research Background
+### The Problem
+Modern AI coding agents (Claude Code, Cursor, Codex CLI) struggle with very large repositories because a single agent must somehow fit enough context to understand the entire system. Context windows are finite, and even within those limits, model quality degrades as context grows longer — a phenomenon known as "context rot."
+### The RLM Insight
+The Recursive Language Models paper (arXiv:2512.24601) proposes a paradigm shift: instead of feeding long prompts directly into the neural network, treat the prompt as part of an external environment. The model interacts with the context through code — slicing, searching, chunking — and only pulls small pieces into its context window at a time. Crucially, the model can programmatically invoke sub-LM calls on constructed snippets, enabling recursive decomposition.
+Key findings from the paper:
+- RLMs handle inputs up to 10M+ tokens (two orders of magnitude beyond context windows)
+- On information-dense tasks, RLMs outperform base models by 28-58% absolute
+- The approach is model-agnostic and works with both closed and open-source models
+- Costs remain comparable to base model calls at the median
+- Emergent strategies appear without explicit training: regex filtering, intelligent chunking, answer verification, variable-based output stitching
+### The Gap We Fill
+The paper's "Future Work" section explicitly identifies the opportunity we are pursuing:
+> "Explicitly training models to be used as RLMs (e.g. as root or sub-LMs) could provide additional performance improvements... We hypothesize that RLM trajectories can be viewed as a form of reasoning, which can be trained by bootstrapping existing frontier models."
+We plan to allow a recursion depth of 1 (or 2?), so that a root agent can spawn sub-agents, and those sub-agents have access to their own REPL and file system tools, but the sub-agents cannot spawn their own sub-agents.
+This will allow the model to be trained as both a root agent and a sub-agent, which is key to the success of the RLM-Forge environment.
+### Why Coding Tasks?
+Coding is the ideal domain for RLM training because:
+1. **Natural structure**: Repositories have files, modules, imports, and tests — providing clear decomposition targets
+2. **Objective evaluation**: Test suites provide automatic, binary reward signals
+3. **Unlimited data**: Every well-tested open-source repository is a potential training example
+4. **Real-world impact**: Improved coding agents have immediate practical value
+5. **Complexity scaling**: Repositories naturally range from simple (100 LOC) to enormous (1M+ LOC), providing a natural curriculum
+---
+## Architecture Design
+### Environment Type
+RLM-Forge is an **OpenEnv environment** built on the OpenEnv 0.2.1 framework. It follows the standard OpenEnv pattern:
+```
+rlm_forge/
+├── __init__.py
+├── README.md
+├── models.py              # Action, Observation, State (Pydantic models)
+├── client.py              # HTTPEnvClient subclass
+├── openenv.yaml           # Environment manifest
+├── pyproject.toml
+├── uv.lock
+└── server/
+    ├── __init__.py
+    ├── app.py             # FastAPI server using create_app()
+    ├── environment.py     # Core Environment implementation
+    ├── repo_manager.py    # Repository cloning, feature extraction, test discovery
+    ├── sandbox.py         # Sandboxed code execution (REPL)
+    ├── sub_agent.py       # Sub-agent lifecycle management
+    ├── reward.py          # Composite reward computation
+    ├── feature_extractor.py  # Module/file removal and test mapping
+    └── Dockerfile
+```
+### Core Concepts
+#### The Root Agent
+The root agent operates in an iterative REPL loop. It receives a task description and a high-level manifest of the repository (directory tree, file sizes, README excerpt). It does NOT see the actual source code in its context window. Instead, it writes Python code to:
+- Explore the repository structure
+- Read specific files
+- Search for patterns (grep, regex, AST parsing)
+- Spawn sub-agents to explore specific directories or modules
+- Write implementation code
+- Save files to rebuild the removed feature
+#### Sub-Agents (Depth = 1)
+Sub-agents are scoped explorers. When the root agent spawns a sub-agent, it specifies:
+- A target scope (directory path or set of files)
+- A mission (what to look for, what to report back)
+- A budget (maximum iterations)
+The sub-agent gets its own sandboxed REPL with:
+- Read-only access to its scoped portion of the repository
+- The ability to execute Python code (read files, parse ASTs, search, analyze)
+- An `llm_query()` function for semantic understanding of code snippets
+- NO ability to spawn further sub-agents (depth limit = 1)
+The sub-agent runs its own iteration loop and returns a structured report to the root agent's REPL environment as a variable.
+**Important distinction from the RLM paper**: In the paper, sub-calls are stateless LM calls — simple prompt-in, text-out. In RLM-Forge, sub-agents have their own REPL state, their own iteration loop, and their own tool access. They are mini-RLMs, not plain LM calls. This is the "depth-1 recursive RLM with tools" architecture. Sub-agents CANNOT spawn their own sub-agents.
+#### The REPL Environment
+Both root and sub-agents operate within sandboxed Python REPL environments. Key properties:
+- **Persistent state**: Variables persist across iterations within an episode
+- **Sandboxed execution**: Code runs in an isolated environment with controlled file system access
+- **Truncated output**: stdout/stderr is truncated to prevent context overflow (configurable limit)
+- **Iteration tracking**: The environment tracks iteration count against a configurable maximum
+- **Built-in functions**:
+  - `llm_query(prompt: str) -> str` — Invoke a sub-LM for semantic understanding
+  - `spawn_agent(scope: str, mission: str, budget: int) -> dict` — Spawn a sub-agent (root only)
+  - `read_file(path: str) -> str` — Read a file from the repository
+  - `list_dir(path: str) -> list` — List directory contents
+  - `search(pattern: str, path: str) -> list` — Grep/regex search
+  - `write_file(path: str, content: str)` — Write implementation files (root only)
+  - `run_tests(test_path: str) -> dict` — Run specific test files and get results
+  - `FINAL()` — Signal episode completion
+---
+## Episode Lifecycle
+### Phase 1: Environment Setup (on `reset()`)
+1. **Repository selection**: The environment selects a repository from its configured dataset (a list of Git repository URLs or local paths)
+2. **Clone and baseline**: Clone the repository. Run the full test suite to establish a baseline (all tests should pass)
+3. **Feature extraction**: Select a target file or module for removal:
+   - Identify files/modules that have dedicated test files with clear mappings (e.g., `src/auth.py` → `tests/test_auth.py`)
+   - Prefer modules with moderate complexity (configurable LOC range)
+   - Record which tests are associated with the target
+   - Record the original content of the target (this is the ground truth, never shown to the agent)
+4. **Feature removal**: Delete the target file(s) from the repository working copy
+5. **Manifest generation**: Create a high-level manifest for the agent:
+   - Directory tree structure
+   - File sizes and languages
+   - README excerpt (first N characters)
+   - List of failing tests (names and file paths)
+   - Task description: "The following module has been removed: `[path]`. N tests in `[test_path]` are now failing. Your task is to implement the missing module so that all tests pass."
+6. **REPL initialization**: Set up the root agent's REPL environment with the repository loaded and built-in functions available
+7. **Return initial observation**: The observation includes the manifest, the task description, the failing test list, and REPL environment metadata (available variables, available functions)
+### Phase 2: Agent Interaction (the `step()` loop)
+Each step, the agent submits an action containing Python code to execute. The environment:
+1. **Extracts code blocks** from the agent's response
+2. **Executes each code block** in the sandboxed REPL
+3. **Captures output** (stdout, stderr, success/failure, any variables set)
+4. **Checks for sub-agent spawns**: If the code calls `spawn_agent()`, the environment:
+   - Creates a new scoped REPL for the sub-agent
+   - Runs the sub-agent's iteration loop (the sub-agent is driven by an `llm_query()` call internally, or by a policy if training the sub-agent)
+   - Returns the sub-agent's report as a variable in the root agent's REPL
+5. **Checks for termination**: Episode ends if:
+   - Agent calls `FINAL()` — voluntary completion
+   - Maximum iterations reached — forced termination
+   - Maximum wall-clock time exceeded — timeout
+6. **Returns observation**: stdout/stderr (truncated), success boolean, iteration count, list of available variables, any sub-agent reports
+### Phase 3: Evaluation (on episode completion)
+When the episode ends (either through `FINAL()` or iteration limit):
+1. **Collect implementation**: Gather all files the agent wrote via `write_file()`
+2. **Run target tests**: Execute the test files associated with the removed feature
+3. **Run regression tests**: Execute the full test suite to check for regressions
+4. **Compute composite reward** (see Reward Function below)
+5. **Return final observation** with done=True, reward, and detailed test results
+---
+## Reward Function
+The reward is a weighted composite of three components. Weights are configurable via environment parameters, with these defaults:
+### Test Pass Rate (Default: 55% of total reward)
+```
+test_pass_reward = (num_target_tests_passed / num_target_tests_total)
+```
+This is the primary signal. The agent is rewarded proportionally to how many of the removed feature's tests it gets passing. Partial credit is given — passing 7 out of 10 tests yields 0.70 on this component.
+### Structural Validity (Default: 15% of total reward)
+```
+structural_reward = weighted_average(
+    parse_success,        # Does the code parse without syntax errors? (weight: 0.3)
+    import_success,       # Do imports resolve correctly? (weight: 0.3)
+    no_regressions,       # Do previously-passing tests still pass? (weight: 0.4)
+)
+```
+This penalizes agents that produce invalid code or hack solutions that break the rest of the codebase. The regression check is particularly important — it prevents the agent from modifying shared utilities in ways that pass target tests but break everything else.
+### Efficiency Bonus (Default: 30% of total reward)
+```
+if iterations_used <= budget * 0.5:
+    efficiency_reward = 1.0   # Full bonus for fast solutions
+elif iterations_used <= budget * 0.75:
+    efficiency_reward = 0.75  # Reduced bonus
+elif iterations_used <= budget:
+    efficiency_reward = 0.5   # Minimal bonus for using full budget
+else:
+    efficiency_reward = 0.0   # No bonus if forced termination
+# Sub-agent efficiency modifier
+sub_agent_penalty = max(0, 1.0 - (num_sub_agents_spawned / max_reasonable_sub_agents))
+efficiency_reward *= (0.7 + 0.3 * sub_agent_penalty)
+```
+This encourages the agent to learn efficient exploration and decomposition strategies. It rewards agents that solve problems quickly and use sub-agents judiciously rather than spawning one for every directory.
+### Total Reward Computation
+```
+total_reward = (
+    test_weight * test_pass_reward +
+    structural_weight * structural_reward +
+    efficiency_weight * efficiency_reward
+)
+```
+Where `test_weight`, `structural_weight`, and `efficiency_weight` are configurable and default to 0.55, 0.15, and 0.30 respectively.
+---
+## Data Models (Pydantic Schemas)
+### Action
+```python
+class RLMForgeAction(Action):
+    """Agent's action: Python code to execute in the REPL."""
+    code: str = Field(..., description="Python code to execute in the REPL environment")
+    action_type: str = Field(
+        default="execute",
+        description="Type of action: 'execute' for code, 'final' to submit solution"
+    )
+```
+### Observation
+```python
+class RLMForgeObservation(Observation):
+    """What the agent sees after each step."""
+    # REPL execution results
+    stdout: str = Field(default="", description="Truncated stdout from code execution")
+    stderr: str = Field(default="", description="Truncated stderr from code execution")
+    success: bool = Field(default=True, description="Whether code executed without errors")
+    # Episode tracking
+    iteration: int = Field(default=0, description="Current iteration number")
+    max_iterations: int = Field(default=50, description="Maximum allowed iterations")
+    # Repository context (provided on reset, may be refreshed)
+    repo_manifest: Optional[dict] = Field(default=None, description="Repository structure manifest")
+    task_description: Optional[str] = Field(default=None, description="The coding task to complete")
+    failing_tests: Optional[list[str]] = Field(default=None, description="List of currently failing test names")
+    # REPL state
+    available_variables: list[str] = Field(default_factory=list, description="Variables currently in REPL scope")
+    available_functions: list[str] = Field(default_factory=list, description="Built-in functions available")
+    # Sub-agent reports (populated when sub-agents complete)
+    sub_agent_reports: list[dict] = Field(default_factory=list, description="Reports from completed sub-agents")
+    # Test results (populated on final evaluation)
+    test_results: Optional[dict] = Field(default=None, description="Detailed test results on completion")
+```
+### State
+```python
+class RLMForgeState(State):
+    """Internal environment state, not directly sent to agent."""
+    episode_id: Optional[str] = None
+    step_count: int = 0
+    # Repository info
+    repo_url: str = ""
+    repo_local_path: str = ""
+    removed_feature_path: str = ""
+    removed_feature_content: dict[str, str] = {}  # filename -> original content
+    target_test_files: list[str] = []
+    baseline_test_count: int = 0
+    # Agent progress
+    files_written: dict[str, str] = {}  # filename -> content written by agent
+    sub_agents_spawned: int = 0
+    total_llm_queries: int = 0
+    # Evaluation
+    final_reward: Optional[float] = None
+    test_pass_rate: Optional[float] = None
+    has_regressions: Optional[bool] = None
+```
+---
+## Feature Extraction Pipeline
+The feature extraction pipeline is responsible for selecting what to remove from a repository and mapping it to tests. This is a critical component that must work reliably.
+### Strategy: File and Module Level Extraction
+The pipeline operates in two modes:
+#### Single-File Mode
+1. Scan the repository for Python/Rust/TS/Julia source files
+2. For each source file, look for a corresponding test file using common patterns:
+   - `src/foo.py` → `tests/test_foo.py`
+   - `src/foo.py` → `tests/foo_test.py`
+   - `src/foo/bar.py` → `tests/test_bar.py`
+   - `lib/foo.rs` → `tests/foo.rs` or `tests/test_foo.rs`
+   - `src/foo.ts` → `__tests__/foo.test.ts` or `tests/foo.spec.ts`
+3. Verify the test file actually imports from / tests the source file
+4. Run the test file in isolation to confirm it passes
+5. Score candidates by:
+   - Number of tests (prefer 5-30 tests; too few = trivial, too many = too complex)
+   - Source file LOC (prefer 50-500 lines for hackathon scope)
+   - Import complexity (prefer files that are imported by few other files, to minimize cascade)
+#### Module Mode
+1. Scan for directories that represent modules (contain `__init__.py` or are listed in package config)
+2. Find test directories or files that correspond to the module
+3. Same scoring criteria but at the module (directory) level
+4. Prefer small, self-contained modules (2-8 files)
+### Output of Feature Extraction
+```python
+@dataclass
+class ExtractedFeature:
+    """Represents a feature to be removed for training."""
+    source_paths: list[str]          # Files to remove
+    test_paths: list[str]            # Test files that exercise this feature
+    original_content: dict[str, str] # Map of path -> original file content
+    num_tests: int                   # Number of individual test cases
+    estimated_complexity: str        # "easy", "medium", "hard"
+    import_dependents: list[str]     # Files that import from the removed feature
+    task_description: str            # Auto-generated task description for the agent
+```
+---
+## Sub-Agent Mechanism
+### Spawning a Sub-Agent
+From the root agent's REPL:
+```python
+report = spawn_agent(
+    scope="/src/database/",
+    mission="Explore the database module. Report: 1) What ORM or database library is used, 2) What models/tables exist, 3) What patterns are used for queries, 4) The public API of this module",
+    budget=10  # max iterations for the sub-agent
+)
+# `report` is now a dict variable in the root agent's REPL
+print(report["summary"])
+print(report["files_examined"])
+```
+### Sub-Agent Lifecycle
+1. **Initialization**: A new sandboxed REPL is created with read-only access to the specified scope
+2. **Mission prompt**: The sub-agent receives a system prompt with:
+   - Its scoped directory listing
+   - The mission description from the root agent
+   - Available built-in functions (read_file, list_dir, search, llm_query)
+   - Its iteration budget
+3. **Iteration loop**: The sub-agent iterates (driven by `llm_query` internally):
+   - Writes code to explore its scope
+   - Executes code, observes results
+   - Refines its understanding
+   - Calls `FINAL(report)` when done or budget exhausted
+4. **Report return**: The sub-agent's final report (a structured dict) is injected as a variable into the root agent's REPL
+### Sub-Agent Constraints
+- **Read-only file access**: Sub-agents can read files within their scope but cannot write files
+- **No sub-agent spawning**: Sub-agents cannot spawn their own sub-agents (depth = 1)
+- **Scoped access**: Sub-agents can only access files within their assigned directory scope
+- **Budget limited**: Each sub-agent has a maximum iteration count
+- **Concurrent limit**: The root agent can have at most N sub-agents per episode (configurable, default 10)
+---
+## Repository Dataset
+### Requirements for Training Repositories
+Each repository used as a training dataset must have:
+1. **Strong test coverage** with test files that clearly map to source modules
+2. **Modular architecture** where individual files/modules can be removed without collapsing the entire project
+3. **Medium-large size** (10,000 - 150,000 LOC)
+4. **Active maintenance** (commits within last 3 months)
+5. **Permissive license** (MIT, Apache 2.0, BSD)
+6. **80%+ in one of**: Python, Rust, TypeScript, or Julia
+### Repository Configuration
+```yaml
+# repos.yaml - Dataset configuration
+repositories:
+  - url: "https://github.com/org/repo1"
+    language: "python"
+    difficulty: "medium"
+    test_command: "pytest"
+    source_dir: "src/"
+    test_dir: "tests/"
+  - url: "https://github.com/org/repo2"
+    language: "rust"
+    difficulty: "hard"
+    test_command: "cargo test"
+    source_dir: "src/"
+    test_dir: "tests/"
+  # ... more repositories
+settings:
+  max_file_loc: 500           # Max LOC for single-file extraction
+  max_module_files: 8          # Max files for module extraction
+  min_tests: 3                 # Minimum tests for a valid feature
+  max_tests: 50                # Maximum tests (avoid overly complex features)
+  preferred_test_range: [5, 30] # Sweet spot for test count
+```
+---
+## Hackathon Problem Statement Alignment
+RLM-Forge addresses multiple hackathon problem statements:
+### Primary: Statement 2 — Long-Horizon Planning & Instruction Following
+The environment requires deep, multi-step reasoning with delayed rewards. The agent must:
+- Decompose the goal of rebuilding a feature into exploration sub-tasks
+- Track state across an extended REPL trajectory (potentially dozens of iterations)
+- Recover from wrong turns (exploring irrelevant code, writing buggy implementations)
+- Plan sub-agent deployments strategically
+### Secondary: Statement 3.1 — World Modeling (Professional Tasks)
+The environment involves real interaction with tools and dynamic systems:
+- File system exploration with real code
+- Test execution with real pass/fail results
+- Code execution in a sandboxed REPL
+- Multi-step workflows: explore → understand → plan → implement → verify
+### Partner Sub-Theme: Mercor (Statement 2)
+"Make an environment with capped/uncapped rewards where frontier model rewards scale with token output." — RLM-Forge naturally fits this: longer, more sophisticated RLM trajectories that correctly process more of the codebase should earn higher rewards, as they'll pass more tests.
+---
+## Implementation Plan
+### Phase 1: Core Environment Scaffold
+1. Set up the OpenEnv project structure using `openenv init`
+2. Define all Pydantic models (Action, Observation, State)
+3. Implement the basic `Environment` class with `reset()` and `step()` stubs
+4. Implement the sandboxed REPL (code execution with safety restrictions)
+5. Implement the `app.py` FastAPI server and `client.py`
+6. Verify the environment scaffold works with `openenv validate`
+### Phase 2: Repository & Feature Pipeline
+1. Implement `repo_manager.py` — repository cloning, caching, test suite discovery
+2. Implement `feature_extractor.py` — file/module selection, test mapping, feature removal
+3. Build the manifest generator (directory tree, file metadata, task description)
+4. Test the pipeline end-to-end on 2-3 repositories
+5. Handle multi-language support (Python pytest, Rust cargo test, TS jest/vitest)
+### Phase 3: Sub-Agent System
+1. Implement `sub_agent.py` — sub-agent REPL creation, scoping, lifecycle
+2. Implement `spawn_agent()` as a built-in REPL function
+3. Implement the sub-agent iteration loop with `llm_query()` integration
+4. Implement sub-agent report format and injection into root REPL
+5. Add sub-agent budget tracking and concurrent limits
+6. Test sub-agent spawning and report aggregation
+### Phase 4: Reward & Evaluation
+1. Implement `reward.py` — test execution, pass rate calculation, regression detection
+2. Implement structural validity checks (parsing, import resolution)
+3. Implement efficiency scoring
+4. Implement the composite reward computation with configurable weights
+5. Test reward computation on sample episodes
+### Phase 5: Integration, Docker & HF Spaces
+1. Full integration testing — run complete episodes end-to-end
+2. Build the Dockerfile with all dependencies (git, language runtimes, test frameworks)
+3. Configure the Gradio web UI for the HF Space
+4. Deploy to HF Spaces using `openenv push`
+5. Verify the deployed environment works remotely
+### Phase 6: Minimal Training Demo
+1. Create a Google Colab notebook
+2. Set up Unsloth + a small model (Qwen2.5-1.5B or similar)
+3. Connect to the deployed environment
+4. Implement GRPO training loop with the environment's reward function
+5. Run a few training steps to demonstrate the pipeline works
+6. Save results and training curves
+### Phase 7: Demo Video & Submission
+1. Record 1-minute YouTube demo video
+2. Final testing and bug fixes
+3. Submit to hackathon
+---
+## Key Technical Resources
+### OpenEnv Framework
+- OpenEnv GitHub: `https://github.com/meta-pytorch/OpenEnv`
+- OpenEnv 0.2.1 stable release
+- Environment builder guide: `docs/source/getting_started/environment-builder.md`
+- Existing REPL environment: `src/envs/repl_env/` (study this closely as a reference)
+- Existing coding environment: `src/envs/coding_env/` (another key reference)
+- 2048 RL training tutorial: `docs/source/tutorials/rl-training-2048.md`
+### RLM Paper
+- arXiv: `https://arxiv.org/abs/2512.24601`
+- Key sections: §2 (methods), §3.1 (emergent patterns), §5 (limitations/future work)
+- System prompts: Appendix D (pages 24-28)
+- Example trajectories: Appendix B (pages 13-20)
+### Training Stack
+- Unsloth: Memory-efficient fine-tuning with LoRA
+- HuggingFace TRL: GRPO (Group Relative Policy Optimization)
+- Google Colab: Free T4 GPU for the training demo
+### Sandboxing
+- Docker isolation (primary — OpenEnv already uses this)
+- RestrictedPython or similar for additional code execution safety
+- Filesystem scoping via chroot or bind mounts
+---
+## Configuration & Defaults
+All key parameters should be configurable through the environment's reset kwargs or openenv.yaml:
+```yaml
+# openenv.yaml
+name: rlm_forge
+version: "0.1.0"
+description: "RLM training environment for AI coding agents"
+defaults:
+  # Episode parameters
+  max_iterations: 50
+  max_wall_clock_seconds: 600
+  max_sub_agents: 10
+  sub_agent_budget: 15
+  output_truncation_chars: 5000
+  # Reward weights
+  test_pass_weight: 0.55
+  structural_validity_weight: 0.15
+  efficiency_weight: 0.30
+  # Feature extraction
+  extraction_mode: "mixed"  # "file", "module", or "mixed"
+  min_source_loc: 50
+  max_source_loc: 500
+  min_tests: 3
+  max_tests: 50
+  # Sub-agent configuration
+  sub_agent_max_iterations: 15
+  sub_agent_output_truncation: 3000
+  sub_agent_read_only: true
+  sub_agent_depth_limit: 1
+```
+---
+## Success Criteria
+### For the Hackathon
+1. **Working environment** deployed on HF Spaces that accepts reset/step/state API calls
+2. **Feature extraction** working on at least 2-3 demonstration repositories
+3. **Sub-agent spawning** functional with scoped REPL access
+4. **Reward computation** returning meaningful composite scores
+5. **Minimal training notebook** in Colab showing GRPO training loop connecting to the environment
+6. **1-minute demo video** explaining the concept and showing the environment in action
+### For Long-Term Value
+1. Environment generalizes across programming languages and repository structures
+2. Reward signal is informative enough for models to learn meaningful exploration strategies
+3. Sub-agent reports genuinely improve root agent performance vs. no sub-agents
+4. Trained models show transfer to unseen repositories
+5. Environment can serve as a benchmark for comparing coding agent architectures

__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # OpenEnv environment root package.
2	+ from rlm_forge import * # noqa: F401, F403

client.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Re-export client for OpenEnv standard layout.
2	+ from rlm_forge.client import * # noqa: F401, F403

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from rlm-forge!")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Re-export models for OpenEnv standard layout.
2	+ from rlm_forge.models import * # noqa: F401, F403

openenv.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: rlm_forge
+version: "0.1.0"
+description: "RLM-Forge: Recursive Language Model training environment for AI coding agents"
+defaults:
+  max_iterations: 10
+  max_sub_agents: 10
+  output_truncation_chars: 5000
+  # Reward weights
+  test_pass_weight: 0.55
+  structural_validity_weight: 0.15
+  efficiency_weight: 0.30
+  # Feature extraction
+  extraction_mode: "curated"
+  min_source_loc: 50
+  max_source_loc: 500
+  min_tests: 3
+  max_tests: 50

pyproject.toml ADDED Viewed

	@@ -0,0 +1,23 @@

+[project]
+name = "rlm-forge"
+version = "0.1.0"
+description = "RLM-Forge: Recursive Language Model training environment for AI coding agents"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.135.1",
+    "freezegun>=1.5.5",
+    "gitpython>=3.1.0",
+    "openenv-core[core]>=0.2.0",
+    "pydantic>=2.0.0",
+    "pytest>=9.0.2",
+    "requests>=2.31.0",
+    "text-unidecode>=1.3",
+    "uvicorn[standard]>=0.24.0",
+]
+[project.scripts]
+server = "server.app:main"
+[tool.setuptools.packages.find]
+include = ["rlm_forge*"]

rlm_forge/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""RLM-Forge: Recursive Language Model Training Environment for AI Coding Agents."""
+from .models import RLMForgeAction, RLMForgeObservation, RLMForgeState
+__all__ = ["RLMForgeAction", "RLMForgeObservation", "RLMForgeState"]

rlm_forge/client.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""Client for connecting to a remote RLM-Forge environment."""
+from typing import Any, Dict
+from openenv.core import EnvClient
+from openenv.core.env_client import StepResult
+from .models import RLMForgeAction, RLMForgeObservation, RLMForgeState
+class RLMForgeClient(EnvClient[RLMForgeAction, RLMForgeObservation, RLMForgeState]):
+    """Client for the RLM-Forge environment."""
+    def _step_payload(self, action: RLMForgeAction) -> Dict[str, Any]:
+        return {"code": action.code, "action_type": action.action_type}
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[RLMForgeObservation]:
+        obs = RLMForgeObservation(**payload["observation"])
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> RLMForgeState:
+        return RLMForgeState(**payload)

rlm_forge/models.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Pydantic models for RLM-Forge environment."""
+from typing import Optional
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class RLMForgeAction(Action):
+    """Agent submits Python code to execute in the REPL."""
+    code: str = Field(..., description="Python code to execute in the REPL environment")
+    action_type: str = Field(
+        default="execute",
+        description="Type of action: 'execute' for code, 'final' to submit solution",
+    )
+class RLMForgeObservation(Observation):
+    """What the agent sees after each step.
+    Inherits from Observation base:
+      done: bool = False
+      reward: Optional[float] = None
+      metadata: Dict[str, Any] = {}
+    """
+    stdout: str = Field(default="", description="Truncated stdout from code execution")
+    stderr: str = Field(default="", description="Truncated stderr from code execution")
+    success: bool = Field(default=True, description="Whether code executed without errors")
+    iteration: int = Field(default=0, description="Current iteration number")
+    max_iterations: int = Field(default=10, description="Maximum allowed iterations")
+    repo_manifest: Optional[dict] = Field(
+        default=None, description="Repository structure manifest"
+    )
+    task_description: Optional[str] = Field(
+        default=None, description="The coding task to complete"
+    )
+    failing_tests: Optional[list[str]] = Field(
+        default=None, description="List of currently failing test names"
+    )
+    available_functions: list[str] = Field(
+        default_factory=list, description="Built-in functions available in the REPL"
+    )
+    test_results: Optional[dict] = Field(
+        default=None, description="Detailed test results on completion"
+    )
+class RLMForgeState(State):
+    """Internal environment state, not directly sent to agent.
+    Inherits from State base:
+      episode_id: Optional[str] = None
+      step_count: int = 0
+    """
+    repo_url: str = ""
+    repo_local_path: str = ""
+    removed_feature_path: str = ""
+    removed_feature_content: str = ""
+    target_test_files: list[str] = Field(default_factory=list)
+    baseline_test_count: int = 0
+    files_written: dict[str, str] = Field(default_factory=dict)
+    sub_agents_spawned: int = 0
+    final_reward: Optional[float] = None

rlm_forge/server/__init__.py ADDED Viewed

File without changes

rlm_forge/server/app.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""FastAPI server for RLM-Forge environment."""
+from openenv.core.env_server import create_app
+from ..models import RLMForgeAction, RLMForgeObservation
+from .environment import RLMForgeEnvironment
+# OpenEnv's HTTP server calls the factory per-request.
+# Use a singleton so reset/step share the same environment instance.
+_singleton_env = None
+def _env_factory():
+    global _singleton_env
+    if _singleton_env is None:
+        _singleton_env = RLMForgeEnvironment()
+    return _singleton_env
+app = create_app(
+    _env_factory,
+    RLMForgeAction,
+    RLMForgeObservation,
+    env_name="rlm_forge",
+)
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

rlm_forge/server/environment.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""Core RLM-Forge Environment implementation."""
+import os
+import random
+import uuid
+from typing import Any, Optional
+from openenv.core.env_server import Environment
+from ..models import RLMForgeAction, RLMForgeObservation, RLMForgeState
+from .feature_extractor import CURATED_PAIRS, FeatureExtractor
+from .repo_manager import RepoManager
+from .reward import RewardComputer
+from .sandbox import REPLSandbox
+class RLMForgeEnvironment(
+    Environment[RLMForgeAction, RLMForgeObservation, RLMForgeState]
+):
+    """RLM-Forge: Recursive Language Model training environment for coding agents.
+    Clones a Python repo, removes a source file with test coverage, and provides
+    a multi-step REPL for the agent to explore and rebuild the feature.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = False
+    def __init__(self):
+        super().__init__()
+        self.repo_manager = RepoManager()
+        self.feature_extractor = FeatureExtractor()
+        self.reward_computer = RewardComputer()
+        self._state = RLMForgeState()
+        self._sandbox: Optional[REPLSandbox] = None
+        self._max_iterations = 10
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> RLMForgeObservation:
+        """Clone repo, remove feature, return initial observation."""
+        # Clean up previous episode
+        if self._state.repo_local_path:
+            self.repo_manager.cleanup(self._state.repo_local_path)
+        if seed is not None:
+            random.seed(seed)
+        # Select a curated pair
+        pair = random.choice(CURATED_PAIRS)
+        # AMENDMENT 2: Use pre-cloned repos if available, else clone from network
+        pre_cloned_dir = os.environ.get("RLM_FORGE_PRE_CLONED_DIR", "")
+        repo_name = pair["repo_url"].rstrip("/").split("/")[-1]
+        pre_cloned_path = os.path.join(pre_cloned_dir, repo_name) if pre_cloned_dir else ""
+        if pre_cloned_path and os.path.isdir(pre_cloned_path):
+            repo_path = self.repo_manager.copy_pre_cloned(pre_cloned_path)
+        else:
+            repo_path = self.repo_manager.clone_repo(pair["repo_url"])
+        # Install dependencies (best-effort)
+        self.repo_manager.install_dependencies(repo_path)
+        # Extract feature (remove source file)
+        feature = self.feature_extractor.extract_feature(
+            repo_path, pair["source_file"], pair["test_file"]
+        )
+        # Generate manifest
+        manifest = self.repo_manager.generate_manifest(repo_path)
+        # Create sandbox
+        self._sandbox = REPLSandbox(repo_path)
+        # Get initial failing test info
+        initial_test_result = self._sandbox._run_tests(pair["test_file"])
+        failing_tests = [
+            f"FAILING: {pair['test_file']} "
+            f"({initial_test_result.get('failed', '?')} failures, "
+            f"{initial_test_result.get('errors', '?')} errors)"
+        ]
+        # Initialize state
+        self._state = RLMForgeState(
+            episode_id=episode_id or str(uuid.uuid4()),
+            step_count=0,
+            repo_url=pair["repo_url"],
+            repo_local_path=repo_path,
+            removed_feature_path=pair["source_file"],
+            removed_feature_content=feature.original_content,
+            target_test_files=[pair["test_file"]],
+            baseline_test_count=feature.num_tests,
+        )
+        return RLMForgeObservation(
+            stdout="Environment initialized. Repository cloned and feature removed.",
+            stderr="",
+            success=True,
+            iteration=0,
+            max_iterations=self._max_iterations,
+            repo_manifest=manifest,
+            task_description=feature.task_description,
+            failing_tests=failing_tests,
+            available_functions=self._sandbox.available_functions,
+            done=False,
+            reward=None,
+        )
+    def step(
+        self,
+        action: RLMForgeAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> RLMForgeObservation:
+        """Execute code in REPL, check for termination, compute reward if done."""
+        if self._sandbox is None:
+            raise RuntimeError("Environment not initialized. Call reset() first.")
+        self._state.step_count += 1
+        # Check for explicit final action or iteration limit
+        if action.action_type == "final":
+            return self._finalize_episode()
+        if self._state.step_count >= self._max_iterations:
+            return self._finalize_episode()
+        # Execute code in sandbox
+        result = self._sandbox.execute(action.code)
+        # Check if FINAL() was called in the code
+        if result["final_called"]:
+            return self._finalize_episode()
+        return RLMForgeObservation(
+            stdout=result["stdout"],
+            stderr=result["stderr"],
+            success=result["success"],
+            iteration=self._state.step_count,
+            max_iterations=self._max_iterations,
+            available_functions=self._sandbox.available_functions,
+            done=False,
+            reward=None,
+        )
+    def _finalize_episode(self) -> RLMForgeObservation:
+        """Compute reward and return final observation."""
+        assert self._sandbox is not None
+        reward_result = self.reward_computer.compute(
+            repo_path=self._state.repo_local_path,
+            target_test=self._state.target_test_files[0],
+            files_written=self._sandbox.files_written,
+            max_iterations=self._max_iterations,
+            iterations_used=self._state.step_count,
+            baseline_test_count=self._state.baseline_test_count,
+        )
+        self._state.final_reward = reward_result["total_reward"]
+        self._state.files_written = self._sandbox.files_written
+        self._state.sub_agents_spawned = self._sandbox._sub_agents_spawned
+        return RLMForgeObservation(
+            stdout=f"Episode complete. Reward: {reward_result['total_reward']:.3f}",
+            stderr="",
+            success=True,
+            iteration=self._state.step_count,
+            max_iterations=self._max_iterations,
+            test_results=reward_result,
+            done=True,
+            reward=reward_result["total_reward"],
+        )
+    @property
+    def state(self) -> RLMForgeState:
+        return self._state
+    def close(self):
+        """No-op for HTTP singleton. Use cleanup() for explicit teardown."""
+        # OpenEnv HTTP server calls close() after each request handler.
+        # For singleton mode, we must NOT destroy state here.
+        # Actual cleanup happens in reset() (previous episode) or explicit cleanup().
+        pass
+    def cleanup(self):
+        """Explicit teardown: remove cloned repo."""
+        if self._state.repo_local_path:
+            self.repo_manager.cleanup(self._state.repo_local_path)
+            self._state.repo_local_path = ""

rlm_forge/server/feature_extractor.py ADDED Viewed

	@@ -0,0 +1,310 @@

+"""Semi-automatic feature extraction: discovers (source, test) pairs and removes features."""
+import ast
+import os
+from dataclasses import dataclass
+from typing import Optional
+@dataclass
+class ExtractedFeature:
+    """Represents a feature removed from a repo for training."""
+    source_path: str
+    test_path: str
+    original_content: str
+    num_tests: int
+    difficulty: str
+    task_description: str
+# Curated fallback pairs — known-good (repo, source, test) triples
+# AMENDMENT 1: python-slugify test file is test.py at root, NOT test/test_slugify.py
+CURATED_PAIRS = [
+    {
+        "repo_url": "https://github.com/un33k/python-slugify",
+        "source_file": "slugify/slugify.py",
+        "test_file": "test.py",
+        "test_command": "pytest test.py -v",
+        "difficulty": "easy",
+    },
+    {
+        "repo_url": "https://github.com/python-humanize/humanize",
+        "source_file": "src/humanize/number.py",
+        "test_file": "tests/test_number.py",
+        "test_command": "pytest tests/test_number.py -v",
+        "difficulty": "medium",
+    },
+    {
+        "repo_url": "https://github.com/python-humanize/humanize",
+        "source_file": "src/humanize/time.py",
+        "test_file": "tests/test_time.py",
+        "test_command": "pytest tests/test_time.py -v",
+        "difficulty": "medium",
+    },
+]
+class FeatureExtractor:
+    """Discovers and extracts (source, test) pairs from Python repos."""
+    def discover_pairs(self, repo_path: str) -> list[dict]:
+        """Auto-discover (source, test) pairs via filename pattern matching."""
+        pairs = []
+        test_files = self._find_test_files(repo_path)
+        for test_file in test_files:
+            source_file = self._match_source_file(repo_path, test_file)
+            if source_file and self._verify_import(repo_path, test_file, source_file):
+                num_tests = self._count_tests(os.path.join(repo_path, test_file))
+                source_loc = self._count_lines(os.path.join(repo_path, source_file))
+                # Filter by complexity sweet spot
+                if 3 <= num_tests <= 50 and 30 <= source_loc <= 500:
+                    pairs.append(
+                        {
+                            "source_path": source_file,
+                            "test_path": test_file,
+                            "num_tests": num_tests,
+                            "source_loc": source_loc,
+                        }
+                    )
+        # Sort by best fit (prefer 5-20 tests, 50-300 LOC)
+        pairs.sort(
+            key=lambda p: abs(p["num_tests"] - 12) + abs(p["source_loc"] - 150)
+        )
+        return pairs
+    def _find_test_files(self, repo_path: str) -> list[str]:
+        """Find all test files in the repo."""
+        test_files = []
+        for root, dirs, files in os.walk(repo_path):
+            dirs[:] = [d for d in dirs if not d.startswith(".") and d != "__pycache__"]
+            for f in files:
+                if f.endswith(".py") and (
+                    f.startswith("test_") or f.endswith("_test.py")
+                ):
+                    rel = os.path.relpath(os.path.join(root, f), repo_path)
+                    test_files.append(rel)
+        return test_files
+    def _match_source_file(
+        self, repo_path: str, test_file: str
+    ) -> Optional[str]:
+        """Given test_foo.py, find foo.py in common source locations."""
+        test_basename = os.path.basename(test_file)
+        if test_basename.startswith("test_"):
+            source_name = test_basename[5:]  # Remove "test_" prefix
+        elif test_basename.endswith("_test.py"):
+            source_name = test_basename[:-8] + ".py"
+        else:
+            return None
+        # Search common source locations
+        search_dirs = ["src", "lib", "."]
+        # Also try package directories (dirs with __init__.py)
+        try:
+            for item in os.listdir(repo_path):
+                item_path = os.path.join(repo_path, item)
+                if os.path.isdir(item_path) and os.path.exists(
+                    os.path.join(item_path, "__init__.py")
+                ):
+                    search_dirs.append(item)
+        except Exception:
+            pass
+        for search_dir in search_dirs:
+            if search_dir == ".":
+                candidate = source_name
+            else:
+                candidate = os.path.join(search_dir, source_name)
+            if os.path.exists(os.path.join(repo_path, candidate)):
+                return candidate
+            # Also search subdirectories of src/
+            src_dir = os.path.join(repo_path, search_dir)
+            if os.path.isdir(src_dir):
+                for sub in os.listdir(src_dir):
+                    sub_candidate = os.path.join(search_dir, sub, source_name)
+                    if os.path.exists(os.path.join(repo_path, sub_candidate)):
+                        return sub_candidate
+        return None
+    def _verify_import(
+        self, repo_path: str, test_file: str, source_file: str
+    ) -> bool:
+        """Check if test_file likely imports from source_file (basic heuristic)."""
+        try:
+            base_name = os.path.splitext(os.path.basename(source_file))[0]
+            test_content = open(os.path.join(repo_path, test_file)).read()
+            return base_name in test_content
+        except Exception:
+            return False
+    def _count_tests(self, test_file_path: str) -> int:
+        """Count test functions/methods in a test file using AST."""
+        try:
+            with open(test_file_path) as f:
+                tree = ast.parse(f.read())
+            count = 0
+            for node in ast.walk(tree):
+                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+                    if node.name.startswith("test_"):
+                        count += 1
+            return count
+        except Exception:
+            return 0
+    def _generate_stub(self, original_content: str) -> str:
+        """Generate a stub module with correct function/class signatures but broken implementations.
+        Parses the original source with AST to extract all top-level function
+        and class definitions, then generates a stub that:
+        - Has the same imports (so dependencies resolve)
+        - Has the same function/class names with correct signatures
+        - Returns None/raises NotImplementedError for all functions
+        """
+        try:
+            tree = ast.parse(original_content)
+        except SyntaxError:
+            return "# Stub: original file could not be parsed\n"
+        lines = ["# STUB: This file needs to be reimplemented.\n"]
+        lines.append("# All functions return None — tests will fail.\n\n")
+        # Preserve imports from the original
+        for node in ast.iter_child_nodes(tree):
+            if isinstance(node, (ast.Import, ast.ImportFrom)):
+                lines.append(ast.get_source_segment(original_content, node) + "\n")
+        lines.append("\n")
+        # Generate stub functions/classes
+        for node in ast.iter_child_nodes(tree):
+            if isinstance(node, ast.FunctionDef):
+                # Extract the full signature from source using body start line
+                func_lines = original_content.splitlines()
+                # Signature spans from the def line to the line before the body
+                body_start = node.body[0].lineno  # 1-indexed
+                sig_lines = func_lines[node.lineno - 1 : body_start - 1]
+                signature = "\n".join(sig_lines)
+                if not signature.rstrip().endswith(":"):
+                    signature = signature.rstrip() + ":"
+                lines.append(f"{signature}\n")
+                lines.append("    return None\n\n")
+            elif isinstance(node, ast.ClassDef):
+                lines.append(f"class {node.name}:\n")
+                lines.append("    pass\n\n")
+            elif isinstance(node, ast.Assign):
+                # Preserve top-level variable assignments
+                segment = ast.get_source_segment(original_content, node)
+                if segment:
+                    lines.append(segment + "\n")
+        return "".join(lines)
+    def _patch_init_files(self, repo_path: str, removed_source: str) -> None:
+        """Remove imports of the deleted module from __init__.py files.
+        When a module like `package/number.py` is removed, the package's
+        `__init__.py` may do `from package.number import ...` which would
+        crash the entire package import. We comment out those lines.
+        """
+        module_base = os.path.splitext(os.path.basename(removed_source))[0]
+        source_dir = os.path.dirname(removed_source)
+        # Check __init__.py in the same directory as the removed file
+        init_path = os.path.join(repo_path, source_dir, "__init__.py")
+        if not os.path.exists(init_path):
+            return
+        try:
+            with open(init_path, "r") as f:
+                lines = f.readlines()
+            patched = []
+            in_multiline_import = False
+            for line in lines:
+                # Detect imports referencing the removed module
+                if in_multiline_import:
+                    patched.append(f"# [RLM-FORGE REMOVED] {line}")
+                    if ")" in line:
+                        in_multiline_import = False
+                elif f".{module_base}" in line and ("import" in line or "from" in line):
+                    patched.append(f"# [RLM-FORGE REMOVED] {line}")
+                    if "(" in line and ")" not in line:
+                        in_multiline_import = True
+                elif f'"{module_base}"' in line or f"'{module_base}'" in line:
+                    # Catch __all__ references
+                    patched.append(line)
+                else:
+                    patched.append(line)
+            with open(init_path, "w") as f:
+                f.writelines(patched)
+        except Exception:
+            pass
+    def _count_lines(self, file_path: str) -> int:
+        try:
+            with open(file_path) as f:
+                return sum(1 for _ in f)
+        except Exception:
+            return 0
+    def extract_feature(
+        self, repo_path: str, source_path: str, test_path: str
+    ) -> ExtractedFeature:
+        """Remove source file and create the ExtractedFeature."""
+        full_source = os.path.join(repo_path, source_path)
+        full_test = os.path.join(repo_path, test_path)
+        # Save original content
+        with open(full_source, "r") as f:
+            original_content = f.read()
+        # Count tests
+        num_tests = self._count_tests(full_test)
+        # Replace the source file with a stub that has correct signatures
+        # but wrong implementations. This ensures:
+        # - Other modules can still import from it (no cascading ImportErrors)
+        # - Tests FAIL (not ERROR), giving a better reward signal
+        # - The agent's job is to write the correct implementation
+        stub = self._generate_stub(original_content)
+        with open(full_source, "w") as f:
+            f.write(stub)
+        # Generate task description
+        task_description = (
+            f"The file `{source_path}` has been replaced with a broken stub. "
+            f"{num_tests} tests in `{test_path}` are now failing. "
+            f"Your task is to explore the repository, understand the expected behavior "
+            f"from the tests and other code, and rewrite `{source_path}` with a correct "
+            f"implementation so that all tests pass.\n\n"
+            f"Available tools:\n"
+            f"  read_file(path) - Read a file from the repo\n"
+            f"  list_dir(path='.') - List directory contents\n"
+            f"  search(pattern, path='.') - Grep for a pattern\n"
+            f"  write_file(path, content) - Write/create a file\n"
+            f"  run_tests(test_path=None) - Run pytest on a test file\n"
+            f"  spawn_agent(scope, mission, budget=5) - Explore a directory scope\n"
+            f"  FINAL() - Signal that your implementation is complete\n\n"
+            f"Call FINAL() when you believe your implementation is complete."
+        )
+        return ExtractedFeature(
+            source_path=source_path,
+            test_path=test_path,
+            original_content=original_content,
+            num_tests=num_tests,
+            difficulty="medium",
+            task_description=task_description,
+        )

rlm_forge/server/repo_manager.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""Repository cloning, dependency installation, and manifest generation."""
+import os
+import shutil
+import subprocess
+import sys
+import tempfile
+class RepoManager:
+    """Manages repository cloning and lifecycle."""
+    def __init__(self, cache_dir: str = "/tmp/rlm_forge_repos"):
+        self.cache_dir = cache_dir
+        os.makedirs(cache_dir, exist_ok=True)
+    def clone_repo(self, repo_url: str) -> str:
+        """Clone repo to a unique temp directory. Returns path."""
+        work_dir = tempfile.mkdtemp(dir=self.cache_dir, prefix="rlm_")
+        subprocess.run(
+            ["git", "clone", "--depth=1", repo_url, work_dir],
+            check=True,
+            capture_output=True,
+            timeout=120,
+        )
+        return work_dir
+    def copy_pre_cloned(self, pre_cloned_path: str) -> str:
+        """Copy a pre-cloned repo directory for a fresh episode. Returns new path."""
+        work_dir = tempfile.mkdtemp(dir=self.cache_dir, prefix="rlm_")
+        # Remove the empty temp dir first, then copy
+        shutil.rmtree(work_dir)
+        shutil.copytree(pre_cloned_path, work_dir)
+        return work_dir
+    def install_dependencies(self, repo_path: str) -> bool:
+        """Best-effort dependency installation using uv pip (falls back to pip)."""
+        uv_path = shutil.which("uv")
+        # Build install command: prefer uv pip, fall back to sys.executable -m pip
+        def _pip_install(args: list[str]) -> bool:
+            if uv_path:
+                cmd = [uv_path, "pip", "install"] + args
+            else:
+                cmd = [sys.executable, "-m", "pip", "install"] + args
+            try:
+                subprocess.run(
+                    cmd, capture_output=True, timeout=120, check=True
+                )
+                return True
+            except Exception:
+                return False
+        # Try pyproject.toml / setup.py first
+        has_pyproject = os.path.exists(os.path.join(repo_path, "pyproject.toml"))
+        has_setup = os.path.exists(os.path.join(repo_path, "setup.py"))
+        if has_pyproject or has_setup:
+            if _pip_install(["-e", repo_path]):
+                return True
+        # Try requirements.txt
+        req_file = os.path.join(repo_path, "requirements.txt")
+        if os.path.exists(req_file):
+            if _pip_install(["-r", req_file]):
+                return True
+        return False
+    def generate_manifest(self, repo_path: str) -> dict:
+        """Generate a high-level manifest of the repo structure."""
+        manifest: dict = {"files": [], "total_files": 0, "total_loc": 0}
+        for root, dirs, files in os.walk(repo_path):
+            dirs[:] = [
+                d for d in dirs if not d.startswith(".") and d != "__pycache__"
+            ]
+            for f in files:
+                if f.endswith(".py"):
+                    full_path = os.path.join(root, f)
+                    rel_path = os.path.relpath(full_path, repo_path)
+                    try:
+                        with open(full_path) as fh:
+                            loc = sum(1 for _ in fh)
+                    except Exception:
+                        loc = 0
+                    manifest["files"].append({"path": rel_path, "loc": loc})
+                    manifest["total_files"] += 1
+                    manifest["total_loc"] += loc
+        # Read README excerpt if available
+        for readme_name in ["README.md", "README.rst", "README.txt", "README"]:
+            readme_path = os.path.join(repo_path, readme_name)
+            if os.path.exists(readme_path):
+                try:
+                    with open(readme_path) as f:
+                        manifest["readme_excerpt"] = f.read()[:2000]
+                except Exception:
+                    pass
+                break
+        return manifest
+    def cleanup(self, repo_path: str):
+        """Remove cloned repo directory."""
+        if repo_path and repo_path.startswith(self.cache_dir):
+            shutil.rmtree(repo_path, ignore_errors=True)

rlm_forge/server/reward.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""Composite reward computation for RLM-Forge episodes."""
+import ast
+import os
+import re
+import subprocess
+class RewardComputer:
+    """Computes composite reward: test pass rate + structural validity + efficiency."""
+    def __init__(
+        self,
+        test_weight: float = 0.55,
+        structural_weight: float = 0.15,
+        efficiency_weight: float = 0.30,
+    ):
+        self.test_weight = test_weight
+        self.structural_weight = structural_weight
+        self.efficiency_weight = efficiency_weight
+    def compute(
+        self,
+        repo_path: str,
+        target_test: str,
+        files_written: dict[str, str],
+        max_iterations: int,
+        iterations_used: int,
+        baseline_test_count: int,
+    ) -> dict:
+        """Compute composite reward. Returns detailed breakdown."""
+        # 1. Test pass rate (55%)
+        test_result = self._run_target_tests(repo_path, target_test)
+        total_tests = max(test_result["total"], baseline_test_count, 1)
+        test_pass_rate = test_result["passed"] / total_tests
+        # 2. Structural validity (15%)
+        structural_score = self._compute_structural(repo_path, files_written)
+        # 3. Efficiency (30%)
+        efficiency_score = self._compute_efficiency(iterations_used, max_iterations)
+        # Composite
+        total = (
+            self.test_weight * test_pass_rate
+            + self.structural_weight * structural_score
+            + self.efficiency_weight * efficiency_score
+        )
+        return {
+            "total_reward": round(total, 4),
+            "test_pass_rate": round(test_pass_rate, 4),
+            "tests_passed": test_result["passed"],
+            "tests_failed": test_result["failed"],
+            "tests_total": test_result["total"],
+            "structural_score": round(structural_score, 4),
+            "efficiency_score": round(efficiency_score, 4),
+            "breakdown": {
+                "test_component": round(self.test_weight * test_pass_rate, 4),
+                "structural_component": round(
+                    self.structural_weight * structural_score, 4
+                ),
+                "efficiency_component": round(
+                    self.efficiency_weight * efficiency_score, 4
+                ),
+            },
+            "test_output": test_result.get("output", "")[:2000],
+        }
+    def _run_target_tests(self, repo_path: str, test_path: str) -> dict:
+        """Run the target test file and parse results."""
+        import sys
+        cmd = [sys.executable, "-m", "pytest", "-v", "--tb=short", "--no-header"]
+        cmd.append(os.path.join(repo_path, test_path))
+        try:
+            result = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=60,
+                cwd=repo_path,
+            )
+            raw_output = result.stdout + result.stderr
+            # Strip ANSI color codes for reliable parsing
+            output = re.sub(r"\x1b\[[0-9;]*m", "", raw_output)
+            passed = len(re.findall(r" PASSED", output))
+            failed = len(re.findall(r" FAILED", output))
+            errors = len(re.findall(r" ERROR", output))
+            return {
+                "passed": passed,
+                "failed": failed,
+                "errors": errors,
+                "total": passed + failed + errors,
+                "output": output[:3000],
+                "returncode": result.returncode,
+            }
+        except subprocess.TimeoutExpired:
+            return {
+                "passed": 0,
+                "failed": 0,
+                "errors": 1,
+                "total": 1,
+                "output": "Test execution timed out",
+                "returncode": -1,
+            }
+    def _compute_structural(
+        self, repo_path: str, files_written: dict[str, str]
+    ) -> float:
+        """Check structural validity of written files."""
+        if not files_written:
+            return 0.0
+        file_scores = []
+        for path, content in files_written.items():
+            # Parse check (weight 0.3)
+            try:
+                ast.parse(content)
+                parse_ok = 1.0
+            except SyntaxError:
+                parse_ok = 0.0
+            # Import check (weight 0.3)
+            module_name = path.replace("/", ".").replace(".py", "")
+            try:
+                import sys
+                result = subprocess.run(
+                    [
+                        sys.executable,
+                        "-c",
+                        f"import importlib; importlib.import_module('{module_name}')",
+                    ],
+                    capture_output=True,
+                    timeout=10,
+                    cwd=repo_path,
+                )
+                import_ok = 1.0 if result.returncode == 0 else 0.0
+            except Exception:
+                import_ok = 0.0
+            file_scores.append(0.3 * parse_ok + 0.3 * import_ok)
+        avg_file_score = sum(file_scores) / len(file_scores)
+        # Regression check (weight 0.4)
+        # For hackathon: assume no regressions since we only modify the removed file
+        regression_score = 0.4
+        return avg_file_score + regression_score
+    def _compute_efficiency(
+        self, iterations_used: int, max_iterations: int
+    ) -> float:
+        """Tiered efficiency score."""
+        if max_iterations <= 0:
+            return 0.0
+        ratio = iterations_used / max_iterations
+        if ratio <= 0.5:
+            return 1.0
+        elif ratio <= 0.75:
+            return 0.75
+        elif ratio <= 1.0:
+            return 0.5
+        else:
+            return 0.0

rlm_forge/server/sandbox.py ADDED Viewed

	@@ -0,0 +1,213 @@

+"""Sandboxed Python REPL using exec() with persistent globals."""
+import contextlib
+import io
+import os
+import re
+import subprocess
+class REPLSandbox:
+    """Sandboxed Python REPL with built-in tool functions for repo exploration."""
+    def __init__(self, repo_path: str, max_output_chars: int = 5000):
+        self.repo_path = os.path.realpath(repo_path)
+        self.max_output_chars = max_output_chars
+        self.files_written: dict[str, str] = {}
+        self._final_called = False
+        self._sub_agents_spawned = 0
+        self.globals_dict: dict = {"__builtins__": __builtins__}
+        self.globals_dict.update(
+            {
+                "read_file": self._read_file,
+                "list_dir": self._list_dir,
+                "search": self._search,
+                "write_file": self._write_file,
+                "run_tests": self._run_tests,
+                "spawn_agent": self._spawn_agent,
+                "FINAL": self._final,
+            }
+        )
+    def execute(self, code: str) -> dict:
+        """Execute code in the sandbox, return stdout/stderr/success."""
+        stdout_capture = io.StringIO()
+        stderr_capture = io.StringIO()
+        try:
+            with contextlib.redirect_stdout(stdout_capture), contextlib.redirect_stderr(
+                stderr_capture
+            ):
+                exec(code, self.globals_dict)
+            success = True
+        except Exception as e:
+            stderr_capture.write(f"{type(e).__name__}: {e}\n")
+            success = False
+        stdout = stdout_capture.getvalue()[: self.max_output_chars]
+        stderr = stderr_capture.getvalue()[: self.max_output_chars]
+        return {
+            "stdout": stdout,
+            "stderr": stderr,
+            "success": success,
+            "final_called": self._final_called,
+        }
+    def _validate_path(self, path: str) -> str:
+        """Ensure path stays within repo. Returns the real absolute path."""
+        full_path = os.path.join(self.repo_path, path)
+        real_path = os.path.realpath(full_path)
+        if not real_path.startswith(self.repo_path):
+            raise PermissionError(f"Access denied: {path}")
+        return real_path
+    def _read_file(self, path: str) -> str:
+        """Read a file from the repo. Path relative to repo root."""
+        real_path = self._validate_path(path)
+        with open(real_path, "r") as f:
+            content = f.read()
+        if len(content) > 10000:
+            content = content[:10000] + "\n... [truncated]"
+        return content
+    def _list_dir(self, path: str = ".") -> list[str]:
+        """List directory contents relative to repo root."""
+        real_path = self._validate_path(path)
+        entries = os.listdir(real_path)
+        result = []
+        for e in sorted(entries):
+            full = os.path.join(real_path, e)
+            suffix = "/" if os.path.isdir(full) else ""
+            result.append(e + suffix)
+        return result
+    def _search(self, pattern: str, path: str = ".") -> list[str]:
+        """Grep for pattern in repo files. Returns list of matches."""
+        real_path = self._validate_path(path)
+        results = []
+        try:
+            output = subprocess.run(
+                ["grep", "-rn", "--include=*.py", pattern, real_path],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            for line in output.stdout.strip().split("\n")[:50]:
+                if line:
+                    results.append(line.replace(self.repo_path + "/", ""))
+        except (subprocess.TimeoutExpired, Exception):
+            pass
+        return results
+    def _write_file(self, path: str, content: str) -> str:
+        """Write a file to the repo. Records it for evaluation."""
+        real_path = self._validate_path(path)
+        os.makedirs(os.path.dirname(real_path), exist_ok=True)
+        with open(real_path, "w") as f:
+            f.write(content)
+        self.files_written[path] = content
+        return f"Written {len(content)} chars to {path}"
+    def _run_tests(self, test_path: str | None = None) -> dict:
+        """Run pytest on specified test file(s). Returns pass/fail summary."""
+        import sys
+        cmd = [sys.executable, "-m", "pytest", "-v", "--tb=short", "--no-header"]
+        if test_path:
+            cmd.append(os.path.join(self.repo_path, test_path))
+        else:
+            cmd.append(self.repo_path)
+        try:
+            result = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=60,
+                cwd=self.repo_path,
+            )
+            raw_output = result.stdout + result.stderr
+            # Strip ANSI color codes for reliable parsing
+            output = re.sub(r"\x1b\[[0-9;]*m", "", raw_output)
+            passed = len(re.findall(r" PASSED", output))
+            failed = len(re.findall(r" FAILED", output))
+            errors = len(re.findall(r" ERROR", output))
+            output_truncated = output[: self.max_output_chars]
+            return {
+                "passed": passed,
+                "failed": failed,
+                "errors": errors,
+                "total": passed + failed + errors,
+                "output": output_truncated,
+                "returncode": result.returncode,
+            }
+        except subprocess.TimeoutExpired:
+            return {
+                "passed": 0,
+                "failed": 0,
+                "errors": 1,
+                "total": 1,
+                "output": "Test execution timed out (60s limit)",
+                "returncode": -1,
+            }
+    def _spawn_agent(self, scope: str, mission: str, budget: int = 5) -> dict:
+        """Stateless sub-LM call. Gathers scoped context and returns structured report."""
+        self._sub_agents_spawned += 1
+        scope_path = os.path.join(self.repo_path, scope)
+        if not os.path.exists(scope_path):
+            return {
+                "error": f"Scope path not found: {scope}",
+                "summary": "",
+                "files_examined": [],
+            }
+        # Build file listing for the scope
+        files = []
+        for root, dirs, filenames in os.walk(scope_path):
+            dirs[:] = [d for d in dirs if not d.startswith(".") and d != "__pycache__"]
+            for f in filenames:
+                if f.endswith(".py"):
+                    rel = os.path.relpath(os.path.join(root, f), self.repo_path)
+                    files.append(rel)
+        # Read first few files to build context
+        context_parts = []
+        for fpath in files[:5]:
+            try:
+                content = self._read_file(fpath)
+                context_parts.append(f"--- {fpath} ---\n{content[:2000]}")
+            except Exception:
+                pass
+        report = {
+            "summary": (
+                f"Explored scope '{scope}' for mission: {mission}. "
+                f"Found {len(files)} Python files."
+            ),
+            "files_examined": files[:10],
+            "file_contents_preview": context_parts[:3],
+            "mission": mission,
+        }
+        return report
+    def _final(self) -> str:
+        """Signal episode completion."""
+        self._final_called = True
+        return "Episode marked as complete. Evaluating..."
+    @property
+    def available_functions(self) -> list[str]:
+        return [
+            "read_file(path)",
+            "list_dir(path='.')",
+            "search(pattern, path='.')",
+            "write_file(path, content)",
+            "run_tests(test_path=None)",
+            "spawn_agent(scope, mission, budget=5)",
+            "FINAL()",
+        ]

rlm_forge_training.ipynb ADDED Viewed

	@@ -0,0 +1,802 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# RLM-Forge: Training LLMs with GRPO on Coding Tasks\n",
+    "\n",
+    "**RLM-Forge** is an OpenEnv environment that trains language models to solve coding tasks using Recursive Language Model (RLM) patterns. The environment:\n",
+    "\n",
+    "1. Clones a Python repository\n",
+    "2. Replaces a source file with a broken stub (correct signatures, wrong implementations)\n",
+    "3. Provides a sandboxed REPL with tools (read_file, list_dir, search, write_file, run_tests)\n",
+    "4. The agent must explore the repo, understand the tests, and rewrite the source file\n",
+    "5. Reward = test pass rate (55%) + structural validity (15%) + efficiency (30%)\n",
+    "\n",
+    "This notebook trains a model using **GRPO (Group Relative Policy Optimization)** with multi-step trajectory concatenation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup & Installation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "# Install dependencies\n",
+    "!pip install -q \"openenv-core[core]>=0.2.0\" trl transformers accelerate bitsandbytes peft datasets\n",
+    "!pip install -q text-unidecode freezegun pytest vllm\n",
+    "\n",
+    "# Clone RLM-Forge repo\n",
+    "!git clone https://github.com/kking112/rlm-forge.git content/rlm-forge 2>/dev/null || true\n",
+    "# Or upload files manually — adjust path as needed\n",
+    "# import sys\n",
+    "# sys.path.insert(0, \"content/rlm-forge\")\n",
+    "\n",
+    "# Install RLM-Forge\n",
+    "!pip install -q -e content/rlm-forge"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition\n",
+      "PyTorch: 2.10.0+cu128\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "import json\n",
+    "import re\n",
+    "import random\n",
+    "from typing import Optional\n",
+    "from dataclasses import dataclass\n",
+    "\n",
+    "print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "# print(f\"VRAM: {torch.cuda.get_device_properties(0). / 1e9:.1f} GB\")\n",
+    "print(f\"PyTorch: {torch.__version__}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Environment Smoke Test\n",
+    "\n",
+    "Verify the environment works before training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task: The file `slugify/slugify.py` has been replaced with a broken stub. 82 tests in `test.py` are now failing. Your task is to explore the repository, understand the expected behavior from the tests and o...\n",
+      "Available tools: ['read_file(path)', \"list_dir(path='.')\", \"search(pattern, path='.')\", 'write_file(path, content)', 'run_tests(test_path=None)', 'spawn_agent(scope, mission, budget=5)', 'FINAL()']\n",
+      "\n",
+      "Step 1 stdout: ['.git/', '.github/', '.gitignore', '.pytest_cache/', '.vscode/', 'CHANGELOG.md', 'LICENSE', 'MANIFEST.in', 'README.md', '__pycache__/', 'dev.requirements.txt', 'format.sh', 'pyproject.toml', 'python_\n",
+      "\n",
+      "Baseline reward (no implementation): 0.3939\n",
+      "Test results: {'total_reward': 0.3939, 'test_pass_rate': 0.1707, 'tests_passed': 14, 'tests_failed': 68, 'tests_total': 82, 'structural_score': 0.0, 'efficiency_score': 1.0, 'breakdown': {'test_component': 0.0939, 'structural_component': 0.0, 'efficiency_component': 0.3}, 'test_output': '============================= test session starts ==============================\\ncollecting ... collected 82 items\\n\\ntest.py::TestSlugify::test_accented_text FAILED                          [  1%]\\ntest.py::TestSlugify::test_accented_text_with_non_word_characters FAILED [  2%]\\ntest.py::TestSlugify::test_contains_numbers FAILED                       [  3%]\\ntest.py::TestSlugify::test_custom_separator FAILED                       [  4%]\\ntest.py::TestSlugify::test_cyrillic_text FAILED                          [  6%]\\ntest.py::TestSlugify::test_differently_cased_stopword_match FAILED       [  7%]\\ntest.py::TestSlugify::test_ends_with_number FAILED                       [  8%]\\ntest.py::TestSlugify::test_extraneous_seperators FAILED                  [  9%]\\ntest.py::TestSlugify::test_html_decimal_off FAILED                       [ 10%]\\ntest.py::TestSlugify::test_html_decimal_on FAILED                        [ 12%]\\ntest.py::TestSlugify::test_html_entities_off FAILED                      [ 13%]\\ntest.py::TestSlugify::test_html_entities_on FAILED                       [ 14%]\\ntest.py::TestSlugify::test_html_hexadecimal_off FAILED                   [ 15%]\\ntest.py::TestSlugify::test_html_hexadecimal_on FAILED                    [ 17%]\\ntest.py::TestSlugify::test_max_length FAILED                             [ 18%]\\ntest.py::TestSlugify::test_max_length_cutoff_not_required FAILED         [ 19%]\\ntest.py::TestSlugify::test_multi_character_separator FAILED              [ 20%]\\ntest.py::TestSlugify::test_multiple_stopword_occurances FAILED           [ 21%]\\ntest.py::TestSlugify::test_multiple_stopwords FAILED                     [ 23%]\\ntest.py::TestSlugify::test_non_word_characters FAILED                    [ 24%]\\ntest.py::TestSlugify::test_numbers_and_symbols FAILED                    [ 25%]\\ntest.py::TestSlugify::test_numbers_only FAILED                           [ 26%]\\ntest.py::TestSlugify::test_phonetic_conversion_of_eastern_scripts FAILED [ 28%]\\ntest.py::TestSlugify::test_pre_translation P'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from rlm_forge.server.environment import RLMForgeEnvironment\n",
+    "from rlm_forge.models import RLMForgeAction\n",
+    "\n",
+    "env = RLMForgeEnvironment()\n",
+    "\n",
+    "# Run a quick episode\n",
+    "obs = env.reset(seed=1)\n",
+    "print(f\"Task: {obs.task_description[:200]}...\")\n",
+    "print(f\"Available tools: {obs.available_functions}\")\n",
+    "\n",
+    "# Take a step — list files\n",
+    "obs2 = env.step(RLMForgeAction(code=\"print(list_dir())\"))\n",
+    "print(f\"\\nStep 1 stdout: {obs2.stdout[:200]}\")\n",
+    "\n",
+    "# Finalize and get reward\n",
+    "obs3 = env.step(RLMForgeAction(code=\"FINAL()\"))\n",
+    "print(f\"\\nBaseline reward (no implementation): {obs3.reward:.4f}\")\n",
+    "print(f\"Test results: {obs3.test_results}\")\n",
+    "\n",
+    "env.cleanup()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load Model\n",
+    "\n",
+    "We use Qwen2.5-Coder-32B-Instruct with 4-bit quantization for inference, and train a LoRA adapter with GRPO."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Model config — adjust based on available VRAM\n",
+    "# MODEL_ID = \"Qwen/Qwen2.5-Coder-32B-Instruct\"  # 32B for H100\n",
+    "MODEL_ID = \"Qwen/Qwen2.5-Coder-7B-Instruct\"  # Fallback for smaller GPUs\n",
+    "HF_TOKEN = '' #! Fill in HF TOKEN HERE!\n",
+    "MAX_STEPS_PER_EPISODE = 6  # Max REPL interactions per episode\n",
+    "NUM_EPISODES_PER_PROMPT = 4  # GRPO group size (completions per prompt)\n",
+    "NUM_TRAINING_PROMPTS = 16  # Total unique prompts (episodes) for training\n",
+    "GRPO_EPOCHS = 2  # Training epochs over collected data\n",
+    "BATCH_SIZE = 2\n",
+    "GRAD_ACCUM = 4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
+    "from peft import LoraConfig, get_peft_model\n",
+    "\n",
+    "# 4-bit quantization for 32B model on H100\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_compute_dtype=torch.bfloat16,\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    ")\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True,token=HF_TOKEN)\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "`torch_dtype` is deprecated! Use `dtype` instead!\n",
+      "/home/neo/Desktop/Projects/OpenEnv_Hackathon_SF/V1/.venv/lib/python3.13/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1c8654adb1804c6e944f84026e38a81b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ac2529cdddcf450ea1d5a50f2cea7814",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model-00003-of-00004.safetensors:   0%|          | 0.00/4.33G [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "dd5ac4ddfc7b46e2a1ca9515752aa745",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model-00001-of-00004.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a354cc2372c6467eb49761d5fc153940",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model-00004-of-00004.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ecca5855d3f647c8b3e49f43214bca5c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model-00002-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    MODEL_ID,\n",
+    "    quantization_config=bnb_config,\n",
+    "    device_map=\"auto\",\n",
+    "    torch_dtype=torch.bfloat16,\n",
+    "    trust_remote_code=True,\n",
+    "    # attn_implementation=\"flash_attention_2\",\n",
+    "    token=HF_TOKEN\n",
+    ")\n",
+    "\n",
+    "# LoRA config for efficient training\n",
+    "lora_config = LoraConfig(\n",
+    "    r=16,\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.05,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    ")\n",
+    "\n",
+    "model = get_peft_model(model, lora_config)\n",
+    "model.print_trainable_parameters()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Trajectory Collection\n",
+    "\n",
+    "The key idea: treat the full multi-step episode as one \"completion\" for GRPO.\n",
+    "\n",
+    "**Prompt** = system message + task description + initial observation\n",
+    "**Completion** = sequence of all code actions (with observation feedback between them)\n",
+    "**Reward** = final composite reward from the environment\n",
+    "\n",
+    "We roll out multiple episodes per prompt (GRPO group) and use relative rewards within each group."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SYSTEM_PROMPT = \"\"\"You are an expert Python developer. You are given a repository where a source file has been replaced with a broken stub. Your task is to explore the repository, understand the expected behavior from the tests, and rewrite the source file so all tests pass.\n",
+    "\n",
+    "You interact via a Python REPL. Available functions:\n",
+    "- read_file(path) — Read a file from the repo\n",
+    "- list_dir(path='.') — List directory contents\n",
+    "- search(pattern, path='.') — Grep for a pattern\n",
+    "- write_file(path, content) — Write/create a file\n",
+    "- run_tests(test_path=None) — Run pytest on a test file\n",
+    "- FINAL() — Signal that your implementation is complete\n",
+    "\n",
+    "Strategy:\n",
+    "1. Read the failing test file to understand expected behavior\n",
+    "2. Read other source files for context (imports, dependencies)\n",
+    "3. Write the implementation\n",
+    "4. Run tests to verify\n",
+    "5. Fix any failures\n",
+    "6. Call FINAL() when done\n",
+    "\n",
+    "Output ONLY valid Python code. No markdown, no explanations — just code to execute.\"\"\"\n",
+    "\n",
+    "\n",
+    "def build_prompt(task_description: str, failing_tests: list[str]) -> list[dict]:\n",
+    "    \"\"\"Build the chat prompt for the initial observation.\"\"\"\n",
+    "    user_msg = f\"{task_description}\\n\\nFailing tests:\\n\" + \"\\n\".join(failing_tests)\n",
+    "    return [\n",
+    "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "        {\"role\": \"user\", \"content\": user_msg},\n",
+    "    ]\n",
+    "\n",
+    "\n",
+    "def extract_code_from_response(response: str) -> str:\n",
+    "    \"\"\"Extract executable Python code from model response.\"\"\"\n",
+    "    # Try to find code blocks first\n",
+    "    code_blocks = re.findall(r\"```(?:python)?\\n(.*?)```\", response, re.DOTALL)\n",
+    "    if code_blocks:\n",
+    "        return \"\\n\".join(code_blocks)\n",
+    "    # Otherwise treat the whole response as code\n",
+    "    lines = response.strip().split(\"\\n\")\n",
+    "    code_lines = []\n",
+    "    for line in lines:\n",
+    "        stripped = line.strip()\n",
+    "        if stripped and not stripped.startswith(\"#\") and any(c in stripped for c in \"=()[]{}:\"):\n",
+    "            code_lines.append(line)\n",
+    "        elif stripped.startswith(\"#\") or stripped.startswith(\"import\") or stripped.startswith(\"from\"):\n",
+    "            code_lines.append(line)\n",
+    "        elif not stripped:\n",
+    "            code_lines.append(line)\n",
+    "        else:\n",
+    "            code_lines.append(f\"# {line}\")\n",
+    "    return \"\\n\".join(code_lines)\n",
+    "\n",
+    "\n",
+    "print(\"Prompt builder ready.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dataclass\n",
+    "class Trajectory:\n",
+    "    \"\"\"A full multi-step episode trajectory for GRPO training.\"\"\"\n",
+    "    prompt_text: str        # Tokenized prompt (system + task)\n",
+    "    completion_text: str    # All model outputs concatenated\n",
+    "    reward: float           # Final episode reward\n",
+    "    steps: int              # Number of steps taken\n",
+    "    seed: int               # Environment seed (for reproducibility)\n",
+    "    tests_passed: int\n",
+    "    tests_total: int\n",
+    "\n",
+    "\n",
+    "def run_episode(\n",
+    "    model,\n",
+    "    tokenizer,\n",
+    "    env: RLMForgeEnvironment,\n",
+    "    seed: int,\n",
+    "    max_steps: int = MAX_STEPS_PER_EPISODE,\n",
+    "    temperature: float = 0.7,\n",
+    "    max_new_tokens: int = 2048,\n",
+    ") -> Trajectory:\n",
+    "    \"\"\"Run a single episode: generate code actions, execute them, collect trajectory.\"\"\"\n",
+    "    obs = env.reset(seed=seed)\n",
+    "\n",
+    "    messages = build_prompt(obs.task_description, obs.failing_tests or [])\n",
+    "    prompt_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
+    "\n",
+    "    all_completions = []  # All model outputs for this episode\n",
+    "\n",
+    "    for step_i in range(max_steps):\n",
+    "        # Build the full conversation so far for the model\n",
+    "        if step_i > 0:\n",
+    "            # Add the observation as assistant feedback\n",
+    "            messages.append({\"role\": \"user\", \"content\": f\"REPL output:\\n{obs.stdout}\\n{obs.stderr}\"})\n",
+    "\n",
+    "        # Generate next action\n",
+    "        full_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
+    "        inputs = tokenizer(full_text, return_tensors=\"pt\", truncation=True, max_length=8192).to(model.device)\n",
+    "\n",
+    "        with torch.no_grad():\n",
+    "            outputs = model.generate(\n",
+    "                **inputs,\n",
+    "                max_new_tokens=max_new_tokens,\n",
+    "                temperature=temperature,\n",
+    "                top_p=0.95,\n",
+    "                do_sample=True,\n",
+    "                pad_token_id=tokenizer.pad_token_id,\n",
+    "            )\n",
+    "\n",
+    "        # Decode only the new tokens\n",
+    "        new_tokens = outputs[0][inputs[\"input_ids\"].shape[1]:]\n",
+    "        response = tokenizer.decode(new_tokens, skip_special_tokens=True)\n",
+    "        all_completions.append(response)\n",
+    "\n",
+    "        # Add to conversation history\n",
+    "        messages.append({\"role\": \"assistant\", \"content\": response})\n",
+    "\n",
+    "        # Extract and execute code\n",
+    "        code = extract_code_from_response(response)\n",
+    "\n",
+    "        # Check if model wants to finalize\n",
+    "        if \"FINAL()\" in code:\n",
+    "            obs = env.step(RLMForgeAction(code=code))\n",
+    "            break\n",
+    "        else:\n",
+    "            obs = env.step(RLMForgeAction(code=code))\n",
+    "\n",
+    "        if obs.done:\n",
+    "            break\n",
+    "\n",
+    "    # If we exhausted steps without FINAL, force finalize\n",
+    "    if not obs.done:\n",
+    "        obs = env.step(RLMForgeAction(code=\"FINAL()\"))\n",
+    "\n",
+    "    # Build the full completion text (all model outputs joined)\n",
+    "    completion_text = \"\\n<|step|>\\n\".join(all_completions)\n",
+    "\n",
+    "    reward = obs.reward or 0.0\n",
+    "    test_results = obs.test_results or {}\n",
+    "\n",
+    "    return Trajectory(\n",
+    "        prompt_text=prompt_text,\n",
+    "        completion_text=completion_text,\n",
+    "        reward=reward,\n",
+    "        steps=step_i + 1,\n",
+    "        seed=seed,\n",
+    "        tests_passed=test_results.get(\"tests_passed\", 0),\n",
+    "        tests_total=test_results.get(\"tests_total\", 0),\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "print(\"Episode runner ready.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Collect Baseline Trajectories\n",
+    "\n",
+    "Run episodes to collect (prompt, completion, reward) tuples before training. This establishes the pre-training baseline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def collect_trajectories(\n",
+    "    model,\n",
+    "    tokenizer,\n",
+    "    num_prompts: int = NUM_TRAINING_PROMPTS,\n",
+    "    episodes_per_prompt: int = NUM_EPISODES_PER_PROMPT,\n",
+    "    temperature: float = 0.7,\n",
+    ") -> list[list[Trajectory]]:\n",
+    "    \"\"\"Collect GRPO groups: multiple trajectories per unique prompt/seed.\"\"\"\n",
+    "    env = RLMForgeEnvironment()\n",
+    "    all_groups = []\n",
+    "\n",
+    "    for prompt_idx in range(num_prompts):\n",
+    "        seed = prompt_idx * 100  # Deterministic seeds\n",
+    "        group = []\n",
+    "\n",
+    "        for ep_idx in range(episodes_per_prompt):\n",
+    "            print(f\"  Prompt {prompt_idx+1}/{num_prompts}, Episode {ep_idx+1}/{episodes_per_prompt}...\", end=\" \")\n",
+    "            traj = run_episode(\n",
+    "                model, tokenizer, env,\n",
+    "                seed=seed,  # Same seed = same task for GRPO group\n",
+    "                temperature=temperature + 0.1 * ep_idx,  # Vary temperature for diversity\n",
+    "            )\n",
+    "            group.append(traj)\n",
+    "            print(f\"reward={traj.reward:.3f}, steps={traj.steps}, \"\n",
+    "                  f\"tests={traj.tests_passed}/{traj.tests_total}\")\n",
+    "\n",
+    "        all_groups.append(group)\n",
+    "\n",
+    "    env.cleanup()\n",
+    "    return all_groups\n",
+    "\n",
+    "\n",
+    "# Collect pre-training baseline\n",
+    "print(\"=\" * 60)\n",
+    "print(\"COLLECTING BASELINE TRAJECTORIES\")\n",
+    "print(\"=\" * 60)\n",
+    "baseline_groups = collect_trajectories(model, tokenizer)\n",
+    "\n",
+    "# Summary stats\n",
+    "all_rewards = [t.reward for g in baseline_groups for t in g]\n",
+    "print(f\"\\nBaseline: mean_reward={sum(all_rewards)/len(all_rewards):.4f}, \"\n",
+    "      f\"min={min(all_rewards):.4f}, max={max(all_rewards):.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. GRPO Training\n",
+    "\n",
+    "Train with Group Relative Policy Optimization. For each group of trajectories (same prompt, different completions), compute advantages relative to the group mean reward, then update the policy to increase probability of higher-reward trajectories."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import Dataset\n",
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "\n",
+    "\n",
+    "def trajectories_to_dataset(groups: list[list[Trajectory]]) -> Dataset:\n",
+    "    \"\"\"Convert trajectory groups into a HuggingFace Dataset for GRPO training.\"\"\"\n",
+    "    records = []\n",
+    "    for group in groups:\n",
+    "        prompt = group[0].prompt_text\n",
+    "        for traj in group:\n",
+    "            records.append({\n",
+    "                \"prompt\": prompt,\n",
+    "                \"completion\": traj.completion_text,\n",
+    "                \"reward\": traj.reward,\n",
+    "            })\n",
+    "    return Dataset.from_list(records)\n",
+    "\n",
+    "\n",
+    "def build_reward_fn(groups: list[list[Trajectory]]):\n",
+    "    \"\"\"Build a reward function from pre-collected trajectories.\"\"\"\n",
+    "    reward_map = {}\n",
+    "    for group in groups:\n",
+    "        for traj in group:\n",
+    "            key = traj.completion_text[:200]\n",
+    "            reward_map[key] = traj.reward\n",
+    "\n",
+    "    def reward_fn(completions: list[str], **kwargs) -> list[float]:\n",
+    "        rewards = []\n",
+    "        for c in completions:\n",
+    "            key = c[:200]\n",
+    "            rewards.append(reward_map.get(key, 0.0))\n",
+    "        return rewards\n",
+    "\n",
+    "    return reward_fn\n",
+    "\n",
+    "\n",
+    "# Build dataset from baseline trajectories\n",
+    "train_dataset = trajectories_to_dataset(baseline_groups)\n",
+    "print(f\"Training dataset: {len(train_dataset)} examples\")\n",
+    "print(f\"Sample prompt length: {len(train_dataset[0]['prompt'])} chars\")\n",
+    "print(f\"Sample completion length: {len(train_dataset[0]['completion'])} chars\")\n",
+    "print(f\"Sample reward: {train_dataset[0]['reward']:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GRPO Training configuration\n",
+    "grpo_config = GRPOConfig(\n",
+    "    output_dir=\"./rlm_forge_grpo_output\",\n",
+    "    num_train_epochs=GRPO_EPOCHS,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRAD_ACCUM,\n",
+    "    learning_rate=1e-5,\n",
+    "    warmup_ratio=0.1,\n",
+    "    max_completion_length=4096,\n",
+    "    # max_prompt_length=4096,\n",
+    "    num_generations=NUM_EPISODES_PER_PROMPT,  # GRPO group size\n",
+    "    logging_steps=1,\n",
+    "    save_strategy=\"epoch\",\n",
+    "    bf16=True,\n",
+    "    gradient_checkpointing=True,\n",
+    "    # GRPO-specific\n",
+    "    beta=0.1,  # KL penalty coefficient\n",
+    "    report_to=\"none\",\n",
+    ")\n",
+    "\n",
+    "# Build reward function from collected trajectories\n",
+    "reward_fn = build_reward_fn(baseline_groups)\n",
+    "\n",
+    "# Prepare prompts dataset (unique prompts only, GRPO generates completions)\n",
+    "prompt_dataset = Dataset.from_list([\n",
+    "    {\"prompt\": group[0].prompt_text}\n",
+    "    for group in baseline_groups\n",
+    "])\n",
+    "\n",
+    "# Initialize GRPO trainer\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    args=grpo_config,\n",
+    "    train_dataset=prompt_dataset,\n",
+    "    reward_funcs=reward_fn,\n",
+    "    processing_class=tokenizer,\n",
+    ")\n",
+    "\n",
+    "print(\"GRPO Trainer initialized. Starting training...\")\n",
+    "trainer.train()\n",
+    "print(\"Training complete!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Post-Training Evaluation\n",
+    "\n",
+    "Collect new trajectories with the trained model and compare rewards to the baseline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Collect post-training trajectories with the same seeds\n",
+    "print(\"=\" * 60)\n",
+    "print(\"COLLECTING POST-TRAINING TRAJECTORIES\")\n",
+    "print(\"=\" * 60)\n",
+    "post_groups = collect_trajectories(model, tokenizer, temperature=0.5)\n",
+    "\n",
+    "post_rewards = [t.reward for g in post_groups for t in g]\n",
+    "baseline_rewards = [t.reward for g in baseline_groups for t in g]\n",
+    "\n",
+    "print(f\"\\n{'='*60}\")\n",
+    "print(f\"RESULTS COMPARISON\")\n",
+    "print(f\"{'='*60}\")\n",
+    "print(f\"Baseline: mean={sum(baseline_rewards)/len(baseline_rewards):.4f}, \"\n",
+    "      f\"max={max(baseline_rewards):.4f}\")\n",
+    "print(f\"Trained:  mean={sum(post_rewards)/len(post_rewards):.4f}, \"\n",
+    "      f\"max={max(post_rewards):.4f}\")\n",
+    "print(f\"Improvement: {(sum(post_rewards)/len(post_rewards) - sum(baseline_rewards)/len(baseline_rewards)):.4f}\")\n",
+    "\n",
+    "# Per-task comparison\n",
+    "print(f\"\\nPer-task breakdown:\")\n",
+    "for i, (bg, pg) in enumerate(zip(baseline_groups, post_groups)):\n",
+    "    b_mean = sum(t.reward for t in bg) / len(bg)\n",
+    "    p_mean = sum(t.reward for t in pg) / len(pg)\n",
+    "    delta = p_mean - b_mean\n",
+    "    arrow = \"\\u2191\" if delta > 0 else \"\\u2193\" if delta < 0 else \"\\u2192\"\n",
+    "    print(f\"  Task {i}: baseline={b_mean:.3f} \\u2192 trained={p_mean:.3f} ({arrow} {abs(delta):.3f})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Visualize Results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 3, figsize=(16, 5))\n",
+    "\n",
+    "# 1. Reward distribution: baseline vs trained\n",
+    "ax1 = axes[0]\n",
+    "ax1.hist(baseline_rewards, bins=20, alpha=0.6, label=\"Baseline\", color=\"steelblue\")\n",
+    "ax1.hist(post_rewards, bins=20, alpha=0.6, label=\"After GRPO\", color=\"coral\")\n",
+    "ax1.set_xlabel(\"Episode Reward\")\n",
+    "ax1.set_ylabel(\"Count\")\n",
+    "ax1.set_title(\"Reward Distribution\")\n",
+    "ax1.legend()\n",
+    "ax1.axvline(np.mean(baseline_rewards), color=\"steelblue\", linestyle=\"--\", alpha=0.8)\n",
+    "ax1.axvline(np.mean(post_rewards), color=\"coral\", linestyle=\"--\", alpha=0.8)\n",
+    "\n",
+    "# 2. Per-task mean reward comparison\n",
+    "ax2 = axes[1]\n",
+    "task_ids = list(range(len(baseline_groups)))\n",
+    "b_means = [np.mean([t.reward for t in g]) for g in baseline_groups]\n",
+    "p_means = [np.mean([t.reward for t in g]) for g in post_groups]\n",
+    "x = np.arange(len(task_ids))\n",
+    "width = 0.35\n",
+    "ax2.bar(x - width/2, b_means, width, label=\"Baseline\", color=\"steelblue\", alpha=0.8)\n",
+    "ax2.bar(x + width/2, p_means, width, label=\"After GRPO\", color=\"coral\", alpha=0.8)\n",
+    "ax2.set_xlabel(\"Task ID\")\n",
+    "ax2.set_ylabel(\"Mean Reward\")\n",
+    "ax2.set_title(\"Per-Task Reward Improvement\")\n",
+    "ax2.legend()\n",
+    "ax2.set_xticks(x)\n",
+    "\n",
+    "# 3. Test pass rate improvement\n",
+    "ax3 = axes[2]\n",
+    "b_pass_rates = [np.mean([t.tests_passed / max(t.tests_total, 1) for t in g]) for g in baseline_groups]\n",
+    "p_pass_rates = [np.mean([t.tests_passed / max(t.tests_total, 1) for t in g]) for g in post_groups]\n",
+    "ax3.bar(x - width/2, b_pass_rates, width, label=\"Baseline\", color=\"steelblue\", alpha=0.8)\n",
+    "ax3.bar(x + width/2, p_pass_rates, width, label=\"After GRPO\", color=\"coral\", alpha=0.8)\n",
+    "ax3.set_xlabel(\"Task ID\")\n",
+    "ax3.set_ylabel(\"Test Pass Rate\")\n",
+    "ax3.set_title(\"Test Pass Rate Improvement\")\n",
+    "ax3.legend()\n",
+    "ax3.set_xticks(x)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.savefig(\"rlm_forge_results.png\", dpi=150, bbox_inches=\"tight\")\n",
+    "plt.show()\n",
+    "\n",
+    "print(f\"\\nOverall test pass rate:\")\n",
+    "print(f\"  Baseline: {np.mean(b_pass_rates):.1%}\")\n",
+    "print(f\"  Trained:  {np.mean(p_pass_rates):.1%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Save Model & Training Log"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save the trained LoRA adapter\n",
+    "model.save_pretrained(\"./rlm_forge_lora_adapter\")\n",
+    "tokenizer.save_pretrained(\"./rlm_forge_lora_adapter\")\n",
+    "\n",
+    "# Save training log\n",
+    "training_log = {\n",
+    "    \"model_id\": MODEL_ID,\n",
+    "    \"num_prompts\": NUM_TRAINING_PROMPTS,\n",
+    "    \"episodes_per_prompt\": NUM_EPISODES_PER_PROMPT,\n",
+    "    \"max_steps_per_episode\": MAX_STEPS_PER_EPISODE,\n",
+    "    \"grpo_epochs\": GRPO_EPOCHS,\n",
+    "    \"baseline_mean_reward\": float(np.mean(baseline_rewards)),\n",
+    "    \"baseline_max_reward\": float(max(baseline_rewards)),\n",
+    "    \"trained_mean_reward\": float(np.mean(post_rewards)),\n",
+    "    \"trained_max_reward\": float(max(post_rewards)),\n",
+    "    \"improvement\": float(np.mean(post_rewards) - np.mean(baseline_rewards)),\n",
+    "    \"baseline_test_pass_rate\": float(np.mean(b_pass_rates)),\n",
+    "    \"trained_test_pass_rate\": float(np.mean(p_pass_rates)),\n",
+    "}\n",
+    "\n",
+    "with open(\"training_log.json\", \"w\") as f:\n",
+    "    json.dump(training_log, f, indent=2)\n",
+    "\n",
+    "print(\"Saved LoRA adapter to ./rlm_forge_lora_adapter\")\n",
+    "print(\"Saved training log to training_log.json\")\n",
+    "print(f\"\\nFinal summary:\")\n",
+    "print(json.dumps(training_log, indent=2))"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "gpuClass": "premium",
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

rlm_forge_training.py ADDED Viewed

	@@ -0,0 +1,470 @@

+import torch
+import json
+import re
+import random
+from typing import Optional
+from dataclasses import dataclass
+from datasets import Dataset
+from trl import GRPOConfig, GRPOTrainer
+print(f"GPU: {torch.cuda.get_device_name(0)}")
+# print(f"VRAM: {torch.cuda.get_device_properties(0). / 1e9:.1f} GB")
+print(f"PyTorch: {torch.__version__}")
+from rlm_forge.server.environment import RLMForgeEnvironment
+from rlm_forge.models import RLMForgeAction
+env = RLMForgeEnvironment()
+# Run a quick episode
+obs = env.reset(seed=1)
+print(f"Task: {obs.task_description[:200]}...")
+print(f"Available tools: {obs.available_functions}")
+# Take a step — list files
+obs2 = env.step(RLMForgeAction(code="print(list_dir())"))
+print(f"\nStep 1 stdout: {obs2.stdout[:200]}")
+# Finalize and get reward
+obs3 = env.step(RLMForgeAction(code="FINAL()"))
+print(f"\nBaseline reward (no implementation): {obs3.reward:.4f}")
+print(f"Test results: {obs3.test_results}")
+env.cleanup()
+# Model config — adjust based on available VRAM
+# MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"  # 32B for H100
+MODEL_ID = "Qwen/Qwen2.5-Coder-7B-Instruct"  # Fallback for smaller GPUs
+HF_TOKEN = ''
+MAX_STEPS_PER_EPISODE = 6  # Max REPL interactions per episode
+NUM_EPISODES_PER_PROMPT = 2  # GRPO group size (completions per prompt)
+NUM_TRAINING_PROMPTS = 8  # 16 # Total unique prompts (episodes) for training
+GRPO_EPOCHS = 2  # Training epochs over collected data
+BATCH_SIZE = 2
+GRAD_ACCUM = 4
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+from peft import LoraConfig, get_peft_model
+# 4-bit quantization for 32B model on H100
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True,token=HF_TOKEN)
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    quantization_config=bnb_config,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+    # attn_implementation="flash_attention_2",
+    token=HF_TOKEN
+)
+# LoRA config for efficient training
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+    task_type="CAUSAL_LM",
+)
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+SYSTEM_PROMPT = """You are an expert Python developer. You are given a repository where a source file has been replaced with a broken stub. Your task is to explore the repository, understand the expected behavior from the tests, and rewrite the source file so all tests pass.
+You interact via a Python REPL. Available functions:
+- read_file(path) — Read a file from the repo
+- list_dir(path='.') — List directory contents
+- search(pattern, path='.') — Grep for a pattern
+- write_file(path, content) — Write/create a file
+- run_tests(test_path=None) — Run pytest on a test file
+- FINAL() — Signal that your implementation is complete
+Strategy:
+1. Read the failing test file to understand expected behavior
+2. Read other source files for context (imports, dependencies)
+3. Write the implementation
+4. Run tests to verify
+5. Fix any failures
+6. Call FINAL() when done
+Output ONLY valid Python code. No markdown, no explanations — just code to execute."""
+def build_prompt(task_description: str, failing_tests: list[str]) -> list[dict]:
+    """Build the chat prompt for the initial observation."""
+    user_msg = f"{task_description}\n\nFailing tests:\n" + "\n".join(failing_tests)
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": user_msg},
+    ]
+def extract_code_from_response(response: str) -> str:
+    """Extract executable Python code from model response."""
+    # Try to find code blocks first
+    code_blocks = re.findall(r"```(?:python)?\n(.*?)```", response, re.DOTALL)
+    if code_blocks:
+        return "\n".join(code_blocks)
+    # Otherwise treat the whole response as code
+    lines = response.strip().split("\n")
+    code_lines = []
+    for line in lines:
+        stripped = line.strip()
+        if stripped and not stripped.startswith("#") and any(c in stripped for c in "=()[]{}:"):
+            code_lines.append(line)
+        elif stripped.startswith("#") or stripped.startswith("import") or stripped.startswith("from"):
+            code_lines.append(line)
+        elif not stripped:
+            code_lines.append(line)
+        else:
+            code_lines.append(f"# {line}")
+    return "\n".join(code_lines)
+print("Prompt builder ready.")
+@dataclass
+class Trajectory:
+    """A full multi-step episode trajectory for GRPO training."""
+    prompt_text: str        # Tokenized prompt (system + task)
+    completion_text: str    # All model outputs concatenated
+    reward: float           # Final episode reward
+    steps: int              # Number of steps taken
+    seed: int               # Environment seed (for reproducibility)
+    tests_passed: int
+    tests_total: int
+def run_episode(
+    model,
+    tokenizer,
+    env: RLMForgeEnvironment,
+    seed: int,
+    max_steps: int = MAX_STEPS_PER_EPISODE,
+    temperature: float = 0.7,
+    max_new_tokens: int = 2048,
+) -> Trajectory:
+    """Run a single episode: generate code actions, execute them, collect trajectory."""
+    obs = env.reset(seed=seed)
+    messages = build_prompt(obs.task_description, obs.failing_tests or [])
+    prompt_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    all_completions = []  # All model outputs for this episode
+    for step_i in range(max_steps):
+        # Build the full conversation so far for the model
+        if step_i > 0:
+            # Add the observation as assistant feedback
+            messages.append({"role": "user", "content": f"REPL output:\n{obs.stdout}\n{obs.stderr}"})
+        # Generate next action
+        full_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        inputs = tokenizer(full_text, return_tensors="pt", truncation=True, max_length=8192).to(model.device)
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                temperature=temperature,
+                top_p=0.95,
+                do_sample=True,
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        # Decode only the new tokens
+        new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
+        response = tokenizer.decode(new_tokens, skip_special_tokens=True)
+        all_completions.append(response)
+        # Add to conversation history
+        messages.append({"role": "assistant", "content": response})
+        # Extract and execute code
+        code = extract_code_from_response(response)
+        # Check if model wants to finalize
+        if "FINAL()" in code:
+            obs = env.step(RLMForgeAction(code=code))
+            break
+        else:
+            obs = env.step(RLMForgeAction(code=code))
+        if obs.done:
+            break
+    # If we exhausted steps without FINAL, force finalize
+    if not obs.done:
+        obs = env.step(RLMForgeAction(code="FINAL()"))
+    # Build the full completion text (all model outputs joined)
+    completion_text = "\n<|step|>\n".join(all_completions)
+    reward = obs.reward or 0.0
+    test_results = obs.test_results or {}
+    return Trajectory(
+        prompt_text=prompt_text,
+        completion_text=completion_text,
+        reward=reward,
+        steps=step_i + 1,
+        seed=seed,
+        tests_passed=test_results.get("tests_passed", 0),
+        tests_total=test_results.get("tests_total", 0),
+    )
+print("Episode runner ready.")
+def collect_trajectories(
+    model,
+    tokenizer,
+    num_prompts: int = NUM_TRAINING_PROMPTS,
+    episodes_per_prompt: int = NUM_EPISODES_PER_PROMPT,
+    temperature: float = 0.7,
+) -> list[list[Trajectory]]:
+    """Collect GRPO groups: multiple trajectories per unique prompt/seed."""
+    env = RLMForgeEnvironment()
+    all_groups = []
+    for prompt_idx in range(num_prompts):
+        seed = prompt_idx * 100  # Deterministic seeds
+        group = []
+        for ep_idx in range(episodes_per_prompt):
+            print(f"  Prompt {prompt_idx+1}/{num_prompts}, Episode {ep_idx+1}/{episodes_per_prompt}...", end=" ")
+            traj = run_episode(
+                model, tokenizer, env,
+                seed=seed,  # Same seed = same task for GRPO group
+                temperature=temperature + 0.1 * ep_idx,  # Vary temperature for diversity
+            )
+            group.append(traj)
+            print(f"reward={traj.reward:.3f}, steps={traj.steps}, "
+                  f"tests={traj.tests_passed}/{traj.tests_total}")
+        all_groups.append(group)
+    env.cleanup()
+    return all_groups
+# GRPO Training configuration
+grpo_config = GRPOConfig(
+    output_dir="./rlm_forge_grpo_output",
+    num_train_epochs=GRPO_EPOCHS,
+    per_device_train_batch_size=BATCH_SIZE,
+    gradient_accumulation_steps=GRAD_ACCUM,
+    learning_rate=1e-5,
+    warmup_ratio=0.1,
+    max_completion_length=4096,
+    # max_prompt_length=4096,
+    num_generations=NUM_EPISODES_PER_PROMPT,  # GRPO group size
+    logging_steps=1,
+    save_strategy="epoch",
+    bf16=True,
+    gradient_checkpointing=True,
+    # GRPO-specific
+    beta=0.1,  # KL penalty coefficient
+    report_to="none",
+)
+# Collect pre-training baseline
+print("=" * 60)
+print("COLLECTING BASELINE TRAJECTORIES")
+print("=" * 60)
+baseline_groups = collect_trajectories(model, tokenizer)
+# Summary stats
+all_rewards = [t.reward for g in baseline_groups for t in g]
+print(f"\nBaseline: mean_reward={sum(all_rewards)/len(all_rewards):.4f}, "
+      f"min={min(all_rewards):.4f}, max={max(all_rewards):.4f}")
+def trajectories_to_dataset(groups: list[list[Trajectory]]) -> Dataset:
+    """Convert trajectory groups into a HuggingFace Dataset for GRPO training."""
+    records = []
+    for group in groups:
+        prompt = group[0].prompt_text
+        for traj in group:
+            records.append({
+                "prompt": prompt,
+                "completion": traj.completion_text,
+                "reward": traj.reward,
+            })
+    return Dataset.from_list(records)
+def build_reward_fn(groups: list[list[Trajectory]]):
+    """Build a reward function from pre-collected trajectories."""
+    reward_map = {}
+    for group in groups:
+        for traj in group:
+            key = traj.completion_text[:200]
+            reward_map[key] = traj.reward
+    def reward_fn(completions: list[str], **kwargs) -> list[float]:
+        rewards = []
+        for c in completions:
+            key = c[:200]
+            rewards.append(reward_map.get(key, 0.0))
+        return rewards
+    return reward_fn
+# Build dataset from baseline trajectories
+train_dataset = trajectories_to_dataset(baseline_groups)
+print(f"Training dataset: {len(train_dataset)} examples")
+print(f"Sample prompt length: {len(train_dataset[0]['prompt'])} chars")
+print(f"Sample completion length: {len(train_dataset[0]['completion'])} chars")
+print(f"Sample reward: {train_dataset[0]['reward']:.4f}")
+# Build reward function from collected trajectories
+reward_fn = build_reward_fn(baseline_groups)
+# Prepare prompts dataset (unique prompts only, GRPO generates completions)
+prompt_dataset = Dataset.from_list([
+    {"prompt": group[0].prompt_text}
+    for group in baseline_groups
+])
+# Initialize GRPO trainer
+trainer = GRPOTrainer(
+    model=model,
+    args=grpo_config,
+    train_dataset=prompt_dataset,
+    reward_funcs=reward_fn,
+    processing_class=tokenizer,
+)
+print("GRPO Trainer initialized. Starting training...")
+trainer.train()
+print("Training complete!")
+# Collect post-training trajectories with the same seeds
+print("=" * 60)
+print("COLLECTING POST-TRAINING TRAJECTORIES")
+print("=" * 60)
+post_groups = collect_trajectories(model, tokenizer, temperature=0.5)
+post_rewards = [t.reward for g in post_groups for t in g]
+baseline_rewards = [t.reward for g in baseline_groups for t in g]
+print(f"\n{'='*60}")
+print(f"RESULTS COMPARISON")
+print(f"{'='*60}")
+print(f"Baseline: mean={sum(baseline_rewards)/len(baseline_rewards):.4f}, "
+      f"max={max(baseline_rewards):.4f}")
+print(f"Trained:  mean={sum(post_rewards)/len(post_rewards):.4f}, "
+      f"max={max(post_rewards):.4f}")
+print(f"Improvement: {(sum(post_rewards)/len(post_rewards) - sum(baseline_rewards)/len(baseline_rewards)):.4f}")
+# Per-task comparison
+print(f"\nPer-task breakdown:")
+for i, (bg, pg) in enumerate(zip(baseline_groups, post_groups)):
+    b_mean = sum(t.reward for t in bg) / len(bg)
+    p_mean = sum(t.reward for t in pg) / len(pg)
+    delta = p_mean - b_mean
+    arrow = "\u2191" if delta > 0 else "\u2193" if delta < 0 else "\u2192"
+    print(f"  Task {i}: baseline={b_mean:.3f} \u2192 trained={p_mean:.3f} ({arrow} {abs(delta):.3f})")
+import matplotlib.pyplot as plt
+import numpy as np
+fig, axes = plt.subplots(1, 3, figsize=(16, 5))
+# 1. Reward distribution: baseline vs trained
+ax1 = axes[0]
+ax1.hist(baseline_rewards, bins=20, alpha=0.6, label="Baseline", color="steelblue")
+ax1.hist(post_rewards, bins=20, alpha=0.6, label="After GRPO", color="coral")
+ax1.set_xlabel("Episode Reward")
+ax1.set_ylabel("Count")
+ax1.set_title("Reward Distribution")
+ax1.legend()
+ax1.axvline(np.mean(baseline_rewards), color="steelblue", linestyle="--", alpha=0.8)
+ax1.axvline(np.mean(post_rewards), color="coral", linestyle="--", alpha=0.8)
+# 2. Per-task mean reward comparison
+ax2 = axes[1]
+task_ids = list(range(len(baseline_groups)))
+b_means = [np.mean([t.reward for t in g]) for g in baseline_groups]
+p_means = [np.mean([t.reward for t in g]) for g in post_groups]
+x = np.arange(len(task_ids))
+width = 0.35
+ax2.bar(x - width/2, b_means, width, label="Baseline", color="steelblue", alpha=0.8)
+ax2.bar(x + width/2, p_means, width, label="After GRPO", color="coral", alpha=0.8)
+ax2.set_xlabel("Task ID")
+ax2.set_ylabel("Mean Reward")
+ax2.set_title("Per-Task Reward Improvement")
+ax2.legend()
+ax2.set_xticks(x)
+# 3. Test pass rate improvement
+ax3 = axes[2]
+b_pass_rates = [np.mean([t.tests_passed / max(t.tests_total, 1) for t in g]) for g in baseline_groups]
+p_pass_rates = [np.mean([t.tests_passed / max(t.tests_total, 1) for t in g]) for g in post_groups]
+ax3.bar(x - width/2, b_pass_rates, width, label="Baseline", color="steelblue", alpha=0.8)
+ax3.bar(x + width/2, p_pass_rates, width, label="After GRPO", color="coral", alpha=0.8)
+ax3.set_xlabel("Task ID")
+ax3.set_ylabel("Test Pass Rate")
+ax3.set_title("Test Pass Rate Improvement")
+ax3.legend()
+ax3.set_xticks(x)
+plt.tight_layout()
+plt.savefig("rlm_forge_results.png", dpi=150, bbox_inches="tight")
+plt.show()
+print(f"\nOverall test pass rate:")
+print(f"  Baseline: {np.mean(b_pass_rates):.1%}")
+print(f"  Trained:  {np.mean(p_pass_rates):.1%}")
+# Save the trained LoRA adapter
+model.save_pretrained("./rlm_forge_lora_adapter")
+tokenizer.save_pretrained("./rlm_forge_lora_adapter")
+# Save training log
+training_log = {
+    "model_id": MODEL_ID,
+    "num_prompts": NUM_TRAINING_PROMPTS,
+    "episodes_per_prompt": NUM_EPISODES_PER_PROMPT,
+    "max_steps_per_episode": MAX_STEPS_PER_EPISODE,
+    "grpo_epochs": GRPO_EPOCHS,
+    "baseline_mean_reward": float(np.mean(baseline_rewards)),
+    "baseline_max_reward": float(max(baseline_rewards)),
+    "trained_mean_reward": float(np.mean(post_rewards)),
+    "trained_max_reward": float(max(post_rewards)),
+    "improvement": float(np.mean(post_rewards) - np.mean(baseline_rewards)),
+    "baseline_test_pass_rate": float(np.mean(b_pass_rates)),
+    "trained_test_pass_rate": float(np.mean(p_pass_rates)),
+}
+with open("training_log.json", "w") as f:
+    json.dump(training_log, f, indent=2)
+print("Saved LoRA adapter to ./rlm_forge_lora_adapter")
+print("Saved training log to training_log.json")
+print(f"\nFinal summary:")
+print(json.dumps(training_log, indent=2))

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Server package for OpenEnv deployment.

server/app.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""FastAPI server entry point for RLM-Forge environment.
+This module provides the standardized OpenEnv server entry point.
+It wraps the rlm_forge.server.app module for multi-mode deployment.
+Usage:
+    uv run server
+    python -m server.app
+    uvicorn server.app:app --host 0.0.0.0 --port 8000
+"""
+from rlm_forge.server.app import app  # noqa: F401
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """Entry point for direct execution via uv run or python -m."""
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff