File size: 4,320 Bytes
32247b2
43cbf31
 
 
 
b936b6f
 
 
 
 
32247b2
 
43cbf31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: RLM-Forge
emoji: πŸš€
colorFrom: blue
colorTo: indigo
sdk: gradio
python_version: 3.11
sdk_version: 6.9.0
app_file: server/app.py
base_path: /rlm_forge
---

# RLM-Forge

**Recursive Language Model training environment for AI coding agents.**

RLM-Forge is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment that trains language models to solve coding tasks on real Python repositories using Recursive Language Model (RLM) patterns.

## How It Works

1. **Clone** a real Python repo (e.g., python-slugify, humanize)
2. **Extract** a source file and replace it with a broken stub (correct signatures, wrong implementations)
3. **Agent** explores the repo via a sandboxed multi-step REPL with built-in tools
4. **Reward** = test pass rate (55%) + structural validity (15%) + efficiency (30%)
5. **Train** with GRPO to improve the agent's coding ability over time

### The REPL Tools

The agent has access to these functions in the sandbox:

| Function | Description |
|----------|-------------|
| `read_file(path)` | Read a file from the repo |
| `list_dir(path='.')` | List directory contents |
| `search(pattern, path='.')` | Grep for a pattern |
| `write_file(path, content)` | Write/create a file |
| `run_tests(test_path=None)` | Run pytest |
| `spawn_agent(scope, mission)` | Explore a directory scope |
| `FINAL()` | Signal implementation is complete |

## Project Structure

```
rlm_forge/
β”œβ”€β”€ __init__.py              # Package exports
β”œβ”€β”€ models.py                # Pydantic models (Action, Observation, State)
β”œβ”€β”€ client.py                # EnvClient for remote connections
└── server/
    β”œβ”€β”€ app.py               # FastAPI server (create_app)
    β”œβ”€β”€ environment.py       # Core Environment (reset/step)
    β”œβ”€β”€ sandbox.py           # Sandboxed Python REPL
    β”œβ”€β”€ repo_manager.py      # Repo cloning & dependency management
    β”œβ”€β”€ feature_extractor.py # Source file extraction & stub generation
    └── reward.py            # Composite reward computation
```

## Quick Start

### Install

```bash
uv sync
```

### Run the Server

```bash
uv run uvicorn rlm_forge.server.app:app --host 0.0.0.0 --port 8000
```

### Use the Environment Directly

```python
from rlm_forge.server.environment import RLMForgeEnvironment
from rlm_forge.models import RLMForgeAction

env = RLMForgeEnvironment()
obs = env.reset(seed=1)
print(obs.task_description)

# Agent takes actions
obs = env.step(RLMForgeAction(code="print(read_file('test.py'))"))
obs = env.step(RLMForgeAction(code="write_file('slugify/slugify.py', '...')"))
obs = env.step(RLMForgeAction(code="FINAL()"))
print(f"Reward: {obs.reward}")
```

### Connect via Client

```python
from rlm_forge.client import RLMForgeClient
from rlm_forge.models import RLMForgeAction

client = RLMForgeClient(base_url="http://localhost:8000")
client.connect()

result = client.reset(seed=1)
result = client.step(RLMForgeAction(code="print(list_dir())"))
result = client.step(RLMForgeAction(code="FINAL()"))
print(f"Reward: {result.reward}")
```

## Training

See `rlm_forge_training.ipynb` for the full GRPO training notebook. Designed for Google Colab with an H100 GPU.

Key training approach:
- **Multi-step trajectory concatenation**: Full episode (all code actions) treated as one GRPO "completion"
- **Group Relative Policy Optimization**: Multiple completions per task, advantages computed relative to group mean
- **LoRA fine-tuning**: 4-bit quantized Qwen2.5-Coder-32B with LoRA adapter

## Reward Breakdown

| Component | Weight | Description |
|-----------|--------|-------------|
| Test Pass Rate | 55% | Fraction of tests passing |
| Structural Validity | 15% | AST parse check + import check |
| Efficiency | 30% | Tiered by iteration budget used |

## Curated Repos

| Repo | Source File | Tests | Difficulty |
|------|-----------|-------|------------|
| python-slugify | `slugify/slugify.py` | 82 | Easy |
| humanize (number) | `src/humanize/number.py` | 219 | Medium |
| humanize (time) | `src/humanize/time.py` | varies | Medium |

## Docker

```bash
docker build -t rlm-forge .
docker run -p 8000:8000 rlm-forge
```

The Dockerfile pre-clones curated repos to avoid network I/O on each `reset()`.

## Deploy to HF Spaces

```bash
openenv push -r your-username/rlm-forge
```