File size: 14,613 Bytes
9d67534
07fffa0
 
 
9d67534
 
 
07fffa0
 
 
 
9d67534
 
07fffa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
---
title: REPL Environment Server
emoji: 🎮
colorFrom: yellow
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# REPL Environment for OpenEnv

A Python REPL environment for training language models on code execution tasks, based on the [Recursive Language Models (RLM)](https://arxiv.org/abs/2512.24601) paradigm.

## Overview

The RLM paradigm allows language models to:
- Execute Python code in a sandboxed REPL environment
- Make recursive calls to themselves or other LMs via `llm_query()` / `llm_query_batched()`
- Handle near-infinite context by programmatically decomposing and exploring data
- Terminate with explicit `FINAL(answer)` or `answer = {"content": ..., "ready": True}` signals

## Features

- **Unified API**: Same `REPLEnv` class works for both local and remote execution
- **Sandboxed Python Execution**: Safe code execution with restricted builtins
- **Context Loading**: Load large contexts that agents can explore programmatically
- **Multiple Finalization Patterns**:
  - Direct call: `FINAL(answer)` - helper function injected into namespace
  - Print pattern: `print('FINAL(answer)')` or `print('FINAL_VAR(var_name)')`
  - Prime Intellect style: `answer = {"content": "...", "ready": True}`
- **Iteration Limits**: Configurable maximum steps per episode
- **Reward Signals**: Customizable reward functions for RL training
- **Optional LLM Oracle**: Can enable `llm_query()` and `llm_query_batched()` for recursive calls

## Quick Start

### Local Mode (No Server Required)

```python
from repl_env import REPLEnv

# Create environment - runs locally by default
with REPLEnv() as env:
    result = env.reset(
        context="This is a large document with lots of text...",
        task_prompt="Find the word count"
    )

    # Execute code iteratively
    result = env.execute("words = context.split()")
    result = env.execute("count = len(words)")
    result = env.execute("print(f'FINAL({count})')")

    print(f"Done: {result.done}")
    print(f"Final Answer: {env.state().final_answer}")
```

### Remote Server Mode

```python
from repl_env import REPLEnv

# Connect to a running server - same API!
with REPLEnv(base_url="https://my-server.hf.space") as env:
    result = env.reset(context="...", task_prompt="...")
    result = env.execute("count = len(context)")
    result = env.execute("print(f'FINAL({count})')")
```

### Local Mode with LLM Support

```python
from repl_env import REPLEnv

def my_llm_query(prompt: str) -> str:
    return your_llm.generate(prompt)

def my_llm_query_batched(prompts: list[str]) -> list[str]:
    return [my_llm_query(p) for p in prompts]

# Pass LLM functions for recursive calls
with REPLEnv(llm_query_fn=my_llm_query, llm_batch_fn=my_llm_query_batched) as env:
    result = env.reset(context=large_document, task_prompt="Summarize this")

    # Now the executed code can use llm_query() and llm_query_batched()!
    result = env.execute("summary = llm_query('Summarize: ' + context[:1000])")
```

### From Docker or HuggingFace Hub

```python
from repl_env import REPLEnv

# Start from Docker image
env = REPLEnv.from_docker_image("repl-env:latest")

# Or from HuggingFace Hub
env = REPLEnv.from_hub("openenv/repl-env")
```

## API Reference

### REPLEnv

```python
class REPLEnv:
    def __init__(
        self,
        base_url: str | None = None,      # Server URL (None = local mode)
        *,
        # Local-only options
        llm_query_fn: Callable | None = None,    # Function for llm_query()
        llm_batch_fn: Callable | None = None,    # Function for llm_query_batched()
        max_output_length: int = 8192,           # Max stdout/stderr chars
        context_preview_length: int = 500,       # Chars in context preview
        reward_on_success: float = 1.0,          # Reward on FINAL()
        reward_on_iteration: float = 0.0,        # Reward per step
        reward_on_failure: float = -0.1,         # Reward on max iterations
        reward_on_error: float = -0.05,          # Reward on execution error
        # Remote-only options
        connect_timeout_s: float = 10.0,
        message_timeout_s: float = 60.0,
    ): ...

    def reset(
        self,
        *,
        context: str = "",              # Text to analyze (as `context` variable)
        task_prompt: str = "",          # Task description
        max_iterations: int = 30,       # Max code execution steps
        seed: int | None = None,        # Random seed
        episode_id: str | None = None,  # Custom episode ID
        hf_token: str | None = None,    # HF token for llm_query (remote mode)
        llm_model: str | None = None,   # Model for llm_query (remote mode)
    ) -> StepResult[REPLObservation]: ...

    def execute(self, code: str) -> StepResult[REPLObservation]: ...
    def step(self, action: REPLAction) -> StepResult[REPLObservation]: ...
    def submit_final_answer(self, answer: str) -> StepResult[REPLObservation]: ...
    def state(self) -> REPLState: ...
    def close(self) -> None: ...
```

### Action Space

```python
class REPLAction:
    code: str = ""                    # Python code to execute
    is_final: bool = False            # Whether this signals the final answer
    final_answer: str | None = None   # The final answer (if is_final=True)
```

### Observation Space

```python
class REPLObservation:
    result: CodeBlockResult      # Execution result (stdout, stderr, etc.)
    context_preview: str | None  # First 500 chars of context
    context_length: int          # Total context length
    available_variables: list    # Variables in namespace
    iteration: int               # Current iteration
    max_iterations: int          # Max iterations
    done: bool                   # Episode complete?
    reward: float                # Step reward
    metadata: dict               # Additional info (final_answer, etc.)
```

## Finalization Patterns

### Pattern 1: Direct FINAL() call (recommended)
```python
result = env.execute("answer = 42")
result = env.execute("FINAL(answer)")
# -> done=True, final_answer="42"
```

### Pattern 2: FINAL() via print
```python
result = env.execute("answer = 42")
result = env.execute("print(f'FINAL({answer})')")
# -> done=True, final_answer="42"
```

### Pattern 3: FINAL_VAR() for variable reference
```python
result = env.execute("my_result = 'The answer is 42'")
# Direct call (recommended) - pass variable name as string
# FINAL_VAR looks up the variable and returns FINAL(value)
result = env.execute('FINAL_VAR("my_result")')
# -> done=True, final_answer="The answer is 42"

# Also works via print (for regex detection)
result = env.execute("print('FINAL_VAR(my_result)')")
# -> done=True, final_answer="The answer is 42"
```

### Pattern 4: Prime Intellect style answer dict
```python
result = env.execute("answer['content'] = '42'")
result = env.execute("answer['ready'] = True")
# -> done=True, final_answer="42"
```

## Prompts Module

The `prompts` module provides RLM-style prompts and parsing utilities:

```python
from repl_env.prompts import (
    # System prompts (from official RLM repo)
    RLM_SYSTEM_PROMPT,           # Base prompt with llm_query_batched
    RLM_SYSTEM_PROMPT_QWEN,      # For Qwen models (adds cost warning)

    # Prompt building
    QueryMetadata,               # Context metadata dataclass
    build_rlm_system_prompt,     # Build system messages with metadata
    build_user_prompt,           # Build user prompt for each iteration
    build_initial_prompt,        # Convenience wrapper for iteration 0

    # Parsing utilities
    extract_code_blocks,         # Extract code from ```repl``` or ```python``` blocks
    format_observation,          # Format execution result for LLM
)

# Example: Build messages using official RLM style
query_metadata = QueryMetadata(
    context_lengths=[len(context)],
    context_total_length=len(context),
    context_type="str",
)
messages = build_rlm_system_prompt(RLM_SYSTEM_PROMPT_QWEN, query_metadata)
messages.append(build_user_prompt(root_prompt="Count words in the context", iteration=0))

# Extract code from LLM response (supports ```repl``` and ```python```)
response = "Here's my solution:\n```repl\ncount = len(context.split())\nFINAL(count)\n```"
code_blocks = extract_code_blocks(response)  # ["count = len(context.split())\nFINAL(count)"]
```

## Examples

See the `examples/` directory for complete working examples:

- **`examples/repl_with_llm.py`** - Full RLM loop with local Qwen model
- **`examples/repl_oolong_simple.py`** - RLM on Oolong benchmark with HuggingFace Inference API

Run examples:
```bash
# Full RLM example with local model (requires GPU)
python examples/repl_with_llm.py

# Oolong benchmark with HF Inference API (requires HF_TOKEN)
python examples/repl_oolong_simple.py
```

## Model Usage

### Inference Loop

A typical model inference loop where the LLM generates code and the environment executes it:

```python
from repl_env import REPLEnv
from repl_env.prompts import RLM_SYSTEM_PROMPT, build_initial_prompt, extract_code_blocks, format_observation

# Works with both local and remote!
with REPLEnv(base_url="http://localhost:8000") as env:  # or REPLEnv() for local
    result = env.reset(
        context="The quick brown fox jumps over the lazy dog. " * 1000,
        task_prompt="Count how many times 'fox' appears"
    )

    messages = [
        {"role": "system", "content": RLM_SYSTEM_PROMPT},
        {"role": "user", "content": build_initial_prompt(
            task_prompt="Count how many times 'fox' appears",
            context_length=result.observation.context_length,
            context_preview=result.observation.context_preview,
            variables=result.observation.available_variables,
        )},
    ]

    while not result.done:
        # Get code from LLM
        response = your_llm.chat(messages)
        code_blocks = extract_code_blocks(response)

        for code in code_blocks:
            result = env.execute(code)
            if result.done:
                break

        # Update conversation
        messages.append({"role": "assistant", "content": response})
        messages.append({"role": "user", "content": format_observation(result.observation)})

    print(f"Final answer: {env.state().final_answer}")
```

### Recursive LLM Calls (RLM Paradigm)

The key insight of RLM is that models can make recursive calls to themselves or other LLMs from within the code:

```python
from repl_env import REPLEnv

def llm_query(prompt: str) -> str:
    """Single LLM call - model can call this from executed code"""
    return your_llm.generate(prompt)

def llm_query_batched(prompts: list[str]) -> list[str]:
    """Batch LLM calls for efficiency (parallel in production)"""
    return [your_llm.generate(p) for p in prompts]

# Create environment with LLM oracle (local mode)
with REPLEnv(llm_query_fn=llm_query, llm_batch_fn=llm_query_batched) as env:
    result = env.reset(
        context=massive_document,  # Could be 100K+ chars
        task_prompt="Summarize each section and find key themes"
    )

    # The model can now generate code like this:
    code = """
# Split document into sections
sections = context.split('\\n\\n')

# Use LLM to summarize each section (recursive call!)
summaries = llm_query_batched([f"Summarize: {s[:1000]}" for s in sections[:10]])

# Combine summaries
combined = '\\n'.join(summaries)

# Final synthesis using another LLM call
answer['content'] = llm_query(f"Find key themes in: {combined}")
answer['ready'] = True
"""

    result = env.execute(code)
    print(f"Done: {result.done}, Answer: {env.state().final_answer}")
```

### RL Training Integration

For RL training, integrate with frameworks like TRL, prime-rl, or verifiers:

```python
from repl_env import REPLEnv

def collect_trajectory(env, policy, context, task):
    """Collect a single trajectory for RL training"""
    result = env.reset(context=context, task_prompt=task)

    trajectory = []
    total_reward = 0

    while not result.done:
        # Policy generates code
        code = policy.generate(result.observation)

        # Step environment
        next_result = env.execute(code)

        # Store transition
        trajectory.append({
            "observation": result.observation,
            "action": code,
            "reward": next_result.reward,
            "next_observation": next_result.observation,
            "done": next_result.done,
        })

        total_reward += next_result.reward
        result = next_result

    return trajectory, total_reward

# Training loop
with REPLEnv(
    reward_on_success=1.0,
    reward_on_iteration=0.0,
    reward_on_error=-0.05,
    reward_on_failure=-0.1,
) as env:
    for epoch in range(num_epochs):
        for context, task, ground_truth in dataset:
            trajectory, reward = collect_trajectory(env, policy, context, task)

            # Verify answer correctness (optional external reward)
            if trajectory:
                final_answer = env.state().final_answer
                if final_answer == ground_truth:
                    reward += verification_bonus

            # Update policy (use your RL framework - PPO, GRPO, DPO, etc.)
            policy.update(trajectory, reward)
```

### Reward Configuration

Configure rewards for different outcomes:

```python
env = REPLEnv(
    reward_on_success=1.0,    # When FINAL() is called
    reward_on_iteration=0.0,  # Per step (can be negative to encourage efficiency)
    reward_on_error=-0.05,    # When code execution fails
    reward_on_failure=-0.1,   # When max iterations reached without answer
)
```

## Environment Configuration

| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `REPL_CONTEXT` | Initial context to load | "" |
| `REPL_TASK_PROMPT` | Task description | "" |
| `REPL_MAX_ITERATIONS` | Max steps per episode | 30 |
| `HF_TOKEN` | HuggingFace token for llm_query (server fallback) | None |
| `LLM_MODEL` | Model for llm_query/llm_query_batched | Qwen/Qwen3-Coder-480B-A35B-Instruct |

## Running the Server

### Using UV
```bash
cd envs/repl_env
uv run --project . server
```

### Using Docker
```bash
docker build -t repl-env:latest -f server/Dockerfile .
docker run -p 8000:8000 repl-env:latest
```

### Testing
```bash
pytest tests/
```

## References

- [RLM Paper (arXiv:2512.24601)](https://arxiv.org/abs/2512.24601)
- [RLM Implementation](https://github.com/alexzhang13/rlm)
- [Alex Zhang's RLM Blog](https://alexzhang13.github.io/blog/2025/rlm/)
- [Prime Intellect RLM Blog](https://www.primeintellect.ai/blog/rlm)