--- title: REPL Environment Server emoji: 🎮 colorFrom: yellow colorTo: indigo sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # REPL Environment for OpenEnv A Python REPL environment for training language models on code execution tasks, based on the [Recursive Language Models (RLM)](https://arxiv.org/abs/2512.24601) paradigm. ## Overview The RLM paradigm allows language models to: - Execute Python code in a sandboxed REPL environment - Make recursive calls to themselves or other LMs via `llm_query()` / `llm_query_batched()` - Handle near-infinite context by programmatically decomposing and exploring data - Terminate with explicit `FINAL(answer)` or `answer = {"content": ..., "ready": True}` signals ## Features - **Unified API**: Same `REPLEnv` class works for both local and remote execution - **Sandboxed Python Execution**: Safe code execution with restricted builtins - **Context Loading**: Load large contexts that agents can explore programmatically - **Multiple Finalization Patterns**: - Direct call: `FINAL(answer)` - helper function injected into namespace - Print pattern: `print('FINAL(answer)')` or `print('FINAL_VAR(var_name)')` - Prime Intellect style: `answer = {"content": "...", "ready": True}` - **Iteration Limits**: Configurable maximum steps per episode - **Reward Signals**: Customizable reward functions for RL training - **Optional LLM Oracle**: Can enable `llm_query()` and `llm_query_batched()` for recursive calls ## Quick Start ### Local Mode (No Server Required) ```python from repl_env import REPLEnv # Create environment - runs locally by default with REPLEnv() as env: result = env.reset( context="This is a large document with lots of text...", task_prompt="Find the word count" ) # Execute code iteratively result = env.execute("words = context.split()") result = env.execute("count = len(words)") result = env.execute("print(f'FINAL({count})')") print(f"Done: {result.done}") print(f"Final Answer: {env.state().final_answer}") ``` ### Remote Server Mode ```python from repl_env import REPLEnv # Connect to a running server - same API! with REPLEnv(base_url="https://my-server.hf.space") as env: result = env.reset(context="...", task_prompt="...") result = env.execute("count = len(context)") result = env.execute("print(f'FINAL({count})')") ``` ### Local Mode with LLM Support ```python from repl_env import REPLEnv def my_llm_query(prompt: str) -> str: return your_llm.generate(prompt) def my_llm_query_batched(prompts: list[str]) -> list[str]: return [my_llm_query(p) for p in prompts] # Pass LLM functions for recursive calls with REPLEnv(llm_query_fn=my_llm_query, llm_batch_fn=my_llm_query_batched) as env: result = env.reset(context=large_document, task_prompt="Summarize this") # Now the executed code can use llm_query() and llm_query_batched()! result = env.execute("summary = llm_query('Summarize: ' + context[:1000])") ``` ### From Docker or HuggingFace Hub ```python from repl_env import REPLEnv # Start from Docker image env = REPLEnv.from_docker_image("repl-env:latest") # Or from HuggingFace Hub env = REPLEnv.from_hub("openenv/repl-env") ``` ## API Reference ### REPLEnv ```python class REPLEnv: def __init__( self, base_url: str | None = None, # Server URL (None = local mode) *, # Local-only options llm_query_fn: Callable | None = None, # Function for llm_query() llm_batch_fn: Callable | None = None, # Function for llm_query_batched() max_output_length: int = 8192, # Max stdout/stderr chars context_preview_length: int = 500, # Chars in context preview reward_on_success: float = 1.0, # Reward on FINAL() reward_on_iteration: float = 0.0, # Reward per step reward_on_failure: float = -0.1, # Reward on max iterations reward_on_error: float = -0.05, # Reward on execution error # Remote-only options connect_timeout_s: float = 10.0, message_timeout_s: float = 60.0, ): ... def reset( self, *, context: str = "", # Text to analyze (as `context` variable) task_prompt: str = "", # Task description max_iterations: int = 30, # Max code execution steps seed: int | None = None, # Random seed episode_id: str | None = None, # Custom episode ID hf_token: str | None = None, # HF token for llm_query (remote mode) llm_model: str | None = None, # Model for llm_query (remote mode) ) -> StepResult[REPLObservation]: ... def execute(self, code: str) -> StepResult[REPLObservation]: ... def step(self, action: REPLAction) -> StepResult[REPLObservation]: ... def submit_final_answer(self, answer: str) -> StepResult[REPLObservation]: ... def state(self) -> REPLState: ... def close(self) -> None: ... ``` ### Action Space ```python class REPLAction: code: str = "" # Python code to execute is_final: bool = False # Whether this signals the final answer final_answer: str | None = None # The final answer (if is_final=True) ``` ### Observation Space ```python class REPLObservation: result: CodeBlockResult # Execution result (stdout, stderr, etc.) context_preview: str | None # First 500 chars of context context_length: int # Total context length available_variables: list # Variables in namespace iteration: int # Current iteration max_iterations: int # Max iterations done: bool # Episode complete? reward: float # Step reward metadata: dict # Additional info (final_answer, etc.) ``` ## Finalization Patterns ### Pattern 1: Direct FINAL() call (recommended) ```python result = env.execute("answer = 42") result = env.execute("FINAL(answer)") # -> done=True, final_answer="42" ``` ### Pattern 2: FINAL() via print ```python result = env.execute("answer = 42") result = env.execute("print(f'FINAL({answer})')") # -> done=True, final_answer="42" ``` ### Pattern 3: FINAL_VAR() for variable reference ```python result = env.execute("my_result = 'The answer is 42'") # Direct call (recommended) - pass variable name as string # FINAL_VAR looks up the variable and returns FINAL(value) result = env.execute('FINAL_VAR("my_result")') # -> done=True, final_answer="The answer is 42" # Also works via print (for regex detection) result = env.execute("print('FINAL_VAR(my_result)')") # -> done=True, final_answer="The answer is 42" ``` ### Pattern 4: Prime Intellect style answer dict ```python result = env.execute("answer['content'] = '42'") result = env.execute("answer['ready'] = True") # -> done=True, final_answer="42" ``` ## Prompts Module The `prompts` module provides RLM-style prompts and parsing utilities: ```python from repl_env.prompts import ( # System prompts (from official RLM repo) RLM_SYSTEM_PROMPT, # Base prompt with llm_query_batched RLM_SYSTEM_PROMPT_QWEN, # For Qwen models (adds cost warning) # Prompt building QueryMetadata, # Context metadata dataclass build_rlm_system_prompt, # Build system messages with metadata build_user_prompt, # Build user prompt for each iteration build_initial_prompt, # Convenience wrapper for iteration 0 # Parsing utilities extract_code_blocks, # Extract code from ```repl``` or ```python``` blocks format_observation, # Format execution result for LLM ) # Example: Build messages using official RLM style query_metadata = QueryMetadata( context_lengths=[len(context)], context_total_length=len(context), context_type="str", ) messages = build_rlm_system_prompt(RLM_SYSTEM_PROMPT_QWEN, query_metadata) messages.append(build_user_prompt(root_prompt="Count words in the context", iteration=0)) # Extract code from LLM response (supports ```repl``` and ```python```) response = "Here's my solution:\n```repl\ncount = len(context.split())\nFINAL(count)\n```" code_blocks = extract_code_blocks(response) # ["count = len(context.split())\nFINAL(count)"] ``` ## Examples See the `examples/` directory for complete working examples: - **`examples/repl_with_llm.py`** - Full RLM loop with local Qwen model - **`examples/repl_oolong_simple.py`** - RLM on Oolong benchmark with HuggingFace Inference API Run examples: ```bash # Full RLM example with local model (requires GPU) python examples/repl_with_llm.py # Oolong benchmark with HF Inference API (requires HF_TOKEN) python examples/repl_oolong_simple.py ``` ## Model Usage ### Inference Loop A typical model inference loop where the LLM generates code and the environment executes it: ```python from repl_env import REPLEnv from repl_env.prompts import RLM_SYSTEM_PROMPT, build_initial_prompt, extract_code_blocks, format_observation # Works with both local and remote! with REPLEnv(base_url="http://localhost:8000") as env: # or REPLEnv() for local result = env.reset( context="The quick brown fox jumps over the lazy dog. " * 1000, task_prompt="Count how many times 'fox' appears" ) messages = [ {"role": "system", "content": RLM_SYSTEM_PROMPT}, {"role": "user", "content": build_initial_prompt( task_prompt="Count how many times 'fox' appears", context_length=result.observation.context_length, context_preview=result.observation.context_preview, variables=result.observation.available_variables, )}, ] while not result.done: # Get code from LLM response = your_llm.chat(messages) code_blocks = extract_code_blocks(response) for code in code_blocks: result = env.execute(code) if result.done: break # Update conversation messages.append({"role": "assistant", "content": response}) messages.append({"role": "user", "content": format_observation(result.observation)}) print(f"Final answer: {env.state().final_answer}") ``` ### Recursive LLM Calls (RLM Paradigm) The key insight of RLM is that models can make recursive calls to themselves or other LLMs from within the code: ```python from repl_env import REPLEnv def llm_query(prompt: str) -> str: """Single LLM call - model can call this from executed code""" return your_llm.generate(prompt) def llm_query_batched(prompts: list[str]) -> list[str]: """Batch LLM calls for efficiency (parallel in production)""" return [your_llm.generate(p) for p in prompts] # Create environment with LLM oracle (local mode) with REPLEnv(llm_query_fn=llm_query, llm_batch_fn=llm_query_batched) as env: result = env.reset( context=massive_document, # Could be 100K+ chars task_prompt="Summarize each section and find key themes" ) # The model can now generate code like this: code = """ # Split document into sections sections = context.split('\\n\\n') # Use LLM to summarize each section (recursive call!) summaries = llm_query_batched([f"Summarize: {s[:1000]}" for s in sections[:10]]) # Combine summaries combined = '\\n'.join(summaries) # Final synthesis using another LLM call answer['content'] = llm_query(f"Find key themes in: {combined}") answer['ready'] = True """ result = env.execute(code) print(f"Done: {result.done}, Answer: {env.state().final_answer}") ``` ### RL Training Integration For RL training, integrate with frameworks like TRL, prime-rl, or verifiers: ```python from repl_env import REPLEnv def collect_trajectory(env, policy, context, task): """Collect a single trajectory for RL training""" result = env.reset(context=context, task_prompt=task) trajectory = [] total_reward = 0 while not result.done: # Policy generates code code = policy.generate(result.observation) # Step environment next_result = env.execute(code) # Store transition trajectory.append({ "observation": result.observation, "action": code, "reward": next_result.reward, "next_observation": next_result.observation, "done": next_result.done, }) total_reward += next_result.reward result = next_result return trajectory, total_reward # Training loop with REPLEnv( reward_on_success=1.0, reward_on_iteration=0.0, reward_on_error=-0.05, reward_on_failure=-0.1, ) as env: for epoch in range(num_epochs): for context, task, ground_truth in dataset: trajectory, reward = collect_trajectory(env, policy, context, task) # Verify answer correctness (optional external reward) if trajectory: final_answer = env.state().final_answer if final_answer == ground_truth: reward += verification_bonus # Update policy (use your RL framework - PPO, GRPO, DPO, etc.) policy.update(trajectory, reward) ``` ### Reward Configuration Configure rewards for different outcomes: ```python env = REPLEnv( reward_on_success=1.0, # When FINAL() is called reward_on_iteration=0.0, # Per step (can be negative to encourage efficiency) reward_on_error=-0.05, # When code execution fails reward_on_failure=-0.1, # When max iterations reached without answer ) ``` ## Environment Configuration | Environment Variable | Description | Default | |---------------------|-------------|---------| | `REPL_CONTEXT` | Initial context to load | "" | | `REPL_TASK_PROMPT` | Task description | "" | | `REPL_MAX_ITERATIONS` | Max steps per episode | 30 | | `HF_TOKEN` | HuggingFace token for llm_query (server fallback) | None | | `LLM_MODEL` | Model for llm_query/llm_query_batched | Qwen/Qwen3-Coder-480B-A35B-Instruct | ## Running the Server ### Using UV ```bash cd envs/repl_env uv run --project . server ``` ### Using Docker ```bash docker build -t repl-env:latest -f server/Dockerfile . docker run -p 8000:8000 repl-env:latest ``` ### Testing ```bash pytest tests/ ``` ## References - [RLM Paper (arXiv:2512.24601)](https://arxiv.org/abs/2512.24601) - [RLM Implementation](https://github.com/alexzhang13/rlm) - [Alex Zhang's RLM Blog](https://alexzhang13.github.io/blog/2025/rlm/) - [Prime Intellect RLM Blog](https://www.primeintellect.ai/blog/rlm)