--- title: Explainer Env Environment Server emoji: "\U0001F4BB" colorFrom: pink colorTo: gray sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Research -> Interactive Explainer Environment An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a **Marimo** reactive notebook or **Manim** math animation, and gets one repair attempt if lint/build validation fails. ## Episode Flow ``` reset() --> topic + tier assigned | explore x 0..3 --> choose research tools + queries | generate x 1 --> produce marimo/manim code | repair x 0..1 --> fix lint/build errors if needed --> episode ends ``` ## Actions **Explore** -- search for information relevant to the assigned topic: ```python ExplainerAction( action_type="explore", tool="search_arxiv", query="merge sort divide and conquer visual explanation", intent="find examples and visual intuition", ) ``` Available tools: `search_wikipedia`, `search_hf_papers`, `search_arxiv`, `search_scholar`, `fetch_docs`, and `search_hf_hub`. **Generate** -- produce educational code using accumulated research: ```python ExplainerAction( action_type="generate", format="marimo", # or "manim" code="import marimo...", narration="...", # manim only ) ``` **Repair** -- revise generated code using lint/build feedback: ```python ExplainerAction( action_type="repair", format="marimo", code="import marimo...", repair_notes="fixed the reported Marimo validation error", ) ``` ## Reward System Multi-component rewards across exploration, generation, and repair. See [rewards/README.md](rewards/README.md) for the full breakdown. **Exploration** (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search. **Generation/repair**: keyword coverage, format match, structural quality (via `marimo check` CLI or manim scene analysis), narration (manim only), context usage, and repair success. Key design: `marimo check` CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4. ## Quick Start ```bash # Install & run locally cd explainer_env && uv sync uv run server # http://localhost:8000 # Client usage python -c " from client import ExplainerEnv from models import ExplainerAction with ExplainerEnv(base_url='http://localhost:8000').sync() as sc: result = sc.reset() print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}') # Explore result = sc.step(ExplainerAction( action_type='explore', tool='search_wikipedia', query=result.observation.topic, intent='overview', )) print(f'Explore reward: {result.reward:.3f}') # Generate result = sc.step(ExplainerAction( action_type='generate', format='marimo', code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n mo.md(\"# Hello\")\n return\n', )) print(f'Generate reward: {result.reward:.3f}, done: {result.done}') " ``` ## Concurrent WebSocket Sessions The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts: ```python from client import ExplainerEnv from models import ExplainerAction from concurrent.futures import ThreadPoolExecutor def run_episode(client_id: int): with ExplainerEnv(base_url="http://localhost:8000").sync() as sc: result = sc.reset() result = sc.step(ExplainerAction( action_type="explore", tool="search_wikipedia", query=result.observation.topic, )) result = sc.step(ExplainerAction( action_type="generate", format="marimo", code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n return\n", )) return client_id, result.reward with ThreadPoolExecutor(max_workers=4) as executor: results = list(executor.map(run_episode, range(4))) ``` ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/reset` | POST | Start new episode, get topic assignment | | `/step` | POST | Submit action, get observation + reward | | `/state` | GET | Current episode state | | `/schema` | GET | Action/Observation JSON schemas | | `/ws` | WebSocket | Low-latency session for training | | `/docs` | GET | Interactive API docs | ## Task Bank 26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format. ## File Structure ``` explainer_env/ ├── server/ │ ├── explainer_env_environment.py # Environment logic (reset/step/state) │ ├── app.py # FastAPI server (create_app) │ └── Dockerfile # Multi-stage Docker build ├── rewards/ │ ├── exploration.py # Explore-phase reward components │ ├── generation.py # Generate/repair reward components │ ├── sources.py # Compatibility wrapper for research tools │ ├── sandbox.py # Code validation (marimo check, AST, execution) │ └── README.md # Reward system documentation ├── research/ # Research tools, structured results, retrieval ├── models.py # ExplainerAction, ExplainerObservation ├── task_bank.py # 26 curated STEM tasks ├── client.py # ExplainerEnv WebSocket client └── openenv.yaml # OpenEnv manifest ```