Spaces:
Running
Running
| title: Explainer Env Environment Server | |
| emoji: "\U0001F4BB" | |
| colorFrom: pink | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # Research -> Interactive Explainer Environment | |
| An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a **Marimo** reactive notebook or **Manim** math animation, and gets one repair attempt if lint/build validation fails. | |
| ## Episode Flow | |
| ``` | |
| reset() --> topic + tier assigned | |
| | | |
| explore x 0..3 --> choose research tools + queries | |
| | | |
| generate x 1 --> produce marimo/manim code | |
| | | |
| repair x 0..1 --> fix lint/build errors if needed --> episode ends | |
| ``` | |
| ## Actions | |
| **Explore** -- search for information relevant to the assigned topic: | |
| ```python | |
| ExplainerAction( | |
| action_type="explore", | |
| tool="search_arxiv", | |
| query="merge sort divide and conquer visual explanation", | |
| intent="find examples and visual intuition", | |
| ) | |
| ``` | |
| Available tools: `search_wikipedia`, `search_hf_papers`, `search_arxiv`, `search_scholar`, `fetch_docs`, and `search_hf_hub`. | |
| **Generate** -- produce educational code using accumulated research: | |
| ```python | |
| ExplainerAction( | |
| action_type="generate", | |
| format="marimo", # or "manim" | |
| code="import marimo...", | |
| narration="...", # manim only | |
| ) | |
| ``` | |
| **Repair** -- revise generated code using lint/build feedback: | |
| ```python | |
| ExplainerAction( | |
| action_type="repair", | |
| format="marimo", | |
| code="import marimo...", | |
| repair_notes="fixed the reported Marimo validation error", | |
| ) | |
| ``` | |
| ## Reward System | |
| Multi-component rewards across exploration, generation, and repair. See [rewards/README.md](rewards/README.md) for the full breakdown. | |
| **Exploration** (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search. | |
| **Generation/repair**: keyword coverage, format match, structural quality (via `marimo check` CLI or manim scene analysis), narration (manim only), context usage, and repair success. | |
| Key design: `marimo check` CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4. | |
| ## Quick Start | |
| ```bash | |
| # Install & run locally | |
| cd explainer_env && uv sync | |
| uv run server # http://localhost:8000 | |
| # Client usage | |
| python -c " | |
| from client import ExplainerEnv | |
| from models import ExplainerAction | |
| with ExplainerEnv(base_url='http://localhost:8000').sync() as sc: | |
| result = sc.reset() | |
| print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}') | |
| # Explore | |
| result = sc.step(ExplainerAction( | |
| action_type='explore', | |
| tool='search_wikipedia', | |
| query=result.observation.topic, | |
| intent='overview', | |
| )) | |
| print(f'Explore reward: {result.reward:.3f}') | |
| # Generate | |
| result = sc.step(ExplainerAction( | |
| action_type='generate', | |
| format='marimo', | |
| code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n mo.md(\"# Hello\")\n return\n', | |
| )) | |
| print(f'Generate reward: {result.reward:.3f}, done: {result.done}') | |
| " | |
| ``` | |
| ## Concurrent WebSocket Sessions | |
| The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts: | |
| ```python | |
| from client import ExplainerEnv | |
| from models import ExplainerAction | |
| from concurrent.futures import ThreadPoolExecutor | |
| def run_episode(client_id: int): | |
| with ExplainerEnv(base_url="http://localhost:8000").sync() as sc: | |
| result = sc.reset() | |
| result = sc.step(ExplainerAction( | |
| action_type="explore", | |
| tool="search_wikipedia", | |
| query=result.observation.topic, | |
| )) | |
| result = sc.step(ExplainerAction( | |
| action_type="generate", format="marimo", | |
| code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n return\n", | |
| )) | |
| return client_id, result.reward | |
| with ThreadPoolExecutor(max_workers=4) as executor: | |
| results = list(executor.map(run_episode, range(4))) | |
| ``` | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/reset` | POST | Start new episode, get topic assignment | | |
| | `/step` | POST | Submit action, get observation + reward | | |
| | `/state` | GET | Current episode state | | |
| | `/schema` | GET | Action/Observation JSON schemas | | |
| | `/ws` | WebSocket | Low-latency session for training | | |
| | `/docs` | GET | Interactive API docs | | |
| ## Task Bank | |
| 26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format. | |
| ## File Structure | |
| ``` | |
| explainer_env/ | |
| βββ server/ | |
| β βββ explainer_env_environment.py # Environment logic (reset/step/state) | |
| β βββ app.py # FastAPI server (create_app) | |
| β βββ Dockerfile # Multi-stage Docker build | |
| βββ rewards/ | |
| β βββ exploration.py # Explore-phase reward components | |
| β βββ generation.py # Generate/repair reward components | |
| β βββ sources.py # Compatibility wrapper for research tools | |
| β βββ sandbox.py # Code validation (marimo check, AST, execution) | |
| β βββ README.md # Reward system documentation | |
| βββ research/ # Research tools, structured results, retrieval | |
| βββ models.py # ExplainerAction, ExplainerObservation | |
| βββ task_bank.py # 26 curated STEM tasks | |
| βββ client.py # ExplainerEnv WebSocket client | |
| βββ openenv.yaml # OpenEnv manifest | |
| ``` | |