Spaces:

kgdrathan
/

explainer-env

Sleeping

File size: 5,961 Bytes

b12f1bd

---
title: Explainer Env Environment Server
emoji: "\U0001F4BB"
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Research -> Interactive Explainer Environment

An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a **Marimo** reactive notebook or **Manim** math animation, and gets one repair attempt if lint/build validation fails.

## Episode Flow

```
reset() --> topic + tier assigned
  |
explore x 0..3 --> choose research tools + queries
  |
generate x 1 --> produce marimo/manim code
  |
repair x 0..1 --> fix lint/build errors if needed --> episode ends
```

## Actions

**Explore** -- search for information relevant to the assigned topic:
```python
ExplainerAction(
    action_type="explore",
    tool="search_arxiv",
    query="merge sort divide and conquer visual explanation",
    intent="find examples and visual intuition",
)
```

Available tools: `search_wikipedia`, `search_hf_papers`, `search_arxiv`, `search_scholar`, `fetch_docs`, and `search_hf_hub`.

**Generate** -- produce educational code using accumulated research:
```python
ExplainerAction(
    action_type="generate",
    format="marimo",      # or "manim"
    code="import marimo...",
    narration="...",       # manim only
)
```

**Repair** -- revise generated code using lint/build feedback:
```python
ExplainerAction(
    action_type="repair",
    format="marimo",
    code="import marimo...",
    repair_notes="fixed the reported Marimo validation error",
)
```

## Reward System

Multi-component rewards across exploration, generation, and repair. See [rewards/README.md](rewards/README.md) for the full breakdown.

**Exploration** (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search.

**Generation/repair**: keyword coverage, format match, structural quality (via `marimo check` CLI or manim scene analysis), narration (manim only), context usage, and repair success.

Key design: `marimo check` CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4.

## Quick Start

```bash
# Install & run locally
cd explainer_env && uv sync
uv run server  # http://localhost:8000

# Client usage
python -c "
from client import ExplainerEnv
from models import ExplainerAction

with ExplainerEnv(base_url='http://localhost:8000').sync() as sc:
    result = sc.reset()
    print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}')

    # Explore
    result = sc.step(ExplainerAction(
        action_type='explore',
        tool='search_wikipedia',
        query=result.observation.topic,
        intent='overview',
    ))
    print(f'Explore reward: {result.reward:.3f}')

    # Generate
    result = sc.step(ExplainerAction(
        action_type='generate',
        format='marimo',
        code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n    mo.md(\"# Hello\")\n    return\n',
    ))
    print(f'Generate reward: {result.reward:.3f}, done: {result.done}')
"
```

## Concurrent WebSocket Sessions

The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts:

```python
from client import ExplainerEnv
from models import ExplainerAction
from concurrent.futures import ThreadPoolExecutor

def run_episode(client_id: int):
    with ExplainerEnv(base_url="http://localhost:8000").sync() as sc:
        result = sc.reset()
        result = sc.step(ExplainerAction(
            action_type="explore",
            tool="search_wikipedia",
            query=result.observation.topic,
        ))
        result = sc.step(ExplainerAction(
            action_type="generate", format="marimo",
            code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n    return\n",
        ))
        return client_id, result.reward

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(run_episode, range(4)))
```

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/reset` | POST | Start new episode, get topic assignment |
| `/step` | POST | Submit action, get observation + reward |
| `/state` | GET | Current episode state |
| `/schema` | GET | Action/Observation JSON schemas |
| `/ws` | WebSocket | Low-latency session for training |
| `/docs` | GET | Interactive API docs |

## Task Bank

26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format.

## File Structure

```
explainer_env/
├── server/
│   ├── explainer_env_environment.py  # Environment logic (reset/step/state)
│   ├── app.py                        # FastAPI server (create_app)
│   └── Dockerfile                    # Multi-stage Docker build
├── rewards/
│   ├── exploration.py                # Explore-phase reward components
│   ├── generation.py                 # Generate/repair reward components
│   ├── sources.py                    # Compatibility wrapper for research tools
│   ├── sandbox.py                    # Code validation (marimo check, AST, execution)
│   └── README.md                     # Reward system documentation
├── research/                          # Research tools, structured results, retrieval
├── models.py                         # ExplainerAction, ExplainerObservation
├── task_bank.py                      # 26 curated STEM tasks
├── client.py                         # ExplainerEnv WebSocket client
└── openenv.yaml                      # OpenEnv manifest
```