Spaces:

kgdrathan
/

explainer-env

Running

App Files Files Community

explainer-env / README.backup.md

kgdrathan

Upload folder using huggingface_hub

b12f1bd verified 17 days ago

preview code

raw

history blame contribute delete

5.96 kB

	---
	title: Explainer Env Environment Server
	emoji: "\U0001F4BB"
	colorFrom: pink
	colorTo: gray
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	---

	# Research -> Interactive Explainer Environment

	An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a Marimo reactive notebook or Manim math animation, and gets one repair attempt if lint/build validation fails.

	## Episode Flow

	```
	reset() --> topic + tier assigned
	\|
	explore x 0..3 --> choose research tools + queries
	\|
	generate x 1 --> produce marimo/manim code
	\|
	repair x 0..1 --> fix lint/build errors if needed --> episode ends
	```

	## Actions

	Explore -- search for information relevant to the assigned topic:
	```python
	ExplainerAction(
	action_type="explore",
	tool="search_arxiv",
	query="merge sort divide and conquer visual explanation",
	intent="find examples and visual intuition",
	)
	```

	Available tools: `search_wikipedia`, `search_hf_papers`, `search_arxiv`, `search_scholar`, `fetch_docs`, and `search_hf_hub`.

	Generate -- produce educational code using accumulated research:
	```python
	ExplainerAction(
	action_type="generate",
	format="marimo", # or "manim"
	code="import marimo...",
	narration="...", # manim only
	)
	```

	Repair -- revise generated code using lint/build feedback:
	```python
	ExplainerAction(
	action_type="repair",
	format="marimo",
	code="import marimo...",
	repair_notes="fixed the reported Marimo validation error",
	)
	```

	## Reward System

	Multi-component rewards across exploration, generation, and repair. See [rewards/README.md](rewards/README.md) for the full breakdown.

	Exploration (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search.

	Generation/repair: keyword coverage, format match, structural quality (via `marimo check` CLI or manim scene analysis), narration (manim only), context usage, and repair success.

	Key design: `marimo check` CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4.

	## Quick Start

	```bash
	# Install & run locally
	cd explainer_env && uv sync
	uv run server # http://localhost:8000

	# Client usage
	python -c "
	from client import ExplainerEnv
	from models import ExplainerAction

	with ExplainerEnv(base_url='http://localhost:8000').sync() as sc:
	result = sc.reset()
	print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}')

	# Explore
	result = sc.step(ExplainerAction(
	action_type='explore',
	tool='search_wikipedia',
	query=result.observation.topic,
	intent='overview',
	))
	print(f'Explore reward: {result.reward:.3f}')

	# Generate
	result = sc.step(ExplainerAction(
	action_type='generate',
	format='marimo',
	code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n mo.md(\"# Hello\")\n return\n',
	))
	print(f'Generate reward: {result.reward:.3f}, done: {result.done}')
	"
	```

	## Concurrent WebSocket Sessions

	The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts:

	```python
	from client import ExplainerEnv
	from models import ExplainerAction
	from concurrent.futures import ThreadPoolExecutor

	def run_episode(client_id: int):
	with ExplainerEnv(base_url="http://localhost:8000").sync() as sc:
	result = sc.reset()
	result = sc.step(ExplainerAction(
	action_type="explore",
	tool="search_wikipedia",
	query=result.observation.topic,
	))
	result = sc.step(ExplainerAction(
	action_type="generate", format="marimo",
	code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n return\n",
	))
	return client_id, result.reward

	with ThreadPoolExecutor(max_workers=4) as executor:
	results = list(executor.map(run_episode, range(4)))
	```

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/reset` \| POST \| Start new episode, get topic assignment \|
	\| `/step` \| POST \| Submit action, get observation + reward \|
	\| `/state` \| GET \| Current episode state \|
	\| `/schema` \| GET \| Action/Observation JSON schemas \|
	\| `/ws` \| WebSocket \| Low-latency session for training \|
	\| `/docs` \| GET \| Interactive API docs \|

	## Task Bank

	26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format.

	## File Structure

	```
	explainer_env/
	├── server/
	│ ├── explainer_env_environment.py # Environment logic (reset/step/state)
	│ ├── app.py # FastAPI server (create_app)
	│ └── Dockerfile # Multi-stage Docker build
	├── rewards/
	│ ├── exploration.py # Explore-phase reward components
	│ ├── generation.py # Generate/repair reward components
	│ ├── sources.py # Compatibility wrapper for research tools
	│ ├── sandbox.py # Code validation (marimo check, AST, execution)
	│ └── README.md # Reward system documentation
	├── research/ # Research tools, structured results, retrieval
	├── models.py # ExplainerAction, ExplainerObservation
	├── task_bank.py # 26 curated STEM tasks
	├── client.py # ExplainerEnv WebSocket client
	└── openenv.yaml # OpenEnv manifest
	```