Spaces:

kgdrathan
/

explainer-env

Running

App Files Files Community

explainer-env / README.md

kgdrathan

Upload folder using huggingface_hub

ac7572a verified 11 days ago

preview code

raw

history blame contribute delete

6.77 kB

	---

	## title: Explainer Env Environment Server
	emoji: "\U0001F4BB"
	colorFrom: pink
	colorTo: gray
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- OpenEnv
	- RL
	---

	<p align="center">
	The dashboard is served by this Space at <code>/web/</code> in the custom tab.
	</p>

	<p align="center">
	<span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
	</p>

	# Teaching Small Models to Build Interactive Explainers

	What if a small language model could do more than answer a STEM question?

	What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?

	That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.

	Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026.

	![Expected episode flow](assets/episode_flow.jpg)

	## The Problem

	Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:

	- gradient descent is clearer when you move the learning rate and watch the loss curve change
	- Fourier transforms are clearer when frequencies become visible
	- sorting algorithms are clearer when every comparison and swap is animated
	- probability and statistics are clearer when samples, distributions, and uncertainty move on screen

	The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.

	The artifact can be:

	- a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
	- a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals

	## Why RL?

	That matters here because "make a good explainer" is not a one-shot task.

	The model has to make a sequence of decisions:

	1. understand the assigned topic
	2. decide what to research
	3. choose the right search or documentation tool
	4. stop exploring when it has enough context
	5. generate runnable Marimo or Manim code
	6. use validation feedback to repair failures

	This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.

	## The Episode

	Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.

	The agent then moves through three phases.

	### 1. Explore

	The agent can call explicit research tools:

	- `search_wikipedia` for fundamentals
	- `search_hf_papers` for ML and AI papers
	- `search_arxiv` for scientific papers
	- `search_hf_hub` for models, datasets, Spaces, and examples

	It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.

	### 2. Generate

	The agent submits one JSON action with a complete Python artifact:

	- `format="marimo"` for a reactive notebook
	- `format="manim"` for an animation scene

	The code is not judged only by how it looks. It is parsed, linted, checked, and run.

	### 3. Repair

	If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.

	This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.

	The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.

	## The Reward Signal

	The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.

	Instead, the environment rewards things that can be checked quickly.

	### Exploration Reward

	The model gets rewarded when it:

	- chooses a useful tool for the topic
	- writes a relevant query
	- retrieves useful sources
	- increases keyword coverage
	- adds new information instead of repeating the same search
	- stops when the context is already good enough

	There is also a small step cost. Exploring forever should not be the winning strategy.

	### Generation Reward

	The generated code is rewarded for:

	- valid JSON action format
	- matching the requested artifact type
	- covering the key concepts
	- passing Marimo or Manim validation
	- actually running or rendering

	Broken code cannot score well just because it mentions the right words. The validation checks act like gates.

	### Repair Reward

	The repair step rewards the model for:

	- fixing the reported error
	- passing validation after the fix
	- avoiding repeated unchanged code

	This makes the environment closer to a real development loop: build, test, read the error, fix.

	## Why Retrieval Is Part of the Environment

	Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.

	So the environment filters research results before sending them back.

	![RAG for long-horizon exploration tasks](assets/why-rag.jpg)

	The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.

	The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.

	## What We Trained First

	Before RL, the model needs to know the shape of the artifacts.

	Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:

	- curated STEM tasks
	- Marimo examples and documentation patterns
	- Manim examples, guides, and reference snippets
	- generate and repair action templates

	The current target model is:

	```text
	unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit
	```

	The SFT adapter is here:

	[kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)

	![SFT training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png)

	## Links

	- Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space)
	- SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
	- Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)