Spaces:
Running
Running
| ## title: Explainer Env Environment Server | |
| emoji: "\U0001F4BB" | |
| colorFrom: pink | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - OpenEnv | |
| - RL | |
| <p align="center"> | |
| The dashboard is served by this Space at <code>/web/</code> in the custom tab. | |
| </p> | |
| <p align="center"> | |
| <span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span> | |
| </p> | |
| # Teaching Small Models to Build Interactive Explainers | |
| What if a small language model could do more than answer a STEM question? | |
| What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails? | |
| That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content. | |
| Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026. | |
|  | |
| ## The Problem | |
| Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying: | |
| - gradient descent is clearer when you move the learning rate and watch the loss curve change | |
| - Fourier transforms are clearer when frequencies become visible | |
| - sorting algorithms are clearer when every comparison and swap is animated | |
| - probability and statistics are clearer when samples, distributions, and uncertainty move on screen | |
| The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches. | |
| The artifact can be: | |
| - a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration | |
| - a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals | |
| ## Why RL? | |
| That matters here because "make a good explainer" is not a one-shot task. | |
| The model has to make a sequence of decisions: | |
| 1. understand the assigned topic | |
| 2. decide what to research | |
| 3. choose the right search or documentation tool | |
| 4. stop exploring when it has enough context | |
| 5. generate runnable Marimo or Manim code | |
| 6. use validation feedback to repair failures | |
| This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text. | |
| ## The Episode | |
| Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty. | |
| The agent then moves through three phases. | |
| ### 1. Explore | |
| The agent can call explicit research tools: | |
| - `search_wikipedia` for fundamentals | |
| - `search_hf_papers` for ML and AI papers | |
| - `search_arxiv` for scientific papers | |
| - `search_hf_hub` for models, datasets, Spaces, and examples | |
| It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training. | |
| ### 2. Generate | |
| The agent submits one JSON action with a complete Python artifact: | |
| - `format="marimo"` for a reactive notebook | |
| - `format="manim"` for an animation scene | |
| The code is not judged only by how it looks. It is parsed, linted, checked, and run. | |
| ### 3. Repair | |
| If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt. | |
| This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute. | |
| The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer. | |
| ## The Reward Signal | |
| The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce. | |
| Instead, the environment rewards things that can be checked quickly. | |
| ### Exploration Reward | |
| The model gets rewarded when it: | |
| - chooses a useful tool for the topic | |
| - writes a relevant query | |
| - retrieves useful sources | |
| - increases keyword coverage | |
| - adds new information instead of repeating the same search | |
| - stops when the context is already good enough | |
| There is also a small step cost. Exploring forever should not be the winning strategy. | |
| ### Generation Reward | |
| The generated code is rewarded for: | |
| - valid JSON action format | |
| - matching the requested artifact type | |
| - covering the key concepts | |
| - passing Marimo or Manim validation | |
| - actually running or rendering | |
| Broken code cannot score well just because it mentions the right words. The validation checks act like gates. | |
| ### Repair Reward | |
| The repair step rewards the model for: | |
| - fixing the reported error | |
| - passing validation after the fix | |
| - avoiding repeated unchanged code | |
| This makes the environment closer to a real development loop: build, test, read the error, fix. | |
| ## Why Retrieval Is Part of the Environment | |
| Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code. | |
| So the environment filters research results before sending them back. | |
|  | |
| The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation. | |
| The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text. | |
| ## What We Trained First | |
| Before RL, the model needs to know the shape of the artifacts. | |
| Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from: | |
| - curated STEM tasks | |
| - Marimo examples and documentation patterns | |
| - Manim examples, guides, and reference snippets | |
| - generate and repair action templates | |
| The current target model is: | |
| ```text | |
| unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit | |
| ``` | |
| The SFT adapter is here: | |
| [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/) | |
|  | |
| ## Links | |
| - Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space) | |
| - SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/) | |
| - Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md) | |