--- ## title: Explainer Env Environment Server emoji: "\U0001F4BB" colorFrom: pink colorTo: gray sdk: docker pinned: false app_port: 8000 base_path: /web tags: - OpenEnv - RL ---

The dashboard is served by this Space at /web/ in the custom tab.

See. Interact. Understand.

# Teaching Small Models to Build Interactive Explainers What if a small language model could do more than answer a STEM question? What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails? That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content. Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026. ![Expected episode flow](assets/episode_flow.jpg) ## The Problem Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying: - gradient descent is clearer when you move the learning rate and watch the loss curve change - Fourier transforms are clearer when frequencies become visible - sorting algorithms are clearer when every comparison and swap is animated - probability and statistics are clearer when samples, distributions, and uncertainty move on screen The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches. The artifact can be: - a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration - a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals ## Why RL? That matters here because "make a good explainer" is not a one-shot task. The model has to make a sequence of decisions: 1. understand the assigned topic 2. decide what to research 3. choose the right search or documentation tool 4. stop exploring when it has enough context 5. generate runnable Marimo or Manim code 6. use validation feedback to repair failures This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text. ## The Episode Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty. The agent then moves through three phases. ### 1. Explore The agent can call explicit research tools: - `search_wikipedia` for fundamentals - `search_hf_papers` for ML and AI papers - `search_arxiv` for scientific papers - `search_hf_hub` for models, datasets, Spaces, and examples It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training. ### 2. Generate The agent submits one JSON action with a complete Python artifact: - `format="marimo"` for a reactive notebook - `format="manim"` for an animation scene The code is not judged only by how it looks. It is parsed, linted, checked, and run. ### 3. Repair If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt. This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute. The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer. ## The Reward Signal The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce. Instead, the environment rewards things that can be checked quickly. ### Exploration Reward The model gets rewarded when it: - chooses a useful tool for the topic - writes a relevant query - retrieves useful sources - increases keyword coverage - adds new information instead of repeating the same search - stops when the context is already good enough There is also a small step cost. Exploring forever should not be the winning strategy. ### Generation Reward The generated code is rewarded for: - valid JSON action format - matching the requested artifact type - covering the key concepts - passing Marimo or Manim validation - actually running or rendering Broken code cannot score well just because it mentions the right words. The validation checks act like gates. ### Repair Reward The repair step rewards the model for: - fixing the reported error - passing validation after the fix - avoiding repeated unchanged code This makes the environment closer to a real development loop: build, test, read the error, fix. ## Why Retrieval Is Part of the Environment Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code. So the environment filters research results before sending them back. ![RAG for long-horizon exploration tasks](assets/why-rag.jpg) The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation. The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text. ## What We Trained First Before RL, the model needs to know the shape of the artifacts. Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from: - curated STEM tasks - Marimo examples and documentation patterns - Manim examples, guides, and reference snippets - generate and repair action templates The current target model is: ```text unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit ``` The SFT adapter is here: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/) ![SFT training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png) ## Links - Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space) - SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/) - Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)