Spaces:
Running
Running
File size: 6,768 Bytes
714cb45 b12f1bd eb1ebe6 9a3b69b 714cb45 9a3b69b bdf789e 714cb45 1eaaf1d ac7572a 1eaaf1d 9a3b69b f3394fa 9a3b69b f3394fa 9a3b69b f3394fa 9a3b69b f3394fa 9a3b69b f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa 9a3b69b f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa 9a3b69b f3394fa 9a3b69b f3394fa eb1ebe6 f3394fa eb1ebe6 f3394fa eb1ebe6 f3394fa eb1ebe6 f3394fa 9a3b69b f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa 43f41de f3394fa b12f1bd f3394fa b12f1bd f3394fa b12f1bd f3394fa b12f1bd f3394fa b12f1bd f3394fa b12f1bd f3394fa df7c076 f3394fa df7c076 f3394fa df7c076 f3394fa b12f1bd f3394fa df7c076 f3394fa bdf789e f3394fa bdf789e f3394fa df7c076 f3394fa df7c076 f3394fa b12f1bd f3394fa 43f41de f3394fa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | ---
## title: Explainer Env Environment Server
emoji: "\U0001F4BB"
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- OpenEnv
- RL
---
<p align="center">
The dashboard is served by this Space at <code>/web/</code> in the custom tab.
</p>
<p align="center">
<span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
</p>
# Teaching Small Models to Build Interactive Explainers
What if a small language model could do more than answer a STEM question?
What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?
That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.
Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026.

## The Problem
Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:
- gradient descent is clearer when you move the learning rate and watch the loss curve change
- Fourier transforms are clearer when frequencies become visible
- sorting algorithms are clearer when every comparison and swap is animated
- probability and statistics are clearer when samples, distributions, and uncertainty move on screen
The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.
The artifact can be:
- a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
- a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals
## Why RL?
That matters here because "make a good explainer" is not a one-shot task.
The model has to make a sequence of decisions:
1. understand the assigned topic
2. decide what to research
3. choose the right search or documentation tool
4. stop exploring when it has enough context
5. generate runnable Marimo or Manim code
6. use validation feedback to repair failures
This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.
## The Episode
Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.
The agent then moves through three phases.
### 1. Explore
The agent can call explicit research tools:
- `search_wikipedia` for fundamentals
- `search_hf_papers` for ML and AI papers
- `search_arxiv` for scientific papers
- `search_hf_hub` for models, datasets, Spaces, and examples
It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.
### 2. Generate
The agent submits one JSON action with a complete Python artifact:
- `format="marimo"` for a reactive notebook
- `format="manim"` for an animation scene
The code is not judged only by how it looks. It is parsed, linted, checked, and run.
### 3. Repair
If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.
This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.
The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.
## The Reward Signal
The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.
Instead, the environment rewards things that can be checked quickly.
### Exploration Reward
The model gets rewarded when it:
- chooses a useful tool for the topic
- writes a relevant query
- retrieves useful sources
- increases keyword coverage
- adds new information instead of repeating the same search
- stops when the context is already good enough
There is also a small step cost. Exploring forever should not be the winning strategy.
### Generation Reward
The generated code is rewarded for:
- valid JSON action format
- matching the requested artifact type
- covering the key concepts
- passing Marimo or Manim validation
- actually running or rendering
Broken code cannot score well just because it mentions the right words. The validation checks act like gates.
### Repair Reward
The repair step rewards the model for:
- fixing the reported error
- passing validation after the fix
- avoiding repeated unchanged code
This makes the environment closer to a real development loop: build, test, read the error, fix.
## Why Retrieval Is Part of the Environment
Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.
So the environment filters research results before sending them back.

The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.
The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.
## What We Trained First
Before RL, the model needs to know the shape of the artifacts.
Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:
- curated STEM tasks
- Marimo examples and documentation patterns
- Manim examples, guides, and reference snippets
- generate and repair action templates
The current target model is:
```text
unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit
```
The SFT adapter is here:
[kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)

## Links
- Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space)
- SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
- Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)
|