Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -13,114 +13,167 @@ tags:
|
|
| 13 |
- RL
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# Research -> Interactive Explainer Environment
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
you move the learning rate and watch the loss curve. Fourier transforms make more sense
|
| 24 |
-
when frequencies appear visually. Algorithms make more sense when you can see each step
|
| 25 |
-
happen.
|
| 26 |
|
| 27 |
-
|
| 28 |
-
writing paragraphs.
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
|
| 34 |
-
3. receives validation feedback,
|
| 35 |
-
4. gets one chance to repair the artifact.
|
| 36 |
|
| 37 |
-
Expected
|
| 38 |
-

|
| 39 |
|
| 40 |
-
##
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
- Correct action in correct phase of episode?
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
-
- Is relevant tool for the topic?
|
| 55 |
-
- Is search query related?
|
| 56 |
-
- Useful content in retrieved sources?
|
| 57 |
-
- Coverage of task keywords?
|
| 58 |
-
- Avoidance of similar **repetitive** searches?
|
| 59 |
-
- Too many **explore steps**?
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
-
- Artifact includes the important topic **keywords**?
|
| 65 |
-
- Code **parses**?
|
| 66 |
-
- Marimo: `marimo check` pass?
|
| 67 |
-
- Manim: code defines a valid scene structure?
|
| 68 |
-
- Can the artifact actually run/export/render?
|
| 69 |
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
-
|
| 73 |
-
- Passes validation?
|
| 74 |
-
- Avoided repeation?
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
| 83 |
-
This will be a long-horizon task.<br>
|
| 84 |
-
Where we need to use a lot of context - which comes from the exploration steps.<br>
|
| 85 |
|
| 86 |
-
|
| 87 |
-
Model: `bge-small-en-v1.5`
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
-
- We need it to be < 3B parameters
|
| 95 |
-
- We need it intelligent enough
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
-
We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
|
| 103 |
-
Created samples.<br>
|
| 104 |
-
And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
|
| 105 |
|
| 106 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
| 110 |
|
|
|
|
| 111 |
|
| 112 |
-
|
| 113 |
|
|
|
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
|
|
|
| 117 |
|
|
|
|
| 118 |
|
| 119 |
-
##
|
| 120 |
|
| 121 |
-
|
| 122 |
-
Remaining: RL GRPO training (some errors in the code)<br>
|
| 123 |
|
|
|
|
| 124 |
|
|
|
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
- RL
|
| 14 |
---
|
| 15 |
|
|
|
|
| 16 |
|
| 17 |
+
<p align="center">
|
| 18 |
+
<span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
|
| 19 |
+
</p>
|
| 20 |
|
| 21 |
+
# Teaching Small Models to Build Interactive Explainers
|
| 22 |
|
| 23 |
+
What if a small language model could do more than answer a STEM question?
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?
|
|
|
|
| 26 |
|
| 27 |
+
That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.
|
| 28 |
|
| 29 |
+
Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026.
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+

|
|
|
|
| 32 |
|
| 33 |
+
## The Problem
|
| 34 |
|
| 35 |
+
Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:
|
| 36 |
|
| 37 |
+
- gradient descent is clearer when you move the learning rate and watch the loss curve change
|
| 38 |
+
- Fourier transforms are clearer when frequencies become visible
|
| 39 |
+
- sorting algorithms are clearer when every comparison and swap is animated
|
| 40 |
+
- probability and statistics are clearer when samples, distributions, and uncertainty move on screen
|
| 41 |
|
| 42 |
+
The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.
|
|
|
|
| 43 |
|
| 44 |
+
The artifact can be:
|
| 45 |
|
| 46 |
+
- a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
|
| 47 |
+
- a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals
|
| 48 |
|
| 49 |
+
## Why RL?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
That matters here because "make a good explainer" is not a one-shot task.
|
| 52 |
|
| 53 |
+
The model has to make a sequence of decisions:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
1. understand the assigned topic
|
| 56 |
+
2. decide what to research
|
| 57 |
+
3. choose the right search or documentation tool
|
| 58 |
+
4. stop exploring when it has enough context
|
| 59 |
+
5. generate runnable Marimo or Manim code
|
| 60 |
+
6. use validation feedback to repair failures
|
| 61 |
|
| 62 |
+
This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
## The Episode
|
| 65 |
|
| 66 |
+
Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.
|
| 67 |
|
| 68 |
+
The agent then moves through three phases.
|
| 69 |
|
| 70 |
+
### 1. Explore
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
The agent can call explicit research tools:
|
|
|
|
| 73 |
|
| 74 |
+
- `search_wikipedia` for fundamentals
|
| 75 |
+
- `search_hf_papers` for ML and AI papers
|
| 76 |
+
- `search_arxiv` for scientific papers
|
| 77 |
+
- `search_hf_hub` for models, datasets, Spaces, and examples
|
| 78 |
|
| 79 |
+
It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.
|
| 80 |
|
| 81 |
+
### 2. Generate
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
The agent submits one JSON action with a complete Python artifact:
|
| 84 |
|
| 85 |
+
- `format="marimo"` for a reactive notebook
|
| 86 |
+
- `format="manim"` for an animation scene
|
| 87 |
|
| 88 |
+
The code is not judged only by how it looks. It is parsed, linted, checked, and run.
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
+
### 3. Repair
|
| 91 |
+
|
| 92 |
+
If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.
|
| 93 |
+
|
| 94 |
+
This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.
|
| 95 |
+
|
| 96 |
+
The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.
|
| 97 |
+
|
| 98 |
+
## The Reward Signal
|
| 99 |
+
|
| 100 |
+
The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.
|
| 101 |
+
|
| 102 |
+
Instead, the environment rewards things that can be checked quickly.
|
| 103 |
+
|
| 104 |
+
### Exploration Reward
|
| 105 |
+
|
| 106 |
+
The model gets rewarded when it:
|
| 107 |
+
|
| 108 |
+
- chooses a useful tool for the topic
|
| 109 |
+
- writes a relevant query
|
| 110 |
+
- retrieves useful sources
|
| 111 |
+
- increases keyword coverage
|
| 112 |
+
- adds new information instead of repeating the same search
|
| 113 |
+
- stops when the context is already good enough
|
| 114 |
+
|
| 115 |
+
There is also a small step cost. Exploring forever should not be the winning strategy.
|
| 116 |
+
|
| 117 |
+
### Generation Reward
|
| 118 |
+
|
| 119 |
+
The generated code is rewarded for:
|
| 120 |
|
| 121 |
+
- valid JSON action format
|
| 122 |
+
- matching the requested artifact type
|
| 123 |
+
- covering the key concepts
|
| 124 |
+
- passing Marimo or Manim validation
|
| 125 |
+
- actually running or rendering
|
| 126 |
|
| 127 |
+
Broken code cannot score well just because it mentions the right words. The validation checks act like gates.
|
| 128 |
|
| 129 |
+
### Repair Reward
|
| 130 |
|
| 131 |
+
The repair step rewards the model for:
|
| 132 |
|
| 133 |
+
- fixing the reported error
|
| 134 |
+
- passing validation after the fix
|
| 135 |
+
- avoiding repeated unchanged code
|
| 136 |
|
| 137 |
+
This makes the environment closer to a real development loop: build, test, read the error, fix.
|
| 138 |
|
| 139 |
+
## Why Retrieval Is Part of the Environment
|
| 140 |
|
| 141 |
+
Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.
|
|
|
|
| 142 |
|
| 143 |
+
So the environment filters research results before sending them back.
|
| 144 |
|
| 145 |
+

|
| 146 |
|
| 147 |
+
The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.
|
| 148 |
+
|
| 149 |
+
The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.
|
| 150 |
+
|
| 151 |
+
## What We Trained First
|
| 152 |
+
|
| 153 |
+
Before RL, the model needs to know the shape of the artifacts.
|
| 154 |
+
|
| 155 |
+
Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:
|
| 156 |
+
|
| 157 |
+
- curated STEM tasks
|
| 158 |
+
- Marimo examples and documentation patterns
|
| 159 |
+
- Manim examples, guides, and reference snippets
|
| 160 |
+
- generate and repair action templates
|
| 161 |
+
|
| 162 |
+
The current target model is:
|
| 163 |
+
|
| 164 |
+
```text
|
| 165 |
+
unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
The SFT adapter is here:
|
| 169 |
+
|
| 170 |
+
[kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
|
| 171 |
+
|
| 172 |
+

|
| 173 |
+
|
| 174 |
+
## Links
|
| 175 |
|
| 176 |
+
- Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space)
|
| 177 |
+
- Dashboard Space: [kgdrathan-explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
|
| 178 |
+
- SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
|
| 179 |
+
- Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)
|