kgdrathan commited on
Commit
f3394fa
·
verified ·
1 Parent(s): 2b5f8f2

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +122 -69
README.md CHANGED
@@ -13,114 +13,167 @@ tags:
13
  - RL
14
  ---
15
 
16
- # Research -> Interactive Explainer Environment
17
 
18
- See. Interact. Understand.
 
 
19
 
20
- That is the philosophy of this repo.
21
 
22
- Some topics are hard to learn from text alone. Gradient descent makes more sense when
23
- you move the learning rate and watch the loss curve. Fourier transforms make more sense
24
- when frequencies appear visually. Algorithms make more sense when you can see each step
25
- happen.
26
 
27
- So this environment trains a model to create interactive explanations instead of only
28
- writing paragraphs.
29
 
30
- Given a topic, the agent:
31
 
32
- 1. researches the topic,
33
- 2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
34
- 3. receives validation feedback,
35
- 4. gets one chance to repair the artifact.
36
 
37
- Expected Episode Flow:
38
- ![Expected Episode Flow](./assets/episode_flow.jpg)
39
 
40
- ## Rewarding Better
41
 
42
- The rewards are designed to be verifiable. We do not need an LLM judge.
43
 
44
- For every step, we first reward the basics:
 
 
 
45
 
46
- - Is action a valid JSON? (fields as well)
47
- - Correct action in correct phase of episode?
48
 
49
- Then each action gets its own reward logic.
50
 
51
- ### `explore`
 
52
 
53
- - **Relevance**
54
- - Is relevant tool for the topic?
55
- - Is search query related?
56
- - Useful content in retrieved sources?
57
- - Coverage of task keywords?
58
- - Avoidance of similar **repetitive** searches?
59
- - Too many **explore steps**?
60
 
61
- ### `generate`
62
 
63
- - Correct **format** selected: Marimo or Manim?
64
- - Artifact includes the important topic **keywords**?
65
- - Code **parses**?
66
- - Marimo: `marimo check` pass?
67
- - Manim: code defines a valid scene structure?
68
- - Can the artifact actually run/export/render?
69
 
70
- ### `repair`
 
 
 
 
 
71
 
72
- - Error (lint/build) addressed?
73
- - Passes validation?
74
- - Avoided repeation?
75
 
76
- > More details in [rewards/README.md](rewards/README.md)
77
 
78
- ## Quirks
79
 
80
- ### Why RAG is done here?
81
 
82
- We are training SLMs.<br>
83
- This will be a long-horizon task.<br>
84
- Where we need to use a lot of context - which comes from the exploration steps.<br>
85
 
86
- To only keep relevant context in the observation:<br>
87
- Model: `bge-small-en-v1.5`
88
 
89
- ![RAG](./assets/why-rag.jpg)
 
 
 
90
 
91
- ### Selection of the SLM
92
 
93
- - We need SLMs with long context
94
- - We need it to be < 3B parameters
95
- - We need it intelligent enough
96
 
97
- We have selected - `Mistral-3-3B`
98
 
99
- ### Why SFT is done in our case?
 
100
 
101
- Even 8B models are not writing good Marimo/Manim code properly/correctly.<br>
102
- We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
103
- Created samples.<br>
104
- And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
105
 
106
- ## Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
- SFT Code: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py) and [adapter model](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)<br>
109
- ![training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png)<br>
 
 
 
110
 
 
111
 
112
- RL GRPO Code: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
113
 
 
114
 
115
- > Dashboard is for looking at logs and interacting with the environment.
116
- Dashboard for interacting with the environment: [explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
 
117
 
 
118
 
119
- ## Status
120
 
121
- Completed: Environment and SFT<br>
122
- Remaining: RL GRPO training (some errors in the code)<br>
123
 
 
124
 
 
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
 
 
 
 
 
13
  - RL
14
  ---
15
 
 
16
 
17
+ <p align="center">
18
+ <span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
19
+ </p>
20
 
21
+ # Teaching Small Models to Build Interactive Explainers
22
 
23
+ What if a small language model could do more than answer a STEM question?
 
 
 
24
 
25
+ What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?
 
26
 
27
+ That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.
28
 
29
+ Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026.
 
 
 
30
 
31
+ ![Expected episode flow](assets/episode_flow.jpg)
 
32
 
33
+ ## The Problem
34
 
35
+ Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:
36
 
37
+ - gradient descent is clearer when you move the learning rate and watch the loss curve change
38
+ - Fourier transforms are clearer when frequencies become visible
39
+ - sorting algorithms are clearer when every comparison and swap is animated
40
+ - probability and statistics are clearer when samples, distributions, and uncertainty move on screen
41
 
42
+ The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.
 
43
 
44
+ The artifact can be:
45
 
46
+ - a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
47
+ - a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals
48
 
49
+ ## Why RL?
 
 
 
 
 
 
50
 
51
+ That matters here because "make a good explainer" is not a one-shot task.
52
 
53
+ The model has to make a sequence of decisions:
 
 
 
 
 
54
 
55
+ 1. understand the assigned topic
56
+ 2. decide what to research
57
+ 3. choose the right search or documentation tool
58
+ 4. stop exploring when it has enough context
59
+ 5. generate runnable Marimo or Manim code
60
+ 6. use validation feedback to repair failures
61
 
62
+ This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.
 
 
63
 
64
+ ## The Episode
65
 
66
+ Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.
67
 
68
+ The agent then moves through three phases.
69
 
70
+ ### 1. Explore
 
 
71
 
72
+ The agent can call explicit research tools:
 
73
 
74
+ - `search_wikipedia` for fundamentals
75
+ - `search_hf_papers` for ML and AI papers
76
+ - `search_arxiv` for scientific papers
77
+ - `search_hf_hub` for models, datasets, Spaces, and examples
78
 
79
+ It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.
80
 
81
+ ### 2. Generate
 
 
82
 
83
+ The agent submits one JSON action with a complete Python artifact:
84
 
85
+ - `format="marimo"` for a reactive notebook
86
+ - `format="manim"` for an animation scene
87
 
88
+ The code is not judged only by how it looks. It is parsed, linted, checked, and run.
 
 
 
89
 
90
+ ### 3. Repair
91
+
92
+ If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.
93
+
94
+ This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.
95
+
96
+ The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.
97
+
98
+ ## The Reward Signal
99
+
100
+ The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.
101
+
102
+ Instead, the environment rewards things that can be checked quickly.
103
+
104
+ ### Exploration Reward
105
+
106
+ The model gets rewarded when it:
107
+
108
+ - chooses a useful tool for the topic
109
+ - writes a relevant query
110
+ - retrieves useful sources
111
+ - increases keyword coverage
112
+ - adds new information instead of repeating the same search
113
+ - stops when the context is already good enough
114
+
115
+ There is also a small step cost. Exploring forever should not be the winning strategy.
116
+
117
+ ### Generation Reward
118
+
119
+ The generated code is rewarded for:
120
 
121
+ - valid JSON action format
122
+ - matching the requested artifact type
123
+ - covering the key concepts
124
+ - passing Marimo or Manim validation
125
+ - actually running or rendering
126
 
127
+ Broken code cannot score well just because it mentions the right words. The validation checks act like gates.
128
 
129
+ ### Repair Reward
130
 
131
+ The repair step rewards the model for:
132
 
133
+ - fixing the reported error
134
+ - passing validation after the fix
135
+ - avoiding repeated unchanged code
136
 
137
+ This makes the environment closer to a real development loop: build, test, read the error, fix.
138
 
139
+ ## Why Retrieval Is Part of the Environment
140
 
141
+ Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.
 
142
 
143
+ So the environment filters research results before sending them back.
144
 
145
+ ![RAG for long-horizon exploration tasks](assets/why-rag.jpg)
146
 
147
+ The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.
148
+
149
+ The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.
150
+
151
+ ## What We Trained First
152
+
153
+ Before RL, the model needs to know the shape of the artifacts.
154
+
155
+ Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:
156
+
157
+ - curated STEM tasks
158
+ - Marimo examples and documentation patterns
159
+ - Manim examples, guides, and reference snippets
160
+ - generate and repair action templates
161
+
162
+ The current target model is:
163
+
164
+ ```text
165
+ unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit
166
+ ```
167
+
168
+ The SFT adapter is here:
169
+
170
+ [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
171
+
172
+ ![SFT training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png)
173
+
174
+ ## Links
175
 
176
+ - Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space)
177
+ - Dashboard Space: [kgdrathan-explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
178
+ - SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
179
+ - Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)