Spaces:

kgdrathan
/

explainer-env

Sleeping

App Files Files Community

kgdrathan commited on about 1 month ago

Commit

f3394fa

verified ·

1 Parent(s): 2b5f8f2

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

README.md +122 -69

README.md CHANGED Viewed

@@ -13,114 +13,167 @@ tags:
   - RL
 ---
-# Research -> Interactive Explainer Environment
-See. Interact. Understand.
-That is the philosophy of this repo.
-Some topics are hard to learn from text alone. Gradient descent makes more sense when
-you move the learning rate and watch the loss curve. Fourier transforms make more sense
-when frequencies appear visually. Algorithms make more sense when you can see each step
-happen.
-So this environment trains a model to create interactive explanations instead of only
-writing paragraphs.
-Given a topic, the agent:
-1. researches the topic,
-2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
-3. receives validation feedback,
-4. gets one chance to repair the artifact.
-Expected Episode Flow:
-![Expected Episode Flow](./assets/episode_flow.jpg)
-## Rewarding Better
-The rewards are designed to be verifiable. We do not need an LLM judge.
-For every step, we first reward the basics:
-- Is action a valid JSON? (fields as well)
-- Correct action in correct phase of episode?
-Then each action gets its own reward logic.
-### `explore`
-- **Relevance**
-  - Is relevant tool for the topic?
-  - Is search query related?
-  - Useful content in retrieved sources?
-  - Coverage of task keywords?
-- Avoidance of similar **repetitive** searches?
-- Too many **explore steps**?
-### `generate`
-- Correct **format** selected: Marimo or Manim?
-- Artifact includes the important topic **keywords**?
-- Code **parses**?
-  - Marimo: `marimo check` pass?
-  - Manim: code defines a valid scene structure?
-  - Can the artifact actually run/export/render?
-### `repair`
-- Error (lint/build) addressed?
-- Passes validation?
-- Avoided repeation?
-> More details in [rewards/README.md](rewards/README.md)
-## Quirks
-### Why RAG is done here?
-We are training SLMs.<br>
-This will be a long-horizon task.<br>
-Where we need to use a lot of context - which comes from the exploration steps.<br>
-To only keep relevant context in the observation:<br>
-Model: `bge-small-en-v1.5`
-![RAG](./assets/why-rag.jpg)
-### Selection of the SLM
-- We need SLMs with long context
-- We need it to be < 3B parameters
-- We need it intelligent enough
-We have selected - `Mistral-3-3B`
-### Why SFT is done in our case?
-Even 8B models are not writing good Marimo/Manim code properly/correctly.<br>
-We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
-Created samples.<br>
-And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
-## Links
-SFT Code: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py) and [adapter model](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)<br>
-![training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png)<br>
-RL GRPO Code: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
-> Dashboard is for looking at logs and interacting with the environment.
-Dashboard for interacting with the environment: [explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
-## Status
-Completed: Environment and SFT<br>
-Remaining: RL GRPO training (some errors in the code)<br>

   - RL
 ---
+<p align="center">
+  <span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
+</p>
+# Teaching Small Models to Build Interactive Explainers
+What if a small language model could do more than answer a STEM question?
+What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?
+That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.
+Built for the [OpenEnv Hackathon](https://openenv.dev) in India, April 25-26, 2026.
+![Expected episode flow](assets/episode_flow.jpg)
+## The Problem
+Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:
+- gradient descent is clearer when you move the learning rate and watch the loss curve change
+- Fourier transforms are clearer when frequencies become visible
+- sorting algorithms are clearer when every comparison and swap is animated
+- probability and statistics are clearer when samples, distributions, and uncertainty move on screen
+The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.
+The artifact can be:
+- a [Marimo](https://marimo.io/) reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
+- a [Manim](https://www.manim.community/) animation for step-by-step math and algorithm visuals
+## Why RL?
+That matters here because "make a good explainer" is not a one-shot task.
+The model has to make a sequence of decisions:
+1. understand the assigned topic
+2. decide what to research
+3. choose the right search or documentation tool
+4. stop exploring when it has enough context
+5. generate runnable Marimo or Manim code
+6. use validation feedback to repair failures
+This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.
+## The Episode
+Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.
+The agent then moves through three phases.
+### 1. Explore
+The agent can call explicit research tools:
+- `search_wikipedia` for fundamentals
+- `search_hf_papers` for ML and AI papers
+- `search_arxiv` for scientific papers
+- `search_hf_hub` for models, datasets, Spaces, and examples
+It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.
+### 2. Generate
+The agent submits one JSON action with a complete Python artifact:
+- `format="marimo"` for a reactive notebook
+- `format="manim"` for an animation scene
+The code is not judged only by how it looks. It is parsed, linted, checked, and run.
+### 3. Repair
+If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.
+This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.
+The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.
+## The Reward Signal
+The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.
+Instead, the environment rewards things that can be checked quickly.
+### Exploration Reward
+The model gets rewarded when it:
+- chooses a useful tool for the topic
+- writes a relevant query
+- retrieves useful sources
+- increases keyword coverage
+- adds new information instead of repeating the same search
+- stops when the context is already good enough
+There is also a small step cost. Exploring forever should not be the winning strategy.
+### Generation Reward
+The generated code is rewarded for:
+- valid JSON action format
+- matching the requested artifact type
+- covering the key concepts
+- passing Marimo or Manim validation
+- actually running or rendering
+Broken code cannot score well just because it mentions the right words. The validation checks act like gates.
+### Repair Reward
+The repair step rewards the model for:
+- fixing the reported error
+- passing validation after the fix
+- avoiding repeated unchanged code
+This makes the environment closer to a real development loop: build, test, read the error, fix.
+## Why Retrieval Is Part of the Environment
+Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.
+So the environment filters research results before sending them back.
+![RAG for long-horizon exploration tasks](assets/why-rag.jpg)
+The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.
+The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.
+## What We Trained First
+Before RL, the model needs to know the shape of the artifacts.
+Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:
+- curated STEM tasks
+- Marimo examples and documentation patterns
+- Manim examples, guides, and reference snippets
+- generate and repair action templates
+The current target model is:
+```text
+unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit
+```
+The SFT adapter is here:
+[kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
+![SFT training curves](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/resolve/main/training_curves.png)
+## Links
+- Environment Space: [kgdrathan-explainer-env](https://kgdrathan-explainer-env.hf.space)
+- Dashboard Space: [kgdrathan-explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
+- SFT adapter: [kgdrathan/ministral-3-3b-4bit-marimo-manim](https://huggingface.co/kgdrathan/ministral-3-3b-4bit-marimo-manim/)
+- Reward details: [explainer_env/rewards/README.md](explainer_env/rewards/README.md)