Spaces:

kgdrathan
/

explainer-env

Sleeping

App Files Files Community

kgdrathan commited on Apr 26

Commit

bdf789e

verified ·

1 Parent(s): b12f1bd

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +60 -73
assets/episode_flow.jpg +3 -0
assets/why-rag.jpg +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/episode_flow.jpg filter=lfs diff=lfs merge=lfs -text
+assets/why-rag.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -9,7 +9,9 @@ pinned: false
 app_port: 8000
 base_path: /web
 tags:
-  - openenv
 # Research -> Interactive Explainer Environment
@@ -25,108 +27,93 @@ happen.
 So this environment trains a model to create interactive explanations instead of only
 writing paragraphs.
-Given a STEM topic, the agent:
 1. researches the topic,
 2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
 3. receives validation feedback,
 4. gets one chance to repair the artifact.
-```text
-research -> create -> verify -> repair
-```
-```mermaid
-flowchart TD
-    A[Topic assigned] --> B{Explore<br/>up to 3 times}
-    T[Research tools<br/>Wikipedia<br/>arXiv<br/>Semantic Scholar<br/>HF Papers<br/>HF Hub<br/>Docs] -.-> B
-    B --> C[Collected context]
-    C --> D[Generate code<br/>Marimo notebook or Manim animation]
-    D --> E{Lint / build<br/>passes?}
-    E -- yes --> F[Done]
-    E -- no --> G[Feedback returned<br/>errors + hints]
-    G --> H[Repair once]
-    H --> E
-```
-## Why This Environment Exists
-The goal is not just "make an LLM explain a concept."
-The goal is to make the model produce something a learner can see and interact with.
-Marimo gives us sliders, plots, tables, and reactive notebooks. Manim gives us visual
-math and step-by-step animations.
-This makes the environment a good fit for training models that teach through artifacts,
-not just fluent text.
-## What Was Challenging
-Reward design was the hardest part.
-If the reward is too vague, the model does not know what improved. If the reward is too
-easy to game, the model can collect points while still producing broken or useless
-artifacts.
-So the rewards are mostly verifiable:
-- Did the model return valid JSON?
-- Did it choose a useful research tool?
-- Did the search add new information?
-- Did the generated code parse?
-- Did `marimo check` pass?
-- Did Manim or Marimo actually run?
-- Did the repair fix the previous error?
-This keeps the training grounded. A broken notebook should not score well just because
-it mentions the right keywords.
-Training was the other challenge. A base model may know Python and may know a topic, but
-it does not automatically know this environment's workflow. It has to learn how to act:
-when to research, when to stop, how to emit actions, how to write Marimo/Manim code, and
-how to repair from feedback.
-## Why SFT First
-We use SFT as a warm start before RL.
-The SFT data includes the task bank, synthetic explore/generate/repair examples, and
-real Marimo and Manim examples. This teaches the model the shape of valid artifacts
-before GRPO starts optimizing rewards.
-In simple terms:
-- Pre-trained model: can talk about topics, but may not follow the environment.
-- SFT model: learns the expected action format and Marimo/Manim style.
-- RL model: should improve using environment rewards like successful execution and
-repair.
-## Current Status
-Completed:
-- OpenEnv server and client.
-- Explore -> generate -> repair episode flow.
-- Research tools for web/docs/papers/HF Hub.
-- Marimo and Manim validation.
-- Verifiable reward components.
-- SFT data preparation.
-- SFT and GRPO training scripts.
-Still remaining:
-- Run full SFT and GRPO training.
-- Add final pre-trained vs SFT vs RL reward comparisons.
-- Add reward curves and final model results.
-## Links
-- Environment: [kgdrathan-explainer-env.hf.space](https://kgdrathan-explainer-env.hf.space)
-- Design notes: [design.md](../design.md)
-- Reward details: [rewards/README.md](rewards/README.md)
-- SFT data script: [train/prepare_data.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/prepare_data.py)
-- SFT training script: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
-- RL training script: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)

 app_port: 8000
 base_path: /web
 tags:
+  - OpenEnv
+  - RL
+---
 # Research -> Interactive Explainer Environment
 So this environment trains a model to create interactive explanations instead of only
 writing paragraphs.
+Given a topic, the agent:
 1. researches the topic,
 2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
 3. receives validation feedback,
 4. gets one chance to repair the artifact.
+Expected Episode Flow:
+![Expected Episode Flow](./assets/episode_flow.jpg)
+## Rewarding Better
+The rewards are designed to be verifiable. We do not need an LLM judge.
+For every step, we first reward the basics:
+- Is action a valid JSON? (fields as well)
+- Correct action in correct phase of episode?
+Then each action gets its own reward logic.
+### `explore`
+- **Relevance**
+  - Is relevant tool for the topic?
+  - Is search query related?
+  - Useful content in retrieved sources?
+  - Coverage of task keywords?
+- Avoidance of similar **repetitive** searches?
+- Too many **explore steps**?
+### `generate`
+- Correct **format** selected: Marimo or Manim?
+- Artifact includes the important topic **keywords**?
+- Code **parses**?
+  - Marimo: `marimo check` pass?
+  - Manim: code defines a valid scene structure?
+  - Can the artifact actually run/export/render?
+### `repair`
+- Error (lint/build) addressed?
+- Passes validation?
+- Avoided repeation?
+> More details in [rewards/README.md](rewards/README.md)
+## Quirks
+### Why RAG is done here?
+We are training SLMs.<br>
+This will be a long-horizon task.<br>
+Where we need to use a lot of context - which comes from the exploration steps.<br>
+To only keep relevant context in the observation:<br>
+Model: `bge-small-en-v1.5`
+![RAG](./assets/why-rag.jpg)
+### Selection of the SLM
+- We need SLMs with long context
+- We need it to be < 3B parameters
+- We need it intelligent enough
+We have selected - `Mistral-3-3B`
+### Why SFT is done in our case?
+Even 8B models are not writing good Marimo/Manim code properly/correctly.<br>
+We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
+Created samples.<br>
+And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
+## Links
+SFT Code: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
+RL GRPO Code: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
+Dashboard for interacting with the environment: [explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
+> Dashboard is for looking at logs and interacting with the environment.
+## Status
+Completed: Environment and SFT
+Remaining: RL GRPO training

assets/episode_flow.jpg ADDED Viewed

Git LFS Details

SHA256: 2cd20b6790652d92b4432a0863385e400992f3d4d06864d0dcb456472199fd47
Pointer size: 131 Bytes
Size of remote file: 451 kB

assets/why-rag.jpg ADDED Viewed

Git LFS Details

SHA256: f80333be6e9112e1640ae24a5cb9d5c2e22756ca539020759f1254ef9056c5d8
Pointer size: 131 Bytes
Size of remote file: 647 kB