kgdrathan commited on
Commit
bdf789e
·
verified ·
1 Parent(s): b12f1bd

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -0
  2. README.md +60 -73
  3. assets/episode_flow.jpg +3 -0
  4. assets/why-rag.jpg +3 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/episode_flow.jpg filter=lfs diff=lfs merge=lfs -text
37
+ assets/why-rag.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -9,7 +9,9 @@ pinned: false
9
  app_port: 8000
10
  base_path: /web
11
  tags:
12
- - openenv
 
 
13
 
14
  # Research -> Interactive Explainer Environment
15
 
@@ -25,108 +27,93 @@ happen.
25
  So this environment trains a model to create interactive explanations instead of only
26
  writing paragraphs.
27
 
28
- Given a STEM topic, the agent:
29
 
30
  1. researches the topic,
31
  2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
32
  3. receives validation feedback,
33
  4. gets one chance to repair the artifact.
34
 
35
- ```text
36
- research -> create -> verify -> repair
37
- ```
38
 
39
- ```mermaid
40
- flowchart TD
41
- A[Topic assigned] --> B{Explore<br/>up to 3 times}
42
- T[Research tools<br/>Wikipedia<br/>arXiv<br/>Semantic Scholar<br/>HF Papers<br/>HF Hub<br/>Docs] -.-> B
43
- B --> C[Collected context]
44
- C --> D[Generate code<br/>Marimo notebook or Manim animation]
45
- D --> E{Lint / build<br/>passes?}
46
- E -- yes --> F[Done]
47
- E -- no --> G[Feedback returned<br/>errors + hints]
48
- G --> H[Repair once]
49
- H --> E
50
- ```
51
 
 
52
 
 
53
 
54
- ## Why This Environment Exists
 
55
 
56
- The goal is not just "make an LLM explain a concept."
57
 
58
- The goal is to make the model produce something a learner can see and interact with.
59
- Marimo gives us sliders, plots, tables, and reactive notebooks. Manim gives us visual
60
- math and step-by-step animations.
61
 
62
- This makes the environment a good fit for training models that teach through artifacts,
63
- not just fluent text.
 
 
 
 
 
64
 
65
- ## What Was Challenging
66
 
67
- Reward design was the hardest part.
 
 
 
 
 
68
 
69
- If the reward is too vague, the model does not know what improved. If the reward is too
70
- easy to game, the model can collect points while still producing broken or useless
71
- artifacts.
72
 
73
- So the rewards are mostly verifiable:
 
 
74
 
75
- - Did the model return valid JSON?
76
- - Did it choose a useful research tool?
77
- - Did the search add new information?
78
- - Did the generated code parse?
79
- - Did `marimo check` pass?
80
- - Did Manim or Marimo actually run?
81
- - Did the repair fix the previous error?
82
 
83
- This keeps the training grounded. A broken notebook should not score well just because
84
- it mentions the right keywords.
85
 
86
- Training was the other challenge. A base model may know Python and may know a topic, but
87
- it does not automatically know this environment's workflow. It has to learn how to act:
88
- when to research, when to stop, how to emit actions, how to write Marimo/Manim code, and
89
- how to repair from feedback.
90
 
91
- ## Why SFT First
 
 
92
 
93
- We use SFT as a warm start before RL.
 
94
 
95
- The SFT data includes the task bank, synthetic explore/generate/repair examples, and
96
- real Marimo and Manim examples. This teaches the model the shape of valid artifacts
97
- before GRPO starts optimizing rewards.
98
 
99
- In simple terms:
100
 
101
- - Pre-trained model: can talk about topics, but may not follow the environment.
102
- - SFT model: learns the expected action format and Marimo/Manim style.
103
- - RL model: should improve using environment rewards like successful execution and
104
- repair.
105
 
106
- ## Current Status
107
 
108
- Completed:
109
 
110
- - OpenEnv server and client.
111
- - Explore -> generate -> repair episode flow.
112
- - Research tools for web/docs/papers/HF Hub.
113
- - Marimo and Manim validation.
114
- - Verifiable reward components.
115
- - SFT data preparation.
116
- - SFT and GRPO training scripts.
117
 
118
- Still remaining:
119
 
120
- - Run full SFT and GRPO training.
121
- - Add final pre-trained vs SFT vs RL reward comparisons.
122
- - Add reward curves and final model results.
123
 
124
- ## Links
 
 
 
 
 
125
 
126
- - Environment: [kgdrathan-explainer-env.hf.space](https://kgdrathan-explainer-env.hf.space)
127
- - Design notes: [design.md](../design.md)
128
- - Reward details: [rewards/README.md](rewards/README.md)
129
- - SFT data script: [train/prepare_data.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/prepare_data.py)
130
- - SFT training script: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
131
- - RL training script: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
132
 
 
9
  app_port: 8000
10
  base_path: /web
11
  tags:
12
+ - OpenEnv
13
+ - RL
14
+ ---
15
 
16
  # Research -> Interactive Explainer Environment
17
 
 
27
  So this environment trains a model to create interactive explanations instead of only
28
  writing paragraphs.
29
 
30
+ Given a topic, the agent:
31
 
32
  1. researches the topic,
33
  2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
34
  3. receives validation feedback,
35
  4. gets one chance to repair the artifact.
36
 
37
+ Expected Episode Flow:
38
+ ![Expected Episode Flow](./assets/episode_flow.jpg)
 
39
 
40
+ ## Rewarding Better
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ The rewards are designed to be verifiable. We do not need an LLM judge.
43
 
44
+ For every step, we first reward the basics:
45
 
46
+ - Is action a valid JSON? (fields as well)
47
+ - Correct action in correct phase of episode?
48
 
49
+ Then each action gets its own reward logic.
50
 
51
+ ### `explore`
 
 
52
 
53
+ - **Relevance**
54
+ - Is relevant tool for the topic?
55
+ - Is search query related?
56
+ - Useful content in retrieved sources?
57
+ - Coverage of task keywords?
58
+ - Avoidance of similar **repetitive** searches?
59
+ - Too many **explore steps**?
60
 
61
+ ### `generate`
62
 
63
+ - Correct **format** selected: Marimo or Manim?
64
+ - Artifact includes the important topic **keywords**?
65
+ - Code **parses**?
66
+ - Marimo: `marimo check` pass?
67
+ - Manim: code defines a valid scene structure?
68
+ - Can the artifact actually run/export/render?
69
 
70
+ ### `repair`
 
 
71
 
72
+ - Error (lint/build) addressed?
73
+ - Passes validation?
74
+ - Avoided repeation?
75
 
76
+ > More details in [rewards/README.md](rewards/README.md)
 
 
 
 
 
 
77
 
78
+ ## Quirks
 
79
 
80
+ ### Why RAG is done here?
 
 
 
81
 
82
+ We are training SLMs.<br>
83
+ This will be a long-horizon task.<br>
84
+ Where we need to use a lot of context - which comes from the exploration steps.<br>
85
 
86
+ To only keep relevant context in the observation:<br>
87
+ Model: `bge-small-en-v1.5`
88
 
89
+ ![RAG](./assets/why-rag.jpg)
 
 
90
 
91
+ ### Selection of the SLM
92
 
93
+ - We need SLMs with long context
94
+ - We need it to be < 3B parameters
95
+ - We need it intelligent enough
 
96
 
97
+ We have selected - `Mistral-3-3B`
98
 
99
+ ### Why SFT is done in our case?
100
 
101
+ Even 8B models are not writing good Marimo/Manim code properly/correctly.<br>
102
+ We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
103
+ Created samples.<br>
104
+ And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
 
 
 
105
 
106
+ ## Links
107
 
108
+ SFT Code: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
109
+ RL GRPO Code: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
110
+ Dashboard for interacting with the environment: [explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
111
 
112
+ > Dashboard is for looking at logs and interacting with the environment.
113
+
114
+ ## Status
115
+
116
+ Completed: Environment and SFT
117
+ Remaining: RL GRPO training
118
 
 
 
 
 
 
 
119
 
assets/episode_flow.jpg ADDED

Git LFS Details

  • SHA256: 2cd20b6790652d92b4432a0863385e400992f3d4d06864d0dcb456472199fd47
  • Pointer size: 131 Bytes
  • Size of remote file: 451 kB
assets/why-rag.jpg ADDED

Git LFS Details

  • SHA256: f80333be6e9112e1640ae24a5cb9d5c2e22756ca539020759f1254ef9056c5d8
  • Pointer size: 131 Bytes
  • Size of remote file: 647 kB