Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- .gitattributes +2 -0
- README.md +60 -73
- assets/episode_flow.jpg +3 -0
- assets/why-rag.jpg +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
assets/episode_flow.jpg filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
assets/why-rag.jpg filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -9,7 +9,9 @@ pinned: false
|
|
| 9 |
app_port: 8000
|
| 10 |
base_path: /web
|
| 11 |
tags:
|
| 12 |
-
-
|
|
|
|
|
|
|
| 13 |
|
| 14 |
# Research -> Interactive Explainer Environment
|
| 15 |
|
|
@@ -25,108 +27,93 @@ happen.
|
|
| 25 |
So this environment trains a model to create interactive explanations instead of only
|
| 26 |
writing paragraphs.
|
| 27 |
|
| 28 |
-
Given a
|
| 29 |
|
| 30 |
1. researches the topic,
|
| 31 |
2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
|
| 32 |
3. receives validation feedback,
|
| 33 |
4. gets one chance to repair the artifact.
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
```
|
| 38 |
|
| 39 |
-
|
| 40 |
-
flowchart TD
|
| 41 |
-
A[Topic assigned] --> B{Explore<br/>up to 3 times}
|
| 42 |
-
T[Research tools<br/>Wikipedia<br/>arXiv<br/>Semantic Scholar<br/>HF Papers<br/>HF Hub<br/>Docs] -.-> B
|
| 43 |
-
B --> C[Collected context]
|
| 44 |
-
C --> D[Generate code<br/>Marimo notebook or Manim animation]
|
| 45 |
-
D --> E{Lint / build<br/>passes?}
|
| 46 |
-
E -- yes --> F[Done]
|
| 47 |
-
E -- no --> G[Feedback returned<br/>errors + hints]
|
| 48 |
-
G --> H[Repair once]
|
| 49 |
-
H --> E
|
| 50 |
-
```
|
| 51 |
|
|
|
|
| 52 |
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
Marimo gives us sliders, plots, tables, and reactive notebooks. Manim gives us visual
|
| 60 |
-
math and step-by-step animations.
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
##
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
-
easy to game, the model can collect points while still producing broken or useless
|
| 71 |
-
artifacts.
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
-
- Did it choose a useful research tool?
|
| 77 |
-
- Did the search add new information?
|
| 78 |
-
- Did the generated code parse?
|
| 79 |
-
- Did `marimo check` pass?
|
| 80 |
-
- Did Manim or Marimo actually run?
|
| 81 |
-
- Did the repair fix the previous error?
|
| 82 |
|
| 83 |
-
|
| 84 |
-
it mentions the right keywords.
|
| 85 |
|
| 86 |
-
|
| 87 |
-
it does not automatically know this environment's workflow. It has to learn how to act:
|
| 88 |
-
when to research, when to stop, how to emit actions, how to write Marimo/Manim code, and
|
| 89 |
-
how to repair from feedback.
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
real Marimo and Manim examples. This teaches the model the shape of valid artifacts
|
| 97 |
-
before GRPO starts optimizing rewards.
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
-
|
| 102 |
-
-
|
| 103 |
-
-
|
| 104 |
-
repair.
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
- Verifiable reward components.
|
| 115 |
-
- SFT data preparation.
|
| 116 |
-
- SFT and GRPO training scripts.
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
- Environment: [kgdrathan-explainer-env.hf.space](https://kgdrathan-explainer-env.hf.space)
|
| 127 |
-
- Design notes: [design.md](../design.md)
|
| 128 |
-
- Reward details: [rewards/README.md](rewards/README.md)
|
| 129 |
-
- SFT data script: [train/prepare_data.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/prepare_data.py)
|
| 130 |
-
- SFT training script: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
|
| 131 |
-
- RL training script: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
|
| 132 |
|
|
|
|
| 9 |
app_port: 8000
|
| 10 |
base_path: /web
|
| 11 |
tags:
|
| 12 |
+
- OpenEnv
|
| 13 |
+
- RL
|
| 14 |
+
---
|
| 15 |
|
| 16 |
# Research -> Interactive Explainer Environment
|
| 17 |
|
|
|
|
| 27 |
So this environment trains a model to create interactive explanations instead of only
|
| 28 |
writing paragraphs.
|
| 29 |
|
| 30 |
+
Given a topic, the agent:
|
| 31 |
|
| 32 |
1. researches the topic,
|
| 33 |
2. builds a [Marimo](https://marimo.io/) notebook or [Manim](https://www.manim.community/) animation,
|
| 34 |
3. receives validation feedback,
|
| 35 |
4. gets one chance to repair the artifact.
|
| 36 |
|
| 37 |
+
Expected Episode Flow:
|
| 38 |
+

|
|
|
|
| 39 |
|
| 40 |
+
## Rewarding Better
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
The rewards are designed to be verifiable. We do not need an LLM judge.
|
| 43 |
|
| 44 |
+
For every step, we first reward the basics:
|
| 45 |
|
| 46 |
+
- Is action a valid JSON? (fields as well)
|
| 47 |
+
- Correct action in correct phase of episode?
|
| 48 |
|
| 49 |
+
Then each action gets its own reward logic.
|
| 50 |
|
| 51 |
+
### `explore`
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
- **Relevance**
|
| 54 |
+
- Is relevant tool for the topic?
|
| 55 |
+
- Is search query related?
|
| 56 |
+
- Useful content in retrieved sources?
|
| 57 |
+
- Coverage of task keywords?
|
| 58 |
+
- Avoidance of similar **repetitive** searches?
|
| 59 |
+
- Too many **explore steps**?
|
| 60 |
|
| 61 |
+
### `generate`
|
| 62 |
|
| 63 |
+
- Correct **format** selected: Marimo or Manim?
|
| 64 |
+
- Artifact includes the important topic **keywords**?
|
| 65 |
+
- Code **parses**?
|
| 66 |
+
- Marimo: `marimo check` pass?
|
| 67 |
+
- Manim: code defines a valid scene structure?
|
| 68 |
+
- Can the artifact actually run/export/render?
|
| 69 |
|
| 70 |
+
### `repair`
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
- Error (lint/build) addressed?
|
| 73 |
+
- Passes validation?
|
| 74 |
+
- Avoided repeation?
|
| 75 |
|
| 76 |
+
> More details in [rewards/README.md](rewards/README.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
## Quirks
|
|
|
|
| 79 |
|
| 80 |
+
### Why RAG is done here?
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
We are training SLMs.<br>
|
| 83 |
+
This will be a long-horizon task.<br>
|
| 84 |
+
Where we need to use a lot of context - which comes from the exploration steps.<br>
|
| 85 |
|
| 86 |
+
To only keep relevant context in the observation:<br>
|
| 87 |
+
Model: `bge-small-en-v1.5`
|
| 88 |
|
| 89 |
+

|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
### Selection of the SLM
|
| 92 |
|
| 93 |
+
- We need SLMs with long context
|
| 94 |
+
- We need it to be < 3B parameters
|
| 95 |
+
- We need it intelligent enough
|
|
|
|
| 96 |
|
| 97 |
+
We have selected - `Mistral-3-3B`
|
| 98 |
|
| 99 |
+
### Why SFT is done in our case?
|
| 100 |
|
| 101 |
+
Even 8B models are not writing good Marimo/Manim code properly/correctly.<br>
|
| 102 |
+
We are extractin tutorials/examples/guides code from the Marimo and Manim cloned repos.<br>
|
| 103 |
+
Created samples.<br>
|
| 104 |
+
And did and SFT to teach/align our SLM to the expected Marimo/Manim code style.<br>
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
## Links
|
| 107 |
|
| 108 |
+
SFT Code: [train/sft_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/sft_unsloth.py)
|
| 109 |
+
RL GRPO Code: [train/grpo_unsloth.py](https://gitlab.com/kgdrathan/openenv-explainer/-/blob/main/train/grpo_unsloth.py)
|
| 110 |
+
Dashboard for interacting with the environment: [explainer-env-dashboard](https://kgdrathan-explainer-env-dashboard.hf.space/)
|
| 111 |
|
| 112 |
+
> Dashboard is for looking at logs and interacting with the environment.
|
| 113 |
+
|
| 114 |
+
## Status
|
| 115 |
+
|
| 116 |
+
Completed: Environment and SFT
|
| 117 |
+
Remaining: RL GRPO training
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
assets/episode_flow.jpg
ADDED
|
Git LFS Details
|
assets/why-rag.jpg
ADDED
|
Git LFS Details
|