Update README.md
Browse files
README.md
CHANGED
|
@@ -25,24 +25,15 @@ https://omrisapir.substack.com/publish/post/167273414
|
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
- **Base model:** `Qwen/Qwen2.5-Math-1.5B`
|
| 28 |
-
- **Method:** TreeRPO
|
| 29 |
- **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
|
| 30 |
-
- **No reference policy / KL:** β = 0 (stability from clipping + relative baseline)
|
| 31 |
-
- **Data efficiency:** 5K SFT CoT examples + 5K RL prompts (vs. multi-million-scale baselines)
|
| 32 |
- **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
|
| 33 |
|
| 34 |
## Intended Use
|
| 35 |
Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
|
| 36 |
|
| 37 |
**Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
|
| 38 |
-
|
| 39 |
-
## Training Summary
|
| 40 |
-
| Phase | Data | Epochs | Notes |
|
| 41 |
-
|-------|------|--------|-------|
|
| 42 |
-
| SFT | 5K CoT examples (NuminaMath-CoT subset) | 1 | Standard causal LM fine-tune |
|
| 43 |
-
| RL (TreeRPO) | 5K prompts (disjoint) | 1 | Max depth 7; typical branch factor 2 |
|
| 44 |
-
|
| 45 |
-
Key hyperparameters: segment length threshold `L_min = 150`, entropy threshold over top‑20 logits `H_th = 1.0`, sampling (temp=0.6, top-p=0.85, top-k=25), PPO/GRPO clip ε=0.2, β=0. Trained on a single 48GB GPU (~18h RL phase).
|
| 46 |
|
| 47 |
## Evaluation (GSM8K Test Set, 1,319 problems)
|
| 48 |
|
|
|
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
- **Base model:** `Qwen/Qwen2.5-Math-1.5B`
|
| 28 |
+
- **Method:** TreeRPO
|
| 29 |
- **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
|
|
|
|
|
|
|
| 30 |
- **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
|
| 31 |
|
| 32 |
## Intended Use
|
| 33 |
Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
|
| 34 |
|
| 35 |
**Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
|
| 36 |
+
single 48GB GPU (~18h RL phase).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Evaluation (GSM8K Test Set, 1,319 problems)
|
| 39 |
|