omrisap
/

Qwen2.5-Math-1.5B-TreeRPO

Text Generation

reinforcement-learning

text-generation-inference

Model card Files Files and versions

omrisap commited on Jul 20, 2025

Commit

6d87d1b

·

verified ·

1 Parent(s): f796f74

Update README.md

Files changed (1) hide show

README.md +2 -11

README.md CHANGED Viewed

@@ -25,24 +25,15 @@ https://omrisapir.substack.com/publish/post/167273414
 ## Model Details
 - **Base model:** `Qwen/Qwen2.5-Math-1.5B`
-- **Method:** TreeRPO (tree-structured GRPO: depth ≤ 7, branching by entropy + length)
 - **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
-- **No reference policy / KL:** β = 0 (stability from clipping + relative baseline)
-- **Data efficiency:** 5K SFT CoT examples + 5K RL prompts (vs. multi-million-scale baselines)
 - **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
 ## Intended Use
 Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
 **Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
-## Training Summary
-| Phase | Data | Epochs | Notes |
-|-------|------|--------|-------|
-| SFT | 5K CoT examples (NuminaMath-CoT subset) | 1 | Standard causal LM fine-tune |
-| RL (TreeRPO) | 5K prompts (disjoint) | 1 | Max depth 7; typical branch factor 2 |
-Key hyperparameters: segment length threshold `L_min = 150`, entropy threshold over top‑20 logits `H_th = 1.0`, sampling (temp=0.6, top-p=0.85, top-k=25), PPO/GRPO clip ε=0.2, β=0. Trained on a single 48GB GPU (~18h RL phase).
 ## Evaluation (GSM8K Test Set, 1,319 problems)

 ## Model Details
 - **Base model:** `Qwen/Qwen2.5-Math-1.5B`
+- **Method:** TreeRPO
 - **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
 - **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
 ## Intended Use
 Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
 **Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
+ single 48GB GPU (~18h RL phase).
 ## Evaluation (GSM8K Test Set, 1,319 problems)