omrisap commited on
Commit
6d87d1b
·
verified ·
1 Parent(s): f796f74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -11
README.md CHANGED
@@ -25,24 +25,15 @@ https://omrisapir.substack.com/publish/post/167273414
25
 
26
  ## Model Details
27
  - **Base model:** `Qwen/Qwen2.5-Math-1.5B`
28
- - **Method:** TreeRPO (tree-structured GRPO: depth ≤ 7, branching by entropy + length)
29
  - **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
30
- - **No reference policy / KL:** β = 0 (stability from clipping + relative baseline)
31
- - **Data efficiency:** 5K SFT CoT examples + 5K RL prompts (vs. multi-million-scale baselines)
32
  - **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
33
 
34
  ## Intended Use
35
  Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
36
 
37
  **Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
38
-
39
- ## Training Summary
40
- | Phase | Data | Epochs | Notes |
41
- |-------|------|--------|-------|
42
- | SFT | 5K CoT examples (NuminaMath-CoT subset) | 1 | Standard causal LM fine-tune |
43
- | RL (TreeRPO) | 5K prompts (disjoint) | 1 | Max depth 7; typical branch factor 2 |
44
-
45
- Key hyperparameters: segment length threshold `L_min = 150`, entropy threshold over top‑20 logits `H_th = 1.0`, sampling (temp=0.6, top-p=0.85, top-k=25), PPO/GRPO clip ε=0.2, β=0. Trained on a single 48GB GPU (~18h RL phase).
46
 
47
  ## Evaluation (GSM8K Test Set, 1,319 problems)
48
 
 
25
 
26
  ## Model Details
27
  - **Base model:** `Qwen/Qwen2.5-Math-1.5B`
28
+ - **Method:** TreeRPO
29
  - **Reward signal:** Deterministic exact-match checker (binary). Interior node reward = average of descendant leaf rewards.
 
 
30
  - **Intended domain:** Grade-school & intermediate math word problems (GSM8K style)
31
 
32
  ## Intended Use
33
  Research on hierarchical RL for reasoning; math tutoring prototypes with human oversight; experimentation in deterministic pass/fail domains (e.g., potential extension to code with unit tests).
34
 
35
  **Not intended for:** Open-ended unsafe dialogue, factual QA outside math, high‑stakes decision making.
36
+ single 48GB GPU (~18h RL phase).
 
 
 
 
 
 
 
37
 
38
  ## Evaluation (GSM8K Test Set, 1,319 problems)
39