Seungjun commited on
Commit
37abb6f
Β·
verified Β·
1 Parent(s): 501f285

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -24
README.md CHANGED
@@ -44,7 +44,7 @@ Produce a LoRA adapter (rank ≀ 32) on Nemotron-3-Nano-30B that maximizes accur
44
  - Generate **~100K synthetic prompts total**
45
  - Match train.csv's category distribution
46
  - Apply floor of ~5K per category (ensure coverage even for rare categories)
47
- - Mix easy/medium/hard within each category
48
 
49
  ### Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
50
  - Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
@@ -87,8 +87,8 @@ Extract `\boxed{}` from rollout β†’ 1 if matches ground truth (exact or numerica
87
 
88
  **Round 1 DAPO**
89
  - Prompt pool: **~30K–50K prompts**
90
- - ~40% overlap with SFT prompts (tests generalization)
91
- - ~60% fresh synthetic prompts
92
  - Only prompts + ground truth needed β€” rollouts generated live
93
  - Evaluate on train.csv β†’ save score + LoRA checkpoint
94
 
@@ -102,18 +102,18 @@ Extract `\boxed{}` from rollout β†’ 1 if matches ground truth (exact or numerica
102
 
103
  **Round N SFT**
104
  - Prompt pool for self-distillation: **~150K prompts**
105
- - ~70K: reuse previous round's SFT prompts (regenerate with stronger model)
106
- - ~80K: fresh synthetic prompts for diversity
107
  - Generate 4–8 candidate traces per prompt (temperature ~0.7)
108
  - Filter: keep traces whose `\boxed{}` matches ground truth
109
- - Expected yield: 85–95% (model is much stronger now)
110
  - Keep 1–2 best traces per prompt (shortest correct, or diverse)
111
- - Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse)
112
  - **Final training set: ~175K rows**
113
  - Evaluate on train.csv β†’ save score + LoRA checkpoint
114
 
115
  **Round N DAPO**
116
- - Prompt pool: ~30K–50K, skewed harder
117
  - Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
118
  - Drop prompts the model aces (no signal)
119
  - Drop prompts the model always fails (no signal)
@@ -148,22 +148,6 @@ Realistic target: **2–3 full rounds** within the 2-month window.
148
 
149
  **Key insight:** API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
150
 
151
-
152
-
153
- SFT β€” Match Train.csv Distribution
154
- For SFT (both Phase 1 and Phase 2+), mirror train.csv's difficulty mix.
155
- Reason: SFT teaches the model what reasoning looks like across the full distribution it'll encounter at test time. If the hidden test set mirrors train.csv (very likely), you want your model fluent at every difficulty level it'll see.
156
- Risk of over-weighting hard problems in SFT: model gets sophisticated reasoning but fumbles easy problems due to underexposure.
157
- Risk of over-weighting easy problems: model never sees complex reasoning patterns.
158
- Default: match train.csv.
159
- DAPO β€” Skew Harder, Shift Each Round
160
- DAPO is different. It only learns from prompts in the Goldilocks zone (20–80% success rate). Prompts too easy or too hard produce no gradient signal.
161
- As your model improves across rounds, the Goldilocks zone shifts toward harder problems because what was hard becomes medium, what was medium becomes easy.
162
- Concrete distribution per round:
163
- RoundEasyMediumHardPhase 1 DAPO20%50%30%Phase 2 DAPO10%40%50%Phase 3 DAPO5%30%65%
164
-
165
- ---
166
-
167
  ## Hyperparameters Quick Reference
168
 
169
  | Setting | Value |
 
44
  - Generate **~100K synthetic prompts total**
45
  - Match train.csv's category distribution
46
  - Apply floor of ~5K per category (ensure coverage even for rare categories)
47
+ - Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv
48
 
49
  ### Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
50
  - Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
 
87
 
88
  **Round 1 DAPO**
89
  - Prompt pool: **~30K–50K prompts**
90
+ - ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
91
+ - ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
92
  - Only prompts + ground truth needed β€” rollouts generated live
93
  - Evaluate on train.csv β†’ save score + LoRA checkpoint
94
 
 
102
 
103
  **Round N SFT**
104
  - Prompt pool for self-distillation: **~150K prompts**
105
+ - ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
106
+ - ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
107
  - Generate 4–8 candidate traces per prompt (temperature ~0.7)
108
  - Filter: keep traces whose `\boxed{}` matches ground truth
109
+ - Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
110
  - Keep 1–2 best traces per prompt (shortest correct, or diverse)
111
+ - Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
112
  - **Final training set: ~175K rows**
113
  - Evaluate on train.csv β†’ save score + LoRA checkpoint
114
 
115
  **Round N DAPO**
116
+ - Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
117
  - Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
118
  - Drop prompts the model aces (no signal)
119
  - Drop prompts the model always fails (no signal)
 
148
 
149
  **Key insight:** API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ## Hyperparameters Quick Reference
152
 
153
  | Setting | Value |