Spaces:

NVIDIAReasoningChallenge
/

README

Running

App Files Files Community

Seungjun commited on Apr 22

Commit

37abb6f

verified ·

1 Parent(s): 501f285

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -24

README.md CHANGED Viewed

@@ -44,7 +44,7 @@ Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accur
 - Generate **~100K synthetic prompts total**
   - Match train.csv's category distribution
   - Apply floor of ~5K per category (ensure coverage even for rare categories)
-  - Mix easy/medium/hard within each category
 ### Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
 - Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
@@ -87,8 +87,8 @@ Extract `\boxed{}` from rollout → 1 if matches ground truth (exact or numerica
 **Round 1 DAPO**
 - Prompt pool: **~30K–50K prompts**
-  - ~40% overlap with SFT prompts (tests generalization)
-  - ~60% fresh synthetic prompts
 - Only prompts + ground truth needed — rollouts generated live
 - Evaluate on train.csv → save score + LoRA checkpoint
@@ -102,18 +102,18 @@ Extract `\boxed{}` from rollout → 1 if matches ground truth (exact or numerica
 **Round N SFT**
 - Prompt pool for self-distillation: **~150K prompts**
-  - ~70K: reuse previous round's SFT prompts (regenerate with stronger model)
-  - ~80K: fresh synthetic prompts for diversity
 - Generate 4–8 candidate traces per prompt (temperature ~0.7)
 - Filter: keep traces whose `\boxed{}` matches ground truth
-- Expected yield: 85–95% (model is much stronger now)
 - Keep 1–2 best traces per prompt (shortest correct, or diverse)
-- Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse)
 - **Final training set: ~175K rows**
 - Evaluate on train.csv → save score + LoRA checkpoint
 **Round N DAPO**
-- Prompt pool: ~30K–50K, skewed harder
 - Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
   - Drop prompts the model aces (no signal)
   - Drop prompts the model always fails (no signal)
@@ -148,22 +148,6 @@ Realistic target: **2–3 full rounds** within the 2-month window.
 **Key insight:** API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
-SFT — Match Train.csv Distribution
-For SFT (both Phase 1 and Phase 2+), mirror train.csv's difficulty mix.
-Reason: SFT teaches the model what reasoning looks like across the full distribution it'll encounter at test time. If the hidden test set mirrors train.csv (very likely), you want your model fluent at every difficulty level it'll see.
-Risk of over-weighting hard problems in SFT: model gets sophisticated reasoning but fumbles easy problems due to underexposure.
-Risk of over-weighting easy problems: model never sees complex reasoning patterns.
-Default: match train.csv.
-DAPO — Skew Harder, Shift Each Round
-DAPO is different. It only learns from prompts in the Goldilocks zone (20–80% success rate). Prompts too easy or too hard produce no gradient signal.
-As your model improves across rounds, the Goldilocks zone shifts toward harder problems because what was hard becomes medium, what was medium becomes easy.
-Concrete distribution per round:
-RoundEasyMediumHardPhase 1 DAPO20%50%30%Phase 2 DAPO10%40%50%Phase 3 DAPO5%30%65%
----
 ## Hyperparameters Quick Reference
 | Setting | Value |

 - Generate **~100K synthetic prompts total**
   - Match train.csv's category distribution
   - Apply floor of ~5K per category (ensure coverage even for rare categories)
+  - Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv
 ### Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
 - Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
 **Round 1 DAPO**
 - Prompt pool: **~30K–50K prompts**
+  - ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
+  - ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
 - Only prompts + ground truth needed — rollouts generated live
 - Evaluate on train.csv → save score + LoRA checkpoint
 **Round N SFT**
 - Prompt pool for self-distillation: **~150K prompts**
+  - ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
+  - ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
 - Generate 4–8 candidate traces per prompt (temperature ~0.7)
 - Filter: keep traces whose `\boxed{}` matches ground truth
+- Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
 - Keep 1–2 best traces per prompt (shortest correct, or diverse)
+- Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) -  still wondering what should be the difficulity distribution of these problems
 - **Final training set: ~175K rows**
 - Evaluate on train.csv → save score + LoRA checkpoint
 **Round N DAPO**
+- Prompt pool: ~30K–50K, skewed harder -  still wondering what should be the difficulity distribution of these problems
 - Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
   - Drop prompts the model aces (no signal)
   - Drop prompts the model always fails (no signal)
 **Key insight:** API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
 ## Hyperparameters Quick Reference
 | Setting | Value |