apple
/

SimpleSD-4B-thinking

@@ -18,6 +18,11 @@ This model was produced using **Simple Self-Distillation (SSD)**, a method that
 - **Self-distillation sampling:** temperature=1.1, top_p=0.95, top_k=20
 - **Evaluation sampling:** temperature=0.7, top_p=0.95, top_k=20
 ## Method
 SSD samples solutions from the base model using non-unit temperature and top-k/top-p truncation, then fine-tunes on those samples via standard supervised learning. Despite its simplicity, SSD yields large gains on competitive programming benchmarks, with improvements concentrating on harder problems. The mechanism traces to resolving a *precision–exploration conflict*: SSD reshapes token distributions in a context-dependent way so that a single global decoding configuration becomes far more effective at evaluation time.

 - **Self-distillation sampling:** temperature=1.1, top_p=0.95, top_k=20
 - **Evaluation sampling:** temperature=0.7, top_p=0.95, top_k=20
+## Notes
+- These are research checkpoints for reproducibility.
+- They are not optimized Qwen releases.
+- They don't represent a broader open-source model strategy.
 ## Method
 SSD samples solutions from the base model using non-unit temperature and top-k/top-p truncation, then fine-tunes on those samples via standard supervised learning. Despite its simplicity, SSD yields large gains on competitive programming benchmarks, with improvements concentrating on harder problems. The mechanism traces to resolving a *precision–exploration conflict*: SSD reshapes token distributions in a context-dependent way so that a single global decoding configuration becomes far more effective at evaluation time.