finephrase

Running on CPU Upgrade

App Files Files Community

finephrase / app /src /content /chapters /7-conclusions.mdx

joelniklaus HF Staff

added analysis with differently sized student models

9894e4e 11 days ago

raw

history blame contribute delete

4.66 kB

	## Conclusions

	We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: prompt design is the single biggest lever. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase: 1.35 billion samples and 486 billion completion tokens generated from 339 million source documents. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing. That said, there's plenty left to explore.

	### What's Next?

	The biggest bottleneck to scaling synthetic data experiments is the compute cost of generation itself. Producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down significantly. Diffusion language models are promising: their parallel generation capabilities yield reported 2-10x inference speedups over autoregressive approaches. Models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.

	Beyond faster generation, we answered several questions about best practices but many remain wide open:

	- Data repetition: Can you repeat data more often without performance loss if the repetitions are rephrased?
	- Mixing ratio: We mixed unrephrased source data with synthetic data at equal proportions. @demystifyingsynth found ~30% rephrased synthetic to be optimal for their setup, but this likely depends on model size, data budget, and synthetic data type. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
	- Generation parameters: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
	- Context extension: Does chunked rollouts context extension during mid-training improve downstream performance?
	- Best-of-N filtering: Can we generate multiple rollouts per example and score them to keep only the best one?
	- Scaling to larger models: Our student sweep (0.5B to 6.2B) confirms that larger students extract more value from synthetic data, consistent with @rewire, and reveals generator differences above 1B that smaller students hide. Training at even larger scales (10B+) could amplify these gaps further.
	- Automatic prompt optimization: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
	- Longer pretraining: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
	- Source filtering: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
	- Larger ablations and mixtures: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.

	The playbook is open. Build on it.

	## Conclusions

	We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: prompt design is the single biggest lever. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase: 1.35 billion samples and 486 billion completion tokens generated from 339 million source documents. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing. That said, there's plenty left to explore.

	### What's Next?

	The biggest bottleneck to scaling synthetic data experiments is the compute cost of generation itself. Producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down significantly. Diffusion language models are promising: their parallel generation capabilities yield reported 2-10x inference speedups over autoregressive approaches. Models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.

	Beyond faster generation, we answered several questions about best practices but many remain wide open:

	- Data repetition: Can you repeat data more often without performance loss if the repetitions are rephrased?
	- Mixing ratio: We mixed unrephrased source data with synthetic data at equal proportions. @demystifyingsynth found ~30% rephrased synthetic to be optimal for their setup, but this likely depends on model size, data budget, and synthetic data type. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
	- Generation parameters: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
	- Context extension: Does chunked rollouts context extension during mid-training improve downstream performance?
	- Best-of-N filtering: Can we generate multiple rollouts per example and score them to keep only the best one?
	- Scaling to larger models: Our student sweep (0.5B to 6.2B) confirms that larger students extract more value from synthetic data, consistent with @rewire, and reveals generator differences above 1B that smaller students hide. Training at even larger scales (10B+) could amplify these gaps further.
	- Automatic prompt optimization: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
	- Longer pretraining: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
	- Source filtering: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
	- Larger ablations and mixtures: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.

	The playbook is open. Build on it.