finephrase / app /src /content /chapters /1-introduction.mdx
joelniklaus's picture
joelniklaus HF Staff
added gdr paper
8be4608
import Image from "../../components/Image.astro";
import HtmlEmbed from "../../components/HtmlEmbed.astro";
import Wide from "../../components/Wide.astro";
import Sidenote from "../../components/Sidenote.astro";
import Note from "../../components/Note.astro";
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
## Introduction
We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a 486B token dataset that clearly outperforms all existing synthetic data baselines. It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
<Sidenote>
Reading time: One weekend
</Sidenote>
<HtmlEmbed
id="finephrase-vs-baselines"
src="d3-benchmark-comparison.html"
desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
config={{
defaultView: "line",
datasets: {
"mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
cosmopedia: { display: "Cosmopedia", color: "#e15759" },
nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
rewire: { display: "REWIRE", color: "#1877F2" },
synth_query_reasoning_answer: { display: "SYNTH", color: "#b07aa1" }
},
speedupAnnotation: {
baselineRun: "nemotron_hq_synth",
targetRun: "mix-fw_edu_hq-table_smollm2_1.7b_hq"
}
}}
/>
If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
- But publicly indexed web data is only part of the picture. Massive amounts of user-generated content (emails, messages, proprietary codebases) remain untapped because they contain PII, toxic content, or copyrighted material. Generative Data Refinement (GDR) [@gdr] shows that LLMs can anonymize and detoxify such data while preserving its utility for training, outperforming industry-grade PII detectors with a single zero-shot prompt. By conditioning rewrites on each real example, GDR also preserves the diversity of the original data, avoiding the mode collapse that plagues purely synthetic generation. This could dramatically expand the usable data pool beyond what's publicly crawlable.
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens of web text for their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series [@glm45]. Here's how much synthetic data recent models are using:
<figure id="synthetic-data-scale">
<Image src={syntheticDataScaleImg} alt="Scale of synthetic data in recent LLM training runs" />
<figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
</figure>
Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
<Sidenote>
During SmolLM2 [@smollm2] training, the model was decent at coding and math but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this.
</Sidenote>
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
Our goal is to turn this alchemy into chemistry: replace intuition with systematic, reproducible experiments. Here's how we go about it:
<Sidenote>
Lavoisier replaced phlogiston theory with precise measurements and repeatable experiments, earning him the title "father of modern chemistry".
</Sidenote>
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
The sections below are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
<Note variant="info" title="But wait, what about model collapse?">
You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. A large-scale empirical study training over 1,000 LLMs [@demystifyingsynth] confirms this nuanced picture: training on rephrased synthetic data mixed with natural web text (at around 30% synthetic) can speed up pretraining convergence by 5-10x, with no signs of degradation. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
</Note>
Want to learn how to make GPUs go brrr and generate synthetic tokens at scale like this? This blog is for you!
<Wide>
<HtmlEmbed
id="intro-throughput"
src="inference-throughput-compare.html"
config={{ modelCount: 1 }}
caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
/>
</Wide>
Now let's start by defining what rephrasing actually means and laying out the design space.