Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
7adb03a
1
Parent(s): 10a0e93
did some general polish and cleanup
Browse files- app/src/content/assets/image/{Screenshot_2026-01-20_at_09_42_21_2f81384e-bcac-80e6-b3fa-d06567e56b15.png → auto-dataset-card.png} +0 -0
- app/src/content/assets/image/{SyDLepVveg_2f81384e-bcac-806f-acb7-fd65c71dd9df.jpg → synthetic-data-scale.jpg} +0 -0
- app/src/content/chapters/experiments.mdx +1 -1
- app/src/content/chapters/infrastructure.mdx +18 -22
- app/src/content/chapters/introduction.mdx +16 -7
app/src/content/assets/image/{Screenshot_2026-01-20_at_09_42_21_2f81384e-bcac-80e6-b3fa-d06567e56b15.png → auto-dataset-card.png}
RENAMED
|
File without changes
|
app/src/content/assets/image/{SyDLepVveg_2f81384e-bcac-806f-acb7-fd65c71dd9df.jpg → synthetic-data-scale.jpg}
RENAMED
|
File without changes
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -4,8 +4,8 @@ import Sidenote from "../../components/Sidenote.astro";
|
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
| 5 |
import FigRef from "../../components/FigRef.astro";
|
| 6 |
|
| 7 |
-
{/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
|
| 8 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
|
|
|
| 9 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 10 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 11 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
|
|
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
| 5 |
import FigRef from "../../components/FigRef.astro";
|
| 6 |
|
|
|
|
| 7 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
| 8 |
+
{/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
|
| 9 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 10 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 11 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
app/src/content/chapters/infrastructure.mdx
CHANGED
|
@@ -3,22 +3,12 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import FigRef from "../../components/FigRef.astro";
|
| 5 |
import Accordion from "../../components/Accordion.astro";
|
| 6 |
-
import
|
| 7 |
-
import Screenshot_2026_01_20_at_09_42_21_2f81384e_bcac_80e6_b3fa_d06567e56b15 from "../assets/image/Screenshot_2026-01-20_at_09_42_21_2f81384e-bcac-80e6-b3fa-d06567e56b15.png";
|
| 8 |
|
| 9 |
## Infrastructure
|
| 10 |
|
| 11 |
When you start generating your first synthetic tokens with LLMs you notice quickly that this is an extremely slow and compute-heavy process. Even though we can cache KV values from previous tokens, we still need one forward pass for *every* token, and every web document typically has a few thousand tokens. The first step before running any large-scale experiments is setting up infrastructure that generates as efficiently and scalably as possible.
|
| 12 |
|
| 13 |
-
Synthetic data has emerged as a key ingredient in training modern LLMs, providing a path past the pretraining data wall, where high-quality text (or ["fossil fuel"](https://youtu.be/1yvBqasHLZs?si=YgaaCSfngJNi3OSb&t=475)) becomes scarce and collecting more internet data yields diminishing returns. For example, NVIDIA used LLMs to rephrase around 2 trillion tokens (!) of web text in their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series of models [@glm45]. <FigRef target="synthetic-data-scale" /> shows the staggering scale of synthetic data usage in recent model training runs.
|
| 14 |
-
|
| 15 |
-
<figure id="synthetic-data-scale">
|
| 16 |
-
<Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Scale of synthetic data in recent LLM training runs" />
|
| 17 |
-
<figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
|
| 18 |
-
</figure>
|
| 19 |
-
|
| 20 |
-
Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates high-quality responses for targeted domains such as reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on a few billion tokens of data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
|
| 21 |
-
|
| 22 |
So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
|
| 23 |
|
| 24 |
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
|
|
@@ -267,7 +257,7 @@ Need multiple samples per document? Set `rollouts_per_document` in your `Inferen
|
|
| 267 |
We want you to be able to just press a button, let the GPUs go brrrr, and check back in to the finished dataset. DataTrove continuously uploads data to your specified Hugging Face dataset repo whenever a chunk is finished. At the end, the `InferenceDatasetCardGenerator` pipeline step checks the logs directory, collects information about the throughput, and uploads a dataset card to document your new synthetic dataset. <FigRef target="auto-dataset-card" /> shows an example of the auto-generated dataset card.
|
| 268 |
|
| 269 |
<figure id="auto-dataset-card">
|
| 270 |
-
<Image src={
|
| 271 |
<figcaption>Example of an auto-generated dataset card with throughput metrics, uploaded to the Hugging Face Hub after inference completes.</figcaption>
|
| 272 |
</figure>
|
| 273 |
|
|
@@ -311,12 +301,7 @@ The entire benchmarking code (experiment launcher, analysis scripts, and sample
|
|
| 311 |
|
| 312 |
We benchmarked **18 models** spanning 4 size categories (tiny to large) on **H100 GPUs** (8 GPUs per node) using vLLM as the inference engine. The goal: find the optimal serving configuration for each model to maximize output tokens per second per GPU.
|
| 313 |
|
| 314 |
-
All models were evaluated on the same task: rewriting documents from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT split) as step-by-step tutorials. Each run processed up to 10,000 examples with:
|
| 315 |
-
- **Model max context**: 8,192 tokens
|
| 316 |
-
- **Max output tokens**: 4,096 tokens
|
| 317 |
-
- **Temperature**: 0.0 (deterministic, seed=42 for reproducibility)
|
| 318 |
|
| 319 |
-
Since all runs use temperature 0.0 and a fixed seed, the variance across runs is negligible. We therefore report single-run throughput numbers without confidence intervals.
|
| 320 |
|
| 321 |
- 🐣 **Tiny** ({'<'}1B): [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct), [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct), [gemma-3-270m-it](https://huggingface.co/google/gemma-3-270m-it), [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 322 |
- 🦆 **Small** (1B–10B): [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), [gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it), [gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it), [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
|
|
@@ -325,12 +310,14 @@ Since all runs use temperature 0.0 and a fixed seed, the variance across runs is
|
|
| 325 |
|
| 326 |
The lineup spans four model families (SmolLM2, Gemma 3 [@gemma3], Qwen3, and GPT-OSS [@gptoss]) and includes both 🧱 dense transformers and 🔀 Mixture-of-Experts (MoE) architectures.
|
| 327 |
|
| 328 |
-
|
| 329 |
|
| 330 |
<Sidenote>
|
| 331 |
-
|
| 332 |
</Sidenote>
|
| 333 |
|
|
|
|
|
|
|
| 334 |
#### Tiered optimization
|
| 335 |
|
| 336 |
We adopted a **two-tier sequential optimization** approach. The second tier builds on the best configuration found in the previous tier:
|
|
@@ -392,8 +379,17 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
|
|
| 392 |
|
| 393 |
#### What these numbers mean in practice
|
| 394 |
|
| 395 |
-
Let's make this concrete
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 396 |
|
|
|
|
| 397 |
|
| 398 |
#### Key findings
|
| 399 |
|
|
@@ -436,7 +432,7 @@ Speculative decoding adds overhead: the verification step has a compute cost, an
|
|
| 436 |
|
| 437 |
##### Models with large speedups
|
| 438 |
|
| 439 |
-
**[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) and [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) (1.95x and 1.78x via tp=2).** Both are MoE models that are severely **memory-bound at tp=1**. gpt-oss-120b (120B total, ~
|
| 440 |
|
| 441 |
**SmolLM2 models (1.34x-1.75x via speculative decoding).** These models are tiny enough that a single GPU has abundant memory. The bottleneck is the sequential nature of autoregressive decoding. Speculative decoding generates multiple tokens per verification step:
|
| 442 |
|
|
@@ -457,7 +453,7 @@ Interestingly, **ngram works better for the 135M model but suffix wins for the 1
|
|
| 457 |
| [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) | 19-26% | 2.1-2.6 | −11% |
|
| 458 |
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | 20-31% | 2.2-2.9 | −16% |
|
| 459 |
|
| 460 |
-
The small SmolLM2 models achieve 64-84% acceptance rates with 5-6 tokens accepted per step, making speculation highly profitable. The medium/large models ([Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B), gemma-3-12b/27b-it, gpt-oss-120b) only achieve 20-30% acceptance with ~2.3 tokens per step, barely better than no speculation. A likely explanation is that larger models generate more diverse, paraphrased text that diverges further from the input prompt, giving n-gram matching fewer opportunities for exact phrase reuse. At these low acceptance rates, the overhead of drafting and verifying rejected tokens outweighs the benefit.
|
| 461 |
|
| 462 |
The tutorial-rewriting task is particularly amenable to speculative decoding because the output frequently contains phrases from the input document, giving both ngram and suffix methods high acceptance rates. Tasks that preserve even more of the input text (such as summarization, text continuation, or guided rewriting where the model is explicitly asked to maintain the original author's voice) would likely see even larger speedups from speculative decoding, since draft acceptance rates would be higher.
|
| 463 |
|
|
|
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import FigRef from "../../components/FigRef.astro";
|
| 5 |
import Accordion from "../../components/Accordion.astro";
|
| 6 |
+
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
|
|
|
| 7 |
|
| 8 |
## Infrastructure
|
| 9 |
|
| 10 |
When you start generating your first synthetic tokens with LLMs you notice quickly that this is an extremely slow and compute-heavy process. Even though we can cache KV values from previous tokens, we still need one forward pass for *every* token, and every web document typically has a few thousand tokens. The first step before running any large-scale experiments is setting up infrastructure that generates as efficiently and scalably as possible.
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
|
| 13 |
|
| 14 |
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
|
|
|
|
| 257 |
We want you to be able to just press a button, let the GPUs go brrrr, and check back in to the finished dataset. DataTrove continuously uploads data to your specified Hugging Face dataset repo whenever a chunk is finished. At the end, the `InferenceDatasetCardGenerator` pipeline step checks the logs directory, collects information about the throughput, and uploads a dataset card to document your new synthetic dataset. <FigRef target="auto-dataset-card" /> shows an example of the auto-generated dataset card.
|
| 258 |
|
| 259 |
<figure id="auto-dataset-card">
|
| 260 |
+
<Image src={datasetCardImg} alt="Auto-generated dataset card on the Hugging Face Hub" />
|
| 261 |
<figcaption>Example of an auto-generated dataset card with throughput metrics, uploaded to the Hugging Face Hub after inference completes.</figcaption>
|
| 262 |
</figure>
|
| 263 |
|
|
|
|
| 301 |
|
| 302 |
We benchmarked **18 models** spanning 4 size categories (tiny to large) on **H100 GPUs** (8 GPUs per node) using vLLM as the inference engine. The goal: find the optimal serving configuration for each model to maximize output tokens per second per GPU.
|
| 303 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 304 |
|
|
|
|
| 305 |
|
| 306 |
- 🐣 **Tiny** ({'<'}1B): [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct), [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct), [gemma-3-270m-it](https://huggingface.co/google/gemma-3-270m-it), [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 307 |
- 🦆 **Small** (1B–10B): [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), [gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it), [gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it), [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
|
|
|
|
| 310 |
|
| 311 |
The lineup spans four model families (SmolLM2, Gemma 3 [@gemma3], Qwen3, and GPT-OSS [@gptoss]) and includes both 🧱 dense transformers and 🔀 Mixture-of-Experts (MoE) architectures.
|
| 312 |
|
| 313 |
+
All models were evaluated on the same task: rewriting documents from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT split) as step-by-step tutorials. Each run processed up to 10,000 examples with 8,192 tokens model max context, 4,096 max output tokens, and temperature 0.
|
| 314 |
|
| 315 |
<Sidenote>
|
| 316 |
+
Since all runs use temperature 0.0 and a fixed seed 42, the variance across runs is negligible. We therefore report single-run throughput numbers without confidence intervals.
|
| 317 |
</Sidenote>
|
| 318 |
|
| 319 |
+
All experiments ran on NVIDIA H100 80GB GPUs with 8 GPUs per node. We used vLLM as the inference engine with automatic prefix caching enabled and the flash_attn backend. The Flash-Attn [@flashattention2] vLLM backend is more than 50% faster than FlashInfer [@flashinfer] across our setups. This aligns with vLLM's [backend priority](https://docs.vllm.ai/en/latest/design/attention_backends/#backend-priority-cuda): on Ampere/Hopper (SM 8.x–9.x) Flash Attention is tried first, whereas on Blackwell (SM 10.x) FlashInfer has priority and may be faster there.
|
| 320 |
+
|
| 321 |
#### Tiered optimization
|
| 322 |
|
| 323 |
We adopted a **two-tier sequential optimization** approach. The second tier builds on the best configuration found in the previous tier:
|
|
|
|
| 379 |
|
| 380 |
#### What these numbers mean in practice
|
| 381 |
|
| 382 |
+
Let's make this concrete with [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. Say you want to generate 10 billion tokens. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), that takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Scale this up to a trillion tokens and the savings run into hundreds of thousands of dollars.
|
| 383 |
+
|
| 384 |
+
These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
|
| 385 |
+
|
| 386 |
+
- **SmolLM2-135M** (45,540 tps/gpu): **7 H100 GPUs** (1 node)
|
| 387 |
+
- **Qwen3-4B** (8,086 tps/gpu): **35 H100 GPUs** (~5 nodes)
|
| 388 |
+
- **Qwen3-8B** (6,443 tps/gpu): **44 H100 GPUs** (~6 nodes)
|
| 389 |
+
- **GPT-OSS-120B** (6,117 tps/gpu): **46 H100 GPUs** (~6 nodes)
|
| 390 |
+
- **Gemma-3-27B** (1,724 tps/gpu): **162 H100 GPUs** (~20 nodes)
|
| 391 |
|
| 392 |
+
Notice that gpt-oss-120b matches Qwen3-8B in per-GPU throughput despite being a much larger model. Two things make this possible: only ~5B of its 120B parameters are active per token (MoE), and the weights are MXFP4-quantized so the full model fits on a single 80GB GPU. That makes large MoE models the sweet spot for quality-per-GPU: a single 8-GPU node running gpt-oss-120b generates ~176 million tokens per hour, and six nodes get you past the billion-token-per-hour mark.
|
| 393 |
|
| 394 |
#### Key findings
|
| 395 |
|
|
|
|
| 432 |
|
| 433 |
##### Models with large speedups
|
| 434 |
|
| 435 |
+
**[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) and [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) (1.95x and 1.78x via tp=2).** Both are MoE models that are severely **memory-bound at tp=1**. gpt-oss-120b (120B total, ~5B active) fits on a single GPU but leaves almost no room for the KV cache: server logs show only ~45,520 tokens of KV capacity at tp=1, enough for roughly 5 concurrent sequences at our 8,192-token context length. At tp=2 that jumps to ~810,000 tokens of KV capacity, enough for ~99 concurrent sequences. Moving to tp=2 halves per-GPU model memory and roughly doubles KV cache capacity, allowing the scheduler to batch far more sequences. The same pattern holds for Qwen3-30B-A3B (30B total, ~3B active). For these large MoE models, tp>1 is critical not for compute parallelism but for **KV cache headroom**: the compute overhead of cross-GPU communication is minimal because only the active parameters participate in each forward pass.
|
| 436 |
|
| 437 |
**SmolLM2 models (1.34x-1.75x via speculative decoding).** These models are tiny enough that a single GPU has abundant memory. The bottleneck is the sequential nature of autoregressive decoding. Speculative decoding generates multiple tokens per verification step:
|
| 438 |
|
|
|
|
| 453 |
| [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) | 19-26% | 2.1-2.6 | −11% |
|
| 454 |
| [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | 20-31% | 2.2-2.9 | −16% |
|
| 455 |
|
| 456 |
+
The small SmolLM2 models achieve 64-84% acceptance rates with 5-6 tokens accepted per step, making speculation highly profitable. The medium/large models ([Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B), [gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it)/[27b-it](https://huggingface.co/google/gemma-3-27b-it), [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)) only achieve 20-30% acceptance with ~2.3 tokens per step, barely better than no speculation. A likely explanation is that larger models generate more diverse, paraphrased text that diverges further from the input prompt, giving n-gram matching fewer opportunities for exact phrase reuse. At these low acceptance rates, the overhead of drafting and verifying rejected tokens outweighs the benefit.
|
| 457 |
|
| 458 |
The tutorial-rewriting task is particularly amenable to speculative decoding because the output frequently contains phrases from the input document, giving both ngram and suffix methods high acceptance rates. Tasks that preserve even more of the input text (such as summarization, text continuation, or guided rewriting where the model is explicitly asked to maintain the original author's voice) would likely see even larger speedups from speculative decoding, since draft acceptance rates would be higher.
|
| 459 |
|
app/src/content/chapters/introduction.mdx
CHANGED
|
@@ -1,6 +1,8 @@
|
|
| 1 |
|
|
|
|
| 2 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 3 |
import FigRef from "../../components/FigRef.astro";
|
|
|
|
| 4 |
|
| 5 |
{/*
|
| 6 |
Notes:
|
|
@@ -16,23 +18,30 @@ If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3
|
|
| 16 |
- When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
|
| 17 |
- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
|
| 18 |
|
| 19 |
-
Besides pretraining, synthetic data generation also has become a useful tool for post-training. It is applied to fill gaps identified in models. A fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this issue.
|
| 20 |
-
|
| 21 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
In this blog post we take a journey to answer all these questions systematically. We ran XXX experiments and generated YYY tokens in total to find the ideal settings for synthetic data.
|
| 26 |
|
| 27 |
Here's the plan:
|
| 28 |
|
| 29 |
-
We start with the infrastructure needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
|
| 30 |
|
| 31 |
-
We continue with a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 32 |
|
| 33 |
-
Finally we present the suite of XXX experiments we ran to figure out best practices regarding what models, prompts and settings work well.
|
| 34 |
|
| 35 |
-
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains
|
| 36 |
|
| 37 |
<HtmlEmbed
|
| 38 |
id="finephrase-vs-baselines"
|
|
|
|
| 1 |
|
| 2 |
+
import Image from "../../components/Image.astro";
|
| 3 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 4 |
import FigRef from "../../components/FigRef.astro";
|
| 5 |
+
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
| 6 |
|
| 7 |
{/*
|
| 8 |
Notes:
|
|
|
|
| 18 |
- When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
|
| 19 |
- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
|
| 20 |
|
|
|
|
|
|
|
| 21 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 22 |
|
| 23 |
+
The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens of web text for their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series [@glm45]. <FigRef target="synthetic-data-scale" /> shows just how much synthetic data recent models are using.
|
| 24 |
+
|
| 25 |
+
<figure id="synthetic-data-scale">
|
| 26 |
+
<Image src={syntheticDataScaleImg} alt="Scale of synthetic data in recent LLM training runs" />
|
| 27 |
+
<figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
|
| 28 |
+
</figure>
|
| 29 |
+
|
| 30 |
+
Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3. Another fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this issue.
|
| 31 |
+
|
| 32 |
+
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 33 |
|
| 34 |
In this blog post we take a journey to answer all these questions systematically. We ran XXX experiments and generated YYY tokens in total to find the ideal settings for synthetic data.
|
| 35 |
|
| 36 |
Here's the plan:
|
| 37 |
|
| 38 |
+
We start with the [Infrastructure](#infrastructure) needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
|
| 39 |
|
| 40 |
+
We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 41 |
|
| 42 |
+
Finally we present the suite of XXX [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
|
| 43 |
|
| 44 |
+
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
|
| 45 |
|
| 46 |
<HtmlEmbed
|
| 47 |
id="finephrase-vs-baselines"
|