joelniklaus HF Staff commited on
Commit
462a612
·
1 Parent(s): 1e2edcd

improved the infrastructure section based on Lewis' feedback

Browse files
app/src/content/chapters/5-infrastructure.mdx CHANGED
@@ -7,15 +7,15 @@ import ReadingTime from "../../components/ReadingTime.astro";
7
 
8
  ## Infrastructure
9
 
10
- <ReadingTime words={4983} visuals={9} />
11
 
12
- When you start generating your first synthetic tokens with LLMs you notice quickly that this is an extremely slow and compute-heavy process. Even though we can cache KV values from previous tokens, we still need one forward pass for *every* token, and every web document typically has a few thousand tokens. The first step before running any large-scale experiments is setting up infrastructure that generates as efficiently and scalably as possible.
13
 
14
- So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
15
 
16
- We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
17
 
18
- In this section we show how DataTrove can be used to generate billions of tokens across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
19
 
20
  <HtmlEmbed
21
  id="datatrove-pipeline"
@@ -75,7 +75,11 @@ At the heart of our inference system lies a powerful abstraction: the **rollout
75
 
76
  This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
77
 
78
- #### Example 1: Simple Single-Request Rollout
 
 
 
 
79
 
80
  The simplest rollout sends one request per document and returns the result directly:
81
 
@@ -94,7 +98,9 @@ The returned `InferenceResult` is automatically stored under `document.metadata[
94
 
95
  **Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
96
 
97
- #### Example 2: Chunked Rollout for Long Documents
 
 
98
 
99
  When documents exceed your model's context window, you can split them into chunks and stitch generations together:
100
 
@@ -129,7 +135,9 @@ Each chunk builds on the previous generation, allowing the model to maintain coh
129
 
130
  **Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
131
 
132
- #### Example 3: CPU-Heavy Preprocessing with Process Pools
 
 
133
 
134
  For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
135
 
@@ -183,12 +191,14 @@ InferenceRunner(
183
  The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
184
 
185
  **Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
 
186
 
187
- #### Running Multiple Rollouts per Document
188
 
189
  Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
190
 
191
  **Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
 
192
 
193
  ### Throughput Benchmarking
194
 
@@ -277,7 +287,7 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
277
 
278
  #### What these numbers mean in practice
279
 
280
- Let's make this concrete with [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. Say you want to generate 10 billion tokens. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), that takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Scale this up to a trillion tokens and the savings run into hundreds of thousands of dollars.
281
 
282
  These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
283
 
 
7
 
8
  ## Infrastructure
9
 
10
+ <ReadingTime words={4780} visuals={9} />
11
 
12
+ Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment can take weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is months of GPU time.
13
 
14
+ Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
15
 
16
+ We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
17
 
18
+ <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
19
 
20
  <HtmlEmbed
21
  id="datatrove-pipeline"
 
75
 
76
  This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
77
 
78
+ <Sidenote>
79
+ For rephrasing, the simple single-request rollout is all you need. The other rollout patterns below show how DataTrove handles more complex use cases like translating long documents, CPU-heavy preprocessing, and best-of-N sampling.
80
+ </Sidenote>
81
+
82
+ <Accordion title="Simple single-request rollout" open>
83
 
84
  The simplest rollout sends one request per document and returns the result directly:
85
 
 
98
 
99
  **Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
100
 
101
+ </Accordion>
102
+
103
+ <Accordion title="Chunked rollout for long documents">
104
 
105
  When documents exceed your model's context window, you can split them into chunks and stitch generations together:
106
 
 
135
 
136
  **Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
137
 
138
+ </Accordion>
139
+
140
+ <Accordion title="CPU-heavy preprocessing with process pools">
141
 
142
  For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
143
 
 
191
  The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
192
 
193
  **Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
194
+ </Accordion>
195
 
196
+ <Accordion title="Multiple rollouts per document">
197
 
198
  Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
199
 
200
  **Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
201
+ </Accordion>
202
 
203
  ### Throughput Benchmarking
204
 
 
287
 
288
  #### What these numbers mean in practice
289
 
290
+ Let's make this concrete. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
291
 
292
  These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
293