Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
462a612
1
Parent(s): 1e2edcd
improved the infrastructure section based on Lewis' feedback
Browse files
app/src/content/chapters/5-infrastructure.mdx
CHANGED
|
@@ -7,15 +7,15 @@ import ReadingTime from "../../components/ReadingTime.astro";
|
|
| 7 |
|
| 8 |
## Infrastructure
|
| 9 |
|
| 10 |
-
<ReadingTime words={
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
<HtmlEmbed
|
| 21 |
id="datatrove-pipeline"
|
|
@@ -75,7 +75,11 @@ At the heart of our inference system lies a powerful abstraction: the **rollout
|
|
| 75 |
|
| 76 |
This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
The simplest rollout sends one request per document and returns the result directly:
|
| 81 |
|
|
@@ -94,7 +98,9 @@ The returned `InferenceResult` is automatically stored under `document.metadata[
|
|
| 94 |
|
| 95 |
**Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
| 98 |
|
| 99 |
When documents exceed your model's context window, you can split them into chunks and stitch generations together:
|
| 100 |
|
|
@@ -129,7 +135,9 @@ Each chunk builds on the previous generation, allowing the model to maintain coh
|
|
| 129 |
|
| 130 |
**Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
|
| 131 |
|
| 132 |
-
|
|
|
|
|
|
|
| 133 |
|
| 134 |
For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
|
| 135 |
|
|
@@ -183,12 +191,14 @@ InferenceRunner(
|
|
| 183 |
The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
|
| 184 |
|
| 185 |
**Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
|
|
|
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
|
| 190 |
|
| 191 |
**Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
|
|
|
|
| 192 |
|
| 193 |
### Throughput Benchmarking
|
| 194 |
|
|
@@ -277,7 +287,7 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
|
|
| 277 |
|
| 278 |
#### What these numbers mean in practice
|
| 279 |
|
| 280 |
-
Let's make this concrete
|
| 281 |
|
| 282 |
These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
|
| 283 |
|
|
|
|
| 7 |
|
| 8 |
## Infrastructure
|
| 9 |
|
| 10 |
+
<ReadingTime words={4780} visuals={9} />
|
| 11 |
|
| 12 |
+
Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment can take weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is months of GPU time.
|
| 13 |
|
| 14 |
+
Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
|
| 15 |
|
| 16 |
+
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
|
| 17 |
|
| 18 |
+
<FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
|
| 19 |
|
| 20 |
<HtmlEmbed
|
| 21 |
id="datatrove-pipeline"
|
|
|
|
| 75 |
|
| 76 |
This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
|
| 77 |
|
| 78 |
+
<Sidenote>
|
| 79 |
+
For rephrasing, the simple single-request rollout is all you need. The other rollout patterns below show how DataTrove handles more complex use cases like translating long documents, CPU-heavy preprocessing, and best-of-N sampling.
|
| 80 |
+
</Sidenote>
|
| 81 |
+
|
| 82 |
+
<Accordion title="Simple single-request rollout" open>
|
| 83 |
|
| 84 |
The simplest rollout sends one request per document and returns the result directly:
|
| 85 |
|
|
|
|
| 98 |
|
| 99 |
**Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
|
| 100 |
|
| 101 |
+
</Accordion>
|
| 102 |
+
|
| 103 |
+
<Accordion title="Chunked rollout for long documents">
|
| 104 |
|
| 105 |
When documents exceed your model's context window, you can split them into chunks and stitch generations together:
|
| 106 |
|
|
|
|
| 135 |
|
| 136 |
**Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
|
| 137 |
|
| 138 |
+
</Accordion>
|
| 139 |
+
|
| 140 |
+
<Accordion title="CPU-heavy preprocessing with process pools">
|
| 141 |
|
| 142 |
For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
|
| 143 |
|
|
|
|
| 191 |
The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
|
| 192 |
|
| 193 |
**Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
|
| 194 |
+
</Accordion>
|
| 195 |
|
| 196 |
+
<Accordion title="Multiple rollouts per document">
|
| 197 |
|
| 198 |
Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
|
| 199 |
|
| 200 |
**Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
|
| 201 |
+
</Accordion>
|
| 202 |
|
| 203 |
### Throughput Benchmarking
|
| 204 |
|
|
|
|
| 287 |
|
| 288 |
#### What these numbers mean in practice
|
| 289 |
|
| 290 |
+
Let's make this concrete. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
|
| 291 |
|
| 292 |
These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
|
| 293 |
|