finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 15 days ago

Commit

462a612

1 Parent(s): 1e2edcd

improved the infrastructure section based on Lewis' feedback

Browse files

Files changed (1) hide show

app/src/content/chapters/5-infrastructure.mdx +20 -10

app/src/content/chapters/5-infrastructure.mdx CHANGED Viewed

@@ -7,15 +7,15 @@ import ReadingTime from "../../components/ReadingTime.astro";
 ## Infrastructure
-<ReadingTime words={4983} visuals={9} />
-When you start generating your first synthetic tokens with LLMs you notice quickly that this is an extremely slow and compute-heavy process. Even though we can cache KV values from previous tokens, we still need one forward pass for *every* token, and every web document typically has a few thousand tokens. The first step before running any large-scale experiments is setting up infrastructure that generates as efficiently and scalably as possible.
-So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
-We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
-In this section we show how DataTrove can be used to generate billions of tokens across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
 <HtmlEmbed
   id="datatrove-pipeline"
@@ -75,7 +75,11 @@ At the heart of our inference system lies a powerful abstraction: the **rollout
 This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
-#### Example 1: Simple Single-Request Rollout
 The simplest rollout sends one request per document and returns the result directly:
@@ -94,7 +98,9 @@ The returned `InferenceResult` is automatically stored under `document.metadata[
  **Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
-#### Example 2: Chunked Rollout for Long Documents
 When documents exceed your model's context window, you can split them into chunks and stitch generations together:
@@ -129,7 +135,9 @@ Each chunk builds on the previous generation, allowing the model to maintain coh
  **Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
-#### Example 3: CPU-Heavy Preprocessing with Process Pools
 For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
@@ -183,12 +191,14 @@ InferenceRunner(
 The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
  **Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
-#### Running Multiple Rollouts per Document
 Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
  **Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
 ### Throughput Benchmarking
@@ -277,7 +287,7 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
 #### What these numbers mean in practice
-Let's make this concrete with [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. Say you want to generate 10 billion tokens. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), that takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Scale this up to a trillion tokens and the savings run into hundreds of thousands of dollars.
 These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:

 ## Infrastructure
+<ReadingTime words={4780} visuals={9} />
+Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment can take weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is months of GPU time.
+Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
+We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
+<FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
 <HtmlEmbed
   id="datatrove-pipeline"
 This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
+<Sidenote>
+For rephrasing, the simple single-request rollout is all you need. The other rollout patterns below show how DataTrove handles more complex use cases like translating long documents, CPU-heavy preprocessing, and best-of-N sampling.
+</Sidenote>
+<Accordion title="Simple single-request rollout" open>
 The simplest rollout sends one request per document and returns the result directly:
  **Use case: Rephrasing web documents for LLM training.** You're building a training corpus by rephrasing web documents into cleaner, more consistent prose. Most documents fit within context, outputs stay under 4k tokens, and you want minimal overhead. One request per document, no chunking logic, no coordination. The rollout wraps each document in a rephrasing prompt and returns the rewritten text directly.
+</Accordion>
+<Accordion title="Chunked rollout for long documents">
 When documents exceed your model's context window, you can split them into chunks and stitch generations together:
  **Use case: Translating long web documents.** You're translating multilingual web content into English at massive scale. Many documents exceed context limits, so you split them into 512-token chunks and translate with a sliding window. Each chunk is translated while keeping the previous (already translated) chunk in the prompt for context. This maintains coherence across chunk boundaries. The [FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) project used this approach to translate over 1 trillion tokens across 500+ languages.
+</Accordion>
+<Accordion title="CPU-heavy preprocessing with process pools">
 For rollouts that require expensive CPU work (parsing, image processing, etc.), you can offload preprocessing to a process pool via `shared_context`:
 The pool is initialized lazily and shared across all rollout invocations, keeping CPU-bound work off the async event loop.
  **Use case: PDF document understanding.** You're building a pipeline to extract structured information from scanned PDFs. Each document requires CPU-intensive OCR preprocessing before the text can be sent to the LLM for extraction. By offloading the OCR to a process pool, you keep the GPU fed with generation requests while workers handle the parsing in parallel.
+</Accordion>
+<Accordion title="Multiple rollouts per document">
 Need multiple samples per document? Set `rollouts_per_document` in your `InferenceConfig`. All successful outputs are collected under `document.metadata["rollout_results"]` as a list.
  **Use case: Best-of-N sampling for code generation.** When generating code solutions, you want multiple attempts per problem to increase the chance of a correct answer. Set `rollouts_per_document=10` and later filter for solutions that pass your test suite.
+</Accordion>
 ### Throughput Benchmarking
 #### What these numbers mean in practice
+Let's make this concrete. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
 These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep: