finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 13

Commit

10a0e93

1 Parent(s): 93ebc2c

made some small fixes

Browse files

Files changed (5) hide show

app/src/content/chapters/appendix.mdx +2 -2
app/src/content/chapters/conclusions.mdx +1 -1
app/src/content/chapters/infrastructure.mdx +2 -2
app/src/content/chapters/introduction.mdx +7 -7
app/src/content/embeds/d3-optimization-sweep.html +2 -2

app/src/content/chapters/appendix.mdx CHANGED Viewed

@@ -2,7 +2,7 @@
 ### Details on the experiments
-For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer ( `hynky/Llama-3.2-1B-no-bos` ) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts
@@ -212,7 +212,7 @@ Original Draft: [TEXT]
 ### Decay vs Scratch
-We explored two distinct training paradigms. In the  **from-scratch**  setup ( `decay_exp=false` ), models were trained for the full 10,000 steps (~21B tokens) on a single dataset or mixture of datasets. In contrast, the  **decay**  experiments ( `decay_exp=true` ) aimed to obtain quicker signal with fewer rephrased tokens by leveraging a two-stage training approach. These decay experiments resumed training from a checkpoint at step 9,000 of a model previously trained on lower-quality data (FineWeb-Edu-LQ), then continued training with a new dataset (or mixture) for the final 1,000 steps (~2B tokens) during the learning rate decay phase. We selected FineWeb-Edu-LQ for the first training phase so we can see effects of the ablated data mixtures more clearly. This design allowed us to evaluate the impact of high-quality rephrased or synthetic data more efficiently, requiring around 2B rephrased tokens rather than the full 21B needed for from-scratch training, thus reducing computational costs by 90% per experimental condition while still providing meaningful signal about data quality effects. To enable the decay experiments, we used a warmup-stable-decay (WSD) [@minicpm] learning rate schedule with 1% warmup (100 steps), 89% stable training, and 10% linear decay (1,000 steps) to a minimum of 5×10⁻⁵.
 #### Variance across seeds and data seeds

 ### Details on the experiments
+For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts
 ### Decay vs Scratch
+We explored two distinct training paradigms. In the **from-scratch** setup (`decay_exp=false`), models were trained for the full 10,000 steps (~21B tokens) on a single dataset or mixture of datasets. In contrast, the **decay** experiments (`decay_exp=true`) aimed to obtain quicker signal with fewer rephrased tokens by leveraging a two-stage training approach. These decay experiments resumed training from a checkpoint at step 9,000 of a model previously trained on lower-quality data (FineWeb-Edu-LQ), then continued training with a new dataset (or mixture) for the final 1,000 steps (~2B tokens) during the learning rate decay phase. We selected FineWeb-Edu-LQ for the first training phase so we can see effects of the ablated data mixtures more clearly. This design allowed us to evaluate the impact of high-quality rephrased or synthetic data more efficiently, requiring around 2B rephrased tokens rather than the full 21B needed for from-scratch training, thus reducing computational costs by 90% per experimental condition while still providing meaningful signal about data quality effects. To enable the decay experiments, we used a warmup-stable-decay (WSD) [@minicpm] learning rate schedule with 1% warmup (100 steps), 89% stable training, and 10% linear decay (1,000 steps) to a minimum of 5×10⁻⁵.
 #### Variance across seeds and data seeds

app/src/content/chapters/conclusions.mdx CHANGED Viewed

@@ -13,7 +13,7 @@ While we answered several questions about best practices for synthetic data gene
 - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
 - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
 - **Best-of-N filtering**: Can we generate multiple rollouts per example and filter for the highest quality one?
-- **Scaling to larger models**: @rewire report larger gains for bigger models trained on their data. Can we reproduce this?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Scaling to more data**: Our ablations trained for 21B tokens. It remains unclear how these findings transfer to larger scales, both in model parameters and data.

 - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
 - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
 - **Best-of-N filtering**: Can we generate multiple rollouts per example and filter for the highest quality one?
+- **Scaling to larger models**: REWIRE [@rewire] reports larger gains for bigger models trained on their data. Can we reproduce this?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Scaling to more data**: Our ablations trained for 21B tokens. It remains unclear how these findings transfer to larger scales, both in model parameters and data.

app/src/content/chapters/infrastructure.mdx CHANGED Viewed

@@ -130,7 +130,7 @@ Bigger chunks improve throughput but increase the work lost if you need to resum
 At the heart of our inference system lies a powerful abstraction: the **rollout function**. A rollout is simply an async callable that receives a `Document`, a `generate(payload)` callback, and any extra resources you've configured. Inside the rollout, you have complete freedom to orchestrate one or many `generate` calls: sequentially, in parallel, or any combination.
-This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic; the runner handles efficient GPU utilization.
 #### Example 1: Simple Single-Request Rollout
@@ -348,7 +348,7 @@ We adopted a **two-tier sequential optimization** approach. The second tier buil
 - **gmu**: 0.9, 0.95 -- fraction of GPU memory allocated to the KV cache
 - **spec**: none, ngram-6, ngram-8, suffix-32 -- speculative decoding methods
-This tiered approach reduces the search space dramatically. A full Cartesian product of all parameters would require ~600 configurations per model; the tiered approach needs only ~15+8 = ~23 per model.
 <Sidenote>
 We call it "tier 0" because these parameters are prerequisites: for larger models, getting `tp` right is not an optimization but a necessity. Without sufficient tensor parallelism the model either doesn't fit in memory or leaves almost no room for the KV cache. In earlier exploratory experiments, we found that `tp`, `mns`, and `mnbt` have by far the largest impact on throughput, which is why they form tier 0.

 At the heart of our inference system lies a powerful abstraction: the **rollout function**. A rollout is simply an async callable that receives a `Document`, a `generate(payload)` callback, and any extra resources you've configured. Inside the rollout, you have complete freedom to orchestrate one or many `generate` calls: sequentially, in parallel, or any combination.
+This design separates *what* you want to generate from *how* the inference engine batches and executes requests. You focus on your application logic. The runner handles efficient GPU utilization.
 #### Example 1: Simple Single-Request Rollout
 - **gmu**: 0.9, 0.95 -- fraction of GPU memory allocated to the KV cache
 - **spec**: none, ngram-6, ngram-8, suffix-32 -- speculative decoding methods
+This tiered approach reduces the search space dramatically. A full Cartesian product of all parameters would require ~600 configurations per model. The tiered approach needs only ~15+8 = ~23 per model.
 <Sidenote>
 We call it "tier 0" because these parameters are prerequisites: for larger models, getting `tp` right is not an optimization but a necessity. Without sufficient tensor parallelism the model either doesn't fit in memory or leaves almost no room for the KV cache. In earlier exploratory experiments, we found that `tp`, `mns`, and `mnbt` have by far the largest impact on throughput, which is why they form tier 0.

app/src/content/chapters/introduction.mdx CHANGED Viewed

@@ -5,28 +5,28 @@ import FigRef from "../../components/FigRef.astro";
 {/*
 Notes:
-- Finepdfs-edu outperforms even DCLM quite clearly. This would change the whole story completely so it would be quite time consuming to adapt. Therefore we leave it out for now.
 */}
 ## Introduction
-If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. We went from training on just a few billion tokens to training on trillions of tokens including most of the web text.
 - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
 - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
-Besides pretraining, synthetic data generation also has become a useful tool for post-training. It is applied to fill gaps identified in models. A fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a small talk dataset ([https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0)) quickly solved this issue.
-We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
-However, how to do synthetic data generation properly still resembles alchemy these days: which model should you use? which prompts work best and how many do you need? and how do you even scale this effectively?
 In this blog post we take a journey to answer all these questions systematically. We ran XXX experiments and generated YYY tokens in total to find the ideal settings for synthetic data.
-Here's the plan:
-We start with the infrastructure needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
 We continue with a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.

 {/*
 Notes:
+- Finepdfs-edu outperforms even DCLM quite clearly. This would change the whole story completely so it would be quite time consuming to adapt. Therefore we leave it out for now.
 */}
 ## Introduction
+If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. We went from training on just a few billion tokens to training on trillions of tokens including most of the web text.
 - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
 - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
+Besides pretraining, synthetic data generation also has become a useful tool for post-training. It is applied to fill gaps identified in models. A fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this issue.
+We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
+However, how to do synthetic data generation properly still resembles alchemy these days: which model should you use? which prompts work best and how many do you need? and how do you even scale this effectively?
 In this blog post we take a journey to answer all these questions systematically. We ran XXX experiments and generated YYY tokens in total to find the ideal settings for synthetic data.
+Here's the plan:
+We start with the infrastructure needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
 We continue with a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.

app/src/content/embeds/d3-optimization-sweep.html CHANGED Viewed

@@ -165,7 +165,7 @@
       const TIERS = ['Baseline', 'Tier 0', 'Tier 1'];
       const SHAPE_SIZE = 42;
       const TIER_Y_OFFSET = { 'Baseline': -0.38, 'Tier 0': 0, 'Tier 1': 0.38 };
-      const margin = { top: 24, right: 62, bottom: 42, left: 148 };
       // ── Colors & shapes ──
       function getFamilyColors() {
@@ -240,7 +240,7 @@
       const gRoot = svg.append('g');
       // ── State ──
-      const state = { metric: 'throughput', sort: 'speedup' };
       function sortedData() {
         const d = [...DATA];

       const TIERS = ['Baseline', 'Tier 0', 'Tier 1'];
       const SHAPE_SIZE = 42;
       const TIER_Y_OFFSET = { 'Baseline': -0.38, 'Tier 0': 0, 'Tier 1': 0.38 };
+      const margin = { top: 20, right: 40, bottom: 40, left: 130 };
       // ── Colors & shapes ──
       function getFamilyColors() {
       const gRoot = svg.append('g');
       // ── State ──
+      const state = { metric: 'speedup', sort: 'speedup' };
       function sortedData() {
         const d = [...DATA];