finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 9

Commit

d7053a5

1 Parent(s): f42775d

added more references

Browse files

Files changed (7) hide show

app/src/content/bibliography.bib +170 -0
app/src/content/chapters/appendix.mdx +1 -1
app/src/content/chapters/conclusions.mdx +1 -1
app/src/content/chapters/experiments.mdx +1 -1
app/src/content/chapters/infrastructure.mdx +4 -6
app/src/content/chapters/introduction.mdx +1 -1
app/src/content/chapters/setup.mdx +3 -3

app/src/content/bibliography.bib CHANGED Viewed

@@ -120,7 +120,165 @@
   url           = {https://arxiv.org/abs/2508.06471}
 }
 % Inference
 @misc{dflash,
   title         = {DFlash: Block Diffusion for Flash Speculative Decoding},
   author        = {Jian Chen and Yesheng Liang and Zhijian Liu},
@@ -130,3 +288,15 @@
   primaryclass  = {cs.CL},
   url           = {https://arxiv.org/abs/2602.06036}
 }

   url           = {https://arxiv.org/abs/2508.06471}
 }
+@misc{nemotron3,
+  title         = {NVIDIA Nemotron 3: Efficient and Open Intelligence},
+  author        = {{NVIDIA}},
+  year          = {2025},
+  eprint        = {2512.20856},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2512.20856}
+}
+@misc{qwen3,
+  title         = {Qwen3 Technical Report},
+  author        = {An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Junyang Lin and Jingren Zhou},
+  year          = {2025},
+  eprint        = {2505.09388},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2505.09388}
+}
+@misc{qwen2,
+  title         = {Qwen2 Technical Report},
+  author        = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Junyang Lin},
+  year          = {2024},
+  eprint        = {2407.10671},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2407.10671}
+}
+@misc{phi4,
+  title         = {Phi-4 Technical Report},
+  author        = {Marah Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Jain and Piero Kauffmann and Yin Tat Lee and Yuanzhi Li and Anh Nguyen and Olatunji Ruwase and Olli Saarikivi and Adil Salim and Shital Shah and Michael Santacroce and Harsha Nori and Xin Wang and Rachel Ward and Philipp Witte and Cyril Zhang and Yi Zhang},
+  year          = {2024},
+  eprint        = {2412.08905},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2412.08905}
+}
+@misc{arceetrinity,
+  title         = {Arcee Trinity Large Technical Report},
+  author        = {{Arcee AI}},
+  year          = {2025},
+  eprint        = {2512.04695},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2512.04695}
+}
+@misc{llama3,
+  title         = {The Llama 3 Herd of Models},
+  author        = {Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aieleen Lakber and Aishwarya Selvaraj and Alan Schelten and Amit Sangani and others},
+  year          = {2024},
+  eprint        = {2407.21783},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.AI},
+  url           = {https://arxiv.org/abs/2407.21783}
+}
+@misc{mixtral,
+  title         = {Mixtral of Experts},
+  author        = {Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L{\'e}lio Renard Lavaud and Lucile Saulnier and Marie-Anne Lachaux and Pierre Stock and Sandeep Subramanian and Sophia Yang and Szymon Antoniak and Teven Le Scao and Th{\'e}ophile Gervet and Thibaut Lavril and Thomas Wang and Timoth{\'e}e Lacroix and William El Sayed},
+  year          = {2024},
+  eprint        = {2401.04088},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2401.04088}
+}
+@misc{deepseekr1,
+  title         = {DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
+  author        = {{DeepSeek-AI}},
+  year          = {2025},
+  eprint        = {2501.12948},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2501.12948}
+}
+@misc{gemma3,
+  title         = {Gemma 3 Technical Report},
+  author        = {{Gemma Team}},
+  year          = {2025},
+  eprint        = {2503.19786},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2503.19786}
+}
+@software{falcon3,
+  title         = {Falcon 3 Family of Open Models},
+  author        = {{Technology Innovation Institute}},
+  year          = {2024},
+  url           = {https://huggingface.co/blog/falcon3},
+  note          = {Hugging Face Blog}
+}
+@misc{granite3,
+  title         = {Granite 3.0 Language Models},
+  author        = {{IBM Granite Team}},
+  year          = {2024},
+  url           = {https://github.com/ibm-granite/granite-3.0-language-models},
+  note          = {Technical Report}
+}
+@misc{kimik2,
+  title         = {Kimi K2: Open Agentic Intelligence},
+  author        = {{Moonshot AI}},
+  year          = {2025},
+  eprint        = {2507.20534},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2507.20534}
+}
+@misc{gptoss,
+  title         = {gpt-oss-120b \& gpt-oss-20b Model Card},
+  author        = {{OpenAI}},
+  year          = {2025},
+  eprint        = {2508.10925},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2508.10925}
+}
 % Inference
+@inproceedings{vllm,
+  title         = {Efficient Memory Management for Large Language Model Serving with PagedAttention},
+  author        = {Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
+  booktitle     = {Proceedings of the 29th Symposium on Operating Systems Principles},
+  year          = {2023},
+  eprint        = {2309.06180},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2309.06180}
+}
+@inproceedings{sglang,
+  title         = {SGLang: Efficient Execution of Structured Language Model Programs},
+  author        = {Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng},
+  booktitle     = {Advances in Neural Information Processing Systems},
+  year          = {2024},
+  eprint        = {2312.07104},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.AI},
+  url           = {https://arxiv.org/abs/2312.07104}
+}
+@misc{flashattention2,
+  title         = {FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning},
+  author        = {Tri Dao},
+  year          = {2023},
+  eprint        = {2307.08691},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2307.08691}
+}
 @misc{dflash,
   title         = {DFlash: Block Diffusion for Flash Speculative Decoding},
   author        = {Jian Chen and Yesheng Liang and Zhijian Liu},
   primaryclass  = {cs.CL},
   url           = {https://arxiv.org/abs/2602.06036}
 }
+% Tools
+@inproceedings{dspy,
+  title         = {DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines},
+  author        = {Omar Khattab and Arnav Singhvi and Paridhi Maheshwari and Zhiyuan Zhang and Keshav Santhanam and Sri Vardhamanan and Saiful Haq and Ashutosh Sharma and Thomas T. Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts},
+  booktitle     = {International Conference on Learning Representations},
+  year          = {2024},
+  eprint        = {2310.03714},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2310.03714}
+}

app/src/content/chapters/appendix.mdx CHANGED Viewed

@@ -2,7 +2,7 @@
 ### Details on the experiments
-For our ablations we train a 1.2B parameter language model using a Qwen2-style architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention), and an intermediate size of 6144. The model utilized the Llama 3.2 tokenizer ( `hynky/Llama-3.2-1B-no-bos` ) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2, fused operations (RMS normalization and rotary embeddings), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts

 ### Details on the experiments
+For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer ( `hynky/Llama-3.2-1B-no-bos` ) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts

app/src/content/chapters/conclusions.mdx CHANGED Viewed

@@ -15,5 +15,5 @@ While we answered some questions in this work, many still remain such as:
 - Experiment with chunked rollouts context extension in mid-training
 - Experiment with multiple rollouts per example and filtering for the highest quality one
 - In REWIRE, they show larger gains for bigger models trained on their data. Can we reproduce this?
-- Does automatic prompt optimization with tools like dspy improve rephrasing performance?
 - The ablations only trained for 21B tokens. It is still unclear how these findings transfer to larger scales in terms of both model parameters and data.

 - Experiment with chunked rollouts context extension in mid-training
 - Experiment with multiple rollouts per example and filtering for the highest quality one
 - In REWIRE, they show larger gains for bigger models trained on their data. Can we reproduce this?
+- Does automatic prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - The ablations only trained for 21B tokens. It is still unclear how these findings transfer to larger scales in terms of both model parameters and data.

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -226,7 +226,7 @@ FAQ prompt: Surprisingly, the 1b model is better for both lq and hq data.
 In general we cannot reproduce REWIRE's claim that large models are needed for lq data. Overall we rarely see benefits of using models larger than 1b. So as long as the model has some baseline level (in our experiments already reached at the 1b scale) we see no evidence for a clear benefit of using larger models for rephrasing. For these reasons we default to the 1b size for maximum throughput from here on. We hypothesize that most rephrasing tasks are simple enough for smaller models to handle sufficiently well.
 #### Does the model family matter?
-Some model families may be better suited for rephrasing than others based on their training data. This is why we test top families at the 1B scale on the four top-performing prompts tutorial, faq, table, math. We find that for the tutorial prompt at the 1B scale Llama-3.2, Granite-3, Gemma-3, and Qwen3 and Falcon3 perform roughly at the same level. SmolLM2 clearly outperforms.
 <HtmlEmbed
   id="model-family-tutorial"

 In general we cannot reproduce REWIRE's claim that large models are needed for lq data. Overall we rarely see benefits of using models larger than 1b. So as long as the model has some baseline level (in our experiments already reached at the 1b scale) we see no evidence for a clear benefit of using larger models for rephrasing. For these reasons we default to the 1b size for maximum throughput from here on. We hypothesize that most rephrasing tasks are simple enough for smaller models to handle sufficiently well.
 #### Does the model family matter?
+Some model families may be better suited for rephrasing than others based on their training data. This is why we test top families at the 1B scale on the four top-performing prompts tutorial, faq, table, math. We find that for the tutorial prompt at the 1B scale Llama-3.2, Granite-3 [@granite3], Gemma-3, and Qwen3 and Falcon3 [@falcon3] perform roughly at the same level. SmolLM2 clearly outperforms.
 <HtmlEmbed
   id="model-family-tutorial"

app/src/content/chapters/infrastructure.mdx CHANGED Viewed

@@ -10,9 +10,9 @@ Synthetic data has emerged as a key ingredient in training modern LLMs, providin
 <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
-Synthetic data also plays a central role in post-training via  *distillation* , where a capable model is used to generate high-quality responses for targeted domains such as reasoning, instruction-following, and tool-use. This data can then be used for supervised fine-tuning or preference optimization, allowing developers to shape a model's behaviour with labels that would be expensive or impractical to obtain from humans. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) was post-trained almost entirely on a few billion tokens of data generated from models like DeepSeek-R1 and Qwen3.
-So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang), it turns out that the bottleneck isn't the generation itself but the  *infrastructure*  around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
 Today we're excited to announce major extensions to [DataTrove](https://github.com/huggingface/datatrove) to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
@@ -44,7 +44,6 @@ python examples/inference/benchmark/generate_data.py \
     --model-max-context 32768 \
     --output-dataset-name s1K-datatrove \
     --tasks 1 \
-    --examples-per-chunk 50 \
     --dp 8 \
     --tp 1 \
     --local-execution
@@ -58,7 +57,7 @@ Most arguments are self-explanatory, but let's take a look at the main ones that
 Bigger chunks improve throughput but increase the work lost if you need to resume, so tune  `examples-per-chunk`  accordingly while using  `tasks`  mainly to spread the workload across independent jobs.
-Local execution is handy for small-scale datasets or models, but what if you want to generate data from a trillion parameter model like Kimi K2 😱? For that we use the in-built Slurm executor to scale the job across multiple nodes and tasks:
 ```shell
 python examples/inference/benchmark/generate_data.py \
@@ -69,7 +68,6 @@ python examples/inference/benchmark/generate_data.py \
     --max-tokens 8 \
     --trust-remote-code \
     --output-dataset-name s1K-1.1-benchmark-Kimi-K2-Instruct \
-    --examples-per-chunk 10 \
     --tasks 1 \
     --workers 1 \
     --max-examples 100 \
@@ -253,7 +251,7 @@ Below we present our results scaling from 1B to 1T parameters.
 We consistently achieve the highest throughput at the lowest tensor parallelism (TP) and pipeline parallelism (PP) without running out of memory (OOM). We hypothesize this occurs because, except for the largest Qwen model and Kimi-K2-Instruct, no model has more than 6B active parameters.
-Interestingly, model family appears to significantly impact performance. At the same 4B scale, Qwen3 achieves nearly 20% higher throughput than Gemma-3. GPT-OSS-20B nearly matches Qwen3-4B's throughput despite having 5x the total parameters (21B vs 4B) and slightly fewer active parameters (3.6B vs 4B). Even more notably, GPT-OSS-120B nearly doubles the throughput of Qwen3-Next-80B-A3B despite having both more total and more active parameters. This performance difference, along with the fact that GPT-OSS-120B runs on TP2 while Qwen3-Next-80B-A3B OOMs, is likely attributable to GPT-OSS being loaded in weight-quantized mode (mxfp4) by default, compared to bf16 for the other models.
 We also explored what would be required to generate 1T tokens in a day. We believe GPT-OSS-120B offers a strong balance between quality and throughput. Generating 1T tokens in a day would require 279 nodes, resulting in a cost of approximately $161K at roughly $3 per H100 hour. For a slightly lower quality option using GPT-OSS-20B, we would need 119 nodes at a cost of $69K.

 <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
+Synthetic data also plays a central role in post-training via  *distillation* , where a capable model is used to generate high-quality responses for targeted domains such as reasoning, instruction-following, and tool-use. This data can then be used for supervised fine-tuning or preference optimization, allowing developers to shape a model's behaviour with labels that would be expensive or impractical to obtain from humans. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) was post-trained almost entirely on a few billion tokens of data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
+So what does it actually take to generate a trillion tokens of synthetic data? Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], it turns out that the bottleneck isn't the generation itself but the  *infrastructure*  around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
 Today we're excited to announce major extensions to [DataTrove](https://github.com/huggingface/datatrove) to manage this entire process. These extensions package the scaffolding we built for our own synthetic data pipelines and make it accessible to anyone who wants to generate high-quality datasets at scale. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue.
     --model-max-context 32768 \
     --output-dataset-name s1K-datatrove \
     --tasks 1 \
     --dp 8 \
     --tp 1 \
     --local-execution
 Bigger chunks improve throughput but increase the work lost if you need to resume, so tune  `examples-per-chunk`  accordingly while using  `tasks`  mainly to spread the workload across independent jobs.
+Local execution is handy for small-scale datasets or models, but what if you want to generate data from a trillion parameter model like Kimi K2 [@kimik2] 😱? For that we use the in-built Slurm executor to scale the job across multiple nodes and tasks:
 ```shell
 python examples/inference/benchmark/generate_data.py \
     --max-tokens 8 \
     --trust-remote-code \
     --output-dataset-name s1K-1.1-benchmark-Kimi-K2-Instruct \
     --tasks 1 \
     --workers 1 \
     --max-examples 100 \
 We consistently achieve the highest throughput at the lowest tensor parallelism (TP) and pipeline parallelism (PP) without running out of memory (OOM). We hypothesize this occurs because, except for the largest Qwen model and Kimi-K2-Instruct, no model has more than 6B active parameters.
+Interestingly, model family appears to significantly impact performance. At the same 4B scale, Qwen3 achieves nearly 20% higher throughput than Gemma-3 [@gemma3]. GPT-OSS-20B [@gptoss] nearly matches Qwen3-4B's throughput despite having 5x the total parameters (21B vs 4B) and slightly fewer active parameters (3.6B vs 4B). Even more notably, GPT-OSS-120B nearly doubles the throughput of Qwen3-Next-80B-A3B despite having both more total and more active parameters. This performance difference, along with the fact that GPT-OSS-120B runs on TP2 while Qwen3-Next-80B-A3B OOMs, is likely attributable to GPT-OSS being loaded in weight-quantized mode (mxfp4) by default, compared to bf16 for the other models.
 We also explored what would be required to generate 1T tokens in a day. We believe GPT-OSS-120B offers a strong balance between quality and throughput. Generating 1T tokens in a day would require 279 nodes, resulting in a cost of approximately $161K at roughly $3 per H100 hour. For a slightly lower quality option using GPT-OSS-20B, we would need 119 nodes at a cost of $69K.

app/src/content/chapters/introduction.mdx CHANGED Viewed

@@ -7,7 +7,7 @@ Notes:
 ## Introduction
-If you read some of the latest LLM papers [add some refs, e.g. Nemotron 3, Arcee's trinity], you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. We went from training on just a few billion tokens to training on trillions of tokens including most of the web text.
 - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.

 ## Introduction
+If you read some of the latest LLM papers (e.g. Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. We went from training on just a few billion tokens to training on trillions of tokens including most of the web text.
 - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.

app/src/content/chapters/setup.mdx CHANGED Viewed

@@ -31,13 +31,13 @@ We compare against several baseline datasets for pretraining and data rephrasing
  **DCLM (DataComp-LM)** [@datacomp] **:**  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
- **Fineweb-Edu-HQ and Fineweb-Edu-LQ** [@fineweb] **:**  Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
  **Ultra-Fineweb-1.4** [@ultrafineweb] **:**  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
- **Nemotron-HQ-Synth** [@nemotroncc] **:**  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B.
- **Cosmopedia** [@cosmopedia] **:**  A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct, containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
  **SYNTH** [@synthpleias] **:**  A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.

  **DCLM (DataComp-LM)** [@datacomp] **:**  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
+ **Fineweb-Edu-HQ and Fineweb-Edu-LQ** [@fineweb] **:**  Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
  **Ultra-Fineweb-1.4** [@ultrafineweb] **:**  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
+ **Nemotron-HQ-Synth** [@nemotroncc] **:**  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3].
+ **Cosmopedia** [@cosmopedia] **:**  A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
  **SYNTH** [@synthpleias] **:**  A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.