joelniklaus HF Staff commited on
Commit
a99a2cf
·
1 Parent(s): c0312c5

removed reading time component

Browse files
app/src/components/ReadingTime.astro DELETED
@@ -1,28 +0,0 @@
1
- ---
2
- interface Props {
3
- words: number;
4
- visuals?: number;
5
- }
6
-
7
- const { words, visuals = 0 } = Astro.props;
8
- const WORDS_PER_MIN = 250;
9
- const MINS_PER_VISUAL = 2;
10
- const totalMinutes = Math.ceil(words / WORDS_PER_MIN + visuals * MINS_PER_VISUAL);
11
- const hours = Math.floor(totalMinutes / 60);
12
- const remainingMinutes = totalMinutes % 60;
13
- const label = hours > 0
14
- ? (remainingMinutes > 0 ? `~${hours}h ${remainingMinutes}min read` : `~${hours}h read`)
15
- : `~${totalMinutes} min read`;
16
- ---
17
-
18
- <span class="reading-time">{label}</span>
19
-
20
- <style is:global>
21
- .reading-time {
22
- display: block;
23
- font-size: 0.85rem;
24
- color: var(--muted-color);
25
- margin-top: -24px;
26
- margin-bottom: var(--spacing-4);
27
- }
28
- </style>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -3,15 +3,16 @@ import Image from "../../components/Image.astro";
3
  import HtmlEmbed from "../../components/HtmlEmbed.astro";
4
  import Sidenote from "../../components/Sidenote.astro";
5
  import FigRef from "../../components/FigRef.astro";
6
- import ReadingTime from "../../components/ReadingTime.astro";
7
  import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
8
 
9
 
10
  ## Introduction
11
 
12
- <ReadingTime words={756} visuals={3} />
13
 
14
- We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a dataset that clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
 
 
 
15
 
16
  <HtmlEmbed
17
  id="finephrase-vs-baselines"
 
3
  import HtmlEmbed from "../../components/HtmlEmbed.astro";
4
  import Sidenote from "../../components/Sidenote.astro";
5
  import FigRef from "../../components/FigRef.astro";
 
6
  import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
7
 
8
 
9
  ## Introduction
10
 
 
11
 
12
+ We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a 486B token dataset that clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
13
+ <Sidenote>
14
+ Reading time: One weekend
15
+ </Sidenote>
16
 
17
  <HtmlEmbed
18
  id="finephrase-vs-baselines"
app/src/content/chapters/2-setup.mdx CHANGED
@@ -1,10 +1,8 @@
1
  import Accordion from "../../components/Accordion.astro";
2
  import Sidenote from "../../components/Sidenote.astro";
3
- import ReadingTime from "../../components/ReadingTime.astro";
4
 
5
- ## Rephrasing the Web
6
 
7
- <ReadingTime words={1243} visuals={0} />
8
 
9
  Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, and BeyondWeb [@beyondweb] tries continuation and summarization. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
10
 
 
1
  import Accordion from "../../components/Accordion.astro";
2
  import Sidenote from "../../components/Sidenote.astro";
 
3
 
 
4
 
5
+ ## Rephrasing the Web
6
 
7
  Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, and BeyondWeb [@beyondweb] tries continuation and summarization. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
8
 
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -3,7 +3,7 @@ import Note from "../../components/Note.astro";
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import Glossary from "../../components/Glossary.astro";
5
  import FigRef from "../../components/FigRef.astro";
6
- import ReadingTime from "../../components/ReadingTime.astro";
7
 
8
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
9
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
@@ -18,8 +18,6 @@ Notes:
18
 
19
  ## Experiments
20
 
21
- <ReadingTime words={2063} visuals={14} />
22
-
23
  Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. <FigRef target="experiment-overview" /> shows the full landscape: source datasets flowing through prompt strategies to model families. We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
24
 
25
  <HtmlEmbed
 
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import Glossary from "../../components/Glossary.astro";
5
  import FigRef from "../../components/FigRef.astro";
6
+
7
 
8
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
9
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
 
18
 
19
  ## Experiments
20
 
 
 
21
  Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. <FigRef target="experiment-overview" /> shows the full landscape: source datasets flowing through prompt strategies to model families. We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
22
 
23
  <HtmlEmbed
app/src/content/chapters/4-analyses.mdx CHANGED
@@ -2,11 +2,9 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
2
  import FigRef from "../../components/FigRef.astro";
3
  import Note from "../../components/Note.astro";
4
  import Wide from "../../components/Wide.astro";
5
- import ReadingTime from "../../components/ReadingTime.astro";
6
 
7
- ## Analyses
8
 
9
- <ReadingTime words={1433} visuals={6} />
10
 
11
  The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
12
 
 
2
  import FigRef from "../../components/FigRef.astro";
3
  import Note from "../../components/Note.astro";
4
  import Wide from "../../components/Wide.astro";
 
5
 
 
6
 
7
+ ## Analyses
8
 
9
  The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
10
 
app/src/content/chapters/5-infrastructure.mdx CHANGED
@@ -3,11 +3,9 @@ import Sidenote from "../../components/Sidenote.astro";
3
  import FigRef from "../../components/FigRef.astro";
4
  import Accordion from "../../components/Accordion.astro";
5
  import Wide from "../../components/Wide.astro";
6
- import ReadingTime from "../../components/ReadingTime.astro";
7
 
8
- ## Infrastructure
9
 
10
- <ReadingTime words={4780} visuals={9} />
11
 
12
  Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment takes weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is literally months of GPU time.
13
 
 
3
  import FigRef from "../../components/FigRef.astro";
4
  import Accordion from "../../components/Accordion.astro";
5
  import Wide from "../../components/Wide.astro";
 
6
 
 
7
 
8
+ ## Infrastructure
9
 
10
  Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment takes weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is literally months of GPU time.
11
 
app/src/content/chapters/6-finephrase.mdx CHANGED
@@ -3,14 +3,12 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import FigRef from "../../components/FigRef.astro";
5
  import Wide from "../../components/Wide.astro";
6
- import ReadingTime from "../../components/ReadingTime.astro";
7
  import datasetCardImg from "../assets/image/auto-dataset-card.png";
8
  import finephraseProgressImg from "../assets/image/finephrase-progress.png";
9
 
10
  ## Applying the Recipe at Scale
11
 
12
- <ReadingTime words={1693} visuals={10} />
13
-
14
  With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.
15
 
16
  The recipe writes itself from the experiments: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks on spare compute on our cluster.
 
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import FigRef from "../../components/FigRef.astro";
5
  import Wide from "../../components/Wide.astro";
6
+
7
  import datasetCardImg from "../assets/image/auto-dataset-card.png";
8
  import finephraseProgressImg from "../assets/image/finephrase-progress.png";
9
 
10
  ## Applying the Recipe at Scale
11
 
 
 
12
  With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.
13
 
14
  The recipe writes itself from the experiments: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks on spare compute on our cluster.
app/src/content/chapters/7-conclusions.mdx CHANGED
@@ -1,9 +1,5 @@
1
- import ReadingTime from "../../components/ReadingTime.astro";
2
-
3
  ## Conclusions
4
 
5
- <ReadingTime words={624} visuals={0} />
6
-
7
  We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase: 1.35 billion samples and 486 billion completion tokens generated from 339 million source documents. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing.
8
 
9
  ### What's Next?
 
 
 
1
  ## Conclusions
2
 
 
 
3
  We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase: 1.35 billion samples and 486 billion completion tokens generated from 339 million source documents. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing.
4
 
5
  ### What's Next?
app/src/content/chapters/8-appendix.mdx CHANGED
@@ -1,9 +1,5 @@
1
- import ReadingTime from "../../components/ReadingTime.astro";
2
-
3
  ## Appendix
4
 
5
- <ReadingTime words={721} visuals={16} />
6
-
7
  ### Details on the experiments
8
 
9
  For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 
 
 
1
  ## Appendix
2
 
 
 
3
  ### Details on the experiments
4
 
5
  For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.