Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
9ff8887
1
Parent(s): d22ec15
updated high level statistics about number of experiments
Browse files
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -10,7 +10,7 @@ import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
|
| 10 |
## Introduction
|
| 11 |
|
| 12 |
|
| 13 |
-
We ran
|
| 14 |
<Sidenote>
|
| 15 |
Reading time: One weekend
|
| 16 |
</Sidenote>
|
|
@@ -66,7 +66,7 @@ Our goal is to turn this alchemy into chemistry: replace intuition with systemat
|
|
| 66 |
Lavoisier replaced phlogiston theory with precise measurements and repeatable experiments, earning him the title "father of modern chemistry".
|
| 67 |
</Sidenote>
|
| 68 |
|
| 69 |
-
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the
|
| 70 |
The sections below are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
|
| 71 |
|
| 72 |
<Note variant="info" title="But wait, what about model collapse?">
|
|
|
|
| 10 |
## Introduction
|
| 11 |
|
| 12 |
|
| 13 |
+
We ran 333 train-and-evaluate experiments over 90 rephrasing configurations to find the best recipe for synthetic pretraining data. Generating those synthetic corpora meant over 1 trillion tokens and 12.7 GPU years. The result is **FinePhrase**, a 486B token dataset that clearly outperforms all existing synthetic data baselines. It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
|
| 14 |
<Sidenote>
|
| 15 |
Reading time: One weekend
|
| 16 |
</Sidenote>
|
|
|
|
| 66 |
Lavoisier replaced phlogiston theory with precise measurements and repeatable experiments, earning him the title "father of modern chemistry".
|
| 67 |
</Sidenote>
|
| 68 |
|
| 69 |
+
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
|
| 70 |
The sections below are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
|
| 71 |
|
| 72 |
<Note variant="info" title="But wait, what about model collapse?">
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -16,7 +16,7 @@ Notes:
|
|
| 16 |
|
| 17 |
## Experiments
|
| 18 |
|
| 19 |
-
Time to put all of this to the test. We
|
| 20 |
|
| 21 |
<HtmlEmbed
|
| 22 |
id="experiment-overview"
|
|
|
|
| 16 |
|
| 17 |
## Experiments
|
| 18 |
|
| 19 |
+
Time to put all of this to the test. We generated 90 rephrasing configurations, then ran 333 train-and-evaluate experiments to systematically answer our questions. The journey took some unexpected turns. Here's the full landscape of the rephrasing space we explored, with source datasets flowing through prompt strategies to model families:
|
| 20 |
|
| 21 |
<HtmlEmbed
|
| 22 |
id="experiment-overview"
|
app/src/content/chapters/4-analyses.mdx
CHANGED
|
@@ -11,7 +11,7 @@ The experiments tell us *what* works. Now let's zoom out and ask *why*. We start
|
|
| 11 |
|
| 12 |
### Did Rephrasing Leak the Benchmarks?
|
| 13 |
|
| 14 |
-
Before reading anything into
|
| 15 |
|
| 16 |
[^contam-method]: We run a 10-gram overlap audit built on DataTrove's decontamination indexer [@datatrove]. We hash every 10-gram from the benchmark answers plus the query/answer boundary, then scan each corpus for exact normalized matches. Query-only n-grams are excluded, since they are mostly prompt-template boilerplate (think "Question: ... Answer:") that floods the index with false positives, and degenerate repeated-token windows are skipped to avoid number-normalization artifacts. Each corpus is subsampled to roughly 5B tokens so the per-document rates are comparable. The [audit entry point](https://github.com/huggingface/finephrase/blob/main/finephrase/cli/audit_contamination.py) and the full per-dataset report are open.
|
| 17 |
|
|
|
|
| 11 |
|
| 12 |
### Did Rephrasing Leak the Benchmarks?
|
| 13 |
|
| 14 |
+
Before reading anything into these benchmark scores, we should rule out the most boring explanation for a win: the model saw the test set during pretraining. Rephrasing makes this worth checking, since a generator could in principle copy a memorized benchmark question straight into the training corpus. So we audited 56 corpora for overlap with the evaluation benchmarks: the eight format prompts, each rephrased by our six small (1–1.7B parameter) generators, plus the source and baseline datasets they are compared against.[^contam-method]
|
| 15 |
|
| 16 |
[^contam-method]: We run a 10-gram overlap audit built on DataTrove's decontamination indexer [@datatrove]. We hash every 10-gram from the benchmark answers plus the query/answer boundary, then scan each corpus for exact normalized matches. Query-only n-grams are excluded, since they are mostly prompt-template boilerplate (think "Question: ... Answer:") that floods the index with false positives, and degenerate repeated-token windows are skipped to avoid number-normalization artifacts. Each corpus is subsampled to roughly 5B tokens so the per-document rates are comparable. The [audit entry point](https://github.com/huggingface/finephrase/blob/main/finephrase/cli/audit_contamination.py) and the full per-dataset report are open.
|
| 17 |
|
app/src/content/chapters/5-infrastructure.mdx
CHANGED
|
@@ -6,7 +6,7 @@ import Wide from "../../components/Wide.astro";
|
|
| 6 |
|
| 7 |
## Infrastructure
|
| 8 |
|
| 9 |
-
Each of our 90
|
| 10 |
|
| 11 |
Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the raw generation speed is no longer the bottleneck. The hard part is the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
|
| 12 |
|
|
@@ -294,7 +294,7 @@ The chart shows the gains, but what do they translate to in actual GPU time and
|
|
| 294 |
|
| 295 |
#### What these numbers mean in practice
|
| 296 |
|
| 297 |
-
Let's make this concrete with some back-of-the-envelope math. Each of our ablation
|
| 298 |
|
| 299 |
These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
|
| 300 |
|
|
|
|
| 6 |
|
| 7 |
## Infrastructure
|
| 8 |
|
| 9 |
+
Each of our 90 rephrasing configurations requires generating around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single generation run takes weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is literally months of GPU time.
|
| 10 |
|
| 11 |
Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the raw generation speed is no longer the bottleneck. The hard part is the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
|
| 12 |
|
|
|
|
| 294 |
|
| 295 |
#### What these numbers mean in practice
|
| 296 |
|
| 297 |
+
Let's make this concrete with some back-of-the-envelope math. Each of our ablation configurations rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token generation run takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**. That's a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 rephrasing configurations, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
|
| 298 |
|
| 299 |
These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
|
| 300 |
|
app/src/content/chapters/6-finephrase.mdx
CHANGED
|
@@ -215,4 +215,4 @@ What makes this result especially compelling is the cost efficiency. Here is how
|
|
| 215 |
|
| 216 |
FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
|
| 217 |
|
| 218 |
-
That's the full picture:
|
|
|
|
| 215 |
|
| 216 |
FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
|
| 217 |
|
| 218 |
+
That's the full picture: 333 train-and-evaluate experiments over 90 rephrasing configurations, a battle-tested infrastructure, and 486 billion tokens of public synthetic data. Let's wrap up with what we learned and where to go next.
|
app/src/content/chapters/7-conclusions.mdx
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
-
We ran
|
| 4 |
|
| 5 |
### What's Next?
|
| 6 |
|
|
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
+
We ran 333 train-and-evaluate experiments over 90 rephrasing configurations to figure out what actually matters for synthetic pretraining data. Generating those synthetic corpora alone meant over 1 trillion tokens and more than 111,000 GPU hours. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase: 1.35 billion samples and 486 billion completion tokens generated from 339 million source documents. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing. That said, there's plenty left to explore.
|
| 4 |
|
| 5 |
### What's Next?
|
| 6 |
|