finephrase / app /src /content /chapters /6-finephrase.mdx
joelniklaus's picture
joelniklaus HF Staff
updated synth results in the table
1b2a671
import Image from "../../components/Image.astro";
import HtmlEmbed from "../../components/HtmlEmbed.astro";
import Sidenote from "../../components/Sidenote.astro";
import Wide from "../../components/Wide.astro";
import datasetCardImg from "../assets/image/auto-dataset-card.png";
import finephraseProgressImg from "../assets/image/finephrase-progress.png";
## Building FinePhrase
With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.
The recipe writes itself from the experiments: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks on spare compute on our cluster.
To get a sense of the scale: our infrastructure benchmarks showed that SmolLM2-1.7B-Instruct achieves ~9,200 tokens per second per GPU with suffix-32 speculative decoding. With 100 GPUs running in parallel, that is ~920,000 tokens per second, or about 3.3 billion tokens per hour. Rephrasing ~339 million documents four times (once per prompt) at an average of ~359 tokens per sample means roughly 486 billion tokens of total generation. At our throughput rate, that takes approximately 612 GPU-days, or about 6 wall-clock days with 100 GPUs (in practice closer to two weeks accounting for restarts, failed workers, and cluster contention).
### The Recipe
Every configuration choice traces directly back to a finding from our experiments or infrastructure benchmarks:
- **Model**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), which dominated all other model families across every prompt in our [model family comparison](#does-the-model-family-matter)
- **Prompts**: [FAQ](#faq), [Math](#math), [Table](#table), and [Tutorial](#tutorial), the four prompts that [consistently beat DCLM](#can-new-prompts-beat-dclm) in our experiments
- **Source data**: [FineWeb-Edu sample-350BT](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), since our experiments showed that [source quality is secondary](#does-the-source-dataset-matter) when paired with a strong mix-in dataset
- **Inference settings**: tp=1 with suffix-32 speculative decoding, mns=2048, mnbt=16384, gmu=0.90, all derived from the [throughput benchmark](#throughput-benchmarking) that found a 1.75x speedup for SmolLM2-1.7B-Instruct with this configuration
The entire FinePhrase production run is defined in a [single script](https://github.com/huggingface/datatrove/blob/main/examples/inference/finephrase.py) that is intentionally thin. It declares the configuration and calls the [`generate_data`](https://github.com/huggingface/datatrove/blob/main/examples/inference/generate_data.py) script introduced in the [Infrastructure](#infrastructure) section (the same script we used for all throughput benchmarking). Here is the core configuration:
```python
KWARGS = {
"model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"model_max_context": 8192,
"max_tokens": 2048,
"input_dataset_name": "HuggingFaceFW/fineweb-edu",
"input_dataset_config": "sample-350BT",
"output_dataset_name": "HuggingFaceFW/finephrase",
"max_num_seqs": 2048,
"max_num_batched_tokens": 16384,
"gpu_memory_utilization": 0.90,
"speculative_config": '{"method":"suffix","num_speculative_tokens":32}',
"enable_monitoring": True,
"examples_per_chunk": 100_000,
"workers": 100,
"tasks": 100,
}
PROMPT_TEMPLATES = {
"math": "Rewrite the document to create a mathematical word problem ...",
"table": "Rewrite the document as a structured table ...",
"faq": "Rewrite the document as a comprehensive FAQ ...",
"tutorial": "Rewrite the document as a clear, step-by-step tutorial ...",
}
for name, template in PROMPT_TEMPLATES.items():
generate_data_main(**KWARGS, name=f"finephrase_{name}",
prompt_template=[name, template])
```
<Sidenote>
We set `max_tokens=2048` instead of 4096 because SmolLM2-1.7B-Instruct rarely generates more than 2K tokens per document anyway. Halving the max token budget lets vLLM allocate more KV cache for concurrent sequences.
</Sidenote>
All the operational complexity lives in DataTrove itself: chunked processing with checkpoint-based resume, distributed Slurm execution, incremental Hub uploads, and automatic dataset card generation. The [`generate_data`](https://github.com/huggingface/datatrove/blob/main/examples/inference/generate_data.py) script wires these pieces together into a single CLI for synthetic data generation, which is why the FinePhrase production script is only less than 100 lines of code. Before any GPU time is spent, it runs pre-flight checks: `check_hf_auth()` verifies you have a write token, `ensure_repo_exists()` creates the output dataset repo, and `validate_config()` catches invalid parallelism settings and validates that prompt templates contain the `[[DOCUMENT]]` placeholder. It reads the model's `GenerationConfig` from the Hub to inherit default sampling parameters rather than requiring you to hardcode them. The rollout function automatically truncates documents that exceed the context budget at newline boundaries, which is critical at 339 million documents where some will inevitably be too long.
On Slurm, a single `generate_data` call orchestrates three coordinated jobs: the inference job (100 parallel workers doing the actual generation), a monitor job (updating the dataset card with progress bars and ETAs), and a datacard job (generating final statistics after completion). The monitor tracks the inference job ID and stops if inference fails. The datacard job uses Slurm's `afterok` dependency to run only on success. Once the jobs are running, the next challenge is keeping track of progress and getting results onto the Hub automatically.
### Automatic HF Upload and Progress Monitoring
We want you to be able to just press a button, let the GPUs go brrrr, and check back in to the finished dataset. DataTrove continuously uploads data to your specified Hugging Face dataset repo whenever a chunk is finished, using `ParquetWriter` with `hf://` paths so data appears on the Hub within minutes of generation, not after the full run completes. At the end, the `InferenceDatasetCardGenerator` pipeline step checks the logs directory, collects information about the throughput, and uploads a dataset card to document your new synthetic dataset. Here's an example of the auto-generated dataset card:
<figure id="auto-dataset-card">
<Image src={datasetCardImg} alt="Auto-generated dataset card on the Hugging Face Hub" />
<figcaption>Example of an auto-generated dataset card with throughput metrics, uploaded to the Hugging Face Hub after inference completes.</figcaption>
</figure>
For long-running inference jobs like FinePhrase (which runs for about two weeks), the `InferenceProgressMonitor` runs as a separate Slurm job alongside the inference workers. It periodically scans the output directory, counts completed chunks across all 100 tasks, and updates the dataset card on the Hub with a progress bar and ETA for each prompt template. Here's the live progress dashboard during the FinePhrase generation run:
<figure id="finephrase-progress">
<Image src={finephraseProgressImg} alt="Live progress monitoring of the FinePhrase generation run" />
<figcaption>Live progress dashboard for FinePhrase, showing per-prompt completion status, document counts, and ETAs. The monitor runs as a separate Slurm job and updates the dataset card hourly.</figcaption>
</figure>
Both the progress monitor and the dataset card generator are configured through an `InferenceDatasetCardParams` object that captures the full run metadata. The `generate_data` script creates these pipelines automatically, but here is what happens under the hood:
```python
params = InferenceDatasetCardParams(
output_repo_id="HuggingFaceFW/finephrase",
input_dataset_name="HuggingFaceFW/fineweb-edu",
input_dataset_split="train",
model_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
# ... other params
)
monitor_pipeline = [
InferenceProgressMonitor(
params=params, update_interval=3600
)
]
datacard_pipeline = [
InferenceDatasetCardGenerator(params=params)
]
```
That's the happy path. But running 100 parallel workers for two weeks surfaced plenty of unhappy paths too.
### Improvements to DataTrove
Building FinePhrase wasn't just about running inference at scale. Processing 339 million documents across 100 parallel workers for two weeks stress-tests infrastructure in ways that small experiments never do. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
#### Graceful error handling for bad documents
At 339 million documents, some will inevitably trigger errors: documents too long for the context window even after truncation, malformed content that produces invalid tokens, or edge cases in the tokenizer. Before [PR #450](https://github.com/huggingface/datatrove/pull/450), a single bad document would crash the entire worker, losing all progress for that task. The `skip_bad_requests` option lets the `InferenceRunner` catch provider-side `BadRequestError` exceptions, log the problematic document, and continue processing the rest of the chunk.
```python
InferenceRunner(
rollout_fn=simple_rollout,
config=inference_config,
skip_bad_requests=True, # Log and skip instead of crashing
)
```
#### Fast resume with checkpoint-aware skipping
The first version of `skip_bad_requests` had a subtle problem: skipped documents were not written to checkpoints. This meant chunks containing bad documents never reached completion, `last_chunk` never advanced, and every restart re-parsed the entire checkpoint history from scratch. For FinePhrase with 100,000 documents per chunk, this made restarts painfully slow (sometimes leading to multiple hours of wasted GPU time per worker). [PR #464](https://github.com/huggingface/datatrove/pull/464) fixes this by writing skipped documents to checkpoints with a special marker so they count toward chunk completion but are excluded from the final output. It also speeds up resume by sorting checkpoint files and skipping replay for chunks that are already complete.
#### Hardening Hub uploads against transient failures
With 100 workers writing to the same Hugging Face Hub repository, transient failures aren't rare, they're guaranteed. We encountered three distinct failure modes and fixed each one:
- **Commit races** ([PR #448](https://github.com/huggingface/datatrove/pull/448)): Two workers commit simultaneously and one gets `412 Precondition Failed` with "A commit has happened since." The fix adds retry logic with exponential backoff to the `DiskWriter`, which all Hub-writing paths go through.
- **Transient server errors** ([PR #463](https://github.com/huggingface/datatrove/pull/463)): `503 Service Unavailable` and other transient API errors were not retried consistently. This PR normalizes retry logic across `DiskWriter` and `HuggingFaceDatasetWriter` so all transient errors are handled uniformly.
- **LFS verification failures** ([PR #455](https://github.com/huggingface/datatrove/pull/455)): Large file uploads occasionally fail LFS verification on the server side. A one-line fix adds `"lfs-verify"` to the list of retryable error messages.
#### Isolating the Xet cache per Slurm task
Hugging Face Hub uses [Xet](https://huggingface.co/docs/hub/storage-backends#xet-storage-backend) as a storage backend, and its local cache is not designed for concurrent access from 100 parallel processes. Shared cache access caused corruption and failures. [PR #465](https://github.com/huggingface/datatrove/pull/465) gives each Slurm task its own cache directory derived from the job, task, and process IDs:
```bash
export HF_XET_CACHE="/tmp/hf_xet/${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SLURM_PROCID}"
mkdir -p "$HF_XET_CACHE"
```
#### Multi-config dataset support
FinePhrase runs four prompt templates that produce four independent dataset configs (faq, math, table, tutorial). Without config-awareness, all four templates would fight over a single dataset card and progress counters would exceed 100%. [PR #447](https://github.com/huggingface/datatrove/pull/447) adds first-class config support: outputs go to config-specific folders (`hf://datasets/HuggingFaceFW/finephrase/faq/`, `.../math/`, etc.), the dataset card merges information from all configs, and the progress monitor tracks each config independently so you see four separate progress bars (as in the [progress dashboard above](#finephrase-progress)).
#### Configurable server startup
vLLM server startup time varies wildly depending on model size, optimization level, and cluster load. With `optimization_level=3` (the highest throughput setting), vLLM compiles CUDA graphs during startup, which can take several minutes. Fixed startup timeouts would kill healthy jobs that were simply slow to initialize. [PR #451](https://github.com/huggingface/datatrove/pull/451) makes all startup parameters configurable via `InferenceConfig`: timeout, max attempts, retry delay, and max retries.
#### Fixing SLURM CPU binding
A one-liner, but without it nothing runs. Slurm's default CPU binding policy conflicts with how DataTrove launches vLLM servers, sometimes causing jobs to fail immediately with `srun: error: CPU binding outside of job step allocation`. [PR #457](https://github.com/huggingface/datatrove/pull/457) passes `--cpu-bind=none` to srun, disabling the restrictive binding policy.
```python
SlurmPipelineExecutor(
srun_args={"cpu-bind": "none"},
...
)
```
With all these fixes in place, the pipeline ran to completion. So what does the resulting dataset actually look like?
### What's in the Dataset?
Browse some real examples from FinePhrase below. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial:
<Wide>
<HtmlEmbed
id="finephrase-explorer"
src="finephrase-explorer.html"
caption="Browse real examples from the FinePhrase dataset. Each sample shows the original source document alongside all four rephrased versions (FAQ, Math, Table, Tutorial). Use the arrows or Random button to navigate between samples."
/>
</Wide>
### How Does FinePhrase Compare?
In the introduction we showed a single FinePhrase prompt (table) against the baselines. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines:
<HtmlEmbed
id="finephrase-all-prompts"
src="d3-benchmark-comparison.html"
desc="All four FinePhrase prompts compared against synthetic data baselines across evaluation metrics."
config={{
defaultView: "line",
datasets: {
"mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
"mix-fw_edu_hq-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#E09530" },
"mix-fw_edu_hq-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#D58228" },
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#CA7020" },
cosmopedia: { display: "Cosmopedia", color: "#e15759" },
nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
rewire: { display: "REWIRE", color: "#1877F2" },
synth_query_reasoning_answer: { display: "SYNTH", color: "#b07aa1" }
}
}}
/>
All four FinePhrase prompts outperform every synthetic baseline by a clear margin. Table and math lead the pack, with FAQ and tutorial close behind. The per-benchmark breakdown (switch with the dropdown above) tells a familiar story. FinePhrase prompts dominate on ARC, SQuAD, and DROP (knowledge and reading comprehension), while the baselines hold a slight edge on HellaSwag and PIQA (commonsense). This is the same commonsense-vs-knowledge trade-off we observed throughout the experiments, and it's exactly why FinePhrase is designed to be mixed with original data rather than used alone. The aggregate wins because the knowledge gains far outweigh the commonsense losses.
What makes this result especially compelling is the cost efficiency. Here is how FinePhrase compares to other synthetic data projects:
<figure id="cost-efficiency">
| Dataset | Generator | Tokens | GPU Hours | Tokens/GPU-Hour |
|:----------------|:------------------|---------:|-----------:|----------------:|
| Cosmopedia | Mixtral 8x7B | 25B | {'>'} 10K | {'<'} 2.5M |
| SYNTH | custom fine-tuned | 80B | 4K | 20M |
| REWIRE | Llama-3.3 70B | 400B | **~352K** | ~1.1M |
| Nemotron-CC | Mistral NeMo 12B | **1.9T** | n/a | n/a |
| **FinePhrase** | SmolLM2-1.7B | 486B | ~14.7K | **~33.1M**|
<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. Nemotron-CC did not report generation cost.</figcaption>
</figure>
FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
That's the full picture: 90 experiments, a battle-tested infrastructure, and 486 billion tokens of public synthetic data. Let's wrap up with what we learned and where to go next.