import Sidenote from "../../components/Sidenote.astro";
import Tabs from "../../components/Tabs.astro";
import Tab from "../../components/Tab.astro";
## Rephrasing the Web
Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, BeyondWeb [@beyondweb] tries continuation and summarization, and EntiGraph [@syntheticcpt] uses entity-centric augmentation to synthesize diverse knowledge representations from small corpora. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
### What is Rephrasing?
**Rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer.
### Three Axes of Synthetic Data
We think about synthetic data generation along three axes, each raising its own question:
1. **Rephrasing strategy**: Which transformations actually improve downstream performance? We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations).
2. **Generator model**: How do model properties affect rephrase quality? We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
3. **Source data quality**: When does seed quality matter? We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? And cutting across all three axes: **how do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely. Here's how we set up the pipeline to test all of this.
### How We Run Rephrasing
In practice, we rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work.
For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations. More on the infrastructure in a [later section](#infrastructure).
### Source Datasets
Before diving into experiments, here's a quick overview of the datasets we compare against. We use "source data" and "seed data" interchangeably throughout.
A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing [@fineweb].
A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters [@cosmopedia].
A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content [@synthpleias].
A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire].
With the datasets defined, we need a consistent way to tell whether one configuration is better than another.
### How We Measure Success
To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
- **General Knowledge**: ARC [@arc], MMLU Redux [@mmluredux]
- **Reading Comprehension**: SQuAD v2 [@squad2], DROP [@drop]
- **Reasoning**: OpenBookQA [@openbookqa], XCSQA [@xcsqa]
- **Natural Language Understanding**: WinoGrande [@winogrande], PIQA [@piqa], HellaSwag [@hellaswag]
- **Math**: GSM8K [@gsm8k]
- **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
With all that context out of the way, let's get to the fun part: the experiments.