Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| import Accordion from "../../components/Accordion.astro"; | |
| ## Setup | |
| Recent work like WRAP [@wrap], Nemotron-CC [@nemotroncc], REWIRE [@rewire], and BeyondWeb [@beyondweb] has shown that rephrasing web content into higher-quality formats can outperform training on raw data alone. But the field still lacks a clear framework for what "rephrasing" actually means and a systematic investigation of which factors make it work. That's what we discuss in this section. | |
| ### What is Rephrasing? | |
| At its core, **rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer. | |
| ### Three Axes of Synthetic Data | |
| We think about synthetic data generation along three axes: | |
| 1. **Rephrasing strategy**: the prompt, format, and transformation type that converts a source document into a synthetic variant. | |
| 2. **Generator model**: the size, architecture, and capabilities of the model doing the rephrasing. | |
| 3. **Source data quality**: the characteristics of the seed documents being transformed, from high-quality filtered corpora to noisy web text. | |
| Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt? | |
| ### Research Questions | |
| FinePhrase tackles these questions through systematic experimentation across all three axes: | |
| 1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance. | |
| 2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters). | |
| 3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences. | |
| 4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely. | |
| ### Rephrasing Setup | |
| We rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work, producing between ~XB and XXB output tokens depending on the strategy. | |
| For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations. | |
| ### Source Datasets | |
| We compare against several baseline datasets for pretraining and data rephrasing. We use "source data" and "seed data" interchangeably throughout. | |
| <Accordion title="DCLM" open> | |
| A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp]. | |
| </Accordion> | |
| <Accordion title="FineWeb-Edu-HQ / FineWeb-Edu-LQ"> | |
| Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing [@fineweb]. | |
| </Accordion> | |
| <Accordion title="Ultra-FineWeb"> | |
| A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb]. | |
| </Accordion> | |
| <Accordion title="Nemotron-HQ-Synth"> | |
| Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc]. | |
| </Accordion> | |
| <Accordion title="Cosmopedia"> | |
| A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters [@cosmopedia]. | |
| </Accordion> | |
| <Accordion title="SYNTH"> | |
| A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content [@synthpleias]. | |
| </Accordion> | |
| <Accordion title="REWIRE"> | |
| A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire]. | |
| </Accordion> | |
| ### Ablation Setup | |
| Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the Appendix) and evaluate on 12 benchmarks spanning reasoning, question answering, and math: | |
| - **Reasoning**: ARC [@arc], HellaSwag [@hellaswag], MMLU Redux [@mmluredux], XCSQA [@xcsqa], OpenBookQA [@openbookqa], Winogrande [@winogrande], PIQA [@piqa] | |
| - **Question answering**: SQuAD v2 [@squad2], DROP [@drop], WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa] | |
| - **Math**: GSM8K [@gsm8k] | |
| Since our model is small and trained on only 20B tokens, we use the **continuation format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting. All evaluations use 3-shot prompting with a single seed. | |
| ### A Note on Model Collapse | |
| A common misconception is that any use of synthetic data inevitably degrades model performance. This stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data. | |
| In practice, nobody trains models this way. Real-world synthetic data pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically for specific purposes rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. | |
| The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web. | |