finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 29 days ago

Commit

a32050b

1 Parent(s): d997d47

combine two sections for brevity

Browse files

Files changed (1) hide show

app/src/content/chapters/2-setup.mdx +5 -14

app/src/content/chapters/2-setup.mdx CHANGED Viewed

@@ -12,22 +12,13 @@ Several teams have already shown that rephrasing web content into cleaner format
 ### Three Axes of Synthetic Data
-We think about synthetic data generation along three axes:
-1. **Rephrasing strategy**: the prompt, format, and transformation type that converts a source document into a synthetic variant.
-2. **Generator model**: the size, architecture, and capabilities of the model doing the rephrasing.
-3. **Source data quality**: the characteristics of the seed documents being transformed, from high-quality filtered corpora to noisy web text.
-Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
-### What We Want to Find Out
-Here are the concrete questions we're trying to answer:
-1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
-2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
-3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
-4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
 ### How We Run Rephrasing

 ### Three Axes of Synthetic Data
+We think about synthetic data generation along three axes, each raising its own question:
+1. **Rephrasing strategy**: Which transformations actually improve downstream performance? We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations).
+2. **Generator model**: How do model properties affect rephrase quality? We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
+3. **Source data quality**: When does seed quality matter? We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
+Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? And cutting across all three axes: **how do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
 ### How We Run Rephrasing