joelniklaus HF Staff commited on
Commit
a32050b
·
1 Parent(s): d997d47

combine two sections for brevity

Browse files
app/src/content/chapters/2-setup.mdx CHANGED
@@ -12,22 +12,13 @@ Several teams have already shown that rephrasing web content into cleaner format
12
 
13
  ### Three Axes of Synthetic Data
14
 
15
- We think about synthetic data generation along three axes:
16
 
17
- 1. **Rephrasing strategy**: the prompt, format, and transformation type that converts a source document into a synthetic variant.
18
- 2. **Generator model**: the size, architecture, and capabilities of the model doing the rephrasing.
19
- 3. **Source data quality**: the characteristics of the seed documents being transformed, from high-quality filtered corpora to noisy web text.
20
 
21
- Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
22
-
23
- ### What We Want to Find Out
24
-
25
- Here are the concrete questions we're trying to answer:
26
-
27
- 1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
28
- 2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
29
- 3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
30
- 4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
31
 
32
  ### How We Run Rephrasing
33
 
 
12
 
13
  ### Three Axes of Synthetic Data
14
 
15
+ We think about synthetic data generation along three axes, each raising its own question:
16
 
17
+ 1. **Rephrasing strategy**: Which transformations actually improve downstream performance? We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations).
18
+ 2. **Generator model**: How do model properties affect rephrase quality? We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
19
+ 3. **Source data quality**: When does seed quality matter? We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
20
 
21
+ Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? And cutting across all three axes: **how do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
 
 
 
 
 
 
 
 
 
22
 
23
  ### How We Run Rephrasing
24