Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
a32050b
1
Parent(s): d997d47
combine two sections for brevity
Browse files
app/src/content/chapters/2-setup.mdx
CHANGED
|
@@ -12,22 +12,13 @@ Several teams have already shown that rephrasing web content into cleaner format
|
|
| 12 |
|
| 13 |
### Three Axes of Synthetic Data
|
| 14 |
|
| 15 |
-
We think about synthetic data generation along three axes:
|
| 16 |
|
| 17 |
-
1. **Rephrasing strategy**:
|
| 18 |
-
2. **Generator model**:
|
| 19 |
-
3. **Source data quality**:
|
| 20 |
|
| 21 |
-
Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best
|
| 22 |
-
|
| 23 |
-
### What We Want to Find Out
|
| 24 |
-
|
| 25 |
-
Here are the concrete questions we're trying to answer:
|
| 26 |
-
|
| 27 |
-
1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
|
| 28 |
-
2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
|
| 29 |
-
3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
|
| 30 |
-
4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
|
| 31 |
|
| 32 |
### How We Run Rephrasing
|
| 33 |
|
|
|
|
| 12 |
|
| 13 |
### Three Axes of Synthetic Data
|
| 14 |
|
| 15 |
+
We think about synthetic data generation along three axes, each raising its own question:
|
| 16 |
|
| 17 |
+
1. **Rephrasing strategy**: Which transformations actually improve downstream performance? We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations).
|
| 18 |
+
2. **Generator model**: How do model properties affect rephrase quality? We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
|
| 19 |
+
3. **Source data quality**: When does seed quality matter? We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
|
| 20 |
|
| 21 |
+
Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? And cutting across all three axes: **how do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
### How We Run Rephrasing
|
| 24 |
|