Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
d00ac2c
1
Parent(s): 05b639b
moved analyses to separate main chapter
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -75,6 +75,7 @@ import Introduction from "./chapters/introduction.mdx";
|
|
| 75 |
import Infrastructure from "./chapters/infrastructure.mdx";
|
| 76 |
import Setup from "./chapters/setup.mdx";
|
| 77 |
import Experiments from "./chapters/experiments.mdx";
|
|
|
|
| 78 |
import Conclusions from "./chapters/conclusions.mdx";
|
| 79 |
import Appendix from "./chapters/appendix.mdx";
|
| 80 |
|
|
@@ -86,6 +87,8 @@ import Appendix from "./chapters/appendix.mdx";
|
|
| 86 |
|
| 87 |
<Experiments />
|
| 88 |
|
|
|
|
|
|
|
| 89 |
<Conclusions />
|
| 90 |
|
| 91 |
<Appendix />
|
|
|
|
| 75 |
import Infrastructure from "./chapters/infrastructure.mdx";
|
| 76 |
import Setup from "./chapters/setup.mdx";
|
| 77 |
import Experiments from "./chapters/experiments.mdx";
|
| 78 |
+
import Analyses from "./chapters/analyses.mdx";
|
| 79 |
import Conclusions from "./chapters/conclusions.mdx";
|
| 80 |
import Appendix from "./chapters/appendix.mdx";
|
| 81 |
|
|
|
|
| 87 |
|
| 88 |
<Experiments />
|
| 89 |
|
| 90 |
+
<Analyses />
|
| 91 |
+
|
| 92 |
<Conclusions />
|
| 93 |
|
| 94 |
<Appendix />
|
app/src/content/chapters/analyses.mdx
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Analyses
|
| 2 |
+
|
| 3 |
+
Our final experiment explores an even more counterintuitive finding.
|
| 4 |
+
|
| 5 |
+
{/*
|
| 6 |
+
|
| 7 |
+
### Does edu-score or DCLM-score predict model performance?
|
| 8 |
+
|
| 9 |
+
Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
|
| 10 |
+
|
| 11 |
+
TODO: Run this analysis and add a small report
|
| 12 |
+
|
| 13 |
+
*/}
|
| 14 |
+
|
| 15 |
+
### Math Rephrasing: When "Worse" Outputs Win
|
| 16 |
+
|
| 17 |
+
We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
|
| 18 |
+
|
| 19 |
+
**Qwen3 produced beautiful, structured outputs:**
|
| 20 |
+
|
| 21 |
+
- 100% had proper Problem/Solution sections
|
| 22 |
+
- 99% had step-by-step formatting
|
| 23 |
+
- 60% included LaTeX math notation
|
| 24 |
+
|
| 25 |
+
Here's a typical Qwen3 output:
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
**Problem:**
|
| 29 |
+
A disc rotates at 120 rpm. How many revolutions in 5 minutes?
|
| 30 |
+
|
| 31 |
+
**Solution:**
|
| 32 |
+
1. Revolutions per minute = 120
|
| 33 |
+
2. Number of minutes = 5
|
| 34 |
+
3. Total revolutions = 120 × 5
|
| 35 |
+
|
| 36 |
+
$$120 \\times 5 = 600$$
|
| 37 |
+
|
| 38 |
+
The disc makes 600 revolutions in 5 minutes.
|
| 39 |
+
|
| 40 |
+
```
|
| 41 |
+
**SmolLM2 was messier:**
|
| 42 |
+
|
| 43 |
+
- Only 68% had complete solutions
|
| 44 |
+
- Wide variance in output length (4 to 4,000 tokens)
|
| 45 |
+
- Mix of formats: questions, partial answers, full solutions
|
| 46 |
+
|
| 47 |
+
SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
|
| 48 |
+
|
| 49 |
+
Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
|
| 50 |
+
|
| 51 |
+
| Metric | SmolLM2 | Qwen3 |
|
| 52 |
+
| --- | --- | --- |
|
| 53 |
+
| Most common start | 3/1000 | 115/1000 |
|
| 54 |
+
| Output length range | 4-4,000 | 100-2,600 |
|
| 55 |
+
| Unique patterns | High | Low |
|
| 56 |
+
|
| 57 |
+
SmolLM2's quality distribution was actually reasonable:
|
| 58 |
+
|
| 59 |
+
| Quality | Criteria | Share |
|
| 60 |
+
| --- | --- | --- |
|
| 61 |
+
| Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
|
| 62 |
+
| Good | Has "solution" + 50+ tokens | 22% |
|
| 63 |
+
| Partial | 30+ tokens but missing structure | 25% |
|
| 64 |
+
| Poor | {'<'}30 tokens | 8% |
|
| 65 |
+
|
| 66 |
+
For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -571,67 +571,3 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
|
|
| 571 |
}}
|
| 572 |
/>
|
| 573 |
|
| 574 |
-
Our final experiment explores an even more counterintuitive finding.
|
| 575 |
-
|
| 576 |
-
{/*
|
| 577 |
-
|
| 578 |
-
### Does edu-score or DCLM-score predict model performance?
|
| 579 |
-
|
| 580 |
-
Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
|
| 581 |
-
|
| 582 |
-
TODO: Run this analysis and add a small report
|
| 583 |
-
|
| 584 |
-
*/}
|
| 585 |
-
|
| 586 |
-
### Math Rephrasing: When "Worse" Outputs Win
|
| 587 |
-
|
| 588 |
-
We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
|
| 589 |
-
|
| 590 |
-
**Qwen3 produced beautiful, structured outputs:**
|
| 591 |
-
|
| 592 |
-
- 100% had proper Problem/Solution sections
|
| 593 |
-
- 99% had step-by-step formatting
|
| 594 |
-
- 60% included LaTeX math notation
|
| 595 |
-
|
| 596 |
-
Here's a typical Qwen3 output:
|
| 597 |
-
|
| 598 |
-
```
|
| 599 |
-
**Problem:**
|
| 600 |
-
A disc rotates at 120 rpm. How many revolutions in 5 minutes?
|
| 601 |
-
|
| 602 |
-
**Solution:**
|
| 603 |
-
1. Revolutions per minute = 120
|
| 604 |
-
2. Number of minutes = 5
|
| 605 |
-
3. Total revolutions = 120 × 5
|
| 606 |
-
|
| 607 |
-
$$120 \\times 5 = 600$$
|
| 608 |
-
|
| 609 |
-
The disc makes 600 revolutions in 5 minutes.
|
| 610 |
-
|
| 611 |
-
```
|
| 612 |
-
**SmolLM2 was messier:**
|
| 613 |
-
|
| 614 |
-
- Only 68% had complete solutions
|
| 615 |
-
- Wide variance in output length (4 to 4,000 tokens)
|
| 616 |
-
- Mix of formats: questions, partial answers, full solutions
|
| 617 |
-
|
| 618 |
-
SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
|
| 619 |
-
|
| 620 |
-
Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
|
| 621 |
-
|
| 622 |
-
| Metric | SmolLM2 | Qwen3 |
|
| 623 |
-
| --- | --- | --- |
|
| 624 |
-
| Most common start | 3/1000 | 115/1000 |
|
| 625 |
-
| Output length range | 4-4,000 | 100-2,600 |
|
| 626 |
-
| Unique patterns | High | Low |
|
| 627 |
-
|
| 628 |
-
SmolLM2's quality distribution was actually reasonable:
|
| 629 |
-
|
| 630 |
-
| Quality | Criteria | Share |
|
| 631 |
-
| --- | --- | --- |
|
| 632 |
-
| Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
|
| 633 |
-
| Good | Has "solution" + 50+ tokens | 22% |
|
| 634 |
-
| Partial | 30+ tokens but missing structure | 25% |
|
| 635 |
-
| Poor | {'<'}30 tokens | 8% |
|
| 636 |
-
|
| 637 |
-
For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
|
|
|
|
| 571 |
}}
|
| 572 |
/>
|
| 573 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|