joelniklaus HF Staff commited on
Commit
d00ac2c
·
1 Parent(s): 05b639b

moved analyses to separate main chapter

Browse files
app/src/content/article.mdx CHANGED
@@ -75,6 +75,7 @@ import Introduction from "./chapters/introduction.mdx";
75
  import Infrastructure from "./chapters/infrastructure.mdx";
76
  import Setup from "./chapters/setup.mdx";
77
  import Experiments from "./chapters/experiments.mdx";
 
78
  import Conclusions from "./chapters/conclusions.mdx";
79
  import Appendix from "./chapters/appendix.mdx";
80
 
@@ -86,6 +87,8 @@ import Appendix from "./chapters/appendix.mdx";
86
 
87
  <Experiments />
88
 
 
 
89
  <Conclusions />
90
 
91
  <Appendix />
 
75
  import Infrastructure from "./chapters/infrastructure.mdx";
76
  import Setup from "./chapters/setup.mdx";
77
  import Experiments from "./chapters/experiments.mdx";
78
+ import Analyses from "./chapters/analyses.mdx";
79
  import Conclusions from "./chapters/conclusions.mdx";
80
  import Appendix from "./chapters/appendix.mdx";
81
 
 
87
 
88
  <Experiments />
89
 
90
+ <Analyses />
91
+
92
  <Conclusions />
93
 
94
  <Appendix />
app/src/content/chapters/analyses.mdx ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Analyses
2
+
3
+ Our final experiment explores an even more counterintuitive finding.
4
+
5
+ {/*
6
+
7
+ ### Does edu-score or DCLM-score predict model performance?
8
+
9
+ Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
10
+
11
+ TODO: Run this analysis and add a small report
12
+
13
+ */}
14
+
15
+ ### Math Rephrasing: When "Worse" Outputs Win
16
+
17
+ We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
18
+
19
+ **Qwen3 produced beautiful, structured outputs:**
20
+
21
+ - 100% had proper Problem/Solution sections
22
+ - 99% had step-by-step formatting
23
+ - 60% included LaTeX math notation
24
+
25
+ Here's a typical Qwen3 output:
26
+
27
+ ```
28
+ **Problem:**
29
+ A disc rotates at 120 rpm. How many revolutions in 5 minutes?
30
+
31
+ **Solution:**
32
+ 1. Revolutions per minute = 120
33
+ 2. Number of minutes = 5
34
+ 3. Total revolutions = 120 × 5
35
+
36
+ $$120 \\times 5 = 600$$
37
+
38
+ The disc makes 600 revolutions in 5 minutes.
39
+
40
+ ```
41
+ **SmolLM2 was messier:**
42
+
43
+ - Only 68% had complete solutions
44
+ - Wide variance in output length (4 to 4,000 tokens)
45
+ - Mix of formats: questions, partial answers, full solutions
46
+
47
+ SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
48
+
49
+ Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
50
+
51
+ | Metric | SmolLM2 | Qwen3 |
52
+ | --- | --- | --- |
53
+ | Most common start | 3/1000 | 115/1000 |
54
+ | Output length range | 4-4,000 | 100-2,600 |
55
+ | Unique patterns | High | Low |
56
+
57
+ SmolLM2's quality distribution was actually reasonable:
58
+
59
+ | Quality | Criteria | Share |
60
+ | --- | --- | --- |
61
+ | Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
62
+ | Good | Has "solution" + 50+ tokens | 22% |
63
+ | Partial | 30+ tokens but missing structure | 25% |
64
+ | Poor | {'<'}30 tokens | 8% |
65
+
66
+ For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
app/src/content/chapters/experiments.mdx CHANGED
@@ -571,67 +571,3 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
571
  }}
572
  />
573
 
574
- Our final experiment explores an even more counterintuitive finding.
575
-
576
- {/*
577
-
578
- ### Does edu-score or DCLM-score predict model performance?
579
-
580
- Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
581
-
582
- TODO: Run this analysis and add a small report
583
-
584
- */}
585
-
586
- ### Math Rephrasing: When "Worse" Outputs Win
587
-
588
- We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
589
-
590
- **Qwen3 produced beautiful, structured outputs:**
591
-
592
- - 100% had proper Problem/Solution sections
593
- - 99% had step-by-step formatting
594
- - 60% included LaTeX math notation
595
-
596
- Here's a typical Qwen3 output:
597
-
598
- ```
599
- **Problem:**
600
- A disc rotates at 120 rpm. How many revolutions in 5 minutes?
601
-
602
- **Solution:**
603
- 1. Revolutions per minute = 120
604
- 2. Number of minutes = 5
605
- 3. Total revolutions = 120 × 5
606
-
607
- $$120 \\times 5 = 600$$
608
-
609
- The disc makes 600 revolutions in 5 minutes.
610
-
611
- ```
612
- **SmolLM2 was messier:**
613
-
614
- - Only 68% had complete solutions
615
- - Wide variance in output length (4 to 4,000 tokens)
616
- - Mix of formats: questions, partial answers, full solutions
617
-
618
- SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
619
-
620
- Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
621
-
622
- | Metric | SmolLM2 | Qwen3 |
623
- | --- | --- | --- |
624
- | Most common start | 3/1000 | 115/1000 |
625
- | Output length range | 4-4,000 | 100-2,600 |
626
- | Unique patterns | High | Low |
627
-
628
- SmolLM2's quality distribution was actually reasonable:
629
-
630
- | Quality | Criteria | Share |
631
- | --- | --- | --- |
632
- | Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
633
- | Good | Has "solution" + 50+ tokens | 22% |
634
- | Partial | 30+ tokens but missing structure | 25% |
635
- | Poor | {'<'}30 tokens | 8% |
636
-
637
- For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
 
571
  }}
572
  />
573