finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 17

Commit

d00ac2c

1 Parent(s): 05b639b

moved analyses to separate main chapter

Browse files

Files changed (3) hide show

app/src/content/article.mdx +3 -0
app/src/content/chapters/analyses.mdx +66 -0
app/src/content/chapters/experiments.mdx +0 -64

app/src/content/article.mdx CHANGED Viewed

@@ -75,6 +75,7 @@ import Introduction from "./chapters/introduction.mdx";
 import Infrastructure from "./chapters/infrastructure.mdx";
 import Setup from "./chapters/setup.mdx";
 import Experiments from "./chapters/experiments.mdx";
 import Conclusions from "./chapters/conclusions.mdx";
 import Appendix from "./chapters/appendix.mdx";
@@ -86,6 +87,8 @@ import Appendix from "./chapters/appendix.mdx";
 <Experiments />
 <Conclusions />
 <Appendix />

 import Infrastructure from "./chapters/infrastructure.mdx";
 import Setup from "./chapters/setup.mdx";
 import Experiments from "./chapters/experiments.mdx";
+import Analyses from "./chapters/analyses.mdx";
 import Conclusions from "./chapters/conclusions.mdx";
 import Appendix from "./chapters/appendix.mdx";
 <Experiments />
+<Analyses />
 <Conclusions />
 <Appendix />

app/src/content/chapters/analyses.mdx ADDED Viewed

	@@ -0,0 +1,66 @@

+## Analyses
+Our final experiment explores an even more counterintuitive finding.
+{/*
+### Does edu-score or DCLM-score predict model performance?
+Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
+TODO: Run this analysis and add a small report
+*/}
+### Math Rephrasing: When "Worse" Outputs Win
+We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
+**Qwen3 produced beautiful, structured outputs:**
+- 100% had proper Problem/Solution sections
+- 99% had step-by-step formatting
+- 60% included LaTeX math notation
+Here's a typical Qwen3 output:
+```
+**Problem:**
+A disc rotates at 120 rpm. How many revolutions in 5 minutes?
+**Solution:**
+1. Revolutions per minute = 120
+2. Number of minutes = 5
+3. Total revolutions = 120 × 5
+$$120 \\times 5 = 600$$
+The disc makes 600 revolutions in 5 minutes.
+```
+**SmolLM2 was messier:**
+- Only 68% had complete solutions
+- Wide variance in output length (4 to 4,000 tokens)
+- Mix of formats: questions, partial answers, full solutions
+SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
+Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
+| Metric | SmolLM2 | Qwen3 |
+| --- | --- | --- |
+| Most common start | 3/1000 | 115/1000 |
+| Output length range | 4-4,000 | 100-2,600 |
+| Unique patterns | High | Low |
+SmolLM2's quality distribution was actually reasonable:
+| Quality | Criteria | Share |
+| --- | --- | --- |
+| Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
+| Good | Has "solution" + 50+ tokens | 22% |
+| Partial | 30+ tokens but missing structure | 25% |
+| Poor | {'<'}30 tokens | 8% |
+For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -571,67 +571,3 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
   }}
 />
-Our final experiment explores an even more counterintuitive finding.
-{/*
-### Does edu-score or DCLM-score predict model performance?
-Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.
-TODO: Run this analysis and add a small report
-*/}
-### Math Rephrasing: When "Worse" Outputs Win
-We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
-**Qwen3 produced beautiful, structured outputs:**
-- 100% had proper Problem/Solution sections
-- 99% had step-by-step formatting
-- 60% included LaTeX math notation
-Here's a typical Qwen3 output:
-```
-**Problem:**
-A disc rotates at 120 rpm. How many revolutions in 5 minutes?
-**Solution:**
-1. Revolutions per minute = 120
-2. Number of minutes = 5
-3. Total revolutions = 120 × 5
-$$120 \\times 5 = 600$$
-The disc makes 600 revolutions in 5 minutes.
-```
-**SmolLM2 was messier:**
-- Only 68% had complete solutions
-- Wide variance in output length (4 to 4,000 tokens)
-- Mix of formats: questions, partial answers, full solutions
-SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*.
-Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.
-| Metric | SmolLM2 | Qwen3 |
-| --- | --- | --- |
-| Most common start | 3/1000 | 115/1000 |
-| Output length range | 4-4,000 | 100-2,600 |
-| Unique patterns | High | Low |
-SmolLM2's quality distribution was actually reasonable:
-| Quality | Criteria | Share |
-| --- | --- | --- |
-| Excellent | Has "solution" + numbered steps + 80+ tokens | 45% |
-| Good | Has "solution" + 50+ tokens | 22% |
-| Partial | 30+ tokens but missing structure | 25% |
-| Poor | {'<'}30 tokens | 8% |
-For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.


571	}}
572	/>
573