import HtmlEmbed from "../../components/HtmlEmbed.astro"; import Note from "../../components/Note.astro"; import Wide from "../../components/Wide.astro"; import Accordion from "../../components/Accordion.astro"; import Sidenote from "../../components/Sidenote.astro"; ## Analyses The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, whether our proxy model is too small to reveal quality differences, what the rephrased outputs actually look like, and why a messier model sometimes wins. ### Is More Compute Worth It? Running 90 experiments is not cheap. GPU time varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. Here's each experiment's downstream performance plotted against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations: **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time *decreasing* performance. **The message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal. Even the cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training a model. Wouldn't it be nice if we could just score the rephrased outputs directly and skip the expensive train-then-evaluate loop? ### Can Quality Scores Predict Performance? FineWeb-Edu-score and DCLM-score are great quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and iterate on prompts without running the full pipeline each time. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] Here's the full correlation matrix: [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses. {/* Seven early runs have incorrect input quality scores due to a scoring pipeline bug and are excluded in both charts' JS via BROKEN_INPUT_SCORES rather than patched in the JSON: article/commentary/discussion/tutorial-1b-hq, tutorial-12b-hq, faq-1b-lq, faq-12b-lq */} **DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.61, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.47–0.54) and reading comprehension (ρ = 0.49–0.52). **Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = −0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all. {/* **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise. */} **Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **The bottom line: for synthetic data, there is no shortcut. You have to train models and evaluate them.** The correlation matrix tells us that quality scores are weak predictors, but not *how* scores change through rephrasing. The slope chart below visualizes this: each experiment is a line connecting its input score (left), output score (middle), and downstream `agg_score_macro` (right). Toggle between DCLM and edu-score views to see both perspectives: **DCLM scores almost universally increase through rephrasing.** Nearly every experiment shows an upward slope from input to output DCLM score, regardless of prompt type or model. The rephrasing models produce cleaner, more structured text that the DCLM classifier rewards. But the slope from output DCLM score to downstream performance is much flatter and noisier, confirming that a high DCLM score does not guarantee good training data. **Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance. So quality scores designed for filtering web data don't transfer to synthetic data. If we can't shortcut the evaluation, we should at least make sure the evaluation itself is trustworthy. One obvious concern: all our model-size experiments used a 1.7B student. What if that student is simply too small to tell good data from great data? ### Is Our Proxy Model Too Small? In the [model size experiment](#does-the-model-size-matter) we found that generator size barely matters past 1B. But all of that was on a single 1.7B student. What if that student is just too small to tell good data from great data? A small model might cap out on all the mixes equally, making 1B and 27B generator data look the same when a bigger student could tell them apart. To check, we trained students at four sizes on identical data mixes. | Preset | Parameters | hidden | intermediate | tp | recompute layer | micro batch | eval batch | |--------|---------------|--------|--------------|----|-----------------|-------------|------------| | 0.5B | 483,714,048 | 1024 | 3072 | 1 | off | 4 | 32 | | 1.7B | 1,672,071,168 | 2048 | 6144 | 1 | off | 2 | 16 | | 2.9B | 2,860,792,320 | 2560 | 9216 | 1 | on | 1 | 8 | | 6.2B | 6,162,714,624 | 4096 | 12288 | 2 | on | 1 | 4 | `tp` is tensor-parallel width. Recompute layer is activation checkpointing; when on, it saves memory during training. Micro batch and eval batch are the micro-batch size during training and the batch size for evaluation runs. We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite](#guided_rewrite_original), [math](#math), [tutorial](#tutorial)), always mixing with FineWeb-Edu-HQ. The chart below averages across all three prompts at each student size: **A small student squashes differences between generators.** At 0.5B, the spread across generator sizes is just 0.004 in macro score. Bump the student to 2.9B and it jumps to 0.021, revealing a clear ranking: 270M lowest, 1B in the middle, larger generators on top. Going further to 6.2B doesn't help much: the spread stays about the same, but the ordering among large generators gets noisier. **2.9B is the sweet spot.** It cleanly separates 270M, 1B, and larger generators without the extra cost of 6.2B, which barely widens the spread. We ran this student sweep later in the project. The earlier experiments still use the 1.7B student because those runs were planned or launched before these results existed. **The 1.7B student hid differences above 1B.** On [guided_rewrite](#guided_rewrite_original), the gap between 1B and the best 4B+ generator is just +0.009 at the 1.7B student, easy to write off as noise. At 2.9B that same gap jumps to +0.017, at 6.2B it's +0.013. So "bigger generators don't help" was partly the 1.7B student squashing those differences. **With a bigger student, three tiers show up:** 270M is clearly worst, 1B sits in the middle, and 4B+ generators form a top group. **Bigger students get more out of the same data.** @rewire report the same pattern: their rewritten data adds +1.0pp at 1B, +1.3pp at 3B, and +2.5pp at 7B over filtered web data alone. We see it too: average macro score climbs from 0.109 (0.5B student) to 0.143 (1.7B) to 0.150 (2.9B) to 0.157 (6.2B), and the generator spread roughly doubles from 0.5B to 2.9B. **But scaling the student helps more than scaling the generator.** Going from a 1.7B to a 6.2B student adds roughly +0.014 to average macro score. Going from a 1B to the best 4B+ generator adds only +0.004 to +0.014 depending on the prompt. #### Digging into individual prompts The overview above averages across prompts. To see how each prompt behaves at each student size, use the interactive chart below: **Math is the exception.** At 0.5B, the 1B generator is actually the *best* in the sweep, beating 12B and 27B. That lead fades at larger student sizes (12B takes over at 2.9B and 6.2B), but 1B stays surprisingly competitive. Switch the chart to GSM8K and the pattern gets even sharper: [math](#math) data from the small Gemma is unusually strong for how cheap it is. All of the above is about *who* generates the data and *how big* the student is. But what about the data itself? Let's start with the simplest property: how long is the output? ### Do Chatty Models Make Better Data? Different prompt formats produce wildly different output lengths. Here are the output tokens per document across four prompt types, broken down by model family: Table and Math prompts tend to be concise, while FAQ and Tutorial prompts generate significantly more tokens per document. The spread within each prompt type varies across model families: some models are consistently verbose regardless of the prompt, while others adapt their output length to the task. But does this variation actually affect downstream performance? Our prompts produce outputs ranging from 25% of the input length (Commentary) to 150% (Guided Rewrite at 12B). Here's each experiment's compression ratio plotted against its benchmark score: **There is no meaningful relationship between compression ratio and performance.** Highly compressive prompts (Commentary at 0.26x, Table at 0.25x) and expansive ones (Guided Rewrite at 1.5x) both appear across the full range of performance scores. The best-performing experiments cluster around 0.3x–0.8x compression, but this likely reflects the distribution of prompt types rather than any causal effect of compression itself. FAQ and Tutorial prompts, which happen to compress moderately, also happen to be the strongest prompts for other reasons (pedagogical restructuring, diverse output formats). What matters is the content and structure of the output, not its length relative to the input. So output length doesn't predict quality either. But we stumbled onto something more interesting while looking at output *diversity*: a case where a model that follows instructions poorly actually produces better training data. ### Math Rephrasing: When "Worse" Outputs Win This was one of our most surprising findings. We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better. **Qwen3 produced beautiful, structured outputs:** - 100% had proper Problem/Solution sections - 99% had step-by-step formatting - 60% included LaTeX math notation Here's a typical Qwen3 output: ``` **Problem:** A disc rotates at 120 rpm. How many revolutions in 5 minutes? **Solution:** 1. Revolutions per minute = 120 2. Number of minutes = 5 3. Total revolutions = 120 × 5 $$120 \\times 5 = 600$$ The disc makes 600 revolutions in 5 minutes. ``` **SmolLM2 was messier:** - Only 68% had complete solutions - Wide variance in output length (4 to 4,000 tokens) - Mix of formats: questions, partial answers, full solutions SmolLM2 outputs ranged from proper solutions to just questions like *"What is the difference between X and Y?"* or even 4-token fragments like *"Areas Where We Service"*. Yet models trained on SmolLM2's data **outperformed** those trained on Qwen3's data on downstream benchmarks. We suspect this is due to **template collapse**: Qwen3's outputs were *too* consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times. | Metric | SmolLM2 | Qwen3 | | --- | --- | --- | | Most common start | 3/1000 | 115/1000 | | Output length range | 4-4,000 | 100-2,600 | | Unique patterns | High | Low | SmolLM2's quality distribution was actually reasonable: | Quality | Criteria | Share | | --- | --- | --- | | Excellent | Has "solution" + numbered steps + 80+ tokens | 45% | | Good | Has "solution" + 50+ tokens | 22% | | Partial | 30+ tokens but missing structure | 25% | | Poor | {'<'}30 tokens | 8% | The lesson: for pretraining data, diversity beats consistency. A model that doesn't follow instructions perfectly can actually produce better training data than one that does. This also helps explain why SmolLM2 dominates the model family comparison: it produces more varied outputs, which may matter more than precise instruction following. **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.
**Quality scores**: Neither edu-score nor DCLM-score reliably predicts downstream performance for synthetic data. There is no shortcut to training and evaluating.
**Proxy model size**: A 2.9B student reveals three tiers (270M {'<'} 1B {'<'} 4B+) that the 1.7B student compressed. Generator gains above 1B are real but smaller than student-side gains. Student scale is the bigger lever.
**Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.
**Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template. With the experiments and analyses behind us, let's talk about the infrastructure that made all of this possible.