joelniklaus HF Staff commited on
Commit
d9c713c
·
1 Parent(s): 9894e4e

added nicer overview plot

Browse files
app/src/content/assets/data/student-scaling-summary.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36ec2938094074a57aae49f4477a1420f45ae52c11156c58dc51ff474a1b7401
3
+ size 380
app/src/content/chapters/4-analyses.mdx CHANGED
@@ -93,13 +93,30 @@ In the [model size experiment](#does-the-model-size-matter) we found that genera
93
 
94
  </Accordion>
95
 
96
- {/*
 
 
 
 
 
 
 
 
 
 
 
 
97
  <Sidenote>
98
- Why stop at 6.2B? Flash Attention 2 [@flashattention2] in our stack can't go past hidden size 4096 with 16 attention heads without running out of memory. And as we'll see, the 6.2B student doesn't separate generators any better than the 2.9B, so there's little reason to push further.
99
  </Sidenote>
100
- */}
101
 
102
- We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite](#guided_rewrite_original), [math](#math), [tutorial](#tutorial)), always mixing with FineWeb-Edu-HQ. Pick a student size and prompt below:
 
 
 
 
 
 
103
 
104
  <HtmlEmbed
105
  id="student-capacity-generator-sweep"
@@ -240,21 +257,8 @@ We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite
240
  }}
241
  />
242
 
243
- **A small student squashes differences between generators.** At 0.5B, the spread across generator sizes is about half of what we see at 2.9B (on [guided_rewrite](#guided_rewrite_original): 0.012 vs 0.024 macro spread at 10k steps). Bump the student to 2.9B and a clear ranking appears: 270M lowest, 1B in the middle, larger generators on top (12B often wins on [guided_rewrite](#guided_rewrite_original) and [tutorial](#tutorial)). Going further to 6.2B doesn't help much: the total spread stays about the same, but the ordering among large generators gets noisier (4B can edge out 12B).
244
-
245
  **Math is the exception.** At 0.5B, the 1B generator is actually the *best* in the sweep, beating 12B and 27B. That lead fades at larger student sizes (12B takes over at 2.9B and 6.2B), but 1B stays surprisingly competitive. Switch the chart to GSM8K and the pattern gets even sharper: [math](#math) data from the small Gemma is unusually strong for how cheap it is.
246
 
247
- **2.9B is the sweet spot.** It cleanly separates 270M, 1B, and larger generators without the extra cost of 6.2B, which barely widens the spread.
248
-
249
- <Sidenote>
250
- We ran this student sweep later in the project. The earlier experiments still use the 1.7B student because those runs were planned or launched before these results existed.
251
- </Sidenote>
252
-
253
-
254
- **The 1.7B student hid differences above 1B.** On [guided_rewrite](#guided_rewrite_original), the gap between 1B and the best 4B+ generator is just +0.009 at the 1.7B student, easy to write off as noise. At 2.9B that same gap jumps to +0.017, at 6.2B it's +0.013. [Tutorial](#tutorial) tells the same story (+0.004 → +0.006 → +0.014). So "bigger generators don't help" was partly the 1.7B student squashing those differences. **With a bigger student, three tiers show up:** 270M is clearly worst, 1B sits in the middle, and 4B+ generators form a top group. The gap from 1B to the top is real, just smaller than the jump from 270M to 1B.
255
-
256
- **Bigger students get more out of the same data.** @rewire report the same pattern: their rewritten data adds +1.0pp at 1B, +1.3pp at 3B, and +2.5pp at 7B over filtered web data alone. We see it too: average macro score climbs from 0.109 (0.5B student) to 0.143 (1.7B) to 0.150 (2.9B) to 0.157 (6.2B), and the generator spread roughly doubles from 0.5B to 2.9B.
257
-
258
  All of the above is about *who* generates the data and *how big* the student is. But what about the data itself? Let's start with the simplest property: how long is the output?
259
 
260
  ### Do Chatty Models Make Better Data?
 
93
 
94
  </Accordion>
95
 
96
+ We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite](#guided_rewrite_original), [math](#math), [tutorial](#tutorial)), always mixing with FineWeb-Edu-HQ. The chart below averages across all three prompts at each student size:
97
+
98
+ <HtmlEmbed
99
+ id="student-scaling-overview"
100
+ src="d3-student-scaling.html"
101
+ data="student-scaling-summary.csv"
102
+ desc="Average aggregate macro score by student size and generator size. Larger students widen the spread between generators."
103
+ />
104
+
105
+ **A small student squashes differences between generators.** At 0.5B, the spread across generator sizes is just 0.004 in macro score. Bump the student to 2.9B and it jumps to 0.021, revealing a clear ranking: 270M lowest, 1B in the middle, larger generators on top. Going further to 6.2B doesn't help much: the spread stays about the same, but the ordering among large generators gets noisier.
106
+
107
+ **2.9B is the sweet spot.** It cleanly separates 270M, 1B, and larger generators without the extra cost of 6.2B, which barely widens the spread.
108
+
109
  <Sidenote>
110
+ We ran this student sweep later in the project. The earlier experiments still use the 1.7B student because those runs were planned or launched before these results existed.
111
  </Sidenote>
 
112
 
113
+ **The 1.7B student hid differences above 1B.** On [guided_rewrite](#guided_rewrite_original), the gap between 1B and the best 4B+ generator is just +0.009 at the 1.7B student, easy to write off as noise. At 2.9B that same gap jumps to +0.017, at 6.2B it's +0.013. So "bigger generators don't help" was partly the 1.7B student squashing those differences. **With a bigger student, three tiers show up:** 270M is clearly worst, 1B sits in the middle, and 4B+ generators form a top group.
114
+
115
+ **Bigger students get more out of the same data.** @rewire report the same pattern: their rewritten data adds +1.0pp at 1B, +1.3pp at 3B, and +2.5pp at 7B over filtered web data alone. We see it too: average macro score climbs from 0.109 (0.5B student) to 0.143 (1.7B) to 0.150 (2.9B) to 0.157 (6.2B), and the generator spread roughly doubles from 0.5B to 2.9B. **But scaling the student helps more than scaling the generator.** Going from a 1.7B to a 6.2B student adds roughly +0.014 to average macro score. Going from a 1B to the best 4B+ generator adds only +0.004 to +0.014 depending on the prompt.
116
+
117
+ #### Digging into individual prompts
118
+
119
+ The overview above averages across prompts. To see how each prompt behaves at each student size, use the interactive chart below:
120
 
121
  <HtmlEmbed
122
  id="student-capacity-generator-sweep"
 
257
  }}
258
  />
259
 
 
 
260
  **Math is the exception.** At 0.5B, the 1B generator is actually the *best* in the sweep, beating 12B and 27B. That lead fades at larger student sizes (12B takes over at 2.9B and 6.2B), but 1B stays surprisingly competitive. Switch the chart to GSM8K and the pattern gets even sharper: [math](#math) data from the small Gemma is unusually strong for how cheap it is.
261
 
 
 
 
 
 
 
 
 
 
 
 
262
  All of the above is about *who* generates the data and *how big* the student is. But what about the data itself? Let's start with the simplest property: how long is the output?
263
 
264
  ### Do Chatty Models Make Better Data?
app/src/content/embeds/d3-student-scaling.html ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-student-scaling"></div>
2
+ <style>
3
+ .d3-student-scaling { position: relative; }
4
+ .d3-student-scaling .legend {
5
+ display: flex; flex-direction: column; align-items: flex-start; gap: 6px; margin: 8px 0 0 0;
6
+ }
7
+ .d3-student-scaling .legend .legend-title {
8
+ font-size: 12px; font-weight: 700; color: var(--text-color);
9
+ }
10
+ .d3-student-scaling .legend .items {
11
+ display: flex; flex-wrap: wrap; gap: 8px 14px;
12
+ }
13
+ .d3-student-scaling .legend .item {
14
+ display: inline-flex; align-items: center; gap: 6px; font-size: 12px; color: var(--text-color);
15
+ }
16
+ .d3-student-scaling .legend .swatch {
17
+ width: 14px; height: 14px; border-radius: 3px; border: 1px solid var(--border-color);
18
+ }
19
+ .d3-student-scaling .d3-tooltip {
20
+ position: absolute; top: 0; left: 0;
21
+ transform: translate(-9999px, -9999px);
22
+ pointer-events: none; padding: 8px 12px; border-radius: 8px;
23
+ font-size: 12px; line-height: 1.4;
24
+ border: 1px solid var(--border-color);
25
+ background: var(--surface-bg); color: var(--text-color);
26
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
27
+ opacity: 0; transition: opacity .12s ease;
28
+ }
29
+ .d3-student-scaling .spread-label {
30
+ font-size: 11px; font-weight: 700; font-family: var(--font-mono, monospace);
31
+ fill: var(--muted-color);
32
+ }
33
+ .d3-student-scaling .axes path,
34
+ .d3-student-scaling .axes line { stroke: var(--axis-color); }
35
+ .d3-student-scaling .axes text { fill: var(--tick-color); font-size: 12px; }
36
+ .d3-student-scaling .grid line { stroke: var(--grid-color); }
37
+ .d3-student-scaling .x-label,
38
+ .d3-student-scaling .y-label { fill: var(--text-color); font-size: 13px; }
39
+ </style>
40
+ <script>
41
+ (() => {
42
+ const ensureD3 = (cb) => {
43
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
44
+ let s = document.getElementById('d3-cdn-script');
45
+ if (!s) { s = document.createElement('script'); s.id = 'd3-cdn-script'; s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js'; document.head.appendChild(s); }
46
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
47
+ s.addEventListener('load', onReady, { once: true });
48
+ if (window.d3) onReady();
49
+ };
50
+
51
+ const bootstrap = () => {
52
+ const scriptEl = document.currentScript;
53
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
54
+ if (!(container && container.classList && container.classList.contains('d3-student-scaling'))) {
55
+ const cs = Array.from(document.querySelectorAll('.d3-student-scaling'))
56
+ .filter(el => !(el.dataset && el.dataset.mounted === 'true'));
57
+ container = cs[cs.length - 1] || null;
58
+ }
59
+ if (!container) return;
60
+ if (container.dataset) {
61
+ if (container.dataset.mounted === 'true') return;
62
+ container.dataset.mounted = 'true';
63
+ }
64
+
65
+ container.style.position = container.style.position || 'relative';
66
+
67
+ // Tooltip
68
+ const tip = document.createElement('div');
69
+ tip.className = 'd3-tooltip';
70
+ const tipInner = document.createElement('div');
71
+ tipInner.style.textAlign = 'left';
72
+ tip.appendChild(tipInner);
73
+ container.appendChild(tip);
74
+
75
+ const showTip = (html, event) => {
76
+ tipInner.innerHTML = html;
77
+ tip.style.opacity = '1';
78
+ const cr = container.getBoundingClientRect();
79
+ const [mx, my] = [event.clientX - cr.left, event.clientY - cr.top];
80
+ const tw = tip.offsetWidth;
81
+ const x = mx + tw + 16 > cr.width ? mx - tw - 12 : mx + 12;
82
+ tip.style.transform = `translate(${x}px, ${my - 40}px)`;
83
+ };
84
+ const hideTip = () => { tip.style.opacity = '0'; tip.style.transform = 'translate(-9999px,-9999px)'; };
85
+
86
+ const svg = d3.select(container).append('svg').attr('width', '100%').style('display', 'block');
87
+ const gRoot = svg.append('g');
88
+
89
+ const students = ['0.5B', '1.7B', '2.9B', '6.2B'];
90
+ const generators = ['270M', '1B', '4B', '12B', '27B'];
91
+
92
+ const getColors = () => {
93
+ if (window.ColorPalettes) return window.ColorPalettes.getColors('categorical', 5);
94
+ return ['#f85149', '#f0883e', '#3fb950', '#58a6ff', '#bc8cff'];
95
+ };
96
+
97
+ let chartData = null;
98
+
99
+ const fetchCSV = async () => {
100
+ const paths = ['/data/student-scaling-summary.csv', './assets/data/student-scaling-summary.csv'];
101
+ for (const p of paths) {
102
+ try { const r = await fetch(p, { cache: 'no-cache' }); if (r.ok) return await r.text(); } catch(_){}
103
+ }
104
+ throw new Error('CSV not found');
105
+ };
106
+
107
+ // Legend
108
+ const legend = document.createElement('div');
109
+ legend.className = 'legend';
110
+ const legendTitle = document.createElement('div');
111
+ legendTitle.className = 'legend-title';
112
+ legendTitle.textContent = 'Legend';
113
+ legend.appendChild(legendTitle);
114
+ const legendItems = document.createElement('div');
115
+ legendItems.className = 'items';
116
+ legend.appendChild(legendItems);
117
+ container.appendChild(legend);
118
+
119
+ function buildLegend(colors) {
120
+ legendItems.innerHTML = '';
121
+ generators.forEach((g, i) => {
122
+ const item = document.createElement('span');
123
+ item.className = 'item';
124
+ const sw = document.createElement('span');
125
+ sw.className = 'swatch';
126
+ sw.style.background = colors[i];
127
+ const txt = document.createElement('span');
128
+ txt.textContent = g;
129
+ item.appendChild(sw);
130
+ item.appendChild(txt);
131
+ legendItems.appendChild(item);
132
+ });
133
+ }
134
+
135
+ const margin = { top: 20, right: 30, bottom: 50, left: 60 };
136
+
137
+ function render() {
138
+ if (!chartData) return;
139
+ const colors = getColors();
140
+ buildLegend(colors);
141
+
142
+ const width = container.clientWidth || 800;
143
+ const height = Math.max(280, Math.round(width / 2.8));
144
+ svg.attr('width', width).attr('height', height);
145
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
146
+
147
+ const iw = width - margin.left - margin.right;
148
+ const ih = height - margin.top - margin.bottom;
149
+
150
+ // Scales
151
+ const x0 = d3.scaleBand().domain(students).range([0, iw]).paddingInner(0.25).paddingOuter(0.1);
152
+ const x1 = d3.scaleBand().domain(generators).range([0, x0.bandwidth()]).padding(0.08);
153
+ const yMin = 0.095;
154
+ const yMax = d3.max(chartData, d => d.score) * 1.06;
155
+ const y = d3.scaleLinear().domain([yMin, yMax]).range([ih, 0]).nice();
156
+
157
+ gRoot.selectAll('*').remove();
158
+
159
+ // Grid
160
+ const gridG = gRoot.append('g').attr('class', 'grid');
161
+ gridG.selectAll('line').data(y.ticks(6)).join('line')
162
+ .attr('x1', 0).attr('x2', iw)
163
+ .attr('y1', d => y(d)).attr('y2', d => y(d));
164
+
165
+ // Axes
166
+ const axesG = gRoot.append('g').attr('class', 'axes');
167
+ axesG.append('g').attr('transform', `translate(0,${ih})`)
168
+ .call(d3.axisBottom(x0).tickSize(0).tickPadding(10))
169
+ .selectAll('text').style('font-size', '13px').style('font-weight', '600');
170
+
171
+ axesG.append('g')
172
+ .call(d3.axisLeft(y).ticks(6).tickFormat(d3.format('.3f')).tickSize(-iw))
173
+ .call(g => g.selectAll('.tick line').attr('stroke', 'var(--grid-color)'))
174
+ .call(g => g.select('.domain').remove());
175
+
176
+ // Axis labels
177
+ gRoot.append('text').attr('class', 'x-label')
178
+ .attr('x', iw / 2).attr('y', ih + 42)
179
+ .attr('text-anchor', 'middle').text('Student model size');
180
+
181
+ gRoot.append('text').attr('class', 'y-label')
182
+ .attr('transform', 'rotate(-90)')
183
+ .attr('x', -ih / 2).attr('y', -48)
184
+ .attr('text-anchor', 'middle').text('Aggregate macro score');
185
+
186
+ // Bars
187
+ const barsG = gRoot.append('g');
188
+
189
+ students.forEach(student => {
190
+ const sData = chartData.filter(d => d.student === student);
191
+ const gBars = barsG.append('g').attr('transform', `translate(${x0(student)},0)`);
192
+
193
+ gBars.selectAll('rect').data(sData).join('rect')
194
+ .attr('x', d => x1(d.generator))
195
+ .attr('y', d => y(d.score))
196
+ .attr('width', x1.bandwidth())
197
+ .attr('height', d => Math.max(0.5, ih - y(d.score)))
198
+ .attr('fill', (d, i) => colors[generators.indexOf(d.generator)])
199
+ .attr('rx', 2)
200
+ .on('mouseenter', (event, d) => {
201
+ const gi = generators.indexOf(d.generator);
202
+ showTip(
203
+ `<strong>${d.student} student + ${d.generator} generator</strong><br/>` +
204
+ `Macro score: <strong>${d.score.toFixed(4)}</strong>`,
205
+ event
206
+ );
207
+ })
208
+ .on('mousemove', (event, d) => {
209
+ showTip(
210
+ `<strong>${d.student} student + ${d.generator} generator</strong><br/>` +
211
+ `Macro score: <strong>${d.score.toFixed(4)}</strong>`,
212
+ event
213
+ );
214
+ })
215
+ .on('mouseleave', hideTip);
216
+ });
217
+
218
+ // Spread annotations
219
+ const spreadG = gRoot.append('g');
220
+ students.forEach(student => {
221
+ const sData = chartData.filter(d => d.student === student);
222
+ const lo = d3.min(sData, d => d.score);
223
+ const hi = d3.max(sData, d => d.score);
224
+ const spread = hi - lo;
225
+ const cx = x0(student) + x0.bandwidth() + 6;
226
+
227
+ // Arrow line
228
+ spreadG.append('line')
229
+ .attr('x1', cx).attr('x2', cx)
230
+ .attr('y1', y(hi) + 2).attr('y2', y(lo) - 2)
231
+ .attr('stroke', 'var(--muted-color)')
232
+ .attr('stroke-width', 1.5);
233
+
234
+ // Arrow heads
235
+ [y(hi), y(lo)].forEach((yy, idx) => {
236
+ const dir = idx === 0 ? 1 : -1;
237
+ spreadG.append('path')
238
+ .attr('d', `M${cx - 3},${yy + dir * 4} L${cx},${yy} L${cx + 3},${yy + dir * 4}`)
239
+ .attr('stroke', 'var(--muted-color)')
240
+ .attr('stroke-width', 1.5)
241
+ .attr('fill', 'none');
242
+ });
243
+
244
+ // Label (above the arrow)
245
+ spreadG.append('text')
246
+ .attr('class', 'spread-label')
247
+ .attr('x', cx)
248
+ .attr('y', y(hi) - 8)
249
+ .attr('text-anchor', 'middle')
250
+ .attr('dominant-baseline', 'auto')
251
+ .text(`\u0394 ${spread.toFixed(3)}`);
252
+ });
253
+ }
254
+
255
+ fetchCSV().then(text => {
256
+ chartData = d3.csvParse(text, d => ({
257
+ student: d.student,
258
+ generator: d.generator,
259
+ score: +d.score
260
+ }));
261
+ render();
262
+ }).catch(err => {
263
+ const pre = document.createElement('pre');
264
+ pre.style.color = 'red';
265
+ pre.textContent = `Error loading data: ${err.message}`;
266
+ container.appendChild(pre);
267
+ });
268
+
269
+ if (window.ResizeObserver) {
270
+ new ResizeObserver(() => render()).observe(container);
271
+ } else {
272
+ window.addEventListener('resize', render);
273
+ }
274
+ };
275
+
276
+ if (document.readyState === 'loading') {
277
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
278
+ } else {
279
+ ensureD3(bootstrap);
280
+ }
281
+ })();
282
+ </script>