Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| <html lang="en" data-theme="dark"> | |
| <head> | |
| <meta charset="utf-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>The Synthetic Data Playbook</title> | |
| <link rel="preconnect" href="https://fonts.googleapis.com"> | |
| <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&display=swap" rel="stylesheet"> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/reveal.css"> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/theme/night.css"> | |
| <link rel="stylesheet" href="style.css"> | |
| </head> | |
| <body> | |
| <div class="reveal"> | |
| <div class="slides"> | |
| <!-- ============================================================ --> | |
| <!-- SECTION 1: MOTIVATION AND RECAP (~40%, slides 1-8, ~9 min) --> | |
| <!-- ============================================================ --> | |
| <!-- SLIDE 1: Title --> | |
| <section class="center-slide"> | |
| <div style="margin-top:-160px;"> | |
| <h2>The Synthetic Data Playbook</h2> | |
| <br> | |
| <h3>How to Cook Better Training Data for LLMs</h3> | |
| <br> | |
| <p style="margin-top:20px;font-size:0.8em;color:rgba(255,255,255,1);"> | |
| SE 26 | |
| </p> | |
| </div> | |
| <img src="assets/bern-skyline.png" style="position:absolute;bottom:0;left:50%;transform:translateX(-50%);width:100%;height:auto;opacity:0.6;pointer-events:none;"> | |
| <aside class="notes"> | |
| ~30s. Welcome, introduce yourself. "Today I'll show you how we made LLMs better | |
| by rewriting their training data instead of just filtering it." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 2: Digital Sovereignty --> | |
| <section> | |
| <p class="section-label">Why This Matters</p> | |
| <h2>The Data Black Box</h2> | |
| <div style="font-size:0.65em;margin-top:20px;"> | |
| <p>Frontier labs (OpenAI, Google, Anthropic) don't disclose how they build their training data.</p> | |
| <p class="fragment">Neither do the Chinese labs (DeepSeek or Qwen).</p> | |
| <p class="fragment" style="margin-top:20px;"> | |
| Training data is the <span class="highlight">most important ingredient</span> in building an LLM, | |
| yet the recipes are kept secret. | |
| </p> | |
| <div class="fragment" style="margin-top:30px;background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.1);border-radius:16px;padding:24px;"> | |
| <p style="font-weight:700;color:#f0c674;margin-bottom:8px;font-size:1.1em;">Digital Sovereignty</p> | |
| <p>If you can't build the data, you can't build the model.<br> | |
| If you can't build the model, you depend on those who can.</p> | |
| <p style="margin-top:12px;">This work puts the knowledge <span class="accent">out in the open</span> for everyone: | |
| governments, universities, startups, and individuals.</p> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Before we dive in, let me explain why this matters beyond the technical. | |
| None of the frontier labs, not OpenAI, not Google, not Anthropic, and not the Chinese labs either, | |
| tell you how they build their training data. It's the most important ingredient and it's a black box. | |
| This is a digital sovereignty issue. If you can't build the data yourself, | |
| you can't build the model, and you're dependent on whoever can. | |
| Our work makes this knowledge open and accessible to everyone." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 3: LLMs and Pretraining Recap --> | |
| <section> | |
| <p class="section-label">Quick Recap</p> | |
| <h2>LLMs: What's Under the Hood</h2> | |
| <div class="two-col" style="font-size:0.65em;margin-top:20px;"> | |
| <div class="col"> | |
| <p>You use these every day: ChatGPT, Copilot, Claude.</p> | |
| <p class="fragment">Under the hood: a giant function that takes <span class="accent">tokens in</span> and predicts <span class="accent">tokens out</span>.</p> | |
| <p class="fragment">Trained by reading <span class="highlight">billions of web pages</span>, learning to predict the next word.</p> | |
| <p class="fragment" style="margin-top:20px;font-weight:700;color:#f0c674;"> | |
| Data quality defines model quality. | |
| </p> | |
| </div> | |
| <div class="col fragment" style="text-align:center;"> | |
| <div style="background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.1);border-radius:16px;padding:30px 20px;"> | |
| <div style="font-size:0.9em;color:rgba(255,255,255,0.5);">Input text</div> | |
| <div style="font-size:2em;margin:10px 0;">↓</div> | |
| <div style="background:rgba(124,111,247,0.15);border:1px solid rgba(124,111,247,0.3);border-radius:12px;padding:16px;font-weight:700;font-size:1.1em;"> | |
| LLM<br><span style="font-size:0.6em;font-weight:400;color:rgba(255,255,255,0.4);">billions of parameters</span> | |
| </div> | |
| <div style="font-size:2em;margin:10px 0;">↓</div> | |
| <div style="font-size:0.9em;color:rgba(255,255,255,0.5);">Output text</div> | |
| </div> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1.5 min. "Quick recap so we have shared vocabulary." Click through fragments. | |
| Emphasize: model quality = data quality. Like training a code model on all of GitHub. | |
| </aside> | |
| </section> | |
| <!-- SLIDE 4: The Data Quality Problem --> | |
| <section> | |
| <p class="section-label">The Problem</p> | |
| <h2>You Start With the Entire Internet...</h2> | |
| <h3 class="fragment">...and throw away 98.6% of it</h3> | |
| <div class="fragment"> | |
| <img src="assets/dclm-filtering-pipeline.png" class="img-contain" style="margin-top:10px;max-height:400px;"> | |
| <p style="font-size:0.45em;color:rgba(255,255,255,0.3);margin-top:8px;"> | |
| DCLM: 240T tokens from Common Crawl → 1.4% survives as DCLM-Baseline | |
| </p> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "This is the DCLM dataset pipeline. You scrape the whole internet, 240 trillion tokens. | |
| Then heuristic filters, deduplication, model-based filtering. Only 1.4% of documents survive. | |
| All this engineering just to clean the data. What if there was a better way?" | |
| </aside> | |
| </section> | |
| <!-- SLIDE 5: Synthetic Data --> | |
| <section> | |
| <p class="section-label">The Idea</p> | |
| <h2>Rewrite Instead of Filter</h2> | |
| <div class="before-after"> | |
| <div class="panel bad"> | |
| <div class="panel-title">Raw Web Text</div> | |
| <p style="margin:0;line-height:1.6;"> | |
| <span style="color:rgba(255,255,255,0.3);">★★★ BeSt DeAls!!!</span><br> | |
| Photosynthesis is the process by wich plants convert sunlit into energy. | |
| It occurs in the chloroplasts<br> | |
| <span style="color:rgba(255,255,255,0.3);">Click here for more → → →</span><br> | |
| <span style="color:rgba(255,255,255,0.3);">© 2019 AllScienceInfo.biz</span><br> | |
| Carbon dioxide and water are transformed into glucose and oxygen... | |
| <span style="color:rgba(255,255,255,0.3);">[AD] [AD] [POPUP]</span> | |
| </p> | |
| </div> | |
| <div class="arrow fragment" data-fragment-index="0">→</div> | |
| <div class="panel good fragment" data-fragment-index="0"> | |
| <div class="panel-title">LLM-Rewritten FAQ</div> | |
| <p style="margin:0;line-height:1.6;"> | |
| <strong>Q: What is photosynthesis?</strong><br> | |
| A: Photosynthesis is the process by which plants convert sunlight into chemical energy. | |
| It occurs in organelles called chloroplasts.<br><br> | |
| <strong>Q: What are the inputs and outputs?</strong><br> | |
| A: Plants take in carbon dioxide (CO₂) and water (H₂O), and using light energy, | |
| produce glucose (C₆H₁₂O₆) and oxygen (O₂). | |
| </p> | |
| </div> | |
| </div> | |
| <p class="fragment" style="font-size:0.55em;margin-top:16px;"> | |
| Same knowledge, better packaging.<br> | |
| You keep <span class="highlight">100%</span> of your data instead of discarding 90%. | |
| </p> | |
| <aside class="notes"> | |
| ~1.5 min. Walk through the before/after. Left: messy web text with spam, typos, ads, broken formatting. | |
| Right: same knowledge, but restructured as a clean FAQ. The LLM acts as a rewriter. | |
| Key insight: you preserve the knowledge, you just improve the presentation. No data wasted. | |
| </aside> | |
| </section> | |
| <!-- SLIDE 6: Research Question --> | |
| <section> | |
| <p class="section-label">Our Research</p> | |
| <h2>What's the Best Recipe?</h2> | |
| <p style="font-size:0.6em;color:rgba(255,255,255,0.6);margin-bottom:12px;"> | |
| Three knobs to tune: <span class="accent">source data</span>, <span class="accent">prompt strategy</span>, and | |
| <span class="accent">generator model</span>. | |
| </p> | |
| <div style="display:flex;align-items:center;gap:24px;"> | |
| <iframe src="charts/experiment-flow.html" style="flex:0 1 75%;height:540px;border:none;border-radius:8px;background:transparent;" loading="lazy"></iframe> | |
| <div class="fragment" style="display:flex;flex-direction:column;gap:24px;min-width:120px;text-align:center;align-self:flex-start;margin-top:40px;padding-left:40px;"> | |
| <div class="stat-box"><div class="num" style="font-size:1.6em;">70+</div><div class="label">experiments</div></div> | |
| <div class="stat-box"><div class="num" style="font-size:1.6em;">1T+</div><div class="label">tokens generated</div></div> | |
| <div class="stat-box"><div class="num" style="font-size:1.6em;">60k+</div><div class="label">GPU hours</div></div> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "We ran a massive ablation study. Three axes: what prompt do you give the rewriter, | |
| which model does the rewriting, and what source data do you start from. | |
| This Sankey shows our 70+ experiments flowing from source → prompt → model. | |
| Over 1 trillion tokens generated, 100k GPU hours." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 7: How We Evaluate --> | |
| <section> | |
| <p class="section-label">Methodology</p> | |
| <h2>Our Integration Test Suite</h2> | |
| <div style="font-size:0.7em;margin-top:30px;"> | |
| <p style="color:rgba(255,255,255,0.5);">For each experiment, we:</p> | |
| <ul> | |
| <li class="fragment">Train a <span class="accent">1.2B parameter</span> model from scratch</li> | |
| <li class="fragment">Feed it <span class="accent">20B tokens</span> of synthetic and original data</li> | |
| <li class="fragment">Test on <span class="accent">12 benchmarks</span> (reading, math, reasoning, knowledge...)</li> | |
| <li class="fragment">Compare against curated web datasets as baselines</li> | |
| </ul> | |
| <p class="fragment" style="margin-top:16px;font-size:0.9em;color:rgba(255,255,255,0.6);"> | |
| This is expensive so we tried proxies: | |
| </p> | |
| <ul class="fragment" style="font-size:0.85em;margin-top:4px;"> | |
| <li>DCLM/Edu scores (used for filtering pretraining data)</li> | |
| <li>Smaller training runs</li> | |
| </ul> | |
| <p class="fragment" style="margin-top:4px;font-size:0.9em;"> | |
| None correlated well enough. | |
| </p> | |
| <p class="fragment" style="margin-top:10px;color:#f0c674;font-weight:600;"> | |
| No shortcuts: you must train and evaluate to know if your data is good. | |
| </p> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Think of it like an integration test suite for data quality. | |
| We train a model on each dataset variant and see how it scores. | |
| 12 benchmarks covering reading comprehension, math, general knowledge, reasoning. | |
| 65 separate training runs. No proxy metric can replace this." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 8: Spoiler --> | |
| <section> | |
| <p class="section-label">Spoiler</p> | |
| <h2>FinePhrase Wins</h2> | |
| <p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:10px;"> | |
| Our best synthetic recipe outperforms all tested baselines, including curated web data. | |
| </p> | |
| <iframe src="charts/benchmark.html" class="chart-frame" style="height:360px;" loading="lazy"></iframe> | |
| <p class="fragment" style="font-size:0.6em;margin-top:10px;color:rgba(255,255,255,0.6);"> | |
| Let's unpack <span class="accent">how</span>. | |
| </p> | |
| <aside class="notes"> | |
| ~1 min. "Here's the punchline up front. FinePhrase, our best configuration, | |
| beats all baselines including DCLM, Nemotron, REWIRE, and Cosmopedia. | |
| Let's unpack the three key findings that got us here." | |
| Transition to Section 2. | |
| </aside> | |
| </section> | |
| <!-- ============================================================ --> | |
| <!-- SECTION 2: EXPERIMENTAL RESULTS (~20%, slides 9-12, ~4 min) --> | |
| <!-- ============================================================ --> | |
| <!-- SLIDE 9: Prompts Matter Most --> | |
| <section> | |
| <p class="section-label">Finding #1</p> | |
| <h2>Prompt Design Is the #1 Lever</h2> | |
| <div class="two-col" style="font-size:0.6em;grid-template-columns:1.5fr 1fr;gap:20px;"> | |
| <div class="col" style="text-align:center;"> | |
| <iframe src="charts/benchmark-prompts.html" class="chart-frame" loading="lazy" | |
| style="height:480px;" id="prompts-chart"></iframe> | |
| </div> | |
| <div class="col"> | |
| <p>Structured prompts beat everything:</p> | |
| <ul> | |
| <li class="fragment"><span class="highlight">Math</span> reformatting</li> | |
| <li class="fragment"><span class="highlight">Table</span> extraction</li> | |
| <li class="fragment"><span class="highlight">FAQ</span> generation</li> | |
| <li class="fragment"><span class="highlight">Tutorial</span> rewriting</li> | |
| </ul> | |
| <p class="fragment" style="margin-top:20px;"> | |
| These beat curated web data <em>and</em> all prior synthetic baselines. | |
| </p> | |
| <p class="fragment" style="color:#f0c674;font-weight:600;margin-top:10px;"> | |
| The prompt matters more than the model or the source data. | |
| </p> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Finding number one, and the most important: prompt design is the biggest lever. | |
| Structured formats like Math, Table, FAQ, Tutorial consistently outperform | |
| both curated web data and prior synthetic approaches. | |
| The prompt matters more than which model you use or what source data you start from." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 10: Smol Models Are Enough --> | |
| <section> | |
| <p class="section-label">Finding #2</p> | |
| <h2>Smol Models Are Enough</h2> | |
| <div class="two-col" style="font-size:0.6em;grid-template-columns:1.6fr 1fr;gap:16px;"> | |
| <div class="col" style="text-align:center;"> | |
| <iframe src="charts/benchmark-family.html" class="chart-frame" loading="lazy" | |
| style="height:440px;"></iframe> | |
| </div> | |
| <div class="col"> | |
| <p>1B matches 4B, 12B, and 27B model performance.</p> | |
| <p class="fragment"><span class="accent">SmolLM2-1.7B</span> beats Qwen, Gemma, Llama, Falcon, and Granite.</p> | |
| <p class="fragment" style="margin-top:20px;">And it's <em>much</em> faster:</p> | |
| <ul> | |
| <li class="fragment"><span class="highlight">3.0x</span> faster than Gemma-3-12B<br><span style="color:rgba(255,255,255,0.4);">(9,220 vs 3,046 tps/gpu)</span></li> | |
| <li class="fragment"><span class="highlight">5.3x</span> faster than Gemma-3-27B<br><span style="color:rgba(255,255,255,0.4);">(9,220 vs 1,724 tps/gpu)</span></li> | |
| </ul> | |
| <p class="fragment" style="color:#f0c674;font-weight:600;margin-top:20px;"> | |
| Better quality <em>and</em> faster inference. | |
| </p> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Finding two: you don't need a big model. | |
| 1B parameters match 4B, 12B, even 27B for rephrasing quality. | |
| SmolLM2 at 1.7B beats all other model families. | |
| And it's 3x faster than Gemma-12B, 5.3x faster than Gemma-27B. | |
| Better quality AND faster inference. You don't need a big model." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 11: Diversity Paradox --> | |
| <section> | |
| <p class="section-label">Finding #3</p> | |
| <h2>Diversity Beats Consistency</h2> | |
| <div style="font-size:0.65em;"> | |
| <div class="two-col"> | |
| <div class="col"> | |
| <p><span class="highlight">Messy beats polished.</span></p> | |
| <p class="fragment" data-fragment-index="1">SmolLM2's varied, inconsistent outputs outperform | |
| Qwen3's template-locked, clean outputs.</p> | |
| <p class="fragment" data-fragment-index="3" style="margin-top:20px;"> | |
| <span class="accent">Synthetic-only fails.</span><br> | |
| You must mix synthetic data with original web data. | |
| </p> | |
| <p class="fragment" data-fragment-index="4" style="margin-top:20px;"> | |
| The mix-in dataset matters as much as the synthetic data itself. | |
| </p> | |
| </div> | |
| <div class="col fragment" data-fragment-index="2"> | |
| <div style="background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.08);border-radius:16px;padding:24px;"> | |
| <div style="font-size:1em;font-weight:700;margin-bottom:12px;">Template Collapse</div> | |
| <div style="font-size:0.9em;line-height:1.6;"> | |
| <p style="color:rgba(255,255,255,0.5);">Qwen3 Math outputs:</p> | |
| <p><span class="danger">115 / 1000</span> samples start with the exact same sentence</p> | |
| <p style="margin-top:12px;color:rgba(255,255,255,0.5);">SmolLM2 Math outputs:</p> | |
| <p><span class="accent">Highly varied</span> formatting and structure</p> | |
| <p style="margin-top:16px;color:#f0c674;font-weight:600;"> | |
| Diversity beats consistency for pretraining. | |
| </p> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Finding three: diversity beats polish. This was counterintuitive. | |
| Qwen3's math outputs are very clean and consistent, but 115 out of 1000 start identically. | |
| SmolLM2's outputs are messier but more varied. The varied outputs win. | |
| Also: synthetic-only training fails. You need to mix in original data. | |
| The mix-in dataset influence is sometimes larger than the synthetic data itself." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 12: Results Summary --> | |
| <section> | |
| <p class="section-label">Summary</p> | |
| <h2>What We Found</h2> | |
| <ul class="takeaway-list" style="margin-top:30px;"> | |
| <li class="fragment"> | |
| <span class="accent">Prompt design</span> is the #1 lever. | |
| Structured formats (Math, Table, FAQ, Tutorial) outperform everything. | |
| </li> | |
| <li class="fragment"> | |
| <span class="accent">1B models suffice.</span> | |
| SmolLM2-1.7B is the best rephraser across the board. | |
| </li> | |
| <li class="fragment"> | |
| <span class="accent">Mix original data in.</span> | |
| Synthetic-only fails. The mix-in dataset matters. | |
| </li> | |
| <li class="fragment"> | |
| <span class="accent">Diversity wins over polish.</span> | |
| Varied, messy outputs beat clean, template-locked ones. | |
| </li> | |
| </ul> | |
| <aside class="notes"> | |
| ~30s. Quick recap of findings. Click through each point. These four bullets | |
| are the core message of the talk. Transition: "Now let's talk about | |
| the engineering challenge of actually doing this at scale." | |
| </aside> | |
| </section> | |
| <!-- ============================================================ --> | |
| <!-- SECTION 3: INFRASTRUCTURE (~20%, slides 13-16, ~4 min) --> | |
| <!-- ============================================================ --> | |
| <!-- SLIDE 13: Engineering Challenge --> | |
| <section> | |
| <p class="section-label">Infrastructure</p> | |
| <h2>How Do You Rephrase 1T Tokens?</h2> | |
| <div style="font-size:0.65em;margin-top:30px;"> | |
| <p>Each experiment generates ~15B tokens.</p> | |
| <p class="fragment">70+ experiments = <span class="accent">1T+ tokens</span> of LLM output.</p> | |
| <p class="fragment" style="margin-top:20px;"> | |
| At ~4,750 tokens/sec/GPU (mean across all experiments): | |
| </p> | |
| <div class="fragment stat-row" style="margin-top:20px;"> | |
| <div class="stat-box"><div class="num">~880</div><div class="label">GPU-hours per experiment</div></div> | |
| <div class="stat-box"><div class="num">~$3k</div><div class="label">cloud cost per experiment</div></div> | |
| <div class="stat-box"><div class="num">~$215k</div><div class="label">total compute budget</div></div> | |
| </div> | |
| <p class="fragment" style="margin-top:20px;color:#f0c674;font-weight:600;"> | |
| You need a scalable, fault-tolerant pipeline. | |
| </p> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Now the engineering side. Each experiment is 15 billion tokens of LLM generation. | |
| 70+ experiments. That's over a trillion tokens total. At $3.50/GPU-hour, | |
| each experiment costs about $7,000. You need infrastructure that handles failures, | |
| checkpoints, and scales across many nodes." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 14: DataTrove + vLLM --> | |
| <section> | |
| <p class="section-label">Infrastructure</p> | |
| <h2>DataTrove + vLLM</h2> | |
| <iframe src="charts/pipeline.html" class="chart-frame" loading="lazy" style="height:440px;margin-bottom:0;"></iframe> | |
| <div style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-top:2px;"> | |
| DataTrove orchestrates the pipeline. vLLM serves the model with optimized batching and prefix caching. | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "We built on DataTrove, our open-source data processing library. | |
| The pipeline is Read → Transform → Write. The Transform step calls vLLM, | |
| a high-throughput inference engine with tensor parallelism, chunked prefill, and prefix caching. | |
| Everything runs on Slurm with checkpointing and auto-recovery. | |
| Outputs go straight to a Hugging Face dataset with auto-generated cards." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 15: Throughput Optimization --> | |
| <section> | |
| <p class="section-label">Infrastructure</p> | |
| <h2>Throughput Optimization</h2> | |
| <p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:8px;"> | |
| 18 models benchmarked on H100 GPUs. Two tiers of optimization. | |
| </p> | |
| <iframe src="charts/throughput.html" class="chart-frame" loading="lazy" style="height:420px;"></iframe> | |
| <aside class="notes"> | |
| ~1 min. "We benchmarked 18 models across two tiers of optimization. | |
| Tier 0: tensor parallelism, batch sizes, sequence lengths. Tier 1: GPU memory utilization, speculative decoding. | |
| For large MoE models like GPT-OSS-120B, Tier 0 alone gives 1.95x speedup, cutting cost by nearly half. | |
| Speculative decoding helps small models but can hurt others (Gemma 3 regresses due to vocab size)." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 16: Cost-Performance --> | |
| <section> | |
| <p class="section-label">Infrastructure</p> | |
| <h2>Cost vs. Performance</h2> | |
| <p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:8px;"> | |
| Small models + good prompts dominate the Pareto frontier. | |
| </p> | |
| <iframe src="charts/cost-efficiency.html" class="chart-frame" loading="lazy" style="height:420px;"></iframe> | |
| <p class="fragment" style="font-size:0.6em;margin-top:4px;color:#f0c674;font-weight:600;"> | |
| Invest in prompt design, not model size. | |
| </p> | |
| <aside class="notes"> | |
| ~1 min. "This scatter plot shows GPU time vs downstream performance for all experiments. | |
| The Pareto frontier is dominated by small models with structured prompts. | |
| The baselines on the left have zero rephrasing cost. Our best synthetic setups | |
| beat them while remaining cost-efficient. Key takeaway: optimize throughput first, | |
| then worry about model size." | |
| </aside> | |
| </section> | |
| <!-- ============================================================ --> | |
| <!-- SECTION 4: CONCLUSIONS (~20%, slides 17-21, ~4 min) --> | |
| <!-- ============================================================ --> | |
| <!-- SLIDE 17: The FinePhrase Recipe --> | |
| <section> | |
| <p class="section-label">Conclusion</p> | |
| <h2>The FinePhrase Recipe</h2> | |
| <div class="recipe-diagram fragment"> | |
| <div class="box"> | |
| <div style="font-size:1.4em;">📄</div> | |
| <div style="font-weight:700;">Source Data</div> | |
| <div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Web text<br>(even low-quality)</div> | |
| </div> | |
| <div class="plus">+</div> | |
| <div class="box"> | |
| <div style="font-size:1.4em;">📝</div> | |
| <div style="font-weight:700;">Structured Prompt</div> | |
| <div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Math / Table /<br>FAQ / Tutorial</div> | |
| </div> | |
| <div class="plus">+</div> | |
| <div class="box"> | |
| <div style="font-size:1.4em;">🤖</div> | |
| <div style="font-weight:700;">SmolLM2-1.7B</div> | |
| <div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Small, fast,<br>diverse outputs</div> | |
| </div> | |
| <div class="equals">=</div> | |
| <div class="box result"> | |
| <div style="font-size:1.4em;">✨</div> | |
| <div style="font-weight:700;color:#7c6ff7;">FinePhrase</div> | |
| <div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Best synthetic<br>pretraining data</div> | |
| </div> | |
| </div> | |
| <p class="fragment" style="font-size:0.6em;color:rgba(255,255,255,0.5);margin-top:20px;"> | |
| Mixed with high-quality original data (e.g., FineWeb-Edu) for best results. | |
| </p> | |
| <aside class="notes"> | |
| ~1 min. "Here's the recipe in one slide. Take any web text, even low-quality, | |
| apply a structured prompt (Math, Table, FAQ, Tutorial), run it through SmolLM2-1.7B, | |
| and mix the output with high-quality original data. That's FinePhrase. | |
| It outperforms all tested baselines." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 18: What Surprised Us --> | |
| <section> | |
| <p class="section-label">Conclusion</p> | |
| <h2>What Surprised Us</h2> | |
| <div class="surprise-grid fragment"> | |
| <div class="surprise-card"> | |
| <div class="icon">🤷</div> | |
| <h4>Typos Don't Matter</h4> | |
| <p>REWIRE's original prompt had typos. Fixing them made no measurable difference to downstream performance.</p> | |
| </div> | |
| <div class="surprise-card"> | |
| <div class="icon">📊</div> | |
| <h4>Proxy Scores Lie</h4> | |
| <p>Edu-score and DCLM-score do not reliably predict downstream performance. You must train and evaluate.</p> | |
| </div> | |
| <div class="surprise-card"> | |
| <div class="icon">🎲</div> | |
| <h4>Messier Is Better</h4> | |
| <p>Varied, inconsistent outputs from SmolLM2 beat Qwen3's polished, template-locked outputs every time.</p> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "Three things that surprised us. First: typos in prompts don't matter. | |
| REWIRE's prompt had actual typos and fixing them changed nothing. | |
| Second: quality proxy scores like edu-score don't predict performance. You must train. | |
| Third: messy, varied outputs consistently beat clean, polished ones. Diversity is king." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 19: Everything Is Open --> | |
| <section> | |
| <p class="section-label">Open Source</p> | |
| <h2>Everything Is Open</h2> | |
| <div style="font-size:0.65em;margin-top:20px;"> | |
| <ul> | |
| <li class="fragment">All prompts, configs, and pipeline code</li> | |
| <li class="fragment">Generated datasets on the Hugging Face Hub</li> | |
| <li class="fragment">Throughput benchmarks for 18 models</li> | |
| <li class="fragment">Blog post with interactive charts</li> | |
| </ul> | |
| <div class="fragment" style="margin-top:30px;"> | |
| <p style="font-weight:700;color:#f0c674;font-size:1.1em;">Future directions:</p> | |
| <ul style="color:rgba(255,255,255,0.5);"> | |
| <li>Diffusion LMs for faster inference</li> | |
| <li>Scaling to more data (ablations trained on only 21B tokens)</li> | |
| <li>Mixing ratio: how little synthetic data can you get away with?</li> | |
| <li>Best-of-N filtering on synthetic outputs</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <aside class="notes"> | |
| ~1 min. "We're releasing everything. All prompts, the pipeline code in DataTrove, | |
| the generated datasets on the Hub, throughput benchmarks. | |
| The blog post itself has interactive charts you can explore. | |
| Future work: we're looking at diffusion LMs for faster inference, | |
| scaling beyond our 21B token ablations, exploring mixing ratios to find how little | |
| synthetic data you actually need, and using best-of-N filtering on synthetic outputs." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 20: Academia Hub --> | |
| <section> | |
| <img src="assets/academia-hub.png" class="img-contain" style="max-height:560px;border-radius:12px;box-shadow:0 8px 40px rgba(0,0,0,0.4);"> | |
| <aside class="notes"> | |
| ~30s. "If you're at a university or research lab, check out our Academia Hub: | |
| institution-wide access to the Hugging Face Hub with priority GPU access, | |
| inference credits, storage, and enterprise admin." | |
| </aside> | |
| </section> | |
| <!-- SLIDE 21: Q&A --> | |
| <section class="center-slide"> | |
| <h2>Thank You</h2> | |
| <p style="font-size:0.6em;color:rgba(255,255,255,0.5);margin-top:10px;">Questions?</p> | |
| <div style="display:flex;align-items:center;justify-content:center;gap:28px;margin-top:30px;"> | |
| <img src="assets/profile.jpg" style="width:90px;height:90px;border-radius:50%;border:2px solid rgba(255,255,255,0.15);object-fit:cover;"> | |
| <div style="text-align:left;font-size:0.55em;"> | |
| <div style="font-weight:700;font-size:1.2em;margin-bottom:8px;">Joel Niklaus</div> | |
| <div style="display:flex;align-items:center;gap:8px;margin-bottom:6px;"> | |
| <svg width="18" height="18" viewBox="0 0 24 24" fill="rgba(255,255,255,0.7)"><path d="M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 01-2.063-2.065 2.064 2.064 0 112.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z"/></svg> | |
| <a href="https://linkedin.com/in/joelniklaus" target="_blank" style="color:rgba(255,255,255,0.7);text-decoration:none;">joelniklaus</a> | |
| </div> | |
| <div style="display:flex;align-items:center;gap:8px;"> | |
| <svg width="18" height="18" viewBox="0 0 24 24" fill="rgba(255,255,255,0.7)"><path d="M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-5.214-6.817L4.99 21.75H1.68l7.73-8.835L1.254 2.25H8.08l4.713 6.231zm-1.161 17.52h1.833L7.084 4.126H5.117z"/></svg> | |
| <a href="https://x.com/joelniklaus" target="_blank" style="color:rgba(255,255,255,0.7);text-decoration:none;">@joelniklaus</a> | |
| </div> | |
| </div> | |
| </div> | |
| <p style="margin-top:24px;font-size:0.55em;color:rgba(255,255,255,0.4);"> | |
| Stay tuned for the blog post with many more details. | |
| </p> | |
| <aside class="notes"> | |
| Q&A time. Mention they can reach out on LinkedIn or X. Have the blog open in a browser tab | |
| for live demos if questions come up. | |
| </aside> | |
| </section> | |
| </div><!-- /slides --> | |
| </div><!-- /reveal --> | |
| <script src="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/reveal.js"></script> | |
| <script src="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/plugin/notes/notes.js"></script> | |
| <script> | |
| Reveal.initialize({ | |
| hash: true, | |
| slideNumber: 'c/t', | |
| showSlideNumber: 'speaker', | |
| transition: 'fade', | |
| transitionSpeed: 'fast', | |
| center: false, | |
| width: 1200, | |
| height: 700, | |
| margin: 0.06, | |
| plugins: [RevealNotes], | |
| }); | |
| </script> | |
| </body> | |
| </html> | |