finephrase

Running on CPU Upgrade

App Files Files Community

finephrase / app /presentation /se2026 /index.html

joelniklaus HF Staff

made presentation together with script to convert to standalone file

1da160c about 1 month ago

raw

history blame contribute delete

30.5 kB

	<!DOCTYPE html>
	<html lang="en" data-theme="dark">
	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>The Synthetic Data Playbook</title>
	<link rel="preconnect" href="https://fonts.googleapis.com">
	<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&display=swap" rel="stylesheet">
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/reveal.css">
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/theme/night.css">
	<link rel="stylesheet" href="style.css">
	</head>
	<body>
	<div class="reveal">
	<div class="slides">

	<!-- ============================================================ -->
	<!-- SECTION 1: MOTIVATION AND RECAP (~40%, slides 1-8, ~9 min) -->
	<!-- ============================================================ -->

	<!-- SLIDE 1: Title -->
	<section class="center-slide">
	<div style="margin-top:-160px;">
	<h2>The Synthetic Data Playbook</h2>
	<br>
	<h3>How to Cook Better Training Data for LLMs</h3>
	<br>
	<p style="margin-top:20px;font-size:0.8em;color:rgba(255,255,255,1);">
	SE 26
	</p>
	</div>
	<img src="assets/bern-skyline.png" style="position:absolute;bottom:0;left:50%;transform:translateX(-50%);width:100%;height:auto;opacity:0.6;pointer-events:none;">
	<aside class="notes">
	~30s. Welcome, introduce yourself. "Today I'll show you how we made LLMs better
	by rewriting their training data instead of just filtering it."
	</aside>
	</section>

	<!-- SLIDE 2: Digital Sovereignty -->
	<section>
	<p class="section-label">Why This Matters</p>
	<h2>The Data Black Box</h2>
	<div style="font-size:0.65em;margin-top:20px;">
	<p>Frontier labs (OpenAI, Google, Anthropic) don't disclose how they build their training data.</p>
	<p class="fragment">Neither do the Chinese labs (DeepSeek or Qwen).</p>
	<p class="fragment" style="margin-top:20px;">
	Training data is the <span class="highlight">most important ingredient</span> in building an LLM,
	yet the recipes are kept secret.
	</p>
	<div class="fragment" style="margin-top:30px;background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.1);border-radius:16px;padding:24px;">
	<p style="font-weight:700;color:#f0c674;margin-bottom:8px;font-size:1.1em;">Digital Sovereignty</p>
	<p>If you can't build the data, you can't build the model.<br>
	If you can't build the model, you depend on those who can.</p>
	<p style="margin-top:12px;">This work puts the knowledge <span class="accent">out in the open</span> for everyone:
	governments, universities, startups, and individuals.</p>
	</div>
	</div>
	<aside class="notes">
	~1 min. "Before we dive in, let me explain why this matters beyond the technical.
	None of the frontier labs, not OpenAI, not Google, not Anthropic, and not the Chinese labs either,
	tell you how they build their training data. It's the most important ingredient and it's a black box.
	This is a digital sovereignty issue. If you can't build the data yourself,
	you can't build the model, and you're dependent on whoever can.
	Our work makes this knowledge open and accessible to everyone."
	</aside>
	</section>

	<!-- SLIDE 3: LLMs and Pretraining Recap -->
	<section>
	<p class="section-label">Quick Recap</p>
	<h2>LLMs: What's Under the Hood</h2>
	<div class="two-col" style="font-size:0.65em;margin-top:20px;">
	<div class="col">
	<p>You use these every day: ChatGPT, Copilot, Claude.</p>
	<p class="fragment">Under the hood: a giant function that takes <span class="accent">tokens in</span> and predicts <span class="accent">tokens out</span>.</p>
	<p class="fragment">Trained by reading <span class="highlight">billions of web pages</span>, learning to predict the next word.</p>
	<p class="fragment" style="margin-top:20px;font-weight:700;color:#f0c674;">
	Data quality defines model quality.
	</p>
	</div>
	<div class="col fragment" style="text-align:center;">
	<div style="background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.1);border-radius:16px;padding:30px 20px;">
	<div style="font-size:0.9em;color:rgba(255,255,255,0.5);">Input text</div>
	<div style="font-size:2em;margin:10px 0;">↓</div>
	<div style="background:rgba(124,111,247,0.15);border:1px solid rgba(124,111,247,0.3);border-radius:12px;padding:16px;font-weight:700;font-size:1.1em;">
	LLM<br><span style="font-size:0.6em;font-weight:400;color:rgba(255,255,255,0.4);">billions of parameters</span>
	</div>
	<div style="font-size:2em;margin:10px 0;">↓</div>
	<div style="font-size:0.9em;color:rgba(255,255,255,0.5);">Output text</div>
	</div>
	</div>
	</div>
	<aside class="notes">
	~1.5 min. "Quick recap so we have shared vocabulary." Click through fragments.
	Emphasize: model quality = data quality. Like training a code model on all of GitHub.
	</aside>
	</section>

	<!-- SLIDE 4: The Data Quality Problem -->
	<section>
	<p class="section-label">The Problem</p>
	<h2>You Start With the Entire Internet...</h2>
	<h3 class="fragment">...and throw away 98.6% of it</h3>
	<div class="fragment">
	<img src="assets/dclm-filtering-pipeline.png" class="img-contain" style="margin-top:10px;max-height:400px;">
	<p style="font-size:0.45em;color:rgba(255,255,255,0.3);margin-top:8px;">
	DCLM: 240T tokens from Common Crawl → 1.4% survives as DCLM-Baseline
	</p>
	</div>
	<aside class="notes">
	~1 min. "This is the DCLM dataset pipeline. You scrape the whole internet, 240 trillion tokens.
	Then heuristic filters, deduplication, model-based filtering. Only 1.4% of documents survive.
	All this engineering just to clean the data. What if there was a better way?"
	</aside>
	</section>

	<!-- SLIDE 5: Synthetic Data -->
	<section>
	<p class="section-label">The Idea</p>
	<h2>Rewrite Instead of Filter</h2>
	<div class="before-after">
	<div class="panel bad">
	<div class="panel-title">Raw Web Text</div>
	<p style="margin:0;line-height:1.6;">
	<span style="color:rgba(255,255,255,0.3);">★★★ BeSt DeAls!!!</span><br>
	Photosynthesis is the process by wich plants convert sunlit into energy.
	It occurs in the chloroplasts<br>
	<span style="color:rgba(255,255,255,0.3);">Click here for more → → →</span><br>
	<span style="color:rgba(255,255,255,0.3);">© 2019 AllScienceInfo.biz</span><br>
	Carbon dioxide and water are transformed into glucose and oxygen...
	<span style="color:rgba(255,255,255,0.3);">[AD] [AD] [POPUP]</span>
	</p>
	</div>
	<div class="arrow fragment" data-fragment-index="0">→</div>
	<div class="panel good fragment" data-fragment-index="0">
	<div class="panel-title">LLM-Rewritten FAQ</div>
	<p style="margin:0;line-height:1.6;">
	<strong>Q: What is photosynthesis?</strong><br>
	A: Photosynthesis is the process by which plants convert sunlight into chemical energy.
	It occurs in organelles called chloroplasts.<br><br>
	<strong>Q: What are the inputs and outputs?</strong><br>
	A: Plants take in carbon dioxide (CO₂) and water (H₂O), and using light energy,
	produce glucose (C₆H₁₂O₆) and oxygen (O₂).
	</p>
	</div>
	</div>
	<p class="fragment" style="font-size:0.55em;margin-top:16px;">
	Same knowledge, better packaging.<br>
	You keep <span class="highlight">100%</span> of your data instead of discarding 90%.
	</p>
	<aside class="notes">
	~1.5 min. Walk through the before/after. Left: messy web text with spam, typos, ads, broken formatting.
	Right: same knowledge, but restructured as a clean FAQ. The LLM acts as a rewriter.
	Key insight: you preserve the knowledge, you just improve the presentation. No data wasted.
	</aside>
	</section>

	<!-- SLIDE 6: Research Question -->
	<section>
	<p class="section-label">Our Research</p>
	<h2>What's the Best Recipe?</h2>
	<p style="font-size:0.6em;color:rgba(255,255,255,0.6);margin-bottom:12px;">
	Three knobs to tune: <span class="accent">source data</span>, <span class="accent">prompt strategy</span>, and
	<span class="accent">generator model</span>.
	</p>
	<div style="display:flex;align-items:center;gap:24px;">
	<iframe src="charts/experiment-flow.html" style="flex:0 1 75%;height:540px;border:none;border-radius:8px;background:transparent;" loading="lazy"></iframe>
	<div class="fragment" style="display:flex;flex-direction:column;gap:24px;min-width:120px;text-align:center;align-self:flex-start;margin-top:40px;padding-left:40px;">
	<div class="stat-box"><div class="num" style="font-size:1.6em;">70+</div><div class="label">experiments</div></div>
	<div class="stat-box"><div class="num" style="font-size:1.6em;">1T+</div><div class="label">tokens generated</div></div>
	<div class="stat-box"><div class="num" style="font-size:1.6em;">60k+</div><div class="label">GPU hours</div></div>
	</div>
	</div>
	<aside class="notes">
	~1 min. "We ran a massive ablation study. Three axes: what prompt do you give the rewriter,
	which model does the rewriting, and what source data do you start from.
	This Sankey shows our 70+ experiments flowing from source → prompt → model.
	Over 1 trillion tokens generated, 100k GPU hours."
	</aside>
	</section>

	<!-- SLIDE 7: How We Evaluate -->
	<section>
	<p class="section-label">Methodology</p>
	<h2>Our Integration Test Suite</h2>
	<div style="font-size:0.7em;margin-top:30px;">
	<p style="color:rgba(255,255,255,0.5);">For each experiment, we:</p>
	<ul>
	<li class="fragment">Train a <span class="accent">1.2B parameter</span> model from scratch</li>
	<li class="fragment">Feed it <span class="accent">20B tokens</span> of synthetic and original data</li>
	<li class="fragment">Test on <span class="accent">12 benchmarks</span> (reading, math, reasoning, knowledge...)</li>
	<li class="fragment">Compare against curated web datasets as baselines</li>
	</ul>
	<p class="fragment" style="margin-top:16px;font-size:0.9em;color:rgba(255,255,255,0.6);">
	This is expensive so we tried proxies:
	</p>
	<ul class="fragment" style="font-size:0.85em;margin-top:4px;">
	<li>DCLM/Edu scores (used for filtering pretraining data)</li>
	<li>Smaller training runs</li>
	</ul>
	<p class="fragment" style="margin-top:4px;font-size:0.9em;">
	None correlated well enough.
	</p>
	<p class="fragment" style="margin-top:10px;color:#f0c674;font-weight:600;">
	No shortcuts: you must train and evaluate to know if your data is good.
	</p>
	</div>
	<aside class="notes">
	~1 min. "Think of it like an integration test suite for data quality.
	We train a model on each dataset variant and see how it scores.
	12 benchmarks covering reading comprehension, math, general knowledge, reasoning.
	65 separate training runs. No proxy metric can replace this."
	</aside>
	</section>

	<!-- SLIDE 8: Spoiler -->
	<section>
	<p class="section-label">Spoiler</p>
	<h2>FinePhrase Wins</h2>
	<p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:10px;">
	Our best synthetic recipe outperforms all tested baselines, including curated web data.
	</p>
	<iframe src="charts/benchmark.html" class="chart-frame" style="height:360px;" loading="lazy"></iframe>
	<p class="fragment" style="font-size:0.6em;margin-top:10px;color:rgba(255,255,255,0.6);">
	Let's unpack <span class="accent">how</span>.
	</p>
	<aside class="notes">
	~1 min. "Here's the punchline up front. FinePhrase, our best configuration,
	beats all baselines including DCLM, Nemotron, REWIRE, and Cosmopedia.
	Let's unpack the three key findings that got us here."
	Transition to Section 2.
	</aside>
	</section>

	<!-- ============================================================ -->
	<!-- SECTION 2: EXPERIMENTAL RESULTS (~20%, slides 9-12, ~4 min) -->
	<!-- ============================================================ -->

	<!-- SLIDE 9: Prompts Matter Most -->
	<section>
	<p class="section-label">Finding #1</p>
	<h2>Prompt Design Is the #1 Lever</h2>
	<div class="two-col" style="font-size:0.6em;grid-template-columns:1.5fr 1fr;gap:20px;">
	<div class="col" style="text-align:center;">
	<iframe src="charts/benchmark-prompts.html" class="chart-frame" loading="lazy"
	style="height:480px;" id="prompts-chart"></iframe>
	</div>
	<div class="col">
	<p>Structured prompts beat everything:</p>
	<ul>
	<li class="fragment"><span class="highlight">Math</span> reformatting</li>
	<li class="fragment"><span class="highlight">Table</span> extraction</li>
	<li class="fragment"><span class="highlight">FAQ</span> generation</li>
	<li class="fragment"><span class="highlight">Tutorial</span> rewriting</li>
	</ul>
	<p class="fragment" style="margin-top:20px;">
	These beat curated web data <em>and</em> all prior synthetic baselines.
	</p>
	<p class="fragment" style="color:#f0c674;font-weight:600;margin-top:10px;">
	The prompt matters more than the model or the source data.
	</p>
	</div>
	</div>
	<aside class="notes">
	~1 min. "Finding number one, and the most important: prompt design is the biggest lever.
	Structured formats like Math, Table, FAQ, Tutorial consistently outperform
	both curated web data and prior synthetic approaches.
	The prompt matters more than which model you use or what source data you start from."
	</aside>
	</section>

	<!-- SLIDE 10: Smol Models Are Enough -->
	<section>
	<p class="section-label">Finding #2</p>
	<h2>Smol Models Are Enough</h2>
	<div class="two-col" style="font-size:0.6em;grid-template-columns:1.6fr 1fr;gap:16px;">
	<div class="col" style="text-align:center;">
	<iframe src="charts/benchmark-family.html" class="chart-frame" loading="lazy"
	style="height:440px;"></iframe>
	</div>
	<div class="col">
	<p>1B matches 4B, 12B, and 27B model performance.</p>
	<p class="fragment"><span class="accent">SmolLM2-1.7B</span> beats Qwen, Gemma, Llama, Falcon, and Granite.</p>
	<p class="fragment" style="margin-top:20px;">And it's <em>much</em> faster:</p>
	<ul>
	<li class="fragment"><span class="highlight">3.0x</span> faster than Gemma-3-12B<br><span style="color:rgba(255,255,255,0.4);">(9,220 vs 3,046 tps/gpu)</span></li>
	<li class="fragment"><span class="highlight">5.3x</span> faster than Gemma-3-27B<br><span style="color:rgba(255,255,255,0.4);">(9,220 vs 1,724 tps/gpu)</span></li>
	</ul>
	<p class="fragment" style="color:#f0c674;font-weight:600;margin-top:20px;">
	Better quality <em>and</em> faster inference.
	</p>
	</div>
	</div>
	<aside class="notes">
	~1 min. "Finding two: you don't need a big model.
	1B parameters match 4B, 12B, even 27B for rephrasing quality.
	SmolLM2 at 1.7B beats all other model families.
	And it's 3x faster than Gemma-12B, 5.3x faster than Gemma-27B.
	Better quality AND faster inference. You don't need a big model."
	</aside>
	</section>

	<!-- SLIDE 11: Diversity Paradox -->
	<section>
	<p class="section-label">Finding #3</p>
	<h2>Diversity Beats Consistency</h2>
	<div style="font-size:0.65em;">
	<div class="two-col">
	<div class="col">
	<p><span class="highlight">Messy beats polished.</span></p>
	<p class="fragment" data-fragment-index="1">SmolLM2's varied, inconsistent outputs outperform
	Qwen3's template-locked, clean outputs.</p>
	<p class="fragment" data-fragment-index="3" style="margin-top:20px;">
	<span class="accent">Synthetic-only fails.</span><br>
	You must mix synthetic data with original web data.
	</p>
	<p class="fragment" data-fragment-index="4" style="margin-top:20px;">
	The mix-in dataset matters as much as the synthetic data itself.
	</p>
	</div>
	<div class="col fragment" data-fragment-index="2">
	<div style="background:rgba(255,255,255,0.04);border:1px solid rgba(255,255,255,0.08);border-radius:16px;padding:24px;">
	<div style="font-size:1em;font-weight:700;margin-bottom:12px;">Template Collapse</div>
	<div style="font-size:0.9em;line-height:1.6;">
	<p style="color:rgba(255,255,255,0.5);">Qwen3 Math outputs:</p>
	<p><span class="danger">115 / 1000</span> samples start with the exact same sentence</p>
	<p style="margin-top:12px;color:rgba(255,255,255,0.5);">SmolLM2 Math outputs:</p>
	<p><span class="accent">Highly varied</span> formatting and structure</p>
	<p style="margin-top:16px;color:#f0c674;font-weight:600;">
	Diversity beats consistency for pretraining.
	</p>
	</div>
	</div>
	</div>
	</div>
	</div>
	<aside class="notes">
	~1 min. "Finding three: diversity beats polish. This was counterintuitive.
	Qwen3's math outputs are very clean and consistent, but 115 out of 1000 start identically.
	SmolLM2's outputs are messier but more varied. The varied outputs win.
	Also: synthetic-only training fails. You need to mix in original data.
	The mix-in dataset influence is sometimes larger than the synthetic data itself."
	</aside>
	</section>

	<!-- SLIDE 12: Results Summary -->
	<section>
	<p class="section-label">Summary</p>
	<h2>What We Found</h2>
	<ul class="takeaway-list" style="margin-top:30px;">
	<li class="fragment">
	<span class="accent">Prompt design</span> is the #1 lever.
	Structured formats (Math, Table, FAQ, Tutorial) outperform everything.
	</li>
	<li class="fragment">
	<span class="accent">1B models suffice.</span>
	SmolLM2-1.7B is the best rephraser across the board.
	</li>
	<li class="fragment">
	<span class="accent">Mix original data in.</span>
	Synthetic-only fails. The mix-in dataset matters.
	</li>
	<li class="fragment">
	<span class="accent">Diversity wins over polish.</span>
	Varied, messy outputs beat clean, template-locked ones.
	</li>
	</ul>
	<aside class="notes">
	~30s. Quick recap of findings. Click through each point. These four bullets
	are the core message of the talk. Transition: "Now let's talk about
	the engineering challenge of actually doing this at scale."
	</aside>
	</section>

	<!-- ============================================================ -->
	<!-- SECTION 3: INFRASTRUCTURE (~20%, slides 13-16, ~4 min) -->
	<!-- ============================================================ -->

	<!-- SLIDE 13: Engineering Challenge -->
	<section>
	<p class="section-label">Infrastructure</p>
	<h2>How Do You Rephrase 1T Tokens?</h2>
	<div style="font-size:0.65em;margin-top:30px;">
	<p>Each experiment generates ~15B tokens.</p>
	<p class="fragment">70+ experiments = <span class="accent">1T+ tokens</span> of LLM output.</p>
	<p class="fragment" style="margin-top:20px;">
	At ~4,750 tokens/sec/GPU (mean across all experiments):
	</p>
	<div class="fragment stat-row" style="margin-top:20px;">
	<div class="stat-box"><div class="num">~880</div><div class="label">GPU-hours per experiment</div></div>
	<div class="stat-box"><div class="num">~$3k</div><div class="label">cloud cost per experiment</div></div>
	<div class="stat-box"><div class="num">~$215k</div><div class="label">total compute budget</div></div>
	</div>
	<p class="fragment" style="margin-top:20px;color:#f0c674;font-weight:600;">
	You need a scalable, fault-tolerant pipeline.
	</p>
	</div>
	<aside class="notes">
	~1 min. "Now the engineering side. Each experiment is 15 billion tokens of LLM generation.
	70+ experiments. That's over a trillion tokens total. At $3.50/GPU-hour,
	each experiment costs about $7,000. You need infrastructure that handles failures,
	checkpoints, and scales across many nodes."
	</aside>
	</section>

	<!-- SLIDE 14: DataTrove + vLLM -->
	<section>
	<p class="section-label">Infrastructure</p>
	<h2>DataTrove + vLLM</h2>
	<iframe src="charts/pipeline.html" class="chart-frame" loading="lazy" style="height:440px;margin-bottom:0;"></iframe>
	<div style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-top:2px;">
	DataTrove orchestrates the pipeline. vLLM serves the model with optimized batching and prefix caching.
	</div>
	<aside class="notes">
	~1 min. "We built on DataTrove, our open-source data processing library.
	The pipeline is Read → Transform → Write. The Transform step calls vLLM,
	a high-throughput inference engine with tensor parallelism, chunked prefill, and prefix caching.
	Everything runs on Slurm with checkpointing and auto-recovery.
	Outputs go straight to a Hugging Face dataset with auto-generated cards."
	</aside>
	</section>

	<!-- SLIDE 15: Throughput Optimization -->
	<section>
	<p class="section-label">Infrastructure</p>
	<h2>Throughput Optimization</h2>
	<p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:8px;">
	18 models benchmarked on H100 GPUs. Two tiers of optimization.
	</p>
	<iframe src="charts/throughput.html" class="chart-frame" loading="lazy" style="height:420px;"></iframe>
	<aside class="notes">
	~1 min. "We benchmarked 18 models across two tiers of optimization.
	Tier 0: tensor parallelism, batch sizes, sequence lengths. Tier 1: GPU memory utilization, speculative decoding.
	For large MoE models like GPT-OSS-120B, Tier 0 alone gives 1.95x speedup, cutting cost by nearly half.
	Speculative decoding helps small models but can hurt others (Gemma 3 regresses due to vocab size)."
	</aside>
	</section>

	<!-- SLIDE 16: Cost-Performance -->
	<section>
	<p class="section-label">Infrastructure</p>
	<h2>Cost vs. Performance</h2>
	<p style="font-size:0.55em;color:rgba(255,255,255,0.5);margin-bottom:8px;">
	Small models + good prompts dominate the Pareto frontier.
	</p>
	<iframe src="charts/cost-efficiency.html" class="chart-frame" loading="lazy" style="height:420px;"></iframe>
	<p class="fragment" style="font-size:0.6em;margin-top:4px;color:#f0c674;font-weight:600;">
	Invest in prompt design, not model size.
	</p>
	<aside class="notes">
	~1 min. "This scatter plot shows GPU time vs downstream performance for all experiments.
	The Pareto frontier is dominated by small models with structured prompts.
	The baselines on the left have zero rephrasing cost. Our best synthetic setups
	beat them while remaining cost-efficient. Key takeaway: optimize throughput first,
	then worry about model size."
	</aside>
	</section>

	<!-- ============================================================ -->
	<!-- SECTION 4: CONCLUSIONS (~20%, slides 17-21, ~4 min) -->
	<!-- ============================================================ -->

	<!-- SLIDE 17: The FinePhrase Recipe -->
	<section>
	<p class="section-label">Conclusion</p>
	<h2>The FinePhrase Recipe</h2>
	<div class="recipe-diagram fragment">
	<div class="box">
	<div style="font-size:1.4em;">📄</div>
	<div style="font-weight:700;">Source Data</div>
	<div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Web text<br>(even low-quality)</div>
	</div>
	<div class="plus">+</div>
	<div class="box">
	<div style="font-size:1.4em;">📝</div>
	<div style="font-weight:700;">Structured Prompt</div>
	<div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Math / Table /<br>FAQ / Tutorial</div>
	</div>
	<div class="plus">+</div>
	<div class="box">
	<div style="font-size:1.4em;">🤖</div>
	<div style="font-weight:700;">SmolLM2-1.7B</div>
	<div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Small, fast,<br>diverse outputs</div>
	</div>
	<div class="equals">=</div>
	<div class="box result">
	<div style="font-size:1.4em;">✨</div>
	<div style="font-weight:700;color:#7c6ff7;">FinePhrase</div>
	<div style="font-size:0.85em;color:rgba(255,255,255,0.4);">Best synthetic<br>pretraining data</div>
	</div>
	</div>
	<p class="fragment" style="font-size:0.6em;color:rgba(255,255,255,0.5);margin-top:20px;">
	Mixed with high-quality original data (e.g., FineWeb-Edu) for best results.
	</p>
	<aside class="notes">
	~1 min. "Here's the recipe in one slide. Take any web text, even low-quality,
	apply a structured prompt (Math, Table, FAQ, Tutorial), run it through SmolLM2-1.7B,
	and mix the output with high-quality original data. That's FinePhrase.
	It outperforms all tested baselines."
	</aside>
	</section>

	<!-- SLIDE 18: What Surprised Us -->
	<section>
	<p class="section-label">Conclusion</p>
	<h2>What Surprised Us</h2>
	<div class="surprise-grid fragment">
	<div class="surprise-card">
	<div class="icon">🤷</div>
	<h4>Typos Don't Matter</h4>
	<p>REWIRE's original prompt had typos. Fixing them made no measurable difference to downstream performance.</p>
	</div>
	<div class="surprise-card">
	<div class="icon">📊</div>
	<h4>Proxy Scores Lie</h4>
	<p>Edu-score and DCLM-score do not reliably predict downstream performance. You must train and evaluate.</p>
	</div>
	<div class="surprise-card">
	<div class="icon">🎲</div>
	<h4>Messier Is Better</h4>
	<p>Varied, inconsistent outputs from SmolLM2 beat Qwen3's polished, template-locked outputs every time.</p>
	</div>
	</div>
	<aside class="notes">
	~1 min. "Three things that surprised us. First: typos in prompts don't matter.
	REWIRE's prompt had actual typos and fixing them changed nothing.
	Second: quality proxy scores like edu-score don't predict performance. You must train.
	Third: messy, varied outputs consistently beat clean, polished ones. Diversity is king."
	</aside>
	</section>

	<!-- SLIDE 19: Everything Is Open -->
	<section>
	<p class="section-label">Open Source</p>
	<h2>Everything Is Open</h2>
	<div style="font-size:0.65em;margin-top:20px;">
	<ul>
	<li class="fragment">All prompts, configs, and pipeline code</li>
	<li class="fragment">Generated datasets on the Hugging Face Hub</li>
	<li class="fragment">Throughput benchmarks for 18 models</li>
	<li class="fragment">Blog post with interactive charts</li>
	</ul>
	<div class="fragment" style="margin-top:30px;">
	<p style="font-weight:700;color:#f0c674;font-size:1.1em;">Future directions:</p>
	<ul style="color:rgba(255,255,255,0.5);">
	<li>Diffusion LMs for faster inference</li>
	<li>Scaling to more data (ablations trained on only 21B tokens)</li>
	<li>Mixing ratio: how little synthetic data can you get away with?</li>
	<li>Best-of-N filtering on synthetic outputs</li>
	</ul>
	</div>
	</div>
	<aside class="notes">
	~1 min. "We're releasing everything. All prompts, the pipeline code in DataTrove,
	the generated datasets on the Hub, throughput benchmarks.
	The blog post itself has interactive charts you can explore.
	Future work: we're looking at diffusion LMs for faster inference,
	scaling beyond our 21B token ablations, exploring mixing ratios to find how little
	synthetic data you actually need, and using best-of-N filtering on synthetic outputs."
	</aside>
	</section>

	<!-- SLIDE 20: Academia Hub -->
	<section>
	<img src="assets/academia-hub.png" class="img-contain" style="max-height:560px;border-radius:12px;box-shadow:0 8px 40px rgba(0,0,0,0.4);">
	<aside class="notes">
	~30s. "If you're at a university or research lab, check out our Academia Hub:
	institution-wide access to the Hugging Face Hub with priority GPU access,
	inference credits, storage, and enterprise admin."
	</aside>
	</section>

	<!-- SLIDE 21: Q&A -->
	<section class="center-slide">
	<h2>Thank You</h2>
	<p style="font-size:0.6em;color:rgba(255,255,255,0.5);margin-top:10px;">Questions?</p>
	<div style="display:flex;align-items:center;justify-content:center;gap:28px;margin-top:30px;">
	<img src="assets/profile.jpg" style="width:90px;height:90px;border-radius:50%;border:2px solid rgba(255,255,255,0.15);object-fit:cover;">
	<div style="text-align:left;font-size:0.55em;">
	<div style="font-weight:700;font-size:1.2em;margin-bottom:8px;">Joel Niklaus</div>
	<div style="display:flex;align-items:center;gap:8px;margin-bottom:6px;">
	<svg width="18" height="18" viewBox="0 0 24 24" fill="rgba(255,255,255,0.7)"><path d="M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 01-2.063-2.065 2.064 2.064 0 112.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z"/></svg>
	<a href="https://linkedin.com/in/joelniklaus" target="_blank" style="color:rgba(255,255,255,0.7);text-decoration:none;">joelniklaus</a>
	</div>
	<div style="display:flex;align-items:center;gap:8px;">
	<svg width="18" height="18" viewBox="0 0 24 24" fill="rgba(255,255,255,0.7)"><path d="M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-5.214-6.817L4.99 21.75H1.68l7.73-8.835L1.254 2.25H8.08l4.713 6.231zm-1.161 17.52h1.833L7.084 4.126H5.117z"/></svg>
	<a href="https://x.com/joelniklaus" target="_blank" style="color:rgba(255,255,255,0.7);text-decoration:none;">@joelniklaus</a>
	</div>
	</div>
	</div>
	<p style="margin-top:24px;font-size:0.55em;color:rgba(255,255,255,0.4);">
	Stay tuned for the blog post with many more details.
	</p>
	<aside class="notes">
	Q&A time. Mention they can reach out on LinkedIn or X. Have the blog open in a browser tab
	for live demos if questions come up.
	</aside>
	</section>

	</div><!-- /slides -->
	</div><!-- /reveal -->

	<script src="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/dist/reveal.js"></script>
	<script src="https://cdn.jsdelivr.net/npm/reveal.js@5.1.0/plugin/notes/notes.js"></script>
	<script>
	Reveal.initialize({
	hash: true,
	slideNumber: 'c/t',
	showSlideNumber: 'speaker',
	transition: 'fade',
	transitionSpeed: 'fast',
	center: false,
	width: 1200,
	height: 700,
	margin: 0.06,
	plugins: [RevealNotes],
	});
	</script>
	</body>
	</html>