joelniklaus HF Staff commited on
Commit
2b72bd7
·
1 Parent(s): a99a2cf

improved phrasing and banner

Browse files
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -52,9 +52,9 @@ During SmolLM2 [@smollm2] training, the model was decent at coding and math but
52
 
53
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
54
 
55
- Here's the plan:
56
  <Sidenote>
57
- The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
58
  </Sidenote>
59
 
60
  We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
 
52
 
53
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
54
 
55
+ Our goal is to turn this alchemy into chemistry: replace intuition with systematic, reproducible experiments. Here's how we go about it:
56
  <Sidenote>
57
+ Lavoisier replaced phlogiston theory with precise measurements and repeatable experiments, earning him the title "father of modern chemistry".
58
  </Sidenote>
59
 
60
  We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
app/src/content/embeds/banner.html CHANGED
@@ -685,7 +685,7 @@
685
  .attr('fill', isDark ? 'rgba(255,255,255,0.4)' : 'rgba(0,0,0,0.38)')
686
  .attr('font-size', subFS).attr('font-weight', 500)
687
  .attr('letter-spacing', '0.14em')
688
- .text(`${numExperiments} EXPERIMENTS \u00B7 ${totalDocsB}B DOCUMENTS`));
689
 
690
  // Legend
691
  const familyCounts = {};
 
685
  .attr('fill', isDark ? 'rgba(255,255,255,0.4)' : 'rgba(0,0,0,0.38)')
686
  .attr('font-size', subFS).attr('font-weight', 500)
687
  .attr('letter-spacing', '0.14em')
688
+ .text(`${numExperiments} EXPERIMENTS \u00B7 ${totalDocsB}B DOCUMENTS \u00B7 1 PAGE \u2248 100M TOKENS`));
689
 
690
  // Legend
691
  const familyCounts = {};