joelniklaus HF Staff commited on
Commit
e0ccb24
·
1 Parent(s): f7eff68

add todos and rephrased conclusion paragraph

Browse files
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -10,6 +10,10 @@ import FigRef from "../../components/FigRef.astro";
10
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
11
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
12
  {/* TODO: brainstorm better banner, be artsy */}
 
 
 
 
13
  {/* TODO: banner idea: 1T tokens = 8M books
14
  5cm pro buech = 400km
15
 
@@ -599,4 +603,4 @@ Here are the key takeaways from our experiments:
599
  - **Q: Do typos in the prompt hurt?**<br/>
600
  A: No. Typos have no negative effect on downstream performance.
601
 
602
- The bottom line: the details of synthetic rephrasing matter a lot, and knowing which ones matter is the key to scaling it up. Prompt design is the single biggest lever, with structured formats like Math, Table, FAQ, and Tutorial consistently beating curated baselines. But equally important is knowing where you can cut corners without losing quality. You don't need a large rephrasing model (1B is enough for simple prompts, 4B for complex ones). You don't need pristine source data (even low-quality sources work with a strong mix-in). Smaller models generate faster, directly translating into higher throughput. And tolerating lower-quality sources opens up a much bigger and more diverse data pool to draw from. The practical recipe is straightforward: pick a strong structured prompt, use the smallest model that handles it, blend with high-quality original data, and spend your remaining compute on volume.
 
10
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
11
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
12
  {/* TODO: brainstorm better banner, be artsy */}
13
+ {/* TODO: run variance experiments with pretraining from scratch */}
14
+ {/* TODO: run scaling experiments with longer pretraining phase */}
15
+ {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
16
+ {/* TODO: try multiple rollouts and scoring */}
17
  {/* TODO: banner idea: 1T tokens = 8M books
18
  5cm pro buech = 400km
19
 
 
603
  - **Q: Do typos in the prompt hurt?**<br/>
604
  A: No. Typos have no negative effect on downstream performance.
605
 
606
+ So what actually matters? Prompt design, above all else. Structured formats like Math, Table, FAQ, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.