joelniklaus HF Staff commited on
Commit
5779b6e
·
1 Parent(s): 8c41d9b

reformulate to make formulation about prompts from beyondweb clearer

Browse files
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -55,7 +55,7 @@ Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actual
55
 
56
  #### Which Individual Prompts Match DCLM?
57
 
58
- We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source:
59
 
60
  <Sidenote>
61
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
@@ -82,7 +82,7 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
82
  }}
83
  />
84
 
85
- On aggregate, only [diverse_qa_pairs](#diverse_qa_pairs) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate.
86
 
87
  But the aggregate hides a striking pattern. Switch to individual benchmarks with the dropdown and you'll see that DCLM dominates on HellaSwag and PIQA (commonsense reasoning), beating every single synthetic prompt. Meanwhile, almost all synthetic prompts comfortably beat DCLM on ARC (science knowledge) and SQuAD (reading comprehension). Rephrasing is essentially trading commonsense reasoning for factual recall. The aggregate score papers over this because gains on one side roughly cancel losses on the other. Keep an eye on this trade-off as you read on: it explains why mixing in original data matters, why DCLM is the best mix-in, and why synthetic-only training underperforms.
88
 
 
55
 
56
  #### Which Individual Prompts Match DCLM?
57
 
58
+ We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two baseline prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source:
59
 
60
  <Sidenote>
61
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
 
82
  }}
83
  />
84
 
85
+ On aggregate, only [diverse_qa_pairs](#diverse_qa_pairs) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb [continue](#continue) and [summarize](#summarize) baseline-prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate.
86
 
87
  But the aggregate hides a striking pattern. Switch to individual benchmarks with the dropdown and you'll see that DCLM dominates on HellaSwag and PIQA (commonsense reasoning), beating every single synthetic prompt. Meanwhile, almost all synthetic prompts comfortably beat DCLM on ARC (science knowledge) and SQuAD (reading comprehension). Rephrasing is essentially trading commonsense reasoning for factual recall. The aggregate score papers over this because gains on one side roughly cancel losses on the other. Keep an eye on this trade-off as you read on: it explains why mixing in original data matters, why DCLM is the best mix-in, and why synthetic-only training underperforms.
88