finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 26 days ago

Commit

23fd567

1 Parent(s): 1b2a671

fix trinity links

Browse files

Files changed (2) hide show

app/src/content/bibliography.bib +13 -7
app/src/content/chapters/1-introduction.mdx +2 -3

app/src/content/bibliography.bib CHANGED Viewed

@@ -230,14 +230,20 @@
   url           = {https://arxiv.org/abs/2412.08905}
 }
-@misc{arceetrinity,
-  title         = {Arcee Trinity Large Technical Report},
-  author        = {{Arcee AI}},
   year          = {2025},
-  eprint        = {2512.04695},
-  archiveprefix = {arXiv},
-  primaryclass  = {cs.LG},
-  url           = {https://arxiv.org/abs/2512.04695}
 }
 @misc{llama3,

   url           = {https://arxiv.org/abs/2412.08905}
 }
+@misc{arceetrinitymanifesto,
+  title         = {The Trinity Manifesto},
+  author        = {Lucas Atkins},
   year          = {2025},
+  note          = {Blog post},
+  url           = {https://www.arcee.ai/blog/the-trinity-manifesto}
+}
+@misc{arceetrinitylarge,
+  title         = {Trinity Large},
+  author        = {Lucas Atkins},
+  year          = {2026},
+  note          = {Blog post},
+  url           = {https://www.arcee.ai/blog/trinity-large}
 }
 @misc{llama3,

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -31,7 +31,7 @@ Reading time: One weekend
   }}
 />
-If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
 - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
@@ -76,5 +76,4 @@ Want to learn how to make GPUs go brrr and generate synthetic tokens at scale li
   caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
 />
 </Wide>
-Now let's start by defining what rephrasing actually means and laying out the design space.

   }}
 />
+If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
 - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
 - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
   caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
 />
 </Wide>
+Now let's start by defining what rephrasing actually means and laying out the design space.