joelniklaus HF Staff commited on
Commit
23fd567
·
1 Parent(s): 1b2a671

fix trinity links

Browse files
app/src/content/bibliography.bib CHANGED
@@ -230,14 +230,20 @@
230
  url = {https://arxiv.org/abs/2412.08905}
231
  }
232
 
233
- @misc{arceetrinity,
234
- title = {Arcee Trinity Large Technical Report},
235
- author = {{Arcee AI}},
236
  year = {2025},
237
- eprint = {2512.04695},
238
- archiveprefix = {arXiv},
239
- primaryclass = {cs.LG},
240
- url = {https://arxiv.org/abs/2512.04695}
 
 
 
 
 
 
241
  }
242
 
243
  @misc{llama3,
 
230
  url = {https://arxiv.org/abs/2412.08905}
231
  }
232
 
233
+ @misc{arceetrinitymanifesto,
234
+ title = {The Trinity Manifesto},
235
+ author = {Lucas Atkins},
236
  year = {2025},
237
+ note = {Blog post},
238
+ url = {https://www.arcee.ai/blog/the-trinity-manifesto}
239
+ }
240
+
241
+ @misc{arceetrinitylarge,
242
+ title = {Trinity Large},
243
+ author = {Lucas Atkins},
244
+ year = {2026},
245
+ note = {Blog post},
246
+ url = {https://www.arcee.ai/blog/trinity-large}
247
  }
248
 
249
  @misc{llama3,
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -31,7 +31,7 @@ Reading time: One weekend
31
  }}
32
  />
33
 
34
- If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
35
 
36
  - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
37
  - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
@@ -76,5 +76,4 @@ Want to learn how to make GPUs go brrr and generate synthetic tokens at scale li
76
  caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
77
  />
78
  </Wide>
79
-
80
- Now let's start by defining what rephrasing actually means and laying out the design space.
 
31
  }}
32
  />
33
 
34
+ If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
35
 
36
  - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
37
  - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
 
76
  caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
77
  />
78
  </Wide>
79
+ Now let's start by defining what rephrasing actually means and laying out the design space.