Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
23fd567
1
Parent(s): 1b2a671
fix trinity links
Browse files
app/src/content/bibliography.bib
CHANGED
|
@@ -230,14 +230,20 @@
|
|
| 230 |
url = {https://arxiv.org/abs/2412.08905}
|
| 231 |
}
|
| 232 |
|
| 233 |
-
@misc{
|
| 234 |
-
title = {
|
| 235 |
-
author = {
|
| 236 |
year = {2025},
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
}
|
| 242 |
|
| 243 |
@misc{llama3,
|
|
|
|
| 230 |
url = {https://arxiv.org/abs/2412.08905}
|
| 231 |
}
|
| 232 |
|
| 233 |
+
@misc{arceetrinitymanifesto,
|
| 234 |
+
title = {The Trinity Manifesto},
|
| 235 |
+
author = {Lucas Atkins},
|
| 236 |
year = {2025},
|
| 237 |
+
note = {Blog post},
|
| 238 |
+
url = {https://www.arcee.ai/blog/the-trinity-manifesto}
|
| 239 |
+
}
|
| 240 |
+
|
| 241 |
+
@misc{arceetrinitylarge,
|
| 242 |
+
title = {Trinity Large},
|
| 243 |
+
author = {Lucas Atkins},
|
| 244 |
+
year = {2026},
|
| 245 |
+
note = {Blog post},
|
| 246 |
+
url = {https://www.arcee.ai/blog/trinity-large}
|
| 247 |
}
|
| 248 |
|
| 249 |
@misc{llama3,
|
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -31,7 +31,7 @@ Reading time: One weekend
|
|
| 31 |
}}
|
| 32 |
/>
|
| 33 |
|
| 34 |
-
If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@
|
| 35 |
|
| 36 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
| 37 |
- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
|
|
@@ -76,5 +76,4 @@ Want to learn how to make GPUs go brrr and generate synthetic tokens at scale li
|
|
| 76 |
caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
|
| 77 |
/>
|
| 78 |
</Wide>
|
| 79 |
-
|
| 80 |
-
Now let's start by defining what rephrasing actually means and laying out the design space.
|
|
|
|
| 31 |
}}
|
| 32 |
/>
|
| 33 |
|
| 34 |
+
If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
|
| 35 |
|
| 36 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
| 37 |
- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
|
|
|
|
| 76 |
caption="Drag the slider to scale up GPUs and watch the tokens fly. By the end of this post, you'll know exactly how to set this up."
|
| 77 |
/>
|
| 78 |
</Wide>
|
| 79 |
+
Now let's start by defining what rephrasing actually means and laying out the design space.
|
|
|