Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
1b2a671
1
Parent(s): 33a5dfc
updated synth results in the table
Browse files
app/src/content/chapters/6-finephrase.mdx
CHANGED
|
@@ -202,15 +202,15 @@ What makes this result especially compelling is the cost efficiency. Here is how
|
|
| 202 |
|
| 203 |
<figure id="cost-efficiency">
|
| 204 |
|
| 205 |
-
| Dataset
|
| 206 |
-
|:----------------
|
| 207 |
-
| Cosmopedia
|
| 208 |
-
| SYNTH
|
| 209 |
-
| REWIRE
|
| 210 |
-
| Nemotron-CC
|
| 211 |
-
| **FinePhrase**
|
| 212 |
-
|
| 213 |
-
<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens.
|
| 214 |
</figure>
|
| 215 |
|
| 216 |
FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
|
|
|
|
| 202 |
|
| 203 |
<figure id="cost-efficiency">
|
| 204 |
|
| 205 |
+
| Dataset | Generator | Tokens | GPU Hours | Tokens/GPU-Hour |
|
| 206 |
+
|:----------------|:------------------|---------:|-----------:|----------------:|
|
| 207 |
+
| Cosmopedia | Mixtral 8x7B | 25B | {'>'} 10K | {'<'} 2.5M |
|
| 208 |
+
| SYNTH | custom fine-tuned | 80B | 4K | 20M |
|
| 209 |
+
| REWIRE | Llama-3.3 70B | 400B | **~352K** | ~1.1M |
|
| 210 |
+
| Nemotron-CC | Mistral NeMo 12B | **1.9T** | n/a | n/a |
|
| 211 |
+
| **FinePhrase** | SmolLM2-1.7B | 486B | ~14.7K | **~33.1M**|
|
| 212 |
+
|
| 213 |
+
<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. Nemotron-CC did not report generation cost.</figcaption>
|
| 214 |
</figure>
|
| 215 |
|
| 216 |
FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
|