joelniklaus HF Staff commited on
Commit
1b2a671
·
1 Parent(s): 33a5dfc

updated synth results in the table

Browse files
app/src/content/chapters/6-finephrase.mdx CHANGED
@@ -202,15 +202,15 @@ What makes this result especially compelling is the cost efficiency. Here is how
202
 
203
  <figure id="cost-efficiency">
204
 
205
- | Dataset | Generator | Tokens | GPU Hours | Tokens/GPU-Hour |
206
- |:-----------------|:-------------------|---------:|-----------:|----------------:|
207
- | Cosmopedia | Mixtral 8x7B | 25B | {'>'} 10K | {'<'} 2.5M |
208
- | SYNTH | custom fine-tuned | ~200B | ~20K\* | ~10M\* |
209
- | REWIRE | Llama-3.3 70B | 400B | **~352K** | ~1.1M |
210
- | Nemotron-CC | Mistral NeMo 12B | **1.9T** | n/a | n/a |
211
- | **FinePhrase** | SmolLM2-1.7B | 486B | ~14.7K | **~33.1M**|
212
-
213
- <figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. \*SYNTH's 20K hours include both generation and model training, making their per-token rate an upper bound. Nemotron-CC did not report generation cost.</figcaption>
214
  </figure>
215
 
216
  FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
 
202
 
203
  <figure id="cost-efficiency">
204
 
205
+ | Dataset | Generator | Tokens | GPU Hours | Tokens/GPU-Hour |
206
+ |:----------------|:------------------|---------:|-----------:|----------------:|
207
+ | Cosmopedia | Mixtral 8x7B | 25B | {'>'} 10K | {'<'} 2.5M |
208
+ | SYNTH | custom fine-tuned | 80B | 4K | 20M |
209
+ | REWIRE | Llama-3.3 70B | 400B | **~352K** | ~1.1M |
210
+ | Nemotron-CC | Mistral NeMo 12B | **1.9T** | n/a | n/a |
211
+ | **FinePhrase** | SmolLM2-1.7B | 486B | ~14.7K | **~33.1M**|
212
+
213
+ <figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. Nemotron-CC did not report generation cost.</figcaption>
214
  </figure>
215
 
216
  FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.