finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on 6 days ago

Commit

1b2a671

1 Parent(s): 33a5dfc

updated synth results in the table

Files changed (1) hide show

app/src/content/chapters/6-finephrase.mdx CHANGED Viewed

@@ -202,15 +202,15 @@ What makes this result especially compelling is the cost efficiency. Here is how
 <figure id="cost-efficiency">
-| Dataset          | Generator          |   Tokens | GPU Hours  | Tokens/GPU-Hour |
-|:-----------------|:-------------------|---------:|-----------:|----------------:|
-| Cosmopedia       | Mixtral 8x7B       |      25B | {'>'} 10K  |    {'<'} 2.5M   |
-| SYNTH            | custom fine-tuned  |    ~200B |    ~20K\*  |        ~10M\*   |
-| REWIRE           | Llama-3.3 70B      |     400B | **~352K**  |          ~1.1M  |
-| Nemotron-CC      | Mistral NeMo 12B   | **1.9T** |        n/a |             n/a |
-| **FinePhrase**   | SmolLM2-1.7B       |     486B |     ~14.7K |       **~33.1M**|
-<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. \*SYNTH's 20K hours include both generation and model training, making their per-token rate an upper bound. Nemotron-CC did not report generation cost.</figcaption>
 </figure>
 FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.

 <figure id="cost-efficiency">
+| Dataset         | Generator         |   Tokens | GPU Hours  | Tokens/GPU-Hour |
+|:----------------|:------------------|---------:|-----------:|----------------:|
+| Cosmopedia      | Mixtral 8x7B      |      25B |  {'>'} 10K |      {'<'} 2.5M |
+| SYNTH           | custom fine-tuned |      80B |         4K |             20M |
+| REWIRE          | Llama-3.3 70B     |     400B |  **~352K** |           ~1.1M |
+| Nemotron-CC     | Mistral NeMo 12B  | **1.9T** |        n/a |             n/a |
+| **FinePhrase**  | SmolLM2-1.7B      |     486B |     ~14.7K |       **~33.1M**|
+<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. Nemotron-CC did not report generation cost.</figcaption>
 </figure>
 FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.