finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 29 days ago

Commit

1ac525d

1 Parent(s): 7171fcc

add intro plot again and compare to synth baselines

Browse files

Files changed (1) hide show

app/src/content/chapters/6-finephrase.mdx +42 -0

app/src/content/chapters/6-finephrase.mdx CHANGED Viewed

@@ -173,4 +173,46 @@ Browse some real examples from FinePhrase below. Each sample shows the original
 />
 </Wide>
 That's the full picture: 90 experiments, a battle-tested infrastructure, and 486 billion tokens of public synthetic data. Let's wrap up with what we learned and where to go next.

 />
 </Wide>
+### How Does FinePhrase Compare?
+In the introduction we showed a single FinePhrase prompt (table) against the baselines. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines:
+<HtmlEmbed
+  id="finephrase-all-prompts"
+  src="d3-benchmark-comparison.html"
+  desc="All four FinePhrase prompts compared against synthetic data baselines across evaluation metrics."
+  config={{
+    defaultView: "line",
+    datasets: {
+      "mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
+      "mix-fw_edu_hq-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#D4892A" },
+      "mix-fw_edu_hq-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#BD6E1E" },
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#A65514" },
+      cosmopedia: "Cosmopedia",
+      nemotron_hq_synth: "Nemotron-HQ-Synth",
+      rewire: "REWIRE",
+      synth_query_reasoning_answer: "SYNTH"
+    }
+  }}
+/>
+All four FinePhrase prompts outperform every synthetic baseline by a clear margin. Table and math lead the pack, with FAQ and tutorial close behind. The per-benchmark breakdown (switch with the dropdown above) tells a familiar story. FinePhrase prompts dominate on ARC, SQuAD, and DROP (knowledge and reading comprehension), while the baselines hold a slight edge on HellaSwag and PIQA (commonsense). This is the same commonsense-vs-knowledge trade-off we observed throughout the experiments, and it's exactly why FinePhrase is designed to be mixed with original data rather than used alone. The aggregate wins because the knowledge gains far outweigh the commonsense losses.
+What makes this result especially compelling is the cost efficiency. Here is how FinePhrase compares to other synthetic data projects:
+<figure id="cost-efficiency">
+| Dataset          | Generator          |   Tokens | GPU Hours  | Tokens/GPU-Hour |
+|:-----------------|:-------------------|---------:|-----------:|----------------:|
+| Cosmopedia       | Mixtral 8x7B       |      25B | {'>'} 10K  |    {'<'} 2.5M   |
+| SYNTH            | custom fine-tuned  |    ~200B |    ~20K\*  |        ~10M\*   |
+| REWIRE           | Llama-3.3 70B      |     400B | **~352K**  |          ~1.1M  |
+| Nemotron-CC      | Mistral NeMo 12B   | **1.9T** |        n/a |             n/a |
+| **FinePhrase**   | SmolLM2-1.7B       |     486B |     ~14.7K |       **~33.1M**|
+<figcaption>Compute cost comparison across synthetic data generation projects. All GPU hours are H100. REWIRE hours extrapolated from their reported 88K per 100B tokens. \*SYNTH's 20K hours include both generation and model training, making their per-token rate an upper bound. Nemotron-CC did not report generation cost.</figcaption>
+</figure>
+FinePhrase achieves **~33M tokens per GPU hour**, roughly 30x more efficient than REWIRE and over 13x more than Cosmopedia. It generates more tokens than REWIRE while using 24x less compute, thanks to the combined payoff of a 1.7B model (vs 70B), optimized inference settings, and speculative decoding. The takeaway: you do not need large models for high-quality synthetic data generation.
 That's the full picture: 90 experiments, a battle-tested infrastructure, and 486 billion tokens of public synthetic data. Let's wrap up with what we learned and where to go next.