finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on Mar 6

Commit

18b5be6

1 Parent(s): 33e5bb8

renamed chapter

Files changed (1) hide show

app/src/content/chapters/6-finephrase.mdx CHANGED Viewed

@@ -7,7 +7,7 @@ import Wide from "../../components/Wide.astro";
 import datasetCardImg from "../assets/image/auto-dataset-card.png";
 import finephraseProgressImg from "../assets/image/finephrase-progress.png";
-## Applying the Recipe at Scale
 With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.

 import datasetCardImg from "../assets/image/auto-dataset-card.png";
 import finephraseProgressImg from "../assets/image/finephrase-progress.png";
+## Building FinePhrase
 With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.