Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
18b5be6
1
Parent(s): 33e5bb8
renamed chapter
Browse files
app/src/content/chapters/6-finephrase.mdx
CHANGED
|
@@ -7,7 +7,7 @@ import Wide from "../../components/Wide.astro";
|
|
| 7 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
| 8 |
import finephraseProgressImg from "../assets/image/finephrase-progress.png";
|
| 9 |
|
| 10 |
-
##
|
| 11 |
|
| 12 |
With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.
|
| 13 |
|
|
|
|
| 7 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
| 8 |
import finephraseProgressImg from "../assets/image/finephrase-progress.png";
|
| 9 |
|
| 10 |
+
## Building FinePhrase
|
| 11 |
|
| 12 |
With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 339 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing 1.35 billion samples and 486 billion completion tokens of synthetic pretraining data.
|
| 13 |
|