Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
e46c12a
1
Parent(s): 462a612
reordered finephrase section to end with the dataset
Browse files
app/src/content/chapters/6-finephrase.mdx
CHANGED
|
@@ -104,18 +104,6 @@ datacard_pipeline = [
|
|
| 104 |
]
|
| 105 |
```
|
| 106 |
|
| 107 |
-
### What's in the Dataset?
|
| 108 |
-
|
| 109 |
-
<FigRef target="finephrase-explorer" /> lets you browse real examples from FinePhrase. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial.
|
| 110 |
-
|
| 111 |
-
<Wide>
|
| 112 |
-
<HtmlEmbed
|
| 113 |
-
id="finephrase-explorer"
|
| 114 |
-
src="finephrase-explorer.html"
|
| 115 |
-
caption="Browse real examples from the FinePhrase dataset. Each sample shows the original source document alongside all four rephrased versions (FAQ, Math, Table, Tutorial). Use the arrows or Random button to navigate between samples."
|
| 116 |
-
/>
|
| 117 |
-
</Wide>
|
| 118 |
-
|
| 119 |
### Improvements to DataTrove
|
| 120 |
|
| 121 |
Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
|
|
@@ -171,3 +159,15 @@ SlurmPipelineExecutor(
|
|
| 171 |
...
|
| 172 |
)
|
| 173 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
]
|
| 105 |
```
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
### Improvements to DataTrove
|
| 108 |
|
| 109 |
Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
|
|
|
|
| 159 |
...
|
| 160 |
)
|
| 161 |
```
|
| 162 |
+
|
| 163 |
+
### What's in the Dataset?
|
| 164 |
+
|
| 165 |
+
<FigRef target="finephrase-explorer" /> lets you browse real examples from FinePhrase. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial.
|
| 166 |
+
|
| 167 |
+
<Wide>
|
| 168 |
+
<HtmlEmbed
|
| 169 |
+
id="finephrase-explorer"
|
| 170 |
+
src="finephrase-explorer.html"
|
| 171 |
+
caption="Browse real examples from the FinePhrase dataset. Each sample shows the original source document alongside all four rephrased versions (FAQ, Math, Table, Tutorial). Use the arrows or Random button to navigate between samples."
|
| 172 |
+
/>
|
| 173 |
+
</Wide>
|