joelniklaus HF Staff commited on
Commit
e46c12a
·
1 Parent(s): 462a612

reordered finephrase section to end with the dataset

Browse files
app/src/content/chapters/6-finephrase.mdx CHANGED
@@ -104,18 +104,6 @@ datacard_pipeline = [
104
  ]
105
  ```
106
 
107
- ### What's in the Dataset?
108
-
109
- <FigRef target="finephrase-explorer" /> lets you browse real examples from FinePhrase. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial.
110
-
111
- <Wide>
112
- <HtmlEmbed
113
- id="finephrase-explorer"
114
- src="finephrase-explorer.html"
115
- caption="Browse real examples from the FinePhrase dataset. Each sample shows the original source document alongside all four rephrased versions (FAQ, Math, Table, Tutorial). Use the arrows or Random button to navigate between samples."
116
- />
117
- </Wide>
118
-
119
  ### Improvements to DataTrove
120
 
121
  Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
@@ -171,3 +159,15 @@ SlurmPipelineExecutor(
171
  ...
172
  )
173
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ]
105
  ```
106
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  ### Improvements to DataTrove
108
 
109
  Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
 
159
  ...
160
  )
161
  ```
162
+
163
+ ### What's in the Dataset?
164
+
165
+ <FigRef target="finephrase-explorer" /> lets you browse real examples from FinePhrase. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial.
166
+
167
+ <Wide>
168
+ <HtmlEmbed
169
+ id="finephrase-explorer"
170
+ src="finephrase-explorer.html"
171
+ caption="Browse real examples from the FinePhrase dataset. Each sample shows the original source document alongside all four rephrased versions (FAQ, Math, Table, Tutorial). Use the arrows or Random button to navigate between samples."
172
+ />
173
+ </Wide>