Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
a45cf8d
1
Parent(s): 230393c
improve mermaid diagram
Browse files
app/src/content/chapters/infrastructure.mdx
CHANGED
|
@@ -3,6 +3,7 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import FigRef from "../../components/FigRef.astro";
|
| 5 |
import Accordion from "../../components/Accordion.astro";
|
|
|
|
| 6 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
| 7 |
|
| 8 |
## Infrastructure
|
|
@@ -15,16 +16,18 @@ We made major extensions to [DataTrove](https://github.com/huggingface/datatrove
|
|
| 15 |
|
| 16 |
In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
|
| 17 |
|
|
|
|
| 18 |
<figure id="datatrove-pipeline">
|
| 19 |
|
| 20 |
```mermaid
|
|
|
|
| 21 |
flowchart TB
|
| 22 |
subgraph Input["📥 Input"]
|
| 23 |
HF_IN["`HF Hub Dataset`"]
|
| 24 |
end
|
| 25 |
|
| 26 |
subgraph Pipeline["⚙️ DataTrove Pipeline"]
|
| 27 |
-
direction
|
| 28 |
READ["`**Read**
|
| 29 |
HuggingFaceDatasetReader`"]
|
| 30 |
TRANSFORM["`**Transform**
|
|
@@ -34,8 +37,16 @@ ParquetWriter`"]
|
|
| 34 |
READ --> TRANSFORM --> WRITE
|
| 35 |
end
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
subgraph Inference["🚀 Inference Engine"]
|
| 38 |
-
direction
|
| 39 |
ROLLOUT["`**Custom Rollout**
|
| 40 |
async callable`"]
|
| 41 |
VLLM["`**vLLM / SGLang**
|
|
@@ -43,14 +54,6 @@ Server`"]
|
|
| 43 |
ROLLOUT -- "generate(payload)" --> VLLM
|
| 44 |
end
|
| 45 |
|
| 46 |
-
subgraph Execution["🖥️ Execution Mode"]
|
| 47 |
-
direction LR
|
| 48 |
-
LOCAL["`**Local**
|
| 49 |
-
single node, multi-GPU`"]
|
| 50 |
-
SLURM["`**Slurm Cluster**
|
| 51 |
-
multi-node, auto-scaling`"]
|
| 52 |
-
end
|
| 53 |
-
|
| 54 |
subgraph Output["📤 Output"]
|
| 55 |
direction LR
|
| 56 |
HF_OUT["`HF Hub Dataset`"]
|
|
@@ -60,15 +63,17 @@ multi-node, auto-scaling`"]
|
|
| 60 |
end
|
| 61 |
|
| 62 |
HF_IN --> READ
|
|
|
|
| 63 |
TRANSFORM --> ROLLOUT
|
| 64 |
Pipeline --> Execution
|
| 65 |
WRITE --> HF_OUT
|
| 66 |
WRITE --> CARD
|
| 67 |
-
|
| 68 |
```
|
| 69 |
|
| 70 |
-
<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow
|
| 71 |
</figure>
|
|
|
|
| 72 |
|
| 73 |
### Generating synthetic data at scale
|
| 74 |
|
|
|
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import FigRef from "../../components/FigRef.astro";
|
| 5 |
import Accordion from "../../components/Accordion.astro";
|
| 6 |
+
import Wide from "../../components/Wide.astro";
|
| 7 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
| 8 |
|
| 9 |
## Infrastructure
|
|
|
|
| 16 |
|
| 17 |
In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
|
| 18 |
|
| 19 |
+
<Wide>
|
| 20 |
<figure id="datatrove-pipeline">
|
| 21 |
|
| 22 |
```mermaid
|
| 23 |
+
%%{init: {"flowchart": {"diagramPadding": 12, "padding": 12, "nodeSpacing": 22, "rankSpacing": 52, "subGraphTitleMargin": {"top": 8, "bottom": 24}}, "themeVariables": {"fontSize": "18px"}} }%%
|
| 24 |
flowchart TB
|
| 25 |
subgraph Input["📥 Input"]
|
| 26 |
HF_IN["`HF Hub Dataset`"]
|
| 27 |
end
|
| 28 |
|
| 29 |
subgraph Pipeline["⚙️ DataTrove Pipeline"]
|
| 30 |
+
direction TB
|
| 31 |
READ["`**Read**
|
| 32 |
HuggingFaceDatasetReader`"]
|
| 33 |
TRANSFORM["`**Transform**
|
|
|
|
| 37 |
READ --> TRANSFORM --> WRITE
|
| 38 |
end
|
| 39 |
|
| 40 |
+
subgraph Execution["🖥️ Execution Mode"]
|
| 41 |
+
direction TB
|
| 42 |
+
LOCAL["`**Local**
|
| 43 |
+
single node, multi-GPU`"]
|
| 44 |
+
SLURM["`**Slurm Cluster**
|
| 45 |
+
multi-node, auto-scaling`"]
|
| 46 |
+
end
|
| 47 |
+
|
| 48 |
subgraph Inference["🚀 Inference Engine"]
|
| 49 |
+
direction TB
|
| 50 |
ROLLOUT["`**Custom Rollout**
|
| 51 |
async callable`"]
|
| 52 |
VLLM["`**vLLM / SGLang**
|
|
|
|
| 54 |
ROLLOUT -- "generate(payload)" --> VLLM
|
| 55 |
end
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
subgraph Output["📤 Output"]
|
| 58 |
direction LR
|
| 59 |
HF_OUT["`HF Hub Dataset`"]
|
|
|
|
| 63 |
end
|
| 64 |
|
| 65 |
HF_IN --> READ
|
| 66 |
+
HF_IN ~~~ LOCAL
|
| 67 |
TRANSFORM --> ROLLOUT
|
| 68 |
Pipeline --> Execution
|
| 69 |
WRITE --> HF_OUT
|
| 70 |
WRITE --> CARD
|
| 71 |
+
WRITE --> MONITOR
|
| 72 |
```
|
| 73 |
|
| 74 |
+
<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching rollout functions to vLLM/SGLang. The system supports local and Slurm-based execution with automatic upload and progress monitoring.</figcaption>
|
| 75 |
</figure>
|
| 76 |
+
</Wide>
|
| 77 |
|
| 78 |
### Generating synthetic data at scale
|
| 79 |
|
app/src/styles/components/_mermaid.css
CHANGED
|
@@ -28,6 +28,13 @@
|
|
| 28 |
fill: var(--text-color) !important;
|
| 29 |
}
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
/* Masquer le flicker pendant la conversion */
|
| 32 |
.mermaid-zoom-wrapper.converting {
|
| 33 |
opacity: 0.7;
|
|
|
|
| 28 |
fill: var(--text-color) !important;
|
| 29 |
}
|
| 30 |
|
| 31 |
+
#datatrove-pipeline .mermaid .cluster-label,
|
| 32 |
+
#datatrove-pipeline .mermaid .cluster-label .nodeLabel,
|
| 33 |
+
#datatrove-pipeline .mermaid .cluster-label .nodeLabel p {
|
| 34 |
+
font-size: 16px !important;
|
| 35 |
+
line-height: 1.2 !important;
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
/* Masquer le flicker pendant la conversion */
|
| 39 |
.mermaid-zoom-wrapper.converting {
|
| 40 |
opacity: 0.7;
|