joelniklaus HF Staff commited on
Commit
a45cf8d
·
1 Parent(s): 230393c

improve mermaid diagram

Browse files
app/src/content/chapters/infrastructure.mdx CHANGED
@@ -3,6 +3,7 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import FigRef from "../../components/FigRef.astro";
5
  import Accordion from "../../components/Accordion.astro";
 
6
  import datasetCardImg from "../assets/image/auto-dataset-card.png";
7
 
8
  ## Infrastructure
@@ -15,16 +16,18 @@ We made major extensions to [DataTrove](https://github.com/huggingface/datatrove
15
 
16
  In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
17
 
 
18
  <figure id="datatrove-pipeline">
19
 
20
  ```mermaid
 
21
  flowchart TB
22
  subgraph Input["📥 Input"]
23
  HF_IN["`HF Hub Dataset`"]
24
  end
25
 
26
  subgraph Pipeline["⚙️ DataTrove Pipeline"]
27
- direction LR
28
  READ["`**Read**
29
  HuggingFaceDatasetReader`"]
30
  TRANSFORM["`**Transform**
@@ -34,8 +37,16 @@ ParquetWriter`"]
34
  READ --> TRANSFORM --> WRITE
35
  end
36
 
 
 
 
 
 
 
 
 
37
  subgraph Inference["🚀 Inference Engine"]
38
- direction LR
39
  ROLLOUT["`**Custom Rollout**
40
  async callable`"]
41
  VLLM["`**vLLM / SGLang**
@@ -43,14 +54,6 @@ Server`"]
43
  ROLLOUT -- "generate(payload)" --> VLLM
44
  end
45
 
46
- subgraph Execution["🖥️ Execution Mode"]
47
- direction LR
48
- LOCAL["`**Local**
49
- single node, multi-GPU`"]
50
- SLURM["`**Slurm Cluster**
51
- multi-node, auto-scaling`"]
52
- end
53
-
54
  subgraph Output["📤 Output"]
55
  direction LR
56
  HF_OUT["`HF Hub Dataset`"]
@@ -60,15 +63,17 @@ multi-node, auto-scaling`"]
60
  end
61
 
62
  HF_IN --> READ
 
63
  TRANSFORM --> ROLLOUT
64
  Pipeline --> Execution
65
  WRITE --> HF_OUT
66
  WRITE --> CARD
67
- Pipeline --> MONITOR
68
  ```
69
 
70
- <figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow from the HF Hub through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching custom rollout functions to vLLM/SGLang servers. The system supports both local and Slurm-based execution, with automatic dataset upload and progress monitoring.</figcaption>
71
  </figure>
 
72
 
73
  ### Generating synthetic data at scale
74
 
 
3
  import Sidenote from "../../components/Sidenote.astro";
4
  import FigRef from "../../components/FigRef.astro";
5
  import Accordion from "../../components/Accordion.astro";
6
+ import Wide from "../../components/Wide.astro";
7
  import datasetCardImg from "../assets/image/auto-dataset-card.png";
8
 
9
  ## Infrastructure
 
16
 
17
  In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
18
 
19
+ <Wide>
20
  <figure id="datatrove-pipeline">
21
 
22
  ```mermaid
23
+ %%{init: {"flowchart": {"diagramPadding": 12, "padding": 12, "nodeSpacing": 22, "rankSpacing": 52, "subGraphTitleMargin": {"top": 8, "bottom": 24}}, "themeVariables": {"fontSize": "18px"}} }%%
24
  flowchart TB
25
  subgraph Input["📥 Input"]
26
  HF_IN["`HF Hub Dataset`"]
27
  end
28
 
29
  subgraph Pipeline["⚙️ DataTrove Pipeline"]
30
+ direction TB
31
  READ["`**Read**
32
  HuggingFaceDatasetReader`"]
33
  TRANSFORM["`**Transform**
 
37
  READ --> TRANSFORM --> WRITE
38
  end
39
 
40
+ subgraph Execution["🖥️ Execution Mode"]
41
+ direction TB
42
+ LOCAL["`**Local**
43
+ single node, multi-GPU`"]
44
+ SLURM["`**Slurm Cluster**
45
+ multi-node, auto-scaling`"]
46
+ end
47
+
48
  subgraph Inference["🚀 Inference Engine"]
49
+ direction TB
50
  ROLLOUT["`**Custom Rollout**
51
  async callable`"]
52
  VLLM["`**vLLM / SGLang**
 
54
  ROLLOUT -- "generate(payload)" --> VLLM
55
  end
56
 
 
 
 
 
 
 
 
 
57
  subgraph Output["📤 Output"]
58
  direction LR
59
  HF_OUT["`HF Hub Dataset`"]
 
63
  end
64
 
65
  HF_IN --> READ
66
+ HF_IN ~~~ LOCAL
67
  TRANSFORM --> ROLLOUT
68
  Pipeline --> Execution
69
  WRITE --> HF_OUT
70
  WRITE --> CARD
71
+ WRITE --> MONITOR
72
  ```
73
 
74
+ <figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching rollout functions to vLLM/SGLang. The system supports local and Slurm-based execution with automatic upload and progress monitoring.</figcaption>
75
  </figure>
76
+ </Wide>
77
 
78
  ### Generating synthetic data at scale
79
 
app/src/styles/components/_mermaid.css CHANGED
@@ -28,6 +28,13 @@
28
  fill: var(--text-color) !important;
29
  }
30
 
 
 
 
 
 
 
 
31
  /* Masquer le flicker pendant la conversion */
32
  .mermaid-zoom-wrapper.converting {
33
  opacity: 0.7;
 
28
  fill: var(--text-color) !important;
29
  }
30
 
31
+ #datatrove-pipeline .mermaid .cluster-label,
32
+ #datatrove-pipeline .mermaid .cluster-label .nodeLabel,
33
+ #datatrove-pipeline .mermaid .cluster-label .nodeLabel p {
34
+ font-size: 16px !important;
35
+ line-height: 1.2 !important;
36
+ }
37
+
38
  /* Masquer le flicker pendant la conversion */
39
  .mermaid-zoom-wrapper.converting {
40
  opacity: 0.7;