finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 18

Commit

a45cf8d

1 Parent(s): 230393c

improve mermaid diagram

Browse files

Files changed (2) hide show

app/src/content/chapters/infrastructure.mdx +17 -12
app/src/styles/components/_mermaid.css +7 -0

app/src/content/chapters/infrastructure.mdx CHANGED Viewed

@@ -3,6 +3,7 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
 import Sidenote from "../../components/Sidenote.astro";
 import FigRef from "../../components/FigRef.astro";
 import Accordion from "../../components/Accordion.astro";
 import datasetCardImg from "../assets/image/auto-dataset-card.png";
 ## Infrastructure
@@ -15,16 +16,18 @@ We made major extensions to [DataTrove](https://github.com/huggingface/datatrove
 In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
 <figure id="datatrove-pipeline">
 ```mermaid
 flowchart TB
   subgraph Input["📥 Input"]
     HF_IN["`HF Hub Dataset`"]
   end
   subgraph Pipeline["⚙️ DataTrove Pipeline"]
-    direction LR
     READ["`**Read**
 HuggingFaceDatasetReader`"]
     TRANSFORM["`**Transform**
@@ -34,8 +37,16 @@ ParquetWriter`"]
     READ --> TRANSFORM --> WRITE
   end
   subgraph Inference["🚀 Inference Engine"]
-    direction LR
     ROLLOUT["`**Custom Rollout**
 async callable`"]
     VLLM["`**vLLM / SGLang**
@@ -43,14 +54,6 @@ Server`"]
     ROLLOUT -- "generate(payload)" --> VLLM
   end
-  subgraph Execution["🖥️ Execution Mode"]
-    direction LR
-    LOCAL["`**Local**
-single node, multi-GPU`"]
-    SLURM["`**Slurm Cluster**
-multi-node, auto-scaling`"]
-  end
   subgraph Output["📤 Output"]
     direction LR
     HF_OUT["`HF Hub Dataset`"]
@@ -60,15 +63,17 @@ multi-node, auto-scaling`"]
   end
   HF_IN --> READ
   TRANSFORM --> ROLLOUT
   Pipeline --> Execution
   WRITE --> HF_OUT
   WRITE --> CARD
-  Pipeline --> MONITOR
 ```
-<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow from the HF Hub through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching custom rollout functions to vLLM/SGLang servers. The system supports both local and Slurm-based execution, with automatic dataset upload and progress monitoring.</figcaption>
 </figure>
 ### Generating synthetic data at scale

 import Sidenote from "../../components/Sidenote.astro";
 import FigRef from "../../components/FigRef.astro";
 import Accordion from "../../components/Accordion.astro";
+import Wide from "../../components/Wide.astro";
 import datasetCardImg from "../assets/image/auto-dataset-card.png";
 ## Infrastructure
 In this section we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 100 million to 1 trillion parameters. <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
+<Wide>
 <figure id="datatrove-pipeline">
 ```mermaid
+%%{init: {"flowchart": {"diagramPadding": 12, "padding": 12, "nodeSpacing": 22, "rankSpacing": 52, "subGraphTitleMargin": {"top": 8, "bottom": 24}}, "themeVariables": {"fontSize": "18px"}} }%%
 flowchart TB
   subgraph Input["📥 Input"]
     HF_IN["`HF Hub Dataset`"]
   end
   subgraph Pipeline["⚙️ DataTrove Pipeline"]
+    direction TB
     READ["`**Read**
 HuggingFaceDatasetReader`"]
     TRANSFORM["`**Transform**
     READ --> TRANSFORM --> WRITE
   end
+  subgraph Execution["🖥️ Execution Mode"]
+    direction TB
+    LOCAL["`**Local**
+single node, multi-GPU`"]
+    SLURM["`**Slurm Cluster**
+multi-node, auto-scaling`"]
+  end
   subgraph Inference["🚀 Inference Engine"]
+    direction TB
     ROLLOUT["`**Custom Rollout**
 async callable`"]
     VLLM["`**vLLM / SGLang**
     ROLLOUT -- "generate(payload)" --> VLLM
   end
   subgraph Output["📤 Output"]
     direction LR
     HF_OUT["`HF Hub Dataset`"]
   end
   HF_IN --> READ
+  HF_IN ~~~ LOCAL
   TRANSFORM --> ROLLOUT
   Pipeline --> Execution
   WRITE --> HF_OUT
   WRITE --> CARD
+  WRITE --> MONITOR
 ```
+<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching rollout functions to vLLM/SGLang. The system supports local and Slurm-based execution with automatic upload and progress monitoring.</figcaption>
 </figure>
+</Wide>
 ### Generating synthetic data at scale

app/src/styles/components/_mermaid.css CHANGED Viewed

@@ -28,6 +28,13 @@
     fill: var(--text-color) !important;
 }
 /* Masquer le flicker pendant la conversion */
 .mermaid-zoom-wrapper.converting {
     opacity: 0.7;

     fill: var(--text-color) !important;
 }
+#datatrove-pipeline .mermaid .cluster-label,
+#datatrove-pipeline .mermaid .cluster-label .nodeLabel,
+#datatrove-pipeline .mermaid .cluster-label .nodeLabel p {
+    font-size: 16px !important;
+    line-height: 1.2 !important;
+}
 /* Masquer le flicker pendant la conversion */
 .mermaid-zoom-wrapper.converting {
     opacity: 0.7;