finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 11

Commit

6554803

1 Parent(s): 45dbf50

add a visualization of the infrastructure using a mermaid diagram

Browse files

Files changed (2) hide show

app/src/content/chapters/experiments.mdx +0 -1
app/src/content/chapters/infrastructure.mdx +55 -0

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -5,7 +5,6 @@ import Glossary from "../../components/Glossary.astro";
 {/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
 {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
-{/* TODO: add a visualization for the infrastructure */}
 {/* TODO: add a plot for the table with the benchmark results */}
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}

 {/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
 {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
 {/* TODO: add a plot for the table with the benchmark results */}
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}

app/src/content/chapters/infrastructure.mdx CHANGED Viewed

@@ -18,6 +18,61 @@ Today we're excited to announce major extensions to [DataTrove](https://github.c
 In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
 ### Generating synthetic data at scale
 At the core of the repo is  `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop:

 In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
+<figure>
+```mermaid
+flowchart TB
+  subgraph Input["📥 Input"]
+    HF_IN["`HF Hub Dataset`"]
+  end
+  subgraph Pipeline["⚙️ DataTrove Pipeline"]
+    direction LR
+    READ["`**Read**
+HuggingFaceDatasetReader`"]
+    TRANSFORM["`**Transform**
+InferenceRunner`"]
+    WRITE["`**Write**
+ParquetWriter`"]
+    READ --> TRANSFORM --> WRITE
+  end
+  subgraph Inference["🚀 Inference Engine"]
+    direction LR
+    ROLLOUT["`**Custom Rollout**
+async callable`"]
+    VLLM["`**vLLM / SGLang**
+Server`"]
+    ROLLOUT -- "generate(payload)" --> VLLM
+  end
+  subgraph Execution["🖥️ Execution Mode"]
+    direction LR
+    LOCAL["`**Local**
+single node, multi-GPU`"]
+    SLURM["`**Slurm Cluster**
+multi-node, auto-scaling`"]
+  end
+  subgraph Output["📤 Output"]
+    direction LR
+    HF_OUT["`HF Hub Dataset`"]
+    CARD["`**Dataset Card**
++ Metrics`"]
+    MONITOR["`Progress Monitor`"]
+  end
+  HF_IN --> READ
+  TRANSFORM --> ROLLOUT
+  Pipeline --> Execution
+  WRITE --> HF_OUT
+  WRITE --> CARD
+  Pipeline --> MONITOR
+```
+<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow from the HF Hub through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching custom rollout functions to vLLM/SGLang servers. The system supports both local and Slurm-based execution, with automatic dataset upload and progress monitoring.</figcaption>
+</figure>
 ### Generating synthetic data at scale
 At the core of the repo is  `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop: