joelniklaus HF Staff commited on
Commit
6554803
·
1 Parent(s): 45dbf50

add a visualization of the infrastructure using a mermaid diagram

Browse files
app/src/content/chapters/experiments.mdx CHANGED
@@ -5,7 +5,6 @@ import Glossary from "../../components/Glossary.astro";
5
 
6
  {/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
7
  {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
8
- {/* TODO: add a visualization for the infrastructure */}
9
  {/* TODO: add a plot for the table with the benchmark results */}
10
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
11
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 
5
 
6
  {/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
7
  {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
 
8
  {/* TODO: add a plot for the table with the benchmark results */}
9
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
10
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
app/src/content/chapters/infrastructure.mdx CHANGED
@@ -18,6 +18,61 @@ Today we're excited to announce major extensions to [DataTrove](https://github.c
18
 
19
  In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ### Generating synthetic data at scale
22
 
23
  At the core of the repo is `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop:
 
18
 
19
  In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
20
 
21
+ <figure>
22
+
23
+ ```mermaid
24
+ flowchart TB
25
+ subgraph Input["📥 Input"]
26
+ HF_IN["`HF Hub Dataset`"]
27
+ end
28
+
29
+ subgraph Pipeline["⚙️ DataTrove Pipeline"]
30
+ direction LR
31
+ READ["`**Read**
32
+ HuggingFaceDatasetReader`"]
33
+ TRANSFORM["`**Transform**
34
+ InferenceRunner`"]
35
+ WRITE["`**Write**
36
+ ParquetWriter`"]
37
+ READ --> TRANSFORM --> WRITE
38
+ end
39
+
40
+ subgraph Inference["🚀 Inference Engine"]
41
+ direction LR
42
+ ROLLOUT["`**Custom Rollout**
43
+ async callable`"]
44
+ VLLM["`**vLLM / SGLang**
45
+ Server`"]
46
+ ROLLOUT -- "generate(payload)" --> VLLM
47
+ end
48
+
49
+ subgraph Execution["🖥️ Execution Mode"]
50
+ direction LR
51
+ LOCAL["`**Local**
52
+ single node, multi-GPU`"]
53
+ SLURM["`**Slurm Cluster**
54
+ multi-node, auto-scaling`"]
55
+ end
56
+
57
+ subgraph Output["📤 Output"]
58
+ direction LR
59
+ HF_OUT["`HF Hub Dataset`"]
60
+ CARD["`**Dataset Card**
61
+ + Metrics`"]
62
+ MONITOR["`Progress Monitor`"]
63
+ end
64
+
65
+ HF_IN --> READ
66
+ TRANSFORM --> ROLLOUT
67
+ Pipeline --> Execution
68
+ WRITE --> HF_OUT
69
+ WRITE --> CARD
70
+ Pipeline --> MONITOR
71
+ ```
72
+
73
+ <figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow from the HF Hub through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching custom rollout functions to vLLM/SGLang servers. The system supports both local and Slurm-based execution, with automatic dataset upload and progress monitoring.</figcaption>
74
+ </figure>
75
+
76
  ### Generating synthetic data at scale
77
 
78
  At the core of the repo is `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop: