Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
6554803
1
Parent(s): 45dbf50
add a visualization of the infrastructure using a mermaid diagram
Browse files
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -5,7 +5,6 @@ import Glossary from "../../components/Glossary.astro";
|
|
| 5 |
|
| 6 |
{/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
|
| 7 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
| 8 |
-
{/* TODO: add a visualization for the infrastructure */}
|
| 9 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 10 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 11 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
|
|
|
| 5 |
|
| 6 |
{/* TODO: Benchmarking: plot compare against default, mention how expensive one sweep is, automatically produce plot from baseline to be optimized and spit out the result */}
|
| 7 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
|
|
|
| 8 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 9 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 10 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
app/src/content/chapters/infrastructure.mdx
CHANGED
|
@@ -18,6 +18,61 @@ Today we're excited to announce major extensions to [DataTrove](https://github.c
|
|
| 18 |
|
| 19 |
In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
### Generating synthetic data at scale
|
| 22 |
|
| 23 |
At the core of the repo is `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop:
|
|
|
|
| 18 |
|
| 19 |
In this blog post we show how DataTrove can be used to generate a billion tokens per hour across several model scales, ranging from 1 billion to 1 trillion parameters. Let's dive in!
|
| 20 |
|
| 21 |
+
<figure>
|
| 22 |
+
|
| 23 |
+
```mermaid
|
| 24 |
+
flowchart TB
|
| 25 |
+
subgraph Input["📥 Input"]
|
| 26 |
+
HF_IN["`HF Hub Dataset`"]
|
| 27 |
+
end
|
| 28 |
+
|
| 29 |
+
subgraph Pipeline["⚙️ DataTrove Pipeline"]
|
| 30 |
+
direction LR
|
| 31 |
+
READ["`**Read**
|
| 32 |
+
HuggingFaceDatasetReader`"]
|
| 33 |
+
TRANSFORM["`**Transform**
|
| 34 |
+
InferenceRunner`"]
|
| 35 |
+
WRITE["`**Write**
|
| 36 |
+
ParquetWriter`"]
|
| 37 |
+
READ --> TRANSFORM --> WRITE
|
| 38 |
+
end
|
| 39 |
+
|
| 40 |
+
subgraph Inference["🚀 Inference Engine"]
|
| 41 |
+
direction LR
|
| 42 |
+
ROLLOUT["`**Custom Rollout**
|
| 43 |
+
async callable`"]
|
| 44 |
+
VLLM["`**vLLM / SGLang**
|
| 45 |
+
Server`"]
|
| 46 |
+
ROLLOUT -- "generate(payload)" --> VLLM
|
| 47 |
+
end
|
| 48 |
+
|
| 49 |
+
subgraph Execution["🖥️ Execution Mode"]
|
| 50 |
+
direction LR
|
| 51 |
+
LOCAL["`**Local**
|
| 52 |
+
single node, multi-GPU`"]
|
| 53 |
+
SLURM["`**Slurm Cluster**
|
| 54 |
+
multi-node, auto-scaling`"]
|
| 55 |
+
end
|
| 56 |
+
|
| 57 |
+
subgraph Output["📤 Output"]
|
| 58 |
+
direction LR
|
| 59 |
+
HF_OUT["`HF Hub Dataset`"]
|
| 60 |
+
CARD["`**Dataset Card**
|
| 61 |
+
+ Metrics`"]
|
| 62 |
+
MONITOR["`Progress Monitor`"]
|
| 63 |
+
end
|
| 64 |
+
|
| 65 |
+
HF_IN --> READ
|
| 66 |
+
TRANSFORM --> ROLLOUT
|
| 67 |
+
Pipeline --> Execution
|
| 68 |
+
WRITE --> HF_OUT
|
| 69 |
+
WRITE --> CARD
|
| 70 |
+
Pipeline --> MONITOR
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
<figcaption>Overview of the DataTrove synthetic data generation pipeline. Documents flow from the HF Hub through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching custom rollout functions to vLLM/SGLang servers. The system supports both local and Slurm-based execution, with automatic dataset upload and progress monitoring.</figcaption>
|
| 74 |
+
</figure>
|
| 75 |
+
|
| 76 |
### Generating synthetic data at scale
|
| 77 |
|
| 78 |
At the core of the repo is `examples/inference/benchmark/generate_data.py` , a Typer-powered entry point that orchestrates the full synthetic data loop:
|