File size: 5,960 Bytes
eff8ebc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08c3236
 
eff8ebc
08c3236
eff8ebc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
tags:
  - openlm
  - language-modeling
  - causal-lm
  - webgraphmix
  - dclm
datasets:
  - WebOrganizer/Corpus-200B
  - PrincetonPLI/cc-centrality-scores
library_name: open_lm
---

# WebGraphMix OpenLM 1B Checkpoints

Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix).

These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks).

| Resource | Link |
|----------|------|
| Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) |
| Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) |
| Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) |
| Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) |
| Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) |

## Checkpoints

Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.

| Folder | Method | Training mixture | DCLM CORE v2 avg. |
|--------|--------|------------------|-------------------|
| `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 39.8% |
| `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 42.3% |
| `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.8% |

> Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.

## Model details

| | |
|---|---|
| Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) |
| Parameters | ~1.44B (1.34B non-embedding) |
| Hidden dim / layers / heads | 2048 / 24 / 16 |
| Context length | 2048 |
| Vocab | 50,432 (GPT-NeoX tokenizer) |
| FFN | SwiGLU (torch) |
| Norm | `gain_only_lp_layer_norm` |
| QK norm | enabled |
| Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) |
| Global batch size | 256 |
| LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
| Seed | 124 |
| Precision | AMP bfloat16 + FSDP |
| OpenLM version | 0.0.34 |

All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**.

## Download

```bash
huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
  --local-dir ./dclm/checkpoints \
  --repo-type model
```

Or from the WebGraphMix repo:

```bash
git clone https://github.com/princeton-pli/WebGraphMix.git
cd WebGraphMix
./experiments/artifacts/download.sh checkpoints
```

Expected layout after download:

```text
checkpoints/
├── open_lm_1b_eval_params.txt
├── random_selection/epoch_11.pt
├── dclm_fasttext_only/epoch_11.pt
├── betweenness_alpha0.5/epoch_11.pt
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
```

Approximate size: **~17 GB per checkpoint** (~68 GB total).

## Evaluate (recommended)

The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline:

```bash
conda env create -f environment.yml && conda activate webgraphmix
cd dclm && pip install -e . && cd ..

export REPO_ROOT=$(pwd)
./experiments/artifacts/download.sh checkpoints

# Default: WebGraphMix 50/50 betweenness
./experiments/eval/mmlu_and_lowvar.sh

# Other checkpoints
./experiments/eval/mmlu_and_lowvar.sh random_selection
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
```

Aggregate scores across models:

```bash
cd dclm/exp_data/evals && python benchmark_score_comparison.py
```

Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.

## Convert to Hugging Face format (optional)

To load with `transformers` + `open_lm` HF wrappers:

```bash
export REPO_ROOT=/path/to/WebGraphMix
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
python dclm/convert_openlm_to_hf_1b.py
```

This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.

## Training data (summary)

| Checkpoint | Mixture description |
|------------|---------------------|
| `random_selection` | Uniform random document sampling |
| `dclm_fasttext_only` | DCLM-fasttext quality filter only |
| `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |

Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).

## Citation

```bibtex
@article{badoni2026webgraphmix,
  title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
  author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
  year={2026}
}
```

## License

Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation.