File size: 5,960 Bytes
eff8ebc 08c3236 eff8ebc 08c3236 eff8ebc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
tags:
- openlm
- language-modeling
- causal-lm
- webgraphmix
- dclm
datasets:
- WebOrganizer/Corpus-200B
- PrincetonPLI/cc-centrality-scores
library_name: open_lm
---
# WebGraphMix OpenLM 1B Checkpoints
Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix).
These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks).
| Resource | Link |
|----------|------|
| Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) |
| Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) |
| Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) |
| Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) |
| Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) |
## Checkpoints
Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.
| Folder | Method | Training mixture | DCLM CORE v2 avg. |
|--------|--------|------------------|-------------------|
| `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 39.8% |
| `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 42.3% |
| `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.8% |
> Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.
## Model details
| | |
|---|---|
| Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) |
| Parameters | ~1.44B (1.34B non-embedding) |
| Hidden dim / layers / heads | 2048 / 24 / 16 |
| Context length | 2048 |
| Vocab | 50,432 (GPT-NeoX tokenizer) |
| FFN | SwiGLU (torch) |
| Norm | `gain_only_lp_layer_norm` |
| QK norm | enabled |
| Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) |
| Global batch size | 256 |
| LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
| Seed | 124 |
| Precision | AMP bfloat16 + FSDP |
| OpenLM version | 0.0.34 |
All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**.
## Download
```bash
huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
--local-dir ./dclm/checkpoints \
--repo-type model
```
Or from the WebGraphMix repo:
```bash
git clone https://github.com/princeton-pli/WebGraphMix.git
cd WebGraphMix
./experiments/artifacts/download.sh checkpoints
```
Expected layout after download:
```text
checkpoints/
├── open_lm_1b_eval_params.txt
├── random_selection/epoch_11.pt
├── dclm_fasttext_only/epoch_11.pt
├── betweenness_alpha0.5/epoch_11.pt
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
```
Approximate size: **~17 GB per checkpoint** (~68 GB total).
## Evaluate (recommended)
The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline:
```bash
conda env create -f environment.yml && conda activate webgraphmix
cd dclm && pip install -e . && cd ..
export REPO_ROOT=$(pwd)
./experiments/artifacts/download.sh checkpoints
# Default: WebGraphMix 50/50 betweenness
./experiments/eval/mmlu_and_lowvar.sh
# Other checkpoints
./experiments/eval/mmlu_and_lowvar.sh random_selection
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
```
Aggregate scores across models:
```bash
cd dclm/exp_data/evals && python benchmark_score_comparison.py
```
Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.
## Convert to Hugging Face format (optional)
To load with `transformers` + `open_lm` HF wrappers:
```bash
export REPO_ROOT=/path/to/WebGraphMix
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
python dclm/convert_openlm_to_hf_1b.py
```
This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.
## Training data (summary)
| Checkpoint | Mixture description |
|------------|---------------------|
| `random_selection` | Uniform random document sampling |
| `dclm_fasttext_only` | DCLM-fasttext quality filter only |
| `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |
Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).
## Citation
```bibtex
@article{badoni2026webgraphmix,
title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
year={2026}
}
```
## License
Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation. |