license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
tags:
- openlm
- language-modeling
- causal-lm
- webgraphmix
- dclm
datasets:
- WebOrganizer/Corpus-200B
- PrincetonPLI/cc-centrality-scores
library_name: open_lm
WebGraphMix OpenLM 1B Checkpoints
Pretrained OpenLM 1B checkpoints from Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality (WebGraphMix).
These models replicate the headline 1B-scale Table 1 experiments: four data-selection methods trained on mixtures derived from WebOrganizer/Corpus-200B, evaluated on DCLM CORE v2 (mmlu_and_lowvar, 23 tasks).
| Resource | Link |
|---|---|
| Paper | arXiv:2606.11499 |
| Project page | princeton-pli.github.io/WebGraphMix |
| Code | github.com/princeton-pli/WebGraphMix |
| Centrality scores | PrincetonPLI/cc-centrality-scores |
| Base corpus | WebOrganizer/Corpus-200B |
Checkpoints
Each folder contains an OpenLM PyTorch checkpoint (epoch_11.pt, final epoch) plus shared eval metadata.
| Folder | Method | Training mixture | DCLM CORE v2 avg. |
|---|---|---|---|
random_selection |
Random baseline | Uniform sampling from Corpus-200B pool | 39.8% |
dclm_fasttext_only |
Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 42.3% |
betweenness_alpha0.5 |
WebGraphMix | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
betweenness_alpha0.5_mult_div_dclm_fasttext |
WebGraphMix+ | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.8% |
Scores are
aggregated_resultsfrom themmlu_and_lowvareval suite (23 low-variance ICL tasks). See the WebGraphMix repo to reproduce evaluation.
Model details
| Architecture | OpenLM 1B (open_lm_1b_swiglutorch) |
| Parameters | ~1.44B (1.34B non-embedding) |
| Hidden dim / layers / heads | 2048 / 24 / 16 |
| Context length | 2048 |
| Vocab | 50,432 (GPT-NeoX tokenizer) |
| FFN | SwiGLU (torch) |
| Norm | gain_only_lp_layer_norm |
| QK norm | enabled |
| Training tokens | ~28.8B (1b_1x_fast Chinchilla scale) |
| Global batch size | 256 |
| LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
| Seed | 124 |
| Precision | AMP bfloat16 + FSDP |
| OpenLM version | 0.0.34 |
All four models share the same architecture and optimizer settings; they differ only in the importance-sampled pretraining mixture.
Download
huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
--local-dir ./dclm/checkpoints \
--repo-type model
Or from the WebGraphMix repo:
git clone https://github.com/princeton-pli/WebGraphMix.git
cd WebGraphMix
./experiments/artifacts/download.sh checkpoints
Expected layout after download:
checkpoints/
├── open_lm_1b_eval_params.txt
├── random_selection/epoch_11.pt
├── dclm_fasttext_only/epoch_11.pt
├── betweenness_alpha0.5/epoch_11.pt
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
Approximate size: ~17 GB per checkpoint (~68 GB total).
Evaluate (recommended)
The checkpoints are stored in native OpenLM PyTorch format. The easiest path is the WebGraphMix evaluation pipeline:
conda env create -f environment.yml && conda activate webgraphmix
cd dclm && pip install -e . && cd ..
export REPO_ROOT=$(pwd)
./experiments/artifacts/download.sh checkpoints
# Default: WebGraphMix 50/50 betweenness
./experiments/eval/mmlu_and_lowvar.sh
# Other checkpoints
./experiments/eval/mmlu_and_lowvar.sh random_selection
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
Aggregate scores across models:
cd dclm/exp_data/evals && python benchmark_score_comparison.py
Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.
Convert to Hugging Face format (optional)
To load with transformers + open_lm HF wrappers:
export REPO_ROOT=/path/to/WebGraphMix
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
python dclm/convert_openlm_to_hf_1b.py
This produces Hugging Face–compatible folders with OpenLMConfig / OpenLMForCausalLM weights and the GPT-NeoX tokenizer.
Training data (summary)
| Checkpoint | Mixture description |
|---|---|
random_selection |
Uniform random document sampling |
dclm_fasttext_only |
DCLM-fasttext quality filter only |
betweenness_alpha0.5 |
50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
betweenness_alpha0.5_mult_div_dclm_fasttext |
Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |
Centrality scores come from PrincetonPLI/cc-centrality-scores. Full sampling and tokenization steps are documented in the WebGraphMix README.
Citation
@article{badoni2026webgraphmix,
title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
year={2026}
}
License
Released under the MIT License, consistent with the DCLM codebase used for training and evaluation.