Update README.md

08c3236 verified 2 days ago

5.96 kB

license: mit
language:
  - en
metrics:
  - accuracy
pipeline_tag: text-generation
tags:
  - openlm
  - language-modeling
  - causal-lm
  - webgraphmix
  - dclm
datasets:
  - WebOrganizer/Corpus-200B
  - PrincetonPLI/cc-centrality-scores
library_name: open_lm

WebGraphMix OpenLM 1B Checkpoints

Pretrained OpenLM 1B checkpoints from Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality (WebGraphMix).

These models replicate the headline 1B-scale Table 1 experiments: four data-selection methods trained on mixtures derived from WebOrganizer/Corpus-200B, evaluated on DCLM CORE v2 (mmlu_and_lowvar, 23 tasks).

Resource	Link
Paper	arXiv:2606.11499
Project page	princeton-pli.github.io/WebGraphMix
Code	github.com/princeton-pli/WebGraphMix
Centrality scores	PrincetonPLI/cc-centrality-scores
Base corpus	WebOrganizer/Corpus-200B

Checkpoints

Each folder contains an OpenLM PyTorch checkpoint (epoch_11.pt, final epoch) plus shared eval metadata.

Folder	Method	Training mixture	DCLM CORE v2 avg.
`random_selection`	Random baseline	Uniform sampling from Corpus-200B pool	39.8%
`dclm_fasttext_only`	Quality (DCLM-fasttext)	Documents above DCLM-fasttext quality threshold	42.3%
`betweenness_alpha0.5`	WebGraphMix	50/50 mix of top/bottom betweenness-centrality hosts	41.4%
`betweenness_alpha0.5_mult_div_dclm_fasttext`	WebGraphMix+	Betweenness 50/50 mix × DCLM-fasttext quality filter	43.8%

Scores are aggregated_results from the mmlu_and_lowvar eval suite (23 low-variance ICL tasks). See the WebGraphMix repo to reproduce evaluation.

Model details


Architecture	OpenLM 1B (`open_lm_1b_swiglutorch`)
Parameters	~1.44B (1.34B non-embedding)
Hidden dim / layers / heads	2048 / 24 / 16
Context length	2048
Vocab	50,432 (GPT-NeoX tokenizer)
FFN	SwiGLU (torch)
Norm	`gain_only_lp_layer_norm`
QK norm	enabled
Training tokens	~28.8B (`1b_1x_fast` Chinchilla scale)
Global batch size	256
LR / warmup / weight decay	0.003 / 5000 steps / 0.033
Seed	124
Precision	AMP bfloat16 + FSDP
OpenLM version	0.0.34

All four models share the same architecture and optimizer settings; they differ only in the importance-sampled pretraining mixture.

Download

huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
  --local-dir ./dclm/checkpoints \
  --repo-type model

Or from the WebGraphMix repo:

git clone https://github.com/princeton-pli/WebGraphMix.git
cd WebGraphMix
./experiments/artifacts/download.sh checkpoints

Expected layout after download:

checkpoints/
├── open_lm_1b_eval_params.txt
├── random_selection/epoch_11.pt
├── dclm_fasttext_only/epoch_11.pt
├── betweenness_alpha0.5/epoch_11.pt
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt

Approximate size: ~17 GB per checkpoint (~68 GB total).

Evaluate (recommended)

The checkpoints are stored in native OpenLM PyTorch format. The easiest path is the WebGraphMix evaluation pipeline:

conda env create -f environment.yml && conda activate webgraphmix
cd dclm && pip install -e . && cd ..

export REPO_ROOT=$(pwd)
./experiments/artifacts/download.sh checkpoints

# Default: WebGraphMix 50/50 betweenness
./experiments/eval/mmlu_and_lowvar.sh

# Other checkpoints
./experiments/eval/mmlu_and_lowvar.sh random_selection
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext

Aggregate scores across models:

cd dclm/exp_data/evals && python benchmark_score_comparison.py

Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.

Convert to Hugging Face format (optional)

To load with transformers + open_lm HF wrappers:

export REPO_ROOT=/path/to/WebGraphMix
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
python dclm/convert_openlm_to_hf_1b.py

This produces Hugging Face–compatible folders with OpenLMConfig / OpenLMForCausalLM weights and the GPT-NeoX tokenizer.

Training data (summary)

Checkpoint	Mixture description
`random_selection`	Uniform random document sampling
`dclm_fasttext_only`	DCLM-fasttext quality filter only
`betweenness_alpha0.5`	50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts
`betweenness_alpha0.5_mult_div_dclm_fasttext`	Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme)

Centrality scores come from PrincetonPLI/cc-centrality-scores. Full sampling and tokenization steps are documented in the WebGraphMix README.

Citation

@article{badoni2026webgraphmix,
  title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
  author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
  year={2026}
}

License

Released under the MIT License, consistent with the DCLM codebase used for training and evaluation.