PrincetonPLI's picture
Update README.md
08c3236 verified
|
Raw
History Blame Contribute Delete
5.96 kB
metadata
license: mit
language:
  - en
metrics:
  - accuracy
pipeline_tag: text-generation
tags:
  - openlm
  - language-modeling
  - causal-lm
  - webgraphmix
  - dclm
datasets:
  - WebOrganizer/Corpus-200B
  - PrincetonPLI/cc-centrality-scores
library_name: open_lm

WebGraphMix OpenLM 1B Checkpoints

Pretrained OpenLM 1B checkpoints from Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality (WebGraphMix).

These models replicate the headline 1B-scale Table 1 experiments: four data-selection methods trained on mixtures derived from WebOrganizer/Corpus-200B, evaluated on DCLM CORE v2 (mmlu_and_lowvar, 23 tasks).

Checkpoints

Each folder contains an OpenLM PyTorch checkpoint (epoch_11.pt, final epoch) plus shared eval metadata.

Folder Method Training mixture DCLM CORE v2 avg.
random_selection Random baseline Uniform sampling from Corpus-200B pool 39.8%
dclm_fasttext_only Quality (DCLM-fasttext) Documents above DCLM-fasttext quality threshold 42.3%
betweenness_alpha0.5 WebGraphMix 50/50 mix of top/bottom betweenness-centrality hosts 41.4%
betweenness_alpha0.5_mult_div_dclm_fasttext WebGraphMix+ Betweenness 50/50 mix × DCLM-fasttext quality filter 43.8%

Scores are aggregated_results from the mmlu_and_lowvar eval suite (23 low-variance ICL tasks). See the WebGraphMix repo to reproduce evaluation.

Model details

Architecture OpenLM 1B (open_lm_1b_swiglutorch)
Parameters ~1.44B (1.34B non-embedding)
Hidden dim / layers / heads 2048 / 24 / 16
Context length 2048
Vocab 50,432 (GPT-NeoX tokenizer)
FFN SwiGLU (torch)
Norm gain_only_lp_layer_norm
QK norm enabled
Training tokens ~28.8B (1b_1x_fast Chinchilla scale)
Global batch size 256
LR / warmup / weight decay 0.003 / 5000 steps / 0.033
Seed 124
Precision AMP bfloat16 + FSDP
OpenLM version 0.0.34

All four models share the same architecture and optimizer settings; they differ only in the importance-sampled pretraining mixture.

Download

huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
  --local-dir ./dclm/checkpoints \
  --repo-type model

Or from the WebGraphMix repo:

git clone https://github.com/princeton-pli/WebGraphMix.git
cd WebGraphMix
./experiments/artifacts/download.sh checkpoints

Expected layout after download:

checkpoints/
├── open_lm_1b_eval_params.txt
├── random_selection/epoch_11.pt
├── dclm_fasttext_only/epoch_11.pt
├── betweenness_alpha0.5/epoch_11.pt
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt

Approximate size: ~17 GB per checkpoint (~68 GB total).

Evaluate (recommended)

The checkpoints are stored in native OpenLM PyTorch format. The easiest path is the WebGraphMix evaluation pipeline:

conda env create -f environment.yml && conda activate webgraphmix
cd dclm && pip install -e . && cd ..

export REPO_ROOT=$(pwd)
./experiments/artifacts/download.sh checkpoints

# Default: WebGraphMix 50/50 betweenness
./experiments/eval/mmlu_and_lowvar.sh

# Other checkpoints
./experiments/eval/mmlu_and_lowvar.sh random_selection
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext

Aggregate scores across models:

cd dclm/exp_data/evals && python benchmark_score_comparison.py

Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.

Convert to Hugging Face format (optional)

To load with transformers + open_lm HF wrappers:

export REPO_ROOT=/path/to/WebGraphMix
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
python dclm/convert_openlm_to_hf_1b.py

This produces Hugging Face–compatible folders with OpenLMConfig / OpenLMForCausalLM weights and the GPT-NeoX tokenizer.

Training data (summary)

Checkpoint Mixture description
random_selection Uniform random document sampling
dclm_fasttext_only DCLM-fasttext quality filter only
betweenness_alpha0.5 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts
betweenness_alpha0.5_mult_div_dclm_fasttext Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme)

Centrality scores come from PrincetonPLI/cc-centrality-scores. Full sampling and tokenization steps are documented in the WebGraphMix README.

Citation

@article{badoni2026webgraphmix,
  title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
  author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
  year={2026}
}

License

Released under the MIT License, consistent with the DCLM codebase used for training and evaluation.