| --- |
| license: mit |
| language: |
| - en |
| metrics: |
| - accuracy |
| pipeline_tag: text-generation |
| tags: |
| - openlm |
| - language-modeling |
| - causal-lm |
| - webgraphmix |
| - dclm |
| datasets: |
| - WebOrganizer/Corpus-200B |
| - PrincetonPLI/cc-centrality-scores |
| library_name: open_lm |
| --- |
| |
| # WebGraphMix OpenLM 1B Checkpoints |
|
|
| Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix). |
|
|
| These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks). |
|
|
| | Resource | Link | |
| |----------|------| |
| | Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) | |
| | Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) | |
| | Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) | |
| | Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) | |
| | Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) | |
|
|
| ## Checkpoints |
|
|
| Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata. |
|
|
| | Folder | Method | Training mixture | DCLM CORE v2 avg. | |
| |--------|--------|------------------|-------------------| |
| | `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 39.8% | |
| | `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 42.3% | |
| | `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% | |
| | `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.8% | |
|
|
| > Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation. |
| |
| ## Model details |
| |
| | | | |
| |---|---| |
| | Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) | |
| | Parameters | ~1.44B (1.34B non-embedding) | |
| | Hidden dim / layers / heads | 2048 / 24 / 16 | |
| | Context length | 2048 | |
| | Vocab | 50,432 (GPT-NeoX tokenizer) | |
| | FFN | SwiGLU (torch) | |
| | Norm | `gain_only_lp_layer_norm` | |
| | QK norm | enabled | |
| | Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) | |
| | Global batch size | 256 | |
| | LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 | |
| | Seed | 124 | |
| | Precision | AMP bfloat16 + FSDP | |
| | OpenLM version | 0.0.34 | |
|
|
| All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**. |
|
|
| ## Download |
|
|
| ```bash |
| huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \ |
| --local-dir ./dclm/checkpoints \ |
| --repo-type model |
| ``` |
|
|
| Or from the WebGraphMix repo: |
|
|
| ```bash |
| git clone https://github.com/princeton-pli/WebGraphMix.git |
| cd WebGraphMix |
| ./experiments/artifacts/download.sh checkpoints |
| ``` |
|
|
| Expected layout after download: |
|
|
| ```text |
| checkpoints/ |
| ├── open_lm_1b_eval_params.txt |
| ├── random_selection/epoch_11.pt |
| ├── dclm_fasttext_only/epoch_11.pt |
| ├── betweenness_alpha0.5/epoch_11.pt |
| └── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt |
| ``` |
|
|
| Approximate size: **~17 GB per checkpoint** (~68 GB total). |
|
|
| ## Evaluate (recommended) |
|
|
| The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline: |
|
|
| ```bash |
| conda env create -f environment.yml && conda activate webgraphmix |
| cd dclm && pip install -e . && cd .. |
| |
| export REPO_ROOT=$(pwd) |
| ./experiments/artifacts/download.sh checkpoints |
| |
| # Default: WebGraphMix 50/50 betweenness |
| ./experiments/eval/mmlu_and_lowvar.sh |
| |
| # Other checkpoints |
| ./experiments/eval/mmlu_and_lowvar.sh random_selection |
| ./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only |
| ./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext |
| ``` |
|
|
| Aggregate scores across models: |
|
|
| ```bash |
| cd dclm/exp_data/evals && python benchmark_score_comparison.py |
| ``` |
|
|
| Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model. |
|
|
| ## Convert to Hugging Face format (optional) |
|
|
| To load with `transformers` + `open_lm` HF wrappers: |
|
|
| ```bash |
| export REPO_ROOT=/path/to/WebGraphMix |
| export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints |
| export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf |
| python dclm/convert_openlm_to_hf_1b.py |
| ``` |
|
|
| This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer. |
|
|
| ## Training data (summary) |
|
|
| | Checkpoint | Mixture description | |
| |------------|---------------------| |
| | `random_selection` | Uniform random document sampling | |
| | `dclm_fasttext_only` | DCLM-fasttext quality filter only | |
| | `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts | |
| | `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) | |
|
|
| Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{badoni2026webgraphmix, |
| title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality}, |
| author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation. |