PrincetonPLI
/

WebGraphMix-openlm-1B

+---
+license: mit
+language:
+- en
+metrics:
+- accuracy
+pipeline_tag: text-generation
+tags:
+  - openlm
+  - language-modeling
+  - causal-lm
+  - webgraphmix
+  - dclm
+datasets:
+  - WebOrganizer/Corpus-200B
+  - PrincetonPLI/cc-centrality-scores
+library_name: open_lm
+---
+# WebGraphMix OpenLM 1B Checkpoints
+Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix).
+These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks).
+| Resource | Link |
+|----------|------|
+| Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) |
+| Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) |
+| Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) |
+| Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) |
+| Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) |
+## Checkpoints
+Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.
+| Folder | Method | Training mixture | DCLM CORE v2 avg. |
+|--------|--------|------------------|-------------------|
+| `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 40.5% |
+| `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 43.1% |
+| `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
+| `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.4% |
+> Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.
+## Model details
+| | |
+|---|---|
+| Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) |
+| Parameters | ~1.44B (1.34B non-embedding) |
+| Hidden dim / layers / heads | 2048 / 24 / 16 |
+| Context length | 2048 |
+| Vocab | 50,432 (GPT-NeoX tokenizer) |
+| FFN | SwiGLU (torch) |
+| Norm | `gain_only_lp_layer_norm` |
+| QK norm | enabled |
+| Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) |
+| Global batch size | 256 |
+| LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
+| Seed | 124 |
+| Precision | AMP bfloat16 + FSDP |
+| OpenLM version | 0.0.34 |
+All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**.
+## Download
+```bash
+huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
+  --local-dir ./dclm/checkpoints \
+  --repo-type model
+```
+Or from the WebGraphMix repo:
+```bash
+git clone https://github.com/princeton-pli/WebGraphMix.git
+cd WebGraphMix
+./experiments/artifacts/download.sh checkpoints
+```
+Expected layout after download:
+```text
+checkpoints/
+├── open_lm_1b_eval_params.txt
+├── random_selection/epoch_11.pt
+├── dclm_fasttext_only/epoch_11.pt
+├── betweenness_alpha0.5/epoch_11.pt
+└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
+```
+Approximate size: **~17 GB per checkpoint** (~68 GB total).
+## Evaluate (recommended)
+The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline:
+```bash
+conda env create -f environment.yml && conda activate webgraphmix
+cd dclm && pip install -e . && cd ..
+export REPO_ROOT=$(pwd)
+./experiments/artifacts/download.sh checkpoints
+# Default: WebGraphMix 50/50 betweenness
+./experiments/eval/mmlu_and_lowvar.sh
+# Other checkpoints
+./experiments/eval/mmlu_and_lowvar.sh random_selection
+./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
+./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
+```
+Aggregate scores across models:
+```bash
+cd dclm/exp_data/evals && python benchmark_score_comparison.py
+```
+Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.
+## Convert to Hugging Face format (optional)
+To load with `transformers` + `open_lm` HF wrappers:
+```bash
+export REPO_ROOT=/path/to/WebGraphMix
+export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
+export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
+python dclm/convert_openlm_to_hf_1b.py
+```
+This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.
+## Training data (summary)
+| Checkpoint | Mixture description |
+|------------|---------------------|
+| `random_selection` | Uniform random document sampling |
+| `dclm_fasttext_only` | DCLM-fasttext quality filter only |
+| `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
+| `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |
+Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).
+## Citation
+```bibtex
+@article{badoni2026webgraphmix,
+  title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
+  author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
+  year={2026}
+}
+```
+## License
+Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation.