Update README.md
Browse files
README.md
CHANGED
|
@@ -1 +1,161 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
metrics:
|
| 6 |
+
- accuracy
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
tags:
|
| 9 |
+
- openlm
|
| 10 |
+
- language-modeling
|
| 11 |
+
- causal-lm
|
| 12 |
+
- webgraphmix
|
| 13 |
+
- dclm
|
| 14 |
+
datasets:
|
| 15 |
+
- WebOrganizer/Corpus-200B
|
| 16 |
+
- PrincetonPLI/cc-centrality-scores
|
| 17 |
+
library_name: open_lm
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# WebGraphMix OpenLM 1B Checkpoints
|
| 21 |
+
|
| 22 |
+
Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix).
|
| 23 |
+
|
| 24 |
+
These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks).
|
| 25 |
+
|
| 26 |
+
| Resource | Link |
|
| 27 |
+
|----------|------|
|
| 28 |
+
| Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) |
|
| 29 |
+
| Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) |
|
| 30 |
+
| Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) |
|
| 31 |
+
| Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) |
|
| 32 |
+
| Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) |
|
| 33 |
+
|
| 34 |
+
## Checkpoints
|
| 35 |
+
|
| 36 |
+
Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.
|
| 37 |
+
|
| 38 |
+
| Folder | Method | Training mixture | DCLM CORE v2 avg. |
|
| 39 |
+
|--------|--------|------------------|-------------------|
|
| 40 |
+
| `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 40.5% |
|
| 41 |
+
| `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 43.1% |
|
| 42 |
+
| `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
|
| 43 |
+
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.4% |
|
| 44 |
+
|
| 45 |
+
> Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.
|
| 46 |
+
|
| 47 |
+
## Model details
|
| 48 |
+
|
| 49 |
+
| | |
|
| 50 |
+
|---|---|
|
| 51 |
+
| Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) |
|
| 52 |
+
| Parameters | ~1.44B (1.34B non-embedding) |
|
| 53 |
+
| Hidden dim / layers / heads | 2048 / 24 / 16 |
|
| 54 |
+
| Context length | 2048 |
|
| 55 |
+
| Vocab | 50,432 (GPT-NeoX tokenizer) |
|
| 56 |
+
| FFN | SwiGLU (torch) |
|
| 57 |
+
| Norm | `gain_only_lp_layer_norm` |
|
| 58 |
+
| QK norm | enabled |
|
| 59 |
+
| Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) |
|
| 60 |
+
| Global batch size | 256 |
|
| 61 |
+
| LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
|
| 62 |
+
| Seed | 124 |
|
| 63 |
+
| Precision | AMP bfloat16 + FSDP |
|
| 64 |
+
| OpenLM version | 0.0.34 |
|
| 65 |
+
|
| 66 |
+
All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**.
|
| 67 |
+
|
| 68 |
+
## Download
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
|
| 72 |
+
--local-dir ./dclm/checkpoints \
|
| 73 |
+
--repo-type model
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
Or from the WebGraphMix repo:
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
git clone https://github.com/princeton-pli/WebGraphMix.git
|
| 80 |
+
cd WebGraphMix
|
| 81 |
+
./experiments/artifacts/download.sh checkpoints
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
Expected layout after download:
|
| 85 |
+
|
| 86 |
+
```text
|
| 87 |
+
checkpoints/
|
| 88 |
+
├── open_lm_1b_eval_params.txt
|
| 89 |
+
├── random_selection/epoch_11.pt
|
| 90 |
+
├── dclm_fasttext_only/epoch_11.pt
|
| 91 |
+
├── betweenness_alpha0.5/epoch_11.pt
|
| 92 |
+
└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Approximate size: **~17 GB per checkpoint** (~68 GB total).
|
| 96 |
+
|
| 97 |
+
## Evaluate (recommended)
|
| 98 |
+
|
| 99 |
+
The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline:
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
conda env create -f environment.yml && conda activate webgraphmix
|
| 103 |
+
cd dclm && pip install -e . && cd ..
|
| 104 |
+
|
| 105 |
+
export REPO_ROOT=$(pwd)
|
| 106 |
+
./experiments/artifacts/download.sh checkpoints
|
| 107 |
+
|
| 108 |
+
# Default: WebGraphMix 50/50 betweenness
|
| 109 |
+
./experiments/eval/mmlu_and_lowvar.sh
|
| 110 |
+
|
| 111 |
+
# Other checkpoints
|
| 112 |
+
./experiments/eval/mmlu_and_lowvar.sh random_selection
|
| 113 |
+
./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
|
| 114 |
+
./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Aggregate scores across models:
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
cd dclm/exp_data/evals && python benchmark_score_comparison.py
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.
|
| 124 |
+
|
| 125 |
+
## Convert to Hugging Face format (optional)
|
| 126 |
+
|
| 127 |
+
To load with `transformers` + `open_lm` HF wrappers:
|
| 128 |
+
|
| 129 |
+
```bash
|
| 130 |
+
export REPO_ROOT=/path/to/WebGraphMix
|
| 131 |
+
export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
|
| 132 |
+
export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
|
| 133 |
+
python dclm/convert_openlm_to_hf_1b.py
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.
|
| 137 |
+
|
| 138 |
+
## Training data (summary)
|
| 139 |
+
|
| 140 |
+
| Checkpoint | Mixture description |
|
| 141 |
+
|------------|---------------------|
|
| 142 |
+
| `random_selection` | Uniform random document sampling |
|
| 143 |
+
| `dclm_fasttext_only` | DCLM-fasttext quality filter only |
|
| 144 |
+
| `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
|
| 145 |
+
| `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |
|
| 146 |
+
|
| 147 |
+
Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).
|
| 148 |
+
|
| 149 |
+
## Citation
|
| 150 |
+
|
| 151 |
+
```bibtex
|
| 152 |
+
@article{badoni2026webgraphmix,
|
| 153 |
+
title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
|
| 154 |
+
author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
|
| 155 |
+
year={2026}
|
| 156 |
+
}
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## License
|
| 160 |
+
|
| 161 |
+
Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation.
|