PrincetonPLI commited on
Commit
eff8ebc
·
verified ·
1 Parent(s): 7c01a36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -1
README.md CHANGED
@@ -1 +1,161 @@
1
- openlm model checkpoints for replicate main table numbers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - openlm
10
+ - language-modeling
11
+ - causal-lm
12
+ - webgraphmix
13
+ - dclm
14
+ datasets:
15
+ - WebOrganizer/Corpus-200B
16
+ - PrincetonPLI/cc-centrality-scores
17
+ library_name: open_lm
18
+ ---
19
+
20
+ # WebGraphMix OpenLM 1B Checkpoints
21
+
22
+ Pretrained **OpenLM 1B** checkpoints from [**Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality**](https://arxiv.org/abs/2606.11499) (WebGraphMix).
23
+
24
+ These models replicate the headline **1B-scale Table 1** experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on **DCLM CORE v2** (`mmlu_and_lowvar`, 23 tasks).
25
+
26
+ | Resource | Link |
27
+ |----------|------|
28
+ | Paper | [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) |
29
+ | Project page | [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) |
30
+ | Code | [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) |
31
+ | Centrality scores | [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) |
32
+ | Base corpus | [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) |
33
+
34
+ ## Checkpoints
35
+
36
+ Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.
37
+
38
+ | Folder | Method | Training mixture | DCLM CORE v2 avg. |
39
+ |--------|--------|------------------|-------------------|
40
+ | `random_selection` | Random baseline | Uniform sampling from Corpus-200B pool | 40.5% |
41
+ | `dclm_fasttext_only` | Quality (DCLM-fasttext) | Documents above DCLM-fasttext quality threshold | 43.1% |
42
+ | `betweenness_alpha0.5` | **WebGraphMix** | 50/50 mix of top/bottom betweenness-centrality hosts | 41.4% |
43
+ | `betweenness_alpha0.5_mult_div_dclm_fasttext` | **WebGraphMix+** | Betweenness 50/50 mix × DCLM-fasttext quality filter | 43.4% |
44
+
45
+ > Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.
46
+
47
+ ## Model details
48
+
49
+ | | |
50
+ |---|---|
51
+ | Architecture | OpenLM 1B (`open_lm_1b_swiglutorch`) |
52
+ | Parameters | ~1.44B (1.34B non-embedding) |
53
+ | Hidden dim / layers / heads | 2048 / 24 / 16 |
54
+ | Context length | 2048 |
55
+ | Vocab | 50,432 (GPT-NeoX tokenizer) |
56
+ | FFN | SwiGLU (torch) |
57
+ | Norm | `gain_only_lp_layer_norm` |
58
+ | QK norm | enabled |
59
+ | Training tokens | ~28.8B (`1b_1x_fast` Chinchilla scale) |
60
+ | Global batch size | 256 |
61
+ | LR / warmup / weight decay | 0.003 / 5000 steps / 0.033 |
62
+ | Seed | 124 |
63
+ | Precision | AMP bfloat16 + FSDP |
64
+ | OpenLM version | 0.0.34 |
65
+
66
+ All four models share the same architecture and optimizer settings; they differ only in the **importance-sampled pretraining mixture**.
67
+
68
+ ## Download
69
+
70
+ ```bash
71
+ huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
72
+ --local-dir ./dclm/checkpoints \
73
+ --repo-type model
74
+ ```
75
+
76
+ Or from the WebGraphMix repo:
77
+
78
+ ```bash
79
+ git clone https://github.com/princeton-pli/WebGraphMix.git
80
+ cd WebGraphMix
81
+ ./experiments/artifacts/download.sh checkpoints
82
+ ```
83
+
84
+ Expected layout after download:
85
+
86
+ ```text
87
+ checkpoints/
88
+ ├── open_lm_1b_eval_params.txt
89
+ ├── random_selection/epoch_11.pt
90
+ ├── dclm_fasttext_only/epoch_11.pt
91
+ ├── betweenness_alpha0.5/epoch_11.pt
92
+ └── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
93
+ ```
94
+
95
+ Approximate size: **~17 GB per checkpoint** (~68 GB total).
96
+
97
+ ## Evaluate (recommended)
98
+
99
+ The checkpoints are stored in **native OpenLM PyTorch format**. The easiest path is the WebGraphMix evaluation pipeline:
100
+
101
+ ```bash
102
+ conda env create -f environment.yml && conda activate webgraphmix
103
+ cd dclm && pip install -e . && cd ..
104
+
105
+ export REPO_ROOT=$(pwd)
106
+ ./experiments/artifacts/download.sh checkpoints
107
+
108
+ # Default: WebGraphMix 50/50 betweenness
109
+ ./experiments/eval/mmlu_and_lowvar.sh
110
+
111
+ # Other checkpoints
112
+ ./experiments/eval/mmlu_and_lowvar.sh random_selection
113
+ ./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
114
+ ./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
115
+ ```
116
+
117
+ Aggregate scores across models:
118
+
119
+ ```bash
120
+ cd dclm/exp_data/evals && python benchmark_score_comparison.py
121
+ ```
122
+
123
+ Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.
124
+
125
+ ## Convert to Hugging Face format (optional)
126
+
127
+ To load with `transformers` + `open_lm` HF wrappers:
128
+
129
+ ```bash
130
+ export REPO_ROOT=/path/to/WebGraphMix
131
+ export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
132
+ export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
133
+ python dclm/convert_openlm_to_hf_1b.py
134
+ ```
135
+
136
+ This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.
137
+
138
+ ## Training data (summary)
139
+
140
+ | Checkpoint | Mixture description |
141
+ |------------|---------------------|
142
+ | `random_selection` | Uniform random document sampling |
143
+ | `dclm_fasttext_only` | DCLM-fasttext quality filter only |
144
+ | `betweenness_alpha0.5` | 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts |
145
+ | `betweenness_alpha0.5_mult_div_dclm_fasttext` | Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) |
146
+
147
+ Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).
148
+
149
+ ## Citation
150
+
151
+ ```bibtex
152
+ @article{badoni2026webgraphmix,
153
+ title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
154
+ author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
155
+ year={2026}
156
+ }
157
+ ```
158
+
159
+ ## License
160
+
161
+ Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation.