Update README.md

08c3236 verified 2 days ago

5.96 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: text-generation
	tags:
	- openlm
	- language-modeling
	- causal-lm
	- webgraphmix
	- dclm
	datasets:
	- WebOrganizer/Corpus-200B
	- PrincetonPLI/cc-centrality-scores
	library_name: open_lm
	---

	# WebGraphMix OpenLM 1B Checkpoints

	Pretrained OpenLM 1B checkpoints from [Hubs or Fringes? Pretraining Data Selection via Web Graph Centrality](https://arxiv.org/abs/2606.11499) (WebGraphMix).

	These models replicate the headline 1B-scale Table 1 experiments: four data-selection methods trained on mixtures derived from [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B), evaluated on DCLM CORE v2 (`mmlu_and_lowvar`, 23 tasks).

	\| Resource \| Link \|
	\|----------\|------\|
	\| Paper \| [arXiv:2606.11499](https://arxiv.org/abs/2606.11499) \|
	\| Project page \| [princeton-pli.github.io/WebGraphMix](https://princeton-pli.github.io/WebGraphMix/) \|
	\| Code \| [github.com/princeton-pli/WebGraphMix](https://github.com/princeton-pli/WebGraphMix) \|
	\| Centrality scores \| [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores) \|
	\| Base corpus \| [WebOrganizer/Corpus-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B) \|

	## Checkpoints

	Each folder contains an OpenLM PyTorch checkpoint (`epoch_11.pt`, final epoch) plus shared eval metadata.

	\| Folder \| Method \| Training mixture \| DCLM CORE v2 avg. \|
	\|--------\|--------\|------------------\|-------------------\|
	\| `random_selection` \| Random baseline \| Uniform sampling from Corpus-200B pool \| 39.8% \|
	\| `dclm_fasttext_only` \| Quality (DCLM-fasttext) \| Documents above DCLM-fasttext quality threshold \| 42.3% \|
	\| `betweenness_alpha0.5` \| WebGraphMix \| 50/50 mix of top/bottom betweenness-centrality hosts \| 41.4% \|
	\| `betweenness_alpha0.5_mult_div_dclm_fasttext` \| WebGraphMix+ \| Betweenness 50/50 mix × DCLM-fasttext quality filter \| 43.8% \|

	> Scores are `aggregated_results` from the `mmlu_and_lowvar` eval suite (23 low-variance ICL tasks). See the [WebGraphMix repo](https://github.com/princeton-pli/WebGraphMix) to reproduce evaluation.

	## Model details

	\| \| \|
	\|---\|---\|
	\| Architecture \| OpenLM 1B (`open_lm_1b_swiglutorch`) \|
	\| Parameters \| ~1.44B (1.34B non-embedding) \|
	\| Hidden dim / layers / heads \| 2048 / 24 / 16 \|
	\| Context length \| 2048 \|
	\| Vocab \| 50,432 (GPT-NeoX tokenizer) \|
	\| FFN \| SwiGLU (torch) \|
	\| Norm \| `gain_only_lp_layer_norm` \|
	\| QK norm \| enabled \|
	\| Training tokens \| ~28.8B (`1b_1x_fast` Chinchilla scale) \|
	\| Global batch size \| 256 \|
	\| LR / warmup / weight decay \| 0.003 / 5000 steps / 0.033 \|
	\| Seed \| 124 \|
	\| Precision \| AMP bfloat16 + FSDP \|
	\| OpenLM version \| 0.0.34 \|

	All four models share the same architecture and optimizer settings; they differ only in the importance-sampled pretraining mixture.

	## Download

	```bash
	huggingface-cli download PrincetonPLI/WebGraphMix-openlm-1B \
	--local-dir ./dclm/checkpoints \
	--repo-type model
	```

	Or from the WebGraphMix repo:

	```bash
	git clone https://github.com/princeton-pli/WebGraphMix.git
	cd WebGraphMix
	./experiments/artifacts/download.sh checkpoints
	```

	Expected layout after download:

	```text
	checkpoints/
	├── open_lm_1b_eval_params.txt
	├── random_selection/epoch_11.pt
	├── dclm_fasttext_only/epoch_11.pt
	├── betweenness_alpha0.5/epoch_11.pt
	└── betweenness_alpha0.5_mult_div_dclm_fasttext/epoch_11.pt
	```

	Approximate size: ~17 GB per checkpoint (~68 GB total).

	## Evaluate (recommended)

	The checkpoints are stored in native OpenLM PyTorch format. The easiest path is the WebGraphMix evaluation pipeline:

	```bash
	conda env create -f environment.yml && conda activate webgraphmix
	cd dclm && pip install -e . && cd ..

	export REPO_ROOT=$(pwd)
	./experiments/artifacts/download.sh checkpoints

	# Default: WebGraphMix 50/50 betweenness
	./experiments/eval/mmlu_and_lowvar.sh

	# Other checkpoints
	./experiments/eval/mmlu_and_lowvar.sh random_selection
	./experiments/eval/mmlu_and_lowvar.sh dclm_fasttext_only
	./experiments/eval/mmlu_and_lowvar.sh betweenness_alpha0.5_mult_div_dclm_fasttext
	```

	Aggregate scores across models:

	```bash
	cd dclm/exp_data/evals && python benchmark_score_comparison.py
	```

	Evaluation uses ≥2 GPUs by default (FSDP); a single GPU may OOM on the 1B model.

	## Convert to Hugging Face format (optional)

	To load with `transformers` + `open_lm` HF wrappers:

	```bash
	export REPO_ROOT=/path/to/WebGraphMix
	export CHECKPOINT_INPUT_DIR=$REPO_ROOT/dclm/checkpoints
	export CHECKPOINT_HF_OUTPUT_DIR=$REPO_ROOT/dclm/checkpoints_hf
	python dclm/convert_openlm_to_hf_1b.py
	```

	This produces Hugging Face–compatible folders with `OpenLMConfig` / `OpenLMForCausalLM` weights and the GPT-NeoX tokenizer.

	## Training data (summary)

	\| Checkpoint \| Mixture description \|
	\|------------\|---------------------\|
	\| `random_selection` \| Uniform random document sampling \|
	\| `dclm_fasttext_only` \| DCLM-fasttext quality filter only \|
	\| `betweenness_alpha0.5` \| 50% documents from highest-betweenness hosts + 50% from lowest-betweenness hosts \|
	\| `betweenness_alpha0.5_mult_div_dclm_fasttext` \| Same 50/50 betweenness mix, combined with DCLM-fasttext quality scores (multiply/divide scheme) \|

	Centrality scores come from [PrincetonPLI/cc-centrality-scores](https://huggingface.co/datasets/PrincetonPLI/cc-centrality-scores). Full sampling and tokenization steps are documented in the [WebGraphMix README](https://github.com/princeton-pli/WebGraphMix).

	## Citation

	```bibtex
	@article{badoni2026webgraphmix,
	title={Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality},
	author={Badoni, Vedant and Chen, Danqi and Wang, Xinyi},
	year={2026}
	}
	```

	## License

	Released under the [MIT License](https://github.com/princeton-pli/WebGraphMix/blob/main/dclm/LICENSE), consistent with the [DCLM](https://github.com/mlfoundations/dclm) codebase used for training and evaluation.