Add model documentation

c8baaae verified about 1 month ago

7.31 kB

	---
	language:
	- en
	tags:
	- pytorch
	- causal-lm
	- pythia
	- polypythias
	license: apache-2.0
	datasets:
	- EleutherAI/pile
	- EleutherAI/pile-preshuffled-seeds
	library_name: transformers
	arxiv: 2503.09543
	---

	# PolyPythias

	This model is part of the PolyPythias suite, an extension of the [Pythia](https://github.com/EleutherAI/pythia) project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models.

	## Paper

	[PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs](https://arxiv.org/abs/2503.09543)

	Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. ICLR 2025.

	## Model Details

	\| Size \| Parameters \| Layers \| Model Dim \| Heads \| Original Model \|
	\|------\|------------\|--------\|-----------\|-------\|----------------\|
	\| 14M \| 14M \| 6 \| 128 \| 4 \| [pythia-14m](https://huggingface.co/EleutherAI/pythia-14m) \|
	\| 31M \| 31M \| 6 \| 256 \| 8 \| [pythia-31m](https://huggingface.co/EleutherAI/pythia-31m) \|
	\| 70M \| 70M \| 6 \| 512 \| 8 \| [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m) \|
	\| 160M \| 160M \| 12 \| 768 \| 12 \| [pythia-160m](https://huggingface.co/EleutherAI/pythia-160m) \|
	\| 410M \| 410M \| 24 \| 1024 \| 16 \| [pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) \|

	All models were trained on 300B tokens from [The Pile](https://pile.eleuther.ai/).

	## Naming Convention

	- `pythia-{size}m` - Original Pythia model (seed 1234)
	- `pythia-{size}m-seed{1-9}` - PolyPythias variants with different random seeds
	- `pythia-160m-data-seed{1-3}` - 160M models with only data ordering varied (weight init fixed)
	- `pythia-160m-weight-seed{1-3}` - 160M models with only weight initialization varied (data order fixed)

	The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization.

	## Quick Start

	```python
	from transformers import GPTNeoXForCausalLM, AutoTokenizer

	# Load the final checkpoint
	model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3")
	tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3")

	# Generate text
	inputs = tokenizer("The quick brown fox", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=20)
	print(tokenizer.decode(outputs[0]))
	```

	## Available Checkpoints

	Each model provides 154 intermediate checkpoints saved as Git branches:

	\| Checkpoint \| Training Tokens \| Description \|
	\|------------\|-----------------\|-------------\|
	\| `step0` \| 0 \| Initialization (before training) \|
	\| `step1`, `step2`, `step4`, ..., `step512` \| 2M - 1B \| 10 log-spaced early checkpoints \|
	\| `step1000`, `step2000`, ..., `step143000` \| 2B - 300B \| 143 evenly-spaced checkpoints \|

	To load a specific checkpoint:

	```python
	model = GPTNeoXForCausalLM.from_pretrained(
	"EleutherAI/pythia-70m-seed3",
	revision="step50000", # Any checkpoint step
	)
	```

	## Training Data

	All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at:

	[EleutherAI/pile-preshuffled-seeds](https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds)

	This dataset contains `.idx` files for seeds 0-9 used with `MMapIndexedDataset` to load the memory-mapped Pile data in the correct order for each seed.

	### Reproducing Training Data Order

	To reproduce the exact data ordering used for a specific seed:

	1. Download the Pile dataset and tokenize it using the Pythia tokenizer
	2. Download the corresponding seed folder from `pile-preshuffled-seeds`:
	```bash
	# Using huggingface_hub
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id="EleutherAI/pile-preshuffled-seeds",
	repo_type="dataset",
	allow_patterns="seed3/*", # Download only seed3
	local_dir="./pile-seeds"
	)
	```
	3. Use the idx files with GPT-NeoX's `MMapIndexedDataset`:
	```python
	from dataset import MMapIndexedDataset
	dataset = MMapIndexedDataset(path_prefix, skip_warmup=True)
	```

	For complete training reproduction instructions, see the [Pythia GitHub repository](https://github.com/EleutherAI/pythia).

	## All PolyPythias Models

	The complete collection is available at: [EleutherAI/polypythias](https://huggingface.co/collections/EleutherAI/polypythias)

	### 14M Parameter Models
	- [pythia-14m-seed1](https://huggingface.co/EleutherAI/pythia-14m-seed1) through [pythia-14m-seed9](https://huggingface.co/EleutherAI/pythia-14m-seed9)

	### 31M Parameter Models
	- [pythia-31m-seed1](https://huggingface.co/EleutherAI/pythia-31m-seed1) through [pythia-31m-seed9](https://huggingface.co/EleutherAI/pythia-31m-seed9)

	### 70M Parameter Models
	- [pythia-70m-seed1](https://huggingface.co/EleutherAI/pythia-70m-seed1) through [pythia-70m-seed9](https://huggingface.co/EleutherAI/pythia-70m-seed9)

	### 160M Parameter Models
	- [pythia-160m-seed1](https://huggingface.co/EleutherAI/pythia-160m-seed1) through [pythia-160m-seed9](https://huggingface.co/EleutherAI/pythia-160m-seed9)
	- [pythia-160m-data-seed1](https://huggingface.co/EleutherAI/pythia-160m-data-seed1) through [pythia-160m-data-seed3](https://huggingface.co/EleutherAI/pythia-160m-data-seed3)
	- [pythia-160m-weight-seed1](https://huggingface.co/EleutherAI/pythia-160m-weight-seed1) through [pythia-160m-weight-seed3](https://huggingface.co/EleutherAI/pythia-160m-weight-seed3)

	### 410M Parameter Models
	- [pythia-410m-seed1](https://huggingface.co/EleutherAI/pythia-410m-seed1) through [pythia-410m-seed9](https://huggingface.co/EleutherAI/pythia-410m-seed9)

	## Evaluation Results

	Evaluation results for all models are available in the [polypythias-evals](https://huggingface.co/datasets/EleutherAI/polypythias-evals) dataset.

	## Limitations

	These models are released for research purposes only. They are not intended for deployment in production systems.

	- Not instruction-tuned: These are base language models that predict the next token; they will not follow instructions like ChatGPT
	- May generate harmful content: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content
	- English only: Models were trained primarily on English text
	- No safety filtering: Outputs are not filtered for safety or accuracy

	## License

	Apache 2.0

	## Contact

	For questions about these models, please use:
	- [EleutherAI Discord](https://discord.gg/eleutherai) - #release-discussion channel
	- [GitHub Issues](https://github.com/EleutherAI/pythia/issues)

	## Citation

	If you use these models, please cite:

	```bibtex
	@inproceedings{vanderwal2025polypythias,
	title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},
	author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},
	booktitle={International Conference on Learning Representations},
	year={2025},
	url={https://arxiv.org/abs/2503.09543}
	}
	```