File size: 7,307 Bytes
dfa32dc 3154bbf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | ---
language:
- en
tags:
- pytorch
- causal-lm
- pythia
- polypythias
license: apache-2.0
datasets:
- EleutherAI/pile
- EleutherAI/pile-preshuffled-seeds
library_name: transformers
arxiv: 2503.09543
---
# PolyPythias
This model is part of the **PolyPythias** suite, an extension of the [Pythia](https://github.com/EleutherAI/pythia) project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models.
## Paper
**[PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs](https://arxiv.org/abs/2503.09543)**
Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. *ICLR 2025*.
## Model Details
| Size | Parameters | Layers | Model Dim | Heads | Original Model |
|------|------------|--------|-----------|-------|----------------|
| 14M | 14M | 6 | 128 | 4 | [pythia-14m](https://huggingface.co/EleutherAI/pythia-14m) |
| 31M | 31M | 6 | 256 | 8 | [pythia-31m](https://huggingface.co/EleutherAI/pythia-31m) |
| 70M | 70M | 6 | 512 | 8 | [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m) |
| 160M | 160M | 12 | 768 | 12 | [pythia-160m](https://huggingface.co/EleutherAI/pythia-160m) |
| 410M | 410M | 24 | 1024 | 16 | [pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) |
All models were trained on 300B tokens from [The Pile](https://pile.eleuther.ai/).
## Naming Convention
- **`pythia-{size}m`** - Original Pythia model (seed 1234)
- **`pythia-{size}m-seed{1-9}`** - PolyPythias variants with different random seeds
- **`pythia-160m-data-seed{1-3}`** - 160M models with only data ordering varied (weight init fixed)
- **`pythia-160m-weight-seed{1-3}`** - 160M models with only weight initialization varied (data order fixed)
The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization.
## Quick Start
```python
from transformers import GPTNeoXForCausalLM, AutoTokenizer
# Load the final checkpoint
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3")
# Generate text
inputs = tokenizer("The quick brown fox", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
```
## Available Checkpoints
Each model provides **154 intermediate checkpoints** saved as Git branches:
| Checkpoint | Training Tokens | Description |
|------------|-----------------|-------------|
| `step0` | 0 | Initialization (before training) |
| `step1`, `step2`, `step4`, ..., `step512` | 2M - 1B | 10 log-spaced early checkpoints |
| `step1000`, `step2000`, ..., `step143000` | 2B - 300B | 143 evenly-spaced checkpoints |
To load a specific checkpoint:
```python
model = GPTNeoXForCausalLM.from_pretrained(
"EleutherAI/pythia-70m-seed3",
revision="step50000", # Any checkpoint step
)
```
## Training Data
All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at:
**[EleutherAI/pile-preshuffled-seeds](https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds)**
This dataset contains `.idx` files for seeds 0-9 used with `MMapIndexedDataset` to load the memory-mapped Pile data in the correct order for each seed.
### Reproducing Training Data Order
To reproduce the exact data ordering used for a specific seed:
1. Download the Pile dataset and tokenize it using the Pythia tokenizer
2. Download the corresponding seed folder from `pile-preshuffled-seeds`:
```bash
# Using huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="EleutherAI/pile-preshuffled-seeds",
repo_type="dataset",
allow_patterns="seed3/*", # Download only seed3
local_dir="./pile-seeds"
)
```
3. Use the idx files with GPT-NeoX's `MMapIndexedDataset`:
```python
from dataset import MMapIndexedDataset
dataset = MMapIndexedDataset(path_prefix, skip_warmup=True)
```
For complete training reproduction instructions, see the [Pythia GitHub repository](https://github.com/EleutherAI/pythia).
## All PolyPythias Models
The complete collection is available at: [EleutherAI/polypythias](https://huggingface.co/collections/EleutherAI/polypythias)
### 14M Parameter Models
- [pythia-14m-seed1](https://huggingface.co/EleutherAI/pythia-14m-seed1) through [pythia-14m-seed9](https://huggingface.co/EleutherAI/pythia-14m-seed9)
### 31M Parameter Models
- [pythia-31m-seed1](https://huggingface.co/EleutherAI/pythia-31m-seed1) through [pythia-31m-seed9](https://huggingface.co/EleutherAI/pythia-31m-seed9)
### 70M Parameter Models
- [pythia-70m-seed1](https://huggingface.co/EleutherAI/pythia-70m-seed1) through [pythia-70m-seed9](https://huggingface.co/EleutherAI/pythia-70m-seed9)
### 160M Parameter Models
- [pythia-160m-seed1](https://huggingface.co/EleutherAI/pythia-160m-seed1) through [pythia-160m-seed9](https://huggingface.co/EleutherAI/pythia-160m-seed9)
- [pythia-160m-data-seed1](https://huggingface.co/EleutherAI/pythia-160m-data-seed1) through [pythia-160m-data-seed3](https://huggingface.co/EleutherAI/pythia-160m-data-seed3)
- [pythia-160m-weight-seed1](https://huggingface.co/EleutherAI/pythia-160m-weight-seed1) through [pythia-160m-weight-seed3](https://huggingface.co/EleutherAI/pythia-160m-weight-seed3)
### 410M Parameter Models
- [pythia-410m-seed1](https://huggingface.co/EleutherAI/pythia-410m-seed1) through [pythia-410m-seed9](https://huggingface.co/EleutherAI/pythia-410m-seed9)
## Evaluation Results
Evaluation results for all models are available in the [polypythias-evals](https://huggingface.co/datasets/EleutherAI/polypythias-evals) dataset.
## Limitations
These models are released for research purposes only. They are **not** intended for deployment in production systems.
- **Not instruction-tuned**: These are base language models that predict the next token; they will not follow instructions like ChatGPT
- **May generate harmful content**: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content
- **English only**: Models were trained primarily on English text
- **No safety filtering**: Outputs are not filtered for safety or accuracy
## License
Apache 2.0
## Contact
For questions about these models, please use:
- [EleutherAI Discord](https://discord.gg/eleutherai) - #release-discussion channel
- [GitHub Issues](https://github.com/EleutherAI/pythia/issues)
## Citation
If you use these models, please cite:
```bibtex
@inproceedings{vanderwal2025polypythias,
title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},
author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},
booktitle={International Conference on Learning Representations},
year={2025},
url={https://arxiv.org/abs/2503.09543}
}
```
|