| | --- |
| | language: |
| | - en |
| | tags: |
| | - pytorch |
| | - causal-lm |
| | - pythia |
| | - polypythias |
| | license: apache-2.0 |
| | datasets: |
| | - EleutherAI/pile |
| | - EleutherAI/pile-preshuffled-seeds |
| | library_name: transformers |
| | arxiv: 2503.09543 |
| | --- |
| | |
| | # PolyPythias |
| |
|
| | This model is part of the **PolyPythias** suite, an extension of the [Pythia](https://github.com/EleutherAI/pythia) project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models. |
| |
|
| | ## Paper |
| |
|
| | **[PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs](https://arxiv.org/abs/2503.09543)** |
| |
|
| | Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. *ICLR 2025*. |
| |
|
| | ## Model Details |
| |
|
| | | Size | Parameters | Layers | Model Dim | Heads | Original Model | |
| | |------|------------|--------|-----------|-------|----------------| |
| | | 14M | 14M | 6 | 128 | 4 | [pythia-14m](https://huggingface.co/EleutherAI/pythia-14m) | |
| | | 31M | 31M | 6 | 256 | 8 | [pythia-31m](https://huggingface.co/EleutherAI/pythia-31m) | |
| | | 70M | 70M | 6 | 512 | 8 | [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m) | |
| | | 160M | 160M | 12 | 768 | 12 | [pythia-160m](https://huggingface.co/EleutherAI/pythia-160m) | |
| | | 410M | 410M | 24 | 1024 | 16 | [pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) | |
| |
|
| | All models were trained on 300B tokens from [The Pile](https://pile.eleuther.ai/). |
| |
|
| | ## Naming Convention |
| |
|
| | - **`pythia-{size}m`** - Original Pythia model (seed 1234) |
| | - **`pythia-{size}m-seed{1-9}`** - PolyPythias variants with different random seeds |
| | - **`pythia-160m-data-seed{1-3}`** - 160M models with only data ordering varied (weight init fixed) |
| | - **`pythia-160m-weight-seed{1-3}`** - 160M models with only weight initialization varied (data order fixed) |
| |
|
| | The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization. |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from transformers import GPTNeoXForCausalLM, AutoTokenizer |
| | |
| | # Load the final checkpoint |
| | model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3") |
| | tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3") |
| | |
| | # Generate text |
| | inputs = tokenizer("The quick brown fox", return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=20) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ## Available Checkpoints |
| |
|
| | Each model provides **154 intermediate checkpoints** saved as Git branches: |
| |
|
| | | Checkpoint | Training Tokens | Description | |
| | |------------|-----------------|-------------| |
| | | `step0` | 0 | Initialization (before training) | |
| | | `step1`, `step2`, `step4`, ..., `step512` | 2M - 1B | 10 log-spaced early checkpoints | |
| | | `step1000`, `step2000`, ..., `step143000` | 2B - 300B | 143 evenly-spaced checkpoints | |
| |
|
| | To load a specific checkpoint: |
| |
|
| | ```python |
| | model = GPTNeoXForCausalLM.from_pretrained( |
| | "EleutherAI/pythia-70m-seed3", |
| | revision="step50000", # Any checkpoint step |
| | ) |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at: |
| |
|
| | **[EleutherAI/pile-preshuffled-seeds](https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds)** |
| |
|
| | This dataset contains `.idx` files for seeds 0-9 used with `MMapIndexedDataset` to load the memory-mapped Pile data in the correct order for each seed. |
| |
|
| | ### Reproducing Training Data Order |
| |
|
| | To reproduce the exact data ordering used for a specific seed: |
| |
|
| | 1. Download the Pile dataset and tokenize it using the Pythia tokenizer |
| | 2. Download the corresponding seed folder from `pile-preshuffled-seeds`: |
| | ```bash |
| | # Using huggingface_hub |
| | from huggingface_hub import snapshot_download |
| | snapshot_download( |
| | repo_id="EleutherAI/pile-preshuffled-seeds", |
| | repo_type="dataset", |
| | allow_patterns="seed3/*", # Download only seed3 |
| | local_dir="./pile-seeds" |
| | ) |
| | ``` |
| | 3. Use the idx files with GPT-NeoX's `MMapIndexedDataset`: |
| | ```python |
| | from dataset import MMapIndexedDataset |
| | dataset = MMapIndexedDataset(path_prefix, skip_warmup=True) |
| | ``` |
| |
|
| | For complete training reproduction instructions, see the [Pythia GitHub repository](https://github.com/EleutherAI/pythia). |
| |
|
| | ## All PolyPythias Models |
| |
|
| | The complete collection is available at: [EleutherAI/polypythias](https://huggingface.co/collections/EleutherAI/polypythias) |
| |
|
| | ### 14M Parameter Models |
| | - [pythia-14m-seed1](https://huggingface.co/EleutherAI/pythia-14m-seed1) through [pythia-14m-seed9](https://huggingface.co/EleutherAI/pythia-14m-seed9) |
| |
|
| | ### 31M Parameter Models |
| | - [pythia-31m-seed1](https://huggingface.co/EleutherAI/pythia-31m-seed1) through [pythia-31m-seed9](https://huggingface.co/EleutherAI/pythia-31m-seed9) |
| |
|
| | ### 70M Parameter Models |
| | - [pythia-70m-seed1](https://huggingface.co/EleutherAI/pythia-70m-seed1) through [pythia-70m-seed9](https://huggingface.co/EleutherAI/pythia-70m-seed9) |
| |
|
| | ### 160M Parameter Models |
| | - [pythia-160m-seed1](https://huggingface.co/EleutherAI/pythia-160m-seed1) through [pythia-160m-seed9](https://huggingface.co/EleutherAI/pythia-160m-seed9) |
| | - [pythia-160m-data-seed1](https://huggingface.co/EleutherAI/pythia-160m-data-seed1) through [pythia-160m-data-seed3](https://huggingface.co/EleutherAI/pythia-160m-data-seed3) |
| | - [pythia-160m-weight-seed1](https://huggingface.co/EleutherAI/pythia-160m-weight-seed1) through [pythia-160m-weight-seed3](https://huggingface.co/EleutherAI/pythia-160m-weight-seed3) |
| |
|
| | ### 410M Parameter Models |
| | - [pythia-410m-seed1](https://huggingface.co/EleutherAI/pythia-410m-seed1) through [pythia-410m-seed9](https://huggingface.co/EleutherAI/pythia-410m-seed9) |
| |
|
| | ## Evaluation Results |
| |
|
| | Evaluation results for all models are available in the [polypythias-evals](https://huggingface.co/datasets/EleutherAI/polypythias-evals) dataset. |
| |
|
| | ## Limitations |
| |
|
| | These models are released for research purposes only. They are **not** intended for deployment in production systems. |
| |
|
| | - **Not instruction-tuned**: These are base language models that predict the next token; they will not follow instructions like ChatGPT |
| | - **May generate harmful content**: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content |
| | - **English only**: Models were trained primarily on English text |
| | - **No safety filtering**: Outputs are not filtered for safety or accuracy |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|
| | ## Contact |
| |
|
| | For questions about these models, please use: |
| | - [EleutherAI Discord](https://discord.gg/eleutherai) - #release-discussion channel |
| | - [GitHub Issues](https://github.com/EleutherAI/pythia/issues) |
| |
|
| | ## Citation |
| |
|
| | If you use these models, please cite: |
| |
|
| | ```bibtex |
| | @inproceedings{vanderwal2025polypythias, |
| | title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs}, |
| | author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella}, |
| | booktitle={International Conference on Learning Representations}, |
| | year={2025}, |
| | url={https://arxiv.org/abs/2503.09543} |
| | } |
| | ``` |
| |
|