Pythia-160M GPT-NeoX Checkpoints
This repository contains the raw GPT-NeoX training checkpoints for Pythia-160M. These are the native checkpoint files produced during training, stored in DeepSpeed's checkpoint format.
If you want to perform inference, use the HuggingFace Transformers-compatible weights at EleutherAI/pythia-160m instead. This repository is intended for research that requires access to optimizer states or the original training format.
Contents
Each branch contains a full training checkpoint at a given step, including:
layer_XX-model_00-model_states.ptβ model weight shards (one per layer)mp_rank_00_model_states.ptβ model state metadatazero_pp_rank_*_optim_states.ptβ ZeRO optimizer states (Adam moments, etc.)160M.ymlβ GPT-NeoX training configuration
Branches
154 checkpoints are available as branches:
step0β initializationstep{1,2,4,8,16,32,64,128,256,512}β log-spaced early checkpointsstep1000throughstep143000β every 1,000 steps
Branch step143000 corresponds to the final model.
Converting to HuggingFace Format
To convert a checkpoint to HuggingFace Transformers format, use the conversion script from GPT-NeoX:
python tools/convert_neox_to_hf.py \
--input_dir /path/to/neox/checkpoint \
--config_file /path/to/config.yml \
--output_dir /path/to/hf/output
Pre-converted weights for all checkpoints are available at EleutherAI/pythia-160m.
Training Details
Trained on the Pile.
All Pythia models were trained for 143,000 steps with a batch size of 2M tokens (2,097,152 tokens per step), seeing a total of 299,892,736,000 tokens. See the Pythia paper and GitHub repository for full training details.
| Pythia Model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate |
|---|---|---|---|---|---|---|
| 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10-3 |
| 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10-4 |
| 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10-4 |
| 1B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10-4 |
| 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10-4 |
| 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10-4 |
| 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10-4 |
| 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10-4 |
Citation
@article{biderman2023pythia,
title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling},
author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others},
journal={International Conference on Machine Learning},
year={2023}
}