Pythia-160M GPT-NeoX Checkpoints

This repository contains the raw GPT-NeoX training checkpoints for Pythia-160M. These are the native checkpoint files produced during training, stored in DeepSpeed's checkpoint format.

If you want to perform inference, use the HuggingFace Transformers-compatible weights at EleutherAI/pythia-160m instead. This repository is intended for research that requires access to optimizer states or the original training format.

Each branch contains a full training checkpoint at a given step, including:

layer_XX-model_00-model_states.pt — model weight shards (one per layer)
mp_rank_00_model_states.pt — model state metadata
zero_pp_rank_*_optim_states.pt — ZeRO optimizer states (Adam moments, etc.)
160M.yml — GPT-NeoX training configuration

Branches

154 checkpoints are available as branches:

step0 — initialization
step{1,2,4,8,16,32,64,128,256,512} — log-spaced early checkpoints
step1000 through step143000 — every 1,000 steps

Branch step143000 corresponds to the final model.

Converting to HuggingFace Format

To convert a checkpoint to HuggingFace Transformers format, use the conversion script from GPT-NeoX:

python tools/convert_neox_to_hf.py \
    --input_dir /path/to/neox/checkpoint \
    --config_file /path/to/config.yml \
    --output_dir /path/to/hf/output

Pre-converted weights for all checkpoints are available at EleutherAI/pythia-160m.

Training Details

Trained on the Pile.

All Pythia models were trained for 143,000 steps with a batch size of 2M tokens (2,097,152 tokens per step), seeing a total of 299,892,736,000 tokens. See the Pythia paper and GitHub repository for full training details.

Pythia Model	Non-Embedding Params	Layers	Model Dim	Heads	Batch Size	Learning Rate
70M	18,915,328	6	512	8	2M	1.0 x 10^-3
160M	85,056,000	12	768	12	2M	6.0 x 10^-4
410M	302,311,424	24	1024	16	2M	3.0 x 10^-4
1B	805,736,448	16	2048	8	2M	3.0 x 10^-4
1.4B	1,208,602,624	24	2048	16	2M	2.0 x 10^-4
2.8B	2,517,652,480	32	2560	32	2M	1.6 x 10^-4
6.9B	6,444,163,072	32	4096	32	2M	1.2 x 10^-4
12B	11,327,027,200	36	5120	40	2M	1.2 x 10^-4

Citation

@article{biderman2023pythia,
  title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling},
  author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others},
  journal={International Conference on Machine Learning},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train EleutherAI/neox-ckpt-pythia-160m

Paper for EleutherAI/neox-ckpt-pythia-160m

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper • 2304.01373 • Published Apr 3, 2023 • 9

EleutherAI
/

neox-ckpt-pythia-160m