Pythia-160M GPT-NeoX Checkpoints

This repository contains the raw GPT-NeoX training checkpoints for Pythia-160M. These are the native checkpoint files produced during training, stored in DeepSpeed's checkpoint format.

If you want to perform inference, use the HuggingFace Transformers-compatible weights at EleutherAI/pythia-160m instead. This repository is intended for research that requires access to optimizer states or the original training format.

Contents

Each branch contains a full training checkpoint at a given step, including:

  • layer_XX-model_00-model_states.pt β€” model weight shards (one per layer)
  • mp_rank_00_model_states.pt β€” model state metadata
  • zero_pp_rank_*_optim_states.pt β€” ZeRO optimizer states (Adam moments, etc.)
  • 160M.yml β€” GPT-NeoX training configuration

Branches

154 checkpoints are available as branches:

  • step0 β€” initialization
  • step{1,2,4,8,16,32,64,128,256,512} β€” log-spaced early checkpoints
  • step1000 through step143000 β€” every 1,000 steps

Branch step143000 corresponds to the final model.

Converting to HuggingFace Format

To convert a checkpoint to HuggingFace Transformers format, use the conversion script from GPT-NeoX:

python tools/convert_neox_to_hf.py \
    --input_dir /path/to/neox/checkpoint \
    --config_file /path/to/config.yml \
    --output_dir /path/to/hf/output

Pre-converted weights for all checkpoints are available at EleutherAI/pythia-160m.

Training Details

Trained on the Pile.

All Pythia models were trained for 143,000 steps with a batch size of 2M tokens (2,097,152 tokens per step), seeing a total of 299,892,736,000 tokens. See the Pythia paper and GitHub repository for full training details.

Pythia Model Non-Embedding Params Layers Model Dim Heads Batch Size Learning Rate
70M 18,915,328 6 512 8 2M 1.0 x 10-3
160M 85,056,000 12 768 12 2M 6.0 x 10-4
410M 302,311,424 24 1024 16 2M 3.0 x 10-4
1B 805,736,448 16 2048 8 2M 3.0 x 10-4
1.4B 1,208,602,624 24 2048 16 2M 2.0 x 10-4
2.8B 2,517,652,480 32 2560 32 2M 1.6 x 10-4
6.9B 6,444,163,072 32 4096 32 2M 1.2 x 10-4
12B 11,327,027,200 36 5120 40 2M 1.2 x 10-4

Citation

@article{biderman2023pythia,
  title={Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling},
  author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others},
  journal={International Conference on Machine Learning},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train EleutherAI/neox-ckpt-pythia-160m

Paper for EleutherAI/neox-ckpt-pythia-160m