Instructions to use gatmiry/sortgpt-checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gatmiry/sortgpt-checkpoints with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("gatmiry/sortgpt-checkpoints", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - sorting | |
| - mechanistic-interpretability | |
| - transformers | |
| - toy-model | |
| # SortGPT Checkpoints | |
| Checkpoints for small decoder-only transformers trained on the **integer sorting task**. | |
| ## Task | |
| The model takes a sequence of `k` integers from `{0, ..., N-1}`, a SEP token, and must output the sorted sequence: | |
| ``` | |
| [unsorted_tokens | SEP | sorted_tokens] | |
| ``` | |
| Input length is `2*k + 1`. The SEP token index is `N` (i.e., `vocab_size = N + 1`). | |
| ## Grid | |
| | Parameter | Values | | |
| |--------------|-----------------------| | |
| | `k` (length) | 16, 32 | | |
| | `N` (vocab) | 128, 256, 512, 1024 | | |
| | Seeds | 1, 2, 3, 4, 5 | | |
| | `n_embd` | 64 | | |
| | `n_layers` | 2 | | |
| | `n_heads` | 1 | | |
| | `init_std` | 0.01 | | |
| | `lr` | 0.03 | | |
| | `max_iters` | 100,000 | | |
| 8 configs × 5 seeds = **40 runs**, each with 20 checkpoints (every 5,000 steps). | |
| ## Architecture | |
| Small GPT-2-style decoder-only transformer: | |
| - Token embeddings (no positional embeddings — `without_pos=True`) | |
| - 2 pre-norm transformer blocks, each with causal self-attention + MLP | |
| - Final LayerNorm + tied LM head | |
| - Weight tying between token embedding and LM head | |
| ## File Structure | |
| ``` | |
| checkpoints/ | |
| k{16,32}_N{128,256,512,1024}/ | |
| seed{1,2,3,4,5}/ | |
| std0p01_iseed{S}__ckpt{iter}.pt | |
| model.py # Model definition + loading utilities | |
| ``` | |
| ## Loading a Checkpoint | |
| ```python | |
| # Copy model.py to your project, then: | |
| from model import load_model_from_checkpoint | |
| model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt") | |
| ``` | |
| Each `.pt` file is a dict with keys: | |
| - `model_config`: dict of `GPTConfig` fields | |
| - `model_state_dict`: PyTorch state dict | |
| - `checkpoint_iter`, `init_seed`, `init_std`, `l1_init_scale` | |
| ## Training Details | |
| - Optimizer: AdamW (betas=0.9, 0.95) | |
| - LR schedule: cosine decay with linear warmup | |
| - Batch size: 128 | |
| - Data: randomly sampled sorting problems (no duplicates) | |
| - `data_seed`: 1337 (shared across all runs) | |