sortgpt-checkpoints / README.md
gatmiry's picture
Upload SortGPT checkpoints (8 configs × 5 seeds × 20 ckpts)
ce004db verified
---
license: mit
tags:
- sorting
- mechanistic-interpretability
- transformers
- toy-model
---
# SortGPT Checkpoints
Checkpoints for small decoder-only transformers trained on the **integer sorting task**.
## Task
The model takes a sequence of `k` integers from `{0, ..., N-1}`, a SEP token, and must output the sorted sequence:
```
[unsorted_tokens | SEP | sorted_tokens]
```
Input length is `2*k + 1`. The SEP token index is `N` (i.e., `vocab_size = N + 1`).
## Grid
| Parameter | Values |
|--------------|-----------------------|
| `k` (length) | 16, 32 |
| `N` (vocab) | 128, 256, 512, 1024 |
| Seeds | 1, 2, 3, 4, 5 |
| `n_embd` | 64 |
| `n_layers` | 2 |
| `n_heads` | 1 |
| `init_std` | 0.01 |
| `lr` | 0.03 |
| `max_iters` | 100,000 |
8 configs × 5 seeds = **40 runs**, each with 20 checkpoints (every 5,000 steps).
## Architecture
Small GPT-2-style decoder-only transformer:
- Token embeddings (no positional embeddings — `without_pos=True`)
- 2 pre-norm transformer blocks, each with causal self-attention + MLP
- Final LayerNorm + tied LM head
- Weight tying between token embedding and LM head
## File Structure
```
checkpoints/
k{16,32}_N{128,256,512,1024}/
seed{1,2,3,4,5}/
std0p01_iseed{S}__ckpt{iter}.pt
model.py # Model definition + loading utilities
```
## Loading a Checkpoint
```python
# Copy model.py to your project, then:
from model import load_model_from_checkpoint
model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")
```
Each `.pt` file is a dict with keys:
- `model_config`: dict of `GPTConfig` fields
- `model_state_dict`: PyTorch state dict
- `checkpoint_iter`, `init_seed`, `init_std`, `l1_init_scale`
## Training Details
- Optimizer: AdamW (betas=0.9, 0.95)
- LR schedule: cosine decay with linear warmup
- Batch size: 128
- Data: randomly sampled sorting problems (no duplicates)
- `data_seed`: 1337 (shared across all runs)