Upload SortGPT checkpoints (8 configs × 5 seeds × 20 ckpts)

ce004db verified about 1 month ago

2.12 kB

	---
	license: mit
	tags:
	- sorting
	- mechanistic-interpretability
	- transformers
	- toy-model
	---

	# SortGPT Checkpoints

	Checkpoints for small decoder-only transformers trained on the integer sorting task.

	## Task

	The model takes a sequence of `k` integers from `{0, ..., N-1}`, a SEP token, and must output the sorted sequence:

	```
	[unsorted_tokens \| SEP \| sorted_tokens]
	```

	Input length is `2*k + 1`. The SEP token index is `N` (i.e., `vocab_size = N + 1`).

	## Grid

	\| Parameter \| Values \|
	\|--------------\|-----------------------\|
	\| `k` (length) \| 16, 32 \|
	\| `N` (vocab) \| 128, 256, 512, 1024 \|
	\| Seeds \| 1, 2, 3, 4, 5 \|
	\| `n_embd` \| 64 \|
	\| `n_layers` \| 2 \|
	\| `n_heads` \| 1 \|
	\| `init_std` \| 0.01 \|
	\| `lr` \| 0.03 \|
	\| `max_iters` \| 100,000 \|

	8 configs × 5 seeds = 40 runs, each with 20 checkpoints (every 5,000 steps).

	## Architecture

	Small GPT-2-style decoder-only transformer:
	- Token embeddings (no positional embeddings — `without_pos=True`)
	- 2 pre-norm transformer blocks, each with causal self-attention + MLP
	- Final LayerNorm + tied LM head
	- Weight tying between token embedding and LM head

	## File Structure

	```
	checkpoints/
	k{16,32}_N{128,256,512,1024}/
	seed{1,2,3,4,5}/
	std0p01_iseed{S}__ckpt{iter}.pt
	model.py # Model definition + loading utilities
	```

	## Loading a Checkpoint

	```python
	# Copy model.py to your project, then:
	from model import load_model_from_checkpoint

	model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")
	```

	Each `.pt` file is a dict with keys:
	- `model_config`: dict of `GPTConfig` fields
	- `model_state_dict`: PyTorch state dict
	- `checkpoint_iter`, `init_seed`, `init_std`, `l1_init_scale`

	## Training Details

	- Optimizer: AdamW (betas=0.9, 0.95)
	- LR schedule: cosine decay with linear warmup
	- Batch size: 128
	- Data: randomly sampled sorting problems (no duplicates)
	- `data_seed`: 1337 (shared across all runs)