GPT-1900 (D26, 8B tokens)

A 1.2B parameter GPT-style language model trained exclusively on pre-1900 English text.

Model Details

  • Architecture: Custom GPT with RoPE, QK-norm, ReLU², value embeddings (ResFormer), per-layer residual/skip scalars
  • Parameters: ~1.2B
  • Layers: 26
  • Hidden dim: 1664
  • Attention heads: 13 (query) / 13 (kv)
  • Head dim: 128
  • Context length: 2048 tokens
  • Vocab size: 32,768 (BPE, GPT-4 style split pattern)
  • Training: ~8B tokens of pre-1900 English text, FP8 (tensorwise), Muon+AdamW optimizer
  • Final val BPB: 1.211

Checkpoint Contents

model_007226.pt          # Model weights (4.9 GB)
meta_007226.json         # Training config and metadata
optim_007226_rank*.pt    # Optimizer state, 8 FSDP shards (for resuming training)
tokenizer/               # BPE tokenizer (tiktoken format) + token byte counts
nanochat/                # Source code to load and run the model
eval_results.csv         # Benchmark eval results at this checkpoint

Quick Start

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import RustBPETokenizer

# Load tokenizer
tokenizer = RustBPETokenizer.from_directory("tokenizer")

# Load model
import json
with open("meta_007226.json") as f:
    meta = json.load(f)

config = GPTConfig(**meta["model_config"])

with torch.device("meta"):
    model = GPT(config)
model.to_empty(device="cuda")
model.init_weights()

state_dict = torch.load("model_007226.pt", map_location="cuda")
state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True, assign=True)
model.eval()

# Generate
bos = tokenizer.get_bos_token_id()
tokens = tokenizer.encode("It was a dark and stormy night", prepend=bos)
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    for token in model.generate(tokens, max_tokens=100, temperature=0.8):
        print(tokenizer.decode([token]), end="", flush=True)

Dependencies

torch>=2.9
tiktoken
rustbpe

Eval Results (step 7226)

Task Accuracy Centered
hellaswag 0.318 0.091
arc_easy 0.411 0.215
lambada_openai 0.332 0.332
piqa 0.586 0.172
winograd 0.674 0.348
copa 0.570 0.140
CORE 0.126

Training

Trained with the nanochat framework using 8x H100 GPUs with FSDP. To resume training, load the optimizer shards (optim_007226_rank*.pt) — one per FSDP rank.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mhla/gpt1900-d26-8btok