GPT-1900 (D26, 8B tokens)

A 1.2B parameter GPT-style language model trained exclusively on pre-1900 English text.

Model Details

Architecture: Custom GPT with RoPE, QK-norm, ReLU², value embeddings (ResFormer), per-layer residual/skip scalars
Parameters: ~1.2B
Layers: 26
Hidden dim: 1664
Attention heads: 13 (query) / 13 (kv)
Head dim: 128
Context length: 2048 tokens
Vocab size: 32,768 (BPE, GPT-4 style split pattern)
Training: ~8B tokens of pre-1900 English text, FP8 (tensorwise), Muon+AdamW optimizer
Final val BPB: 1.211

Checkpoint Contents

model_007226.pt          # Model weights (4.9 GB)
meta_007226.json         # Training config and metadata
optim_007226_rank*.pt    # Optimizer state, 8 FSDP shards (for resuming training)
tokenizer/               # BPE tokenizer (tiktoken format) + token byte counts
nanochat/                # Source code to load and run the model
eval_results.csv         # Benchmark eval results at this checkpoint

Quick Start

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import RustBPETokenizer

# Load tokenizer
tokenizer = RustBPETokenizer.from_directory("tokenizer")

# Load model
import json
with open("meta_007226.json") as f:
    meta = json.load(f)

config = GPTConfig(**meta["model_config"])

with torch.device("meta"):
    model = GPT(config)
model.to_empty(device="cuda")
model.init_weights()

state_dict = torch.load("model_007226.pt", map_location="cuda")
state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True, assign=True)
model.eval()

# Generate
bos = tokenizer.get_bos_token_id()
tokens = tokenizer.encode("It was a dark and stormy night", prepend=bos)
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    for token in model.generate(tokens, max_tokens=100, temperature=0.8):
        print(tokenizer.decode([token]), end="", flush=True)

Dependencies

torch>=2.9
tiktoken
rustbpe

Eval Results (step 7226)

Task	Accuracy	Centered
hellaswag	0.318	0.091
arc_easy	0.411	0.215
lambada_openai	0.332	0.332
piqa	0.586	0.172
winograd	0.674	0.348
copa	0.570	0.140
CORE		0.126

Training

Trained with the nanochat framework using 8x H100 GPUs with FSDP. To resume training, load the optimizer shards (optim_007226_rank*.pt) — one per FSDP rank.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mhla/gpt1900-d26-8btok

GPT-1900 Drafts

Collection

Experimental and intermediate GPT-1900 checkpoints. Working artifacts, not for general use. • 49 items • Updated 20 days ago