mhla
/

gpt1905-d34

@@ -6,34 +6,38 @@ tags:
 - gpt
 - pre-1900
 - historical
 - nanochat
 ---
-# GPT-1905 D34 Base (fully trained)
-3.29B parameter GPT-style language model trained on pre-1905 English text. Training complete (19,103 steps, 40B tokens).
-## Model Details
-- **Architecture:** Custom GPT with RoPE, QK-norm, ReLU², value embeddings (ResFormer), per-layer residual/skip scalars
-- **Parameters:** 3.29B
-- **Layers:** 34
-- **Hidden dim:** 2176
-- **Attention heads:** 17 (query) / 17 (kv)
-- **Head dim:** 128
-- **Context length:** 2048 tokens
-- **Vocab size:** 32,768 (BPE, GPT-4 style split pattern)
-- **Training:** Base pretraining on pre-1905 corpus, 19,103 steps, 40B tokens
-## Checkpoint Contents
-```
-model_019103.pt          # Model weights
-meta_019103.json         # Training config and metadata
-optim_019103_rank*.pt    # Optimizer state shards (if present, for resuming training)
-tokenizer/                   # BPE tokenizer (tiktoken format) + token byte counts
-nanochat/                    # Source code to load and run the model
-```
 ## Quick Start
@@ -48,7 +52,6 @@ with open("meta_019103.json") as f:
     meta = json.load(f)
 config = GPTConfig(**meta["model_config"])
 with torch.device("meta"):
     model = GPT(config)
 model.to_empty(device="cuda")
@@ -58,14 +61,20 @@ state_dict = torch.load("model_019103.pt", map_location="cuda")
 state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
 model.load_state_dict(state_dict, strict=True, assign=True)
 model.eval()
 bos = tokenizer.get_bos_token_id()
-tokens = tokenizer.encode("It was a dark and stormy night", prepend=bos)
 with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
-    for token in model.generate(tokens, max_tokens=100, temperature=0.8):
         print(tokenizer.decode([token]), end="", flush=True)
 ```
 ## Dependencies
 ```
@@ -73,3 +82,12 @@ torch>=2.9
 tiktoken
 rustbpe
 ```

 - gpt
 - pre-1900
 - historical
+- physics
 - nanochat
 ---
+# GPT-1905
+A 3.29B parameter language model trained on pre-1905 English text. Like [GPT-1900](https://huggingface.co/mhla/gpt1900-d34-22btok), but with a cutoff extended to 1905 — just before Einstein's *annus mirabilis*. This model knows of Planck's early work and Lorentz's electron theory, but has never heard of special relativity or the photon.
+Trained on **~40B tokens** from digitized books and newspapers published before 1905.
+## Training
+- **Data:** Pre-1905 English text corpus (institutional books + American Stories newspapers)
+- **Tokens:** ~40B
+- **Steps:** 19,103
+- **Val BPB:** 0.787
+- **Hardware:** 8x8 H100 GPUs
+## Architecture
+Custom GPT with RoPE, QK-norm, ReLU² activation, value embeddings (ResFormer), and per-layer residual/skip scalars. Built with the [nanochat](https://github.com/karpathy/nanochat) framework.
+| Parameter | Value |
+|---|---|
+| Parameters | 3.29B |
+| Layers | 34 |
+| Hidden dim | 2176 |
+| Attention heads | 17 (query) / 17 (kv) |
+| Head dim | 128 |
+| Context length | 2048 tokens |
+| Vocab size | 32,768 (BPE, GPT-4 style split pattern) |
 ## Quick Start
     meta = json.load(f)
 config = GPTConfig(**meta["model_config"])
 with torch.device("meta"):
     model = GPT(config)
 model.to_empty(device="cuda")
 state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
 model.load_state_dict(state_dict, strict=True, assign=True)
 model.eval()
+```
+### Generate text
+```python
 bos = tokenizer.get_bos_token_id()
+tokens = tokenizer.encode("The luminiferous aether", prepend=bos)
 with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
+    for token in model.generate(tokens, max_tokens=200, temperature=0.8):
         print(tokenizer.decode([token]), end="", flush=True)
 ```
 ## Dependencies
 ```
 tiktoken
 rustbpe
 ```
+## Related
+- [mhla/pre1900-corpus](https://huggingface.co/datasets/mhla/pre1900-corpus) — Pre-1900 training corpus with metadata
+- [mhla/gpt1900-physics-clm](https://huggingface.co/datasets/mhla/gpt1900-physics-clm) — Physics texts for continued pretraining
+- [mhla/gpt1900-instruct-v3-data](https://huggingface.co/datasets/mhla/gpt1900-instruct-v3-data) — Instruction-tuning conversation pairs
+- [mhla/gpt1900-contradiction-eval](https://huggingface.co/datasets/mhla/gpt1900-contradiction-eval) — Physics contradiction evaluation problems