DeseretLM-200M

A 209.7M-parameter chat model trained from scratch on synthetic text written exclusively in the Deseret Alphabet β€” a 19th-century phonetic writing system for English. The model produces and consumes Deseret-only text; English is auto-translated at the input boundary.

Total training cost: ~$45 USD on a single H100.

What this is

This is, to our knowledge, the first language model trained from scratch in the Deseret Alphabet. It demonstrates that a phonetic-orthography-only LLM can be built end-to-end on a hobby budget by:

  1. Reverse-engineering a translation pipeline (English β†’ Deseret) and validating it against the 1869 Book of Mormon (99.965 % parity).
  2. Synthesizing a ~11 B-token pre-training corpus from FineWeb-Edu.
  3. Synthesizing a 200 k-conversation chat corpus from UltraChat.
  4. Training a Llama-style transformer for ~12 hours on a single H100.

Architecture

Standard decoder-only transformer with modern components:

Hyperparameter Value
Parameters 209,716,224
Layers 16
Hidden size (d_model) 1024
Attention heads 16
MLP intermediate (SwiGLU) 2730
Vocab size 8,192
Context length 1024
Normalization RMSNorm
Positional encoding RoPE (base 10000)
Tied embeddings yes
Activation SwiGLU

Tokenizer

Byte-level BPE with 8k vocab, trained on the full Deseret corpus. Special tokens at IDs 0–5: <|pad|> <|bos|> <|eos|> <|user|> <|assistant|> <|system|>.

Get the tokenizer at chrisjpatty/deseret-8k-bpe.

Chat template

<|bos|> <|user|> {user content tokens} <|assistant|> {assistant content tokens} <|eos|>

Multi-turn: repeat the user/assistant pair. Loss during SFT was computed only on assistant tokens + the terminal <|eos|>.

Training data

Stage Dataset Size License
Pre-training chrisjpatty/fineweb-edu-deseret 11.13 B tokens ODC-By 1.0
SFT chrisjpatty/ultrachat-deseret 200 k conversations MIT

Training recipe

Pre-training (~$40, ~12 hr on 1Γ— H100 80GB SXM):

  • Optimizer: AdamW (Ξ²=(0.9, 0.95), wd=0.1, eps=1e-8, fused)
  • LR: 3e-4 peak, cosine decay to 3e-5
  • Warmup: 2000 steps
  • Batch: 32 Γ— grad-accum 16 Γ— ctx 1024 = 524 288 tokens/step
  • Steps: 20 000 (~10.5 B tokens, ~50 tokens/parameter β€” well past Chinchilla-optimal)
  • Precision: bf16
  • Grad clip: 1.0
  • Gradient norm + parameter norm logged throughout
  • NaN guard with emergency checkpoint
  • Final loss: 2.68 train / 2.67 val

SFT (~$5, ~33 min on same pod):

  • LR: 1e-5 peak, cosine decay to 1e-6
  • Warmup: 200 steps
  • Batch: 16 Γ— grad-accum 4 Γ— max_len 1024
  • 1 epoch over 200k UltraChat-Deseret conversations
  • Loss only on assistant tokens
  • Final loss: 1.58

Validation

The translation pipeline that produced the training data was validated against the Illinois Deseret Consortium's parallel transcription of the 1869 Book of Mormon β€” the authoritative published Deseret text β€” achieving:

  • 100.00 % parity on the IDC spelling dictionary (1,359 entries)
  • 99.965 % word-level parity on the full 1869 Book of Mormon (108k+ words)

The model itself was not benchmarked against standard NLP evals (these don't exist for Deseret).

Usage

import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
# Use the model code from https://github.com/chrisjpatty/deseretlm  (or vendor model/transformer.py)
from model.transformer import Transformer, TransformerConfig

# Download files
ckpt_path = hf_hub_download(repo_id="chrisjpatty/deseretlm-200m", filename="final.pt")
tok_path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")

# Load
device = torch.device("mps") if torch.backends.mps.is_available() else \
         torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
ckpt = torch.load(ckpt_path, map_location=device)
cfg = TransformerConfig(**ckpt["cfg"])
model = Transformer(cfg).to(device).eval()
model.load_state_dict({k.replace("_orig_mod.", ""): v for k, v in ckpt["model"].items()})
tok = Tokenizer.from_file(tok_path)

# Build prompt with chat template
bos, eos = tok.token_to_id("<|bos|>"), tok.token_to_id("<|eos|>")
u, a = tok.token_to_id("<|user|>"), tok.token_to_id("<|assistant|>")
prompt_des = "ππ²π‘Šπ¬, 𐐸𐐭 πͺ𐑉 𐐷𐐭?"   # "Hello, who are you?"
ids = [bos, u] + tok.encode(prompt_des).ids + [a]
x = torch.tensor([ids], dtype=torch.long, device=device)

# Generate
with torch.no_grad():
    for _ in range(256):
        logits, _ = model(x)
        next_id = int(torch.multinomial(torch.softmax(logits[0, -1] / 0.7, -1), 1).item())
        x = torch.cat([x, torch.tensor([[next_id]], device=device)], dim=1)
        if next_id == eos:
            break

reply = tok.decode(x[0, len(ids):].tolist())
print(reply)

Known limitations

This is a small from-scratch model on a tight budget. Expect:

  • Coherent Deseret, sometimes off-topic answers. The language is fluent but instruction-following is weak. E.g., asked "who are you?" the model may answer about music players. It learned the chat format well; it learned what to say less well.
  • No factual knowledge guarantees at this scale. 200M parameters trained on 11B tokens has limited capacity for facts.
  • Modern American English phonology, not 1860s New England. Words like "ask" are rendered /Γ¦sk/ (modern) not /Ι‘sk/ (period). The translator is internally consistent but stylistically modern.
  • No safety tuning, no RLHF/DPO. Single SFT pass over UltraChat-200k only.
  • Limited multi-turn coherence beyond ~3 turns.

Reproduction

Full code, including the translator, training scripts, and validation harness, lives in the project repo (linked from the citation below). Total compute budget: under $50 on rented H100. Time: ~24 hours wall, including data prep on a single Mac.

Citation

DeseretLM-200M: a small language model trained from scratch in the Deseret Alphabet.
Christopher Patty (chrisjpatty), 2026.

If you build on the datasets, please also cite the source corpora β€” see the dataset cards for fineweb-edu-deseret and ultrachat-deseret.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support