--- license: mit language: - en library_name: pytorch pipeline_tag: text-generation tags: - text-generation - stream-mixer - linear-time - recurrent - attention-free - nanochat - small-llm datasets: - karpathy/climbmix-400b-shuffle - HuggingFaceTB/smol-smoltalk - cais/mmlu - allenai/ai2_arc - openai/gsm8k base_model: karpathy/nanochat --- # Mnemo > *μνήμη — Greek for "memory"* **Mnemo** is a small attention-free language model with 117M parameters, built on the **Stream Mixer** architecture — a linear-time recurrent sequence mixer that uses multiple parallel content-routed memory streams instead of self-attention. The name nods to the model's recurrent memory: every layer maintains M parallel state buffers that "remember" content over the entire sequence without quadratic attention. The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of [karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based GPT replaced by a custom Stream Mixer block. --- ## Quick facts | | | |---|---| | Architecture | Stream Mixer (linear-time recurrent) | | Parameters | **117,179,136** | | Layers | 16 | | Hidden dim | 768 | | Memory streams (M) | 48 | | Stream state dim (D) | 96 | | Read heads | 6 | | Context length | 2048 tokens | | Vocab | 32,768 BPE (GPT-4-style pretokenization) | | Special tokens | `<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>` | | Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) | | **Base perplexity (BPB)** | **19.47 (0.9011 bits-per-byte)** | | **Chat ChatCORE metric** | **22.74%** (mean centered across 5 tasks) | | **SpellingBee accuracy** | **94.53%** (256/256 test set) | | License | MIT | --- ## Architecture: Stream Mixer Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention to compute pairwise interactions across tokens (cost: **O(T²)**), Mnemo uses a chunked parallel scan over M parallel content-routed memory streams (cost: **O(T · M · D)** — **linear in sequence length**). Per token *t* and per layer: 1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`. 2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`. 3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams. The full state across a layer is **(B, M, D)** — a fixed-size recurrent memory that the model can carry across arbitrary sequence lengths. The chunked scan implementation keeps numerical range bounded even for slow-decay streams. For details see the model source. --- ## Training ### Pretraining (base model) | | | |---|---| | Corpus | [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards | | Total tokens | **5.24B** (44.7× over params) | | Steps | 80,000 × B=32 × T=2048 | | Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) | | Compute | RTX PRO 6000 Blackwell (single GPU, bf16) | | Wall time | **~9 hours** | | Best val loss | **2.9508** (perplexity ≈ 19.12) | ### Supervised fine-tuning | | | |---|---| | Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs | | Total conversations | ~1.09M | | Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens | | Optimizer | AdamW (peak LR 1e-4, warmup 300) | | Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) | | Format | nanochat-style BOS-aligned best-fit packing with padding | ### Pipeline ``` ClimbMix-400B │ ▼ [80k step pretrain on Stream Mixer] │ best val 2.9508 @ step 79k ▼ Base checkpoint (completes prompts) │ ▼ [30k step SFT on multi-task mixture] │ best val ~1.45 ▼ SFT checkpoint (chat-aware — answers as Mnemo) ``` --- ## Evaluation results Measured on the full test sets — no subsampling, no cherry-picking. ### Base model — `model.pt` @ step 79,000 | Metric | Value | |---|---| | Validation loss (nats / token) | 2.9691 | | Perplexity | 19.47 | | **Bits per byte (BPB)** | **0.9011** | | Evaluation window | 409,600 tokens / 1,947,169 bytes | Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class. ### Chat model — full benchmark suite Evaluated on the **complete test set of each task** (no `--max-problems` cap). Categorical tasks use logit comparison over allowed letters; generative tasks sample greedily and parse `#### N` for the final answer. | Task | Type | N | Accuracy | Random baseline | Centered | |---|---|---|---|---|---| | MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | **28.32%** | 25% | +4.42 | | ARC-Easy | categorical 4-way MCQ | 2,376 | **30.68%** | 25% | +7.58 | | ARC-Challenge | categorical 4-way MCQ | 1,172 | **29.52%** | 25% | +6.03 | | GSM8K (math word problems) | generative, parse `#### N` | 1,319 | 1.14% | 0% | +1.14 | | **SpellingBee (letter counting)** | generative, parse `#### N` | 256 | **94.53%** | 0% | **+94.53** | ### ChatCORE metric **`ChatCORE = 22.74%`** — mean centered accuracy across all five tasks. ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly *can* hold the necessary structure — the dominant ceiling is parameter count, not architecture. ### Where the numbers come from - **SpellingBee 94.53%** is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one. - **All three MCQ tasks above random** confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers. - **GSM8K at 1.14%** is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently. ## Capabilities and limitations ### Confirmed strong - Coherent conversational dialogue in chat format (`<|user_start|>` / `<|assistant_start|>`) - Factual recall on common entities (capital cities, chemical symbols, planets ordered) - **Letter counting via manual enumeration** — 94.5% on SpellingBee - Multiple-choice answer commitment (above random on all three MCQ benchmarks) - Persona consistency (model identifies as Mnemo with consistent self-description) - Greedy + nucleus (top-p) sampling configurable for short or long generation ### Confirmed weak - **Math word problems** — 1.14% on GSM8K. Format is learned, arithmetic is not - **Single-token common words for spelling** — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token - **Niche factual recall** — confabulates confidently on rare entities, exact dates, specific quotations - **Long multi-turn conversations** — context drifts after ~2-3 turns ### Limitations (architectural) - **117M parameters** — knowledge density is the ceiling, not the architecture - **No tool use, no internet, no images, no memory across sessions** - **2048-token context** — quality degrades past ~1500 tokens without repetition penalty - **No RLHF** — outputs reflect only supervised signal; may produce inappropriate completions - **English only** — pretraining corpus is essentially English educational/web text - **Repetition prone in long generations** without `--repetition-penalty` or `--top-p` --- ## Usage ### Direct loading ```python import torch from tokenizers import Tokenizer from model import GPT tokenizer = Tokenizer.from_file('tokenizer.json') ckpt = torch.load('model.pt', map_location='cuda') config = dict(ckpt['config']) config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64 model = GPT.from_config(config).cuda().eval() state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()} model.load_state_dict(state, strict=False) ``` ### Chat CLI (recommended) ```bash python3 chat_cli.py # interactive REPL python3 chat_cli.py -p "Who are you?" # one-shot ``` The chat CLI handles the chat-format token wrapping (`<|bos|>` → `<|user_start|>` …) and stops generation cleanly on `<|assistant_end|>`. State is cached across turns via the recurrent state buffer — only the new tokens of each user message are prefilled, giving roughly **5–10× faster prefill** on multi-turn conversations than re-processing the entire history. ### Raw inference (no chat format) ```bash python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15 ``` Recommended sampling parameters (empirically tuned, see training log): - **Greedy / factual probes**: `-t 0` - **Short prose (≤500 tok)**: `-t 0.8 -k 50` - **Long prose (500–2000 tok)**: `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop) - **Diverse creative writing**: `-t 0.9 --top-p 0.85 -r 1.1` --- ## Probe outputs (greedy, from the base checkpoint) Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion. | Prompt | First tokens of output | Verdict | |---|---|---| | *The capital of France is* | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands | | *The chemical symbol of gold is* | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim | | *If yesterday was Friday, then tomorrow will be* | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) | | *The opposite of hot is* | "the cold." | ✓ | | *The planets of the solar system are:* | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order | | *My favorite color is* | "red. It's a color that's been around for a long time..." | ✓ | | *If 5\*x + 3 = 13, then x is* | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop | | *Photosynthesis is the process by which* | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener | **5/7 of the original training probes land correct answers at greedy.** Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section). --- ## Citation and acknowledgements Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy. The Stream Mixer architecture is an attention-free experiment swapping the standard Transformer block for a recurrent linear-time sequence mixer. Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle). SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k, and a custom 1000-conversation identity dataset. ```bibtex @misc{mnemo2026, title={Mnemo: A Linear-Time Recurrent Language Model}, author={Alvarado, Luis Miguel}, year={2026}, note={Built on karpathy/nanochat. Stream Mixer architecture.}, howpublished={\url{https://github.com//mnemo}} } ``` --- ## License MIT. Use freely. No warranty.