| --- |
| license: mit |
| language: |
| - en |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - text-generation |
| - stream-mixer |
| - linear-time |
| - recurrent |
| - attention-free |
| - nanochat |
| - small-llm |
| datasets: |
| - karpathy/climbmix-400b-shuffle |
| - HuggingFaceTB/smol-smoltalk |
| - cais/mmlu |
| - allenai/ai2_arc |
| - openai/gsm8k |
| base_model: karpathy/nanochat |
| --- |
| |
| # Mnemo |
|
|
| > *μνήμη — Greek for "memory"* |
|
|
| **Mnemo** is a small attention-free language model with 117M parameters, built on the |
| **Stream Mixer** architecture — a linear-time recurrent sequence mixer that uses |
| multiple parallel content-routed memory streams instead of self-attention. The name |
| nods to the model's recurrent memory: every layer maintains M parallel state buffers |
| that "remember" content over the entire sequence without quadratic attention. |
|
|
| The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of |
| [karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based |
| GPT replaced by a custom Stream Mixer block. |
|
|
| --- |
|
|
| ## Quick facts |
|
|
| | | | |
| |---|---| |
| | Architecture | Stream Mixer (linear-time recurrent) | |
| | Parameters | **117,179,136** | |
| | Layers | 16 | |
| | Hidden dim | 768 | |
| | Memory streams (M) | 48 | |
| | Stream state dim (D) | 96 | |
| | Read heads | 6 | |
| | Context length | 2048 tokens | |
| | Vocab | 32,768 BPE (GPT-4-style pretokenization) | |
| | Special tokens | `<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>` | |
| | Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) | |
| | **Base perplexity (BPB)** | **19.47 (0.9011 bits-per-byte)** | |
| | **Chat ChatCORE metric** | **22.74%** (mean centered across 5 tasks) | |
| | **SpellingBee accuracy** | **94.53%** (256/256 test set) | |
| | License | MIT | |
|
|
| --- |
|
|
| ## Architecture: Stream Mixer |
|
|
| Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention |
| to compute pairwise interactions across tokens (cost: **O(T²)**), Mnemo uses a chunked |
| parallel scan over M parallel content-routed memory streams (cost: **O(T · M · D)** — |
| **linear in sequence length**). |
|
|
| Per token *t* and per layer: |
|
|
| 1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`. |
| 2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`. |
| 3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams. |
|
|
| The full state across a layer is **(B, M, D)** — a fixed-size recurrent memory that |
| the model can carry across arbitrary sequence lengths. The chunked scan implementation |
| keeps numerical range bounded even for slow-decay streams. |
|
|
| For details see the model source. |
|
|
| --- |
|
|
| ## Training |
|
|
| ### Pretraining (base model) |
|
|
| | | | |
| |---|---| |
| | Corpus | [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards | |
| | Total tokens | **5.24B** (44.7× over params) | |
| | Steps | 80,000 × B=32 × T=2048 | |
| | Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) | |
| | Compute | RTX PRO 6000 Blackwell (single GPU, bf16) | |
| | Wall time | **~9 hours** | |
| | Best val loss | **2.9508** (perplexity ≈ 19.12) | |
|
|
| ### Supervised fine-tuning |
|
|
| | | | |
| |---|---| |
| | Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs | |
| | Total conversations | ~1.09M | |
| | Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens | |
| | Optimizer | AdamW (peak LR 1e-4, warmup 300) | |
| | Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) | |
| | Format | nanochat-style BOS-aligned best-fit packing with padding | |
|
|
| ### Pipeline |
|
|
| ``` |
| ClimbMix-400B |
| │ |
| ▼ |
| [80k step pretrain on Stream Mixer] |
| │ best val 2.9508 @ step 79k |
| ▼ |
| Base checkpoint (completes prompts) |
| │ |
| ▼ |
| [30k step SFT on multi-task mixture] |
| │ best val ~1.45 |
| ▼ |
| SFT checkpoint (chat-aware — answers as Mnemo) |
| ``` |
|
|
| --- |
|
|
| ## Evaluation results |
|
|
| Measured on the full test sets — no subsampling, no cherry-picking. |
|
|
| ### Base model — `model.pt` @ step 79,000 |
|
|
| | Metric | Value | |
| |---|---| |
| | Validation loss (nats / token) | 2.9691 | |
| | Perplexity | 19.47 | |
| | **Bits per byte (BPB)** | **0.9011** | |
| | Evaluation window | 409,600 tokens / 1,947,169 bytes | |
|
|
| Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class. |
|
|
| ### Chat model — full benchmark suite |
|
|
| Evaluated on the **complete test set of each task** (no `--max-problems` cap). |
| Categorical tasks use logit comparison over allowed letters; generative tasks |
| sample greedily and parse `#### N` for the final answer. |
|
|
| | Task | Type | N | Accuracy | Random baseline | Centered | |
| |---|---|---|---|---|---| |
| | MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | **28.32%** | 25% | +4.42 | |
| | ARC-Easy | categorical 4-way MCQ | 2,376 | **30.68%** | 25% | +7.58 | |
| | ARC-Challenge | categorical 4-way MCQ | 1,172 | **29.52%** | 25% | +6.03 | |
| | GSM8K (math word problems) | generative, parse `#### N` | 1,319 | 1.14% | 0% | +1.14 | |
| | **SpellingBee (letter counting)** | generative, parse `#### N` | 256 | **94.53%** | 0% | **+94.53** | |
|
|
| ### ChatCORE metric |
|
|
| **`ChatCORE = 22.74%`** — mean centered accuracy across all five tasks. |
|
|
| ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly *can* hold the necessary structure — the dominant ceiling is parameter count, not architecture. |
|
|
| ### Where the numbers come from |
|
|
| - **SpellingBee 94.53%** is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one. |
| - **All three MCQ tasks above random** confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers. |
| - **GSM8K at 1.14%** is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently. |
|
|
| ## Capabilities and limitations |
|
|
| ### Confirmed strong |
|
|
| - Coherent conversational dialogue in chat format (`<|user_start|>` / `<|assistant_start|>`) |
| - Factual recall on common entities (capital cities, chemical symbols, planets ordered) |
| - **Letter counting via manual enumeration** — 94.5% on SpellingBee |
| - Multiple-choice answer commitment (above random on all three MCQ benchmarks) |
| - Persona consistency (model identifies as Mnemo with consistent self-description) |
| - Greedy + nucleus (top-p) sampling configurable for short or long generation |
|
|
| ### Confirmed weak |
|
|
| - **Math word problems** — 1.14% on GSM8K. Format is learned, arithmetic is not |
| - **Single-token common words for spelling** — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token |
| - **Niche factual recall** — confabulates confidently on rare entities, exact dates, specific quotations |
| - **Long multi-turn conversations** — context drifts after ~2-3 turns |
|
|
| ### Limitations (architectural) |
|
|
| - **117M parameters** — knowledge density is the ceiling, not the architecture |
| - **No tool use, no internet, no images, no memory across sessions** |
| - **2048-token context** — quality degrades past ~1500 tokens without repetition penalty |
| - **No RLHF** — outputs reflect only supervised signal; may produce inappropriate completions |
| - **English only** — pretraining corpus is essentially English educational/web text |
| - **Repetition prone in long generations** without `--repetition-penalty` or `--top-p` |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Direct loading |
|
|
| ```python |
| import torch |
| from tokenizers import Tokenizer |
| from model import GPT |
| |
| tokenizer = Tokenizer.from_file('tokenizer.json') |
| ckpt = torch.load('model.pt', map_location='cuda') |
| |
| config = dict(ckpt['config']) |
| config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64 |
| model = GPT.from_config(config).cuda().eval() |
| |
| state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()} |
| model.load_state_dict(state, strict=False) |
| ``` |
|
|
| ### Chat CLI (recommended) |
|
|
| ```bash |
| python3 chat_cli.py # interactive REPL |
| python3 chat_cli.py -p "Who are you?" # one-shot |
| ``` |
|
|
| The chat CLI handles the chat-format token wrapping (`<|bos|>` → `<|user_start|>` …) |
| and stops generation cleanly on `<|assistant_end|>`. State is cached across turns |
| via the recurrent state buffer — only the new tokens of each user message are |
| prefilled, giving roughly **5–10× faster prefill** on multi-turn conversations than |
| re-processing the entire history. |
|
|
| ### Raw inference (no chat format) |
|
|
| ```bash |
| python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15 |
| ``` |
|
|
| Recommended sampling parameters (empirically tuned, see training log): |
| - **Greedy / factual probes**: `-t 0` |
| - **Short prose (≤500 tok)**: `-t 0.8 -k 50` |
| - **Long prose (500–2000 tok)**: `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop) |
| - **Diverse creative writing**: `-t 0.9 --top-p 0.85 -r 1.1` |
|
|
| --- |
|
|
| ## Probe outputs (greedy, from the base checkpoint) |
|
|
| Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion. |
|
|
| | Prompt | First tokens of output | Verdict | |
| |---|---|---| |
| | *The capital of France is* | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands | |
| | *The chemical symbol of gold is* | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim | |
| | *If yesterday was Friday, then tomorrow will be* | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) | |
| | *The opposite of hot is* | "the cold." | ✓ | |
| | *The planets of the solar system are:* | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order | |
| | *My favorite color is* | "red. It's a color that's been around for a long time..." | ✓ | |
| | *If 5\*x + 3 = 13, then x is* | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop | |
| | *Photosynthesis is the process by which* | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener | |
|
|
| **5/7 of the original training probes land correct answers at greedy.** Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section). |
|
|
| --- |
|
|
| ## Citation and acknowledgements |
|
|
| Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy. |
| The Stream Mixer architecture is an attention-free experiment swapping the standard |
| Transformer block for a recurrent linear-time sequence mixer. |
|
|
| Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle). |
| SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k, |
| and a custom 1000-conversation identity dataset. |
| |
| ```bibtex |
| @misc{mnemo2026, |
| title={Mnemo: A Linear-Time Recurrent Language Model}, |
| author={Alvarado, Luis Miguel}, |
| year={2026}, |
| note={Built on karpathy/nanochat. Stream Mixer architecture.}, |
| howpublished={\url{https://github.com/<your-handle>/mnemo}} |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| MIT. Use freely. No warranty. |
| |