File size: 12,072 Bytes
15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 90ac948 15e1547 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | ---
license: mit
language:
- en
library_name: pytorch
pipeline_tag: text-generation
tags:
- text-generation
- stream-mixer
- linear-time
- recurrent
- attention-free
- nanochat
- small-llm
datasets:
- karpathy/climbmix-400b-shuffle
- HuggingFaceTB/smol-smoltalk
- cais/mmlu
- allenai/ai2_arc
- openai/gsm8k
base_model: karpathy/nanochat
---
# Mnemo
> *μνήμη — Greek for "memory"*
**Mnemo** is a small attention-free language model with 117M parameters, built on the
**Stream Mixer** architecture — a linear-time recurrent sequence mixer that uses
multiple parallel content-routed memory streams instead of self-attention. The name
nods to the model's recurrent memory: every layer maintains M parallel state buffers
that "remember" content over the entire sequence without quadratic attention.
The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of
[karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based
GPT replaced by a custom Stream Mixer block.
---
## Quick facts
| | |
|---|---|
| Architecture | Stream Mixer (linear-time recurrent) |
| Parameters | **117,179,136** |
| Layers | 16 |
| Hidden dim | 768 |
| Memory streams (M) | 48 |
| Stream state dim (D) | 96 |
| Read heads | 6 |
| Context length | 2048 tokens |
| Vocab | 32,768 BPE (GPT-4-style pretokenization) |
| Special tokens | `<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>` |
| Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) |
| **Base perplexity (BPB)** | **19.47 (0.9011 bits-per-byte)** |
| **Chat ChatCORE metric** | **22.74%** (mean centered across 5 tasks) |
| **SpellingBee accuracy** | **94.53%** (256/256 test set) |
| License | MIT |
---
## Architecture: Stream Mixer
Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention
to compute pairwise interactions across tokens (cost: **O(T²)**), Mnemo uses a chunked
parallel scan over M parallel content-routed memory streams (cost: **O(T · M · D)** —
**linear in sequence length**).
Per token *t* and per layer:
1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`.
2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`.
3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.
The full state across a layer is **(B, M, D)** — a fixed-size recurrent memory that
the model can carry across arbitrary sequence lengths. The chunked scan implementation
keeps numerical range bounded even for slow-decay streams.
For details see the model source.
---
## Training
### Pretraining (base model)
| | |
|---|---|
| Corpus | [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards |
| Total tokens | **5.24B** (44.7× over params) |
| Steps | 80,000 × B=32 × T=2048 |
| Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) |
| Compute | RTX PRO 6000 Blackwell (single GPU, bf16) |
| Wall time | **~9 hours** |
| Best val loss | **2.9508** (perplexity ≈ 19.12) |
### Supervised fine-tuning
| | |
|---|---|
| Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs |
| Total conversations | ~1.09M |
| Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens |
| Optimizer | AdamW (peak LR 1e-4, warmup 300) |
| Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) |
| Format | nanochat-style BOS-aligned best-fit packing with padding |
### Pipeline
```
ClimbMix-400B
│
â–¼
[80k step pretrain on Stream Mixer]
│ best val 2.9508 @ step 79k
â–¼
Base checkpoint (completes prompts)
│
â–¼
[30k step SFT on multi-task mixture]
│ best val ~1.45
â–¼
SFT checkpoint (chat-aware — answers as Mnemo)
```
---
## Evaluation results
Measured on the full test sets — no subsampling, no cherry-picking.
### Base model — `model.pt` @ step 79,000
| Metric | Value |
|---|---|
| Validation loss (nats / token) | 2.9691 |
| Perplexity | 19.47 |
| **Bits per byte (BPB)** | **0.9011** |
| Evaluation window | 409,600 tokens / 1,947,169 bytes |
Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.
### Chat model — full benchmark suite
Evaluated on the **complete test set of each task** (no `--max-problems` cap).
Categorical tasks use logit comparison over allowed letters; generative tasks
sample greedily and parse `#### N` for the final answer.
| Task | Type | N | Accuracy | Random baseline | Centered |
|---|---|---|---|---|---|
| MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | **28.32%** | 25% | +4.42 |
| ARC-Easy | categorical 4-way MCQ | 2,376 | **30.68%** | 25% | +7.58 |
| ARC-Challenge | categorical 4-way MCQ | 1,172 | **29.52%** | 25% | +6.03 |
| GSM8K (math word problems) | generative, parse `#### N` | 1,319 | 1.14% | 0% | +1.14 |
| **SpellingBee (letter counting)** | generative, parse `#### N` | 256 | **94.53%** | 0% | **+94.53** |
### ChatCORE metric
**`ChatCORE = 22.74%`** — mean centered accuracy across all five tasks.
ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly *can* hold the necessary structure — the dominant ceiling is parameter count, not architecture.
### Where the numbers come from
- **SpellingBee 94.53%** is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
- **All three MCQ tasks above random** confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
- **GSM8K at 1.14%** is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently.
## Capabilities and limitations
### Confirmed strong
- Coherent conversational dialogue in chat format (`<|user_start|>` / `<|assistant_start|>`)
- Factual recall on common entities (capital cities, chemical symbols, planets ordered)
- **Letter counting via manual enumeration** — 94.5% on SpellingBee
- Multiple-choice answer commitment (above random on all three MCQ benchmarks)
- Persona consistency (model identifies as Mnemo with consistent self-description)
- Greedy + nucleus (top-p) sampling configurable for short or long generation
### Confirmed weak
- **Math word problems** — 1.14% on GSM8K. Format is learned, arithmetic is not
- **Single-token common words for spelling** — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
- **Niche factual recall** — confabulates confidently on rare entities, exact dates, specific quotations
- **Long multi-turn conversations** — context drifts after ~2-3 turns
### Limitations (architectural)
- **117M parameters** — knowledge density is the ceiling, not the architecture
- **No tool use, no internet, no images, no memory across sessions**
- **2048-token context** — quality degrades past ~1500 tokens without repetition penalty
- **No RLHF** — outputs reflect only supervised signal; may produce inappropriate completions
- **English only** — pretraining corpus is essentially English educational/web text
- **Repetition prone in long generations** without `--repetition-penalty` or `--top-p`
---
## Usage
### Direct loading
```python
import torch
from tokenizers import Tokenizer
from model import GPT
tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')
config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()
state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)
```
### Chat CLI (recommended)
```bash
python3 chat_cli.py # interactive REPL
python3 chat_cli.py -p "Who are you?" # one-shot
```
The chat CLI handles the chat-format token wrapping (`<|bos|>` → `<|user_start|>` …)
and stops generation cleanly on `<|assistant_end|>`. State is cached across turns
via the recurrent state buffer — only the new tokens of each user message are
prefilled, giving roughly **5–10× faster prefill** on multi-turn conversations than
re-processing the entire history.
### Raw inference (no chat format)
```bash
python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15
```
Recommended sampling parameters (empirically tuned, see training log):
- **Greedy / factual probes**: `-t 0`
- **Short prose (≤500 tok)**: `-t 0.8 -k 50`
- **Long prose (500–2000 tok)**: `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop)
- **Diverse creative writing**: `-t 0.9 --top-p 0.85 -r 1.1`
---
## Probe outputs (greedy, from the base checkpoint)
Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion.
| Prompt | First tokens of output | Verdict |
|---|---|---|
| *The capital of France is* | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands |
| *The chemical symbol of gold is* | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim |
| *If yesterday was Friday, then tomorrow will be* | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) |
| *The opposite of hot is* | "the cold." | ✓ |
| *The planets of the solar system are:* | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order |
| *My favorite color is* | "red. It's a color that's been around for a long time..." | ✓ |
| *If 5\*x + 3 = 13, then x is* | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop |
| *Photosynthesis is the process by which* | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener |
**5/7 of the original training probes land correct answers at greedy.** Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section).
---
## Citation and acknowledgements
Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy.
The Stream Mixer architecture is an attention-free experiment swapping the standard
Transformer block for a recurrent linear-time sequence mixer.
Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k,
and a custom 1000-conversation identity dataset.
```bibtex
@misc{mnemo2026,
title={Mnemo: A Linear-Time Recurrent Language Model},
author={Alvarado, Luis Miguel},
year={2026},
note={Built on karpathy/nanochat. Stream Mixer architecture.},
howpublished={\url{https://github.com/<your-handle>/mnemo}}
}
```
---
## License
MIT. Use freely. No warranty.
|