File size: 12,072 Bytes
15e1547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90ac948
 
 
15e1547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90ac948
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15e1547
 
90ac948
 
 
 
 
 
 
 
15e1547
90ac948
15e1547
90ac948
 
 
 
15e1547
90ac948
 
 
15e1547
90ac948
15e1547
 
90ac948
15e1547
 
 
 
 
 
 
 
 
 
 
 
 
90ac948
15e1547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90ac948
 
 
15e1547
90ac948
 
 
 
 
 
 
 
 
 
15e1547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
license: mit
language:
- en
library_name: pytorch
pipeline_tag: text-generation
tags:
- text-generation
- stream-mixer
- linear-time
- recurrent
- attention-free
- nanochat
- small-llm
datasets:
- karpathy/climbmix-400b-shuffle
- HuggingFaceTB/smol-smoltalk
- cais/mmlu
- allenai/ai2_arc
- openai/gsm8k
base_model: karpathy/nanochat
---

# Mnemo

> *μνήμη — Greek for "memory"*

**Mnemo** is a small attention-free language model with 117M parameters, built on the
**Stream Mixer** architecture — a linear-time recurrent sequence mixer that uses
multiple parallel content-routed memory streams instead of self-attention. The name
nods to the model's recurrent memory: every layer maintains M parallel state buffers
that "remember" content over the entire sequence without quadratic attention.

The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of
[karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based
GPT replaced by a custom Stream Mixer block.

---

## Quick facts

| | |
|---|---|
| Architecture | Stream Mixer (linear-time recurrent) |
| Parameters | **117,179,136** |
| Layers | 16 |
| Hidden dim | 768 |
| Memory streams (M) | 48 |
| Stream state dim (D) | 96 |
| Read heads | 6 |
| Context length | 2048 tokens |
| Vocab | 32,768 BPE (GPT-4-style pretokenization) |
| Special tokens | `<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>` |
| Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) |
| **Base perplexity (BPB)** | **19.47 (0.9011 bits-per-byte)** |
| **Chat ChatCORE metric** | **22.74%** (mean centered across 5 tasks) |
| **SpellingBee accuracy** | **94.53%** (256/256 test set) |
| License | MIT |

---

## Architecture: Stream Mixer

Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention
to compute pairwise interactions across tokens (cost: **O(T²)**), Mnemo uses a chunked
parallel scan over M parallel content-routed memory streams (cost: **O(T · M · D)** —
**linear in sequence length**).

Per token *t* and per layer:

1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`.
2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`.
3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.

The full state across a layer is **(B, M, D)** — a fixed-size recurrent memory that
the model can carry across arbitrary sequence lengths. The chunked scan implementation
keeps numerical range bounded even for slow-decay streams.

For details see the model source.

---

## Training

### Pretraining (base model)

| | |
|---|---|
| Corpus | [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards |
| Total tokens | **5.24B** (44.7× over params) |
| Steps | 80,000 × B=32 × T=2048 |
| Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) |
| Compute | RTX PRO 6000 Blackwell (single GPU, bf16) |
| Wall time | **~9 hours** |
| Best val loss | **2.9508** (perplexity ≈ 19.12) |

### Supervised fine-tuning

| | |
|---|---|
| Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs |
| Total conversations | ~1.09M |
| Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens |
| Optimizer | AdamW (peak LR 1e-4, warmup 300) |
| Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) |
| Format | nanochat-style BOS-aligned best-fit packing with padding |

### Pipeline

```
ClimbMix-400B
   │
   â–¼
[80k step pretrain on Stream Mixer]
   │  best val 2.9508 @ step 79k
   â–¼
Base checkpoint  (completes prompts)
   │
   â–¼
[30k step SFT on multi-task mixture]
   │  best val ~1.45
   â–¼
SFT checkpoint  (chat-aware — answers as Mnemo)
```

---

## Evaluation results

Measured on the full test sets — no subsampling, no cherry-picking.

### Base model — `model.pt` @ step 79,000

| Metric | Value |
|---|---|
| Validation loss (nats / token) | 2.9691 |
| Perplexity | 19.47 |
| **Bits per byte (BPB)** | **0.9011** |
| Evaluation window | 409,600 tokens / 1,947,169 bytes |

Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.

### Chat model — full benchmark suite

Evaluated on the **complete test set of each task** (no `--max-problems` cap).
Categorical tasks use logit comparison over allowed letters; generative tasks
sample greedily and parse `#### N` for the final answer.

| Task | Type | N | Accuracy | Random baseline | Centered |
|---|---|---|---|---|---|
| MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | **28.32%** | 25% | +4.42 |
| ARC-Easy | categorical 4-way MCQ | 2,376 | **30.68%** | 25% | +7.58 |
| ARC-Challenge | categorical 4-way MCQ | 1,172 | **29.52%** | 25% | +6.03 |
| GSM8K (math word problems) | generative, parse `#### N` | 1,319 | 1.14% | 0% | +1.14 |
| **SpellingBee (letter counting)** | generative, parse `#### N` | 256 | **94.53%** | 0% | **+94.53** |

### ChatCORE metric

**`ChatCORE = 22.74%`** — mean centered accuracy across all five tasks.

ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly *can* hold the necessary structure — the dominant ceiling is parameter count, not architecture.

### Where the numbers come from

- **SpellingBee 94.53%** is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
- **All three MCQ tasks above random** confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
- **GSM8K at 1.14%** is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently.

## Capabilities and limitations

### Confirmed strong

- Coherent conversational dialogue in chat format (`<|user_start|>` / `<|assistant_start|>`)
- Factual recall on common entities (capital cities, chemical symbols, planets ordered)
- **Letter counting via manual enumeration** — 94.5% on SpellingBee
- Multiple-choice answer commitment (above random on all three MCQ benchmarks)
- Persona consistency (model identifies as Mnemo with consistent self-description)
- Greedy + nucleus (top-p) sampling configurable for short or long generation

### Confirmed weak

- **Math word problems** — 1.14% on GSM8K. Format is learned, arithmetic is not
- **Single-token common words for spelling** — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
- **Niche factual recall** — confabulates confidently on rare entities, exact dates, specific quotations
- **Long multi-turn conversations** — context drifts after ~2-3 turns

### Limitations (architectural)

- **117M parameters** — knowledge density is the ceiling, not the architecture
- **No tool use, no internet, no images, no memory across sessions**
- **2048-token context** — quality degrades past ~1500 tokens without repetition penalty
- **No RLHF** — outputs reflect only supervised signal; may produce inappropriate completions
- **English only** — pretraining corpus is essentially English educational/web text
- **Repetition prone in long generations** without `--repetition-penalty` or `--top-p`

---

## Usage

### Direct loading

```python
import torch
from tokenizers import Tokenizer
from model import GPT

tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')

config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()

state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)
```

### Chat CLI (recommended)

```bash
python3 chat_cli.py                   # interactive REPL
python3 chat_cli.py -p "Who are you?"  # one-shot
```

The chat CLI handles the chat-format token wrapping (`<|bos|>` → `<|user_start|>` …)
and stops generation cleanly on `<|assistant_end|>`. State is cached across turns
via the recurrent state buffer — only the new tokens of each user message are
prefilled, giving roughly **5–10× faster prefill** on multi-turn conversations than
re-processing the entire history.

### Raw inference (no chat format)

```bash
python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15
```

Recommended sampling parameters (empirically tuned, see training log):
- **Greedy / factual probes**: `-t 0`
- **Short prose (≤500 tok)**: `-t 0.8 -k 50`
- **Long prose (500–2000 tok)**: `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop)
- **Diverse creative writing**: `-t 0.9 --top-p 0.85 -r 1.1`

---

## Probe outputs (greedy, from the base checkpoint)

Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion.

| Prompt | First tokens of output | Verdict |
|---|---|---|
| *The capital of France is* | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands |
| *The chemical symbol of gold is* | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim |
| *If yesterday was Friday, then tomorrow will be* | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) |
| *The opposite of hot is* | "the cold." | ✓ |
| *The planets of the solar system are:* | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order |
| *My favorite color is* | "red. It's a color that's been around for a long time..." | ✓ |
| *If 5\*x + 3 = 13, then x is* | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop |
| *Photosynthesis is the process by which* | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener |

**5/7 of the original training probes land correct answers at greedy.** Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section).

---

## Citation and acknowledgements

Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy.
The Stream Mixer architecture is an attention-free experiment swapping the standard
Transformer block for a recurrent linear-time sequence mixer.

Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k,
and a custom 1000-conversation identity dataset.

```bibtex
@misc{mnemo2026,
  title={Mnemo: A Linear-Time Recurrent Language Model},
  author={Alvarado, Luis Miguel},
  year={2026},
  note={Built on karpathy/nanochat. Stream Mixer architecture.},
  howpublished={\url{https://github.com/<your-handle>/mnemo}}
}
```

---

## License

MIT. Use freely. No warranty.