File size: 12,072 Bytes

---
license: mit
language:
- en
library_name: pytorch
pipeline_tag: text-generation
tags:
- text-generation
- stream-mixer
- linear-time
- recurrent
- attention-free
- nanochat
- small-llm
datasets:
- karpathy/climbmix-400b-shuffle
- HuggingFaceTB/smol-smoltalk
- cais/mmlu
- allenai/ai2_arc
- openai/gsm8k
base_model: karpathy/nanochat
---

# Mnemo

> *μνήμη — Greek for "memory"*

**Mnemo** is a small attention-free language model with 117M parameters, built on the
**Stream Mixer** architecture — a linear-time recurrent sequence mixer that uses
multiple parallel content-routed memory streams instead of self-attention. The name
nods to the model's recurrent memory: every layer maintains M parallel state buffers
that "remember" content over the entire sequence without quadratic attention.

The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of
[karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based
GPT replaced by a custom Stream Mixer block.

---

## Quick facts

| | |
|---|---|
| Architecture | Stream Mixer (linear-time recurrent) |
| Parameters | **117,179,136** |
| Layers | 16 |
| Hidden dim | 768 |
| Memory streams (M) | 48 |
| Stream state dim (D) | 96 |
| Read heads | 6 |
| Context length | 2048 tokens |
| Vocab | 32,768 BPE (GPT-4-style pretokenization) |
| Special tokens | `<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>` |
| Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) |
| **Base perplexity (BPB)** | **19.47 (0.9011 bits-per-byte)** |
| **Chat ChatCORE metric** | **22.74%** (mean centered across 5 tasks) |
| **SpellingBee accuracy** | **94.53%** (256/256 test set) |
| License | MIT |

---

## Architecture: Stream Mixer

Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention
to compute pairwise interactions across tokens (cost: **O(T²)**), Mnemo uses a chunked
parallel scan over M parallel content-routed memory streams (cost: **O(T · M · D)** —
**linear in sequence length**).

Per token *t* and per layer:

1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`.
2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`.
3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.

The full state across a layer is **(B, M, D)** — a fixed-size recurrent memory that
the model can carry across arbitrary sequence lengths. The chunked scan implementation
keeps numerical range bounded even for slow-decay streams.

For details see the model source.

---

## Training

### Pretraining (base model)

| | |
|---|---|
| Corpus | [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards |
| Total tokens | **5.24B** (44.7× over params) |
| Steps | 80,000 × B=32 × T=2048 |
| Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) |
| Compute | RTX PRO 6000 Blackwell (single GPU, bf16) |
| Wall time | **~9 hours** |
| Best val loss | **2.9508** (perplexity ≈ 19.12) |

### Supervised fine-tuning

| | |
|---|---|
| Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs |
| Total conversations | ~1.09M |
| Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens |
| Optimizer | AdamW (peak LR 1e-4, warmup 300) |
| Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) |
| Format | nanochat-style BOS-aligned best-fit packing with padding |

### Pipeline

```
ClimbMix-400B
   │
   ▼
[80k step pretrain on Stream Mixer]
   │  best val 2.9508 @ step 79k
   ▼
Base checkpoint  (completes prompts)
   │
   ▼
[30k step SFT on multi-task mixture]
   │  best val ~1.45
   ▼
SFT checkpoint  (chat-aware — answers as Mnemo)
```

---

## Evaluation results

Measured on the full test sets — no subsampling, no cherry-picking.

### Base model — `model.pt` @ step 79,000

| Metric | Value |
|---|---|
| Validation loss (nats / token) | 2.9691 |
| Perplexity | 19.47 |
| **Bits per byte (BPB)** | **0.9011** |
| Evaluation window | 409,600 tokens / 1,947,169 bytes |

Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.

### Chat model — full benchmark suite

Evaluated on the **complete test set of each task** (no `--max-problems` cap).
Categorical tasks use logit comparison over allowed letters; generative tasks
sample greedily and parse `#### N` for the final answer.

| Task | Type | N | Accuracy | Random baseline | Centered |
|---|---|---|---|---|---|
| MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | **28.32%** | 25% | +4.42 |
| ARC-Easy | categorical 4-way MCQ | 2,376 | **30.68%** | 25% | +7.58 |
| ARC-Challenge | categorical 4-way MCQ | 1,172 | **29.52%** | 25% | +6.03 |
| GSM8K (math word problems) | generative, parse `#### N` | 1,319 | 1.14% | 0% | +1.14 |
| **SpellingBee (letter counting)** | generative, parse `#### N` | 256 | **94.53%** | 0% | **+94.53** |

### ChatCORE metric

**`ChatCORE = 22.74%`** — mean centered accuracy across all five tasks.

ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly *can* hold the necessary structure — the dominant ceiling is parameter count, not architecture.

### Where the numbers come from

- **SpellingBee 94.53%** is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
- **All three MCQ tasks above random** confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
- **GSM8K at 1.14%** is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently.

## Capabilities and limitations

### Confirmed strong

- Coherent conversational dialogue in chat format (`<|user_start|>` / `<|assistant_start|>`)
- Factual recall on common entities (capital cities, chemical symbols, planets ordered)
- **Letter counting via manual enumeration** — 94.5% on SpellingBee
- Multiple-choice answer commitment (above random on all three MCQ benchmarks)
- Persona consistency (model identifies as Mnemo with consistent self-description)
- Greedy + nucleus (top-p) sampling configurable for short or long generation

### Confirmed weak

- **Math word problems** — 1.14% on GSM8K. Format is learned, arithmetic is not
- **Single-token common words for spelling** — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
- **Niche factual recall** — confabulates confidently on rare entities, exact dates, specific quotations
- **Long multi-turn conversations** — context drifts after ~2-3 turns

### Limitations (architectural)

- **117M parameters** — knowledge density is the ceiling, not the architecture
- **No tool use, no internet, no images, no memory across sessions**
- **2048-token context** — quality degrades past ~1500 tokens without repetition penalty
- **No RLHF** — outputs reflect only supervised signal; may produce inappropriate completions
- **English only** — pretraining corpus is essentially English educational/web text
- **Repetition prone in long generations** without `--repetition-penalty` or `--top-p`

---

## Usage

### Direct loading

```python
import torch
from tokenizers import Tokenizer
from model import GPT

tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')

config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()

state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)
```

### Chat CLI (recommended)

```bash
python3 chat_cli.py                   # interactive REPL
python3 chat_cli.py -p "Who are you?"  # one-shot
```

The chat CLI handles the chat-format token wrapping (`<|bos|>` → `<|user_start|>` …)
and stops generation cleanly on `<|assistant_end|>`. State is cached across turns
via the recurrent state buffer — only the new tokens of each user message are
prefilled, giving roughly **5–10× faster prefill** on multi-turn conversations than
re-processing the entire history.

### Raw inference (no chat format)

```bash
python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15
```

Recommended sampling parameters (empirically tuned, see training log):
- **Greedy / factual probes**: `-t 0`
- **Short prose (≤500 tok)**: `-t 0.8 -k 50`
- **Long prose (500–2000 tok)**: `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop)
- **Diverse creative writing**: `-t 0.9 --top-p 0.85 -r 1.1`

---

## Probe outputs (greedy, from the base checkpoint)

Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion.

| Prompt | First tokens of output | Verdict |
|---|---|---|
| *The capital of France is* | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands |
| *The chemical symbol of gold is* | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim |
| *If yesterday was Friday, then tomorrow will be* | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) |
| *The opposite of hot is* | "the cold." | ✓ |
| *The planets of the solar system are:* | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order |
| *My favorite color is* | "red. It's a color that's been around for a long time..." | ✓ |
| *If 5\*x + 3 = 13, then x is* | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop |
| *Photosynthesis is the process by which* | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener |

**5/7 of the original training probes land correct answers at greedy.** Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section).

---

## Citation and acknowledgements

Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy.
The Stream Mixer architecture is an attention-free experiment swapping the standard
Transformer block for a recurrent linear-time sequence mixer.

Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k,
and a custom 1000-conversation identity dataset.

```bibtex
@misc{mnemo2026,
  title={Mnemo: A Linear-Time Recurrent Language Model},
  author={Alvarado, Luis Miguel},
  year={2026},
  note={Built on karpathy/nanochat. Stream Mixer architecture.},
  howpublished={\url{https://github.com/<your-handle>/mnemo}}
}
```

---

## License

MIT. Use freely. No warranty.