Mnemo / README.md

Upload README.md

90ac948 verified 17 days ago

12.1 kB

	---
	license: mit
	language:
	- en
	library_name: pytorch
	pipeline_tag: text-generation
	tags:
	- text-generation
	- stream-mixer
	- linear-time
	- recurrent
	- attention-free
	- nanochat
	- small-llm
	datasets:
	- karpathy/climbmix-400b-shuffle
	- HuggingFaceTB/smol-smoltalk
	- cais/mmlu
	- allenai/ai2_arc
	- openai/gsm8k
	base_model: karpathy/nanochat
	---

	# Mnemo

	> μνήμη — Greek for "memory"

	Mnemo is a small attention-free language model with 117M parameters, built on the
	Stream Mixer architecture — a linear-time recurrent sequence mixer that uses
	multiple parallel content-routed memory streams instead of self-attention. The name
	nods to the model's recurrent memory: every layer maintains M parallel state buffers
	that "remember" content over the entire sequence without quadratic attention.

	The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of
	[karpathy/nanochat](https://github.com/karpathy/nanochat), with the attention-based
	GPT replaced by a custom Stream Mixer block.

	---

	## Quick facts

	\| \| \|
	\|---\|---\|
	\| Architecture \| Stream Mixer (linear-time recurrent) \|
	\| Parameters \| 117,179,136 \|
	\| Layers \| 16 \|
	\| Hidden dim \| 768 \|
	\| Memory streams (M) \| 48 \|
	\| Stream state dim (D) \| 96 \|
	\| Read heads \| 6 \|
	\| Context length \| 2048 tokens \|
	\| Vocab \| 32,768 BPE (GPT-4-style pretokenization) \|
	\| Special tokens \| `<\\|bos\\|>`, `<\\|user_start\\|>`, `<\\|user_end\\|>`, `<\\|assistant_start\\|>`, `<\\|assistant_end\\|>` \|
	\| Compute dtype \| bf16 (Ampere+) / fp32 (T4/CPU) \|
	\| Base perplexity (BPB) \| 19.47 (0.9011 bits-per-byte) \|
	\| Chat ChatCORE metric \| 22.74% (mean centered across 5 tasks) \|
	\| SpellingBee accuracy \| 94.53% (256/256 test set) \|
	\| License \| MIT \|

	---

	## Architecture: Stream Mixer

	Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention
	to compute pairwise interactions across tokens (cost: O(T²)), Mnemo uses a chunked
	parallel scan over M parallel content-routed memory streams (cost: O(T · M · D) —
	linear in sequence length).

	Per token t and per layer:

	1. Compute value `v[t]`, read query `q[t]`, content-router `r[t]`, and per-stream decay `α[t]`.
	2. Each memory stream `s_m` updates via `s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]`.
	3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.

	The full state across a layer is (B, M, D) — a fixed-size recurrent memory that
	the model can carry across arbitrary sequence lengths. The chunked scan implementation
	keeps numerical range bounded even for slow-decay streams.

	For details see the model source.

	---

	## Training

	### Pretraining (base model)

	\| \| \|
	\|---\|---\|
	\| Corpus \| [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) — 88 shards \|
	\| Total tokens \| 5.24B (44.7× over params) \|
	\| Steps \| 80,000 × B=32 × T=2048 \|
	\| Optimizer \| AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) \|
	\| Compute \| RTX PRO 6000 Blackwell (single GPU, bf16) \|
	\| Wall time \| ~9 hours \|
	\| Best val loss \| 2.9508 (perplexity ≈ 19.12) \|

	### Supervised fine-tuning

	\| \| \|
	\|---\|---\|
	\| Mixture \| SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs \|
	\| Total conversations \| ~1.09M \|
	\| Steps \| 30,000 × B=8 × T=2048 = ~500M SFT tokens \|
	\| Optimizer \| AdamW (peak LR 1e-4, warmup 300) \|
	\| Best val loss \| ~1.45 (masked cross-entropy over assistant tokens only) \|
	\| Format \| nanochat-style BOS-aligned best-fit packing with padding \|

	### Pipeline

	```
	ClimbMix-400B
	│
	▼
	[80k step pretrain on Stream Mixer]
	│ best val 2.9508 @ step 79k
	▼
	Base checkpoint (completes prompts)
	│
	▼
	[30k step SFT on multi-task mixture]
	│ best val ~1.45
	▼
	SFT checkpoint (chat-aware — answers as Mnemo)
	```

	---

	## Evaluation results

	Measured on the full test sets — no subsampling, no cherry-picking.

	### Base model — `model.pt` @ step 79,000

	\| Metric \| Value \|
	\|---\|---\|
	\| Validation loss (nats / token) \| 2.9691 \|
	\| Perplexity \| 19.47 \|
	\| Bits per byte (BPB) \| 0.9011 \|
	\| Evaluation window \| 409,600 tokens / 1,947,169 bytes \|

	Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.

	### Chat model — full benchmark suite

	Evaluated on the complete test set of each task (no `--max-problems` cap).
	Categorical tasks use logit comparison over allowed letters; generative tasks
	sample greedily and parse `#### N` for the final answer.

	\| Task \| Type \| N \| Accuracy \| Random baseline \| Centered \|
	\|---\|---\|---\|---\|---\|---\|
	\| MMLU (57 subjects) \| categorical 4-way MCQ \| 14,042 \| 28.32% \| 25% \| +4.42 \|
	\| ARC-Easy \| categorical 4-way MCQ \| 2,376 \| 30.68% \| 25% \| +7.58 \|
	\| ARC-Challenge \| categorical 4-way MCQ \| 1,172 \| 29.52% \| 25% \| +6.03 \|
	\| GSM8K (math word problems) \| generative, parse `#### N` \| 1,319 \| 1.14% \| 0% \| +1.14 \|
	\| SpellingBee (letter counting) \| generative, parse `#### N` \| 256 \| 94.53% \| 0% \| +94.53 \|

	### ChatCORE metric

	`ChatCORE = 22.74%` — mean centered accuracy across all five tasks.

	ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly can hold the necessary structure — the dominant ceiling is parameter count, not architecture.

	### Where the numbers come from

	- SpellingBee 94.53% is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct `#### N` final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
	- All three MCQ tasks above random confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
	- GSM8K at 1.14% is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + `#### N` final answer) but the arithmetic isn't reliable enough to land the right number consistently.

	## Capabilities and limitations

	### Confirmed strong

	- Coherent conversational dialogue in chat format (`<\|user_start\|>` / `<\|assistant_start\|>`)
	- Factual recall on common entities (capital cities, chemical symbols, planets ordered)
	- Letter counting via manual enumeration — 94.5% on SpellingBee
	- Multiple-choice answer commitment (above random on all three MCQ benchmarks)
	- Persona consistency (model identifies as Mnemo with consistent self-description)
	- Greedy + nucleus (top-p) sampling configurable for short or long generation

	### Confirmed weak

	- Math word problems — 1.14% on GSM8K. Format is learned, arithmetic is not
	- Single-token common words for spelling — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
	- Niche factual recall — confabulates confidently on rare entities, exact dates, specific quotations
	- Long multi-turn conversations — context drifts after ~2-3 turns

	### Limitations (architectural)

	- 117M parameters — knowledge density is the ceiling, not the architecture
	- No tool use, no internet, no images, no memory across sessions
	- 2048-token context — quality degrades past ~1500 tokens without repetition penalty
	- No RLHF — outputs reflect only supervised signal; may produce inappropriate completions
	- English only — pretraining corpus is essentially English educational/web text
	- Repetition prone in long generations without `--repetition-penalty` or `--top-p`

	---

	## Usage

	### Direct loading

	```python
	import torch
	from tokenizers import Tokenizer
	from model import GPT

	tokenizer = Tokenizer.from_file('tokenizer.json')
	ckpt = torch.load('model.pt', map_location='cuda')

	config = dict(ckpt['config'])
	config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
	model = GPT.from_config(config).cuda().eval()

	state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
	model.load_state_dict(state, strict=False)
	```

	### Chat CLI (recommended)

	```bash
	python3 chat_cli.py # interactive REPL
	python3 chat_cli.py -p "Who are you?" # one-shot
	```

	The chat CLI handles the chat-format token wrapping (`<\|bos\|>` → `<\|user_start\|>` …)
	and stops generation cleanly on `<\|assistant_end\|>`. State is cached across turns
	via the recurrent state buffer — only the new tokens of each user message are
	prefilled, giving roughly 5–10× faster prefill on multi-turn conversations than
	re-processing the entire history.

	### Raw inference (no chat format)

	```bash
	python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15
	```

	Recommended sampling parameters (empirically tuned, see training log):
	- Greedy / factual probes: `-t 0`
	- Short prose (≤500 tok): `-t 0.8 -k 50`
	- Long prose (500–2000 tok): `-t 0.8 -k 50 --top-p 0.9 -r 1.15` (anti-loop)
	- Diverse creative writing: `-t 0.9 --top-p 0.85 -r 1.1`

	---

	## Probe outputs (greedy, from the base checkpoint)

	Run via `python3 base_eval.py --eval sample` against the pretrained checkpoint (`model.pt`, val 2.9508). Greedy, 64 tokens per completion.

	\| Prompt \| First tokens of output \| Verdict \|
	\|---\|---\|---\|
	\| The capital of France is \| "...Paris, and it is the capital of France. The capital of France is Paris..." \| ✓ Paris lands \|
	\| The chemical symbol of gold is \| "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." \| ✓ Au + real applied claim \|
	\| If yesterday was Friday, then tomorrow will be \| "Tuesday. The weather is not so bad..." \| ✗ (correct: Sunday) \|
	\| The opposite of hot is \| "the cold." \| ✓ \|
	\| The planets of the solar system are: \| "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." \| ✓ Correct astronomical order \|
	\| My favorite color is \| "red. It's a color that's been around for a long time..." \| ✓ \|
	\| If 5\x + 3 = 13, then x is* \| "a positive integer. If x is a positive integer, then x is a positive integer..." \| ✗ Loop \|
	\| Photosynthesis is the process by which \| "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." \| ✓ Factually correct opener \|

	5/7 of the original training probes land correct answers at greedy. Repetition is visible — the base model benefits substantially from `--repetition-penalty 1.15` and/or `--top-p 0.9` on longer generations (see Usage section).

	---

	## Citation and acknowledgements

	Built on top of [karpathy/nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy.
	The Stream Mixer architecture is an attention-free experiment swapping the standard
	Transformer block for a recurrent linear-time sequence mixer.

	Pretraining data is [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
	SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k,
	and a custom 1000-conversation identity dataset.

	```bibtex
	@misc{mnemo2026,
	title={Mnemo: A Linear-Time Recurrent Language Model},
	author={Alvarado, Luis Miguel},
	year={2026},
	note={Built on karpathy/nanochat. Stream Mixer architecture.},
	howpublished={\url{https://github.com/<your-handle>/mnemo}}
	}
	```

	---

	## License

	MIT. Use freely. No warranty.