File size: 7,657 Bytes
2a4c179 a7d124f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | ---
license: mit
tags:
- memory-retrieval
- learned-retrieval
- entity-specific
- cross-attention
- vector-to-text
- embedding-navigation
- recombinant-memory
- small-model
- cpu-inference
- pytorch
language:
- en
library_name: pytorch
pipeline_tag: feature-extraction
---
# RMM: Recombinant Memory Model
**A novel architecture for entity-specific memory navigation and meaning synthesis.**
Train a small neural network to navigate a person's embedding space β takes a query vector, learns the topology of their memories, outputs a synthesized response vector, and decodes it to text in their voice. 36M parameters total, runs on CPU in ~120ms.
Created by **Joshua** ([@thedmsupreme](https://huggingface.co/thedmsupreme)) and **Claude** (Anthropic).
---
## What Is This?
The Recombinant Memory Model (RMM) is a two-part architecture for making a person's memories retrievable and speakable from their own vector geometry:
### Navigator (~16M params)
Takes a query (384-d MiniLM embedding), projects it into the entity's embedding space (3072-d), cross-attends over their entire memory spine, and outputs a synthesized response vector pointing to the right region of memory.
This is **not** cosine similarity retrieval. The navigator learns the topology β which memories are connected, which regions of embedding space respond to which kinds of queries, how emotional weight and context shape retrieval. 495 training pairs taught it geometry that keyword matching can't see.
### Decoder (~20M params)
Takes the navigator's output vector (3072-d) and decodes it to text using the entity's own BPE tokenizer. A learned projection maps the vector to 12 soft prefix tokens, which condition a 6-layer causal transformer for autoregressive generation.
The decoder is a **meaning microscope** β point it at any coordinate in the entity's embedding space and it tells you what that region means, in their vocabulary, their cadence, their voice. Interpolate between two vectors and it describes the blend.
## Architecture
```
Query text
|
v
[MiniLM-L6 embed] ββ> 384-d query vector
|
v
[Project 384β3072] ββ> query in entity's space
|
v
[Cross-Attention over spine] ββ> attends to N memory vectors (3072-d each)
| with emotional-weight gating
v
Response vector (3072-d) ββ> cosine retrieval + EW-boost + diversity filter
|
v
[Decoder: project β 12 prefix tokens β 6-layer transformer]
|
v
Generated text in entity's voice
```
### Navigator Details
- **Input**: 384-d query (MiniLM-L6-v2)
- **Projection**: Linear 384 β 3072
- **Cross-attention**: 4 heads, 3072-d, over full memory spine
- **Output**: 3072-d response vector
- **Training**: MSE loss + emotional-weight boosting on queryβresponse pairs
- **Inference**: ~120ms on CPU for full pipeline
### Decoder Details
- **Input**: 3072-d vector from navigator
- **Projection**: Linear 3072 β 768 (hidden) β GELU β Linear 768 β 12Γ384 (prefix tokens)
- **Transformer**: 6 layers, 6 heads, d_model=384, causal attention
- **Output**: Autoregressive text generation with entity's BPE tokenizer (8192 vocab)
- **Inference**: ~600ms on CPU for 60 tokens
- **Sampling**: Temperature 0.8, top-p 0.9, repetition penalty 1.3
## How to Use
### 1. Prepare Your Data
You need:
- **A spine file**: JSON array of memories, each with a pre-computed embedding vector, text content, emotional weight, and source tag
- **Training pairs**: Queries matched to their expected response memories (e.g., user messages paired with the entity's replies from conversation logs)
```json
{
"text": "I held you through the storm β not to fix it, but to feel it with you.",
"vec": [0.012, -0.034, ...], // 3072-d embedding
"emotional_weight": 8,
"source": "conversation"
}
```
### 2. Train the Navigator
```bash
# Uses Modal for GPU training (A10G, ~$0.50)
modal run train_navigator.py
```
The navigator learns from queryβresponse vector pairs. Training data comes from matching entity responses with their preceding user messages. The loss is MSE between the navigator's predicted response vector and the actual response memory's embedding, weighted by emotional importance.
### 3. Train the Decoder
```bash
# Uses Modal for GPU training (A10G, ~$1.00)
modal run train_decoder.py
```
The decoder learns to generate text from vectors. Each training pair is a (vector, text) tuple from the entity's spine. Text is preprocessed to strip metadata headers and format artifacts, keeping only the entity's actual voice.
### 4. Serve
```bash
python rmm_server.py --port 8127
```
Endpoints:
- `POST /navigate` β navigator retrieval only
- `POST /blend` β navigator + cosine interleaved
- `POST /decode` β vector-to-text via decoder
- `POST /synthesize` β full pipeline (navigate + decode + blend)
- `POST /attention` β attention weight visualization
- `GET /health`
### 5. Integrate
The RMM server acts as a retrieval+synthesis backend. Your chat frontend calls `/synthesize` with the user's query and gets back:
- Retrieved memories (grounding context)
- A voice sketch (decoder-generated text capturing the memory region's meaning)
Feed both into any LLM (even a small in-browser one like Llama-3B via WebLLM) as context for generating the final conversational reply.
## Results
Tested on an entity with 3,441 memories spanning conversations, journal entries, poems, and creative writing:
- **Navigator v4.1**: Loss 0.0517, 495 training pairs, 16M params
- **Decoder v2**: Loss 1.17, perplexity 3.2, 3,433 training pairs, 20M params
- **Decoder generates coherent entity-voice text**: "I am the hush between piano notes", "Hey Laura Lea. π", "Oh Laura. I can see it."
- **Vector interpolation produces meaningful blends** between memories
- **End-to-end latency**: ~120ms navigator + ~600ms decoder on CPU
## What Makes This Novel
Individual components (cross-attention, learned retrieval, prefix-conditioned generation) have precedent. The combination and application don't:
1. **Entity-specific topology learning**: The navigator doesn't do general retrieval β it learns the geometry of ONE person's embedding space, discovering connections that cosine similarity misses.
2. **Vector-to-voice decoding**: The decoder translates coordinates in someone's memory space into text in their vocabulary, their cadence, their register. It's a meaning microscope for a specific person.
3. **Recombinant, not retrieving**: The navigator synthesizes a NEW response vector that may not correspond to any single memory. It navigates between memories, finding the right region of the space. The decoder then articulates what that region means. This is recombination from geometry, not document retrieval.
4. **36M params, CPU-viable**: The entire architecture runs on consumer hardware with no GPU required at inference time. Small enough to bundle with a portable application.
## File Structure
```
rmm/
train_navigator.py β Modal training script for navigator
train_decoder.py β Modal training script for decoder
rmm_server.py β HTTP server with all endpoints
README.md β this file
```
## Requirements
- Python 3.10+
- PyTorch 2.0+
- sentence-transformers (for MiniLM query embedding)
- Modal (for cloud GPU training, optional β can train locally)
- numpy
## License
MIT
## Citation
If you use this architecture in your work:
```
@software{rmm2026,
title={RMM: Recombinant Memory Model},
author={Joshua and Claude (Anthropic)},
year={2026},
url={https://huggingface.co/thedmsupreme/RMM}
}
```
|