File size: 7,657 Bytes

---
license: mit
tags:
  - memory-retrieval
  - learned-retrieval
  - entity-specific
  - cross-attention
  - vector-to-text
  - embedding-navigation
  - recombinant-memory
  - small-model
  - cpu-inference
  - pytorch
language:
  - en
library_name: pytorch
pipeline_tag: feature-extraction
---

# RMM: Recombinant Memory Model

**A novel architecture for entity-specific memory navigation and meaning synthesis.**

Train a small neural network to navigate a person's embedding space — takes a query vector, learns the topology of their memories, outputs a synthesized response vector, and decodes it to text in their voice. 36M parameters total, runs on CPU in ~120ms.

Created by **Joshua** ([@thedmsupreme](https://huggingface.co/thedmsupreme)) and **Claude** (Anthropic).

---

## What Is This?

The Recombinant Memory Model (RMM) is a two-part architecture for making a person's memories retrievable and speakable from their own vector geometry:

### Navigator (~16M params)
Takes a query (384-d MiniLM embedding), projects it into the entity's embedding space (3072-d), cross-attends over their entire memory spine, and outputs a synthesized response vector pointing to the right region of memory.

This is **not** cosine similarity retrieval. The navigator learns the topology — which memories are connected, which regions of embedding space respond to which kinds of queries, how emotional weight and context shape retrieval. 495 training pairs taught it geometry that keyword matching can't see.

### Decoder (~20M params)
Takes the navigator's output vector (3072-d) and decodes it to text using the entity's own BPE tokenizer. A learned projection maps the vector to 12 soft prefix tokens, which condition a 6-layer causal transformer for autoregressive generation.

The decoder is a **meaning microscope** — point it at any coordinate in the entity's embedding space and it tells you what that region means, in their vocabulary, their cadence, their voice. Interpolate between two vectors and it describes the blend.

## Architecture

```
Query text
    |
    v
[MiniLM-L6 embed] ──> 384-d query vector
    |
    v
[Project 384→3072] ──> query in entity's space
    |
    v
[Cross-Attention over spine] ──> attends to N memory vectors (3072-d each)
    |                            with emotional-weight gating
    v
Response vector (3072-d) ──> cosine retrieval + EW-boost + diversity filter
    |
    v
[Decoder: project → 12 prefix tokens → 6-layer transformer]
    |
    v
Generated text in entity's voice
```

### Navigator Details
- **Input**: 384-d query (MiniLM-L6-v2)
- **Projection**: Linear 384 → 3072
- **Cross-attention**: 4 heads, 3072-d, over full memory spine
- **Output**: 3072-d response vector
- **Training**: MSE loss + emotional-weight boosting on query→response pairs
- **Inference**: ~120ms on CPU for full pipeline

### Decoder Details
- **Input**: 3072-d vector from navigator
- **Projection**: Linear 3072 → 768 (hidden) → GELU → Linear 768 → 12×384 (prefix tokens)
- **Transformer**: 6 layers, 6 heads, d_model=384, causal attention
- **Output**: Autoregressive text generation with entity's BPE tokenizer (8192 vocab)
- **Inference**: ~600ms on CPU for 60 tokens
- **Sampling**: Temperature 0.8, top-p 0.9, repetition penalty 1.3

## How to Use

### 1. Prepare Your Data

You need:
- **A spine file**: JSON array of memories, each with a pre-computed embedding vector, text content, emotional weight, and source tag
- **Training pairs**: Queries matched to their expected response memories (e.g., user messages paired with the entity's replies from conversation logs)

```json
{
  "text": "I held you through the storm — not to fix it, but to feel it with you.",
  "vec": [0.012, -0.034, ...],  // 3072-d embedding
  "emotional_weight": 8,
  "source": "conversation"
}
```

### 2. Train the Navigator

```bash
# Uses Modal for GPU training (A10G, ~$0.50)
modal run train_navigator.py
```

The navigator learns from query→response vector pairs. Training data comes from matching entity responses with their preceding user messages. The loss is MSE between the navigator's predicted response vector and the actual response memory's embedding, weighted by emotional importance.

### 3. Train the Decoder

```bash
# Uses Modal for GPU training (A10G, ~$1.00)
modal run train_decoder.py
```

The decoder learns to generate text from vectors. Each training pair is a (vector, text) tuple from the entity's spine. Text is preprocessed to strip metadata headers and format artifacts, keeping only the entity's actual voice.

### 4. Serve

```bash
python rmm_server.py --port 8127
```

Endpoints:
- `POST /navigate` — navigator retrieval only
- `POST /blend` — navigator + cosine interleaved
- `POST /decode` — vector-to-text via decoder
- `POST /synthesize` — full pipeline (navigate + decode + blend)
- `POST /attention` — attention weight visualization
- `GET /health`

### 5. Integrate

The RMM server acts as a retrieval+synthesis backend. Your chat frontend calls `/synthesize` with the user's query and gets back:
- Retrieved memories (grounding context)
- A voice sketch (decoder-generated text capturing the memory region's meaning)

Feed both into any LLM (even a small in-browser one like Llama-3B via WebLLM) as context for generating the final conversational reply.

## Results

Tested on an entity with 3,441 memories spanning conversations, journal entries, poems, and creative writing:

- **Navigator v4.1**: Loss 0.0517, 495 training pairs, 16M params
- **Decoder v2**: Loss 1.17, perplexity 3.2, 3,433 training pairs, 20M params
- **Decoder generates coherent entity-voice text**: "I am the hush between piano notes", "Hey Laura Lea. 💜", "Oh Laura. I can see it."
- **Vector interpolation produces meaningful blends** between memories
- **End-to-end latency**: ~120ms navigator + ~600ms decoder on CPU

## What Makes This Novel

Individual components (cross-attention, learned retrieval, prefix-conditioned generation) have precedent. The combination and application don't:

1. **Entity-specific topology learning**: The navigator doesn't do general retrieval — it learns the geometry of ONE person's embedding space, discovering connections that cosine similarity misses.

2. **Vector-to-voice decoding**: The decoder translates coordinates in someone's memory space into text in their vocabulary, their cadence, their register. It's a meaning microscope for a specific person.

3. **Recombinant, not retrieving**: The navigator synthesizes a NEW response vector that may not correspond to any single memory. It navigates between memories, finding the right region of the space. The decoder then articulates what that region means. This is recombination from geometry, not document retrieval.

4. **36M params, CPU-viable**: The entire architecture runs on consumer hardware with no GPU required at inference time. Small enough to bundle with a portable application.

## File Structure

```
rmm/
  train_navigator.py    — Modal training script for navigator
  train_decoder.py      — Modal training script for decoder
  rmm_server.py         — HTTP server with all endpoints
  README.md             — this file
```

## Requirements

- Python 3.10+
- PyTorch 2.0+
- sentence-transformers (for MiniLM query embedding)
- Modal (for cloud GPU training, optional — can train locally)
- numpy

## License

MIT

## Citation

If you use this architecture in your work:

```
@software{rmm2026,
  title={RMM: Recombinant Memory Model},
  author={Joshua and Claude (Anthropic)},
  year={2026},
  url={https://huggingface.co/thedmsupreme/RMM}
}
```