Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

Community Article Published January 23, 2026

What can we learn from a benchmark left behind by an AI startup that raised half a billion dollars and then went silent?

The Mystery

In August 2024, a small San Francisco startup called Magic made waves in the AI world. With just 24 employees, they announced a model with a 100 million token context window, 50 times larger than Gemini's 2 million tokens. Eric Schmidt led a $320 million investment round, bringing their total funding to over $500 million at a $1.5 billion valuation.

Their claim was extraordinary: while Llama 3.1 405B would require 638 H100 GPUs just to store the KV cache for a 100M token context, Magic's LTM-2-mini could do it on a fraction of a single GPU. Roughly 1000x cheaper per decoded token.

Then... silence.

No product launch. No API access. No research papers. Just a blog post with intriguing demos and a curious benchmark they called HashHop, which they open-sourced on GitHub.

This is the story of how pure curiosity led us to reverse-engineer their approach, solve their benchmark perfectly, and build a working prototype of what their system might look like.

What Magic Left Behind

Magic's blog post described two tantalizing demos:

A calculator built with a novel GUI framework where the model learned an entirely new framework purely from context, without prior training
A password strength meter for Documenso where the model navigated an unfamiliar codebase to implement a feature autonomously

They also introduced HashHop, a benchmark designed to expose weaknesses in existing long-context evaluations.

Why HashHop?

Standard benchmarks like "Needle in a Haystack" test whether models can find a specific sentence buried in a long document. The problem: models can learn to identify the "odd" sentence that doesn't fit the surrounding text. The task becomes pattern matching instead of true retrieval.

HashHop eliminates this shortcut by presenting incompressible hash pairs, random strings of letters that cannot be compressed or predicted:

YOJVrdjKMNPLqWXZ = ABCDEFGHIJKLmnop
ABCDEFGHIJKLmnop = 'QRSTUVWXYZabcdef'

Query: YOJVrdjKMNPLqWXZ -> ?
Answer: QRSTUVWXYZabcdef

The model must follow a chain of associations (hash1 to hash2 to final_value) across potentially millions of tokens. No semantic structure to exploit. You either stored the information correctly or you didn't.

Magic reported 95% accuracy at 100 million tokens with their LTM-2-mini model.

We wanted to understand how.

The Tokenization Insight

The first breakthrough came from examining why standard transformers fail at HashHop while Magic claimed near-perfect performance.

Consider the hash string ABCDEFGHIJKLmnop. A standard tokenizer like GPT-2's splits this into multiple tokens:

"ABCDEFGHIJKLmnop" -> ["ABC", "DE", "FG", "HIJ", "KL", "mn", "op"]

Now the model must learn that the sequence ["ABC", "DE", "FG", "HIJ", "KL", "mn", "op"] maps to some other sequence of tokens. Complex multi-token-to-multi-token mapping requiring sophisticated pattern matching.

What if each hash string were a single token instead?

"ABCDEFGHIJKLmnop" -> [token_42]
"QRSTUVWXYZabcdef" -> [token_87]

Now the problem becomes trivial: token_42 -> token_87. Exactly what attention mechanisms excel at: key-value lookup.

This insight connects to recent work on Multi-Query Associative Recall (MQAR) from the Zoology project at Stanford. When keys and values are single tokens, transformers with attention become perfect hash tables.

The Math

Why do single-token keys enable perfect retrieval?

Random orthogonality: When you sample random high-dimensional vectors (like 128D embeddings), they're nearly orthogonal to each other. The expected dot product between two random unit vectors approaches zero as dimension increases.
Hard attention: With low temperature softmax, attention weights become nearly one-hot, selecting exactly the key that matches the query.
Perfect retrieval: The query embedding has maximal similarity with its own key embedding, so attention reliably selects the correct value.

We implemented this in about 300 lines of Python:

class TokenizedRetriever:
    """Token-based associative memory using attention mechanism."""

    def __init__(self, d_model: int = 128):
        self.d_model = d_model
        self.embeddings: Dict[int, np.ndarray] = {}

    def _get_embedding(self, tid: int) -> np.ndarray:
        """Random unit vector embedding, nearly orthogonal between tokens."""
        if tid not in self.embeddings:
            emb = np.random.randn(self.d_model)
            emb = emb / np.linalg.norm(emb)
            self.embeddings[tid] = emb
        return self.embeddings[tid]

    def retrieve(self, query_id, key_ids, value_ids, temperature=0.001):
        """Attention-based lookup with hard attention."""
        q_emb = self._get_embedding(query_id)
        k_embs = np.array([self._get_embedding(k) for k in key_ids])

        # Dot product attention
        scores = k_embs @ q_emb / temperature
        attn = softmax(scores)

        # Retrieve nearest value
        return argmax_value(attn, value_ids)

100% Accuracy at Any Scale

Context Length	Gemini 1.5 Flash	Tokenized Solver
1K tokens	100%	100%
10K tokens	96%	100%
100K tokens	77%	100%
1M tokens	4%	100%
10M tokens	-	100%

This is the first open-source implementation achieving perfect HashHop accuracy at arbitrary scale. The insight: treat each hash string as a single token.

From HashHop to Code Retrieval

Solving HashHop is satisfying but academic. Can the same principle enable practical applications?

Look at the structure of HashHop:

hash_string -> hash_string -> final_value

Now look at code retrieval:

function_name -> function_implementation

The pattern is identical. If we treat each function name as a single token, we should get perfect function retrieval. Exactly what Magic demonstrated with their Documenso example.

This led us to build MALM: Memory-Augmented Language Model.

MALM: A 165M Parameter Memory-Augmented Model

MALM applies the HashHop insight to practical code search.

Architecture Details

Component	Parameters
Token Embedding	11.1M
Position Embedding	0.1M
Query Encoder (4 layers)	28.4M
Value Encoder (4 layers)	28.4M
Decoder (12 layers)	85.1M
Output Projection	11.1M
Total	~165M

The key design choice: function names are added to the vocabulary as single tokens. This enables the same near-perfect retrieval we achieved on HashHop.

Training

We trained MALM on the CodeParrot dataset, extracting functions and generating diverse query variations:

Name-based queries: "function add", "find calculate_sum"
Semantic queries: "add two numbers", "function that sorts a list"
Docstring-based queries: Using documentation as query source

Training uses contrastive loss where the model learns to maximize similarity between a query and its target function while minimizing similarity to all other functions.

Results

Small Scale (2K functions):

Query Type	Accuracy
Exact Name Queries	100%
Semantic Queries	100%

Large Scale (20K functions):

Query Type	Accuracy
Exact Name Queries	70%
Semantic Queries	67%

The accuracy at 20K functions is limited by vocabulary collision (multiple functions mapping to similar embeddings), but for exact name queries with single-token keys, retrieval remains highly reliable.

Pre-trained model: codelion/malm-165m

Replicating Magic's Use Cases

With MALM providing retrieval, we combined it with Qwen2.5-Coder-1.5B for generation. Total system: ~1.7B parameters.

Use Case 1: Calculator with Novel GUI Framework

Magic claimed their model could learn an entirely new GUI framework from context alone. We tested this with a custom framework that no model has ever seen:

# Novel GUI framework (provided only in context)
class App:
    def __init__(self, title: str):
        self.title = title
        self.widgets = []

class Button:
    def __init__(self, id: str, text: str, on_click: Callable):
        ...

class TextInput:
    def __init__(self, id: str, placeholder: str):
        ...

Result: The system generated a working calculator:

Generated Calculator App:
+------------------------------------------+
| Simple Calculator                        |
| [_____] [_____]                         |
| [=     ]                                 |
| [+] [-] [*] [/] [C]                     |
+------------------------------------------+

--- Live Demo ---
  42 + 8 = 50.0
  100 - 37 = 63.0
  7 * 6 = 42.0
  144 / 12 = 12.0

Calculator works with the custom GUI framework!

Use Case 2: Password Strength Meter

Magic demonstrated adding a password strength meter to the Documenso codebase. We created a similar demo with a DocuSign-like application:

Password: 'P@ssw0rd!' (Strong password)
   [█████] Very Strong
  ✓ At least 8 characters
  ✓ Contains uppercase
  ✓ Contains lowercase
  ✓ Contains number
  ✓ Contains special char

Password strength meter integrates with signup page!

The system retrieves relevant components (authentication, forms, UI elements) and generates code following existing patterns.

Lessons Learned

1. Tokenization Determines Capability

How you tokenize determines what your model can do. Standard tokenization destroys the structure needed for exact key-value retrieval. Purpose-built tokenization enables capabilities that seem impossible otherwise.

2. Retrieval Beats Raw Context Length

Magic's 100M token context window is impressive, but the real innovation is likely their retrieval mechanism. Our ~1.7B parameter system achieves similar demos by using a small retrieval model (165M) to find relevant code, then passing it to a small generator (1.5B).

3. Demos Are Easier Than Products

Magic raised $500M but has yet to ship a product. Building impressive demos is easier than building reliable systems. Our implementation works well on clean benchmarks, but production deployment requires handling:

Noisy queries with typos and ambiguity
Code that doesn't follow naming conventions
Multi-file dependencies and imports
Incremental updates to the codebase
Latency requirements for interactive use

The HashHop benchmark tests pure retrieval ability, but real-world code assistance requires much more.

Try It Yourself

HashHop Solver

git clone https://github.com/codelion/hash-hop.git
cd hash-hop
poetry install

# Run at 10K tokens
poetry run python tokenized_hashhop.py --tokens 10000

# Run full benchmark (1K to 10M)
poetry run python tokenized_hashhop.py --benchmark

MALM Code Search

# Download pre-trained model
pip install mlx huggingface_hub numpy
huggingface-cli download codelion/malm-165m --local-dir ./malm-165m

# Run inference
python malm-165m/inference.py --query "function that sorts a list"

Example output:

Query: function that sorts a list
------------------------------------------------------------

1. array_sort (score: 0.9526)
   Signature: array_sort(col)
   Docstring: Collection function: sorts the input array in ascending order...

2. sort_array (score: 0.7707)
   Signature: sort_array(col, asc)
   Docstring: Collection function: sorts the input array in ascending or descending order...

Run the Demos

poetry run python code_llm/demos/run_demo.py

Conclusion

Magic's half-billion dollars and subsequent silence remain a mystery. But their HashHop benchmark and brief demos gave us enough to reconstruct a plausible approach.

The key insight is elegant: treat retrieval keys as single tokens. This transforms an impossible problem (matching arbitrary-length strings across millions of tokens) into a trivial one (key-value lookup via attention).

Whether Magic's actual implementation resembles ours, we can't know. But we've shown that:

HashHop can be solved with 100% accuracy at any scale using tokenization
The same principle enables practical code retrieval
Small models (~1.7B) with good retrieval can match frontier-model demos

The code is open source. The model is on HuggingFace. We invite the community to build on this work.

Perhaps someday Magic will tell their side of the story. Until then, we'll keep exploring what's possible.

Resources

GitHub Repository: codelion/hash-hop
Pre-trained MALM Model: codelion/malm-165m
Magic's Original Blog Post: 100M Token Context Windows
HashHop Original Repo: magicproduct/hash-hop

Related Work

Zoology: Measuring and Improving Recall in Efficient Language Models - Stanford's work on MQAR
In-context Learning and Induction Heads - Anthropic's research on attention-based retrieval
Mamba: Linear-Time Sequence Modeling with Selective State Spaces - Alternative to attention for long contexts

Models mentioned in this article 1

Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens

March 6, 2026

The Optimal Architecture for Small Language Models

120

December 26, 2025

Community

pavel-ai

Jan 25

Does it mean that the Qwen model is passing tokens back to the context. In the retrieval level these tokens got encoded into MALM model's hashes which are used to retrieve information from context in MALM tokens and then the result converts to Qwen's tokens again?

codelion

Article author Jan 25

Good question! Let me clarify - the MALM and Qwen models don't share tokens directly.

MALM has its own tokenizer where each function name becomes a single token. It encodes the query, searches its memory bank, and returns the top matching functions as plain text (function name, signature, docstring, code).

This retrieved text is then simply concatenated with the user query as a prompt to Qwen. Qwen tokenizes this combined prompt using its own tokenizer and generates code.

So the flow is:

User query goes to MALM

MALM retrieves relevant functions as text
Text prompt = query + retrieved code
Qwen tokenizes this prompt with its own tokenizer
Qwen generates output

There's no token-level integration - just text passing between the two models. MALM acts as a retrieval layer that provides relevant context, and Qwen does the generation with that context in its prompt.

The single-token-per-key insight is only within MALM for perfect retrieval. Qwen sees regular text.

codelion

Article author Jan 25

This comment has been hidden (marked as Resolved)

darshanmakwana

Jan 31

Amazing! I have one query though, If the task is just code retrieval based on function names then wouldn't just doing grep or ripgrep will likely also give much higher accuracy. Even in the HashHop task using a regex search for * -> * will likely churn out a couple of matches which we can then pass to an LM for filtering and final answer generation. My perspective is that the more structure we impose on a retrieval task the more incentives we have to offload it to specialized tools (which can perform the structure task with 95% accuracy). Needle in haystack seems to be hard because it is semantic retrieval and there is not structure it in
Again awesome work! looking forward to what you cook next

codelion

Article author Jan 31

Precisely, in fact this quest started by trying to see if attention can implement and track a simple hash table that would be sufficient to solve hash hop. One of the reasons we haven’t heard from Magic labs is because tool calling + reasoning has made long context less relevant.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote