Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

Community Article Published January 23, 2026

What can we learn from a benchmark left behind by an AI startup that raised half a billion dollars and then went silent?

The Mystery

In August 2024, a small San Francisco startup called Magic made waves in the AI world. With just 24 employees, they announced a model with a 100 million token context window, 50 times larger than Gemini's 2 million tokens. Eric Schmidt led a $320 million investment round, bringing their total funding to over $500 million at a $1.5 billion valuation.

Their claim was extraordinary: while Llama 3.1 405B would require 638 H100 GPUs just to store the KV cache for a 100M token context, Magic's LTM-2-mini could do it on a fraction of a single GPU. Roughly 1000x cheaper per decoded token.

Then... silence.

No product launch. No API access. No research papers. Just a blog post with intriguing demos and a curious benchmark they called HashHop, which they open-sourced on GitHub.

This is the story of how pure curiosity led us to reverse-engineer their approach, solve their benchmark perfectly, and build a working prototype of what their system might look like.

What Magic Left Behind

Magic's blog post described two tantalizing demos:

  1. A calculator built with a novel GUI framework where the model learned an entirely new framework purely from context, without prior training
  2. A password strength meter for Documenso where the model navigated an unfamiliar codebase to implement a feature autonomously

They also introduced HashHop, a benchmark designed to expose weaknesses in existing long-context evaluations.

Why HashHop?

Standard benchmarks like "Needle in a Haystack" test whether models can find a specific sentence buried in a long document. The problem: models can learn to identify the "odd" sentence that doesn't fit the surrounding text. The task becomes pattern matching instead of true retrieval.

HashHop eliminates this shortcut by presenting incompressible hash pairs, random strings of letters that cannot be compressed or predicted:

YOJVrdjKMNPLqWXZ = ABCDEFGHIJKLmnop
ABCDEFGHIJKLmnop = 'QRSTUVWXYZabcdef'

Query: YOJVrdjKMNPLqWXZ -> ?
Answer: QRSTUVWXYZabcdef

The model must follow a chain of associations (hash1 to hash2 to final_value) across potentially millions of tokens. No semantic structure to exploit. You either stored the information correctly or you didn't.

Magic reported 95% accuracy at 100 million tokens with their LTM-2-mini model.

We wanted to understand how.

The Tokenization Insight

The first breakthrough came from examining why standard transformers fail at HashHop while Magic claimed near-perfect performance.

Consider the hash string ABCDEFGHIJKLmnop. A standard tokenizer like GPT-2's splits this into multiple tokens:

"ABCDEFGHIJKLmnop" -> ["ABC", "DE", "FG", "HIJ", "KL", "mn", "op"]

Now the model must learn that the sequence ["ABC", "DE", "FG", "HIJ", "KL", "mn", "op"] maps to some other sequence of tokens. Complex multi-token-to-multi-token mapping requiring sophisticated pattern matching.

What if each hash string were a single token instead?

"ABCDEFGHIJKLmnop" -> [token_42]
"QRSTUVWXYZabcdef" -> [token_87]

Now the problem becomes trivial: token_42 -> token_87. Exactly what attention mechanisms excel at: key-value lookup.

Screenshot 2026-01-21 at 9.22.26 PM

This insight connects to recent work on Multi-Query Associative Recall (MQAR) from the Zoology project at Stanford. When keys and values are single tokens, transformers with attention become perfect hash tables.

The Math

Why do single-token keys enable perfect retrieval?

  1. Random orthogonality: When you sample random high-dimensional vectors (like 128D embeddings), they're nearly orthogonal to each other. The expected dot product between two random unit vectors approaches zero as dimension increases.

  2. Hard attention: With low temperature softmax, attention weights become nearly one-hot, selecting exactly the key that matches the query.

  3. Perfect retrieval: The query embedding has maximal similarity with its own key embedding, so attention reliably selects the correct value.

We implemented this in about 300 lines of Python:

class TokenizedRetriever:
    """Token-based associative memory using attention mechanism."""

    def __init__(self, d_model: int = 128):
        self.d_model = d_model
        self.embeddings: Dict[int, np.ndarray] = {}

    def _get_embedding(self, tid: int) -> np.ndarray:
        """Random unit vector embedding, nearly orthogonal between tokens."""
        if tid not in self.embeddings:
            emb = np.random.randn(self.d_model)
            emb = emb / np.linalg.norm(emb)
            self.embeddings[tid] = emb
        return self.embeddings[tid]

    def retrieve(self, query_id, key_ids, value_ids, temperature=0.001):
        """Attention-based lookup with hard attention."""
        q_emb = self._get_embedding(query_id)
        k_embs = np.array([self._get_embedding(k) for k in key_ids])

        # Dot product attention
        scores = k_embs @ q_emb / temperature
        attn = softmax(scores)

        # Retrieve nearest value
        return argmax_value(attn, value_ids)

100% Accuracy at Any Scale

Context Length Gemini 1.5 Flash Tokenized Solver
1K tokens 100% 100%
10K tokens 96% 100%
100K tokens 77% 100%
1M tokens 4% 100%
10M tokens - 100%

This is the first open-source implementation achieving perfect HashHop accuracy at arbitrary scale. The insight: treat each hash string as a single token.

From HashHop to Code Retrieval

Solving HashHop is satisfying but academic. Can the same principle enable practical applications?

Look at the structure of HashHop:

hash_string -> hash_string -> final_value

Now look at code retrieval:

function_name -> function_implementation

The pattern is identical. If we treat each function name as a single token, we should get perfect function retrieval. Exactly what Magic demonstrated with their Documenso example.

This led us to build MALM: Memory-Augmented Language Model.

MALM: A 165M Parameter Memory-Augmented Model

MALM applies the HashHop insight to practical code search.

Screenshot 2026-01-21 at 9.22.49 PM

Architecture Details

Component Parameters
Token Embedding 11.1M
Position Embedding 0.1M
Query Encoder (4 layers) 28.4M
Value Encoder (4 layers) 28.4M
Decoder (12 layers) 85.1M
Output Projection 11.1M
Total ~165M

The key design choice: function names are added to the vocabulary as single tokens. This enables the same near-perfect retrieval we achieved on HashHop.

Training

We trained MALM on the CodeParrot dataset, extracting functions and generating diverse query variations:

  • Name-based queries: "function add", "find calculate_sum"
  • Semantic queries: "add two numbers", "function that sorts a list"
  • Docstring-based queries: Using documentation as query source

Training uses contrastive loss where the model learns to maximize similarity between a query and its target function while minimizing similarity to all other functions.

Results

Small Scale (2K functions):

Query Type Accuracy
Exact Name Queries 100%
Semantic Queries 100%

Large Scale (20K functions):

Query Type Accuracy
Exact Name Queries 70%
Semantic Queries 67%

The accuracy at 20K functions is limited by vocabulary collision (multiple functions mapping to similar embeddings), but for exact name queries with single-token keys, retrieval remains highly reliable.

Pre-trained model: codelion/malm-165m

Replicating Magic's Use Cases

With MALM providing retrieval, we combined it with Qwen2.5-Coder-1.5B for generation. Total system: ~1.7B parameters.

Use Case 1: Calculator with Novel GUI Framework

Magic claimed their model could learn an entirely new GUI framework from context alone. We tested this with a custom framework that no model has ever seen:

# Novel GUI framework (provided only in context)
class App:
    def __init__(self, title: str):
        self.title = title
        self.widgets = []

class Button:
    def __init__(self, id: str, text: str, on_click: Callable):
        ...

class TextInput:
    def __init__(self, id: str, placeholder: str):
        ...

Result: The system generated a working calculator:

Generated Calculator App:
+------------------------------------------+
| Simple Calculator                        |
| [_____] [_____]                         |
| [=     ]                                 |
| [+] [-] [*] [/] [C]                     |
+------------------------------------------+

--- Live Demo ---
  42 + 8 = 50.0
  100 - 37 = 63.0
  7 * 6 = 42.0
  144 / 12 = 12.0

Calculator works with the custom GUI framework!

Use Case 2: Password Strength Meter

Magic demonstrated adding a password strength meter to the Documenso codebase. We created a similar demo with a DocuSign-like application:

Password: 'P@ssw0rd!' (Strong password)
   [█████] Very Strong
  ✓ At least 8 characters
  ✓ Contains uppercase
  ✓ Contains lowercase
  ✓ Contains number
  ✓ Contains special char

Password strength meter integrates with signup page!

The system retrieves relevant components (authentication, forms, UI elements) and generates code following existing patterns.

Lessons Learned

1. Tokenization Determines Capability

How you tokenize determines what your model can do. Standard tokenization destroys the structure needed for exact key-value retrieval. Purpose-built tokenization enables capabilities that seem impossible otherwise.

2. Retrieval Beats Raw Context Length

Magic's 100M token context window is impressive, but the real innovation is likely their retrieval mechanism. Our ~1.7B parameter system achieves similar demos by using a small retrieval model (165M) to find relevant code, then passing it to a small generator (1.5B).

3. Demos Are Easier Than Products

Magic raised $500M but has yet to ship a product. Building impressive demos is easier than building reliable systems. Our implementation works well on clean benchmarks, but production deployment requires handling:

  • Noisy queries with typos and ambiguity
  • Code that doesn't follow naming conventions
  • Multi-file dependencies and imports
  • Incremental updates to the codebase
  • Latency requirements for interactive use

The HashHop benchmark tests pure retrieval ability, but real-world code assistance requires much more.

Try It Yourself

HashHop Solver

git clone https://github.com/codelion/hash-hop.git
cd hash-hop
poetry install

# Run at 10K tokens
poetry run python tokenized_hashhop.py --tokens 10000

# Run full benchmark (1K to 10M)
poetry run python tokenized_hashhop.py --benchmark

MALM Code Search

# Download pre-trained model
pip install mlx huggingface_hub numpy
huggingface-cli download codelion/malm-165m --local-dir ./malm-165m

# Run inference
python malm-165m/inference.py --query "function that sorts a list"

Example output:

Query: function that sorts a list
------------------------------------------------------------

1. array_sort (score: 0.9526)
   Signature: array_sort(col)
   Docstring: Collection function: sorts the input array in ascending order...

2. sort_array (score: 0.7707)
   Signature: sort_array(col, asc)
   Docstring: Collection function: sorts the input array in ascending or descending order...

Run the Demos

poetry run python code_llm/demos/run_demo.py

Conclusion

Magic's half-billion dollars and subsequent silence remain a mystery. But their HashHop benchmark and brief demos gave us enough to reconstruct a plausible approach.

The key insight is elegant: treat retrieval keys as single tokens. This transforms an impossible problem (matching arbitrary-length strings across millions of tokens) into a trivial one (key-value lookup via attention).

Whether Magic's actual implementation resembles ours, we can't know. But we've shown that:

  1. HashHop can be solved with 100% accuracy at any scale using tokenization
  2. The same principle enables practical code retrieval
  3. Small models (~1.7B) with good retrieval can match frontier-model demos

The code is open source. The model is on HuggingFace. We invite the community to build on this work.

Perhaps someday Magic will tell their side of the story. Until then, we'll keep exploring what's possible.


Resources

Related Work

Community

Sign up or log in to comment