Pavantej's picture
Upload folder using huggingface_hub
fbca19f verified
# When Models Learn While Thinking
---
## 01 / The Frozen Calculator Problem
Every conversation you've had with ChatGPT, Claude, or any large language model follows the same pattern: the model thinks, predicts, and forgets. The weights that determine its behavior were set months ago, frozen in place after training. When you correct it, when you teach it something new, when you have a breakthrough conversation—none of that changes the model itself.
This isn't a bug. It's the architecture.
The model can *simulate* learning through in-context adaptation. It can act like it remembers. But the parameters that define its cognition remain untouched. When the context window ends, so does the illusion.
This demo breaks that pattern.
---
## 02 / What This Actually Does
This is a minimal reimplementation of two recent papers: **Titans** (test-time training) and **MIRAS** (associative memory with attentional bias). Together, they demonstrate something most production LLMs don't do: **learning during inference**.
The architecture is simple:
- A frozen language model (distilgpt2) generates text
- Hidden states from that model are projected into a memory space
- An associative memory module predicts what it should remember
- The prediction error drives gradient descent
- The memory weights update
- The updated state persists to disk
Every message you send changes the model's internal representations. Not through prompt engineering. Not through retrieval. Through actual gradient-based optimization—at inference time.
---
## 03 / The Text Doesn't Matter
If you interact with this demo, you'll notice the text responses are... not good. Random. Sometimes incoherent. This is intentional.
The text generator (distilgpt2) is frozen. We're not training it. The responses reflect what a small, untuned model produces when asked to continue arbitrary text. That's not the point.
**The point is the numbers below each response.**
Watch the loss. When you send the same message multiple times, the loss decreases. The memory is learning to predict the hidden state patterns associated with that input. When you send something completely different, the loss spikes—the memory is surprised.
This is test-time learning. The model is changing itself while you use it.
---
## 04 / What the Stats Mean
Each response shows four metrics:
**Loss**: How surprised the memory is. Lower means the pattern is familiar. Higher means it's novel. This is the prediction error that drives learning.
**Retention**: A multiplier on the learning rate. When loss is high (surprising input), retention is high (2.0x). The memory learns more aggressively from surprising events. This is the retention gate—a simple mechanism inspired by how human memory prioritizes novelty.
**Updates**: The total number of times the memory has been updated. This persists across sessions. Refresh the page, send another message, and the count continues. The memory doesn't reset.
**Avg Loss**: The running average of all losses. Over time, as the memory learns recurring patterns, this should trend downward.
These aren't vanity metrics. They're the observable signature of gradient descent happening during inference.
---
## 05 / The Two Papers
**Titans** (2025) introduces test-time training for language models. The core idea: instead of freezing weights after pre-training, allow a subset of parameters to update during inference. This creates a feedback loop—think, predict, update, think differently next time—that doesn't exist in standard LLMs.
**MIRAS** (2024) reframes attention mechanisms as implicit optimization problems. It shows that dot-product attention, RNNs, and linear transformers are all solving online optimization with a specific loss function (L2). By making the loss function explicit and tunable, you can change the memory behavior. Different losses produce different cognition.
This demo combines both: Titans' test-time learning with MIRAS's associative memory framework.
---
## 06 / What's Missing
This is a minimal reimplementation. Several components from the papers are not included:
**From Titans**:
- Multi-layer test-time updates (we only update the memory module)
- Task-specific memory partitioning (we use a single shared memory)
- Adaptive learning rate schedules (we use a simple retention gate)
**From MIRAS**:
- Alternative loss functions (we use L2)
- Multi-head memory (we use a single memory matrix)
- Attention-based retrieval (we use direct key-value mapping)
The goal was to demonstrate the core mechanism—learning during inference—not to replicate every detail. The full papers contain significantly more sophistication.
---
## 07 / The Difference from Standard LLMs
Standard LLMs (ChatGPT, Claude, GPT-4) do this:
```
Input → Frozen Weights → Output → Forget
```
This demo does this:
```
Input → Frozen LM → Hidden States → Memory (Learning) → Output → Save
```
The frozen LM provides the text generation. The memory provides the learning. They're decoupled.
This matters because:
- **Weights update during use** (not just during training)
- **Memory persists across sessions** (not just within a context window)
- **Learning is explicit** (not simulated through in-context adaptation)
- **The system becomes different** after each interaction
In-context learning is pattern matching. This is optimization.
---
## 08 / What Problem This Solves
The current paradigm for "adaptive" LLMs involves:
- Vector databases for retrieval
- Fine-tuning on user data (expensive, slow)
- Prompt engineering (fragile, context-limited)
- RAG systems (fetch, don't learn)
None of these change the model itself. They work around the frozen weights.
Test-time learning makes adaptation a first-class primitive. The model doesn't retrieve your preferences—it encodes them in its parameters. It doesn't simulate learning—it performs learning.
This opens up:
- **Personalization without fine-tuning** (the model adapts to you as you use it)
- **Continual learning** (the model improves from every interaction)
- **Transparent memory** (you can inspect what it learned)
- **Efficient adaptation** (gradient descent is cheaper than retraining)
---
## 09 / What This Means for AI's Future
The industry is converging on a model: train once, deploy frozen, scale through retrieval. This works. But it's not the only path.
Test-time learning suggests a different trajectory: models that are **living systems**, not static calculators. Systems that don't just respond to you—they change because of you.
This has implications:
- **Privacy**: Your data updates your local model, not a shared cloud model
- **Efficiency**: Learning happens incrementally, not in massive retraining runs
- **Alignment**: The model adapts to your values through interaction, not through RLHF on aggregate data
- **Transparency**: You can see what the model learned, reset it, or fork it
The tradeoff is complexity. A model that changes during use is harder to reason about, harder to debug, harder to guarantee. But the benefits—true personalization, continual improvement, user-specific adaptation—may be worth it.
---
## 10 / The Retention Gate
One detail worth highlighting: the retention gate.
When the memory encounters a high-loss input (surprising, novel), it increases the learning rate. When it encounters a low-loss input (familiar, repeated), it decreases the learning rate.
This is a simple heuristic, but it mirrors how human memory works. We remember surprising events more vividly than routine ones. The retention gate makes the memory selective—it learns more from what it doesn't already know.
In this demo, retention is always 2.0x because the memory is fresh. Everything is surprising. After hundreds of interactions, you'd see retention vary—0.5x for familiar patterns, 2.0x for novel ones. The memory would become selective.
---
## 11 / Why the Memory is Shared
This demo uses a single, shared memory across all users. This is intentional.
It demonstrates that the memory is not user-specific. It's a collective brain. Every user's input updates the same weight matrix. This makes the learning observable—you can see the loss decrease as the memory encounters repeated patterns from different users.
In a production system, you'd likely use per-user memory. But for a demo, shared memory makes the learning more visible and the privacy implications simpler (there are none—no user data is stored).
---
## 12 / The Bandwidth Constraint
One reason LLMs feel static is that they operate at the wrong bandwidth. The only way to change their behavior is to retrain them—a process that costs millions and takes weeks. Users can't influence the model in real time.
Test-time learning changes the bandwidth. The model updates with every message. The feedback loop tightens from months to milliseconds.
This doesn't mean the model becomes smarter. It means the model becomes *responsive*. It adapts to the distribution of inputs it actually sees, not the distribution it was trained on.
---
## 13 / What You're Actually Watching
When you interact with this demo, you're not chatting with a model. You're watching a memory module learn to compress hidden state patterns into a 256-dimensional space.
The text generation is a side effect. The real process is:
- Extract hidden states from the frozen LM
- Project them into memory space
- Predict what the memory should encode
- Compute the error
- Update the weights
- Save the new state
This happens for every message. The memory is always learning. The loss is always updating. The system is always changing.
That's the difference. Standard LLMs are frozen calculators. This is a living system.
---
## 14 / The Horizon
This demo is a proof of concept. It's not production-ready. It's not optimized. It's not aligned. But it demonstrates a principle: **models can learn while they think**.
The implications ripple outward:
- What if your AI assistant remembered how you corrected it?
- What if your code completion tool learned your style over time?
- What if your search engine adapted to your information needs?
- What if alignment happened through interaction, not through pre-training?
These aren't hypotheticals. They're design choices. The architecture exists. The papers are published. The code is open.
The question is whether we build systems that simulate learning or systems that perform learning.
This demo chooses the latter.
---
## 15 / A Note on Hype
Test-time learning is not a silver bullet. It introduces complexity, instability, and new failure modes. A model that changes during use is harder to trust, harder to audit, harder to guarantee.
But it's also more adaptive, more personal, more aligned with how humans actually learn.
The industry will likely converge on hybrid systems: frozen base models with test-time learning in specific modules. The best of both worlds—stability where you need it, adaptation where you want it.
This demo is a step in that direction. Not the destination. Just a clearer mental model of what's possible.
---
## 16 / The Core Metaphor
Standard LLMs are like libraries. They contain vast knowledge, but they don't change when you visit them. You can check out books (retrieve information), but the library itself remains static.
Test-time learning is like a brain. It changes with every experience. The connections strengthen or weaken based on what you encounter. The system becomes different because of you.
Both are useful. But they're not the same thing.
This demo is a brain, not a library.
---
**Papers**:
- Titans: Learning to (Learn at Test Time): RNNs with Expressive Hidden States ([arxiv.org/abs/2501.00663](https://arxiv.org/abs/2501.00663))
- MIRAS: Associative Memory with Attentional Bias ([arxiv.org/abs/2504.13173](https://arxiv.org/abs/2504.13173))
**Code**: Open source, minimal, educational.
**Memory**: Shared, persistent, observable.
**Learning**: Real, not simulated.
---
*This is not a chatbot. This is a demonstration of what happens when models learn while thinking.*