Abstract
TIDE addresses limitations in LLM design by introducing EmbeddingMemory to overcome rare token and contextual collapse problems through context-free semantic vector computation and depth-conditioned routing.
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
Community
We investigate rare token and contextual collapse problem in LLM design and propose to inject token identity information to each transformer layer.
the idea of embedding memory that keeps token identity alive through every layer is a clean counterpoint to the standard one-shot embedding. my main question is ablations: how much of the gains come from the memory blocks themselves versus the depth-conditioned router and the null bank; would love to see results when you dial k down to 1 and/or remove the router entirely. btw the arxivlens breakdown (https://arxivlens.com/PaperView/Details/tide-every-layer-knows-the-token-beneath-the-context-6994-fbcacec9) did a nice job unpacking the method details. i wonder how this scales to truly massive vocabularies or multilingual setups where token identity can be ambiguous across scripts.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mixture of Chapters: Scaling Learnt Memory in Transformers (2026)
- Attention Residuals (2026)
- Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling (2026)
- ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models (2026)
- Hierarchical vs. Flat Iteration in Shared-Weight Transformers (2026)
- Learning When to Attend: Conditional Memory Access for Long-Context LLMs (2026)
- Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06216 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper