Papers
arxiv:2605.06216

TIDE: Every Layer Knows the Token Beneath the Context

Published on May 7
· Submitted by
Ajay Jaiswal
on May 8
Authors:
,
,
,
,
,

Abstract

TIDE addresses limitations in LLM design by introducing EmbeddingMemory to overcome rare token and contextual collapse problems through context-free semantic vector computation and depth-conditioned routing.

AI-generated summary

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

Community

Paper submitter

We investigate rare token and contextual collapse problem in LLM design and propose to inject token identity information to each transformer layer.

the idea of embedding memory that keeps token identity alive through every layer is a clean counterpoint to the standard one-shot embedding. my main question is ablations: how much of the gains come from the memory blocks themselves versus the depth-conditioned router and the null bank; would love to see results when you dial k down to 1 and/or remove the router entirely. btw the arxivlens breakdown (https://arxivlens.com/PaperView/Details/tide-every-layer-knows-the-token-beneath-the-context-6994-fbcacec9) did a nice job unpacking the method details. i wonder how this scales to truly massive vocabularies or multilingual setups where token identity can be ambiguous across scripts.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.06216
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.06216 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.06216 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.06216 in a Space README.md to link it from this page.

Collections including this paper 2