arxiv:2607.01218

The State-Prediction Separation Hypothesis

Published on Jul 1

· Submitted by

Nathan Godey on Jul 2

Cornell LIL Lab

Upvote

Authors:

Abstract

Separating state prediction from token prediction in Transformers improves language modeling performance and efficiency across different scales.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.

View arXiv page View PDF Add to collection

Community

nthngdy

Paper submitter about 13 hours ago

Code coming soon!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2607.01218

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.01218 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.01218 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.01218 in a Space README.md to link it from this page.