Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
Abstract
SPEED is a phase-asymmetric KV-visibility policy that reduces long-context inference costs in decoder-only language models by processing prompt tokens in lower layers during prefill while maintaining full-depth attention during decoding.
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
Community
The paper studies why long prompts are expensive in decoder-only LMs: prompt KV states are usually materialized at every layer during Prefill and then attended to throughout Decode. SPEED keeps a small set of anchor prompt tokens visible in upper layers, stores non-anchor prompt tokens only in lower layers, and keeps new Decode tokens full-depth. In the Llama-3.1-8B study, using 75% of layers for Prefill tokens preserved benchmark quality within 0.2 average score while improving TTFT, TPOT, and active KV memory at 128K context.
Get this paper in your agent:
hf papers read 2605.06105 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper