Papers
arxiv:2603.12201

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Published on Mar 12
· Submitted by
Yushi Bai
on Mar 13
#2 Paper of the day
Authors:
,
,
,
,

Abstract

IndexCache reduces sparse attention computation in large language models by reusing top-k token selections across layers, achieving significant speedups with minimal quality loss.

AI-generated summary

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Community

Paper author Paper submitter

We introduce IndexCache, a method to accelerate DeepSeek Sparse Attention (DSA) by exploiting cross-layer redundancy in token selection. In DSA, a lightweight indexer at each layer selects top-k tokens for sparse attention, but adjacent layers produce nearly identical selections (70-100% overlap). IndexCache removes up to 75% of these redundant indexers by letting most layers reuse indices from a small set of retained layers. We propose a training-free greedy search to find which layers to keep, and a training-aware multi-layer distillation loss that adapts the model to aggressive sharing. On a 30B model, IndexCache achieves 1.82× prefill and 1.48× decode speedup at 200K context with negligible quality loss. Results on GLM-5 (744B) confirm scalability to production.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.12201 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.12201 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.12201 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.