Papers
arxiv:2604.04921

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Published on Apr 6
ยท Submitted by
Wei Huang
on Apr 7
#3 Paper of the day
ยท nvidia NVIDIA
Authors:
,
,
,
,
,
,
,

Abstract

TriAttention addresses KV cache memory bottlenecks in LLMs by leveraging Q/K vector concentration in pre-RoPE space to improve key importance estimation and enable efficient long-context generation.

AI-generated summary

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

Community

Running a 32B model on a 24GB GPU ๐Ÿ’ป leaves very little room for KV cache. OpenClaw ships with such lengthy default instructions that Full Attention hits out-of-memory before the agent can even start. TriAttention compresses KV cache on the fly ๐Ÿš€๐Ÿš€, letting the agent run to completion.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.04921
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04921 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04921 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04921 in a Space README.md to link it from this page.

Collections including this paper 1