ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Abstract
ThriftAttention reduces long-context attention computation by selectively applying higher precision to critical query-key interactions, achieving near-full precision quality at reduced bitwidth efficiency.
Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.
Community
Mixed precision attention provides a means to get FP16 output quality at sub-byte inference latency. On long-context evaluation benchmarks, promoting just 5% of the attention computation to FP16 recovers 90% of the performance gap between FP4 and FP16 attention.
that selective fp16 promotion for a tiny fraction of query-key blocks is a neat tightrope between fidelity and throughput. would love to see how robust the 5% block budget is across highly skewed long-context distributions, and whether the identity of the promoted blocks shifts a lot when input changes. the arxivlens breakdown helped me parse the method details, and the online softmax fusion across fp4 and fp16 paths feels almost deceptively simple in practice. if this generalizes to other hardware and even longer contexts, it could be a practical default for mixed-precision attention in real deployments.
Get this paper in your agent:
hf papers read 2605.23081 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper