Papers
arxiv:2606.13233

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Published on Jun 11
Authors:
,
,
,
,
,
,

Abstract

ReSET addresses accuracy degradation in low-precision reasoning by adapting decoding temperature based on entropy signals, while a custom CUDA kernel improves decoding latency for autoregressive models.

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose ReSET, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-M NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to sim!2 points over the NVFP4 baseline. Our CUDA-core small-M kernel further improves latency-critical decoding, delivering up to 2.5!times kernel-level speedup over NVFP4 vLLM and approximately 2!times end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13233
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13233 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13233 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13233 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.