Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding
Abstract
Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal computational overhead.
Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.
Community
💡 Deeper is Not Always Better: Bypassing the "Alignment Tax" in LLMs
Standard practice assumes that the deeper a layer is in an autoregressive LLM, the more accurate its token representation becomes. In our latest collaborative research in Qwen Team, we prove this isn't always true.
Through an information-theoretic analysis of residual streams, we exposed a recurring Guess-Refine-Perturb phase structure in aligned models. While intermediate layers crystallize highly accurate logical and semantic reasoning, dense post-training alignment (e.g. RLHF or DPO) forces low-rank steering perturbations in the final layers. For complex scientific or mathematical problems, this causes an "Alignment Tax"—dragging pristine reasoning back toward generic, hyper-frequent filler words.
To solve this without retraining, we present Confident Decoding:
- Entropy Valley Tracking: Uses an entropy-guided, conservative backward search to dynamically decode tokens at the peak of model confidence before late-stage steering conflicts arise.
- Universal Efficacy: Tested across dense and MoE families (Qwen3.5, Gemma-4, gpt-oss), securing massive surges on frontier benchmarks—including up to a +22.4% jump on categorized Omni-MATH Level 4, +9.4% and +6.5% absolute improvement on LiveCodeBench and GPQA-Diamond, respectively.
- Production Viability: Requires zero modification to the core forward pass or KV Cache. It functions natively inside high-throughput engines like vLLM with less than 2% wall-clock latency overhead.
Optimizing where to stop internally inside the network opens up an entirely new vertical paradigm for test-time compute (TTC).
Paper: https://arxiv.org/pdf/2606.21906
Project: https://github.com/QwenLM/Confident-Decoding
Qwen3.7-Max/Plus is already live as a closed API — any plans for open-weight releases of the 3.7 family? (like 3.6-35B-A3B / 3.6-27B alongside 3.6-Max)
Would love to run it locally via llama.cpp / GGUF.
Absolutely will do.
The 'Guess-Refine-Perturb' dynamic in Confident Decoding is a refreshing take on the alignment tax. Most of us just accept that the final layer is the 'truth', but the idea that alignment often manifests as a late-stage perturbation toward generic tokens is a critical insight for anyone trying to squeeze more raw reasoning out of a model.
From an engineering perspective, a training-free decoding strategy that uses entropy to pick the layer is a huge win—it's the kind of low-overhead tweak that actually moves the needle on deployability without needing a full retraining cycle. I'm curious to see how this holds up across different model architectures (e.g., MoE vs Dense) where the layer dynamics might differ. Definitely worth testing on local weights to see if we can recover 'lost' capabilities without breaking the safety guardrails.
Great point!
@XUANMINGZHANG , here is how I see it, hope you can convince the management at Alibaba.
Right now Qwen models have the best capacity/ parameters ratio but if you exclusively lock down your model to hosted solutions, almost 99% of Western companies will not use your models because of politics and routed through Chinese servers.
Why not releasing extremely capable models at 70B dense or 122B MOE to demonstrate your technical superiority then lock the best Opus-class model in your cloud ? This way, Western company employees will be your inside advocates and also boost your company technical profile (hence valuations).
Don't keep people supporting your team and Alibaba in the dark. I think you guys have a real shot at winning this AI war. Western companies like Anthropic, OpenAI or Google are just smoke and mirrors, their models need to be 2x 3x size and highly inefficient to achieve the same capabilities.
Get this paper in your agent:
hf papers read 2606.21906 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper