Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Abstract
Adaptive Layerwise Perturbation (ALP) addresses policy staleness and training-inference mismatch in large language model reinforcement learning by injecting learnable perturbations into hidden states to stabilize training and improve exploration.
Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
Community
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs (2026)
- VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training (2026)
- A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization (2026)
- Rethinking the Trust Region in LLM Reinforcement Learning (2026)
- LLMs Can Learn to Reason Via Off-Policy RL (2026)
- SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models (2026)
- QuRL: Efficient Reinforcement Learning with Quantized Rollout (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper