arxiv:2601.11061

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Published on Jan 16

· Submitted by

Ruizhe Li on Jan 20

Upvote

Authors:

Ruizhe Li ,

Jiahui Geng ,

Wenxi Li ,

Chris Lee

Abstract

Spurious rewards in reinforcement learning with verifiable rewards trigger a memorization shortcut in LLMs, identified through neural circuit analysis and causal steering techniques.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.

View arXiv page View PDF GitHub 5 Add to collection

Community

rzdiversity

Paper author Paper submitter about 11 hours ago

RLVR is the secret sauce for reasoning models, but it has a dark side. The Spurious Rewards Paradox reveals how models exploit latent contamination to achieve SOTA benchmark results without genuine reasoning. By identifying the specific Anchor-Adapter circuit, our paper shows we can now causally steer a model's reliance on shortcuts. Check out the code (https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts) and the circuit analysis in our paper to see how reasoning might just be hidden memorization in disguise.

avahal

about 7 hours ago

arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/spurious-rewards-paradox-mechanistically-understanding-how-rlvr-activates-memorization-shortcuts-in-llms-5861-d1dfddca

Executive Summary
Detailed Breakdown
Practical Applications

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.11061 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.11061 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.11061 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.