On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

⚡ Introduction

LLDS is a lightweight likelihood-preserving regularization designed to stabilize tool-integrated reinforcement learning (e.g., GRPO / Search-R1 style training). It prevents training collapse by regularizing only when the likelihood of (good) action decreases, and only on the tokens responsible for the decrease.

We identify Lazy Likelihood Displacement (LLD) as a key mechanism behind collapse in tool-integrated GRPO training.
LLDS activates selectively: it penalizes likelihood reduction on a preserving set (e.g., non-negative-advantage actions).
We release our LLDS-tuned Qwen2.5-3B-Base checkpoint for searchs-integrated reasoning and QA.
A refer to action-level gate, R refer to response-level gate, action (A) level gate achieve the best performance.

🔍 Tool-Integrated Search Inference (Search-R1 style)

We support tool-integrated inference using the same workflow as Search-R1, where the LLM interacts with a local retrieval server for multi-step reasoning.

The pipeline consists of two parts:

Launch a local retriever server
Run inference with the LLDS model

1️⃣ Launch the local retrieval server

Search-R1 recommends running the retriever in a separate environment.

conda activate retriever
bash retrieval_launch.sh

2️⃣ Run inference with LLDS-R-GRPO-Qwen2.5-3B-Base

conda activate searchr1
python infer.py

MODEL_NAME = "<YOUR_ORG>/<YOUR_MODEL_NAME>" # e.g. my-org/LLDS-R-GRPO-Qwen2.5-3B-Base

question = "Your question here"

📖 Citation

@article{deng2025grpo,
  title={On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral},
  author={Deng, Wenlong and Li, Yushu and Gong, Boying and Ren, Yi and Thrampoulidis, Christos and Li, Xiaoxiao},
  journal={arXiv preprint arXiv:2512.04220},
  year={2025}
}