LLDS-Search
Collection
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
•
12 items
•
Updated
We target improved agent reinforcement learning in terms of stability (S), efficiency (E), and generalization (G).