SEGAgentRL
/

LLDS-R-GRPO-Qwen2.5-3B-Base

Reinforcement Learning

text-generation

QuestionAnswering

text-generation-inference

Model card Files Files and versions

dwenlong commited on Jan 15

Commit

aef9fb2

·

verified ·

1 Parent(s): 29300e7

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -29,6 +29,7 @@ It prevents training collapse by regularizing **only when** the likelihood of (g
 - We identify **Lazy Likelihood Displacement (LLD)** as a key mechanism behind collapse in tool-integrated GRPO training.
 - LLDS activates **selectively**: it penalizes likelihood reduction on a *preserving set* (e.g., non-negative-advantage actions).
 - We release our **LLDS-tuned Qwen2.5-3B-Base** checkpoint for searchs-integrated reasoning and QA.
 ## 🔍 Tool-Integrated Search Inference (Search-R1 style)

 - We identify **Lazy Likelihood Displacement (LLD)** as a key mechanism behind collapse in tool-integrated GRPO training.
 - LLDS activates **selectively**: it penalizes likelihood reduction on a *preserving set* (e.g., non-negative-advantage actions).
 - We release our **LLDS-tuned Qwen2.5-3B-Base** checkpoint for searchs-integrated reasoning and QA.
+- **A refer to action-level gate**, R refer to response-level gate, **action (A) level gate achieve the best performance**.
 ## 🔍 Tool-Integrated Search Inference (Search-R1 style)