Update README.md
Browse files
README.md
CHANGED
|
@@ -29,6 +29,7 @@ It prevents training collapse by regularizing **only when** the likelihood of (g
|
|
| 29 |
- We identify **Lazy Likelihood Displacement (LLD)** as a key mechanism behind collapse in tool-integrated GRPO training.
|
| 30 |
- LLDS activates **selectively**: it penalizes likelihood reduction on a *preserving set* (e.g., non-negative-advantage actions).
|
| 31 |
- We release our **LLDS-tuned Qwen2.5-3B-Base** checkpoint for searchs-integrated reasoning and QA.
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
## 🔍 Tool-Integrated Search Inference (Search-R1 style)
|
|
|
|
| 29 |
- We identify **Lazy Likelihood Displacement (LLD)** as a key mechanism behind collapse in tool-integrated GRPO training.
|
| 30 |
- LLDS activates **selectively**: it penalizes likelihood reduction on a *preserving set* (e.g., non-negative-advantage actions).
|
| 31 |
- We release our **LLDS-tuned Qwen2.5-3B-Base** checkpoint for searchs-integrated reasoning and QA.
|
| 32 |
+
- **A refer to action-level gate**, R refer to response-level gate, **action (A) level gate achieve the best performance**.
|
| 33 |
|
| 34 |
|
| 35 |
## 🔍 Tool-Integrated Search Inference (Search-R1 style)
|