Yonghong commited on
Commit Β·
739aac3
1
Parent(s): 6b4058d
Update blog with real Kimi results (94.4 avg) + fixed GRPO formula
Browse files
README.md
CHANGED
|
@@ -187,15 +187,19 @@ Standard RLHF requires a separate reward model. GRPO replaces it with **group-re
|
|
| 187 |
### Implementation (`llm_agent/train_grpo.py`)
|
| 188 |
|
| 189 |
```python
|
| 190 |
-
def grpo_loss(log_probs,
|
| 191 |
clip_eps=0.2, kl_coeff=0.04):
|
| 192 |
-
"""Clipped surrogate + KL penalty."""
|
|
|
|
| 193 |
ratio = torch.exp(log_probs - old_log_probs)
|
| 194 |
clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
| 199 |
```
|
| 200 |
|
| 201 |
Training loop:
|
|
@@ -239,33 +243,28 @@ All scores from `inference.py --mode rule-based` (deterministic, no LLM, reprodu
|
|
| 239 |
|
| 240 |
We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
|
| 241 |
|
| 242 |
-
**Moonshot V1-8K (Kimi) β
|
| 243 |
-
|
| 244 |
-
| Iteration | Mean Reward | Max Reward | Tasks Evaluated |
|
| 245 |
-
|-----------|-------------|------------|-----------------|
|
| 246 |
-
| 1 | 0.987 | 0.987 | T3, T1 |
|
| 247 |
-
| 2 | 0.967 | 0.987 | T6, T2 |
|
| 248 |
-
| 3 | 0.902 | 0.967 | T4, T7 |
|
| 249 |
-
| 4-8 | 0.912-0.987 | 0.987 | Mixed |
|
| 250 |
-
|
| 251 |
-
**Qwen 2.5-7B-Instruct (open-source, via Ollama) β rollout-only mode:**
|
| 252 |
|
| 253 |
-
| Task | Reward |
|
| 254 |
-
|------|--------|-------|
|
| 255 |
-
| T1 Single page | 0.
|
| 256 |
-
| T2 Multi-page |
|
| 257 |
-
| T3 Duplicates |
|
| 258 |
-
| T4 Rate limit |
|
| 259 |
-
|
|
| 260 |
-
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
-
|
| 263 |
|
| 264 |
Key findings:
|
| 265 |
-
- **
|
| 266 |
-
- **
|
| 267 |
-
- **
|
| 268 |
-
- **
|
|
|
|
| 269 |
|
| 270 |
### What the Scoring Reveals
|
| 271 |
|
|
|
|
| 187 |
### Implementation (`llm_agent/train_grpo.py`)
|
| 188 |
|
| 189 |
```python
|
| 190 |
+
def grpo_loss(log_probs, old_log_probs, ref_log_probs, advantages,
|
| 191 |
clip_eps=0.2, kl_coeff=0.04):
|
| 192 |
+
"""Clipped surrogate + reverse-KL penalty (DeepSeekMath)."""
|
| 193 |
+
# Policy ratio: r_t = Ο_new / Ο_old
|
| 194 |
ratio = torch.exp(log_probs - old_log_probs)
|
| 195 |
clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
|
| 196 |
+
surrogate = torch.min(ratio * advantages, clipped * advantages).mean()
|
| 197 |
|
| 198 |
+
# Reverse KL: D_KL(Ο_new || Ο_ref) = E[exp(x) - 1 - x] where x = log(Ο_new/Ο_ref)
|
| 199 |
+
log_ratio_ref = log_probs - ref_log_probs
|
| 200 |
+
kl = (torch.exp(log_ratio_ref) - 1 - log_ratio_ref).mean()
|
| 201 |
+
|
| 202 |
+
return -(surrogate - kl_coeff * kl)
|
| 203 |
```
|
| 204 |
|
| 205 |
Training loop:
|
|
|
|
| 243 |
|
| 244 |
We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
|
| 245 |
|
| 246 |
+
**Moonshot V1-8K (Kimi) β full agentic loop, all 8 tasks:**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
| 248 |
+
| Task | Score | Reward | Steps | vs Baseline |
|
| 249 |
+
|------|-------|--------|-------|-------------|
|
| 250 |
+
| T1 Single page | 98.7 | 0.987 | 3 | +3.7 |
|
| 251 |
+
| T2 Multi-page | 98.7 | 0.987 | 7 | +0.7 |
|
| 252 |
+
| T3 Duplicates | 98.7 | 0.987 | 5 | +0.7 |
|
| 253 |
+
| T4 Rate limit 429 | 83.7 | 0.837 | 5 | +0.7 |
|
| 254 |
+
| T5 Server error 500 | 84.3 | 0.843 | 5 | +0.6 |
|
| 255 |
+
| T6 Page drift | 94.7 | 0.947 | 5 | +0.4 |
|
| 256 |
+
| T7 Totals trap | 98.7 | 0.987 | 5 | +2.7 |
|
| 257 |
+
| T8 Mixed faults | 97.3 | 0.973 | 5 | +0.9 |
|
| 258 |
+
| **Average** | **94.4** | **0.944** | **5.0** | **+1.3** |
|
| 259 |
|
| 260 |
+

|
| 261 |
|
| 262 |
Key findings:
|
| 263 |
+
- **LLM agent outperforms rule-based baseline on 8/8 tasks** β the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
|
| 264 |
+
- **T1/T2/T3/T7 hit near-perfect 98.7** β the LLM correctly handles pagination, dedup, and totals filtering
|
| 265 |
+
- **T4/T5 remain hardest** (83-84 pts) β robustness scoring requires explicit log evidence of retry/backoff that the infrastructure handles silently
|
| 266 |
+
- **T8 mixed faults scores 97.3** β the LLM successfully handles both rate-limit retry AND cross-page deduplication simultaneously
|
| 267 |
+
- **Average 94.4 vs baseline 93.0** β the gap is small because the baseline is already strong; GRPO gradient training would push this further by optimizing the LLM's tool sequencing decisions
|
| 268 |
|
| 269 |
### What the Scoring Reveals
|
| 270 |
|