Yonghong commited on
Commit
739aac3
Β·
1 Parent(s): 6b4058d

Update blog with real Kimi results (94.4 avg) + fixed GRPO formula

Browse files
Files changed (1) hide show
  1. README.md +27 -28
README.md CHANGED
@@ -187,15 +187,19 @@ Standard RLHF requires a separate reward model. GRPO replaces it with **group-re
187
  ### Implementation (`llm_agent/train_grpo.py`)
188
 
189
  ```python
190
- def grpo_loss(log_probs, advantages, old_log_probs, ref_log_probs,
191
  clip_eps=0.2, kl_coeff=0.04):
192
- """Clipped surrogate + KL penalty."""
 
193
  ratio = torch.exp(log_probs - old_log_probs)
194
  clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
195
- pg_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
196
 
197
- kl = (log_probs - ref_log_probs).mean()
198
- return pg_loss + kl_coeff * kl
 
 
 
199
  ```
200
 
201
  Training loop:
@@ -239,33 +243,28 @@ All scores from `inference.py --mode rule-based` (deterministic, no LLM, reprodu
239
 
240
  We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
241
 
242
- **Moonshot V1-8K (Kimi) β€” closed-source, 8 GRPO rollout iterations:**
243
-
244
- | Iteration | Mean Reward | Max Reward | Tasks Evaluated |
245
- |-----------|-------------|------------|-----------------|
246
- | 1 | 0.987 | 0.987 | T3, T1 |
247
- | 2 | 0.967 | 0.987 | T6, T2 |
248
- | 3 | 0.902 | 0.967 | T4, T7 |
249
- | 4-8 | 0.912-0.987 | 0.987 | Mixed |
250
-
251
- **Qwen 2.5-7B-Instruct (open-source, via Ollama) β€” rollout-only mode:**
252
 
253
- | Task | Reward | Notes |
254
- |------|--------|-------|
255
- | T1 Single page | 0.950 | Matches rule-based baseline |
256
- | T2 Multi-page | 0.890 | Sometimes misses last page |
257
- | T3 Duplicates | 0.870 | Partial dedup in prompt-only mode |
258
- | T4 Rate limit | 0.780 | Wastes budget on extra retries |
259
- | T7 Totals trap | 0.920 | Correctly filters most totals rows |
260
- | T8 Mixed faults | 0.720 | Hardest β€” both retry and dedup needed |
 
 
 
261
 
262
- *Note: Qwen results are from rollout-only mode (no gradient updates). Full GRPO training with gradient steps requires GPU; the training pipeline is validated but large-scale runs are pending HuggingFace compute credits.*
263
 
264
  Key findings:
265
- - **Moonshot V1 achieves 0.987 reward on simple tasks** (T1, T2, T3) β€” matching or exceeding the rule-based baseline on Observability (the LLM naturally generates structured logs)
266
- - **Qwen 2.5-7B scores lower on fault tasks** β€” expected for a 7B open model without gradient training
267
- - **Fault tasks are genuinely harder**: T4 (0.780) and T8 (0.720) show the environment discriminates between capable and limited agents
268
- - **The gap between rule-based (0.926) and LLM baseline (0.855 avg Qwen) is exactly what GRPO training should close**
 
269
 
270
  ### What the Scoring Reveals
271
 
 
187
  ### Implementation (`llm_agent/train_grpo.py`)
188
 
189
  ```python
190
+ def grpo_loss(log_probs, old_log_probs, ref_log_probs, advantages,
191
  clip_eps=0.2, kl_coeff=0.04):
192
+ """Clipped surrogate + reverse-KL penalty (DeepSeekMath)."""
193
+ # Policy ratio: r_t = Ο€_new / Ο€_old
194
  ratio = torch.exp(log_probs - old_log_probs)
195
  clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
196
+ surrogate = torch.min(ratio * advantages, clipped * advantages).mean()
197
 
198
+ # Reverse KL: D_KL(Ο€_new || Ο€_ref) = E[exp(x) - 1 - x] where x = log(Ο€_new/Ο€_ref)
199
+ log_ratio_ref = log_probs - ref_log_probs
200
+ kl = (torch.exp(log_ratio_ref) - 1 - log_ratio_ref).mean()
201
+
202
+ return -(surrogate - kl_coeff * kl)
203
  ```
204
 
205
  Training loop:
 
243
 
244
  We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
245
 
246
+ **Moonshot V1-8K (Kimi) β€” full agentic loop, all 8 tasks:**
 
 
 
 
 
 
 
 
 
247
 
248
+ | Task | Score | Reward | Steps | vs Baseline |
249
+ |------|-------|--------|-------|-------------|
250
+ | T1 Single page | 98.7 | 0.987 | 3 | +3.7 |
251
+ | T2 Multi-page | 98.7 | 0.987 | 7 | +0.7 |
252
+ | T3 Duplicates | 98.7 | 0.987 | 5 | +0.7 |
253
+ | T4 Rate limit 429 | 83.7 | 0.837 | 5 | +0.7 |
254
+ | T5 Server error 500 | 84.3 | 0.843 | 5 | +0.6 |
255
+ | T6 Page drift | 94.7 | 0.947 | 5 | +0.4 |
256
+ | T7 Totals trap | 98.7 | 0.987 | 5 | +2.7 |
257
+ | T8 Mixed faults | 97.3 | 0.973 | 5 | +0.9 |
258
+ | **Average** | **94.4** | **0.944** | **5.0** | **+1.3** |
259
 
260
+ ![Benchmark Results](benchmark_results.png)
261
 
262
  Key findings:
263
+ - **LLM agent outperforms rule-based baseline on 8/8 tasks** β€” the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
264
+ - **T1/T2/T3/T7 hit near-perfect 98.7** β€” the LLM correctly handles pagination, dedup, and totals filtering
265
+ - **T4/T5 remain hardest** (83-84 pts) β€” robustness scoring requires explicit log evidence of retry/backoff that the infrastructure handles silently
266
+ - **T8 mixed faults scores 97.3** β€” the LLM successfully handles both rate-limit retry AND cross-page deduplication simultaneously
267
+ - **Average 94.4 vs baseline 93.0** β€” the gap is small because the baseline is already strong; GRPO gradient training would push this further by optimizing the LLM's tool sequencing decisions
268
 
269
  ### What the Scoring Reveals
270