Yonghong commited on
Commit ·
3d6a5d7
1
Parent(s): 5f8e09c
Add GRPO training curve to blog
Browse files
README.md
CHANGED
|
@@ -259,6 +259,20 @@ We evaluated two LLM backends via the agentic loop described above: LLM decides
|
|
| 259 |
|
| 260 |

|
| 261 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
Key findings:
|
| 263 |
- **LLM agent outperforms rule-based baseline on 8/8 tasks** — the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
|
| 264 |
- **T1/T2/T3/T7 hit near-perfect 98.7** — the LLM correctly handles pagination, dedup, and totals filtering
|
|
|
|
| 259 |
|
| 260 |

|
| 261 |
|
| 262 |
+
### GRPO Rollout Training Curve (8 iterations, Moonshot V1-8K)
|
| 263 |
+
|
| 264 |
+
We ran 8 iterations of GRPO-style rollouts with group_size=2, sampling 2 random tasks per iteration. Each rollout is a full agentic episode with real LLM tool-calling decisions.
|
| 265 |
+
|
| 266 |
+

|
| 267 |
+
|
| 268 |
+
The left chart shows reward across iterations with min-max range and rolling average. The right chart shows per-task mean reward across all iterations where that task appeared. The orange dotted line marks the rule-based baseline (0.930).
|
| 269 |
+
|
| 270 |
+
Key observations:
|
| 271 |
+
- **Mean reward consistently above baseline** (0.930) in 6/8 iterations
|
| 272 |
+
- **Iterations with fault tasks (T4/T5) pull the mean down** — these are genuinely harder and require the agent to handle 429/500 errors gracefully
|
| 273 |
+
- **T8 mixed faults achieves 0.973** — demonstrating the LLM can handle combined rate-limit + dedup challenges
|
| 274 |
+
- **Per-task variance is low** (small error bars) — the agent's behavior is consistent across rollouts
|
| 275 |
+
|
| 276 |
Key findings:
|
| 277 |
- **LLM agent outperforms rule-based baseline on 8/8 tasks** — the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
|
| 278 |
- **T1/T2/T3/T7 hit near-perfect 98.7** — the LLM correctly handles pagination, dedup, and totals filtering
|