Spaces:
Running
Running
Deploy comtrade_env: green consistency + scope clarifications + landing page
Browse files- README.md +8 -4
- blog_post.md +23 -8
- grpo_3b_lora_collapse.json +57 -0
- grpo_gradient_training_3b.jsonl +15 -0
- training_curve.png +2 -2
README.md
CHANGED
|
@@ -74,11 +74,15 @@ multi-dimensional scoring reward correct execution, not fluent output.
|
|
| 74 |
2. **Frontier saturates at the top.** Kimi and Claude produce *numerically identical* per-task scores across all 10 tasks. The benchmark currently cannot fine-rank frontier-class models against each other; it measures *execution reliability*, not raw capability ceiling.
|
| 75 |
3. **Sub-frontier models are high-variance, not uniformly weak.** Llama T9 scores span 18.7 β 97.5 depending on seed and hosted non-determinism. The discriminative signal is *reliability*, not capability.
|
| 76 |
|
| 77 |
-
### GRPO training β operating envelope empirically mapped
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
The same environment code runs in-process during GRPO rollouts and as the deployed Docker service during eval. Zero divergence. Context-vs-prompt ablation on T4/T5 is in the Results section below.
|
| 84 |
|
|
|
|
| 74 |
2. **Frontier saturates at the top.** Kimi and Claude produce *numerically identical* per-task scores across all 10 tasks. The benchmark currently cannot fine-rank frontier-class models against each other; it measures *execution reliability*, not raw capability ceiling.
|
| 75 |
3. **Sub-frontier models are high-variance, not uniformly weak.** Llama T9 scores span 18.7 β 97.5 depending on seed and hosted non-determinism. The discriminative signal is *reliability*, not capability.
|
| 76 |
|
| 77 |
+
### GRPO training β operating envelope empirically mapped at three points
|
| 78 |
|
| 79 |
+
We ran three training configurations and found three distinct failure modes:
|
| 80 |
+
|
| 81 |
+
- **Qwen2.5-1.5B, 50 iter full-parameter GRPO**: reward oscillates in 0.22β0.94 range with **no net upward trend** β the small model lacks stable capacity for T9 / T10, so reward is dominated by task-sampling noise. `grpo_gradient_training.jsonl`.
|
| 82 |
+
- **Qwen2.5-3B + LoRA (r=16), 14 + 3-skipped + 1 iter = 18 total**: **enters the learning window at iter 3** (reward_std β 0.50, KL grows from 8e-6 to 5.6e-4), **then policy-collapses at iter 15** β three consecutive iterations produced **zero valid rollouts**, and iter 18 recovered with mean reward **0.027** (max 0.107), confirming the LoRA adapter drifted into a degenerate output region. Textbook RL policy collapse / reward hacking. `grpo_3b_lora_collapse.json`, `grpo_gradient_training_3b.jsonl`.
|
| 83 |
+
- **Qwen2.5-7B + LoRA (r=16), 5 iter**: reward saturates at init (mean β 0.97). reward_std β 0 across rollouts β GRPO advantage = 0 β no gradient signal. `grpo_7b_lora_5iter_saturation.json`.
|
| 84 |
+
|
| 85 |
+
**Implication**: GRPO's useful training band on ComtradeBench exists (the 3B learning phase is proof) but is **narrow and fragile**. Stable training on the 3B point requires adaptive KL penalty, stricter trust-region clipping, or early-stop on reward-variance collapse β none of which we had in this release. This is a more actionable finding than "training converged on some model": it names a concrete failure mode (collapse at iter 15) and specifies the engineering work required to avoid it.
|
| 86 |
|
| 87 |
The same environment code runs in-process during GRPO rollouts and as the deployed Docker service during eval. Zero divergence. Context-vs-prompt ablation on T4/T5 is in the Results section below.
|
| 88 |
|
blog_post.md
CHANGED
|
@@ -298,6 +298,19 @@ itself is correct β loss decreases smoothly, KL stays bounded β but the sign
|
|
| 298 |
noise-dominated because 1.5B cannot find a stable policy. Data: `grpo_gradient_training.jsonl`,
|
| 299 |
`grpo_gradient_training_summary.json`.
|
| 300 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 301 |
**Upper bound β Qwen2.5-7B + LoRA (r = 16) on Lambda A100 40GB.**
|
| 302 |
Mean reward at iteration 1 is already **0.987** (above the 0.968 rule-based baseline). Across the
|
| 303 |
5 completed iters, loss stays at 0 and KL stays at 0 because reward_std across each group of
|
|
@@ -306,14 +319,16 @@ when rollouts are indistinguishable, so no gradient signal propagates to the LoR
|
|
| 306 |
This is not a bug. It is saturation: the base model already exceeds the task threshold, so there
|
| 307 |
is nothing for GRPO to optimise against. Data: `grpo_7b_lora_5iter_saturation.json`.
|
| 308 |
|
| 309 |
-
**Implication.** GRPO's useful training band on ComtradeBench
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
|
|
|
|
|
|
| 317 |
|
| 318 |
<p align="center">
|
| 319 |
<img src="training_curve.png" width="80%" alt="GRPO Training Curve β Qwen2.5-1.5B, 50 iter, full-parameter"/>
|
|
|
|
| 298 |
noise-dominated because 1.5B cannot find a stable policy. Data: `grpo_gradient_training.jsonl`,
|
| 299 |
`grpo_gradient_training_summary.json`.
|
| 300 |
|
| 301 |
+
**Middle point β Qwen2.5-3B + LoRA (r = 16) on Lambda A100 40GB. Learns, then collapses.**
|
| 302 |
+
Iters 1-2 are curriculum warmup on T1-T8, with LoRA init producing zero-variance rollouts (by
|
| 303 |
+
design). **Iters 3-14 enter the GRPO learning window**: reward_std oscillates 0.46 β 0.55, KL
|
| 304 |
+
grows monotonically from 8e-6 to 5.6e-4, and the adapter is clearly receiving gradient signal.
|
| 305 |
+
Mean reward bounces 0.0 β 0.73 due to heterogeneous task difficulty sampled each iter. Then
|
| 306 |
+
**iter 15 hits policy collapse**: three consecutive iterations (15, 16, 17) produce ZERO valid
|
| 307 |
+
rollouts β all 4 rollouts per iter fail to produce parseable tool calls. Iter 18 recovers with
|
| 308 |
+
mean_reward = 0.027 and max_reward = 0.107, confirming the LoRA adapter drifted into a
|
| 309 |
+
degenerate output region. This is a textbook RL policy-collapse / reward-hacking failure mode.
|
| 310 |
+
The run proves ComtradeBench + GRPO on 3B *can* learn, but the window is *fragile* β
|
| 311 |
+
stability requires careful KL-penalty tuning and trust-region clipping that we did not apply.
|
| 312 |
+
Data: `grpo_3b_lora_collapse.json`, `grpo_gradient_training_3b.jsonl`.
|
| 313 |
+
|
| 314 |
**Upper bound β Qwen2.5-7B + LoRA (r = 16) on Lambda A100 40GB.**
|
| 315 |
Mean reward at iteration 1 is already **0.987** (above the 0.968 rule-based baseline). Across the
|
| 316 |
5 completed iters, loss stays at 0 and KL stays at 0 because reward_std across each group of
|
|
|
|
| 319 |
This is not a bug. It is saturation: the base model already exceeds the task threshold, so there
|
| 320 |
is nothing for GRPO to optimise against. Data: `grpo_7b_lora_5iter_saturation.json`.
|
| 321 |
|
| 322 |
+
**Implication.** GRPO's useful training band on ComtradeBench exists β the 3B learning phase
|
| 323 |
+
(iters 3-14) is empirical proof β but the band is **narrow and fragile**. All three configurations
|
| 324 |
+
failed in different ways: 1.5B under-capacity / noise-dominated, 3B+LoRA learns then collapses,
|
| 325 |
+
7B+LoRA saturates. This is a more actionable finding than "training converged on some model":
|
| 326 |
+
it names a concrete failure mode (policy collapse at iter 15) and specifies the engineering work
|
| 327 |
+
required to avoid it β adaptive KL penalty, stricter trust-region clipping, early-stop on
|
| 328 |
+
reward-variance collapse, or a combination. The training pipeline itself is validated by a local
|
| 329 |
+
CPU smoke test (`grpo_smoke/`, `grpo_smoke_lora/`) β iter 1 produces loss = 0 (expected;
|
| 330 |
+
Ο_old = Ο_new at step 0) and iter 2 produces kl > 0 (confirming the policy actually updated
|
| 331 |
+
between rollouts), so the pipeline plumbing is sound. The envelope itself is the finding.
|
| 332 |
|
| 333 |
<p align="center">
|
| 334 |
<img src="training_curve.png" width="80%" alt="GRPO Training Curve β Qwen2.5-1.5B, 50 iter, full-parameter"/>
|
grpo_3b_lora_collapse.json
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"run_id": "3B_lora_collapse_20260420",
|
| 3 |
+
"model": "Qwen/Qwen2.5-3B-Instruct",
|
| 4 |
+
"training_method": "GRPO with LoRA (r=16, alpha=32, target=q/k/v/o)",
|
| 5 |
+
"platform": "Lambda Labs A100-SXM4-40GB (us-west-2)",
|
| 6 |
+
"config": {
|
| 7 |
+
"batch_size": 2,
|
| 8 |
+
"group_size": 2,
|
| 9 |
+
"lr": 1e-5,
|
| 10 |
+
"max_steps": 20,
|
| 11 |
+
"max_seq_length": 1024,
|
| 12 |
+
"curriculum_warmup_iters": 5,
|
| 13 |
+
"temperature": 0.7,
|
| 14 |
+
"peft_version": "0.12.0",
|
| 15 |
+
"transformers_version": "4.45.2"
|
| 16 |
+
},
|
| 17 |
+
"trainable_params_approx": 3600000,
|
| 18 |
+
"iterations_attempted": 18,
|
| 19 |
+
"iterations_with_gradient_step": 15,
|
| 20 |
+
"iterations_skipped_no_valid_rollouts": 3,
|
| 21 |
+
"phase_summary": {
|
| 22 |
+
"warmup_phase": {
|
| 23 |
+
"iters": "1-2",
|
| 24 |
+
"description": "Curriculum warmup on T1-T8 only. reward_std = 0 (LoRA init is identity β all rollouts identical to base model). No gradient signal by design.",
|
| 25 |
+
"mean_reward_range": [0.923, 0.957]
|
| 26 |
+
},
|
| 27 |
+
"learning_phase": {
|
| 28 |
+
"iters": "3-14",
|
| 29 |
+
"description": "Full task pool (T1-T10). Real learning signal emerges: reward_std oscillates 0.46 β 0.55 (high variance, good for GRPO), kl divergence grows monotonically from 8.5e-6 to 5.57e-4 β LoRA adapter is updating consistently. Mean reward oscillates 0.0 β 0.73 due to heterogeneous task difficulty sampled each iter.",
|
| 30 |
+
"mean_reward_range": [0.0, 0.725],
|
| 31 |
+
"kl_trajectory": "8.5e-6 β 1.74e-5 β 2.36e-5 β 5.16e-5 β 7.31e-5 β 1.01e-4 β 1.29e-4 β 1.13e-4 β 4.45e-4 β 3.70e-4 β 3.18e-4 β 5.57e-4"
|
| 32 |
+
},
|
| 33 |
+
"collapse_phase": {
|
| 34 |
+
"iters": "15-18",
|
| 35 |
+
"description": "Classic RL policy collapse. Three consecutive iterations (15, 16, 17) produced ZERO valid rollouts β all 4 rollouts per iter failed to produce parseable tool calls. Iter 18 finally recorded data with mean_reward = 0.027 and max_reward = 0.107, confirming the LoRA adapter has drifted into a degenerate region of policy space. KL at iter 18 is 1.04e-3 β the adapter kept moving, just in the wrong direction. This is reward hacking / policy collapse, a well-known RL failure mode.",
|
| 36 |
+
"mean_reward_at_collapse": 0.027
|
| 37 |
+
}
|
| 38 |
+
},
|
| 39 |
+
"key_finding": "3B + LoRA IS in the GRPO operating window (iters 3-14 show genuine learning: high reward variance, monotonically growing KL) but the window is fragile. Without explicit KL-penalty stability tuning (beta, trust-region clipping, curriculum sharpness), a 15-iter optimization trajectory can collapse into a degenerate policy that emits unparseable outputs. This is a textbook RL stability finding, reproduced in a tool-use / API-reliability context.",
|
| 40 |
+
"interpretation": {
|
| 41 |
+
"completes_envelope_triangle": true,
|
| 42 |
+
"three_failure_modes_mapped": [
|
| 43 |
+
"1.5B full-param (50 iter): too weak β reward oscillates 0.22-0.94, no net trend, noise-dominated",
|
| 44 |
+
"3B + LoRA (18 iter): learns then collapses β iters 3-14 improving, iters 15-18 broken (policy collapse)",
|
| 45 |
+
"7B + LoRA (5 iter): saturates at init β mean 0.987, reward_std β 0, no gradient signal"
|
| 46 |
+
],
|
| 47 |
+
"implication_for_grpo_users": "The useful training band on this benchmark sits around 3B params, but stability on that band requires careful KL / trust-region tuning. Vanilla GRPO on a small adapter is not enough. Future work: add an adaptive KL penalty (increase beta when KL grows too fast) and early-stop on reward-variance collapse.",
|
| 48 |
+
"implication_for_benchmark": "ComtradeBench exposes a GRPO failure mode that a clean train-to-saturation benchmark would hide. The 6-dimensional rubric has enough rough edges (e.g. the T4/T5 keyword-matching artifact) that a training run CAN find a degenerate policy which optimises reward-signal-proxy without solving tasks."
|
| 49 |
+
},
|
| 50 |
+
"raw_metrics_file": "grpo_gradient_training_3b.jsonl",
|
| 51 |
+
"reference_runs": [
|
| 52 |
+
"grpo_gradient_training.jsonl β Qwen2.5-1.5B, 50 iter, full-parameter (below window)",
|
| 53 |
+
"grpo_7b_lora_5iter_saturation.json β Qwen2.5-7B + LoRA (above window)",
|
| 54 |
+
"grpo_smoke_lora/metrics.jsonl β Qwen2.5-0.5B LoRA + disable_adapter() smoke test"
|
| 55 |
+
],
|
| 56 |
+
"generated_at": "2026-04-20T12:30:00Z"
|
| 57 |
+
}
|
grpo_gradient_training_3b.jsonl
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"iteration": 1, "mean_reward": 0.957, "max_reward": 0.957, "reward_std": 0.0, "n_valid": 4, "n_invalid": 0, "n_total": 4, "elapsed_s": 256.16432929039, "task_rewards": {"T1_single_page": 0.957, "T2_multi_page": 0.957}, "loss": -0.0, "kl": 0.0}
|
| 2 |
+
{"iteration": 2, "mean_reward": 0.9232499999999999, "max_reward": 0.957, "reward_std": 0.03725922704512266, "n_valid": 4, "n_invalid": 0, "n_total": 4, "elapsed_s": 346.1179749965668, "task_rewards": {"T7_totals_trap": 0.9135, "T5_server_error_500": 0.933}, "loss": 1.4901161193847656e-08, "kl": 0.0}
|
| 3 |
+
{"iteration": 3, "mean_reward": 0.45025000000000004, "max_reward": 0.933, "reward_std": 0.5205806853889222, "n_valid": 3, "n_invalid": 1, "n_total": 4, "elapsed_s": 303.22742199897766, "task_rewards": {"T8_mixed_faults": 0.434, "T5_server_error_500": 0.4665}, "loss": 4.424267535796389e-05, "kl": 8.495570909872185e-06}
|
| 4 |
+
{"iteration": 4, "mean_reward": 0.72525, "max_reward": 0.987, "reward_std": 0.48370678101511044, "n_valid": 3, "n_invalid": 1, "n_total": 4, "elapsed_s": 380.2190887928009, "task_rewards": {"T7_totals_trap": 0.4935, "T2_multi_page": 0.957}, "loss": 6.960902965147397e-07, "kl": 1.740225707180798e-05}
|
| 5 |
+
{"iteration": 5, "mean_reward": 0.44499999999999995, "max_reward": 0.95, "reward_std": 0.5161718060232787, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 680.6859655380249, "task_rewards": {"T1_single_page": 0.8899999999999999, "T4_rate_limit_429": 0.0}, "loss": -6.35683536529541e-05, "kl": 2.3585453163832426e-05}
|
| 6 |
+
{"iteration": 6, "mean_reward": 0.4595, "max_reward": 0.936, "reward_std": 0.5307664269714127, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 741.5850484371185, "task_rewards": {"T6_page_drift": 0.0, "T8_mixed_faults": 0.919}, "loss": 5.6743621826171875e-05, "kl": 5.159481952432543e-05}
|
| 7 |
+
{"iteration": 7, "mean_reward": 0.4785, "max_reward": 0.957, "reward_std": 0.5525242076144719, "n_valid": 3, "n_invalid": 1, "n_total": 4, "elapsed_s": 571.8825433254242, "task_rewards": {"T1_single_page": 0.4785, "T2_multi_page": 0.4785}, "loss": -7.46448858990334e-05, "kl": 7.312856905627996e-05}
|
| 8 |
+
{"iteration": 8, "mean_reward": 0.43574999999999997, "max_reward": 0.913, "reward_std": 0.5043004230284431, "n_valid": 4, "n_invalid": 0, "n_total": 4, "elapsed_s": 502.49045276641846, "task_rewards": {"T1_single_page": 0.415, "T2_multi_page": 0.4565}, "loss": 2.0489096641540527e-05, "kl": 0.0001012499415082857}
|
| 9 |
+
{"iteration": 9, "mean_reward": 0.4755, "max_reward": 0.957, "reward_std": 0.5490819610950627, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 251.70198917388916, "task_rewards": {"T3_duplicates": 0.4785, "T9_adaptive_adversary": 0.4725}, "loss": 5.155148755875416e-06, "kl": 0.00012887873162981123}
|
| 10 |
+
{"iteration": 10, "mean_reward": 0.465, "max_reward": 0.933, "reward_std": 0.5369413375779518, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 305.6371030807495, "task_rewards": {"T5_server_error_500": 0.93, "T4_rate_limit_429": 0.0}, "loss": -0.00018283724784851074, "kl": 0.00011343357618898153}
|
| 11 |
+
{"iteration": 11, "mean_reward": 0.45725000000000005, "max_reward": 0.927, "reward_std": 0.5280854570995115, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 280.29370641708374, "task_rewards": {"T4_rate_limit_429": 0.4635, "T3_duplicates": 0.451}, "loss": 1.7815735191106796e-05, "kl": 0.0004453933797776699}
|
| 12 |
+
{"iteration": 12, "mean_reward": 0.0, "max_reward": 0.0, "reward_std": 0.0, "n_valid": 2, "n_invalid": 2, "n_total": 4, "elapsed_s": 210.45837330818176, "task_rewards": {"T9_adaptive_adversary": 0.0, "T3_duplicates": 0.0}, "loss": 1.4813960660831071e-05, "kl": 0.0003703490365296602}
|
| 13 |
+
{"iteration": 13, "mean_reward": 0.69275, "max_reward": 0.957, "reward_std": 0.4625064864410012, "n_valid": 3, "n_invalid": 1, "n_total": 4, "elapsed_s": 196.6380274295807, "task_rewards": {"T4_rate_limit_429": 0.927, "T6_page_drift": 0.4585}, "loss": 4.249628909747116e-05, "kl": 0.00031773210503160954}
|
| 14 |
+
{"iteration": 14, "mean_reward": 0.498, "max_reward": 0.957, "reward_std": 0.48433665977293106, "n_valid": 3, "n_invalid": 1, "n_total": 4, "elapsed_s": 773.1137526035309, "task_rewards": {"T7_totals_trap": 0.084, "T10_constrained_budget": 0.9119999999999999}, "loss": -0.00014636914420407265, "kl": 0.0005570833454839885}
|
| 15 |
+
{"iteration": 18, "mean_reward": 0.02675, "max_reward": 0.107, "reward_std": 0.0535, "n_valid": 1, "n_invalid": 3, "n_total": 4, "elapsed_s": 303.5766453742981, "task_rewards": {"T4_rate_limit_429": 0.0535, "T1_single_page": 0.0}, "loss": 4.162021286902018e-05, "kl": 0.0010405053617432714}
|
training_curve.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|