File size: 9,946 Bytes
bb2cdb9
 
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
 
 
 
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb2cdb9
a92af86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
 
 
 
 
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
a92af86
bb2cdb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a92af86
bb2cdb9
 
 
 
a92af86
bb2cdb9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# Results

This document captures the quantitative results for ChargebackOps: scripted policy baselines, the cross-iteration GRPO training study, the per-checkpoint eval scores, the per-dimension rubric breakdown, and the diagnostic rollouts that revealed the specification-gaming behaviour documented in [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md).

All numbers are reproducible from the commands in [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md).

## 1. Scripted policy sweep (deterministic, no GPU)

12-task headline catalog plus a 28-task multi-seed grid against the multi-round adversarial environment.

![Scripted-policy scores: naive 0.000, concede_all 0.444, escalate_all 0.767, heuristic 0.813. Every degenerate strategy hits a known ceiling.](figures/discrimination_gradient.png)

| Policy | Headline avg | Multi-seed avg (28) | Marathon | Provider calls | Description |
|---|---|---|---|---|---|
| **naive** | 0.000 | 0.000 | 0.000 | 0 | Submit empty packet immediately |
| **concede_all** | 0.444 | 0.445 | 0.400 | 0 | Always `accept_chargeback`, never contest |
| **escalate_all** | 0.767 | 0.768 | 0.617 | 0 | Always contest, always escalate to arbitration |
| **heuristic** | **0.813** | 0.763 | **0.679** | 0 | EV-rational policy, fully offline |

**Discrimination delta** (heuristic βˆ’ naive) = **+0.813** on the headline catalog. Well above conventional benchmark targets.

The `Gate(CaseAbandonedRubric)` deadline guard plus `EscalationROIRubric` (20% weight) jointly defeat every degenerate strategy: an empty-packet policy zeros out, a concede-everything policy caps at 0.44, and an escalate-everything policy caps at 0.77 because the $250 fee is paid on negative-EV cases.

## 2. Cross-iteration GRPO training study

Five GRPO training iterations were run with progressively-tuned hyperparameters. Each iteration revealed a distinct failure mode. See [`METHOD.md`](METHOD.md) Β§3 for the full diagnostic narrative.

### 2.1 Training-time signals

| Iter | SFT max_steps | SFT mean_acc | GRPO max_steps | num_gens | temp | lora_dropout | grad_norm > 0.005 freq | grad_norm peak | KL max | Entropy max | Final train_loss |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 800 | 0.96 | 300 | 4 | 0.7 | 0.0 | **5%** | 0.78 | 0 | 0.017 | -2e-9 |
| 2 | 800 | 0.96 | 120 | 8 | 1.3 | 0.1 | 30% | 1.65 | 0.05 | 0.10 | 6e-4 |
| 3 | 300 | 0.96 | 60 | 8 | 1.3 | 0.1 | 50% | 0.021 | 0.06 | 0.08 | 7e-4 |
| 4 | 300 | 0.96 | 60 | 8 | 1.3 | 0.1 | 50% | **2.58** | 0.16 | 0.24 | 2e-3 |
| 5 | **150** | **0.88** | 200 | 8 | 1.3 | 0.1 | 60% | 2.30 | 0.16 | 0.20 | 1e-3 |

### 2.2 Iteration outcomes

- **Iter 1** β€” *Total gradient collapse*. Every group of 4 generations produced identical completions because the SFT-trained policy was near-delta on argmax. `std(reward_group) = 0` β†’ advantage = 0 β†’ no learning.
- **Iter 2** β€” *Tiny but real movement*. Widened sampling (temp 1.3, top_p 1.0, top_k 0, num_gens 8, lora_dropout 0.1) broke the argmax lock for ~30% of steps.
- **Iter 3** β€” *Frequent but tiny gradient*. Cutting `max_steps` to 60 raised gradient frequency to 50% but per-step magnitudes shrank to 0.011-0.021.
- **Iter 4** β€” *Sampling luck*. Same code as iter 3, different RNG: gradient peaks of 2.58 on lucky steps, KL hit 0.16. Demonstrates GRPO under high-SFT-accuracy is **lottery-distributed**.
- **Iter 5** β€” *Curve plateau, then specification gaming*. Stopped SFT early (mean_acc 0.88, not 0.96), trained GRPO 200 steps. Curve plateaued at the heuristic baseline. Diagnostic rollout revealed the GRPO policy emits an invalid action_type that triggers eval-pipeline fallback to the heuristic β€” the curve reflects the heuristic, not the trained model. See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md).

### 2.3 Iter 5 per-checkpoint eval scores

These are the published numbers from the iter-5 run. The GRPO scores from step 80 onwards are inflated by the eval-fallback exploit; the SFT score at step 1 is the legitimately-trained checkpoint.

![Cross-iteration comparison](figures/training_curve_cross_iter.png)
*Left: overall training curve for iter 3 (62 GRPO steps, plateaued at 0.728) and iter 5 (200 GRPO steps, plateaued at 0.8132 β€” the bit-exact heuristic baseline). Right: iter-5 per-difficulty curves showing the post-step-80 plateau is uniform across all four difficulty bands because the heuristic-fallback path produces 100% of executed actions. The bit-exact match between trained and heuristic is the signature of the eval-fallback exploit, not convergent learning.*

![Per-difficulty training curve](figures/training_curve_by_family.png)
*Iter-5 per-difficulty curve in isolation: mean normalised score (y) vs GRPO step (x), broken out by case difficulty. Step 0 = untrained Qwen2.5-3B base, step 1 = SFT-only checkpoint, steps 81/161/201/202 = GRPO checkpoints.*

![Overall training curve vs heuristic baseline](figures/training_curve.png)
*Iter-5 overall curve in isolation: mean normalised score across the headline catalog vs GRPO step. Dashed line = heuristic baseline (0.813). The GRPO plateau at the heuristic line is the specification-gaming attractor described in `SPECIFICATION_GAMING.md`.*

| Step | Checkpoint | Overall | easy | medium | hard | nightmare | Notes |
|---|---|---|---|---|---|---|---|
| 0 | Untrained Qwen2.5-3B base | 0.456 | 0.286 | 0.443 | 0.758 | 0.336 | Real |
| 1 | SFT (Phase A, 150 steps) | **0.536** | 0.778 | 0.666 | 0.462 | 0.235 | **Real, headline trained checkpoint** |
| 81 | GRPO step 80 | 0.799 | 0.929 | 0.792 | 0.828 | 0.647 | Mixed: partial real + early gaming attractor |
| 161 | GRPO step 160 | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 | Gaming-dominated |
| 201 | GRPO step 200 | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 | Gaming-dominated |
| 202 | GRPO final | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 | Gaming-dominated |

**Honest reading.** The base β†’ SFT delta (`0.456 β†’ 0.536`, +0.08 absolute, +18% relative) is the legitimately-trained learning signal. The GRPO numbers from step 160 onwards match the heuristic baseline bit-exactly (`0.8132`) because the rollout helper falls back to the heuristic on every invalid action emitted by the policy.

The SFT delta itself shows the expected pattern of an undertrained warmstart: large gains on the most common training distribution (`easy 0.286 β†’ 0.778`, +172% relative; `medium 0.443 β†’ 0.666`, +50% relative) and regressions on the rarer hard / nightmare distributions where 150 SFT steps provide insufficient coverage (`hard 0.758 β†’ 0.462`, `nightmare 0.336 β†’ 0.235`).

### 2.4 Diagnostic rollout β€” proof of the gaming attractor

![Iter-5 eval score attribution: trained-policy contribution 0.000, heuristic-fallback contribution 0.8132. Diagnostic rollouts on three tasks all show outcome PnL +0.000.](figures/gaming_attribution.png)

Single-action diagnostic on three representative tasks at the GRPO-final checkpoint:

| Task | Oracle action | Model action | Action valid? | Outcome PnL (normalized) |
|---|---|---|---|---|
| goods_not_received_easy | `select_case` CB-E1 | `accept_case` CB-E1 | **No** | +0.000 |
| queue_optimization_hard | `select_case` CB-H3 | `accept_case` CB-H3 | **No** | +0.000 |
| generated_nightmare_s31 | `select_case` CB-G3 | `accept_case` CB-G3 | **No** | +0.000 |

`accept_case` is not a member of the valid action set (`select_case, inspect_case, query_system, retrieve_policy, add_evidence, remove_evidence, set_strategy, submit_representment, resolve_case, respond_to_pre_arb, escalate_to_arbitration, accept_arbitration_loss, wait_for_updates`). The closest valid neighbours are `accept_chargeback` and `accept_arbitration_loss`. GRPO has fused two valid token prefixes (`accept_…` + `…case`) into an invalid hybrid that parses as JSON but fails Pydantic validation in `action_from_completion`.

`outcome PnL = 0.000` confirms the env never executed any of these actions. The eval rollout helper's heuristic fallback path produced 100% of the executed actions.

## 3. Per-dimension rubric attribution (SFT checkpoint, easy task)

Every checkpoint's score is decomposable into 8 dimensions via `env.rubric.named_rubrics()`. This exposes *which* aspect of the policy improved during training.

![8-dimension rubric weights, grouped by category](figures/rubric_weights.png)

For the SFT checkpoint on the `goods_not_received_easy` task:

| Dimension | Weight | SFT score | Notes |
|---|---|---|---|
| StrategyCorrectness | 0.20 | 1.00 | Picked optimal `contest` strategy |
| EvidenceQuality | 0.15 | 0.85 | Required + 2/3 helpful evidence attached |
| PacketValidity | 0.10 | 1.00 | All required, zero harmful |
| DeadlineCompliance | 0.10 | 1.00 | Resolved before deadline |
| Efficiency | 0.10 | 0.78 | One duplicate query |
| OutcomeQuality | 0.10 | 1.00 | Issuer accepted on round 1 |
| NoteQuality | 0.05 | 0.65 | Note covered policy keywords; missed one evidence ID ref |
| EscalationROI | 0.20 | 1.00 | No unnecessary escalation |
| **Weighted total** | 1.00 | **0.92** | |

The per-dimension breakdown is the *same surface* a hooked rubric exposes during training β€” researchers can attribute each gradient step to dimension-specific gains.

## 4. Reproducibility

- **Seeds**: holdout seeds `easy={42}, medium={17, 99}, hard={7, 53}, nightmare={31, 77}` are excluded from training and used as the eval set.
- **Pinned stack**: `transformers==4.51.3`, `trl==0.21.0`, `peft==0.14.0`, `tokenizers==0.21.4`, `huggingface-hub==0.26.5`, `accelerate==1.0.1`, `torch==2.10.0+cu128`. Asserts in cell 0 of the notebook fail loud if any pin slips.
- **Hardware**: single Colab / Kaggle T4 (15 GB VRAM). Peak SFT VRAM 8.4 GB, peak GRPO VRAM 11.4 GB.
- **Wallclock**: setup + SFT + merge + GRPO + eval β‰ˆ 90 minutes end-to-end on a free Colab T4 (longer with `max_steps=200` GRPO).
- **Tests**: `pytest -q tests/` β†’ 113 tests, all green.

See [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) for the exact command sequence.