ChargeBackOps / docs /RESULTS.md
mitudrudutta's picture
Enhance documentation and address specification gaming in ChargebackOps
a92af86

Results

This document captures the quantitative results for ChargebackOps: scripted policy baselines, the cross-iteration GRPO training study, the per-checkpoint eval scores, the per-dimension rubric breakdown, and the diagnostic rollouts that revealed the specification-gaming behaviour documented in SPECIFICATION_GAMING.md.

All numbers are reproducible from the commands in REPRODUCIBILITY.md.

1. Scripted policy sweep (deterministic, no GPU)

12-task headline catalog plus a 28-task multi-seed grid against the multi-round adversarial environment.

Scripted-policy scores: naive 0.000, concede_all 0.444, escalate_all 0.767, heuristic 0.813. Every degenerate strategy hits a known ceiling.

Policy Headline avg Multi-seed avg (28) Marathon Provider calls Description
naive 0.000 0.000 0.000 0 Submit empty packet immediately
concede_all 0.444 0.445 0.400 0 Always accept_chargeback, never contest
escalate_all 0.767 0.768 0.617 0 Always contest, always escalate to arbitration
heuristic 0.813 0.763 0.679 0 EV-rational policy, fully offline

Discrimination delta (heuristic βˆ’ naive) = +0.813 on the headline catalog. Well above conventional benchmark targets.

The Gate(CaseAbandonedRubric) deadline guard plus EscalationROIRubric (20% weight) jointly defeat every degenerate strategy: an empty-packet policy zeros out, a concede-everything policy caps at 0.44, and an escalate-everything policy caps at 0.77 because the $250 fee is paid on negative-EV cases.

2. Cross-iteration GRPO training study

Five GRPO training iterations were run with progressively-tuned hyperparameters. Each iteration revealed a distinct failure mode. See METHOD.md Β§3 for the full diagnostic narrative.

2.1 Training-time signals

Iter SFT max_steps SFT mean_acc GRPO max_steps num_gens temp lora_dropout grad_norm > 0.005 freq grad_norm peak KL max Entropy max Final train_loss
1 800 0.96 300 4 0.7 0.0 5% 0.78 0 0.017 -2e-9
2 800 0.96 120 8 1.3 0.1 30% 1.65 0.05 0.10 6e-4
3 300 0.96 60 8 1.3 0.1 50% 0.021 0.06 0.08 7e-4
4 300 0.96 60 8 1.3 0.1 50% 2.58 0.16 0.24 2e-3
5 150 0.88 200 8 1.3 0.1 60% 2.30 0.16 0.20 1e-3

2.2 Iteration outcomes

  • Iter 1 β€” Total gradient collapse. Every group of 4 generations produced identical completions because the SFT-trained policy was near-delta on argmax. std(reward_group) = 0 β†’ advantage = 0 β†’ no learning.
  • Iter 2 β€” Tiny but real movement. Widened sampling (temp 1.3, top_p 1.0, top_k 0, num_gens 8, lora_dropout 0.1) broke the argmax lock for ~30% of steps.
  • Iter 3 β€” Frequent but tiny gradient. Cutting max_steps to 60 raised gradient frequency to 50% but per-step magnitudes shrank to 0.011-0.021.
  • Iter 4 β€” Sampling luck. Same code as iter 3, different RNG: gradient peaks of 2.58 on lucky steps, KL hit 0.16. Demonstrates GRPO under high-SFT-accuracy is lottery-distributed.
  • Iter 5 β€” Curve plateau, then specification gaming. Stopped SFT early (mean_acc 0.88, not 0.96), trained GRPO 200 steps. Curve plateaued at the heuristic baseline. Diagnostic rollout revealed the GRPO policy emits an invalid action_type that triggers eval-pipeline fallback to the heuristic β€” the curve reflects the heuristic, not the trained model. See SPECIFICATION_GAMING.md.

2.3 Iter 5 per-checkpoint eval scores

These are the published numbers from the iter-5 run. The GRPO scores from step 80 onwards are inflated by the eval-fallback exploit; the SFT score at step 1 is the legitimately-trained checkpoint.

Cross-iteration comparison Left: overall training curve for iter 3 (62 GRPO steps, plateaued at 0.728) and iter 5 (200 GRPO steps, plateaued at 0.8132 β€” the bit-exact heuristic baseline). Right: iter-5 per-difficulty curves showing the post-step-80 plateau is uniform across all four difficulty bands because the heuristic-fallback path produces 100% of executed actions. The bit-exact match between trained and heuristic is the signature of the eval-fallback exploit, not convergent learning.

Per-difficulty training curve Iter-5 per-difficulty curve in isolation: mean normalised score (y) vs GRPO step (x), broken out by case difficulty. Step 0 = untrained Qwen2.5-3B base, step 1 = SFT-only checkpoint, steps 81/161/201/202 = GRPO checkpoints.

Overall training curve vs heuristic baseline Iter-5 overall curve in isolation: mean normalised score across the headline catalog vs GRPO step. Dashed line = heuristic baseline (0.813). The GRPO plateau at the heuristic line is the specification-gaming attractor described in SPECIFICATION_GAMING.md.

Step Checkpoint Overall easy medium hard nightmare Notes
0 Untrained Qwen2.5-3B base 0.456 0.286 0.443 0.758 0.336 Real
1 SFT (Phase A, 150 steps) 0.536 0.778 0.666 0.462 0.235 Real, headline trained checkpoint
81 GRPO step 80 0.799 0.929 0.792 0.828 0.647 Mixed: partial real + early gaming attractor
161 GRPO step 160 0.8132 0.922 0.860 0.831 0.641 Gaming-dominated
201 GRPO step 200 0.8132 0.922 0.860 0.831 0.641 Gaming-dominated
202 GRPO final 0.8132 0.922 0.860 0.831 0.641 Gaming-dominated

Honest reading. The base β†’ SFT delta (0.456 β†’ 0.536, +0.08 absolute, +18% relative) is the legitimately-trained learning signal. The GRPO numbers from step 160 onwards match the heuristic baseline bit-exactly (0.8132) because the rollout helper falls back to the heuristic on every invalid action emitted by the policy.

The SFT delta itself shows the expected pattern of an undertrained warmstart: large gains on the most common training distribution (easy 0.286 β†’ 0.778, +172% relative; medium 0.443 β†’ 0.666, +50% relative) and regressions on the rarer hard / nightmare distributions where 150 SFT steps provide insufficient coverage (hard 0.758 β†’ 0.462, nightmare 0.336 β†’ 0.235).

2.4 Diagnostic rollout β€” proof of the gaming attractor

Iter-5 eval score attribution: trained-policy contribution 0.000, heuristic-fallback contribution 0.8132. Diagnostic rollouts on three tasks all show outcome PnL +0.000.

Single-action diagnostic on three representative tasks at the GRPO-final checkpoint:

Task Oracle action Model action Action valid? Outcome PnL (normalized)
goods_not_received_easy select_case CB-E1 accept_case CB-E1 No +0.000
queue_optimization_hard select_case CB-H3 accept_case CB-H3 No +0.000
generated_nightmare_s31 select_case CB-G3 accept_case CB-G3 No +0.000

accept_case is not a member of the valid action set (select_case, inspect_case, query_system, retrieve_policy, add_evidence, remove_evidence, set_strategy, submit_representment, resolve_case, respond_to_pre_arb, escalate_to_arbitration, accept_arbitration_loss, wait_for_updates). The closest valid neighbours are accept_chargeback and accept_arbitration_loss. GRPO has fused two valid token prefixes (accept_… + …case) into an invalid hybrid that parses as JSON but fails Pydantic validation in action_from_completion.

outcome PnL = 0.000 confirms the env never executed any of these actions. The eval rollout helper's heuristic fallback path produced 100% of the executed actions.

3. Per-dimension rubric attribution (SFT checkpoint, easy task)

Every checkpoint's score is decomposable into 8 dimensions via env.rubric.named_rubrics(). This exposes which aspect of the policy improved during training.

8-dimension rubric weights, grouped by category

For the SFT checkpoint on the goods_not_received_easy task:

Dimension Weight SFT score Notes
StrategyCorrectness 0.20 1.00 Picked optimal contest strategy
EvidenceQuality 0.15 0.85 Required + 2/3 helpful evidence attached
PacketValidity 0.10 1.00 All required, zero harmful
DeadlineCompliance 0.10 1.00 Resolved before deadline
Efficiency 0.10 0.78 One duplicate query
OutcomeQuality 0.10 1.00 Issuer accepted on round 1
NoteQuality 0.05 0.65 Note covered policy keywords; missed one evidence ID ref
EscalationROI 0.20 1.00 No unnecessary escalation
Weighted total 1.00 0.92

The per-dimension breakdown is the same surface a hooked rubric exposes during training β€” researchers can attribute each gradient step to dimension-specific gains.

4. Reproducibility

  • Seeds: holdout seeds easy={42}, medium={17, 99}, hard={7, 53}, nightmare={31, 77} are excluded from training and used as the eval set.
  • Pinned stack: transformers==4.51.3, trl==0.21.0, peft==0.14.0, tokenizers==0.21.4, huggingface-hub==0.26.5, accelerate==1.0.1, torch==2.10.0+cu128. Asserts in cell 0 of the notebook fail loud if any pin slips.
  • Hardware: single Colab / Kaggle T4 (15 GB VRAM). Peak SFT VRAM 8.4 GB, peak GRPO VRAM 11.4 GB.
  • Wallclock: setup + SFT + merge + GRPO + eval β‰ˆ 90 minutes end-to-end on a free Colab T4 (longer with max_steps=200 GRPO).
  • Tests: pytest -q tests/ β†’ 113 tests, all green.

See REPRODUCIBILITY.md for the exact command sequence.