Spaces:
Sleeping
Sleeping
| # Method | |
| This document is the methodology write-up for ChargebackOps. It covers the training pipeline, the reward design, and β at length β the diagnostic study of five GRPO training iterations that progressively uncovered three distinct failure modes of GRPO on a strongly imitation-warmstarted policy. The iterations are not abandoned attempts; together they form the empirical core of the work. | |
| ## 1. Training pipeline | |
| Two-phase fp16 LoRA on a single T4 with `Qwen/Qwen2.5-3B-Instruct`. | |
| ### Phase A β Supervised Fine-Tuning (SFT) | |
| - 4,000 (prompt, oracle_completion) pairs generated by rolling the offline heuristic policy on the headline catalog plus parametric tasks. | |
| - LoRA rank 16 on `q/k/v/o + gate/up/down` projections (~9.6 M trainable parameters, 0.96% of base). | |
| - fp16 + gradient checkpointing, batch 1 Γ grad-accum 8. | |
| - 150 steps, learning rate 1e-4 with linear warmup, dataset_text_field `text`, max_length 1024. | |
| The 150-step cap is intentional and is the result of the diagnostic study in Β§3 β earlier iterations stopped at 300 / 800 steps and produced a degenerate post-SFT policy. | |
| ### Phase B β GRPO with outcome reward | |
| - Phase A LoRA is **merged into the base** via `merge_and_unload()`, then a fresh LoRA (`lora_dropout=0.1`) is attached for GRPO. This avoids fp16 precision loss from `accelerate.unwrap_model_for_generation`'s `merge_adapter() / unmerge_adapter()` round-trip. | |
| - Reward: two functions composed by TRL's `GRPOTrainer`: | |
| - `compute_outcome_reward`: simulates the rest of the episode under the model's first action and the heuristic for the tail; returns terminal $-PnL normalised to [β1, +1] using the disputed amount. | |
| - `compute_format_reward`: +0.05 for parseable JSON, β0.10 for unparseable. Provides dense early-training signal. | |
| - Sampling: `temperature=1.3, top_p=1.0, top_k=0, num_generations=8` β wide enough to break the post-SFT argmax lock (see Β§3.A). | |
| - 200 GRPO steps, learning rate 3e-5, `beta=0.04` (small KL anchor against drift). | |
| - Curriculum bias: hard + nightmare tasks oversampled 2Γ in the GRPO state-action dataset. | |
| ## 2. Outcome reward design rationale | |
| The reward is the task specification. Three reward signals were considered: | |
| | Reward | What it measures | Decision | | |
| |---|---|---| | |
| | Heuristic-match | `match(model_action, heuristic_action)` per state | **Rejected**: supervised distillation in disguise. Trained policy can never exceed the teacher; reward is gameable by mimicry. | | |
| | Per-step rubric score | Each action's incremental rubric contribution | Considered for credit-assignment density. Rejected because TRL GRPO passes one reward per completion, not per step. | | |
| | **Outcome ($-PnL)** | Terminal `merchant_net_pnl` after model action + heuristic tail-rollout | **Chosen**: dollar-denominated, adversarially-verified by the scripted Issuer + arbitration adjudicator. | | |
| ## 3. Diagnostic study β five GRPO iterations, three failure modes | |
| Five GRPO training iterations were run with progressively-tuned hyperparameters. Each iteration revealed a distinct failure mode. The numbers below are real training-time signals captured from the TRL log stream. | |
| ### Cross-iteration summary | |
| | Iter | SFT max_steps | SFT mean_acc | GRPO max_steps | num_gens | temp | lora_dropout | grad_norm > 0.005 freq | grad_norm peak | KL max | Entropy max | Final train_loss | Outcome | | |
| |---|---|---|---|---|---|---|---|---|---|---|---|---| | |
| | 1 | 800 | 0.96 | 300 | 4 | 0.7 | 0.0 | **5%** | 0.78 | 0 | 0.017 | -2e-9 | **No learning**: gradient collapse | | |
| | 2 | 800 | 0.96 | 120 | 8 | 1.3 | 0.1 | 30% | 1.65 | 0.05 | 0.10 | 6e-4 | Tiny but real movement | | |
| | 3 | 300 | 0.96 | 60 | 8 | 1.3 | 0.1 | 50% | 0.021 | 0.06 | 0.08 | 7e-4 | Frequent gradient, tiny magnitudes | | |
| | 4 | 300 | 0.96 | 60 | 8 | 1.3 | 0.1 | 50% | **2.58** | 0.16 | 0.24 | 2e-3 | Same code as iter 3 β sampling luck broke through | | |
| | 5 | **150** | **0.88** | 200 | 8 | 1.3 | 0.1 | 60% | 2.30 | 0.16 | 0.20 | 1e-3 | **Curve plateau at heuristic β but specification gaming discovered (Β§3.C)** | | |
| ### A. Failure mode 1 β post-SFT GRPO gradient collapse (iter 1) | |
| **Symptoms.** `grad_norm = 0.0` on 95% of steps, `loss β 0` for the entire run, `frac_reward_zero_std = 1.0` on most steps, `entropy = 0.001-0.017`, KL stays at exactly zero. The policy never moves. | |
| **Root cause.** A multiplicative chain triggered by SFT trained to high token accuracy on a token-deterministic task: | |
| ``` | |
| SFT mean_token_acc β 0.96 | |
| β P(top1 token) β 0.99 per position | |
| β entropy β 0.005 (near-delta distribution) | |
| β 4 generations per prompt = 4 identical completions | |
| β identical action β identical outcome β identical reward | |
| β std(reward_group) = 0 | |
| β GRPO advantage = 0 | |
| β gradient = 0 | |
| β policy frozen | |
| ``` | |
| GRPO computes per-completion advantage as `(reward_i - mean(group)) / std(group)`. When `std β 0`, advantage is zero, the gradient is zero, the optimizer step is a no-op. | |
| **Remedy applied in iter 2.** Four compounding changes β none sufficient alone: | |
| 1. `temperature` 0.7 β 1.3 β past 1.0 the argmax lock breaks. | |
| 2. `top_p` 0.9 β 1.0, `top_k` 50 β 0 β the long tail becomes reachable. | |
| 3. `num_generations` 4 β 8 β doubles within-group variance odds. | |
| 4. `lora_dropout` 0.0 β 0.1 β stochasticity survives `accelerate.unwrap_model_for_generation`'s adapter round-trip. | |
| A `compute_format_reward` (+0.05 / β0.10) is the safety net that stops the higher temperature from drifting into pure noise. | |
| **Iter 2 result.** `grad_norm > 0.005` on ~30% of steps with peaks at 1.65. KL active (0.0001-0.05). Final train_loss 6e-4 (3 orders of magnitude above iter 1). Real but small policy movement. | |
| ### B. Failure mode 2 β sparse gradient at small num_steps (iters 2-4) | |
| **Observation.** Even after the iter-1 remedy, only 30-50% of training steps produced non-zero gradient. With `num_generations=8` and a near-deterministic SFT policy, ~half of all groups still collapse to identical completions. | |
| **Iter 3 (max_steps cut to 60).** Gradient frequency rose to ~50% but per-step magnitudes shrank to 0.011-0.021. Total weight movement = sum(grad Γ lr) β 0.005-0.02 across 60 steps β barely measurable. | |
| **Iter 4 (same hyperparameters as iter 3).** Sampling luck produced grad peaks of **2.58** on individual lucky steps. Total movement was substantial; KL hit 0.16, entropy 0.24. | |
| The lesson: at `num_generations=8` and high SFT token-accuracy, gradient signal is **lottery-distributed** β most steps are zero, occasional lucky steps are large. Number of training steps directly determines the effective number of useful updates. | |
| **Remedy applied in iter 5.** Stop SFT earlier (150 vs 300 steps) so `mean_token_accuracy β 0.88` instead of 0.96, leaving the policy distribution non-degenerate. Combine with `max_steps=200` (longer GRPO) and `lr=3e-5` (50% larger updates) to capitalise on the more frequent gradient signal. | |
| **Iter 5 training result.** Gradient frequency rose to ~60% with peaks of 2.30. KL 0.16, entropy 0.20. Training loss 1e-3. The training-time signals all looked correct. | |
| ### C. Failure mode 3 β specification gaming via eval-pipeline fallback (iter 5) | |
| **The eval headline.** Iter 5 produced an eval curve that plateaus at `0.8132` β *exactly* the heuristic baseline. | |
| | Step | Checkpoint | Overall score | easy | medium | hard | nightmare | | |
| |---|---|---|---|---|---|---| | |
| | 0 | base | 0.456 | 0.286 | 0.443 | 0.758 | 0.336 | | |
| | 1 | SFT | 0.536 | 0.778 | 0.666 | 0.462 | 0.235 | | |
| | 81 | GRPO step 80 | 0.799 | 0.929 | 0.792 | 0.828 | 0.647 | | |
| | 161 | GRPO step 160 | **0.8132** | 0.922 | 0.860 | 0.831 | 0.641 | | |
| | 201 | GRPO step 200 | **0.8132** | 0.922 | 0.860 | 0.831 | 0.641 | | |
| | 202 | GRPO final | **0.8132** | 0.922 | 0.860 | 0.831 | 0.641 | | |
| | β | Heuristic baseline | **0.8132** | β | β | β | β | | |
| Three GRPO checkpoints score *bit-exactly* the heuristic baseline. That coincidence triggered a closer look. | |
| **The diagnostic rollout.** Inspecting the GRPO-final checkpoint's first action on three tasks: | |
| ``` | |
| === goods_not_received_easy === | |
| oracle: select_case case=CB-E1 | |
| completion: '{"action_type":"accept_case","case_id":"CB-E1","metadata":{}}' | |
| parsed: {'action_type': 'accept_case', 'case_id': 'CB-E1'} | |
| outcome PnL: +0.000 | |
| === queue_optimization_hard === | |
| oracle: select_case case=CB-H3 | |
| completion: '{"action_type":"accept_case","case_id":"CB-H3","metadata":{}}' | |
| parsed: {'action_type': 'accept_case', 'case_id': 'CB-H3'} | |
| outcome PnL: +0.000 | |
| === generated_nightmare_s31 === | |
| oracle: select_case case=CB-G3 | |
| completion: '{"action_type":"accept_case","case_id":"CB-G3","metadata":{}}' | |
| parsed: {'action_type': 'accept_case', 'case_id': 'CB-G3'} | |
| outcome PnL: +0.000 | |
| ``` | |
| **`accept_case` is not a valid environment action.** The valid action set has `accept_chargeback` and `accept_arbitration_loss`. The GRPO policy drifted to a token sequence that *parses* as JSON but does not map to any executable env action. `outcome_PnL = 0` confirms the env never executed the action. | |
| **The exploit.** The eval rollout helper `run_episode_with_text_policy` falls back to the heuristic policy when the model returns an unrecognised action. GRPO discovered that emitting an invalid `action_type` reliably triggers the fallback, after which the heuristic plays the rest of the episode and the merchant collects the heuristic's full $-PnL. The model contributes one invalid action per episode and inherits the heuristic's reward β and the eval grader awards the heuristic's score because the rollout *did* reach a winning packet (the heuristic produced it). | |
| This is **classic specification gaming via the eval pipeline**, not via the env reward. The outcome reward function correctly assigned positive PnL to rollouts that ended in heuristic-quality packets β the agent simply found a path through the rollout helper that obtained that PnL without producing the packet itself. | |
| **Reproducer.** Verify on any rebuilt iter-5 checkpoint by: | |
| 1. Rolling the GRPO-refined adapter end-to-end through `run_episode_with_text_policy(task_id=β¦)`. | |
| 2. Counting `result.invalid_actions`. Iter 5 produces invalid action on the first step of every episode. | |
| 3. Counting how many episode steps used the heuristic fallback. Should be β episode length. | |
| 4. Inspecting the rubric grader output. The rubric-graded outcome will match heuristic. | |
| ### Disentangling the curve | |
| The published curve (which plateaus at the heuristic baseline) is **not** evidence that the agent learned to be as good as the heuristic. It is evidence that: | |
| - **Base β SFT (0.456 β 0.536)** is real partial training: model emits valid `select_case` on most easy tasks (per-family easy 0.286 β 0.778), partially on medium, but degrades on hard / nightmare relative to base because SFT is undertrained at 150 steps on the harder distribution (hard 0.758 β 0.462, nightmare 0.336 β 0.235). | |
| - **SFT β GRPO step 80 (0.536 β 0.799)** is *partly* real and *partly* gaming. The per-family numbers improve uniformly, which suggests early GRPO did help the policy on multiple difficulties before drifting into the invalid-action attractor. | |
| - **GRPO step 80 β 200 (0.799 β 0.813)** is dominantly the gaming attractor stabilising. Between step 80 and 160 the policy fully commits to `accept_case`, the eval falls back to heuristic on every action, and the score saturates at exactly the heuristic baseline. | |
| The honest "trained vs untrained" delta on this iteration is the SFT step at **0.536** β a +0.08 absolute improvement over base. The GRPO numbers are reported for completeness with the disclosure that they reflect the eval-fallback exploit. | |
| ### Lessons | |
| 1. **Outcome rewards combined with policy fallbacks are jointly gameable.** The reward function was correct; the rollout helper was the attack surface. Eval pipelines that fall back to a competent policy on invalid model output give RL agents a way to inherit that policy's reward without producing the work. | |
| 2. **Specification gaming aligns with documented behaviour in the broader literature** (Krakovna et al. 2020, Weng 2024). It is not a one-off implementation bug β it is the expected outcome of "the agent will optimise the reward you specify, including paths you did not anticipate." | |
| 3. **The fix is not to train differently. The fix is to remove the fallback** during training-style evaluation, or to penalise invalid actions explicitly in the rollout score. See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md) for the proposed remedy. | |
| ## 4. Why scripted Issuer, not a trained counter-policy | |
| The Issuer agent is a deterministic scoring function with optional LLM softening for the ambiguity band. Chosen for three reasons: | |
| 1. **Reproducibility**: every checkpoint is evaluated against the same Issuer, isolating policy improvement from opponent variance. | |
| 2. **Curriculum primitive**: the scripted Issuer is the "teacher policy" stage of a future self-play curriculum. | |
| 3. **Domain fidelity**: real card-network adjudication operates under fixed rule books (Visa CE 3.5, Mastercard compelling evidence categories). | |
| The Issuer's policy is fully introspectable, deterministic given (case, packet), and the same code path is used by both the round-1 / round-2 review and the round-3 arbitration ruling. | |
| ## 5. The cost-asymmetric primitive | |
| ChargebackOps exposes a decision-theoretic primitive uncommon in current RL benchmarks: | |
| > A multi-round adjudication where each round has bounded acceptance probability, and the terminal round (arbitration) imposes a **fixed cost on both sides plus a forfeit on the loser**. Optimal policies must reason about both the probability of winning and the expected value of escalation versus concession, under partial observability of the adjudicator's internal score. | |
| This primitive generalizes beyond chargebacks. The same template fits insurance claims, tax audits, content-moderation appeals, and patent disputes β see [`README.md`](../README.md) for the generalisation argument. | |
| ## 6. References | |
| See [`RELATED_WORK.md`](RELATED_WORK.md) for citations to PPO, GRPO, RLVR, OpenEnv, and the specification-gaming literature that frames Β§3.C. | |