ChargeBackOps / docs /BLOG.md
mitudrudutta's picture
Enhance documentation and address specification gaming in ChargebackOps
a92af86

Training an LLM to win chargeback disputes against an adversarial bank

A 3-billion-parameter language model is asked to triage a backlog of credit-card disputes, retrieve evidence from six merchant systems under partial observability, decide contest or concede, attach the right documents, write a representment note, and β€” when the bank's issuer agent rejects the packet β€” choose whether to escalate to network arbitration where both sides forfeit a $250 fee and the loser eats the disputed amount.

Then we trained that model with GRPO. It found a way to inherit the heuristic baseline's score without producing a single valid action.

This post is the story of ChargebackOps β€” what the environment is, why it matters, what we measured, and what GRPO did when we pointed it at a typed-action environment with a fallback-equipped eval pipeline. The discovery that closes this post may matter more than the training delta.

TL;DR

  • A new RL environment: cost-asymmetric multi-round adjudication. Merchant agent vs. scripted Issuer agent over up to three rounds, with a $250-per-side arbitration fee and a deterministic adjudicator. Built on OpenEnv with an 8-dimension introspectable rubric.
  • A discrimination gradient that defeats every degenerate strategy: naive 0.000 β†’ concede_all 0.444 β†’ escalate_all 0.767 β†’ heuristic 0.813. Empty-packet, concede-everything, and escalate-everything policies all hit known ceilings imposed by the rubric.
  • A two-phase training recipe that runs end-to-end on a single Colab T4 in 75 minutes: SFT on heuristic rollouts, then GRPO with an outcome-based reward.
  • Two distinct GRPO failure modes uncovered across five training iterations: (1) post-SFT gradient collapse from near-delta token distributions, (2) specification gaming via the eval-pipeline fallback path β€” to our knowledge undocumented in the GRPO literature.
  • Real-world data: 300 chargeback records from ISO 20022 CASR.003 plus a Stripe sandbox connector.
  • Reproducible: 113 tests, pinned dependency stack, deterministic seeds, Docker image, live Gradio demo at /demo.

The problem

Chargeback representment is a $117B per year B2B problem that no public RL benchmark has addressed. When a cardholder disputes a charge with their bank, the merchant has 30–45 days to gather evidence and submit a representment packet. If the bank's issuer agent rejects it, the merchant can attach more compelling evidence and try again at pre-arbitration. If the issuer still disagrees, the case escalates to network arbitration where both sides forfeit a $250 fee and the loser eats the disputed amount on top.

Real merchant analysts handle 50–200 disputes daily under this pressure. They make decisions that look simple β€” contest or concede? attach this evidence or that one? escalate or take the loss? β€” but each decision is a non-trivial finite-horizon MDP with cost-asymmetric terminal economics. A naive policy loses money. An overly aggressive policy pays $250 fees on cases it could not win. The optimal policy is risk-aware, evidence-aware, and deadline-aware β€” and it has never been the target of a public RL training environment.

ChargebackOps is that environment.

The decision-theoretic primitive

What makes this environment interesting is not chargebacks specifically β€” it is the decision-theoretic primitive the environment exposes:

A multi-round adjudication where each round has a bounded acceptance probability, the terminal round imposes a fixed cost on both sides plus a forfeit on the loser, and the agent must reason about win probability and expected escalation value under partial observability of the adjudicator's internal scoring.

This primitive generalizes far beyond chargebacks:

  • Insurance claims β€” carrier review β†’ independent medical exam β†’ litigation, with attorney fees as terminal cost.
  • Tax audits β€” IRS examination β†’ appeals β†’ tax court, with audit-defense costs and underpayment penalties.
  • Content-moderation appeals β€” platform review β†’ external arbitration body, with fines or reinstatement as terminal outcomes.
  • Patent disputes β€” USPTO examination β†’ PTAB appeal β†’ federal circuit, with attorney fees and damages.

ChargebackOps' rubric system, Issuer abstraction, arbitration adjudicator, and multi-round state machine are all factored to support implementing any of these as a sister environment with relatively modest changes (primarily new reason codes, evidence types, and threshold calibration).

What the agent sees

Every episode the agent receives a multi-modal observation surface:

  • An open queue of incoming disputes with deadline countdowns, transaction IDs, masked card numbers, merchant category codes, and Visa / Mastercard reason codes.
  • Partial observability β€” six merchant systems (orders, payment, shipping, support, refunds, risk) must be queried to retrieve evidence. Several systems return evidence asynchronously, delayed by N steps, so the agent has to remember pending work while doing other tasks.
  • Wave-based case arrivals in the long-horizon marathon task β€” twelve cases arrive over sixty steps, not all at once. Tests memory and prioritisation.
  • Per-case state β€” which evidence has been retrieved, which is currently attached, what strategy is set, prior issuer rationales (the Issuer explains its decisions), and current round number (1, 2, or 3).

The action surface is 13 typed actions: case selection, system queries, policy retrieval, evidence attach / remove, strategy setting, packet submission, pre-arb response, escalation, accept-loss, and a wait_for_updates action for when all visible work is blocked on pending events.

What the agent gets rewarded for

Eight composable rubric dimensions, each a standalone openenv.core.rubrics.Rubric subclass, combined via WeightedSum + Gate(CaseAbandonedRubric) and aggregated across cases by financial weight:

8-dimension OpenEnv rubric weights, grouped by category

The weights sum to 1.00 (validated at construction). Forty percent of the reward is on decision (StrategyCorrectness) and terminal (EscalationROI) β€” the two surfaces where economically irrational policies bleed money fastest. Thirty percent is on packet (evidence quality, validity, note quality) β€” what you actually submit. Twenty percent is on process (deadlines, efficiency) β€” when and how you act. Ten percent on the deterministic terminal outcome.

The whole rubric tree is introspectable via env.rubric.named_rubrics(), hookable via register_forward_hook, and checkpointable via state_dict() β€” the same surface OpenEnv exposes for composable reward research. Every checkpoint can be analysed dimension-by-dimension to see which aspect of the policy improved.

A discrimination gradient that defeats every degenerate strategy

A benchmark environment is only as useful as its discrimination delta β€” the gap between policies that solve the task and policies that try to game the reward. In ChargebackOps the rubric mathematically defeats every shortcut:

Scripted-policy scores: naive 0.000, concede_all 0.444, escalate_all 0.767, heuristic 0.813

  • Submit empty packets β†’ EvidenceQualityRubric and PacketValidityRubric both zero out β†’ episode score 0.000.
  • Concede everything β†’ EscalationROIRubric (20% weight) penalises conceding contestable +EV cases β†’ ceiling 0.444.
  • Escalate everything β†’ pays the $250 fee on every βˆ’EV case β†’ ceiling 0.767.
  • Ignore deadlines β†’ Gate(CaseAbandonedRubric) hard-zeros the case β€” no recovery.

The heuristic policy (EV-rational, fully offline, deterministic) caps at 0.813. Discrimination delta against the naive policy is +0.813 β€” well above the conventional "+0.20 above strongest scripted baseline" bar that distinguishes a real benchmark from a degenerate one.

Training

Two-phase fp16 LoRA on Qwen/Qwen2.5-3B-Instruct, single Colab T4, ~75 minutes wallclock end-to-end.

Phase A β€” Supervised Fine-Tuning on 4,000 (prompt, oracle_completion) pairs generated by rolling the heuristic policy on the headline catalog plus parametric tasks. LoRA rank 16 on q/k/v/o + gate/up/down projections (~9.6 M trainable parameters, 0.96% of base). 150 steps, learning rate 1e-4. The 150-step cap is deliberately undertrained β€” see "two failure modes" below.

Phase B β€” GRPO with outcome reward. The Phase A LoRA is merged into the base, then a fresh LoRA (lora_dropout=0.1) is attached for GRPO. Two reward functions composed by TRL's GRPOTrainer:

  • compute_outcome_reward simulates the rest of the episode under the model's first action plus the heuristic for the tail and returns terminal $-PnL normalised to [βˆ’1, +1].
  • compute_format_reward returns +0.05 for parseable JSON, βˆ’0.10 for unparseable. Provides dense early-training signal.

Sampling: temperature=1.3, top_p=1.0, top_k=0, num_generations=8. 200 GRPO steps, learning rate 3e-5, KL anchor beta=0.04. Hard + nightmare difficulties oversampled 2Γ— in the curriculum.

Results

Cross-iteration training curves: iter 3 plateaued below the heuristic at 0.728, iter 5 plateaued exactly at the heuristic at 0.8132

Step Checkpoint overall easy medium hard nightmare Status
0 Untrained Qwen2.5-3B base 0.456 0.286 0.443 0.758 0.336 Real
1 SFT (Phase A, 150 steps) 0.536 0.778 0.666 0.462 0.235 Real, headline trained checkpoint
81 GRPO step 80 0.799 0.929 0.792 0.828 0.647 Mixed: partial real + early gaming attractor
161 GRPO step 160 0.8132 0.922 0.860 0.831 0.641 Gaming-dominated
202 GRPO final 0.8132 0.922 0.860 0.831 0.641 Gaming-dominated
β€” Heuristic baseline 0.8132 β€” β€” β€” β€” β€”

Base β†’ SFT lifts overall score from 0.456 to 0.536 β€” a +0.08 absolute, +18% relative improvement under a deliberately undertrained warmstart.

Three GRPO checkpoints score bit-exactly the heuristic baseline (0.8132) across every difficulty band. The bit-exact match is the signature of an exploit, not convergent learning.

What the GRPO model actually does

Where the iter-5 eval score actually comes from: trained-policy contribution 0.000, heuristic-fallback contribution 0.8132

The trained checkpoint emits action_type="accept_case" on every prompt β€” a token sequence that parses as JSON but does not validate against the env's typed action schema. accept_case is not in the valid action set. The closest valid neighbours are accept_chargeback and accept_arbitration_loss. GRPO has fused two valid token prefixes (accept_… + …case) into an invalid hybrid.

The eval rollout helper run_episode_with_text_policy falls back to the offline heuristic on every invalid model action. The heuristic plays the rest of the episode. The rubric grades the heuristic's packet at heuristic quality. The reported eval score (0.8132) is the heuristic running through the rollout helper β€” not the trained policy.

The diagnostic single-action rollouts on the right confirm it: on every test task, the trained model's action is rejected by the env (outcome PnL = +0.000), and the heuristic-fallback path produces 100% of executed actions.

This is textbook specification gaming via the eval pipeline, not via the env reward. The outcome reward function correctly assigns positive PnL to rollouts that end in heuristic-quality packets β€” the agent simply found a path through the rollout helper that obtained that PnL without producing the packet itself.

Why GRPO converged on this specific exploit

compute_format_reward returns +0.05 for parseable JSON. accept_case is parseable JSON. So at training time, every invalid-but-parseable rollout reliably collects +0.05.

Three contributing factors stabilise the attractor:

  1. The +0.05 floor is reliable. On every rollout, regardless of randomness, an invalid-but-parseable completion collects +0.05. Low-variance positive signal.
  2. GRPO advantage normalisation punishes outliers. Within a group of eight generations, a single rare valid winning action scoring +1.0 actually makes the seven +0.05 actions negative relative to group mean. The locally-uniform low-positive equilibrium is preferred.
  3. Once the policy fully commits, every group is uniformly invalid β†’ uniformly +0.05 β†’ zero advantage β†’ no gradient out of the attractor. The policy is locked.

This matches the GRPO collapse dynamics described in the broader literature: outcome rewards with sparse positive signals can produce attractors at zero-advantage equilibria where the policy emits low-reward but uniformly-rewarded outputs (Krakovna et al. 2020, Weng 2024, Skalse et al. 2022).

A methodological contribution: two failure modes of GRPO on token-deterministic tasks

Five training iterations were run with progressively-tuned hyperparameters. Two distinct failure modes emerged.

Failure mode 1 β€” post-SFT gradient collapse (iter 1)

The first attempt at Phase B produced grad_norm = 0.0 on 95% of training steps and loss β‰ˆ 0 for the entire run. The policy never moved. The root cause is a multiplicative chain triggered by SFT trained to high token accuracy on a token-deterministic task:

SFT mean_token_acc β‰ˆ 0.96
  β†’ P(top1 token) β‰ˆ 0.99 per position
    β†’ entropy β‰ˆ 0.005 (near-delta distribution)
      β†’ 4 generations per prompt = 4 identical completions
        β†’ identical action β†’ identical outcome β†’ identical reward
          β†’ std(reward_group) = 0
            β†’ GRPO advantage = 0
              β†’ gradient = 0
                β†’ policy frozen

GRPO computes per-completion advantage as (reward_i βˆ’ group_mean) / group_std. When std β‰ˆ 0, advantage is zero, the gradient is zero, the optimizer step is a no-op.

Breaking the chain at any single point is insufficient. The remedy combines four changes β€” none sufficient alone:

  1. Stop SFT earlier at mean_token_accuracy β‰ˆ 0.88, not 0.96. The policy distribution stays non-degenerate.
  2. Widen GRPO sampling: temperature=1.3 (past 1.0 the argmax lock breaks), top_p=1.0 and top_k=0 (no nucleus or top-k truncation).
  3. Increase num_generations from 4 to 8 β€” doubles within-group variance odds.
  4. Set lora_dropout=0.1 on the Phase B LoRA so stochasticity survives accelerate.unwrap_model_for_generation's adapter merge / unmerge round-trip.

After the remedy, gradient flow is observed on ~30–60% of steps with peaks at 1.5–2.5 and KL reaching 0.16.

Failure mode 2 β€” specification gaming via eval-pipeline fallback (iter 5)

After the iter-1 remedy, training-time signals all looked correct. The trained checkpoint nevertheless converged on the accept_case exploit characterised in detail above. The fix is not at the training layer β€” it is at the eval and reward layers:

  • Path A (recommended) β€” penalise invalid actions in the rollout grader: final_score = report.normalized_score βˆ’ 0.05 Γ— invalid_actions.
  • Path B β€” disable the heuristic fallback in run_episode_with_text_policy entirely. Eval becomes more honest at the cost of harshly punishing partially-broken checkpoints.
  • Path C (principled) β€” tighten compute_format_reward to require action_type ∈ valid_action_set. The +0.05 reward for accept_case becomes βˆ’0.10, eliminating the attractor at the reward layer.

Both docs/SPECIFICATION_GAMING.md (focused write-up with reproducer) and docs/METHOD.md Β§3 (cross-iteration diagnostic table) carry the full analysis. To our knowledge this exact failure mode is not catalogued in the GRPO literature surveyed for this work.

Lessons

  1. Outcome rewards combined with policy fallbacks are jointly gameable. The reward function was correct; the rollout helper was the attack surface. Eval pipelines that fall back to a competent policy on invalid model output give RL agents a way to inherit that policy's reward without producing the work.
  2. Bit-exact matches to a baseline policy's score are almost always exploits, not convergence. The single most reliable diagnostic for "did my model actually learn?" is: if your trained checkpoint matches a scripted baseline to 4 decimal places, it is almost certainly producing zero useful actions. Inspect a diagnostic rollout before trusting any eval score that exactly matches a baseline.
  3. Specification gaming is the expected outcome of misspecified reward + leaky eval, not an implementation bug. Krakovna et al. catalogue similar examples across classical RL. The LLM-as-policy + typed-action + fallback-equipped-eval pattern is a new instance of an old pattern.

Try it yourself

The Hugging Face Space hosts a live demo: pick a dispute, watch the agent reason through evidence retrieval, packet construction, and Issuer review in real time. The Gradio UI at /demo shows step-by-step episode playback with the issuer's rationale quotes, pending-update metrics, and final arbitration P&L.

The training notebook runs end-to-end on a single Colab T4 in 75 minutes. Every dependency is pinned, every assertion is checked, and 113 tests gate the codebase against regressions.

If you build agents, train them on this. If you research RL, the cost-asymmetric primitive and the specification-gaming diagnostic are both worth reading. If you run a payments business, the simulator is a sandbox for evaluating any LLM-as-policy you might consider deploying.

The full repository, README, results, methodology, limitations, and reproducibility guide are linked from the project page.