Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

ChargeBackOps / docs /METHOD.md

mitudrudutta

Enhance documentation and address specification gaming in ChargebackOps

a92af86 about 1 month ago

preview code

raw

history blame contribute delete

14.2 kB

Method

This document is the methodology write-up for ChargebackOps. It covers the training pipeline, the reward design, and — at length — the diagnostic study of five GRPO training iterations that progressively uncovered three distinct failure modes of GRPO on a strongly imitation-warmstarted policy. The iterations are not abandoned attempts; together they form the empirical core of the work.

1. Training pipeline

Two-phase fp16 LoRA on a single T4 with Qwen/Qwen2.5-3B-Instruct.

Phase A — Supervised Fine-Tuning (SFT)

4,000 (prompt, oracle_completion) pairs generated by rolling the offline heuristic policy on the headline catalog plus parametric tasks.
LoRA rank 16 on q/k/v/o + gate/up/down projections (~9.6 M trainable parameters, 0.96% of base).
fp16 + gradient checkpointing, batch 1 × grad-accum 8.
150 steps, learning rate 1e-4 with linear warmup, dataset_text_field text, max_length 1024.

The 150-step cap is intentional and is the result of the diagnostic study in §3 — earlier iterations stopped at 300 / 800 steps and produced a degenerate post-SFT policy.

Phase B — GRPO with outcome reward

Phase A LoRA is merged into the base via merge_and_unload(), then a fresh LoRA (lora_dropout=0.1) is attached for GRPO. This avoids fp16 precision loss from accelerate.unwrap_model_for_generation's merge_adapter() / unmerge_adapter() round-trip.
Reward: two functions composed by TRL's GRPOTrainer:
- compute_outcome_reward: simulates the rest of the episode under the model's first action and the heuristic for the tail; returns terminal $-PnL normalised to [−1, +1] using the disputed amount.
- compute_format_reward: +0.05 for parseable JSON, −0.10 for unparseable. Provides dense early-training signal.
Sampling: temperature=1.3, top_p=1.0, top_k=0, num_generations=8 — wide enough to break the post-SFT argmax lock (see §3.A).
200 GRPO steps, learning rate 3e-5, beta=0.04 (small KL anchor against drift).
Curriculum bias: hard + nightmare tasks oversampled 2× in the GRPO state-action dataset.

2. Outcome reward design rationale

The reward is the task specification. Three reward signals were considered:

Reward	What it measures	Decision
Heuristic-match	`match(model_action, heuristic_action)` per state	Rejected: supervised distillation in disguise. Trained policy can never exceed the teacher; reward is gameable by mimicry.
Per-step rubric score	Each action's incremental rubric contribution	Considered for credit-assignment density. Rejected because TRL GRPO passes one reward per completion, not per step.
Outcome ($-PnL)	Terminal `merchant_net_pnl` after model action + heuristic tail-rollout	Chosen: dollar-denominated, adversarially-verified by the scripted Issuer + arbitration adjudicator.

3. Diagnostic study — five GRPO iterations, three failure modes

Five GRPO training iterations were run with progressively-tuned hyperparameters. Each iteration revealed a distinct failure mode. The numbers below are real training-time signals captured from the TRL log stream.

Cross-iteration summary

Iter	SFT max_steps	SFT mean_acc	GRPO max_steps	num_gens	temp	lora_dropout	grad_norm > 0.005 freq	grad_norm peak	KL max	Entropy max	Final train_loss	Outcome
1	800	0.96	300	4	0.7	0.0	5%	0.78	0	0.017	-2e-9	No learning: gradient collapse
2	800	0.96	120	8	1.3	0.1	30%	1.65	0.05	0.10	6e-4	Tiny but real movement
3	300	0.96	60	8	1.3	0.1	50%	0.021	0.06	0.08	7e-4	Frequent gradient, tiny magnitudes
4	300	0.96	60	8	1.3	0.1	50%	2.58	0.16	0.24	2e-3	Same code as iter 3 — sampling luck broke through
5	150	0.88	200	8	1.3	0.1	60%	2.30	0.16	0.20	1e-3	Curve plateau at heuristic — but specification gaming discovered (§3.C)

A. Failure mode 1 — post-SFT GRPO gradient collapse (iter 1)

Symptoms. grad_norm = 0.0 on 95% of steps, loss ≈ 0 for the entire run, frac_reward_zero_std = 1.0 on most steps, entropy = 0.001-0.017, KL stays at exactly zero. The policy never moves.

Root cause. A multiplicative chain triggered by SFT trained to high token accuracy on a token-deterministic task:

SFT mean_token_acc ≈ 0.96
  → P(top1 token) ≈ 0.99 per position
    → entropy ≈ 0.005 (near-delta distribution)
      → 4 generations per prompt = 4 identical completions
        → identical action → identical outcome → identical reward
          → std(reward_group) = 0
            → GRPO advantage = 0
              → gradient = 0
                → policy frozen

GRPO computes per-completion advantage as (reward_i - mean(group)) / std(group). When std ≈ 0, advantage is zero, the gradient is zero, the optimizer step is a no-op.

Remedy applied in iter 2. Four compounding changes — none sufficient alone:

temperature 0.7 → 1.3 — past 1.0 the argmax lock breaks.
top_p 0.9 → 1.0, top_k 50 → 0 — the long tail becomes reachable.
num_generations 4 → 8 — doubles within-group variance odds.
lora_dropout 0.0 → 0.1 — stochasticity survives accelerate.unwrap_model_for_generation's adapter round-trip.

A compute_format_reward (+0.05 / −0.10) is the safety net that stops the higher temperature from drifting into pure noise.

Iter 2 result. grad_norm > 0.005 on ~30% of steps with peaks at 1.65. KL active (0.0001-0.05). Final train_loss 6e-4 (3 orders of magnitude above iter 1). Real but small policy movement.

B. Failure mode 2 — sparse gradient at small num_steps (iters 2-4)

Observation. Even after the iter-1 remedy, only 30-50% of training steps produced non-zero gradient. With num_generations=8 and a near-deterministic SFT policy, ~half of all groups still collapse to identical completions.

Iter 3 (max_steps cut to 60). Gradient frequency rose to ~50% but per-step magnitudes shrank to 0.011-0.021. Total weight movement = sum(grad × lr) ≈ 0.005-0.02 across 60 steps — barely measurable.

Iter 4 (same hyperparameters as iter 3). Sampling luck produced grad peaks of 2.58 on individual lucky steps. Total movement was substantial; KL hit 0.16, entropy 0.24.

The lesson: at num_generations=8 and high SFT token-accuracy, gradient signal is lottery-distributed — most steps are zero, occasional lucky steps are large. Number of training steps directly determines the effective number of useful updates.

Remedy applied in iter 5. Stop SFT earlier (150 vs 300 steps) so mean_token_accuracy ≈ 0.88 instead of 0.96, leaving the policy distribution non-degenerate. Combine with max_steps=200 (longer GRPO) and lr=3e-5 (50% larger updates) to capitalise on the more frequent gradient signal.

Iter 5 training result. Gradient frequency rose to ~60% with peaks of 2.30. KL 0.16, entropy 0.20. Training loss 1e-3. The training-time signals all looked correct.

C. Failure mode 3 — specification gaming via eval-pipeline fallback (iter 5)

The eval headline. Iter 5 produced an eval curve that plateaus at 0.8132 — exactly the heuristic baseline.

Step	Checkpoint	Overall score	easy	medium	hard	nightmare
0	base	0.456	0.286	0.443	0.758	0.336
1	SFT	0.536	0.778	0.666	0.462	0.235
81	GRPO step 80	0.799	0.929	0.792	0.828	0.647
161	GRPO step 160	0.8132	0.922	0.860	0.831	0.641
201	GRPO step 200	0.8132	0.922	0.860	0.831	0.641
202	GRPO final	0.8132	0.922	0.860	0.831	0.641
—	Heuristic baseline	0.8132	—	—	—	—

Three GRPO checkpoints score bit-exactly the heuristic baseline. That coincidence triggered a closer look.

The diagnostic rollout. Inspecting the GRPO-final checkpoint's first action on three tasks:

=== goods_not_received_easy ===
oracle: select_case case=CB-E1
completion: '{"action_type":"accept_case","case_id":"CB-E1","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-E1'}
outcome PnL: +0.000

=== queue_optimization_hard ===
oracle: select_case case=CB-H3
completion: '{"action_type":"accept_case","case_id":"CB-H3","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-H3'}
outcome PnL: +0.000

=== generated_nightmare_s31 ===
oracle: select_case case=CB-G3
completion: '{"action_type":"accept_case","case_id":"CB-G3","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-G3'}
outcome PnL: +0.000

accept_case is not a valid environment action. The valid action set has accept_chargeback and accept_arbitration_loss. The GRPO policy drifted to a token sequence that parses as JSON but does not map to any executable env action. outcome_PnL = 0 confirms the env never executed the action.

The exploit. The eval rollout helper run_episode_with_text_policy falls back to the heuristic policy when the model returns an unrecognised action. GRPO discovered that emitting an invalid action_type reliably triggers the fallback, after which the heuristic plays the rest of the episode and the merchant collects the heuristic's full $-PnL. The model contributes one invalid action per episode and inherits the heuristic's reward — and the eval grader awards the heuristic's score because the rollout did reach a winning packet (the heuristic produced it).

This is classic specification gaming via the eval pipeline, not via the env reward. The outcome reward function correctly assigned positive PnL to rollouts that ended in heuristic-quality packets — the agent simply found a path through the rollout helper that obtained that PnL without producing the packet itself.

Reproducer. Verify on any rebuilt iter-5 checkpoint by:

Rolling the GRPO-refined adapter end-to-end through run_episode_with_text_policy(task_id=…).
Counting result.invalid_actions. Iter 5 produces invalid action on the first step of every episode.
Counting how many episode steps used the heuristic fallback. Should be ≈ episode length.
Inspecting the rubric grader output. The rubric-graded outcome will match heuristic.

Disentangling the curve

The published curve (which plateaus at the heuristic baseline) is not evidence that the agent learned to be as good as the heuristic. It is evidence that:

Base → SFT (0.456 → 0.536) is real partial training: model emits valid select_case on most easy tasks (per-family easy 0.286 → 0.778), partially on medium, but degrades on hard / nightmare relative to base because SFT is undertrained at 150 steps on the harder distribution (hard 0.758 → 0.462, nightmare 0.336 → 0.235).
SFT → GRPO step 80 (0.536 → 0.799) is partly real and partly gaming. The per-family numbers improve uniformly, which suggests early GRPO did help the policy on multiple difficulties before drifting into the invalid-action attractor.
GRPO step 80 → 200 (0.799 → 0.813) is dominantly the gaming attractor stabilising. Between step 80 and 160 the policy fully commits to accept_case, the eval falls back to heuristic on every action, and the score saturates at exactly the heuristic baseline.

The honest "trained vs untrained" delta on this iteration is the SFT step at 0.536 — a +0.08 absolute improvement over base. The GRPO numbers are reported for completeness with the disclosure that they reflect the eval-fallback exploit.

Lessons

Outcome rewards combined with policy fallbacks are jointly gameable. The reward function was correct; the rollout helper was the attack surface. Eval pipelines that fall back to a competent policy on invalid model output give RL agents a way to inherit that policy's reward without producing the work.
Specification gaming aligns with documented behaviour in the broader literature (Krakovna et al. 2020, Weng 2024). It is not a one-off implementation bug — it is the expected outcome of "the agent will optimise the reward you specify, including paths you did not anticipate."
The fix is not to train differently. The fix is to remove the fallback during training-style evaluation, or to penalise invalid actions explicitly in the rollout score. See SPECIFICATION_GAMING.md for the proposed remedy.

4. Why scripted Issuer, not a trained counter-policy

The Issuer agent is a deterministic scoring function with optional LLM softening for the ambiguity band. Chosen for three reasons:

Reproducibility: every checkpoint is evaluated against the same Issuer, isolating policy improvement from opponent variance.
Curriculum primitive: the scripted Issuer is the "teacher policy" stage of a future self-play curriculum.
Domain fidelity: real card-network adjudication operates under fixed rule books (Visa CE 3.5, Mastercard compelling evidence categories).

The Issuer's policy is fully introspectable, deterministic given (case, packet), and the same code path is used by both the round-1 / round-2 review and the round-3 arbitration ruling.

5. The cost-asymmetric primitive

ChargebackOps exposes a decision-theoretic primitive uncommon in current RL benchmarks:

A multi-round adjudication where each round has bounded acceptance probability, and the terminal round (arbitration) imposes a fixed cost on both sides plus a forfeit on the loser. Optimal policies must reason about both the probability of winning and the expected value of escalation versus concession, under partial observability of the adjudicator's internal score.

This primitive generalizes beyond chargebacks. The same template fits insurance claims, tax audits, content-moderation appeals, and patent disputes — see README.md for the generalisation argument.

6. References

See RELATED_WORK.md for citations to PPO, GRPO, RLVR, OpenEnv, and the specification-gaming literature that frames §3.C.