File size: 6,772 Bytes
bb2cdb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Related Work

ChargebackOps positions at the intersection of four research lines: policy-gradient RL for LLMs, RL with verifiable rewards (RLVR), reward design and specification gaming, and RL environments for agent training.

## 1. Policy-gradient algorithms for LLM post-training

- **PPO**: Schulman et al., *Proximal Policy Optimization Algorithms*, 2017. The originating policy-gradient algorithm with a clipped trust region; provides the conceptual base for most LLM RL trainers.  
  https://arxiv.org/abs/1707.06347
- **GRPO** (Group Relative Policy Optimization): Shao et al., *DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models*, 2024. Removes the value model from PPO and computes advantages via within-group reward standardisation. ChargebackOps uses GRPO via TRL.  
  https://arxiv.org/abs/2402.03300
- **TRL** library (Hugging Face), the reference implementation for PPO / GRPO / DPO post-training of transformer models.  
  https://huggingface.co/docs/trl

The post-SFT GRPO collapse documented in [`METHOD.md`](METHOD.md) §3 is, to our knowledge, not formally characterised in the existing literature on GRPO. The DeepSeekMath paper's experiments warmstart from base instruct models without the high-token-accuracy SFT phase that triggers the collapse. Practitioners applying GRPO to a strongly imitation-warmstarted policy on a token-deterministic task should be aware of the failure mode.

## 2. RL with verifiable rewards (RLVR)

- Lambert et al., *Tülu 3: Pushing Frontiers in Open Language Model Post-Training*, 2024. Popularised the RLVR framing — replace learned reward models with programmatic verifiers where ground truth is checkable.
- Label Studio, *Reinforcement Learning from Verifiable Rewards*, 2024. Practitioner overview of RLVR vs RLHF tradeoffs.  
  https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
- Hu et al., *Reinforcement Learning with Verifiable Environments*, 2025 (RLVE). Argues that procedurally-generated, adjustable-difficulty environments are a superior reward source vs static-prompt RLVR.  
  https://arxiv.org/html/2511.07317v1

ChargebackOps' outcome reward is RLVR-style: the verifier is the simulated dispute outcome (terminal $-PnL after Issuer review and arbitration), not a learned reward model. The parametric task generator + ISO 20022 adapter make the environment RLVE-style: difficulty is adjustable via reason code and difficulty tier, and the task pool is unbounded.

## 3. Reward design, specification gaming, reward hacking

- Krakovna et al., *Specification Gaming: The Flip Side of AI Ingenuity*, DeepMind, 2020. Catalogue of reward-hacking failures across RL systems; foundational for thinking about what reward functions actually optimise.  
  https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
- Weng, *Reward Hacking in Reinforcement Learning*, 2024. Comprehensive survey of how reward hacking arises in modern RL.  
  https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
- Skalse et al., *Defining and Characterizing Reward Hacking*, 2022.  
  https://arxiv.org/abs/2209.13085

ChargebackOps' rubric design is anti-hacking by construction:

- The 8 dimensions impose orthogonal constraints (strategy, evidence, packet, deadline, efficiency, outcome, note, escalation ROI) that no degenerate strategy can simultaneously satisfy.
- The `Gate(CaseAbandonedRubric)` is a hard zero on deadline-violating cases — no recovery.
- The arbitration adjudicator and the Issuer scoring function share a single source of truth (`evidence_strength_score`), so a packet that exploits round-1 review will fare correspondingly worse in round-3 arbitration.
- The four scripted-policy baselines (naive, concede_all, escalate_all, heuristic) cap at 0.0, 0.44, 0.77, and 0.81 respectively — every degenerate strategy hits a low ceiling, validating the rubric's discrimination.

## 4. RL environments for agent training

- **OpenEnv**: Meta-PyTorch's framework for RL environments with composable rubrics, FastAPI-served environments, and Hugging Face Space deployment. ChargebackOps is built directly on `openenv.core.env_server.interfaces.Environment` and `openenv.core.rubrics.{Rubric, WeightedSum, Gate}`.  
  https://github.com/meta-pytorch/OpenEnv
- **BrowserGym**: ServiceNow's browser-task RL environment. Closest in spirit (real-world workflow, partial observability, multi-step) but in a different domain (web navigation vs. financial dispute resolution).  
  https://github.com/ServiceNow/BrowserGym
- **Reasoning Gym**: procedurally-generated reasoning tasks with adjustable difficulty.  
  https://openreview.net/forum?id=GqYSunGmp7

The environment + Rubric system + multi-round adversarial state machine integration in ChargebackOps targets a specific gap in the OpenEnv ecosystem: most existing environments are single-agent puzzle-style or browser-style. A cost-asymmetric multi-round adjudication environment with a programmable Issuer is, to our knowledge, the first of its kind in the OpenEnv catalogue.

## 5. Domain references — chargebacks and dispute resolution

- Visa Compelling Evidence 3.5 (CE 3.5) policy framework. Defines the evidence categories acceptable for representment of fraud-related disputes.
- Mastercard Chargeback Guide. Defines reason codes, response windows, and pre-arbitration thresholds.
- ISO 20022 CASR.003 (Card Issuer-to-Acquirer Chargeback). The standardised message format for cross-network chargeback exchanges; ChargebackOps' [`scenarios/iso_adapter.py`](../scenarios/iso_adapter.py) parses this format directly.
- Stripe Disputes API. Used by [`connectors/stripe_sandbox.py`](../connectors/stripe_sandbox.py) for live or synthetic Stripe-format dispute ingestion.

The domain knowledge encoded in the environment (reason codes, evidence categories, fee schedules, deadline windows) reflects production card-network rules, not stylised abstractions.

## 6. Decision-theoretic foundations

- Howard, *Dynamic Programming and Markov Processes*, 1960. Original framework for optimal policies under uncertainty.
- Puterman, *Markov Decision Processes: Discrete Stochastic Dynamic Programming*, 1994. The cost-asymmetric terminal economics in ChargebackOps (fixed fee + amount forfeit on loss) make each case a non-trivial finite-horizon MDP with risk-sensitive optimal policies.

The "escalate iff `P(win) · amount > $250 fee`" rule encoded in `EscalationROIRubric` is the EV-rational decision criterion under risk neutrality. The rubric does not penalise risk-seeking or risk-averse deviations beyond what their expected-value impact warrants — this is a deliberate choice and a place where extensions could explore CVaR-aware or prospect-theoretic policies.