Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

ChargeBackOps / docs /RELATED_WORK.md

mitudrudutta

fix(eval): sequential per-checkpoint base load + product-grade docs

bb2cdb9 about 1 month ago

preview code

raw

history blame contribute delete

6.77 kB

Related Work

ChargebackOps positions at the intersection of four research lines: policy-gradient RL for LLMs, RL with verifiable rewards (RLVR), reward design and specification gaming, and RL environments for agent training.

1. Policy-gradient algorithms for LLM post-training

PPO: Schulman et al., Proximal Policy Optimization Algorithms, 2017. The originating policy-gradient algorithm with a clipped trust region; provides the conceptual base for most LLM RL trainers.
https://arxiv.org/abs/1707.06347
GRPO (Group Relative Policy Optimization): Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024. Removes the value model from PPO and computes advantages via within-group reward standardisation. ChargebackOps uses GRPO via TRL.
https://arxiv.org/abs/2402.03300
TRL library (Hugging Face), the reference implementation for PPO / GRPO / DPO post-training of transformer models.
https://huggingface.co/docs/trl

The post-SFT GRPO collapse documented in METHOD.md §3 is, to our knowledge, not formally characterised in the existing literature on GRPO. The DeepSeekMath paper's experiments warmstart from base instruct models without the high-token-accuracy SFT phase that triggers the collapse. Practitioners applying GRPO to a strongly imitation-warmstarted policy on a token-deterministic task should be aware of the failure mode.

2. RL with verifiable rewards (RLVR)

Lambert et al., Tülu 3: Pushing Frontiers in Open Language Model Post-Training, 2024. Popularised the RLVR framing — replace learned reward models with programmatic verifiers where ground truth is checkable.
Label Studio, Reinforcement Learning from Verifiable Rewards, 2024. Practitioner overview of RLVR vs RLHF tradeoffs.
https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/
Hu et al., Reinforcement Learning with Verifiable Environments, 2025 (RLVE). Argues that procedurally-generated, adjustable-difficulty environments are a superior reward source vs static-prompt RLVR.
https://arxiv.org/html/2511.07317v1

ChargebackOps' outcome reward is RLVR-style: the verifier is the simulated dispute outcome (terminal $-PnL after Issuer review and arbitration), not a learned reward model. The parametric task generator + ISO 20022 adapter make the environment RLVE-style: difficulty is adjustable via reason code and difficulty tier, and the task pool is unbounded.

3. Reward design, specification gaming, reward hacking

Krakovna et al., Specification Gaming: The Flip Side of AI Ingenuity, DeepMind, 2020. Catalogue of reward-hacking failures across RL systems; foundational for thinking about what reward functions actually optimise.
https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
Weng, Reward Hacking in Reinforcement Learning, 2024. Comprehensive survey of how reward hacking arises in modern RL.
https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Skalse et al., Defining and Characterizing Reward Hacking, 2022.
https://arxiv.org/abs/2209.13085

ChargebackOps' rubric design is anti-hacking by construction:

The 8 dimensions impose orthogonal constraints (strategy, evidence, packet, deadline, efficiency, outcome, note, escalation ROI) that no degenerate strategy can simultaneously satisfy.
The Gate(CaseAbandonedRubric) is a hard zero on deadline-violating cases — no recovery.
The arbitration adjudicator and the Issuer scoring function share a single source of truth (evidence_strength_score), so a packet that exploits round-1 review will fare correspondingly worse in round-3 arbitration.
The four scripted-policy baselines (naive, concede_all, escalate_all, heuristic) cap at 0.0, 0.44, 0.77, and 0.81 respectively — every degenerate strategy hits a low ceiling, validating the rubric's discrimination.

4. RL environments for agent training

OpenEnv: Meta-PyTorch's framework for RL environments with composable rubrics, FastAPI-served environments, and Hugging Face Space deployment. ChargebackOps is built directly on openenv.core.env_server.interfaces.Environment and openenv.core.rubrics.{Rubric, WeightedSum, Gate}.
https://github.com/meta-pytorch/OpenEnv
BrowserGym: ServiceNow's browser-task RL environment. Closest in spirit (real-world workflow, partial observability, multi-step) but in a different domain (web navigation vs. financial dispute resolution).
https://github.com/ServiceNow/BrowserGym
Reasoning Gym: procedurally-generated reasoning tasks with adjustable difficulty.
https://openreview.net/forum?id=GqYSunGmp7

The environment + Rubric system + multi-round adversarial state machine integration in ChargebackOps targets a specific gap in the OpenEnv ecosystem: most existing environments are single-agent puzzle-style or browser-style. A cost-asymmetric multi-round adjudication environment with a programmable Issuer is, to our knowledge, the first of its kind in the OpenEnv catalogue.

5. Domain references — chargebacks and dispute resolution

Visa Compelling Evidence 3.5 (CE 3.5) policy framework. Defines the evidence categories acceptable for representment of fraud-related disputes.
Mastercard Chargeback Guide. Defines reason codes, response windows, and pre-arbitration thresholds.
ISO 20022 CASR.003 (Card Issuer-to-Acquirer Chargeback). The standardised message format for cross-network chargeback exchanges; ChargebackOps' scenarios/iso_adapter.py parses this format directly.
Stripe Disputes API. Used by connectors/stripe_sandbox.py for live or synthetic Stripe-format dispute ingestion.

The domain knowledge encoded in the environment (reason codes, evidence categories, fee schedules, deadline windows) reflects production card-network rules, not stylised abstractions.

6. Decision-theoretic foundations

Howard, Dynamic Programming and Markov Processes, 1960. Original framework for optimal policies under uncertainty.
Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994. The cost-asymmetric terminal economics in ChargebackOps (fixed fee + amount forfeit on loss) make each case a non-trivial finite-horizon MDP with risk-sensitive optimal policies.

The "escalate iff P(win) · amount > $250 fee" rule encoded in EscalationROIRubric is the EV-rational decision criterion under risk neutrality. The rubric does not penalise risk-seeking or risk-averse deviations beyond what their expected-value impact warrants — this is a deliberate choice and a place where extensions could explore CVaR-aware or prospect-theoretic policies.