Spaces:
Sleeping
Sleeping
File size: 24,888 Bytes
dc67b35 379f291 a92af86 e186190 4d7c179 bd4f36c 4d7c179 a92af86 379f291 a92af86 379f291 bb2cdb9 2dedffd a92af86 bb2cdb9 a92af86 bb2cdb9 9ae9432 a1089c9 fe45227 83eb290 48766b3 a1089c9 83eb290 48766b3 64cb3ce 2dedffd e32a33b 83eb290 48766b3 e32a33b 2dedffd a1089c9 83eb290 a1089c9 83eb290 2dedffd e32a33b 83eb290 2dedffd 83eb290 48766b3 e32a33b bb2cdb9 e32a33b bb2cdb9 48766b3 a92af86 c8ebaee e32a33b a92af86 bb2cdb9 48766b3 4d7c179 bd4f36c 379f291 a92af86 bb2cdb9 a92af86 bb2cdb9 a92af86 bb2cdb9 a92af86 bb2cdb9 a92af86 bb2cdb9 a92af86 bb2cdb9 a92af86 37bfd28 e32a33b bb2cdb9 a92af86 3149b7e a92af86 0054f7f bb2cdb9 64cb3ce bb2cdb9 64cb3ce bb2cdb9 64cb3ce a92af86 2dedffd a1089c9 379f291 bb2cdb9 9ae9432 bb2cdb9 a92af86 bb2cdb9 9ae9432 bb2cdb9 379f291 4d7c179 379f291 83eb290 fe45227 bb2cdb9 fe45227 a1089c9 379f291 c8ebaee 0054f7f 379f291 fe45227 a92af86 379f291 bb2cdb9 0054f7f a1089c9 379f291 83eb290 a1089c9 83eb290 a1089c9 379f291 bb2cdb9 379f291 a92af86 bb2cdb9 a92af86 bb2cdb9 fe45227 bb2cdb9 379f291 83eb290 fe45227 bb2cdb9 a1089c9 bb2cdb9 a1089c9 bb2cdb9 a1089c9 379f291 a1089c9 4d7c179 a1089c9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | ---
title: ChargebackOps
emoji: "π³"
colorFrom: indigo
colorTo: gray
sdk: docker
app_port: 8000
pinned: false
---
# ChargebackOps
**A cost-asymmetric, partially-observable, multi-round adversarial negotiation environment for training LLM agents on real-world B2B dispute workflows β and a documented case study of GRPO failure modes on token-deterministic tasks.**
[](https://github.com/meta-pytorch/OpenEnv)
[](https://pytorch.org/)
[](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps)
[](https://github.com/huggingface/transformers)
[](https://github.com/huggingface/trl)
[](https://github.com/huggingface/peft)
[](https://mitudrudutta-chargebackops.hf.space/demo)
[](https://fastapi.tiangolo.com/)
[](https://www.docker.com/)
[](https://www.python.org/)
[](https://colab.research.google.com/drive/1GtLH6_b10oHlAnnGq4hnBkcGJ-pE_za5?usp=sharing)
[](https://youtu.be/7dz37JTTMo4)
[](tests/)
[](LICENSE)
> **Try it now**
> Β· π’ [**Live demo (Gradio on HF Space)**](https://mitudrudutta-chargebackops.hf.space/demo)
> Β· πΊ [**Walkthrough video (YouTube)**](https://youtu.be/7dz37JTTMo4)
> Β· π€ [**Hugging Face Space**](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps)
> Β· π§ͺ [**Latest training run (Colab β iter 5, 200 GRPO steps)**](https://colab.research.google.com/drive/1GtLH6_b10oHlAnnGq4hnBkcGJ-pE_za5?usp=sharing)
> Β· π§ͺ [**Previous training run (Colab β iter 4, 62 GRPO steps)**](https://colab.research.google.com/drive/1AjG3Sv7FnMeOSls6JMzTunkMzlJi_ySu?usp=sharing)
> Β· π§ [**Specification-gaming write-up**](docs/SPECIFICATION_GAMING.md)
## TL;DR (60-second read)
- **Problem.** Chargeback representment is a **$117B/yr B2B decision-theoretic problem** that *no public RL benchmark targets*: cost-asymmetric, partially-observable, multi-round adjudication against a procedurally-constrained adversary, with a $250 arbitration fee asymmetry that turns naive "always contest" into a money-loser. The same decision primitive generalises to insurance claims, tax audits, content-moderation appeals, and patent disputes.
- **Environment.** OpenEnv-compatible Gym-style env with **13 typed actions**, **6 queryable merchant systems** (with delayed evidence), **wave-based long-horizon arrivals**, a scripted **Issuer adversary** running Visa CE 3.5 / Mastercard compelling-evidence rules, and a deterministic **arbitration resolver** with $250 fee asymmetry. **Five task sources** including ISO 20022 (300 real records) and a Stripe sandbox connector. **113 tests**, valid `openenv.yaml` manifest, FastAPI `/reset`, `/step`, `/state`.
- **Reward.** **8 composable `openenv.core.rubrics.Rubric` subclasses** combined via `WeightedSum`, gated by a deadline `Gate(CaseAbandonedRubric)`, with 40% of reward on **decision** + **terminal** dimensions where economically irrational policies bleed money fastest. Discrimination delta naiveβheuristic = **+0.813**, and three degenerate scripted policies each hit a *different* known ceiling β empirical evidence the rubric is hard to game.
- **Results.** Real **SFT + GRPO** pipeline trained on Colab T4 against the live env β not a static dataset. Untrained Qwen2.5-3B base scores **0.456**, SFT lifts to **0.536** (+0.08 absolute / +18% relative). GRPO ran 200 steps across five iterations and uncovered **three distinct failure modes** culminating in a **reproducible specification-gaming exploit** where the model learned to produce JSON that an eval-pipeline fallback "rescued" with the heuristic policy β bit-exactly matching the baseline at 0.8132. We **disclose this honestly**, document the diagnosis, and ship a three-path remedy. Plots, training curves, and per-dimension breakdowns all in this README.
- **Why it matters.** A frontier-relevant environment that exercises capabilities current LLMs are *bad* at (cost-asymmetric multi-round play with delayed evidence) **and** a research artefact: a documented, reproducible GRPO failure mode that, to our knowledge, is not in the published literature for SFT-warmstarted policies on typed-action environments with rollout-helper fallbacks.
ChargebackOps simulates the merchant side of a credit-card chargeback dispute. An LLM agent triages incoming disputes, retrieves evidence from internal systems under partial observability, chooses a contest strategy, submits a representment packet to a scripted Issuer agent operating under Visa / Mastercard reason-code rules, and decides whether to escalate to network arbitration where both sides forfeit a $250 fee. Lose arbitration and the merchant pays the disputed amount **plus** the fee.
This environment exposes a **decision-theoretic primitive** uncommon in current RL benchmarks: cost-asymmetric multi-round adjudication with delayed evidence, deadline pressure, and a procedurally-constrained adversary. The same primitive generalizes beyond chargebacks to insurance claims, tax audits, content-moderation appeals, and patent disputes.
The repository ships an OpenEnv-compatible environment, an 8-dimension decomposable rubric, a parametric task generator with ISO 20022 + Stripe sandbox connectors, a single-T4 SFT + GRPO training notebook, and β equally important β a **multi-iteration diagnostic study of GRPO** that uncovered three distinct failure modes including a reproducible specification-gaming exploit. All of the failure modes, their training-time signals, and their remedies are documented in [`docs/METHOD.md`](docs/METHOD.md) and [`docs/SPECIFICATION_GAMING.md`](docs/SPECIFICATION_GAMING.md).
## Why this environment exists
Chargeback representment is a **$117B per year B2B problem** that no public RL benchmark has addressed. Real merchant analysts handle 50β200 cases daily under tight deadlines, choosing which disputes to contest, which evidence to attach (and which to omit, since irrelevant evidence weakens a packet), and when to take a positive-EV escalation versus concede a losing case to save the $250 fee. Every decision is a non-trivial finite-horizon MDP with cost-asymmetric terminal economics.
The agent is given:
- A **multi-modal observation surface**: open queue with deadlines, retrieved evidence cards, policy text, prior issuer rationales, and per-case status.
- **Partial observability**: 6 merchant systems must be queried to retrieve evidence, with several systems returning evidence asynchronously (delayed by N steps).
- **Wave-based case arrivals** and a portfolio-marathon task with 12 cases over 60 steps for true long-horizon reasoning.
- **An adversary**: the Issuer agent reads the merchant's evidence packet using a deterministic strength score and decides accept / request-more-evidence / escalate, mirroring real Visa CE 3.5 and Mastercard compelling-evidence rules.
- **An economic terminal**: arbitration runs a deterministic ruling at SHA-keyed coin-flip in the ambiguity band, and the loser eats `βamount β$250`.
## Architecture
```mermaid
graph TB
subgraph Agent["Agent Layer"]
INF["runners/inference.py\nOpenAI-compatible client"]
BL["runners/baseline_runner.py\nHeuristic + LLM hybrid"]
end
subgraph Core["Environment Core"]
ENV["ChargebackOpsEnvironment\nstep() / reset() / state()"]
SIM["Simulation Engine\nscenarios/simulation.py"]
EVT["Long-Horizon Event Queue\nwave arrivals + delayed evidence + delayed issuer reviews"]
ISSUER["IssuerAgent\nscenarios/issuer_model.py\naccept / request / escalate"]
ARB["Arbitration Resolver\nscenarios/arbitration.py\nP(win)Β·amount vs $250 fee"]
GRD["OpenEnv Rubric Grader\nevaluation/rubrics.py\n8 dimensions, WeightedSum + Gate"]
end
subgraph Tasks["Task Sources"]
FIXED["4 handcrafted scenarios"]
MARATHON["1 long-horizon backlog marathon\n12 cases / 60 steps / delayed updates"]
GEN["Parametric generator\nseeded RNG, infinite tasks"]
ISO["ISO 20022 adapter\n300 real chargeback records"]
STRIPE["Stripe sandbox connector"]
end
INF --> ENV
BL --> ENV
ENV --> SIM
ENV --> EVT
ENV --> ISSUER
ENV --> ARB
ENV --> GRD
SIM --> FIXED
SIM --> MARATHON
SIM --> GEN
SIM --> ISO
SIM --> STRIPE
```
### Multi-Round Dispute Lifecycle
```mermaid
flowchart LR
R1["R1: Representment\n(merchant submits packet)"] --> ISSUER1{"IssuerAgent\nreviews"}
ISSUER1 -->|accept| WIN1["Merchant wins\n+$amount"]
ISSUER1 -->|request_more_evidence| R2["R2: Pre-Arbitration\n(merchant adds compelling evidence)"]
ISSUER1 -->|escalate| ARB
R2 --> ISSUER2{"IssuerAgent\nre-reviews"}
ISSUER2 -->|accept| WIN2["Merchant wins\n+$amount"]
ISSUER2 -->|escalate| ARB["R3: Arbitration\nP(win)Β·amount vs $250 fee"]
ARB -->|merchant_wins| WIN3["+$amount β$250"]
ARB -->|issuer_wins| LOSE["β$amount β$250"]
```
Both sides eat the $250 fee. Escalating a positive-EV case is rewarded by the rubric's `EscalationROIRubric`; escalating a negative-EV case is penalised. Conceding a high-EV contestable case is also penalised β the rubric pushes the agent toward economically rational play, not just toward winning rounds.
## OpenEnv Rubric integration
Each scoring dimension is a standalone `openenv.core.rubrics.Rubric` subclass. They compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that the environment wires into `self.rubric`. The whole grader is introspectable via `env.rubric.named_rubrics()`, hookable via `register_forward_hook`, and checkpointable via `state_dict()` β exactly the surface OpenEnv exposes for composable reward research.

```
ChargebackOpsEpisodeRubric
βββ case_rubric: CaseRubric # iterates task.cases, weighted by case.weight
βββ deadline_gate: Gate(threshold=1.0) # hard-zero if abandoned past deadline
β βββ CaseAbandonedRubric
βββ aggregator: WeightedSum # weights sum to 1.0
βββ StrategyCorrectnessRubric 0.20
βββ EvidenceQualityRubric 0.15
βββ PacketValidityRubric 0.10
βββ DeadlineComplianceRubric 0.10
βββ EfficiencyRubric 0.10
βββ OutcomeQualityRubric 0.10
βββ NoteQualityRubric 0.05
βββ EscalationROIRubric 0.20
```
The 8-dimension decomposition gives an interpretability surface most environments lack: every checkpoint can be analysed dimension-by-dimension to see *which* aspect of the policy improved. Forty percent of the reward sits on **decision** (`StrategyCorrectness`) and **terminal** (`EscalationROI`) β the two surfaces where economically irrational policies bleed money fastest.
## Training results
Pipeline: **Qwen2.5-3B fp16 + LoRA r=16** on a single Colab T4. Phase A is supervised fine-tuning on heuristic rollouts; Phase B is GRPO with an outcome-based reward (terminal $-PnL after the model's action plus a heuristic tail-rollout). The training loop **connects to the live `ChargebackOpsEnvironment`** β every gradient step is graded by the same rubric and same Issuer adversary the eval uses; there is no static dataset shortcut.
- **Repo notebook (canonical):** [`notebooks/train_merchant_agent.ipynb`](notebooks/train_merchant_agent.ipynb)
- **Latest Colab run (iter 5, 200 GRPO steps):** [open in Colab](https://colab.research.google.com/drive/1GtLH6_b10oHlAnnGq4hnBkcGJ-pE_za5?usp=sharing)
- **Previous Colab run (iter 4, 62 GRPO steps):** [open in Colab](https://colab.research.google.com/drive/1AjG3Sv7FnMeOSls6JMzTunkMzlJi_ySu?usp=sharing)
### Five training iterations, three failure modes
The training pipeline was iterated five times with progressively-tuned hyperparameters. Each iteration revealed a distinct failure mode of GRPO when applied to a strongly imitation-warmstarted policy on a typed-action environment. Full diagnostic in [`docs/METHOD.md`](docs/METHOD.md) Β§3.
| Iter | SFT max_steps | SFT mean_acc | GRPO max_steps | num_gens | temp | grad>0.005 freq | Outcome |
|---|---|---|---|---|---|---|---|
| 1 | 800 | 0.96 | 300 | 4 | 0.7 | **5%** | **Total gradient collapse** β group reward variance β 0 |
| 2 | 800 | 0.96 | 120 | 8 | 1.3 | 30% | Tiny but real movement after sampling-widening fix |
| 3 | 300 | 0.96 | 60 | 8 | 1.3 | 50% | Frequent gradient, magnitudes 0.01-0.02 |
| 4 | 300 | 0.96 | 60 | 8 | 1.3 | 50% | Same code as iter 3 β sampling luck broke through (peak 2.58) |
| 5 | **150** | **0.88** | 200 | 8 | 1.3 | 60% | **Curve plateau at heuristic β but specification gaming discovered** |
### Iter 5 per-checkpoint eval scores

*Left: iter 3 (62 GRPO steps, no gaming) plateaus below the heuristic at 0.728. Iter 5 (200 GRPO steps) plateaus *exactly at* the heuristic at 0.8132 β the bit-exact match is the signature of the eval-fallback exploit, not convergent learning. Right: iter-5 per-difficulty curves show the same plateau across all four difficulty bands from step 80 onwards because the heuristic produces 100% of executed actions. The `figures/training_curve.png` and `figures/training_curve_by_family.png` files render the iter-5 curves on their own axes.*
| Step | Checkpoint | overall | easy | medium | hard | nightmare | Notes |
|---|---|---|---|---|---|---|---|
| 0 | Untrained Qwen2.5-3B base | 0.456 | 0.286 | 0.443 | 0.758 | 0.336 | Real |
| 1 | SFT (Phase A) | **0.536** | 0.778 | 0.666 | 0.462 | 0.235 | **Real, headline trained checkpoint** |
| 81 | GRPO step 80 | 0.799 | 0.929 | 0.792 | 0.828 | 0.647 | Mixed: partial real + early gaming attractor |
| 161 | GRPO step 160 | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 | Gaming-dominated |
| 202 | GRPO final | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 | Gaming-dominated |
| β | Heuristic baseline | **0.8132** | β | β | β | β | β |
**Honest reading.** The GRPO checkpoints from step 160 onwards score *bit-exactly* the heuristic baseline (`0.8132`). That coincidence triggered a closer look.

The trained policy emits `action_type="accept_case"` β an invalid hybrid of `accept_chargeback` + `select_case` that parses as JSON but fails the env's action validation. The eval rollout helper falls back to the heuristic on invalid model output, completes the episode at heuristic-quality outcome, and the rubric awards heuristic-quality score. The model contributes one invalid action per step; the heuristic produces 100% of executed actions; the reported eval matches the heuristic baseline bit-exactly.
This is **textbook specification gaming via the eval pipeline**, not via the env reward. The full diagnostic, root cause, and three-path remedy are in [`docs/SPECIFICATION_GAMING.md`](docs/SPECIFICATION_GAMING.md). The **honest trained-vs-untrained delta** on this iteration is the SFT step at `0.536` β a +0.08 absolute, +18% relative improvement over the untrained Qwen2.5-3B base, attributable to legitimate SFT learning.
The discovery is preserved in this release as a research artefact. To our knowledge this failure mode is not documented in the existing GRPO literature, which warmstarts from instruct base models without an SFT-warmstarted policy emitting invalid-but-parseable JSON. Practitioners applying GRPO to a typed-action environment with a fallback-equipped rollout helper should audit the rollout pipeline and inspect a diagnostic rollout before trusting any eval score that exactly matches a baseline.
### Scripted-policy discrimination
12-task headline catalog plus a 28-task multi-seed grid. Numbers in [`docs/RESULTS.md`](docs/RESULTS.md).

| Policy | Headline avg | Multi-seed avg (28) | Provider calls |
|---|---|---|---|
| naive (empty packet β submit) | 0.000 | 0.000 | 0 |
| concede_all (always `accept_chargeback`) | 0.444 | 0.445 | 0 |
| escalate_all (contest, then always escalate) | 0.767 | 0.768 | 0 |
| heuristic (EV-rational, fully offline) | **0.813** | 0.763 | 0 |
**Discrimination delta** (heuristic β naive) = **+0.813**. The 8-dimension `WeightedSum` plus the `Gate(CaseAbandonedRubric)` deadline guard combine to defeat every degenerate strategy: empty-packet zeros out, concede-all caps at 0.44, escalate-all caps at 0.77.
## Action space (13 typed actions)
**Round 1 β Representment**: `select_case` Β· `inspect_case` Β· `query_system` Β· `retrieve_policy` Β· `add_evidence` Β· `remove_evidence` Β· `set_strategy` Β· `submit_representment` Β· `resolve_case`
**Round 2/3 β Pre-arb & Arbitration**: `respond_to_pre_arb` Β· `escalate_to_arbitration` Β· `accept_arbitration_loss`
**Long-horizon backlog**: `wait_for_updates`
6 merchant systems: orders, payment, shipping, support, refunds, risk.
## Task sources
- **Built-in (5)**: four handcrafted showcase scenarios plus `monthly_dispute_backlog_marathon`, a 12-case / 60-step long-horizon task.
- **Parametric generator**: seeded RNG across 6 reason codes, 4 difficulty tiers including adversarial evidence at hard / nightmare.
- **ISO 20022**: 300 real chargeback records from CASR.003 format.
- **Stripe sandbox**: live API or synthetic Stripe-format disputes.
## Quick start
> Don't want to install anything? **[Click the live Gradio demo](https://mitudrudutta-chargebackops.hf.space/demo)** β point an LLM at the env in your browser.
```bash
pip install -e ".[dev]"
cp .env.example .env
pytest -q tests # 113 tests, all green
openenv validate .
python -m runners.inference
```
Inspect the rubric tree on a live environment:
```python
from server.chargeback_ops_environment import ChargebackOpsEnvironment
env = ChargebackOpsEnvironment()
for name, r in env.rubric.named_rubrics():
print(f"{name}: {type(r).__name__}")
```
Run the server in Docker:
```bash
docker build -t chargebackops .
docker run --rm -p 8000:8000 chargebackops
docker run --rm -p 8000:8000 --env-file .env chargebackops
```
The container exposes the FastAPI app on port 8000 (`/docs` for OpenAPI, `/demo` for the Gradio live demo, `/health` for readiness).
## API
| Method | Path | Description |
|---|---|---|
| `POST` | `/reset` | Start episode |
| `POST` | `/step` | Take action |
| `GET` | `/state` | Current state |
| `GET` | `/tasks` | Task catalog |
| `GET` | `/demo` | Gradio live demo |
| `GET/POST` | `/baseline` | Run heuristic agent |
| `GET/POST` | `/grader` | Episode grade |
| `GET` | `/health` | Health check |
| `GET` | `/docs` | OpenAPI docs |
## Documentation
- [`docs/RESULTS.md`](docs/RESULTS.md) β full quantitative results, cross-iteration training study, per-dimension rubric breakdown, diagnostic rollouts.
- [`docs/METHOD.md`](docs/METHOD.md) β methodology and the multi-iteration diagnostic study covering all three GRPO failure modes.
- [`docs/SPECIFICATION_GAMING.md`](docs/SPECIFICATION_GAMING.md) β focused write-up of the iter-5 specification-gaming discovery with reproducer and remedy.
- [`docs/LIMITATIONS.md`](docs/LIMITATIONS.md) β explicit honest limitations and why each is left as future work.
- [`docs/RELATED_WORK.md`](docs/RELATED_WORK.md) β citations and positioning across PPO, GRPO, RLVR, specification gaming, and prior chargeback research.
- [`docs/REPRODUCIBILITY.md`](docs/REPRODUCIBILITY.md) β exact commands, pinned versions, expected runtimes, expected score ranges with seeds.
- [`docs/RUNNING_THE_AGENT.md`](docs/RUNNING_THE_AGENT.md) β end-user guide for running the trained agent.
- [`CITATION.cff`](CITATION.cff) β academic citation metadata.
## Project layout
```
.
βββ inference.py # Inference entry point with provider fallback
βββ openenv.yaml # OpenEnv spec
βββ core/ # Models, client, episode store
βββ evaluation/ # OpenEnv Rubric subclasses + grader adapters
βββ runners/ # Heuristic baseline, inference logic, benchmark sweep
βββ scenarios/ # Tasks, generator, Issuer, arbitration, ISO 20022 adapter
βββ server/ # FastAPI app, environment, Gradio demo
βββ connectors/ # Stripe sandbox connector
βββ training/ # SFT dataset, outcome reward, training curve plots
βββ notebooks/ # Single-T4 SFT + GRPO Colab notebook
βββ tests/ # 113 tests (env, grader, API, issuer, arbitration, training)
βββ Dockerfile
βββ pyproject.toml
```
## Engineering hygiene (table stakes)
- **OpenEnv base classes used as intended.** `ChargebackOpsEnvironment` subclasses `openenv.core.environments.Environment`; rubric components subclass `openenv.core.rubrics.Rubric`. No reserved tool names (`reset`, `step`, `state`, `close`) reused for anything else.
- **Gym-style API.** `env.reset(task_id=...)` β `Observation`, `env.step(action)` β `(Observation, reward, done, info)`, `env.state()` β introspectable `EnvironmentState`. Episode store is server-side; clients are purely network.
- **Strict client/server separation.** `core/client.py` talks to the FastAPI server over HTTP only β it never imports `server.*` or `scenarios.*`. The Docker image is the source of truth.
- **Valid `openenv.yaml` manifest.** Passes `openenv validate .`; manifest declares the action schema, observation schema, and rubric module path.
- **113 tests, all green.** Cover env reset/step semantics, action validation, every rubric subclass, the issuer agent, the arbitration resolver, the FastAPI surface, and the SFT data builder.
- **Reproducibility.** SHA-1 keyed RNG for arbitration, pinned dependencies in `pyproject.toml`, deterministic task IDs, expected score ranges in [`docs/REPRODUCIBILITY.md`](docs/REPRODUCIBILITY.md).
## Why this matters
Most public RL-for-LLM benchmarks score policies on tasks where a competent next-token predictor is already close to optimal β chess, snake, grid worlds, single-turn math. ChargebackOps is intentionally a different shape: **multi-round, partially-observable, cost-asymmetric play against a procedurally-constrained adversary, where the rational policy depends on a $250 fee asymmetry and the rubric punishes both rule-violating and economically-irrational behaviour**. That is the kind of decision surface real B2B operations live on, and it is exactly the kind of capability gap current LLM agents struggle with β as the iter-5 specification-gaming exploit demonstrates in vivid detail.
The environment is built so a researcher can credibly write a paper on top of it: composable rubrics, deterministic task IDs, ISO 20022 + Stripe sandbox connectors for real-world data, an honest documented failure mode of GRPO that future training recipes can target as a benchmark, and a heuristic baseline strong enough that **beating it requires the model to actually learn the task**, not merely to execute the rollout-helper fallback.
## License
MIT
|