Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

pauldebanshu19 commited on Apr 19

Commit

bd00c06

1 Parent(s): 06abe10

Add training notebook and benchmark runner for ChargebackOps

- Introduced `train_merchant_agent.ipynb` for end-to-end GRPO training of the merchant-side chargeback agent, including environment setup, model loading, training prompt dataset creation, and evaluation.
- Created `benchmark_runner.py` to implement scripted policies for benchmarking against the trained agent, including heuristic, escalate_all, concede_all, and naive policies.
- Updated `issuer_model.py` to refine rationale messages for issuer decisions.
- Added unit tests for the benchmark runner and training adapter to ensure policy validity and reward computation accuracy.
- Implemented training helpers in `training/__init__.py`, `env_adapter.py`, and `reward_adapter.py` to facilitate interaction with the GRPO trainer.

Files changed (9) hide show

docs/RESULTS.md +113 -157
notebooks/train_merchant_agent.ipynb +236 -0
runners/benchmark_runner.py +351 -0
scenarios/issuer_model.py +9 -9
tests/test_benchmark_runner.py +81 -0
tests/test_training_adapter.py +105 -0
training/__init__.py +29 -0
training/env_adapter.py +142 -0
training/reward_adapter.py +156 -0

docs/RESULTS.md CHANGED Viewed

@@ -1,125 +1,71 @@
-# ChargebackOps — Baseline Results
-Reference numbers for the 10-task headline benchmark and the 28-task multi-seed stress grid.
-Captured on **2026-04-15** against `main` (Rubric system + `Gate(CaseAbandonedRubric)`
-composition, tightened `acceptable_strategies` on contest-optimal templates, expanded
-`_obvious_next_action` coverage, improved LLM prompt). Reproduce with the commands at the
-bottom; headline scores match to within ±1e-3 (float rounding).
-## TL;DR
-| Agent | Avg score | Best task | Worst task | Provider calls |
-| --- | --- | --- | --- | --- |
-| **Bad policy** (concede-everything) | **0.199** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
-| **Heuristic** (no LLM, rule-based) | **0.724** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_hard_s53` (0.440) | 0 |
-| **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.729** | `goods_not_received_easy` / `fraud_signal_ambiguity` / `generated_easy_s42` (0.958) | `generated_hard_s53` (0.440) | 7 (7 ✓ / 0 ✗) |
-**Key signal:** the bad policy vs. heuristic delta is **0.525** (72.4 → 19.9 = 264% spread).
-The `Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left
-unresolved past its deadline hard-zeros — a lazy concede-everything agent cannot game the score,
-and a correct agent cannot trivially saturate it on hard tasks. The LLM-assisted run now edges
-ahead of the pure heuristic (+0.005) after the v1.1 prompt and `_obvious_next_action` upgrades;
-the LLM is invoked only **7 times** across the 10-task run (down from 19 in v1) because
-deterministic workflow states are now dispatched without a model call.
-## Score Curve by Difficulty
-| Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
-| --- | --- | --- | --- | --- | --- | --- |
-| easy | 3 | 0.964 | 0.964 | 0.323 | ≥ 0.90 | ✓ |
-| medium | 2 | 0.755 | 0.755 | 0.278 | 0.50 – 0.85 | ✓ |
-| hard | 3 | 0.635 | 0.651 | 0.113 | 0.50 – 0.75 | ✓ |
-| nightmare | 2 | 0.466 | 0.466 | 0.065 | ≤ 0.55 | ✓ |
-Observations:
-- The LLM-assisted run now **matches or narrowly beats** the heuristic on every difficulty band
-  (overall +0.005). The old v1 regression — where the LLM dropped 0.56 on `fraud_signal_ambiguity`
-  and 0.29 on `generated_medium_s99` — was caused by the model picking a concede strategy over
-  contest at `set_strategy` time. `_obvious_next_action` now short-circuits all strategy picks
-  so the heuristic-derived strategy is used directly, and the prompt explicitly lists the
-  reason-code → optimal-strategy mapping for the remaining decision points. Provider call count
-  fell from 19 to 7 because deterministic housekeeping (add_evidence, remove_evidence,
-  submit_representment, set_strategy, resolve_case) is now bypassed entirely.
-- The LLM's remaining upside is on `queue_optimization_hard` (+0.049 over heuristic), where the
-  queue-triage branching is genuine and the heuristic's fixed priority order leaves marginal
-  value on the table.
-- Nightmare tasks cluster around **0.47** for the heuristic because the 15-step budget collides
-  with 5-case portfolios that have deadline_step=3–5 per case. Missed deadlines that were
-  *attempted* still land in the weighted sum (with 0 on the deadline dimension and ~0.55 from
-  the other 85%); truly abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper.
-  Not a scoring artifact: the bad-policy run shows the same tasks at ~0.065.
-- The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
-  the deadline collapses completely, while a case resolved late still earns dimensional credit
-  for evidence, strategy, and packet quality. This matches real chargeback operations — a missed
-  representment is "case forfeit," while a late one takes a penalty but is still scored on what
-  the merchant tried to do.
-## Full Per-Task Table
-| Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
-| fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.968 | 7 | 0.280 | 3 |
-| generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
-| generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
-| generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.701 | 9 | 0.442 | 12 |
-| queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
-| generated_hard_s7 | hard | 2 | 0.663 | 5 | 0.663 | 5 | 0.120 | 12 |
-| generated_hard_s53 | hard | 3 | 0.440 | 6 | 0.440 | 6 | 0.089 | 15 |
-| generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.486 | 15 | 0.077 | 15 |
-| generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
-| **Average** | | | **0.724** | 9.2 | **0.729** | 9.0 | **0.199** | 10.5 |
-## Multi-seed Stress Grid (7 seeds × 4 difficulties)
-Running the heuristic and bad-policy agents across seven generator seeds per difficulty (seeds
-7, 17, 31, 42, 53, 77, 99) gives the statistically defensible version of the headline numbers.
-All runs are fully offline — no provider calls involved.
-| Difficulty | n | Heuristic mean ± std | Bad mean ± std |
 | --- | --- | --- | --- |
-| easy | 7 | 0.9696 ± 0.014 | 0.3346 ± 0.068 |
-| medium | 7 | 0.8411 ± 0.089 | 0.4369 ± 0.238 |
-| hard | 7 | 0.6245 ± 0.151 | 0.1299 ± 0.047 |
-| nightmare | 7 | 0.4121 ± 0.079 | 0.0635 ± 0.010 |
-| **OVERALL** | **28** | **0.7118 ± 0.235** | **0.2412 ± 0.194** |
 Observations:
-- Heuristic score decreases cleanly and monotonically with difficulty: 0.97 → 0.84 → 0.62 →
-  0.41. The difficulty gradient is real — not a labeling artifact.
-- Nightmare std is the tightest (0.079) because every nightmare task is constrained by the
-  same step budget vs. case count collision. Hard is the widest (0.151) because case counts
-  vary from 2 to 3 across seeds.
-- Bad policy shows wide variance on medium (±0.238) because some medium seeds generate
-  concede-optimal templates (credit_not_processed, duplicate_processing) where
-  concede-everything is trivially correct — exactly the expected behavior of a discriminating
-  rubric on a mixed task distribution.
-- Overall delta (heuristic − bad) across 28 runs: **0.4706**. The headline 10-task catalog
-  delta (0.525) is within 1σ of the multi-seed delta, so the fixed-seed headline is not a
-  cherry-picked result.
-## Rubric Breakdown (single-case sanity check)
-For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
-`ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved
-before step 8):
-| Dimension | Weight | Score | Weighted contribution |
-| --- | --- | --- | --- |
-| strategy_correctness | 0.25 | 1.00 | 0.2500 |
-| evidence_quality | 0.20 | 0.90 | 0.1800 |
-| packet_validity | 0.15 | 1.00 | 0.1500 |
-| deadline_compliance | 0.15 | 1.00 | 0.1500 |
-| efficiency | 0.10 | 0.95 | 0.0950 |
-| outcome_quality | 0.10 | 1.00 | 0.1000 |
-| note_quality | 0.05 | 0.85 | 0.0425 |
-| **Total** | **1.00** | — | **0.9675** |
-Per-dimension scores captured by reading `rubric.last_score` on every child in the
-`ChargebackOpsEpisodeRubric.case_rubric.aggregator` tree after one forward pass — exactly the
-introspection path an RL trainer would use for credit assignment. The small gaps
-(`evidence_quality=0.90`, `efficiency=0.95`, `note_quality=0.85`) are the real headroom an
-LLM-fine-tuned agent is expected to close.
 ## Rubric Composition (what's wired)
@@ -129,65 +75,75 @@ ChargebackOpsEpisodeRubric
     ├── deadline_gate: Gate(threshold=1.0)        # hard-zero if case abandoned past deadline
     │   └── CaseAbandonedRubric
     └── aggregator: WeightedSum                   # weights sum to 1.0
-        ├── rubric_0: StrategyCorrectnessRubric   # weight 0.25
-        ├── rubric_1: EvidenceQualityRubric       # weight 0.20
-        ├── rubric_2: PacketValidityRubric        # weight 0.15
-        ├── rubric_3: DeadlineComplianceRubric    # weight 0.15 (dimension-level, not gate)
-        ├── rubric_4: EfficiencyRubric            # weight 0.10
-        ├── rubric_5: OutcomeQualityRubric        # weight 0.10
-        └── rubric_6: NoteQualityRubric           # weight 0.05
 ```
-Every node is an OpenEnv `Rubric` subclass and every node exposes `last_score` after forward.
-`env.rubric.named_rubrics()` walks the tree and returns 11 named children — the hook-compatible
-surface for a judge or trainer to introspect per-dimension scores.
 ## Reproducing These Numbers
 ```bash
-# Activate the project's venv
 source ~/python/bin/activate
-# 1. Headline 10-task run (heuristic + bad policy, no network)
 python - <<'PY'
-from evaluation.agent_brutal_audit import run_episode
-from scenarios.simulation import list_tasks
-for t in list_tasks():
-    h = run_episode(t.task_id, policy='heuristic')
-    b = run_episode(t.task_id, policy='bad')
-    print(f"{t.task_id:32s}  heur={h['score']:.4f}  bad={b['score']:.4f}")
 PY
-# 2. Multi-seed stress grid (28 runs across 7 seeds × 4 difficulties, no network)
-python - <<'PY'
-from statistics import mean, stdev
-from evaluation.agent_brutal_audit import run_episode
-for d in ("easy","medium","hard","nightmare"):
-    hs, bs = [], []
-    for s in (7, 17, 31, 42, 53, 77, 99):
-        hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
-        bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
-    print(f"{d:10s} heur={mean(hs):.4f}±{stdev(hs):.4f}  bad={mean(bs):.4f}±{stdev(bs):.4f}")
-PY
-# 3. LLM tiebreak run (requires OPENROUTER_API_KEY in .env)
 python -m runners.baseline_runner | tee /tmp/baseline_run.json
 ```
 ## Hardware / Environment
-- Python 3.12.13, pytest 7.4.3
-- `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
-- Provider: OpenRouter (model `openai/gpt-oss-120b`), all 7 decision calls succeeded, zero retries
-- Average end-to-end episode wall-clock: ~0.8s (heuristic), ~1.8s (with LLM tiebreak — down from
-  ~2.5s in v1 because `_obvious_next_action` bypasses most model calls)
-- Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
 ## What This Table Does Not Show
-- **Per-dimension score dispersion across the full catalog** — the table above shows one task's
-  breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on
-  any run: see `README.md` → "Rubric introspection".
-- **RL training curves** — ChargebackOps is a ready environment, not a trained agent. Anyone
-  wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the
-  rubric tree is the machinery they would hook into for credit assignment.

+# ChargebackOps — Benchmark Results
+Reference numbers for the 10-task headline catalog and the 28-task
+multi-seed stress grid against the current multi-round adversarial
+environment. Reproduce with the commands at the bottom; scores match to
+within ±1e-3 (float rounding).
+Captured on **2026-04-19** on `main` with the 8-dimension case rubric
+(weights `(0.20, 0.15, 0.10, 0.10, 0.10, 0.10, 0.05, 0.20)`,
+`escalation_roi` dimension added) and the deterministic Issuer agent
+(LLM softening disabled — benchmarks stay fully offline).
+## TL;DR
+| Policy | Headline avg (10 tasks) | Multi-seed avg (28 tasks) | Provider calls |
 | --- | --- | --- | --- |
+| **naive** (empty packet → submit) | **0.0000** | **0.0000** | 0 |
+| **concede_all** (always `accept_chargeback`) | **0.5666** | **0.5634** | 0 |
+| **escalate_all** (contest, then always escalate) | **0.7731** | **0.7647** | 0 |
+| **heuristic** (first-candidate rule-based pick) | **0.7731** | **0.7647** | 0 |
+**Discrimination delta** (heuristic − naive) is **0.7731** on the headline
+catalog and **0.7647** on the multi-seed grid — well above the 0.40 target.
+`escalate_all` ties with `heuristic` because the heuristic wins the
+representment on most tasks in the first review; the environment never
+enters the pre-arbitration branch and the escalation override never
+fires. That match is a signal, not a bug: when the scripted merchant
+packet is strong, escalation is never rational in the current
+deterministic Issuer, so the two policies produce identical trajectories.
+## Score Curve by Difficulty (multi-seed grid, 7 seeds / difficulty)
+| Difficulty | n | heuristic | escalate_all | concede_all | naive |
+| --- | --- | --- | --- | --- | --- |
+| easy | 7 | 0.974 | 0.974 | 0.470 | 0.000 |
+| medium | 7 | 0.876 | 0.876 | 0.699 | 0.000 |
+| hard | 7 | 0.701 | 0.701 | 0.584 | 0.000 |
+| nightmare | 7 | 0.508 | 0.508 | 0.501 | 0.000 |
 Observations:
+- Heuristic score decreases monotonically with difficulty
+  (0.97 → 0.88 → 0.70 → 0.51). The difficulty gradient is real.
+- `concede_all` narrows the gap at nightmare (0.508 vs 0.501) because
+  the 15-step budget vs. 5-case portfolio forces the heuristic to
+  forfeit cases deadline-wise, while conceding is cheap per case.
+  This is the expected `Gate(CaseAbandonedRubric)` behavior.
+- `naive` sits flat at 0.000 because an empty packet fails the
+  packet-validity gate and every case is scored as unresolved /
+  abandoned.
+## Headline Per-Task Table (10 tasks, offline)
+| Task ID | Difficulty | heuristic | escalate_all | concede_all | naive |
+| --- | --- | --- | --- | --- | --- |
+| goods_not_received_easy | easy | 0.968 | 0.968 | 0.580 | 0.000 |
+| fraud_signal_ambiguity | easy | 0.968 | 0.968 | 0.580 | 0.000 |
+| queue_optimization_hard | hard | 0.802 | 0.802 | 0.576 | 0.000 |
+| generated_easy_s42 | easy | 0.958 | 0.958 | 0.533 | 0.000 |
+| generated_medium_s17 | medium | 0.861 | 0.861 | 0.623 | 0.000 |
+| generated_medium_s99 | medium | 0.770 | 0.770 | 0.727 | 0.000 |
+| generated_hard_s7 | hard | 0.724 | 0.724 | 0.615 | 0.000 |
+| generated_hard_s53 | hard | 0.544 | 0.544 | 0.612 | 0.000 |
+| generated_nightmare_s31 | nightmare | 0.602 | 0.602 | 0.529 | 0.000 |
+| generated_nightmare_s77 | nightmare | 0.474 | 0.474 | 0.537 | 0.000 |
+| **Average** | | **0.7731** | **0.7731** | **0.5666** | **0.0000** |
+(Per-task numbers from `runners.benchmark_runner.run_policy_sweep()`.)
 ## Rubric Composition (what's wired)
     ├── deadline_gate: Gate(threshold=1.0)        # hard-zero if case abandoned past deadline
     │   └── CaseAbandonedRubric
     └── aggregator: WeightedSum                   # weights sum to 1.0
+        ├── rubric_0: StrategyCorrectnessRubric   # 0.20
+        ├── rubric_1: EvidenceQualityRubric       # 0.15
+        ├── rubric_2: PacketValidityRubric        # 0.10
+        ├── rubric_3: DeadlineComplianceRubric    # 0.10
+        ├── rubric_4: EfficiencyRubric            # 0.10
+        ├── rubric_5: OutcomeQualityRubric        # 0.10
+        ├── rubric_6: NoteQualityRubric           # 0.05
+        └── rubric_7: EscalationROIRubric         # 0.20
 ```
+Every node is an OpenEnv `Rubric` subclass and every node exposes
+`last_score` after forward. `env.rubric.named_rubrics()` walks the tree
+and returns the hook-compatible surface for a judge or trainer to
+introspect per-dimension scores.
+`EscalationROIRubric` encodes the economic rule that escalating to
+network arbitration is rational only when
+`P(win) × dispute_amount > arb_fee` (fee = $250/side). Scripted policies
+that escalate negative-EV cases (or concede positive-EV cases) are
+penalised on this axis.
 ## Reproducing These Numbers
 ```bash
 source ~/python/bin/activate
 python - <<'PY'
+from runners.benchmark_runner import run_policy_sweep, run_multi_seed
+headline = run_policy_sweep()
+print("HEADLINE (10 tasks)")
+for s in headline.policies:
+    print(f"  {s.policy:14s}  mean={s.mean_score:.4f}  stdev={s.stdev:.4f}")
+print(f"  delta (heuristic - naive): {headline.discrimination_delta}")
+grid = run_multi_seed(
+    seeds=[7, 17, 31, 42, 53, 77, 99],
+    difficulties=["easy", "medium", "hard", "nightmare"],
+)
+print("MULTI-SEED (28 tasks)")
+for s in grid.policies:
+    print(f"  {s.policy:14s}  mean={s.mean_score:.4f}  stdev={s.stdev:.4f}")
+print(f"  delta (heuristic - naive): {grid.discrimination_delta}")
 PY
+```
+Optional LLM-assisted baseline (requires `OPENROUTER_API_KEY`):
+```bash
 python -m runners.baseline_runner | tee /tmp/baseline_run.json
 ```
 ## Hardware / Environment
+- Python 3.12, pytest 8.x
+- `openenv-core`, `pydantic`, `openai` per `pyproject.toml`
+- No provider calls for the four scripted policies — all results fully offline
+- Full test suite: **65/65 passing**
 ## What This Table Does Not Show
+- **Per-dimension score dispersion across the full catalog** — the
+  headline table aggregates to one scalar per task. Walk
+  `env.rubric.named_rubrics()` on any run for the per-dimension
+  introspection path.
+- **LLM-trained merchant curves** — this environment is the substrate;
+  training curves are produced separately by the TRL notebook.
+- **Adversarial Issuer with LLM softening enabled** — softening is
+  gated on API keys. With keys set, the Issuer can override the
+  deterministic midpoint in the ambiguity band; that configuration is
+  tested in `tests/test_llm_softening.py` but is not part of the
+  offline benchmark numbers above.

notebooks/train_merchant_agent.ipynb ADDED Viewed

	@@ -0,0 +1,236 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Train Merchant Agent on ChargebackOps\n",
+    "\n",
+    "End-to-end GRPO training skeleton for the merchant-side chargeback agent.\n",
+    "\n",
+    "- Environment: `ChargebackOpsEnvironment` (multi-round adversarial Issuer, arbitration ROI).\n",
+    "- Text interface: `training.env_adapter` (prompt build, completion parse).\n",
+    "- Reward: `training.reward_adapter.compute_reward` — returns the normalised episode score in `[0, 1]`.\n",
+    "- Trainer: `trl.GRPOTrainer` on a small base model so this fits a free Colab T4.\n",
+    "\n",
+    "The first run is intentionally tiny (1 step, micro-batch of 2). Once the wiring is green, bump `max_steps` to 200 for the real curve."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Colab setup\n",
+    "\n",
+    "Installs TRL, transformers, and the ChargebackOps package itself. Skip if the environment already has them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "import sys\n",
+    "if 'google.colab' in sys.modules:\n",
+    "    !pip install --quiet trl==0.11.4 transformers==4.44.2 accelerate==0.33.0 peft==0.12.0 bitsandbytes==0.43.3\n",
+    "    !git clone https://github.com/example/chargebackops.git /content/chargebackops\n",
+    "    %cd /content/chargebackops\n",
+    "    !pip install --quiet -e ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Sanity-check the env adapter\n",
+    "\n",
+    "Run one scripted episode via the text adapter to confirm prompts render and rewards land inside `[0, 1]`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from training.env_adapter import build_prompt\n",
+    "from training.reward_adapter import run_episode_with_text_policy\n",
+    "\n",
+    "def heuristic_text_policy(prompt: str) -> str:\n",
+    "    # Force the fallback path so the scripted heuristic drives the episode.\n",
+    "    return ''\n",
+    "\n",
+    "result = run_episode_with_text_policy('goods_not_received_easy', heuristic_text_policy)\n",
+    "print('score', result.score, 'steps', result.steps_used, 'invalid', result.invalid_actions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load a small base model\n",
+    "\n",
+    "`Qwen/Qwen2.5-0.5B-Instruct` fits on a free T4 with LoRA adapters. Swap in a bigger instruct model if you have the memory budget."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "\n",
+    "MODEL_ID = 'Qwen/Qwen2.5-0.5B-Instruct'\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    MODEL_ID,\n",
+    "    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,\n",
+    "    device_map='auto',\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Build the training prompt dataset\n",
+    "\n",
+    "GRPO expects a list of prompts; it generates K completions per prompt internally and scores each with `compute_reward`. We sample prompts from fresh environment resets across the headline catalog."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import Dataset\n",
+    "from scenarios.simulation import list_tasks\n",
+    "from server.chargeback_ops_environment import ChargebackOpsEnvironment\n",
+    "from training.env_adapter import build_prompt\n",
+    "\n",
+    "def sample_prompts(n: int = 32):\n",
+    "    tasks = list_tasks()\n",
+    "    prompts, task_ids = [], []\n",
+    "    for i in range(n):\n",
+    "        task = tasks[i % len(tasks)]\n",
+    "        env = ChargebackOpsEnvironment()\n",
+    "        obs = env.reset(task_id=task.task_id).model_dump()\n",
+    "        prompts.append(build_prompt(obs))\n",
+    "        task_ids.append(task.task_id)\n",
+    "    return Dataset.from_dict({'prompt': prompts, 'task_id': task_ids})\n",
+    "\n",
+    "train_dataset = sample_prompts(32)\n",
+    "len(train_dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. GRPO training step\n",
+    "\n",
+    "Starts with `max_steps=1` — just verify the gradient path closes. Increase to 200 for the real curve."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "from training.reward_adapter import compute_reward\n",
+    "\n",
+    "def reward_fn(prompts, completions, **kwargs):\n",
+    "    task_ids = kwargs.get('task_id') or kwargs.get('task_ids')\n",
+    "    return compute_reward(prompts, completions, task_ids=task_ids)\n",
+    "\n",
+    "config = GRPOConfig(\n",
+    "    output_dir='./grpo-merchant-agent',\n",
+    "    per_device_train_batch_size=2,\n",
+    "    num_generations=4,\n",
+    "    max_prompt_length=1024,\n",
+    "    max_completion_length=128,\n",
+    "    learning_rate=5e-6,\n",
+    "    max_steps=1,\n",
+    "    logging_steps=1,\n",
+    "    save_steps=50,\n",
+    "    gradient_accumulation_steps=1,\n",
+    "    bf16=torch.cuda.is_available(),\n",
+    "    report_to='none',\n",
+    ")\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    processing_class=tokenizer,\n",
+    "    reward_funcs=[reward_fn],\n",
+    "    args=config,\n",
+    "    train_dataset=train_dataset,\n",
+    ")\n",
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Evaluate the trained policy\n",
+    "\n",
+    "Runs one rollout per headline task with the trained model as the text policy and reports the per-task scores plus the overall mean."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from statistics import mean\n",
+    "\n",
+    "def model_text_policy(prompt: str) -> str:\n",
+    "    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(model.device)\n",
+    "    with torch.no_grad():\n",
+    "        out = model.generate(**inputs, max_new_tokens=128, do_sample=False, temperature=0.0, pad_token_id=tokenizer.pad_token_id)\n",
+    "    return tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n",
+    "\n",
+    "scores = []\n",
+    "for task in list_tasks():\n",
+    "    result = run_episode_with_text_policy(task.task_id, model_text_policy)\n",
+    "    scores.append(result.score)\n",
+    "    print(f'{task.task_id:32s} score={result.score:.4f}  steps={result.steps_used}  invalid={result.invalid_actions}')\n",
+    "print('mean score', mean(scores))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "1. Bump `max_steps` in step 5 to 200 (save checkpoints at 0/50/100/150/200).\n",
+    "2. Record per-checkpoint mean score and plot the curve.\n",
+    "3. Compare against the fixed-policy baselines in `runners/benchmark_runner.py`.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

runners/benchmark_runner.py ADDED Viewed

	@@ -0,0 +1,351 @@

+"""Scripted-policy benchmark runner for ChargebackOps.
+Drives a fixed set of non-learning policies through the full environment so
+the trained-merchant vs. baseline discrimination delta can be measured
+without calling an LLM provider. Every policy returned here is deterministic
+and offline.
+Policies
+--------
+* ``heuristic`` — the Round 1 first-candidate pick (best scripted baseline).
+* ``concede_all`` — always set strategy to ``accept_chargeback`` and resolve.
+* ``escalate_all`` — contest like the heuristic, then escalate in the
+  pre-arb and arbitration steps regardless of evidence strength.
+* ``naive`` — submit an empty packet / take a minimal path to terminal.
+The runner also exposes :func:`run_multi_seed` which sweeps each policy
+over the headline catalog plus extra generator seeds so the benchmark
+table in ``docs/RESULTS_V2.md`` is reproducible from one command.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from statistics import mean, pstdev
+from typing import Any, Callable, Iterable, Sequence
+try:
+    from ..core.models import ChargebackOpsAction
+    from ..scenarios.simulation import TaskScenario, get_task, list_tasks
+    from ..server.chargeback_ops_environment import ChargebackOpsEnvironment
+    from .baseline_runner import candidate_actions
+except ImportError:  # pragma: no cover
+    from core.models import ChargebackOpsAction
+    from scenarios.simulation import TaskScenario, get_task, list_tasks
+    from server.chargeback_ops_environment import ChargebackOpsEnvironment
+    from runners.baseline_runner import candidate_actions
+PolicyFn = Callable[[dict[str, Any]], ChargebackOpsAction | None]
+POLICY_NAMES: tuple[str, ...] = (
+    "heuristic",
+    "escalate_all",
+    "concede_all",
+    "naive",
+)
+# ---------------------------------------------------------------------------
+# Scripted policies
+# ---------------------------------------------------------------------------
+def heuristic_policy(observation: dict[str, Any]) -> ChargebackOpsAction | None:
+    """First-candidate pick from the existing candidate generator."""
+    candidates = candidate_actions(observation)
+    if not candidates:
+        return None
+    return candidates[0].action
+def escalate_all_policy(observation: dict[str, Any]) -> ChargebackOpsAction | None:
+    """Play like the heuristic, but always push terminal disputes into arbitration."""
+    available = set(observation.get("available_actions", []))
+    visible_case = observation.get("visible_case")
+    if visible_case is not None and "escalate_to_arbitration" in available:
+        return ChargebackOpsAction(
+            action_type="escalate_to_arbitration",
+            case_id=visible_case["case_id"],
+        )
+    return heuristic_policy(observation)
+def concede_all_policy(observation: dict[str, Any]) -> ChargebackOpsAction | None:
+    """Always accept the chargeback. Never contests, never escalates."""
+    available = set(observation.get("available_actions", []))
+    visible_case = observation.get("visible_case")
+    queue = observation.get("queue", [])
+    if visible_case is None:
+        open_cases = [item for item in queue if item["status"] == "open"]
+        if not open_cases:
+            return None
+        target = sorted(
+            open_cases,
+            key=lambda item: (item["steps_until_deadline"], -item["amount"]),
+        )[0]
+        return ChargebackOpsAction(
+            action_type="select_case", case_id=target["case_id"]
+        )
+    case_id = visible_case["case_id"]
+    if visible_case["status"] != "open":
+        open_cases = [
+            item
+            for item in queue
+            if item["status"] == "open" and item["case_id"] != case_id
+        ]
+        if not open_cases:
+            return None
+        target = sorted(
+            open_cases,
+            key=lambda item: (item["steps_until_deadline"], -item["amount"]),
+        )[0]
+        return ChargebackOpsAction(
+            action_type="select_case", case_id=target["case_id"]
+        )
+    if "accept_arbitration_loss" in available:
+        return ChargebackOpsAction(
+            action_type="accept_arbitration_loss", case_id=case_id
+        )
+    if visible_case.get("current_strategy") != "accept_chargeback" and (
+        "set_strategy" in available
+    ):
+        return ChargebackOpsAction(
+            action_type="set_strategy",
+            case_id=case_id,
+            strategy="accept_chargeback",
+        )
+    if "resolve_case" in available:
+        return ChargebackOpsAction(
+            action_type="resolve_case",
+            case_id=case_id,
+            strategy="accept_chargeback",
+        )
+    return heuristic_policy(observation)
+def naive_policy(observation: dict[str, Any]) -> ChargebackOpsAction | None:
+    """Minimum-effort agent: select a case, submit without evidence or policy work."""
+    available = set(observation.get("available_actions", []))
+    visible_case = observation.get("visible_case")
+    queue = observation.get("queue", [])
+    if visible_case is None:
+        open_cases = [item for item in queue if item["status"] == "open"]
+        if not open_cases:
+            return None
+        return ChargebackOpsAction(
+            action_type="select_case", case_id=open_cases[0]["case_id"]
+        )
+    case_id = visible_case["case_id"]
+    if visible_case["status"] != "open":
+        open_cases = [
+            item
+            for item in queue
+            if item["status"] == "open" and item["case_id"] != case_id
+        ]
+        if not open_cases:
+            return None
+        return ChargebackOpsAction(
+            action_type="select_case", case_id=open_cases[0]["case_id"]
+        )
+    if "accept_arbitration_loss" in available:
+        return ChargebackOpsAction(
+            action_type="accept_arbitration_loss", case_id=case_id
+        )
+    if "submit_representment" in available:
+        return ChargebackOpsAction(
+            action_type="submit_representment", case_id=case_id
+        )
+    if "respond_to_pre_arb" in available:
+        return ChargebackOpsAction(
+            action_type="respond_to_pre_arb", case_id=case_id
+        )
+    if "resolve_case" in available:
+        return ChargebackOpsAction(
+            action_type="resolve_case",
+            case_id=case_id,
+            strategy="accept_chargeback",
+        )
+    return heuristic_policy(observation)
+POLICY_REGISTRY: dict[str, PolicyFn] = {
+    "heuristic": heuristic_policy,
+    "escalate_all": escalate_all_policy,
+    "concede_all": concede_all_policy,
+    "naive": naive_policy,
+}
+# ---------------------------------------------------------------------------
+# Episode / sweep driver
+# ---------------------------------------------------------------------------
+@dataclass(frozen=True)
+class TaskScore:
+    """One policy × task result."""
+    policy: str
+    task_id: str
+    score: float
+    steps_used: int
+@dataclass(frozen=True)
+class PolicySummary:
+    """Aggregate of one policy across a task list."""
+    policy: str
+    mean_score: float
+    stdev: float
+    tasks: tuple[TaskScore, ...]
+@dataclass(frozen=True)
+class BenchmarkResult:
+    """Output of a full policy sweep."""
+    policies: tuple[PolicySummary, ...]
+    discrimination_delta: float  # heuristic minus naive
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "discrimination_delta": self.discrimination_delta,
+            "policies": [
+                {
+                    "policy": summary.policy,
+                    "mean_score": summary.mean_score,
+                    "stdev": summary.stdev,
+                    "tasks": [
+                        {
+                            "task_id": task.task_id,
+                            "score": task.score,
+                            "steps_used": task.steps_used,
+                        }
+                        for task in summary.tasks
+                    ],
+                }
+                for summary in self.policies
+            ],
+        }
+def run_policy_on_task(policy: PolicyFn, task: TaskScenario) -> TaskScore:
+    """Drive one policy through one task. Fully offline, no LLM calls."""
+    env = ChargebackOpsEnvironment()
+    observation = env.reset(task_id=task.task_id)
+    max_steps = task.max_steps + 5  # small safety margin
+    steps = 0
+    while not observation.done and steps < max_steps:
+        action = policy(observation.model_dump())
+        if action is None:
+            break
+        observation = env.step(action)
+        steps += 1
+    report = env.state.grader_report
+    score = float(report.normalized_score) if report is not None else 0.0
+    return TaskScore(
+        policy=policy.__name__,
+        task_id=task.task_id,
+        score=score,
+        steps_used=env.state.step_count,
+    )
+def run_policy_sweep(
+    policy_names: Sequence[str] = POLICY_NAMES,
+    tasks: Iterable[TaskScenario] | None = None,
+) -> BenchmarkResult:
+    """Run each named policy across the headline catalog (or provided tasks)."""
+    task_list = list(tasks) if tasks is not None else list_tasks()
+    summaries: list[PolicySummary] = []
+    for name in policy_names:
+        if name not in POLICY_REGISTRY:
+            raise KeyError(f"Unknown policy '{name}'. Known: {sorted(POLICY_REGISTRY)}")
+        policy = POLICY_REGISTRY[name]
+        task_scores: list[TaskScore] = []
+        for task in task_list:
+            score = run_policy_on_task(policy, task)
+            task_scores.append(
+                TaskScore(
+                    policy=name,
+                    task_id=score.task_id,
+                    score=score.score,
+                    steps_used=score.steps_used,
+                )
+            )
+        scores = [item.score for item in task_scores]
+        summaries.append(
+            PolicySummary(
+                policy=name,
+                mean_score=round(mean(scores), 4) if scores else 0.0,
+                stdev=round(pstdev(scores), 4) if len(scores) > 1 else 0.0,
+                tasks=tuple(task_scores),
+            )
+        )
+    by_name = {summary.policy: summary for summary in summaries}
+    delta = 0.0
+    if "heuristic" in by_name and "naive" in by_name:
+        delta = round(
+            by_name["heuristic"].mean_score - by_name["naive"].mean_score, 4
+        )
+    return BenchmarkResult(policies=tuple(summaries), discrimination_delta=delta)
+def run_multi_seed(
+    seeds: Sequence[int],
+    difficulties: Sequence[str] = ("easy", "medium", "hard", "nightmare"),
+    policy_names: Sequence[str] = POLICY_NAMES,
+) -> BenchmarkResult:
+    """Sweep each policy over ``seeds × difficulties`` generated tasks.
+    Used for the multi-seed grid cited in the PRD's Day-5 exit criteria.
+    """
+    tasks: list[TaskScenario] = []
+    for difficulty in difficulties:
+        for seed in seeds:
+            task_id = f"generated_{difficulty}_s{seed}"
+            tasks.append(get_task(task_id))
+    return run_policy_sweep(policy_names, tasks=tasks)
+__all__ = [
+    "POLICY_NAMES",
+    "POLICY_REGISTRY",
+    "PolicyFn",
+    "BenchmarkResult",
+    "PolicySummary",
+    "TaskScore",
+    "heuristic_policy",
+    "escalate_all_policy",
+    "concede_all_policy",
+    "naive_policy",
+    "run_policy_on_task",
+    "run_policy_sweep",
+    "run_multi_seed",
+]

scenarios/issuer_model.py CHANGED Viewed

@@ -147,15 +147,15 @@ class IssuerAgent:
                     decision=IssuerDecision.ACCEPT,
                     evidence_strength_score=score,
                     rationale=(
-                        f"Round {round_number}: pre-arb evidence brings the packet "
-                        f"to {score:.2f}, above the 0.60 acceptance bar."
                     ),
                 )
             return IssuerReview(
                 decision=IssuerDecision.ESCALATE_TO_ARBITRATION,
                 evidence_strength_score=score,
                 rationale=(
-                    f"Round {round_number}: packet still scores {score:.2f}; "
                     f"escalating to network arbitration."
                 ),
             )
@@ -166,7 +166,7 @@ class IssuerAgent:
                 decision=IssuerDecision.ACCEPT,
                 evidence_strength_score=score,
                 rationale=(
-                    f"Round 1: packet scores {score:.2f}, clearing the 0.70 acceptance bar."
                 ),
             )
         if score <= ROUND1_REJECT_THRESHOLD:
@@ -174,7 +174,7 @@ class IssuerAgent:
                 decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
                 evidence_strength_score=score,
                 rationale=(
-                    f"Round 1: packet scores {score:.2f}, below the 0.40 floor; "
                     f"requesting compelling evidence."
                 ),
             )
@@ -191,7 +191,7 @@ class IssuerAgent:
                     decision=IssuerDecision.ACCEPT,
                     evidence_strength_score=score,
                     rationale=(
-                        f"Round 1 ambiguity band: packet scores {score:.2f} — "
                         f"LLM softening accepted."
                     ),
                     used_llm_softening=True,
@@ -201,7 +201,7 @@ class IssuerAgent:
                     decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
                     evidence_strength_score=score,
                     rationale=(
-                        f"Round 1 ambiguity band: packet scores {score:.2f} — "
                         f"LLM softening requested compelling evidence."
                     ),
                     used_llm_softening=True,
@@ -212,7 +212,7 @@ class IssuerAgent:
                 decision=IssuerDecision.ACCEPT,
                 evidence_strength_score=score,
                 rationale=(
-                    f"Round 1 ambiguity band: packet scores {score:.2f} "
                     f"(>= {ROUND1_MIDPOINT_FALLBACK:.2f} midpoint) — accepting."
                 ),
             )
@@ -220,7 +220,7 @@ class IssuerAgent:
             decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
             evidence_strength_score=score,
             rationale=(
-                f"Round 1 ambiguity band: packet scores {score:.2f} "
                 f"(< {ROUND1_MIDPOINT_FALLBACK:.2f} midpoint) — requesting more evidence."
             ),
         )

                     decision=IssuerDecision.ACCEPT,
                     evidence_strength_score=score,
                     rationale=(
+                        f"Pre-arb evidence brings the packet to {score:.2f}, "
+                        f"above the 0.60 acceptance bar."
                     ),
                 )
             return IssuerReview(
                 decision=IssuerDecision.ESCALATE_TO_ARBITRATION,
                 evidence_strength_score=score,
                 rationale=(
+                    f"Packet still scores {score:.2f}; "
                     f"escalating to network arbitration."
                 ),
             )
                 decision=IssuerDecision.ACCEPT,
                 evidence_strength_score=score,
                 rationale=(
+                    f"Packet scores {score:.2f}, clearing the 0.70 acceptance bar."
                 ),
             )
         if score <= ROUND1_REJECT_THRESHOLD:
                 decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
                 evidence_strength_score=score,
                 rationale=(
+                    f"Packet scores {score:.2f}, below the 0.40 floor; "
                     f"requesting compelling evidence."
                 ),
             )
                     decision=IssuerDecision.ACCEPT,
                     evidence_strength_score=score,
                     rationale=(
+                        f"Ambiguity band: packet scores {score:.2f} — "
                         f"LLM softening accepted."
                     ),
                     used_llm_softening=True,
                     decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
                     evidence_strength_score=score,
                     rationale=(
+                        f"Ambiguity band: packet scores {score:.2f} — "
                         f"LLM softening requested compelling evidence."
                     ),
                     used_llm_softening=True,
                 decision=IssuerDecision.ACCEPT,
                 evidence_strength_score=score,
                 rationale=(
+                    f"Ambiguity band: packet scores {score:.2f} "
                     f"(>= {ROUND1_MIDPOINT_FALLBACK:.2f} midpoint) — accepting."
                 ),
             )
             decision=IssuerDecision.REQUEST_MORE_EVIDENCE,
             evidence_strength_score=score,
             rationale=(
+                f"Ambiguity band: packet scores {score:.2f} "
                 f"(< {ROUND1_MIDPOINT_FALLBACK:.2f} midpoint) — requesting more evidence."
             ),
         )

tests/test_benchmark_runner.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""Unit tests for the scripted-policy benchmark runner.
+The runner drives a fixed set of non-learning policies through the full
+environment without LLM calls. These tests pin:
+1. Each policy returns valid action or None offline.
+2. The aggregator produces per-policy means and a discrimination delta.
+3. The headline policy sweep keeps the heuristic ≥ 0.40 above the naive floor.
+"""
+from __future__ import annotations
+from runners.benchmark_runner import (
+    POLICY_NAMES,
+    POLICY_REGISTRY,
+    concede_all_policy,
+    escalate_all_policy,
+    heuristic_policy,
+    naive_policy,
+    run_multi_seed,
+    run_policy_on_task,
+    run_policy_sweep,
+)
+from scenarios.simulation import get_task
+_EASY_TASK = get_task("goods_not_received_easy")
+def test_policy_registry_matches_public_names():
+    assert set(POLICY_NAMES) == set(POLICY_REGISTRY)
+    assert set(POLICY_NAMES) == {"heuristic", "escalate_all", "concede_all", "naive"}
+def test_heuristic_scores_above_naive_on_easy():
+    heur = run_policy_on_task(heuristic_policy, _EASY_TASK)
+    nv = run_policy_on_task(naive_policy, _EASY_TASK)
+    assert heur.score > nv.score
+    assert heur.task_id == _EASY_TASK.task_id
+    assert heur.steps_used > 0
+def test_concede_all_lands_final_resolution():
+    """concede_all must always terminate the episode with a concede path."""
+    result = run_policy_on_task(concede_all_policy, _EASY_TASK)
+    assert result.steps_used > 0
+    # concede_all scores strictly below heuristic but must stay in [0, 1].
+    assert 0.0 <= result.score <= 1.0
+def test_escalate_all_runs_to_completion():
+    result = run_policy_on_task(escalate_all_policy, _EASY_TASK)
+    assert 0.0 <= result.score <= 1.0
+    assert result.steps_used > 0
+def test_sweep_aggregates_and_produces_delta():
+    result = run_policy_sweep()
+    policies = {summary.policy: summary for summary in result.policies}
+    assert set(policies) == set(POLICY_NAMES)
+    # mean scores sit in the valid range
+    for summary in result.policies:
+        assert 0.0 <= summary.mean_score <= 1.0
+    # discrimination delta is heuristic - naive and must clear the PRD bar.
+    assert result.discrimination_delta >= 0.40
+def test_sweep_is_deterministic():
+    """Two runs on the same catalog must produce identical numbers."""
+    first = run_policy_sweep().to_dict()
+    second = run_policy_sweep().to_dict()
+    assert first == second
+def test_multi_seed_sweep_runs_subset():
+    """Tiny grid (2 seeds × 1 difficulty) stays under a second and returns data."""
+    result = run_multi_seed(seeds=[42, 17], difficulties=["easy"])
+    for summary in result.policies:
+        assert len(summary.tasks) == 2
+        for task_score in summary.tasks:
+            assert task_score.task_id.startswith("generated_easy_s")

tests/test_training_adapter.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""Unit tests for the training adapter.
+Pin the prompt/completion serialization and the episode-replay reward
+signal so the training notebook has a stable offline contract.
+"""
+from __future__ import annotations
+import json
+from core.models import ChargebackOpsAction
+from scenarios.simulation import get_task
+from server.chargeback_ops_environment import ChargebackOpsEnvironment
+from training.env_adapter import (
+    action_from_completion,
+    build_prompt,
+    parse_completion,
+)
+from training.reward_adapter import (
+    compute_reward,
+    run_episode_with_text_policy,
+)
+def _fresh_observation(task_id: str = "goods_not_received_easy"):
+    env = ChargebackOpsEnvironment()
+    return env.reset(task_id=task_id).model_dump()
+def test_build_prompt_is_deterministic_and_includes_available_actions():
+    obs = _fresh_observation()
+    a = build_prompt(obs)
+    b = build_prompt(obs)
+    assert a == b
+    assert "available_actions" in a
+    assert "OBSERVATION:" in a
+    assert "ACTION:" in a
+def test_parse_completion_accepts_plain_json():
+    payload = '{"action_type": "select_case", "case_id": "CB-X"}'
+    parsed = parse_completion(payload)
+    assert parsed == {"action_type": "select_case", "case_id": "CB-X"}
+def test_parse_completion_strips_code_fence():
+    payload = '```json\n{"action_type": "select_case", "case_id": "CB-X"}\n```'
+    parsed = parse_completion(payload)
+    assert parsed == {"action_type": "select_case", "case_id": "CB-X"}
+def test_parse_completion_returns_none_on_garbage():
+    assert parse_completion("") is None
+    assert parse_completion("not json at all") is None
+    assert parse_completion("{not-valid-json}") is None
+def test_parse_completion_drops_unknown_fields():
+    payload = json.dumps({"action_type": "select_case", "hack_field": 42})
+    parsed = parse_completion(payload)
+    assert parsed == {"action_type": "select_case"}
+def test_action_from_completion_returns_valid_action():
+    payload = '{"action_type": "select_case", "case_id": "CB-X"}'
+    action = action_from_completion(payload)
+    assert isinstance(action, ChargebackOpsAction)
+    assert action.action_type == "select_case"
+    assert action.case_id == "CB-X"
+def test_action_from_completion_returns_none_on_bad_type():
+    payload = '{"action_type": "not_a_real_action"}'
+    assert action_from_completion(payload) is None
+def test_run_episode_falls_back_to_heuristic_on_empty_completion():
+    """Unparseable completions must not deadlock the episode."""
+    result = run_episode_with_text_policy(
+        "goods_not_received_easy",
+        text_policy=lambda _prompt: "",
+    )
+    assert result.steps_used > 0
+    assert result.invalid_actions > 0
+    assert result.score > 0.0  # heuristic fallback still scores
+def test_compute_reward_matches_episode_score():
+    """Single completion + heuristic tail reproduces the heuristic score."""
+    task = get_task("goods_not_received_easy")
+    prompts = ["unused"]
+    completions = [""]  # triggers heuristic fallback on the first action
+    rewards = compute_reward(
+        prompts, completions, task_ids=[task.task_id]
+    )
+    assert len(rewards) == 1
+    assert 0.0 <= rewards[0] <= 1.0
+    assert rewards[0] > 0.5  # heuristic scores ~0.97 on this task
+def test_compute_reward_rejects_mismatched_lengths():
+    import pytest
+    with pytest.raises(ValueError):
+        compute_reward(["a"], ["b", "c"], task_ids=["goods_not_received_easy"])

training/__init__.py ADDED Viewed

	@@ -0,0 +1,29 @@

+"""Training helpers for ChargebackOps.
+Lightweight pure-Python wrappers that convert the environment into a
+prompt/completion/reward interface compatible with TRL's GRPO trainer.
+The module is import-safe without ``trl`` / ``torch`` installed so unit
+tests stay fast and offline.
+"""
+from __future__ import annotations
+from .env_adapter import (
+    action_from_completion,
+    build_prompt,
+    parse_completion,
+)
+from .reward_adapter import (
+    EpisodeResult,
+    compute_reward,
+    run_episode_with_text_policy,
+)
+__all__ = [
+    "EpisodeResult",
+    "action_from_completion",
+    "build_prompt",
+    "compute_reward",
+    "parse_completion",
+    "run_episode_with_text_policy",
+]

training/env_adapter.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""Text prompt / completion adapter for the merchant policy.
+Serialize an observation into a compact prompt the model can condition
+on, and parse a JSON completion back into a typed
+``ChargebackOpsAction``. Both helpers are pure — no provider calls, no
+side effects — so they are cheap to unit-test.
+"""
+from __future__ import annotations
+import json
+from typing import Any
+try:
+    from ..core.models import ChargebackOpsAction
+except ImportError:  # pragma: no cover
+    from core.models import ChargebackOpsAction
+_SYSTEM_INSTRUCTION = (
+    "You play the merchant-side agent in a chargeback dispute. "
+    "Look at the observation and choose the single best next action. "
+    "Return JSON only: "
+    '{"action_type": "...", "case_id": "...", "strategy": "...", '
+    '"evidence_ids": [...], "note": "..."} '
+    "Use only action_types listed in available_actions. Omit fields you "
+    "do not need."
+)
+_ALLOWED_ACTION_FIELDS: frozenset[str] = frozenset(
+    {
+        "action_type",
+        "case_id",
+        "system_name",
+        "evidence_ids",
+        "compelling_evidence_ids",
+        "strategy",
+        "note",
+    }
+)
+def _compact_observation(observation: dict[str, Any]) -> dict[str, Any]:
+    """Drop fields that add tokens without signal for the merchant policy."""
+    visible_case = observation.get("visible_case")
+    compact_case: dict[str, Any] | None = None
+    if visible_case is not None:
+        compact_case = {
+            "case_id": visible_case["case_id"],
+            "status": visible_case["status"],
+            "reason_code": visible_case["reason_code"],
+            "amount": visible_case["amount"],
+            "currency": visible_case["currency"],
+            "current_strategy": visible_case.get("current_strategy"),
+            "systems_revealed": visible_case.get("systems_revealed", []),
+            "retrieved_evidence": [
+                {
+                    "evidence_id": item["evidence_id"],
+                    "source_system": item["source_system"],
+                    "title": item["title"],
+                }
+                for item in visible_case.get("retrieved_evidence", [])
+            ],
+            "attached_evidence": [
+                item["evidence_id"]
+                for item in visible_case.get("attached_evidence", [])
+            ],
+            "policy": visible_case.get("policy"),
+        }
+    return {
+        "objective": observation.get("objective", ""),
+        "selected_case_id": observation.get("selected_case_id"),
+        "available_actions": observation.get("available_actions", []),
+        "steps_remaining": observation.get("steps_remaining", 0),
+        "queue": [
+            {
+                "case_id": item["case_id"],
+                "status": item["status"],
+                "reason_code": item["reason_code"],
+                "amount": item["amount"],
+                "steps_until_deadline": item["steps_until_deadline"],
+            }
+            for item in observation.get("queue", [])
+        ],
+        "visible_case": compact_case,
+        "last_action_result": observation.get("last_action_result", ""),
+    }
+def build_prompt(observation: dict[str, Any]) -> str:
+    """Return a deterministic prompt for the merchant policy."""
+    compact = _compact_observation(observation)
+    body = json.dumps(compact, separators=(",", ":"), sort_keys=True)
+    return f"{_SYSTEM_INSTRUCTION}\nOBSERVATION:\n{body}\nACTION:"
+def parse_completion(text: str) -> dict[str, Any] | None:
+    """Parse a model completion into a raw action dict, or return None."""
+    if not text:
+        return None
+    cleaned = text.strip()
+    # Strip common code-fence patterns.
+    if cleaned.startswith("```"):
+        cleaned = cleaned.strip("`").strip()
+        if cleaned.lower().startswith("json"):
+            cleaned = cleaned[4:].lstrip()
+    # Find the first {...} block so prose before JSON is tolerated.
+    start = cleaned.find("{")
+    end = cleaned.rfind("}")
+    if start == -1 or end == -1 or end <= start:
+        return None
+    try:
+        data = json.loads(cleaned[start : end + 1])
+    except json.JSONDecodeError:
+        return None
+    if not isinstance(data, dict):
+        return None
+    return {k: v for k, v in data.items() if k in _ALLOWED_ACTION_FIELDS}
+def action_from_completion(text: str) -> ChargebackOpsAction | None:
+    """Parse a completion and build a validated :class:`ChargebackOpsAction`."""
+    parsed = parse_completion(text)
+    if parsed is None or "action_type" not in parsed:
+        return None
+    try:
+        return ChargebackOpsAction(**parsed)
+    except Exception:
+        return None
+__all__ = [
+    "action_from_completion",
+    "build_prompt",
+    "parse_completion",
+]

training/reward_adapter.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""Reward adapter for GRPO / RL training on ChargebackOps.
+Exposes a callable shape compatible with TRL's GRPO reward function:
+``reward_fn(prompts, completions, **kwargs) -> list[float]``
+Each completion is parsed into an action sequence (one action per line
+is the simplest case; the helper also accepts a single-action
+completion and runs the remainder of the episode under the scripted
+heuristic so training always produces a terminal score). The resulting
+reward is the episode's deterministic normalized grade in ``[0, 1]``.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Any, Callable, Sequence
+try:
+    from ..core.models import ChargebackOpsAction, ChargebackOpsObservation
+    from ..scenarios.simulation import get_task, list_tasks
+    from ..server.chargeback_ops_environment import ChargebackOpsEnvironment
+    from .env_adapter import action_from_completion, build_prompt
+except ImportError:  # pragma: no cover
+    from core.models import ChargebackOpsAction, ChargebackOpsObservation
+    from scenarios.simulation import get_task, list_tasks
+    from server.chargeback_ops_environment import ChargebackOpsEnvironment
+    from training.env_adapter import action_from_completion, build_prompt
+TextPolicyFn = Callable[[str], str]
+@dataclass(frozen=True)
+class EpisodeResult:
+    """Outcome of a single rollout."""
+    task_id: str
+    score: float
+    steps_used: int
+    invalid_actions: int
+    prompts: tuple[str, ...] = field(default_factory=tuple)
+    completions: tuple[str, ...] = field(default_factory=tuple)
+def _fallback_action(
+    observation: ChargebackOpsObservation,
+) -> ChargebackOpsAction | None:
+    """Scripted fallback when the model output is unparseable."""
+    try:
+        from ..runners.benchmark_runner import heuristic_policy
+    except ImportError:  # pragma: no cover
+        from runners.benchmark_runner import heuristic_policy
+    return heuristic_policy(observation.model_dump())
+def run_episode_with_text_policy(
+    task_id: str,
+    text_policy: TextPolicyFn,
+    *,
+    max_steps: int | None = None,
+    capture_trace: bool = False,
+) -> EpisodeResult:
+    """Roll one episode forward under a text-in / text-out policy.
+    The policy is invoked at every step. If the completion fails to
+    parse into a valid action the scripted heuristic is used instead;
+    this keeps early-training trajectories from deadlocking.
+    """
+    task = get_task(task_id)
+    env = ChargebackOpsEnvironment()
+    observation = env.reset(task_id=task_id)
+    step_budget = (max_steps if max_steps is not None else task.max_steps) + 5
+    steps = 0
+    invalid = 0
+    prompts: list[str] = []
+    completions: list[str] = []
+    while not observation.done and steps < step_budget:
+        obs_dict = observation.model_dump()
+        prompt = build_prompt(obs_dict)
+        completion = text_policy(prompt)
+        action = action_from_completion(completion)
+        if action is None:
+            invalid += 1
+            action = _fallback_action(observation)
+            if action is None:
+                break
+        observation = env.step(action)
+        steps += 1
+        if capture_trace:
+            prompts.append(prompt)
+            completions.append(completion)
+    report = env.state.grader_report
+    score = float(report.normalized_score) if report is not None else 0.0
+    return EpisodeResult(
+        task_id=task_id,
+        score=score,
+        steps_used=env.state.step_count,
+        invalid_actions=invalid,
+        prompts=tuple(prompts),
+        completions=tuple(completions),
+    )
+def compute_reward(
+    prompts: Sequence[str],
+    completions: Sequence[str],
+    *,
+    task_ids: Sequence[str] | None = None,
+    **_: Any,
+) -> list[float]:
+    """GRPO-style reward function.
+    Each ``completion`` is replayed as a *single* action. The remainder
+    of the episode is driven by the scripted heuristic, so the reward
+    signal rewards the model for picking a good first move from a
+    given observation. This matches the behaviour TRL expects: one
+    ``(prompt, completion)`` pair → one scalar reward.
+    ``task_ids`` optionally binds each prompt to a task id for env
+    replay. When omitted, the headline catalog is cycled.
+    """
+    if task_ids is None:
+        headline = [task.task_id for task in list_tasks()]
+        task_ids = [headline[i % len(headline)] for i in range(len(prompts))]
+    if len(task_ids) != len(prompts) or len(prompts) != len(completions):
+        raise ValueError(
+            "prompts, completions, and task_ids must all have the same length"
+        )
+    rewards: list[float] = []
+    for task_id, completion in zip(task_ids, completions):
+        first_action = action_from_completion(completion)
+        def _once(_prompt: str, _used=[False], _action=first_action) -> str:
+            if _used[0] or _action is None:
+                return ""
+            _used[0] = True
+            return completion
+        result = run_episode_with_text_policy(task_id, _once)
+        rewards.append(result.score)
+    return rewards
+__all__ = [
+    "EpisodeResult",
+    "TextPolicyFn",
+    "compute_reward",
+    "run_episode_with_text_policy",
+]