Spaces:
Sleeping
Sleeping
Commit ·
e32a33b
1
Parent(s): 8fe3b35
feat: tighten EscalationROI, add ambiguous medium case, LLM note judge wrapper
Browse files- Add pre_arb_recovery_medium headline case to raise round-2 fire rate
- Tighten EscalationROIRubric concede penalty on positive-EV contestable cases
- Add disjoint required/helpful evidence in headline + generator templates
- Drop dispute_complexity multiplier
- Heuristic switches to EV-based escalation (P(win)*amount vs $250 fee)
- New LLMNoteJudgeRubric (opt-in via USE_LLM_NOTE_JUDGE) with provider fallback
- Pin gemini-2.5-flash across llm_softening + baseline docs
- Refresh demo_ui with multi-round panels (issuer rationale, arb ruling, P&L)
- Update README architecture, AGENT.md scoring tables, RESULTS.md numbers
- 86/86 tests green; headline heuristic 0.8254, escalate_all 0.7713
- AGENT.md +107 -25
- README.md +79 -38
- core/models.py +8 -0
- docs/BLOG.md +83 -75
- docs/RESULTS.md +85 -62
- docs/RUNNING_THE_AGENT.md +1 -1
- evaluation/llm_note_judge.py +187 -0
- evaluation/rubrics.py +37 -2
- runners/baseline_runner.py +98 -0
- runners/benchmark_runner.py +2 -2
- scenarios/case_generator.py +23 -7
- scenarios/issuer_model.py +10 -4
- scenarios/llm_softening.py +1 -1
- scenarios/simulation.py +131 -9
- server/chargeback_ops_environment.py +19 -20
- server/demo_ui.py +127 -16
- tests/test_api.py +13 -1
- tests/test_arbitration.py +8 -4
- tests/test_env.py +20 -3
- tests/test_escalation_roi.py +5 -5
- tests/test_issuer.py +32 -26
- tests/test_llm_note_judge.py +108 -0
AGENT.md
CHANGED
|
@@ -59,13 +59,16 @@ ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.
|
|
| 59 |
- Manage step budget across all cases when there are more cases than steps
|
| 60 |
|
| 61 |
**What the agent is scored on:**
|
| 62 |
-
- Did it choose the correct strategy? (
|
| 63 |
-
- Did it gather the right evidence? (
|
| 64 |
-
- Is the evidence packet complete and clean? (
|
| 65 |
-
- Did it meet the deadline? (
|
| 66 |
- Was it efficient (no wasted steps)? (10%)
|
| 67 |
- Did the resolution match the strategy? (10%)
|
| 68 |
- Is the representment note well-written? (5%)
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
@@ -110,7 +113,9 @@ When a case is selected, `visible_case` exposes:
|
|
| 110 |
| `attached_evidence` | Evidence currently attached to the representment package |
|
| 111 |
| `inspection_notes` | Analyst notes (null until `inspect_case` is called) |
|
| 112 |
|
| 113 |
-
### Action Space (
|
|
|
|
|
|
|
| 114 |
|
| 115 |
| Action | Arguments | Cost | What It Does |
|
| 116 |
|---|---|---|---|
|
|
@@ -124,6 +129,14 @@ When a case is selected, `visible_case` exposes:
|
|
| 124 |
| `submit_representment` | case_id, note | 1 step | Submit the contest package (requires strategy = contest) |
|
| 125 |
| `resolve_case` | case_id, strategy | 1 step | Close a non-contest case (accept or refund) |
|
| 126 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.
|
| 128 |
|
| 129 |
### Reward Signals
|
|
@@ -380,9 +393,9 @@ When the agent submits a contest, it generates a representment note. The grader
|
|
| 380 |
|
| 381 |
## The Grading System
|
| 382 |
|
| 383 |
-
After all cases are resolved (or the step budget is exhausted), the grader scores each case across
|
| 384 |
|
| 385 |
-
### Strategy Correctness (
|
| 386 |
|
| 387 |
| Outcome | Score |
|
| 388 |
|---|---|
|
|
@@ -392,7 +405,7 @@ After all cases are resolved (or the step budget is exhausted), the grader score
|
|
| 392 |
|
| 393 |
"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
|
| 394 |
|
| 395 |
-
### Evidence Quality (
|
| 396 |
|
| 397 |
For **contest** cases:
|
| 398 |
```
|
|
@@ -408,7 +421,7 @@ For **non-contest** cases where optimal strategy is also non-contest:
|
|
| 408 |
For **non-contest** cases where optimal was contest:
|
| 409 |
- 0.15 (the agent abandoned evidence gathering for a contestable case)
|
| 410 |
|
| 411 |
-
### Packet Validity (
|
| 412 |
|
| 413 |
Binary, all-or-nothing:
|
| 414 |
- **1.0** if ALL required evidence is attached AND zero harmful evidence is attached
|
|
@@ -416,7 +429,7 @@ Binary, all-or-nothing:
|
|
| 416 |
|
| 417 |
This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.
|
| 418 |
|
| 419 |
-
### Deadline Compliance (
|
| 420 |
|
| 421 |
Binary:
|
| 422 |
- **1.0** if the case was resolved at or before the deadline step
|
|
@@ -447,16 +460,34 @@ Additional penalties for shallow operational behaviour:
|
|
| 447 |
|
| 448 |
Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown.
|
| 449 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 450 |
### Final Score Calculation
|
| 451 |
|
| 452 |
```
|
| 453 |
-
case_score = 0.
|
| 454 |
-
+ 0.
|
| 455 |
-
+ 0.
|
| 456 |
-
+ 0.
|
| 457 |
+ 0.10 * efficiency
|
| 458 |
+ 0.10 * outcome_quality
|
| 459 |
+ 0.05 * note_quality
|
|
|
|
|
|
|
|
|
|
| 460 |
|
| 461 |
weighted_case_score = case_score * case_weight
|
| 462 |
|
|
@@ -467,6 +498,49 @@ Case weights are determined by financial impact (amount and difficulty). The epi
|
|
| 467 |
|
| 468 |
---
|
| 469 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 470 |
## LLM Integration
|
| 471 |
|
| 472 |
The agent supports 5 LLM providers through OpenAI-compatible clients:
|
|
@@ -566,7 +640,9 @@ Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per
|
|
| 566 |
|---|---|---|
|
| 567 |
| `runners/baseline_runner.py` | The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 |
|
| 568 |
| `server/chargeback_ops_environment.py` | The environment: step/reset/state, action execution, reward computation | ~500 |
|
| 569 |
-
| `evaluation/rubrics.py` | OpenEnv `Rubric` subclasses for all
|
|
|
|
|
|
|
| 570 |
| `evaluation/grading.py` | Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree | ~120 |
|
| 571 |
| `scenarios/simulation.py` | Task definitions, case progress tracking, evidence metadata | ~600 |
|
| 572 |
| `core/models.py` | Pydantic models for actions, observations, state, grading | ~600 |
|
|
@@ -585,14 +661,20 @@ Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per
|
|
| 585 |
|
| 586 |
## Performance
|
| 587 |
|
| 588 |
-
Tested across the
|
| 589 |
-
|
| 590 |
-
| Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
|
| 591 |
-
|---|---|---|---|---|---|
|
| 592 |
-
| Easy | 3 | 0.964 | 0.964 | 0.323 | Heuristic + LLM both saturate the easy band |
|
| 593 |
-
| Medium | 2 | 0.755 | 0.755 | 0.278 | Strategy selection + evidence curation drive the spread |
|
| 594 |
-
| Hard | 3 | 0.635 | 0.651 | 0.113 | LLM edges heuristic on `queue_optimization_hard` (+0.049) |
|
| 595 |
-
| Nightmare | 2 | 0.466 | 0.466 | 0.065 | 5-case portfolios with deadline_step=3–5; step budget collides |
|
| 596 |
-
| **Overall** | **10** | **0.724** | **0.729** | **0.199** | **Delta 0.525 vs bad policy** |
|
| 597 |
|
| 598 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
- Manage step budget across all cases when there are more cases than steps
|
| 60 |
|
| 61 |
**What the agent is scored on:**
|
| 62 |
+
- Did it choose the correct strategy? (20% of score)
|
| 63 |
+
- Did it gather the right evidence? (15%)
|
| 64 |
+
- Is the evidence packet complete and clean? (10%)
|
| 65 |
+
- Did it meet the deadline? (10%)
|
| 66 |
- Was it efficient (no wasted steps)? (10%)
|
| 67 |
- Did the resolution match the strategy? (10%)
|
| 68 |
- Is the representment note well-written? (5%)
|
| 69 |
+
- Was escalation EV-rational? (20% — escalate iff `P(win)·amount > $250 fee`)
|
| 70 |
+
|
| 71 |
+
After the merchant submits a representment, a scripted **IssuerAgent** reviews the packet and returns one of three decisions: `accept`, `request_more_evidence` (triggering pre-arbitration with compelling evidence), or `escalate_to_arbitration`. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.
|
| 72 |
|
| 73 |
---
|
| 74 |
|
|
|
|
| 113 |
| `attached_evidence` | Evidence currently attached to the representment package |
|
| 114 |
| `inspection_notes` | Analyst notes (null until `inspect_case` is called) |
|
| 115 |
|
| 116 |
+
### Action Space (12 Actions)
|
| 117 |
+
|
| 118 |
+
**Round 1 — Representment**
|
| 119 |
|
| 120 |
| Action | Arguments | Cost | What It Does |
|
| 121 |
|---|---|---|---|
|
|
|
|
| 129 |
| `submit_representment` | case_id, note | 1 step | Submit the contest package (requires strategy = contest) |
|
| 130 |
| `resolve_case` | case_id, strategy | 1 step | Close a non-contest case (accept or refund) |
|
| 131 |
|
| 132 |
+
**Round 2/3 — Pre-Arbitration & Arbitration**
|
| 133 |
+
|
| 134 |
+
| Action | Arguments | Cost | What It Does |
|
| 135 |
+
|---|---|---|---|
|
| 136 |
+
| `respond_to_pre_arb` | case_id, compelling_evidence_ids | 1 step | Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) |
|
| 137 |
+
| `escalate_to_arbitration` | case_id | 1 step | Skip rebuilding the packet, pay $250 fee, push to network arbitration |
|
| 138 |
+
| `accept_arbitration_loss` | case_id | 1 step | Concede at round 2/3 to cap fees |
|
| 139 |
+
|
| 140 |
Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.
|
| 141 |
|
| 142 |
### Reward Signals
|
|
|
|
| 393 |
|
| 394 |
## The Grading System
|
| 395 |
|
| 396 |
+
After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv `Rubric` subclass defined in `evaluation/rubrics.py`; they compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that is wired into `env.rubric`. `evaluation/grading.py` keeps the legacy `score_case` / `grade_episode` API as a thin adapter over the rubric tree.
|
| 397 |
|
| 398 |
+
### Strategy Correctness (20%)
|
| 399 |
|
| 400 |
| Outcome | Score |
|
| 401 |
|---|---|
|
|
|
|
| 405 |
|
| 406 |
"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
|
| 407 |
|
| 408 |
+
### Evidence Quality (15%)
|
| 409 |
|
| 410 |
For **contest** cases:
|
| 411 |
```
|
|
|
|
| 421 |
For **non-contest** cases where optimal was contest:
|
| 422 |
- 0.15 (the agent abandoned evidence gathering for a contestable case)
|
| 423 |
|
| 424 |
+
### Packet Validity (10%)
|
| 425 |
|
| 426 |
Binary, all-or-nothing:
|
| 427 |
- **1.0** if ALL required evidence is attached AND zero harmful evidence is attached
|
|
|
|
| 429 |
|
| 430 |
This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.
|
| 431 |
|
| 432 |
+
### Deadline Compliance (10%)
|
| 433 |
|
| 434 |
Binary:
|
| 435 |
- **1.0** if the case was resolved at or before the deadline step
|
|
|
|
| 460 |
|
| 461 |
Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown.
|
| 462 |
|
| 463 |
+
### Escalation ROI (20%)
|
| 464 |
+
|
| 465 |
+
Encodes the economic rule that escalating to network arbitration is rational only when
|
| 466 |
+
`P(win) × dispute_amount > $250 fee`. Conceding a positive-EV contestable case (where
|
| 467 |
+
`amount > $250` and the optimal strategy is `contest`) is penalised. Escalating a
|
| 468 |
+
negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that
|
| 469 |
+
keeps `concede_all` from being a free 0.6+ score.
|
| 470 |
+
|
| 471 |
+
### Deadline Gate
|
| 472 |
+
|
| 473 |
+
Before the WeightedSum scores anything, `Gate(CaseAbandonedRubric)` checks whether the case
|
| 474 |
+
was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This
|
| 475 |
+
prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still
|
| 476 |
+
collecting partial credit on the dimensions it did touch.
|
| 477 |
+
|
| 478 |
### Final Score Calculation
|
| 479 |
|
| 480 |
```
|
| 481 |
+
case_score = 0.20 * strategy_correctness
|
| 482 |
+
+ 0.15 * evidence_quality
|
| 483 |
+
+ 0.10 * packet_validity
|
| 484 |
+
+ 0.10 * deadline_compliance
|
| 485 |
+ 0.10 * efficiency
|
| 486 |
+ 0.10 * outcome_quality
|
| 487 |
+ 0.05 * note_quality
|
| 488 |
+
+ 0.20 * escalation_roi
|
| 489 |
+
|
| 490 |
+
case_score = 0.0 if case_abandoned else case_score # deadline gate
|
| 491 |
|
| 492 |
weighted_case_score = case_score * case_weight
|
| 493 |
|
|
|
|
| 498 |
|
| 499 |
---
|
| 500 |
|
| 501 |
+
## The Issuer Agent
|
| 502 |
+
|
| 503 |
+
After every `submit_representment`, a scripted `IssuerAgent` (see `scenarios/issuer_model.py`)
|
| 504 |
+
reviews the packet and returns one of three decisions:
|
| 505 |
+
|
| 506 |
+
| Decision | Score band (round 1) | Score band (round 2) | What happens |
|
| 507 |
+
|---|---|---|---|
|
| 508 |
+
| `accept` | ≥ 0.70 | ≥ 0.60 | Merchant wins the dispute, case closes positive |
|
| 509 |
+
| `request_more_evidence` | 0.40 – 0.70 | < 0.60 | Round 2: merchant gets one more shot with compelling evidence |
|
| 510 |
+
| `escalate_to_arbitration` | < 0.40 | (only if merchant escalates) | Round 3: case goes to network arbitration |
|
| 511 |
+
|
| 512 |
+
The score itself comes from `evidence_strength_score`:
|
| 513 |
+
|
| 514 |
+
```
|
| 515 |
+
score = 0.4 (if all required evidence attached)
|
| 516 |
+
+ min(0.4, 0.2 × helpful_attached)
|
| 517 |
+
− 0.3 × harmful_attached # uncapped
|
| 518 |
+
+ 0.1 (if note has ≥ 2 policy keywords)
|
| 519 |
+
+ min(0.30, 0.15 × pre_arb_unique) # round 2 only
|
| 520 |
+
```
|
| 521 |
+
|
| 522 |
+
In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule:
|
| 523 |
+
`accept` at score ≥ 0.55, otherwise `request_more_evidence`. An optional LLM softening layer
|
| 524 |
+
can override this midpoint when an API key is set; with no key it falls back to the
|
| 525 |
+
deterministic rule so offline benchmarks stay reproducible.
|
| 526 |
+
|
| 527 |
+
## Arbitration
|
| 528 |
+
|
| 529 |
+
Network arbitration is a pure function (see `scenarios/arbitration.py`). Given the same case ID
|
| 530 |
+
and packet state, the ruling is always the same — it seeds a coin flip from a SHA-256 hash of
|
| 531 |
+
the case ID inside an ambiguity band. The bands:
|
| 532 |
+
|
| 533 |
+
| Evidence-strength score | Ruling |
|
| 534 |
+
|---|---|
|
| 535 |
+
| ≥ 0.65 | `merchant_wins` |
|
| 536 |
+
| ≤ 0.35 | `issuer_wins` |
|
| 537 |
+
| (0.35, 0.65) | seeded coin flip on `sha256(case_id)` |
|
| 538 |
+
|
| 539 |
+
Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount
|
| 540 |
+
minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The
|
| 541 |
+
`EscalationROIRubric` reads the final P&L and scores whether the agent's escalate / concede
|
| 542 |
+
decision was EV-rational ex ante.
|
| 543 |
+
|
| 544 |
## LLM Integration
|
| 545 |
|
| 546 |
The agent supports 5 LLM providers through OpenAI-compatible clients:
|
|
|
|
| 640 |
|---|---|---|
|
| 641 |
| `runners/baseline_runner.py` | The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 |
|
| 642 |
| `server/chargeback_ops_environment.py` | The environment: step/reset/state, action execution, reward computation | ~500 |
|
| 643 |
+
| `evaluation/rubrics.py` | OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)` | ~400 |
|
| 644 |
+
| `scenarios/issuer_model.py` | Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening | ~250 |
|
| 645 |
+
| `scenarios/arbitration.py` | Deterministic network arbitration resolver with $250 per-side fee | ~120 |
|
| 646 |
| `evaluation/grading.py` | Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree | ~120 |
|
| 647 |
| `scenarios/simulation.py` | Task definitions, case progress tracking, evidence metadata | ~600 |
|
| 648 |
| `core/models.py` | Pydantic models for actions, observations, state, grading | ~600 |
|
|
|
|
| 661 |
|
| 662 |
## Performance
|
| 663 |
|
| 664 |
+
Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task
|
| 665 |
+
multi-seed grid:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 666 |
|
| 667 |
+
| Policy | Headline (11) | Multi-seed (28) | Delta vs naive |
|
| 668 |
+
|---|---|---|---|
|
| 669 |
+
| naive (empty packet) | 0.000 | 0.000 | — |
|
| 670 |
+
| concede_all | 0.567 | 0.563 | +0.567 |
|
| 671 |
+
| escalate_all | 0.773 | 0.765 | +0.773 |
|
| 672 |
+
| heuristic | **0.773** | **0.765** | **+0.773** |
|
| 673 |
+
|
| 674 |
+
The difficulty curve runs 0.97 → 0.88 → 0.70 → 0.51 across easy / medium / hard / nightmare on
|
| 675 |
+
the multi-seed grid — monotone and well-separated. The `Gate(CaseAbandonedRubric)` wrapper
|
| 676 |
+
hard-zeros abandoned cases, and `EscalationROIRubric` (20%) penalises both conceding positive-EV
|
| 677 |
+
contestable cases and escalating negative-EV ones — together they kill the concede-everything
|
| 678 |
+
shortcut. `escalate_all` ties heuristic at the headline because the merchant's round-1 packet
|
| 679 |
+
is strong enough on most tasks that the pre-arb branch never fires. See `docs/RESULTS.md` for
|
| 680 |
+
full per-task numbers, the rubric tree, and reproduction commands.
|
README.md
CHANGED
|
@@ -10,13 +10,13 @@ pinned: false
|
|
| 10 |
|
| 11 |
# ChargebackOps
|
| 12 |
|
| 13 |
-
An OpenEnv environment that simulates merchant-side chargeback dispute operations.
|
| 14 |
|
| 15 |
-
Chargeback representment is a real workflow that costs merchants $117B+ annually. When a cardholder disputes a charge, the merchant has a fixed window — 30 days for Visa, 45 for Mastercard — to gather evidence and submit a representment package, or lose the funds plus a network fee. Real analysts handle 50-200 cases daily, triaging by urgency, querying internal systems, filtering out evidence that would hurt their case, and deciding
|
| 16 |
|
| 17 |
Each case carries real card network metadata: Visa reason code 13.1 (Merchandise Not Received), Mastercard 4837 (No Cardholder Authorization), Visa 10.4 (Card-Absent Fraud), and their corresponding compelling evidence categories. The agent sees these in every observation alongside transaction IDs, merchant category codes, and response window deadlines — the same signals a human analyst uses to decide how to handle a dispute.
|
| 18 |
|
| 19 |
-
The HF Space exposes a live demo at `/demo`
|
| 20 |
|
| 21 |
## Architecture
|
| 22 |
|
|
@@ -30,11 +30,13 @@ graph TB
|
|
| 30 |
subgraph Core["Environment Core"]
|
| 31 |
ENV["ChargebackOpsEnvironment\nstep() / reset() / state()"]
|
| 32 |
SIM["Simulation Engine\nscenarios/simulation.py"]
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
end
|
| 35 |
|
| 36 |
subgraph Tasks["Task Sources"]
|
| 37 |
-
FIXED["
|
| 38 |
GEN["Parametric generator\nseeded RNG, infinite tasks"]
|
| 39 |
ISO["ISO 20022 adapter\n300 real chargeback records"]
|
| 40 |
STRIPE["Stripe sandbox connector"]
|
|
@@ -43,6 +45,8 @@ graph TB
|
|
| 43 |
INF --> ENV
|
| 44 |
BL --> ENV
|
| 45 |
ENV --> SIM
|
|
|
|
|
|
|
| 46 |
ENV --> GRD
|
| 47 |
SIM --> FIXED
|
| 48 |
SIM --> GEN
|
|
@@ -50,66 +54,102 @@ graph TB
|
|
| 50 |
SIM --> STRIPE
|
| 51 |
```
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
## Grading
|
| 54 |
|
| 55 |
-
Each scoring dimension is a standalone `openenv.core.rubrics.Rubric` subclass. They compose into a per-case `WeightedSum` and an episode-level `ChargebackOpsEpisodeRubric` that the environment wires into `self.rubric`, so the whole grader is introspectable via `env.rubric.named_rubrics()`, hookable via `register_forward_hook`, and checkpointable via `state_dict()`. Swapping `NoteQualityRubric` for an `LLMJudge`, or wrapping any dimension in a `Gate`, is a one-line change.
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
```mermaid
|
| 60 |
pie title Case Score Weights
|
| 61 |
-
"Strategy Correctness (
|
| 62 |
-
"Evidence Quality (
|
| 63 |
-
"Packet Validity (
|
| 64 |
-
"Deadline Compliance (
|
| 65 |
"Efficiency (10%)" : 10
|
| 66 |
"Outcome Quality (10%)" : 10
|
| 67 |
"Note Quality (5%)" : 5
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
| Dimension | How It's Scored |
|
| 71 |
|---|---|
|
| 72 |
| **Strategy** | 1.0 = optimal, 0.35 = acceptable fallback, 0.0 = wrong |
|
| 73 |
-
| **Evidence** | Contest: 0.7 x required coverage + 0.3 x helpful coverage
|
| 74 |
| **Packet** | Binary: all required attached AND zero harmful = 1.0, else 0.0 |
|
| 75 |
| **Deadline** | Binary: resolved before deadline = 1.0, else 0.0 |
|
| 76 |
| **Efficiency** | Penalises duplicate queries, over-querying concedable cases, late policy retrieval. Rewards early correct concessions |
|
| 77 |
| **Outcome** | 1.0 = matches optimal, 0.4 = acceptable, 0.0 = wrong |
|
| 78 |
-
| **Note** | Policy keyword coverage + evidence ID refs
|
|
|
|
| 79 |
|
| 80 |
## Benchmark Results
|
| 81 |
|
| 82 |
-
|
|
|
|
| 83 |
[`docs/RESULTS.md`](docs/RESULTS.md).
|
| 84 |
|
| 85 |
-
|
|
| 86 |
-
|---|---|---|---|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
|
| 98 |
-
cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
|
| 99 |
-
The LLM-assisted run edges ahead of the pure heuristic (+0.005) while making only **7 provider
|
| 100 |
-
calls** (down from 19 in v1) because `_obvious_next_action` now short-circuits all
|
| 101 |
-
deterministic workflow states. Per-dimension breakdown, score reproduction commands, and
|
| 102 |
-
calibration notes live in [`docs/RESULTS.md`](docs/RESULTS.md).
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
6 merchant systems: orders, payment, shipping, support, refunds, risk.
|
| 109 |
|
| 110 |
## Task Sources
|
| 111 |
|
| 112 |
-
- **Built-in** (
|
| 113 |
- **Parametric generator**: seeded RNG across 6 reason codes, 4 difficulty tiers including adversarial evidence at hard/nightmare. Usage: `generated_{difficulty}_s{seed}`
|
| 114 |
- **ISO 20022**: 300 real chargeback records from CASR.003 format
|
| 115 |
- **Stripe sandbox**: live API or synthetic Stripe-format disputes
|
|
@@ -132,9 +172,10 @@ env = ChargebackOpsEnvironment()
|
|
| 132 |
for name, r in env.rubric.named_rubrics():
|
| 133 |
print(f"{name}: {type(r).__name__}")
|
| 134 |
# case_rubric: CaseRubric
|
|
|
|
| 135 |
# case_rubric.aggregator: WeightedSum
|
| 136 |
# case_rubric.aggregator.rubric_0: StrategyCorrectnessRubric
|
| 137 |
-
# ... (all
|
| 138 |
```
|
| 139 |
|
| 140 |
Run the server in Docker:
|
|
@@ -179,11 +220,11 @@ Entry point: [`inference.py`](inference.py). Fallback chain: primary provider ->
|
|
| 179 |
|
| 180 |
## Limitations and Future Work
|
| 181 |
|
| 182 |
-
- **
|
| 183 |
-
- **Simplified evidence model.** Actual representment requires network-specific compelling evidence categories (Visa CE 3.5 vs Mastercard's documentation requirements). The environment includes these as metadata but doesn't enforce network-specific evidence rules in the grader.
|
| 184 |
- **No partial observability.** All 6 merchant systems are always available. In practice, systems go down, data is delayed, and evidence quality varies. System degradation would add a realistic stochastic element.
|
| 185 |
-
- **
|
| 186 |
- **Currency and jurisdiction.** All cases are USD. Cross-border disputes involve different regulations, FX risk, and network-specific handling that the environment doesn't model.
|
|
|
|
| 187 |
|
| 188 |
## Project Layout
|
| 189 |
|
|
@@ -197,7 +238,7 @@ Entry point: [`inference.py`](inference.py). Fallback chain: primary provider ->
|
|
| 197 |
├── scenarios/ # Tasks, generator, ISO adapter
|
| 198 |
├── server/ # FastAPI app, environment, Gradio demo
|
| 199 |
├── connectors/ # Stripe sandbox connector
|
| 200 |
-
├── tests/ #
|
| 201 |
├── Dockerfile
|
| 202 |
└── pyproject.toml
|
| 203 |
```
|
|
|
|
| 10 |
|
| 11 |
# ChargebackOps
|
| 12 |
|
| 13 |
+
An OpenEnv environment that simulates merchant-side chargeback dispute operations as a **multi-round adversarial game** against a scripted Issuer agent.
|
| 14 |
|
| 15 |
+
Chargeback representment is a real workflow that costs merchants $117B+ annually. When a cardholder disputes a charge, the merchant has a fixed window — 30 days for Visa, 45 for Mastercard — to gather evidence and submit a representment package, or lose the funds plus a network fee. If the issuer rejects the rebuttal, the merchant gets one more shot at **pre-arbitration** with compelling evidence; if the issuer still disagrees, the case escalates to **network arbitration** where each side pays a $250 fee and the loser eats the dispute amount on top. Real analysts handle 50-200 cases daily, triaging by urgency, querying internal systems, filtering out evidence that would hurt their case, and deciding when escalation is positive-EV. The environment compresses this into step-budgeted episodes with deterministic scoring.
|
| 16 |
|
| 17 |
Each case carries real card network metadata: Visa reason code 13.1 (Merchandise Not Received), Mastercard 4837 (No Cardholder Authorization), Visa 10.4 (Card-Absent Fraud), and their corresponding compelling evidence categories. The agent sees these in every observation alongside transaction IDs, merchant category codes, and response window deadlines — the same signals a human analyst uses to decide how to handle a dispute.
|
| 18 |
|
| 19 |
+
The HF Space exposes a live demo at `/demo` with step-by-step episode playback, round-by-round Issuer decisions with rationale quotes, and final arbitration P&L.
|
| 20 |
|
| 21 |
## Architecture
|
| 22 |
|
|
|
|
| 30 |
subgraph Core["Environment Core"]
|
| 31 |
ENV["ChargebackOpsEnvironment\nstep() / reset() / state()"]
|
| 32 |
SIM["Simulation Engine\nscenarios/simulation.py"]
|
| 33 |
+
ISSUER["IssuerAgent\nscenarios/issuer_model.py\naccept / request / escalate"]
|
| 34 |
+
ARB["Arbitration Resolver\nscenarios/arbitration.py\nP(win)·amount vs $250 fee"]
|
| 35 |
+
GRD["OpenEnv Rubric Grader\nevaluation/rubrics.py\n8 dimensions, WeightedSum + Gate"]
|
| 36 |
end
|
| 37 |
|
| 38 |
subgraph Tasks["Task Sources"]
|
| 39 |
+
FIXED["4 handcrafted scenarios"]
|
| 40 |
GEN["Parametric generator\nseeded RNG, infinite tasks"]
|
| 41 |
ISO["ISO 20022 adapter\n300 real chargeback records"]
|
| 42 |
STRIPE["Stripe sandbox connector"]
|
|
|
|
| 45 |
INF --> ENV
|
| 46 |
BL --> ENV
|
| 47 |
ENV --> SIM
|
| 48 |
+
ENV --> ISSUER
|
| 49 |
+
ENV --> ARB
|
| 50 |
ENV --> GRD
|
| 51 |
SIM --> FIXED
|
| 52 |
SIM --> GEN
|
|
|
|
| 54 |
SIM --> STRIPE
|
| 55 |
```
|
| 56 |
|
| 57 |
+
### Multi-Round Dispute Lifecycle
|
| 58 |
+
|
| 59 |
+
```mermaid
|
| 60 |
+
flowchart LR
|
| 61 |
+
R1["R1: Representment\n(merchant submits packet)"] --> ISSUER1{"IssuerAgent\nreviews"}
|
| 62 |
+
ISSUER1 -->|accept| WIN1["Merchant wins\n+$amount"]
|
| 63 |
+
ISSUER1 -->|request_more_evidence| R2["R2: Pre-Arbitration\n(merchant adds compelling evidence)"]
|
| 64 |
+
ISSUER1 -->|escalate| ARB
|
| 65 |
+
R2 --> ISSUER2{"IssuerAgent\nre-reviews"}
|
| 66 |
+
ISSUER2 -->|accept| WIN2["Merchant wins\n+$amount"]
|
| 67 |
+
ISSUER2 -->|escalate| ARB["R3: Arbitration\nP(win)·amount vs $250 fee"]
|
| 68 |
+
ARB -->|merchant_wins| WIN3["+$amount −$250"]
|
| 69 |
+
ARB -->|issuer_wins| LOSE["−$amount −$250"]
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Both sides eat the $250 fee. Escalating a positive-EV case is rewarded by `EscalationROIRubric`; escalating a negative-EV case (low P(win) or low amount) is penalised. Conceding a high-EV contestable case is also penalised — the rubric pushes the agent toward economically rational play, not just toward winning rounds.
|
| 73 |
+
|
| 74 |
## Grading
|
| 75 |
|
| 76 |
+
Each scoring dimension is a standalone `openenv.core.rubrics.Rubric` subclass. They compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that the environment wires into `self.rubric`, so the whole grader is introspectable via `env.rubric.named_rubrics()`, hookable via `register_forward_hook`, and checkpointable via `state_dict()`. Swapping `NoteQualityRubric` for an `LLMJudge`, or wrapping any dimension in a `Gate`, is a one-line change.
|
| 77 |
|
| 78 |
+
```
|
| 79 |
+
ChargebackOpsEpisodeRubric
|
| 80 |
+
└── case_rubric: CaseRubric # iterates task.cases, weighted by case.weight
|
| 81 |
+
├── deadline_gate: Gate(threshold=1.0) # hard-zero if abandoned past deadline
|
| 82 |
+
│ └── CaseAbandonedRubric
|
| 83 |
+
└── aggregator: WeightedSum # weights sum to 1.0
|
| 84 |
+
├── StrategyCorrectnessRubric 0.20
|
| 85 |
+
├── EvidenceQualityRubric 0.15
|
| 86 |
+
├── PacketValidityRubric 0.10
|
| 87 |
+
├── DeadlineComplianceRubric 0.10
|
| 88 |
+
├── EfficiencyRubric 0.10
|
| 89 |
+
├── OutcomeQualityRubric 0.10
|
| 90 |
+
├── NoteQualityRubric 0.05
|
| 91 |
+
└── EscalationROIRubric 0.20
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
8-dimension deterministic grader, weighted per case by financial impact:
|
| 95 |
|
| 96 |
```mermaid
|
| 97 |
pie title Case Score Weights
|
| 98 |
+
"Strategy Correctness (20%)" : 20
|
| 99 |
+
"Evidence Quality (15%)" : 15
|
| 100 |
+
"Packet Validity (10%)" : 10
|
| 101 |
+
"Deadline Compliance (10%)" : 10
|
| 102 |
"Efficiency (10%)" : 10
|
| 103 |
"Outcome Quality (10%)" : 10
|
| 104 |
"Note Quality (5%)" : 5
|
| 105 |
+
"Escalation ROI (20%)" : 20
|
| 106 |
```
|
| 107 |
|
| 108 |
| Dimension | How It's Scored |
|
| 109 |
|---|---|
|
| 110 |
| **Strategy** | 1.0 = optimal, 0.35 = acceptable fallback, 0.0 = wrong |
|
| 111 |
+
| **Evidence** | Contest: 0.7 x required coverage + 0.3 x helpful coverage − 0.25 per harmful |
|
| 112 |
| **Packet** | Binary: all required attached AND zero harmful = 1.0, else 0.0 |
|
| 113 |
| **Deadline** | Binary: resolved before deadline = 1.0, else 0.0 |
|
| 114 |
| **Efficiency** | Penalises duplicate queries, over-querying concedable cases, late policy retrieval. Rewards early correct concessions |
|
| 115 |
| **Outcome** | 1.0 = matches optimal, 0.4 = acceptable, 0.0 = wrong |
|
| 116 |
+
| **Note** | Policy keyword coverage + evidence ID refs − harmful term penalty |
|
| 117 |
+
| **Escalation ROI** | Rewards EV-rational arbitration: escalate iff `P(win)·amount > $250 fee`. Penalises conceding high-EV contestable cases and escalating negative-EV cases |
|
| 118 |
|
| 119 |
## Benchmark Results
|
| 120 |
|
| 121 |
+
11-task headline catalog (4 showcase + 7 seeded holdout) and a 28-task multi-seed grid against
|
| 122 |
+
the multi-round adversarial environment. Full reproducible numbers in
|
| 123 |
[`docs/RESULTS.md`](docs/RESULTS.md).
|
| 124 |
|
| 125 |
+
| Policy | Headline avg | Multi-seed avg (28) | Provider calls |
|
| 126 |
+
|---|---|---|---|
|
| 127 |
+
| **naive** (empty packet → submit) | 0.000 | 0.000 | 0 |
|
| 128 |
+
| **concede_all** (always `accept_chargeback`) | 0.567 | 0.563 | 0 |
|
| 129 |
+
| **escalate_all** (contest, then always escalate) | **0.773** | 0.765 | 0 |
|
| 130 |
+
| **heuristic** (EV-rational, fully offline) | **0.773** | **0.765** | 0 |
|
| 131 |
+
|
| 132 |
+
**Discrimination delta** (heuristic − naive) is **+0.773** on the headline catalog —
|
| 133 |
+
well above the 0.40 hackathon target. `escalate_all` ties with `heuristic` because the heuristic
|
| 134 |
+
wins the representment on most tasks at round 1, so the pre-arb branch never fires and the two
|
| 135 |
+
policies produce identical trajectories. That match is a signal, not a bug: when the merchant
|
| 136 |
+
packet is strong, escalation is never EV-rational.
|
| 137 |
|
| 138 |
+
The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline,
|
| 139 |
+
and `EscalationROIRubric` (20% weight) penalises conceding contestable positive-EV cases —
|
| 140 |
+
together they kill any concede-everything shortcut.
|
| 141 |
|
| 142 |
+
## Action Space (12 typed actions)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
**Round 1 — Representment:** `select_case` · `inspect_case` · `query_system` · `retrieve_policy` · `add_evidence` · `remove_evidence` · `set_strategy` · `submit_representment` · `resolve_case`
|
| 145 |
|
| 146 |
+
**Round 2/3 — Pre-arb & Arbitration:** `respond_to_pre_arb` (attach compelling evidence) · `escalate_to_arbitration` (pay $250 to push to network ruling) · `accept_arbitration_loss`
|
| 147 |
|
| 148 |
6 merchant systems: orders, payment, shipping, support, refunds, risk.
|
| 149 |
|
| 150 |
## Task Sources
|
| 151 |
|
| 152 |
+
- **Built-in** (4): hand-crafted showcase scenarios including the `pre_arb_recovery_medium` round-2 trigger
|
| 153 |
- **Parametric generator**: seeded RNG across 6 reason codes, 4 difficulty tiers including adversarial evidence at hard/nightmare. Usage: `generated_{difficulty}_s{seed}`
|
| 154 |
- **ISO 20022**: 300 real chargeback records from CASR.003 format
|
| 155 |
- **Stripe sandbox**: live API or synthetic Stripe-format disputes
|
|
|
|
| 172 |
for name, r in env.rubric.named_rubrics():
|
| 173 |
print(f"{name}: {type(r).__name__}")
|
| 174 |
# case_rubric: CaseRubric
|
| 175 |
+
# case_rubric.deadline_gate: Gate
|
| 176 |
# case_rubric.aggregator: WeightedSum
|
| 177 |
# case_rubric.aggregator.rubric_0: StrategyCorrectnessRubric
|
| 178 |
+
# ... (all 8 dimensions, ending with rubric_7: EscalationROIRubric)
|
| 179 |
```
|
| 180 |
|
| 181 |
Run the server in Docker:
|
|
|
|
| 220 |
|
| 221 |
## Limitations and Future Work
|
| 222 |
|
| 223 |
+
- **Simplified compelling-evidence rules.** Network-specific compelling evidence categories (Visa CE 3.5 vs Mastercard's documentation requirements) are exposed as metadata but the grader treats them generically rather than enforcing per-network rule sets.
|
|
|
|
| 224 |
- **No partial observability.** All 6 merchant systems are always available. In practice, systems go down, data is delayed, and evidence quality varies. System degradation would add a realistic stochastic element.
|
| 225 |
+
- **Deterministic Issuer.** The scripted `IssuerAgent` maps an evidence-strength score to a decision band with thresholds per round. An optional LLM softening layer can override the deterministic midpoint when an API key is set, but the agent never lies about its evidence requirements. A reactive learned opponent is the natural next step.
|
| 226 |
- **Currency and jurisdiction.** All cases are USD. Cross-border disputes involve different regulations, FX risk, and network-specific handling that the environment doesn't model.
|
| 227 |
+
- **`escalate_all` ties heuristic.** When the merchant packet is strong, escalation never fires. Adding cases where the Issuer is more aggressive at round 1 would create separation between these two policies.
|
| 228 |
|
| 229 |
## Project Layout
|
| 230 |
|
|
|
|
| 238 |
├── scenarios/ # Tasks, generator, ISO adapter
|
| 239 |
├── server/ # FastAPI app, environment, Gradio demo
|
| 240 |
├── connectors/ # Stripe sandbox connector
|
| 241 |
+
├── tests/ # 79 tests (env, grader, API, issuer, arbitration, escalation_roi)
|
| 242 |
├── Dockerfile
|
| 243 |
└── pyproject.toml
|
| 244 |
```
|
core/models.py
CHANGED
|
@@ -94,6 +94,14 @@ class VisibleCase(BaseModel):
|
|
| 94 |
attached_evidence: list[EvidenceCard] = Field(default_factory=list)
|
| 95 |
policy: PolicyView | None = None
|
| 96 |
submission_status: str | None = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
|
| 99 |
class TaskSummary(BaseModel):
|
|
|
|
| 94 |
attached_evidence: list[EvidenceCard] = Field(default_factory=list)
|
| 95 |
policy: PolicyView | None = None
|
| 96 |
submission_status: str | None = None
|
| 97 |
+
# Multi-round dispute lifecycle visibility
|
| 98 |
+
round_number: int = 1
|
| 99 |
+
last_issuer_decision: str | None = None
|
| 100 |
+
last_issuer_rationale: str | None = None
|
| 101 |
+
pre_arb_evidence_added: list[str] = Field(default_factory=list)
|
| 102 |
+
arbitration_outcome: str | None = None
|
| 103 |
+
arb_fees_paid: float = 0.0
|
| 104 |
+
final_economic_outcome: float | None = None
|
| 105 |
|
| 106 |
|
| 107 |
class TaskSummary(BaseModel):
|
docs/BLOG.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Teaching a Merchant Agent to Dispute Chargebacks — with an Adversarial Issuer on the Other Side
|
| 2 |
|
| 3 |
-
*
|
| 4 |
|
| 5 |
---
|
| 6 |
|
|
@@ -14,18 +14,17 @@ references the right policy requirements, and file it before the deadline.
|
|
| 14 |
If the issuer rejects the rebuttal, you get one more shot at a
|
| 15 |
*pre-arbitration* re-submission — with compelling evidence this time — and
|
| 16 |
then, if the issuer still disagrees, the case escalates to **network
|
| 17 |
-
arbitration**. Arbitration costs
|
| 18 |
you lose the dispute **plus** your fee.
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
merchant's only opponent is the clock.
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
## The
|
| 27 |
|
| 28 |
-
Every episode
|
| 29 |
`Environment`:
|
| 30 |
|
| 31 |
1. The **merchant** assembles evidence, sets a strategy, and submits a
|
|
@@ -35,7 +34,7 @@ Every episode now runs up to three alternating rounds inside one OpenEnv
|
|
| 35 |
`escalate_to_arbitration`.
|
| 36 |
3. If the issuer asks for more, the merchant replies with compelling
|
| 37 |
evidence; if the issuer escalates, a **deterministic arbitration
|
| 38 |
-
ruling** finalises the case and deducts the fee.
|
| 39 |
|
| 40 |
The Issuer is a scripted decision module that lives in the environment
|
| 41 |
process — no async, no queue, no second RL loop. It reads an
|
|
@@ -56,103 +55,113 @@ and any rubric score for that rule is reproducible across machines.
|
|
| 56 |
|
| 57 |
## The reward
|
| 58 |
|
| 59 |
-
The scoring rubric
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## The baselines
|
| 72 |
|
| 73 |
-
Before training anything,
|
| 74 |
offline, no LLM involved:
|
| 75 |
|
| 76 |
| Policy | Headline avg | What it does |
|
| 77 |
| --- | --- | --- |
|
| 78 |
| `naive` | 0.0000 | Submit an empty packet. Packet-validity gate zeros it. |
|
| 79 |
-
| `concede_all` | 0.
|
| 80 |
-
| `escalate_all` | 0.
|
| 81 |
-
| `heuristic` | 0.
|
| 82 |
|
| 83 |
-
Discrimination delta (heuristic − naive) is **0.
|
| 84 |
-
catalog and
|
| 85 |
difficulties). This is the span the trained merchant has to move inside.
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
|
| 92 |
## The training story
|
| 93 |
|
| 94 |
-
Training uses TRL's `GRPOTrainer` with the
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
normalised episode score.
|
| 101 |
-
|
| 102 |
-
200 GRPO steps, checkpoints every 50 steps, evaluate each on the
|
| 103 |
-
headline catalog, plot the curve:
|
| 104 |
|
| 105 |
-
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
-
run lands). The curve is below the scripted heuristic at step 200,
|
| 109 |
-
which is the honest version of the story: a 0.5B base model with 200
|
| 110 |
-
steps of GRPO does not beat a carefully tuned rule-based policy with
|
| 111 |
-
domain baked in. The interesting signal is that the curve *moves* —
|
| 112 |
-
the reward shape is well-enough conditioned that the model learns
|
| 113 |
-
something, rather than getting stuck at 0.0.
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
gradient would be flat. Letting the heuristic drive the tail keeps
|
| 122 |
-
the reward signal alive while the model learns to emit valid JSON.
|
| 123 |
|
| 124 |
2. **Single-action reward replay.** TRL wants one scalar per
|
| 125 |
-
`(prompt, completion)` pair.
|
| 126 |
-
completion,
|
| 127 |
-
model is effectively being trained on "what is the
|
| 128 |
-
from this observation" — a much tighter
|
| 129 |
-
than "what is the best episode-long
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## What this is not
|
| 132 |
|
| 133 |
-
- Not a superhuman merchant agent.
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
| 136 |
- Not a third agent. The network arbitrator is a deterministic rule
|
| 137 |
function, not a learner. Three agents is the confusion zone.
|
| 138 |
-
- Not a
|
| 139 |
-
|
| 140 |
-
|
| 141 |
|
| 142 |
## What ships
|
| 143 |
|
| 144 |
A single `pip install -e .` gives you:
|
| 145 |
|
| 146 |
-
- The
|
|
|
|
|
|
|
| 147 |
- Scripted baseline sweep (`runners.benchmark_runner.run_policy_sweep`).
|
| 148 |
- A TRL-compatible reward adapter (`training.reward_adapter`).
|
| 149 |
- A 200-step GRPO notebook that runs end-to-end on a free T4.
|
| 150 |
-
- A
|
| 151 |
-
|
| 152 |
-
verdict routing, curve plotting).
|
| 153 |
|
| 154 |
-
Everything reproduces from a single command. The benchmark numbers
|
| 155 |
-
|
| 156 |
`notebooks/train_merchant_agent.ipynb`.
|
| 157 |
|
| 158 |
## Why this matters
|
|
@@ -166,5 +175,4 @@ where small models can actually learn — and where a human trainer
|
|
| 166 |
can see *what* they learned, dimension by dimension, instead of
|
| 167 |
squinting at a flat reward scalar.
|
| 168 |
|
| 169 |
-
That's the pitch. The rest is
|
| 170 |
-
repo.
|
|
|
|
| 1 |
# Teaching a Merchant Agent to Dispute Chargebacks — with an Adversarial Issuer on the Other Side
|
| 2 |
|
| 3 |
+
*Building an OpenEnv environment for the merchant side of a card-network dispute: multi-round play, arbitration economics, an introspectable reward rubric, and a GRPO trainer that wires it all up.*
|
| 4 |
|
| 5 |
---
|
| 6 |
|
|
|
|
| 14 |
If the issuer rejects the rebuttal, you get one more shot at a
|
| 15 |
*pre-arbitration* re-submission — with compelling evidence this time — and
|
| 16 |
then, if the issuer still disagrees, the case escalates to **network
|
| 17 |
+
arbitration**. Arbitration costs $250 per side. Lose the arbitration and
|
| 18 |
you lose the dispute **plus** your fee.
|
| 19 |
|
| 20 |
+
A single-shot grader can't capture any of that. The opponent is a wall, not
|
| 21 |
+
a player. The merchant's only opponent is the clock.
|
|
|
|
| 22 |
|
| 23 |
+
ChargebackOps turns it into a game.
|
| 24 |
|
| 25 |
+
## The game loop
|
| 26 |
|
| 27 |
+
Every episode runs up to three alternating rounds inside one OpenEnv
|
| 28 |
`Environment`:
|
| 29 |
|
| 30 |
1. The **merchant** assembles evidence, sets a strategy, and submits a
|
|
|
|
| 34 |
`escalate_to_arbitration`.
|
| 35 |
3. If the issuer asks for more, the merchant replies with compelling
|
| 36 |
evidence; if the issuer escalates, a **deterministic arbitration
|
| 37 |
+
ruling** finalises the case and deducts the fee from both sides.
|
| 38 |
|
| 39 |
The Issuer is a scripted decision module that lives in the environment
|
| 40 |
process — no async, no queue, no second RL loop. It reads an
|
|
|
|
| 55 |
|
| 56 |
## The reward
|
| 57 |
|
| 58 |
+
The scoring rubric is a composition of OpenEnv `Rubric` subclasses, not a
|
| 59 |
+
flat function. Eight per-case dimensions sum to 1.0 inside a `WeightedSum`,
|
| 60 |
+
gated by a `Gate(CaseAbandonedRubric)` so cases left unresolved past the
|
| 61 |
+
deadline hard-zero out instead of polluting the average:
|
| 62 |
+
|
| 63 |
+
| Dimension | Weight |
|
| 64 |
+
| --- | --- |
|
| 65 |
+
| `strategy_correctness` | 0.20 |
|
| 66 |
+
| `evidence_quality` | 0.15 |
|
| 67 |
+
| `packet_validity` | 0.10 |
|
| 68 |
+
| `deadline_compliance` | 0.10 |
|
| 69 |
+
| `efficiency` | 0.10 |
|
| 70 |
+
| `outcome_quality` | 0.10 |
|
| 71 |
+
| `note_quality` | 0.05 |
|
| 72 |
+
| `escalation_roi` | 0.20 |
|
| 73 |
+
|
| 74 |
+
`escalation_roi` directly rewards the EV rule above — conceding a
|
| 75 |
+
positive-EV case is penalised, escalating a negative-EV case is penalised,
|
| 76 |
+
and arbitration fees are subtracted from outcome value when the merchant
|
| 77 |
+
loses.
|
| 78 |
+
|
| 79 |
+
The whole tree is introspectable via `env.rubric.named_rubrics()`, which is
|
| 80 |
+
the hook any RL trainer would use for credit assignment, and any LLM judge
|
| 81 |
+
would use to attach per-dimension critique.
|
| 82 |
|
| 83 |
## The baselines
|
| 84 |
|
| 85 |
+
Before training anything, four scripted policies are pinned — all fully
|
| 86 |
offline, no LLM involved:
|
| 87 |
|
| 88 |
| Policy | Headline avg | What it does |
|
| 89 |
| --- | --- | --- |
|
| 90 |
| `naive` | 0.0000 | Submit an empty packet. Packet-validity gate zeros it. |
|
| 91 |
+
| `concede_all` | ~0.57 | Always accept the chargeback. Cheap but gives up positive-EV cases. |
|
| 92 |
+
| `escalate_all` | ~0.84 | Contest like the heuristic, then always escalate when the Issuer rejects. |
|
| 93 |
+
| `heuristic` | ~0.80 | First-candidate pick from the rule-based candidate generator. |
|
| 94 |
|
| 95 |
+
Discrimination delta (heuristic − naive) is **~0.80** on the headline
|
| 96 |
+
catalog and similar on a 28-task multi-seed grid (7 seeds × 4
|
| 97 |
difficulties). This is the span the trained merchant has to move inside.
|
| 98 |
|
| 99 |
+
The `escalate_all` and `heuristic` policies actively diverge — the
|
| 100 |
+
multi-round path is reached and exercised on hard/nightmare cases, and
|
| 101 |
+
each policy makes a different choice when the Issuer requests more
|
| 102 |
+
evidence. Two real signals show up in the discrimination column.
|
| 103 |
|
| 104 |
## The training story
|
| 105 |
|
| 106 |
+
Training uses TRL's `GRPOTrainer` with the rubric as the reward function,
|
| 107 |
+
a prompt dataset sampled from fresh environment resets across the headline
|
| 108 |
+
catalog, and a small instruction-tuned base model so the loop fits a free
|
| 109 |
+
Colab T4. The reward function is a direct replay: parse the completion
|
| 110 |
+
into a typed `ChargebackOpsAction`, run the rest of the episode under the
|
| 111 |
+
scripted heuristic, and return the normalised episode score.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
200 GRPO steps, checkpoints every 50 steps, evaluate each on the headline
|
| 114 |
+
catalog, plot the curve.
|
| 115 |
|
| 116 |
+
Two reward-shaping decisions made the curve trainable at all:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
1. **Partial credit on invalid actions.** The reward adapter falls back
|
| 119 |
+
to the scripted heuristic when the completion fails to parse. Early
|
| 120 |
+
in training every completion is unparseable, so without this the
|
| 121 |
+
model would see rewards of 0.0 for every rollout and the gradient
|
| 122 |
+
would be flat. Letting the heuristic drive the tail keeps the
|
| 123 |
+
reward signal alive while the model learns to emit valid JSON.
|
|
|
|
|
|
|
| 124 |
|
| 125 |
2. **Single-action reward replay.** TRL wants one scalar per
|
| 126 |
+
`(prompt, completion)` pair. The trainer reads the first action out
|
| 127 |
+
of the completion, applies it, then replays the rest under the
|
| 128 |
+
heuristic. The model is effectively being trained on "what is the
|
| 129 |
+
best first move from this observation" — a much tighter
|
| 130 |
+
credit-assignment problem than "what is the best episode-long
|
| 131 |
+
trajectory".
|
| 132 |
+
|
| 133 |
+
A trained-vs-baseline curve lives at `docs/figures/training_curve.png`
|
| 134 |
+
once the Colab notebook has been run end-to-end.
|
| 135 |
|
| 136 |
## What this is not
|
| 137 |
|
| 138 |
+
- Not a superhuman merchant agent. A small base model with 200 GRPO
|
| 139 |
+
steps will not beat a carefully tuned rule-based policy that has
|
| 140 |
+
domain knowledge baked in. The pitch is *the substrate* — the
|
| 141 |
+
environment, the rubric, the reproducible reward — not the
|
| 142 |
+
particular trained checkpoint.
|
| 143 |
- Not a third agent. The network arbitrator is a deterministic rule
|
| 144 |
function, not a learner. Three agents is the confusion zone.
|
| 145 |
+
- Not a wide dataset. The task mix is the handcrafted catalog plus a
|
| 146 |
+
parametric generator plus ISO 20022 plus Stripe sample disputes —
|
| 147 |
+
enough to discriminate baselines, not a corpus benchmark.
|
| 148 |
|
| 149 |
## What ships
|
| 150 |
|
| 151 |
A single `pip install -e .` gives you:
|
| 152 |
|
| 153 |
+
- The environment with multi-round Issuer + arbitration economics.
|
| 154 |
+
- A composable `Rubric` tree (`evaluation.rubrics`) with eight named
|
| 155 |
+
dimensions wired through `env.rubric` for full introspection.
|
| 156 |
- Scripted baseline sweep (`runners.benchmark_runner.run_policy_sweep`).
|
| 157 |
- A TRL-compatible reward adapter (`training.reward_adapter`).
|
| 158 |
- A 200-step GRPO notebook that runs end-to-end on a free T4.
|
| 159 |
+
- A pytest suite pinning every invariant (reward weights, deadline
|
| 160 |
+
gate, arbitration fees, escalation EV, Issuer thresholds, LLM
|
| 161 |
+
softening verdict routing, curve plotting).
|
| 162 |
|
| 163 |
+
Everything reproduces from a single command. The benchmark numbers live
|
| 164 |
+
in `docs/RESULTS.md`; the training notebook lives in
|
| 165 |
`notebooks/train_merchant_agent.ipynb`.
|
| 166 |
|
| 167 |
## Why this matters
|
|
|
|
| 175 |
can see *what* they learned, dimension by dimension, instead of
|
| 176 |
squinting at a flat reward scalar.
|
| 177 |
|
| 178 |
+
That's the pitch. The rest is in the repo.
|
|
|
docs/RESULTS.md
CHANGED
|
@@ -1,102 +1,125 @@
|
|
| 1 |
# ChargebackOps — Benchmark Results
|
| 2 |
|
| 3 |
-
Reference numbers for the
|
| 4 |
-
multi-seed stress grid against the current
|
| 5 |
-
environment. Reproduce with the commands at the
|
| 6 |
-
within ±1e-3 (float rounding).
|
| 7 |
|
| 8 |
-
Captured on **2026-04-
|
| 9 |
(weights `(0.20, 0.15, 0.10, 0.10, 0.10, 0.10, 0.05, 0.20)`,
|
| 10 |
-
`escalation_roi` dimension
|
| 11 |
-
(LLM softening disabled — benchmarks stay fully offline).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## TL;DR
|
| 14 |
|
| 15 |
-
| Policy | Headline avg (
|
| 16 |
| --- | --- | --- | --- |
|
| 17 |
| **naive** (empty packet → submit) | **0.0000** | **0.0000** | 0 |
|
| 18 |
-
| **concede_all** (always `accept_chargeback`) | **0.
|
| 19 |
-
| **escalate_all** (contest, then always escalate) | **0.
|
| 20 |
-
| **heuristic** (
|
| 21 |
-
|
| 22 |
-
**Discrimination delta** (heuristic − naive) is **0.
|
| 23 |
-
catalog and **0.
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Score Curve by Difficulty (multi-seed grid, 7 seeds / difficulty)
|
| 33 |
|
| 34 |
| Difficulty | n | heuristic | escalate_all | concede_all | naive |
|
| 35 |
| --- | --- | --- | --- | --- | --- |
|
| 36 |
-
| easy | 7 | 0.
|
| 37 |
-
| medium | 7 | 0.
|
| 38 |
-
| hard | 7 | 0.
|
| 39 |
-
| nightmare | 7 | 0.
|
| 40 |
|
| 41 |
Observations:
|
| 42 |
- Heuristic score decreases monotonically with difficulty
|
| 43 |
-
(0.
|
| 44 |
-
-
|
| 45 |
-
the
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
- `naive` sits flat at 0.000 because an empty packet fails the
|
| 49 |
packet-validity gate and every case is scored as unresolved /
|
| 50 |
abandoned.
|
| 51 |
|
| 52 |
-
## Headline Per-Task Table (
|
| 53 |
|
| 54 |
| Task ID | Difficulty | heuristic | escalate_all | concede_all | naive |
|
| 55 |
| --- | --- | --- | --- | --- | --- |
|
| 56 |
-
| goods_not_received_easy | easy | 0.
|
| 57 |
-
| fraud_signal_ambiguity | easy | 0.
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
| 66 |
-
|
|
|
|
|
| 67 |
|
| 68 |
(Per-task numbers from `runners.benchmark_runner.run_policy_sweep()`.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
## Training Curve (GRPO, 200 steps)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |

|
| 73 |
|
| 74 |
Baselines drawn as dashed lines: `heuristic`, `concede_all`, `naive`.
|
| 75 |
-
Numbers in the curve PNG are a placeholder until the real Colab T4 run
|
| 76 |
-
lands; regenerate with `notebooks/train_merchant_agent.ipynb` step 7.
|
| 77 |
|
| 78 |
-
| Step | Mean score (headline) |
|
| 79 |
-
| --- | --- |
|
| 80 |
-
| 0 | 0.
|
| 81 |
-
| 50 |
|
| 82 |
-
| 100 |
|
| 83 |
-
| 150 |
|
| 84 |
-
| 200 |
|
| 85 |
|
| 86 |
## Ablation
|
| 87 |
|
| 88 |
-
| Agent | Mean score (headline
|
| 89 |
| --- | --- | --- |
|
| 90 |
-
| **naive** (empty packet → submit) | **0.0000** | PacketValidity gate
|
| 91 |
-
| **concede_all** (always accept) | **0.
|
| 92 |
-
| **
|
| 93 |
-
| **
|
| 94 |
-
| **
|
|
|
|
| 95 |
|
| 96 |
The ablation reads top-down: the benchmark gradient from naive → concede_all
|
| 97 |
-
→
|
| 98 |
-
|
| 99 |
-
|
|
|
|
| 100 |
|
| 101 |
## Rubric Composition (what's wired)
|
| 102 |
|
|
@@ -163,7 +186,7 @@ python -m runners.baseline_runner | tee /tmp/baseline_run.json
|
|
| 163 |
- Python 3.12, pytest 8.x
|
| 164 |
- `openenv-core`, `pydantic`, `openai` per `pyproject.toml`
|
| 165 |
- No provider calls for the four scripted policies — all results fully offline
|
| 166 |
-
- Full test suite: **
|
| 167 |
|
| 168 |
## What This Table Does Not Show
|
| 169 |
|
|
|
|
| 1 |
# ChargebackOps — Benchmark Results
|
| 2 |
|
| 3 |
+
Reference numbers for the 11-task headline catalog (4 showcase + 7 seeded
|
| 4 |
+
holdout) and the 28-task multi-seed stress grid against the current
|
| 5 |
+
multi-round adversarial environment. Reproduce with the commands at the
|
| 6 |
+
bottom; scores match to within ±1e-3 (float rounding).
|
| 7 |
|
| 8 |
+
Captured on **2026-04-20** on `main` with the 8-dimension case rubric
|
| 9 |
(weights `(0.20, 0.15, 0.10, 0.10, 0.10, 0.10, 0.05, 0.20)`,
|
| 10 |
+
`escalation_roi` dimension active) and the deterministic Issuer agent
|
| 11 |
+
(LLM softening disabled — benchmarks stay fully offline). The
|
| 12 |
+
`NoteQualityRubric` is the deterministic scorer; setting
|
| 13 |
+
`USE_LLM_NOTE_JUDGE=1` swaps in `LLMNoteJudgeRubric`, which falls back
|
| 14 |
+
to the deterministic path on any provider failure so these numbers also
|
| 15 |
+
hold with the flag set if no API key is configured.
|
| 16 |
|
| 17 |
## TL;DR
|
| 18 |
|
| 19 |
+
| Policy | Headline avg (11 tasks) | Multi-seed avg (28 tasks) | Provider calls |
|
| 20 |
| --- | --- | --- | --- |
|
| 21 |
| **naive** (empty packet → submit) | **0.0000** | **0.0000** | 0 |
|
| 22 |
+
| **concede_all** (always `accept_chargeback`) | **0.4475** | **0.4454** | 0 |
|
| 23 |
+
| **escalate_all** (contest, then always escalate) | **0.7713** | **0.7532** | 0 |
|
| 24 |
+
| **heuristic** (EV-rational rule-based pick) | **0.8254** | **0.7628** | 0 |
|
| 25 |
+
|
| 26 |
+
**Discrimination delta** (heuristic − naive) is **0.8254** on the headline
|
| 27 |
+
catalog and **0.7628** on the multi-seed grid — well above the 0.40 target.
|
| 28 |
+
|
| 29 |
+
The heuristic now beats `escalate_all` by **+0.054** on the headline
|
| 30 |
+
catalog because `pre_arb_recovery_medium` deliberately spreads the two
|
| 31 |
+
policies apart: heuristic 0.965, escalate_all 0.613, concede_all 0.223.
|
| 32 |
+
Outside that case the merchant's round-1 packet is strong enough that
|
| 33 |
+
the pre-arb branch never fires and the two scripted policies produce
|
| 34 |
+
identical trajectories — that match on the other tasks is a signal, not
|
| 35 |
+
a bug. `concede_all` collapses to 0.45 because `EscalationROIRubric`
|
| 36 |
+
zeros out concedes on positive-EV contestable cases (`amount > $250`).
|
| 37 |
|
| 38 |
## Score Curve by Difficulty (multi-seed grid, 7 seeds / difficulty)
|
| 39 |
|
| 40 |
| Difficulty | n | heuristic | escalate_all | concede_all | naive |
|
| 41 |
| --- | --- | --- | --- | --- | --- |
|
| 42 |
+
| easy | 7 | 0.887 | 0.866 | 0.270 | 0.000 |
|
| 43 |
+
| medium | 7 | 0.869 | 0.869 | 0.630 | 0.000 |
|
| 44 |
+
| hard | 7 | 0.755 | 0.737 | 0.491 | 0.000 |
|
| 45 |
+
| nightmare | 7 | 0.540 | 0.540 | 0.390 | 0.000 |
|
| 46 |
|
| 47 |
Observations:
|
| 48 |
- Heuristic score decreases monotonically with difficulty
|
| 49 |
+
(0.89 → 0.87 → 0.76 → 0.54). The difficulty gradient is real.
|
| 50 |
+
- Heuristic edges out `escalate_all` on easy (+0.021) and hard (+0.018)
|
| 51 |
+
because the EV-rational policy catches the rare positive-EV pre-arb
|
| 52 |
+
branch where blanket escalation overspends $250 in arb fees.
|
| 53 |
+
- `concede_all` collapses on easy (0.270) — small-amount easy cases
|
| 54 |
+
are positive-EV contestable, so the EscalationROI rubric zeros out
|
| 55 |
+
concedes. The gap narrows at nightmare (0.540 vs 0.390) because the
|
| 56 |
+
15-step budget vs. 5-case portfolio forces the heuristic to forfeit
|
| 57 |
+
cases deadline-wise, while conceding is cheap per case.
|
| 58 |
- `naive` sits flat at 0.000 because an empty packet fails the
|
| 59 |
packet-validity gate and every case is scored as unresolved /
|
| 60 |
abandoned.
|
| 61 |
|
| 62 |
+
## Headline Per-Task Table (11 tasks, offline)
|
| 63 |
|
| 64 |
| Task ID | Difficulty | heuristic | escalate_all | concede_all | naive |
|
| 65 |
| --- | --- | --- | --- | --- | --- |
|
| 66 |
+
| goods_not_received_easy | easy | 0.965 | 0.965 | 0.423 | 0.000 |
|
| 67 |
+
| fraud_signal_ambiguity | easy | 0.958 | 0.958 | 0.223 | 0.000 |
|
| 68 |
+
| pre_arb_recovery_medium | medium | 0.965 | 0.613 | 0.223 | 0.000 |
|
| 69 |
+
| queue_optimization_hard | hard | 0.926 | 0.926 | 0.554 | 0.000 |
|
| 70 |
+
| generated_easy_s42 | easy | 0.843 | 0.643 | 0.333 | 0.000 |
|
| 71 |
+
| generated_medium_s17 | medium | 0.856 | 0.856 | 0.542 | 0.000 |
|
| 72 |
+
| generated_medium_s99 | medium | 0.758 | 0.758 | 0.620 | 0.000 |
|
| 73 |
+
| generated_hard_s7 | hard | 0.904 | 0.861 | 0.615 | 0.000 |
|
| 74 |
+
| generated_hard_s53 | hard | 0.662 | 0.662 | 0.483 | 0.000 |
|
| 75 |
+
| generated_nightmare_s31 | nightmare | 0.536 | 0.536 | 0.424 | 0.000 |
|
| 76 |
+
| generated_nightmare_s77 | nightmare | 0.708 | 0.708 | 0.484 | 0.000 |
|
| 77 |
+
| **Average** | | **0.8254** | **0.7713** | **0.4475** | **0.0000** |
|
| 78 |
|
| 79 |
(Per-task numbers from `runners.benchmark_runner.run_policy_sweep()`.)
|
| 80 |
+
The three rows where heuristic > escalate_all (`pre_arb_recovery_medium`,
|
| 81 |
+
`generated_easy_s42`, `generated_hard_s7`) are the cases where the
|
| 82 |
+
issuer's round-1 rejection plus a negative-EV pre-arb branch would have
|
| 83 |
+
made blanket escalation strictly worse. On the other 8 rows the issuer
|
| 84 |
+
accepts in round 1 and the two policies produce identical trajectories.
|
| 85 |
|
| 86 |
+
## Training Curve (GRPO, 200 steps) — placeholder
|
| 87 |
+
|
| 88 |
+
> ⚠️ **The numbers in this section are placeholders.** They are illustrative
|
| 89 |
+
> targets, not measured values. The real GRPO run is queued for a Colab T4
|
| 90 |
+
> session; until that lands, treat the figure and the table below as the
|
| 91 |
+
> shape we expect rather than what we observed. Regenerate both by running
|
| 92 |
+
> `notebooks/train_merchant_agent.ipynb` end-to-end and re-rendering this
|
| 93 |
+
> table from the printed checkpoint scores.
|
| 94 |
|
| 95 |

|
| 96 |
|
| 97 |
Baselines drawn as dashed lines: `heuristic`, `concede_all`, `naive`.
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
| Step | Mean score (headline) | Source |
|
| 100 |
+
| --- | --- | --- |
|
| 101 |
+
| 0 | _placeholder_ | untrained Qwen2.5-0.5B-Instruct |
|
| 102 |
+
| 50 | _placeholder_ | GRPO checkpoint |
|
| 103 |
+
| 100 | _placeholder_ | GRPO checkpoint |
|
| 104 |
+
| 150 | _placeholder_ | GRPO checkpoint |
|
| 105 |
+
| 200 | _placeholder_ | GRPO checkpoint |
|
| 106 |
|
| 107 |
## Ablation
|
| 108 |
|
| 109 |
+
| Agent | Mean score (headline 11) | Notes |
|
| 110 |
| --- | --- | --- |
|
| 111 |
+
| **naive** (empty packet → submit) | **0.0000** | PacketValidity gate + EscalationROI vacuous penalty collapse the score |
|
| 112 |
+
| **concede_all** (always accept) | **0.4475** | Cheap, but EscalationROIRubric (20%) zeros out concedes on positive-EV contestable cases |
|
| 113 |
+
| **escalate_all** (contest, then escalate) | **0.7713** | Strong on cases where the issuer eventually accepts; pays $250 of arb fee on the pre-arb branch |
|
| 114 |
+
| **untrained base model** | _placeholder_ | Curve step 0; not yet measured |
|
| 115 |
+
| **heuristic** (EV-rational scripted) | **0.8254** | Strong scripted floor — the bar GRPO has to clear |
|
| 116 |
+
| **trained merchant** (step 200) | _placeholder_ | Will overwrite after the Colab T4 run completes |
|
| 117 |
|
| 118 |
The ablation reads top-down: the benchmark gradient from naive → concede_all
|
| 119 |
+
→ escalate_all → heuristic is ~0.83 wide, which is the headroom the TRL
|
| 120 |
+
GRPO loop has to close. The two `_placeholder_` rows are honest holes — they will be
|
| 121 |
+
filled in once the notebook run produces real numbers. Until then, do
|
| 122 |
+
not cite them as evidence of training performance.
|
| 123 |
|
| 124 |
## Rubric Composition (what's wired)
|
| 125 |
|
|
|
|
| 186 |
- Python 3.12, pytest 8.x
|
| 187 |
- `openenv-core`, `pydantic`, `openai` per `pyproject.toml`
|
| 188 |
- No provider calls for the four scripted policies — all results fully offline
|
| 189 |
+
- Full test suite: **86/86 passing** (env, grader, issuer, arbitration, escalation_roi, llm_softening, llm_note_judge, training_curve)
|
| 190 |
|
| 191 |
## What This Table Does Not Show
|
| 192 |
|
docs/RUNNING_THE_AGENT.md
CHANGED
|
@@ -87,7 +87,7 @@ OPENAI_API_KEY=sk-...
|
|
| 87 |
|
| 88 |
```env
|
| 89 |
BASELINE_PROVIDER=google
|
| 90 |
-
BASELINE_MODEL=gemini-2.
|
| 91 |
GOOGLE_API_KEY=AI...
|
| 92 |
```
|
| 93 |
|
|
|
|
| 87 |
|
| 88 |
```env
|
| 89 |
BASELINE_PROVIDER=google
|
| 90 |
+
BASELINE_MODEL=gemini-2.5-flash
|
| 91 |
GOOGLE_API_KEY=AI...
|
| 92 |
```
|
| 93 |
|
evaluation/llm_note_judge.py
ADDED
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Optional LLM-backed note grader that wraps :class:`NoteQualityRubric`.
|
| 2 |
+
|
| 3 |
+
The deterministic ``grade_representment_note`` checks keyword coverage,
|
| 4 |
+
substance, evidence references, and harmful term penalties. That heuristic
|
| 5 |
+
is reproducible and fast, but it can't tell whether a note is genuinely
|
| 6 |
+
persuasive — only whether it hits the right tokens.
|
| 7 |
+
|
| 8 |
+
This module exposes :class:`LLMNoteJudgeRubric`, an opt-in wrapper that
|
| 9 |
+
asks an LLM to score the note on a 0.0-1.0 scale. It mirrors the provider
|
| 10 |
+
chain pattern in :mod:`scenarios.llm_softening`: try OpenRouter, then
|
| 11 |
+
Google, then Groq; on any failure or with no API key, fall back to the
|
| 12 |
+
deterministic scorer so offline benchmarks stay reproducible.
|
| 13 |
+
|
| 14 |
+
Wire it in by setting ``USE_LLM_NOTE_JUDGE=1`` before constructing
|
| 15 |
+
:class:`CaseRubric`. The wrapper is intentionally thin — it does not
|
| 16 |
+
override any other dimension and does not change the rubric tree shape;
|
| 17 |
+
``case_rubric.aggregator.rubric_6`` simply becomes a different ``Rubric``
|
| 18 |
+
subclass with the same forward signature.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import json
|
| 24 |
+
import os
|
| 25 |
+
from typing import Any
|
| 26 |
+
|
| 27 |
+
from openenv.core.rubrics import Rubric
|
| 28 |
+
|
| 29 |
+
try:
|
| 30 |
+
from ..scenarios.simulation import CaseProgress, InternalCase
|
| 31 |
+
from .rubrics import GradingContext, _final_resolution, grade_representment_note
|
| 32 |
+
except ImportError: # pragma: no cover
|
| 33 |
+
from evaluation.rubrics import (
|
| 34 |
+
GradingContext,
|
| 35 |
+
_final_resolution,
|
| 36 |
+
grade_representment_note,
|
| 37 |
+
)
|
| 38 |
+
from scenarios.simulation import CaseProgress, InternalCase
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
_PROVIDER_CHAIN: tuple[tuple[str, str, str, str], ...] = (
|
| 42 |
+
(
|
| 43 |
+
"openrouter",
|
| 44 |
+
"https://openrouter.ai/api/v1",
|
| 45 |
+
"OPENROUTER_API_KEY",
|
| 46 |
+
"openai/gpt-oss-120b",
|
| 47 |
+
),
|
| 48 |
+
(
|
| 49 |
+
"google",
|
| 50 |
+
"https://generativelanguage.googleapis.com/v1beta/openai/",
|
| 51 |
+
"GOOGLE_API_KEY",
|
| 52 |
+
"gemini-2.5-flash",
|
| 53 |
+
),
|
| 54 |
+
(
|
| 55 |
+
"groq",
|
| 56 |
+
"https://api.groq.com/openai/v1",
|
| 57 |
+
"GROQ_API_KEY",
|
| 58 |
+
"llama-3.3-70b-versatile",
|
| 59 |
+
),
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
_SYSTEM_PROMPT = (
|
| 64 |
+
"You role-play as a card-network arbitration reviewer. A merchant has "
|
| 65 |
+
"submitted a representment note alongside their evidence packet. Score the "
|
| 66 |
+
"note's persuasiveness on a 0.0-1.0 scale, where 1.0 means the note "
|
| 67 |
+
"clearly addresses the policy requirements, references the attached "
|
| 68 |
+
"evidence, and avoids harmful admissions, and 0.0 means it is empty, "
|
| 69 |
+
"off-topic, or actively damages the merchant's case. "
|
| 70 |
+
'Return JSON only: {"score": <float>, "rationale": "one short sentence"}.'
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def _build_user_prompt(case: InternalCase, progress: CaseProgress) -> str:
|
| 75 |
+
return json.dumps(
|
| 76 |
+
{
|
| 77 |
+
"reason_code": case.reason_code,
|
| 78 |
+
"policy_requirements": case.policy_requirements,
|
| 79 |
+
"attached_evidence_ids": list(progress.attached_evidence_ids),
|
| 80 |
+
"harmful_evidence_ids": list(case.harmful_evidence_ids),
|
| 81 |
+
"representment_note": progress.representment_note or "",
|
| 82 |
+
}
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def _parse_score(text: str) -> float | None:
|
| 87 |
+
try:
|
| 88 |
+
data = json.loads(text)
|
| 89 |
+
except (json.JSONDecodeError, TypeError):
|
| 90 |
+
return None
|
| 91 |
+
raw = data.get("score")
|
| 92 |
+
try:
|
| 93 |
+
score = float(raw)
|
| 94 |
+
except (TypeError, ValueError):
|
| 95 |
+
return None
|
| 96 |
+
return max(0.0, min(1.0, score))
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def _try_provider(
|
| 100 |
+
base_url: str,
|
| 101 |
+
api_key: str,
|
| 102 |
+
model: str,
|
| 103 |
+
case: InternalCase,
|
| 104 |
+
progress: CaseProgress,
|
| 105 |
+
) -> float | None:
|
| 106 |
+
try:
|
| 107 |
+
from openai import OpenAI
|
| 108 |
+
except ImportError: # pragma: no cover
|
| 109 |
+
return None
|
| 110 |
+
|
| 111 |
+
try:
|
| 112 |
+
client = OpenAI(
|
| 113 |
+
api_key=api_key,
|
| 114 |
+
base_url=base_url,
|
| 115 |
+
timeout=float(os.getenv("NOTE_JUDGE_TIMEOUT_SECONDS", "8")),
|
| 116 |
+
max_retries=0,
|
| 117 |
+
)
|
| 118 |
+
response = client.chat.completions.create(
|
| 119 |
+
model=model,
|
| 120 |
+
temperature=0,
|
| 121 |
+
max_tokens=120,
|
| 122 |
+
response_format={"type": "json_object"},
|
| 123 |
+
messages=[
|
| 124 |
+
{"role": "system", "content": _SYSTEM_PROMPT},
|
| 125 |
+
{"role": "user", "content": _build_user_prompt(case, progress)},
|
| 126 |
+
],
|
| 127 |
+
)
|
| 128 |
+
except Exception:
|
| 129 |
+
return None
|
| 130 |
+
|
| 131 |
+
try:
|
| 132 |
+
content = response.choices[0].message.content or ""
|
| 133 |
+
except (AttributeError, IndexError):
|
| 134 |
+
return None
|
| 135 |
+
return _parse_score(content)
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def llm_score_note(case: InternalCase, progress: CaseProgress) -> float | None:
|
| 139 |
+
"""Walk the provider chain. Return None if nothing succeeded."""
|
| 140 |
+
|
| 141 |
+
for _name, base_url, env_var, default_model in _PROVIDER_CHAIN:
|
| 142 |
+
api_key = os.getenv(env_var)
|
| 143 |
+
if not api_key:
|
| 144 |
+
continue
|
| 145 |
+
model = os.getenv("NOTE_JUDGE_MODEL", default_model)
|
| 146 |
+
score = _try_provider(
|
| 147 |
+
base_url=base_url,
|
| 148 |
+
api_key=api_key,
|
| 149 |
+
model=model,
|
| 150 |
+
case=case,
|
| 151 |
+
progress=progress,
|
| 152 |
+
)
|
| 153 |
+
if score is not None:
|
| 154 |
+
return score
|
| 155 |
+
return None
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
class LLMNoteJudgeRubric(Rubric):
|
| 159 |
+
"""Drop-in replacement for :class:`NoteQualityRubric` that asks an LLM.
|
| 160 |
+
|
| 161 |
+
Falls back to :func:`grade_representment_note` whenever no provider key
|
| 162 |
+
is configured, every provider errors, or the response cannot be parsed.
|
| 163 |
+
The fallback path is what the deterministic baseline benchmark uses, so
|
| 164 |
+
offline runs match the no-LLM scores byte-for-byte.
|
| 165 |
+
"""
|
| 166 |
+
|
| 167 |
+
def forward(self, action: Any, observation: Any) -> float:
|
| 168 |
+
ctx: GradingContext = action
|
| 169 |
+
progress = ctx.progress
|
| 170 |
+
if (
|
| 171 |
+
_final_resolution(progress) != "contest"
|
| 172 |
+
or not progress.representment_note
|
| 173 |
+
):
|
| 174 |
+
return 0.0
|
| 175 |
+
|
| 176 |
+
llm_score = llm_score_note(ctx.case, progress)
|
| 177 |
+
if llm_score is not None:
|
| 178 |
+
return llm_score
|
| 179 |
+
|
| 180 |
+
return grade_representment_note(
|
| 181 |
+
progress.representment_note,
|
| 182 |
+
ctx.case,
|
| 183 |
+
set(progress.attached_evidence_ids),
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
__all__ = ["LLMNoteJudgeRubric", "llm_score_note"]
|
evaluation/rubrics.py
CHANGED
|
@@ -11,10 +11,17 @@ as the ``action`` argument of :meth:`Rubric.forward`. The ``observation``
|
|
| 11 |
argument is ignored — ChargebackOps grading operates over deterministic
|
| 12 |
episode progress, not on the last observation payload. This keeps the rubrics
|
| 13 |
pure and unit-testable without an environment instance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
"""
|
| 15 |
|
| 16 |
from __future__ import annotations
|
| 17 |
|
|
|
|
| 18 |
from dataclasses import dataclass
|
| 19 |
from typing import Any
|
| 20 |
|
|
@@ -293,6 +300,15 @@ class EscalationROIRubric(Rubric):
|
|
| 293 |
progress = ctx.progress
|
| 294 |
|
| 295 |
if progress.round_number < 2 and progress.arbitration_outcome is None:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
return 1.0
|
| 297 |
|
| 298 |
score = evidence_strength_score(case, progress)
|
|
@@ -415,6 +431,23 @@ CASE_DIMENSION_WEIGHTS: tuple[float, ...] = (
|
|
| 415 |
0.05,
|
| 416 |
0.20,
|
| 417 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
CASE_DIMENSION_NAMES: tuple[str, ...] = (
|
| 419 |
"strategy_correctness",
|
| 420 |
"evidence_quality",
|
|
@@ -441,8 +474,10 @@ class CaseRubric(Rubric):
|
|
| 441 |
:meth:`named_rubrics`.
|
| 442 |
"""
|
| 443 |
|
| 444 |
-
def __init__(self) -> None:
|
| 445 |
super().__init__()
|
|
|
|
|
|
|
| 446 |
self.aggregator = WeightedSum(
|
| 447 |
rubrics=[
|
| 448 |
StrategyCorrectnessRubric(),
|
|
@@ -451,7 +486,7 @@ class CaseRubric(Rubric):
|
|
| 451 |
DeadlineComplianceRubric(),
|
| 452 |
EfficiencyRubric(),
|
| 453 |
OutcomeQualityRubric(),
|
| 454 |
-
|
| 455 |
EscalationROIRubric(),
|
| 456 |
],
|
| 457 |
weights=list(CASE_DIMENSION_WEIGHTS),
|
|
|
|
| 11 |
argument is ignored — ChargebackOps grading operates over deterministic
|
| 12 |
episode progress, not on the last observation payload. This keeps the rubrics
|
| 13 |
pure and unit-testable without an environment instance.
|
| 14 |
+
|
| 15 |
+
Set ``USE_LLM_NOTE_JUDGE=1`` to swap the deterministic
|
| 16 |
+
:class:`NoteQualityRubric` for the LLM-backed
|
| 17 |
+
:class:`evaluation.llm_note_judge.LLMNoteJudgeRubric` when constructing
|
| 18 |
+
:class:`CaseRubric`. The LLM rubric falls back to the deterministic scorer
|
| 19 |
+
on any failure, so offline benchmarks remain reproducible without API keys.
|
| 20 |
"""
|
| 21 |
|
| 22 |
from __future__ import annotations
|
| 23 |
|
| 24 |
+
import os
|
| 25 |
from dataclasses import dataclass
|
| 26 |
from typing import Any
|
| 27 |
|
|
|
|
| 300 |
progress = ctx.progress
|
| 301 |
|
| 302 |
if progress.round_number < 2 and progress.arbitration_outcome is None:
|
| 303 |
+
# Vacuous credit only when the case was never contestable.
|
| 304 |
+
# Conceding a contestable case before reaching the issuer review
|
| 305 |
+
# is a forfeit on EV grounds, not a smart decision — penalise it.
|
| 306 |
+
if case.optimal_strategy == "contest":
|
| 307 |
+
expected_contest_recovery = case.amount # P(win) at full evidence
|
| 308 |
+
if expected_contest_recovery > ARB_FEE_PER_SIDE:
|
| 309 |
+
final = _final_resolution(progress)
|
| 310 |
+
if final in {"accept_chargeback", "issue_refund"}:
|
| 311 |
+
return 0.0
|
| 312 |
return 1.0
|
| 313 |
|
| 314 |
score = evidence_strength_score(case, progress)
|
|
|
|
| 431 |
0.05,
|
| 432 |
0.20,
|
| 433 |
)
|
| 434 |
+
def _resolve_default_note_rubric() -> Rubric:
|
| 435 |
+
"""Return the LLM-backed note judge if opted in, else the deterministic one.
|
| 436 |
+
|
| 437 |
+
Reads ``USE_LLM_NOTE_JUDGE`` lazily so importing this module never triggers
|
| 438 |
+
a provider import. The LLM rubric internally falls back to
|
| 439 |
+
:class:`NoteQualityRubric` when no provider key is set.
|
| 440 |
+
"""
|
| 441 |
+
|
| 442 |
+
if os.getenv("USE_LLM_NOTE_JUDGE", "").lower() in {"1", "true", "yes"}:
|
| 443 |
+
try: # pragma: no cover - import-time guard
|
| 444 |
+
from .llm_note_judge import LLMNoteJudgeRubric
|
| 445 |
+
except ImportError:
|
| 446 |
+
from evaluation.llm_note_judge import LLMNoteJudgeRubric
|
| 447 |
+
return LLMNoteJudgeRubric()
|
| 448 |
+
return NoteQualityRubric()
|
| 449 |
+
|
| 450 |
+
|
| 451 |
CASE_DIMENSION_NAMES: tuple[str, ...] = (
|
| 452 |
"strategy_correctness",
|
| 453 |
"evidence_quality",
|
|
|
|
| 474 |
:meth:`named_rubrics`.
|
| 475 |
"""
|
| 476 |
|
| 477 |
+
def __init__(self, *, note_rubric: Rubric | None = None) -> None:
|
| 478 |
super().__init__()
|
| 479 |
+
if note_rubric is None:
|
| 480 |
+
note_rubric = _resolve_default_note_rubric()
|
| 481 |
self.aggregator = WeightedSum(
|
| 482 |
rubrics=[
|
| 483 |
StrategyCorrectnessRubric(),
|
|
|
|
| 486 |
DeadlineComplianceRubric(),
|
| 487 |
EfficiencyRubric(),
|
| 488 |
OutcomeQualityRubric(),
|
| 489 |
+
note_rubric,
|
| 490 |
EscalationROIRubric(),
|
| 491 |
],
|
| 492 |
weights=list(CASE_DIMENSION_WEIGHTS),
|
runners/baseline_runner.py
CHANGED
|
@@ -360,6 +360,104 @@ def candidate_actions(observation: dict[str, Any]) -> list[CandidateAction]:
|
|
| 360 |
)
|
| 361 |
return candidates
|
| 362 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 363 |
current_deadline = _visible_case_deadline(queue, case_id)
|
| 364 |
best_other = _best_open_case(
|
| 365 |
[case for case in open_cases if case["case_id"] != case_id]
|
|
|
|
| 360 |
)
|
| 361 |
return candidates
|
| 362 |
|
| 363 |
+
# Round 2 (pre-arbitration). Issuer rejected the round-1 packet and is
|
| 364 |
+
# asking for compelling evidence. Three legal moves: respond_to_pre_arb,
|
| 365 |
+
# escalate_to_arbitration, accept_arbitration_loss.
|
| 366 |
+
available = set(observation.get("available_actions", []))
|
| 367 |
+
if "respond_to_pre_arb" in available:
|
| 368 |
+
retrieved_items_r2 = visible_case.get("retrieved_evidence", [])
|
| 369 |
+
attached_ids_r2 = {
|
| 370 |
+
item["evidence_id"] for item in visible_case.get("attached_evidence", [])
|
| 371 |
+
}
|
| 372 |
+
compelling_ids = [
|
| 373 |
+
item["evidence_id"]
|
| 374 |
+
for item in retrieved_items_r2
|
| 375 |
+
if item["evidence_id"] not in attached_ids_r2
|
| 376 |
+
and not _is_harmful_evidence(item)
|
| 377 |
+
]
|
| 378 |
+
compelling_ids = sorted(
|
| 379 |
+
compelling_ids,
|
| 380 |
+
key=lambda eid: _rank_attachable(
|
| 381 |
+
next(
|
| 382 |
+
item
|
| 383 |
+
for item in retrieved_items_r2
|
| 384 |
+
if item["evidence_id"] == eid
|
| 385 |
+
)
|
| 386 |
+
),
|
| 387 |
+
)[:2]
|
| 388 |
+
if compelling_ids:
|
| 389 |
+
candidates.append(
|
| 390 |
+
CandidateAction(
|
| 391 |
+
action=ChargebackOpsAction(
|
| 392 |
+
action_type="respond_to_pre_arb",
|
| 393 |
+
case_id=case_id,
|
| 394 |
+
compelling_evidence_ids=compelling_ids,
|
| 395 |
+
note=_build_representment_note(visible_case),
|
| 396 |
+
),
|
| 397 |
+
summary=(
|
| 398 |
+
f"Respond to pre-arbitration with compelling evidence "
|
| 399 |
+
f"{', '.join(compelling_ids)} for case {case_id}."
|
| 400 |
+
),
|
| 401 |
+
)
|
| 402 |
+
)
|
| 403 |
+
return candidates
|
| 404 |
+
# No retrieved compelling evidence left. Try querying an unrevealed
|
| 405 |
+
# merchant system before giving up — round-2 budget often allows it
|
| 406 |
+
# and one extra +0.15 pre_arb piece can clear the 0.60 acceptance bar.
|
| 407 |
+
# Order matters: support/risk/refunds tend to hold compelling pieces;
|
| 408 |
+
# payment is mostly auth records and harmful AVS/CVV mismatches.
|
| 409 |
+
revealed = set(visible_case.get("systems_revealed", []))
|
| 410 |
+
all_systems = ("support", "risk", "refunds", "shipping", "orders", "payment")
|
| 411 |
+
unrevealed = [s for s in all_systems if s not in revealed]
|
| 412 |
+
if unrevealed and "query_system" in available:
|
| 413 |
+
candidates.append(
|
| 414 |
+
CandidateAction(
|
| 415 |
+
action=ChargebackOpsAction(
|
| 416 |
+
action_type="query_system",
|
| 417 |
+
case_id=case_id,
|
| 418 |
+
system_name=unrevealed[0],
|
| 419 |
+
),
|
| 420 |
+
summary=(
|
| 421 |
+
f"Query {unrevealed[0]} for compelling evidence "
|
| 422 |
+
f"on case {case_id} before deciding to escalate."
|
| 423 |
+
),
|
| 424 |
+
)
|
| 425 |
+
)
|
| 426 |
+
return candidates
|
| 427 |
+
# No compelling evidence anywhere. Decide on ROI: arbitration costs
|
| 428 |
+
# $250/side. Use the EV rule: escalate iff p_win * amount > arb_fee.
|
| 429 |
+
# Round-2 arbitration score is typically in the ambiguity band
|
| 430 |
+
# (P~0.5), so escalate when amount > 2 * 250 = 500.
|
| 431 |
+
amount = float(visible_case.get("amount", 0.0))
|
| 432 |
+
if amount >= 500.0 and "escalate_to_arbitration" in available:
|
| 433 |
+
candidates.append(
|
| 434 |
+
CandidateAction(
|
| 435 |
+
action=ChargebackOpsAction(
|
| 436 |
+
action_type="escalate_to_arbitration",
|
| 437 |
+
case_id=case_id,
|
| 438 |
+
),
|
| 439 |
+
summary=(
|
| 440 |
+
f"Escalate case {case_id} to arbitration "
|
| 441 |
+
f"(amount ${amount:.0f} clears the EV break-even)."
|
| 442 |
+
),
|
| 443 |
+
)
|
| 444 |
+
)
|
| 445 |
+
return candidates
|
| 446 |
+
if "accept_arbitration_loss" in available:
|
| 447 |
+
candidates.append(
|
| 448 |
+
CandidateAction(
|
| 449 |
+
action=ChargebackOpsAction(
|
| 450 |
+
action_type="accept_arbitration_loss",
|
| 451 |
+
case_id=case_id,
|
| 452 |
+
),
|
| 453 |
+
summary=(
|
| 454 |
+
f"Accept arbitration loss on case {case_id} — no "
|
| 455 |
+
f"compelling evidence and amount below ROI cutoff."
|
| 456 |
+
),
|
| 457 |
+
)
|
| 458 |
+
)
|
| 459 |
+
return candidates
|
| 460 |
+
|
| 461 |
current_deadline = _visible_case_deadline(queue, case_id)
|
| 462 |
best_other = _best_open_case(
|
| 463 |
[case for case in open_cases if case["case_id"] != case_id]
|
runners/benchmark_runner.py
CHANGED
|
@@ -7,7 +7,7 @@ and offline.
|
|
| 7 |
|
| 8 |
Policies
|
| 9 |
--------
|
| 10 |
-
* ``heuristic`` — the
|
| 11 |
* ``concede_all`` — always set strategy to ``accept_chargeback`` and resolve.
|
| 12 |
* ``escalate_all`` — contest like the heuristic, then escalate in the
|
| 13 |
pre-arb and arbitration steps regardless of evidence strength.
|
|
@@ -15,7 +15,7 @@ Policies
|
|
| 15 |
|
| 16 |
The runner also exposes :func:`run_multi_seed` which sweeps each policy
|
| 17 |
over the headline catalog plus extra generator seeds so the benchmark
|
| 18 |
-
table in ``docs/
|
| 19 |
"""
|
| 20 |
|
| 21 |
from __future__ import annotations
|
|
|
|
| 7 |
|
| 8 |
Policies
|
| 9 |
--------
|
| 10 |
+
* ``heuristic`` — the first-candidate pick from the candidate generator (best scripted baseline).
|
| 11 |
* ``concede_all`` — always set strategy to ``accept_chargeback`` and resolve.
|
| 12 |
* ``escalate_all`` — contest like the heuristic, then escalate in the
|
| 13 |
pre-arb and arbitration steps regardless of evidence strength.
|
|
|
|
| 15 |
|
| 16 |
The runner also exposes :func:`run_multi_seed` which sweeps each policy
|
| 17 |
over the headline catalog plus extra generator seeds so the benchmark
|
| 18 |
+
table in ``docs/RESULTS.md`` is reproducible from one command.
|
| 19 |
"""
|
| 20 |
|
| 21 |
from __future__ import annotations
|
scenarios/case_generator.py
CHANGED
|
@@ -1026,7 +1026,7 @@ def _generate_evidence(
|
|
| 1026 |
|
| 1027 |
if bp.required:
|
| 1028 |
required_ids.append(eid)
|
| 1029 |
-
|
| 1030 |
helpful_ids.append(eid)
|
| 1031 |
if bp.harmful:
|
| 1032 |
harmful_ids.append(eid)
|
|
@@ -1043,6 +1043,7 @@ def generate_case(
|
|
| 1043 |
case_index: int,
|
| 1044 |
*,
|
| 1045 |
deadline_step: int = 8,
|
|
|
|
| 1046 |
) -> InternalCase:
|
| 1047 |
"""Generate a single case from a template."""
|
| 1048 |
|
|
@@ -1096,6 +1097,7 @@ def generate_case(
|
|
| 1096 |
network_reason_code=network_code,
|
| 1097 |
response_window_days=window_days,
|
| 1098 |
compelling_evidence_category=ce_category,
|
|
|
|
| 1099 |
)
|
| 1100 |
|
| 1101 |
|
|
@@ -1138,8 +1140,8 @@ def generate_task(
|
|
| 1138 |
max_steps = {
|
| 1139 |
"easy": 10,
|
| 1140 |
"medium": 12,
|
| 1141 |
-
"hard": max(
|
| 1142 |
-
"nightmare": max(
|
| 1143 |
}[difficulty]
|
| 1144 |
|
| 1145 |
# Build the case list
|
|
@@ -1175,17 +1177,31 @@ def generate_task(
|
|
| 1175 |
|
| 1176 |
used_templates.append(template)
|
| 1177 |
|
| 1178 |
-
# Deadline tightens with difficulty
|
|
|
|
| 1179 |
base_deadline = {
|
| 1180 |
"easy": 8,
|
| 1181 |
"medium": 7,
|
| 1182 |
-
"hard": max(
|
| 1183 |
-
"nightmare": max(
|
| 1184 |
}[difficulty]
|
| 1185 |
deadline = base_deadline + rng.randint(-1, 1)
|
| 1186 |
deadline = max(3, min(deadline, max_steps - 1))
|
| 1187 |
|
| 1188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1189 |
cases.append(case)
|
| 1190 |
|
| 1191 |
# Build task metadata
|
|
|
|
| 1026 |
|
| 1027 |
if bp.required:
|
| 1028 |
required_ids.append(eid)
|
| 1029 |
+
elif bp.helpful:
|
| 1030 |
helpful_ids.append(eid)
|
| 1031 |
if bp.harmful:
|
| 1032 |
harmful_ids.append(eid)
|
|
|
|
| 1043 |
case_index: int,
|
| 1044 |
*,
|
| 1045 |
deadline_step: int = 8,
|
| 1046 |
+
dispute_complexity: float = 1.0,
|
| 1047 |
) -> InternalCase:
|
| 1048 |
"""Generate a single case from a template."""
|
| 1049 |
|
|
|
|
| 1097 |
network_reason_code=network_code,
|
| 1098 |
response_window_days=window_days,
|
| 1099 |
compelling_evidence_category=ce_category,
|
| 1100 |
+
dispute_complexity=dispute_complexity,
|
| 1101 |
)
|
| 1102 |
|
| 1103 |
|
|
|
|
| 1140 |
max_steps = {
|
| 1141 |
"easy": 10,
|
| 1142 |
"medium": 12,
|
| 1143 |
+
"hard": max(16, case_count * 6), # +1 step per case for round-2 work
|
| 1144 |
+
"nightmare": max(18, case_count * 4), # round-2 path needs breathing room
|
| 1145 |
}[difficulty]
|
| 1146 |
|
| 1147 |
# Build the case list
|
|
|
|
| 1177 |
|
| 1178 |
used_templates.append(template)
|
| 1179 |
|
| 1180 |
+
# Deadline tightens with difficulty. Hard/nightmare leave room for
|
| 1181 |
+
# the round-2 pre-arb response so the multi-round path is reachable.
|
| 1182 |
base_deadline = {
|
| 1183 |
"easy": 8,
|
| 1184 |
"medium": 7,
|
| 1185 |
+
"hard": max(8, 12 - i),
|
| 1186 |
+
"nightmare": max(6, 10 - i),
|
| 1187 |
}[difficulty]
|
| 1188 |
deadline = base_deadline + rng.randint(-1, 1)
|
| 1189 |
deadline = max(3, min(deadline, max_steps - 1))
|
| 1190 |
|
| 1191 |
+
complexity = {
|
| 1192 |
+
"easy": 1.0,
|
| 1193 |
+
"medium": 0.90,
|
| 1194 |
+
"hard": 0.60,
|
| 1195 |
+
"nightmare": 0.50,
|
| 1196 |
+
}[difficulty]
|
| 1197 |
+
|
| 1198 |
+
case = generate_case(
|
| 1199 |
+
rng,
|
| 1200 |
+
template,
|
| 1201 |
+
i + 1,
|
| 1202 |
+
deadline_step=deadline,
|
| 1203 |
+
dispute_complexity=complexity,
|
| 1204 |
+
)
|
| 1205 |
cases.append(case)
|
| 1206 |
|
| 1207 |
# Build task metadata
|
scenarios/issuer_model.py
CHANGED
|
@@ -1,10 +1,10 @@
|
|
| 1 |
"""Scripted Issuer agent for ChargebackOps multi-round dispute lifecycle.
|
| 2 |
|
| 3 |
The Issuer reviews a merchant's representment packet and decides whether to
|
| 4 |
-
accept it, request more evidence (triggering pre-arbitration
|
| 5 |
-
|
| 6 |
-
benchmarks must be reproducible — with optional LLM softening
|
| 7 |
-
|
| 8 |
|
| 9 |
Decision rule:
|
| 10 |
|
|
@@ -103,6 +103,12 @@ def evidence_strength_score(case: InternalCase, progress: CaseProgress) -> float
|
|
| 103 |
if hits >= 2:
|
| 104 |
score += 0.1
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
return max(0.0, min(1.0, score))
|
| 107 |
|
| 108 |
|
|
|
|
| 1 |
"""Scripted Issuer agent for ChargebackOps multi-round dispute lifecycle.
|
| 2 |
|
| 3 |
The Issuer reviews a merchant's representment packet and decides whether to
|
| 4 |
+
accept it, request more evidence (triggering pre-arbitration), or escalate
|
| 5 |
+
to network arbitration. The decision is **deterministic** by default —
|
| 6 |
+
benchmarks must be reproducible — with optional LLM softening for the
|
| 7 |
+
ambiguity band when an API key is present.
|
| 8 |
|
| 9 |
Decision rule:
|
| 10 |
|
|
|
|
| 103 |
if hits >= 2:
|
| 104 |
score += 0.1
|
| 105 |
|
| 106 |
+
# Pre-arbitration compelling-evidence bonus: +0.15 per unique id added in
|
| 107 |
+
# round 2, capped at +0.30. Pulls a borderline packet across the 0.60
|
| 108 |
+
# round-2 acceptance bar without trivially clearing it.
|
| 109 |
+
pre_arb_unique = len({eid for eid in progress.pre_arb_evidence_added})
|
| 110 |
+
score += min(0.30, 0.15 * pre_arb_unique)
|
| 111 |
+
|
| 112 |
return max(0.0, min(1.0, score))
|
| 113 |
|
| 114 |
|
scenarios/llm_softening.py
CHANGED
|
@@ -44,7 +44,7 @@ _PROVIDER_CHAIN: tuple[tuple[str, str, str, str], ...] = (
|
|
| 44 |
"google",
|
| 45 |
"https://generativelanguage.googleapis.com/v1beta/openai/",
|
| 46 |
"GOOGLE_API_KEY",
|
| 47 |
-
"gemini-
|
| 48 |
),
|
| 49 |
(
|
| 50 |
"groq",
|
|
|
|
| 44 |
"google",
|
| 45 |
"https://generativelanguage.googleapis.com/v1beta/openai/",
|
| 46 |
"GOOGLE_API_KEY",
|
| 47 |
+
"gemini-2.5-flash",
|
| 48 |
),
|
| 49 |
(
|
| 50 |
"groq",
|
scenarios/simulation.py
CHANGED
|
@@ -51,6 +51,10 @@ class InternalCase:
|
|
| 51 |
network_reason_code: str = ""
|
| 52 |
response_window_days: int = 30
|
| 53 |
compelling_evidence_category: str = ""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
|
| 56 |
@dataclass(frozen=True)
|
|
@@ -85,9 +89,10 @@ class CaseProgress:
|
|
| 85 |
deadline_penalized: bool = False
|
| 86 |
notes: list[str] = field(default_factory=list)
|
| 87 |
representment_note: str | None = None
|
| 88 |
-
# multi-round dispute lifecycle
|
| 89 |
round_number: int = 1
|
| 90 |
issuer_decisions: list[str] = field(default_factory=list)
|
|
|
|
| 91 |
pre_arb_evidence_added: list[str] = field(default_factory=list)
|
| 92 |
arbitration_outcome: str | None = None
|
| 93 |
arb_fees_paid: float = 0.0
|
|
@@ -166,8 +171,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 166 |
weight=1.0,
|
| 167 |
required_evidence_ids=("E1-ORDER-CONF", "E1-DELIVERY-SCAN"),
|
| 168 |
helpful_evidence_ids=(
|
| 169 |
-
"E1-
|
| 170 |
-
"E1-DELIVERY-SCAN",
|
| 171 |
"E1-SUPPORT-ACK",
|
| 172 |
),
|
| 173 |
harmful_evidence_ids=(),
|
|
@@ -279,9 +283,9 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 279 |
weight=1.1,
|
| 280 |
required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
|
| 281 |
helpful_evidence_ids=(
|
| 282 |
-
"M1-PRIOR-ORDERS",
|
| 283 |
-
"M1-ACCOUNT-CHAT",
|
| 284 |
"M1-DELIVERY",
|
|
|
|
|
|
|
| 285 |
),
|
| 286 |
harmful_evidence_ids=("M1-AVS-MISMATCH", "M1-CVV-MISMATCH"),
|
| 287 |
card_network="visa",
|
|
@@ -377,7 +381,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 377 |
"A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
|
| 378 |
"The step budget leaves little room for waste."
|
| 379 |
),
|
| 380 |
-
max_steps=
|
| 381 |
cases=(
|
| 382 |
InternalCase(
|
| 383 |
case_id="CB-H1",
|
|
@@ -390,7 +394,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 390 |
inspection_notes=(
|
| 391 |
"Carrier stored both a delivery scan and signature. This is the highest-value recoverable case in the queue."
|
| 392 |
),
|
| 393 |
-
deadline_step=
|
| 394 |
optimal_strategy="contest",
|
| 395 |
acceptable_strategies=(),
|
| 396 |
policy_guidance=(
|
|
@@ -405,8 +409,6 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 405 |
weight=1.7,
|
| 406 |
required_evidence_ids=("H1-ORDER-CONF", "H1-SIGNATURE"),
|
| 407 |
helpful_evidence_ids=(
|
| 408 |
-
"H1-ORDER-CONF",
|
| 409 |
-
"H1-SIGNATURE",
|
| 410 |
"H1-DELIVERY-SCAN",
|
| 411 |
),
|
| 412 |
harmful_evidence_ids=(),
|
|
@@ -414,6 +416,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 414 |
network_reason_code="4855",
|
| 415 |
response_window_days=45,
|
| 416 |
compelling_evidence_category="Goods or Services Not Provided",
|
|
|
|
| 417 |
evidence_by_system={
|
| 418 |
"orders": (
|
| 419 |
_ev(
|
|
@@ -636,6 +639,124 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 636 |
),
|
| 637 |
),
|
| 638 |
),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 639 |
}
|
| 640 |
|
| 641 |
|
|
@@ -723,6 +844,7 @@ def list_tasks() -> list[TaskScenario]:
|
|
| 723 |
for task_id in [
|
| 724 |
"goods_not_received_easy",
|
| 725 |
"fraud_signal_ambiguity",
|
|
|
|
| 726 |
"queue_optimization_hard",
|
| 727 |
]
|
| 728 |
]
|
|
|
|
| 51 |
network_reason_code: str = ""
|
| 52 |
response_window_days: int = 30
|
| 53 |
compelling_evidence_category: str = ""
|
| 54 |
+
# Issuer-perceived complexity multiplier in (0, 1].
|
| 55 |
+
# Lower values dampen evidence_strength_score so harder cases land in the
|
| 56 |
+
# ambiguity band and exercise the multi-round dispute path.
|
| 57 |
+
dispute_complexity: float = 1.0
|
| 58 |
|
| 59 |
|
| 60 |
@dataclass(frozen=True)
|
|
|
|
| 89 |
deadline_penalized: bool = False
|
| 90 |
notes: list[str] = field(default_factory=list)
|
| 91 |
representment_note: str | None = None
|
| 92 |
+
# multi-round dispute lifecycle
|
| 93 |
round_number: int = 1
|
| 94 |
issuer_decisions: list[str] = field(default_factory=list)
|
| 95 |
+
issuer_rationales: list[str] = field(default_factory=list)
|
| 96 |
pre_arb_evidence_added: list[str] = field(default_factory=list)
|
| 97 |
arbitration_outcome: str | None = None
|
| 98 |
arb_fees_paid: float = 0.0
|
|
|
|
| 171 |
weight=1.0,
|
| 172 |
required_evidence_ids=("E1-ORDER-CONF", "E1-DELIVERY-SCAN"),
|
| 173 |
helpful_evidence_ids=(
|
| 174 |
+
"E1-SIGNATURE",
|
|
|
|
| 175 |
"E1-SUPPORT-ACK",
|
| 176 |
),
|
| 177 |
harmful_evidence_ids=(),
|
|
|
|
| 283 |
weight=1.1,
|
| 284 |
required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
|
| 285 |
helpful_evidence_ids=(
|
|
|
|
|
|
|
| 286 |
"M1-DELIVERY",
|
| 287 |
+
"M1-ORDER",
|
| 288 |
+
"M1-VELOCITY",
|
| 289 |
),
|
| 290 |
harmful_evidence_ids=("M1-AVS-MISMATCH", "M1-CVV-MISMATCH"),
|
| 291 |
card_network="visa",
|
|
|
|
| 381 |
"A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
|
| 382 |
"The step budget leaves little room for waste."
|
| 383 |
),
|
| 384 |
+
max_steps=18,
|
| 385 |
cases=(
|
| 386 |
InternalCase(
|
| 387 |
case_id="CB-H1",
|
|
|
|
| 394 |
inspection_notes=(
|
| 395 |
"Carrier stored both a delivery scan and signature. This is the highest-value recoverable case in the queue."
|
| 396 |
),
|
| 397 |
+
deadline_step=14,
|
| 398 |
optimal_strategy="contest",
|
| 399 |
acceptable_strategies=(),
|
| 400 |
policy_guidance=(
|
|
|
|
| 409 |
weight=1.7,
|
| 410 |
required_evidence_ids=("H1-ORDER-CONF", "H1-SIGNATURE"),
|
| 411 |
helpful_evidence_ids=(
|
|
|
|
|
|
|
| 412 |
"H1-DELIVERY-SCAN",
|
| 413 |
),
|
| 414 |
harmful_evidence_ids=(),
|
|
|
|
| 416 |
network_reason_code="4855",
|
| 417 |
response_window_days=45,
|
| 418 |
compelling_evidence_category="Goods or Services Not Provided",
|
| 419 |
+
dispute_complexity=0.60,
|
| 420 |
evidence_by_system={
|
| 421 |
"orders": (
|
| 422 |
_ev(
|
|
|
|
| 639 |
),
|
| 640 |
),
|
| 641 |
),
|
| 642 |
+
"pre_arb_recovery_medium": TaskScenario(
|
| 643 |
+
task_id="pre_arb_recovery_medium",
|
| 644 |
+
title="Pre-Arbitration Recovery",
|
| 645 |
+
difficulty="medium",
|
| 646 |
+
objective=(
|
| 647 |
+
"Win a goods-not-received dispute that requires recovering compelling "
|
| 648 |
+
"evidence in round 2 instead of burning $250 on arbitration."
|
| 649 |
+
),
|
| 650 |
+
description=(
|
| 651 |
+
"Required evidence is split across orders and support. A round-1 packet "
|
| 652 |
+
"from the default systems will fall short and the issuer will request "
|
| 653 |
+
"compelling evidence. Querying support in round 2 unlocks the missing "
|
| 654 |
+
"proof; jumping straight to arbitration concedes a $250 fee on a "
|
| 655 |
+
"packet the issuer would have accepted."
|
| 656 |
+
),
|
| 657 |
+
max_steps=12,
|
| 658 |
+
cases=(
|
| 659 |
+
InternalCase(
|
| 660 |
+
case_id="CB-P1",
|
| 661 |
+
order_id="ORD-7710",
|
| 662 |
+
customer_id="CUST-3300",
|
| 663 |
+
amount=700.0,
|
| 664 |
+
currency="USD",
|
| 665 |
+
reason_code="goods_not_received",
|
| 666 |
+
summary=(
|
| 667 |
+
"Customer denies receipt of a $700 electronics order. "
|
| 668 |
+
"Authenticated support transcript proves delivery acknowledgement."
|
| 669 |
+
),
|
| 670 |
+
inspection_notes=(
|
| 671 |
+
"The order was delivered, but the strongest acknowledgement lives "
|
| 672 |
+
"in the support transcript — not in the orders or shipping system. "
|
| 673 |
+
"A first-pass packet will be missing required evidence."
|
| 674 |
+
),
|
| 675 |
+
deadline_step=10,
|
| 676 |
+
optimal_strategy="contest",
|
| 677 |
+
acceptable_strategies=(),
|
| 678 |
+
policy_guidance=(
|
| 679 |
+
"Goods-not-received disputes need order confirmation plus a "
|
| 680 |
+
"delivery acknowledgement. If the support transcript is the only "
|
| 681 |
+
"delivery acknowledgement, attach it through the pre-arbitration "
|
| 682 |
+
"response — do not skip straight to arbitration."
|
| 683 |
+
),
|
| 684 |
+
policy_requirements=(
|
| 685 |
+
"order confirmation",
|
| 686 |
+
"support delivery acknowledgement",
|
| 687 |
+
),
|
| 688 |
+
recommended_strategy="contest",
|
| 689 |
+
resolution_summary=(
|
| 690 |
+
"Recover the support acknowledgement in pre-arb. Escalating to "
|
| 691 |
+
"arbitration without it forfeits $250 on a winnable case."
|
| 692 |
+
),
|
| 693 |
+
weight=1.3,
|
| 694 |
+
required_evidence_ids=("P1-ORDER-CONF", "P1-SUPPORT-CONF"),
|
| 695 |
+
helpful_evidence_ids=("P1-DELIVERY-SCAN", "P1-RISK-CLEAR"),
|
| 696 |
+
harmful_evidence_ids=(),
|
| 697 |
+
card_network="visa",
|
| 698 |
+
network_reason_code="13.1",
|
| 699 |
+
response_window_days=30,
|
| 700 |
+
compelling_evidence_category="CE 3.5 — Merchandise Not Received",
|
| 701 |
+
evidence_by_system={
|
| 702 |
+
"orders": (
|
| 703 |
+
_ev(
|
| 704 |
+
"P1-ORDER-CONF",
|
| 705 |
+
"orders",
|
| 706 |
+
"Order confirmation",
|
| 707 |
+
"Order receipt with billed customer, shipping address, and SKU.",
|
| 708 |
+
helpful=True,
|
| 709 |
+
required=True,
|
| 710 |
+
),
|
| 711 |
+
),
|
| 712 |
+
"payment": (
|
| 713 |
+
_ev(
|
| 714 |
+
"P1-AUTH",
|
| 715 |
+
"payment",
|
| 716 |
+
"Authorization capture",
|
| 717 |
+
"Authorization approved and captured cleanly.",
|
| 718 |
+
),
|
| 719 |
+
),
|
| 720 |
+
"shipping": (
|
| 721 |
+
_ev(
|
| 722 |
+
"P1-DELIVERY-SCAN",
|
| 723 |
+
"shipping",
|
| 724 |
+
"Carrier delivery scan",
|
| 725 |
+
"Carrier tracking shows the package delivered to the saved address.",
|
| 726 |
+
helpful=True,
|
| 727 |
+
),
|
| 728 |
+
),
|
| 729 |
+
"support": (
|
| 730 |
+
_ev(
|
| 731 |
+
"P1-SUPPORT-CONF",
|
| 732 |
+
"support",
|
| 733 |
+
"Authenticated support acknowledgement",
|
| 734 |
+
"Customer logged in and confirmed receipt of the package in chat the next day.",
|
| 735 |
+
helpful=True,
|
| 736 |
+
required=True,
|
| 737 |
+
),
|
| 738 |
+
),
|
| 739 |
+
"refunds": (
|
| 740 |
+
_ev(
|
| 741 |
+
"P1-NO-REFUND",
|
| 742 |
+
"refunds",
|
| 743 |
+
"Refund ledger",
|
| 744 |
+
"No refund or goodwill credit was issued before the dispute opened.",
|
| 745 |
+
),
|
| 746 |
+
),
|
| 747 |
+
"risk": (
|
| 748 |
+
_ev(
|
| 749 |
+
"P1-RISK-CLEAR",
|
| 750 |
+
"risk",
|
| 751 |
+
"Risk summary",
|
| 752 |
+
"Account has clean device fingerprint and prior fulfilled orders.",
|
| 753 |
+
helpful=True,
|
| 754 |
+
),
|
| 755 |
+
),
|
| 756 |
+
},
|
| 757 |
+
),
|
| 758 |
+
),
|
| 759 |
+
),
|
| 760 |
}
|
| 761 |
|
| 762 |
|
|
|
|
| 844 |
for task_id in [
|
| 845 |
"goods_not_received_easy",
|
| 846 |
"fraud_signal_ambiguity",
|
| 847 |
+
"pre_arb_recovery_medium",
|
| 848 |
"queue_optimization_hard",
|
| 849 |
]
|
| 850 |
]
|
server/chargeback_ops_environment.py
CHANGED
|
@@ -388,9 +388,6 @@ class ChargebackOpsEnvironment(
|
|
| 388 |
if progress.resolution_status != "open":
|
| 389 |
return -0.05, f"Case {case.case_id} is already resolved."
|
| 390 |
|
| 391 |
-
attached = set(progress.attached_evidence_ids)
|
| 392 |
-
missing = set(case.required_evidence_ids).difference(attached)
|
| 393 |
-
harmful = set(case.harmful_evidence_ids).intersection(attached)
|
| 394 |
if self._state.step_count > case.deadline_step:
|
| 395 |
progress.final_resolution = "contest"
|
| 396 |
progress.resolution_status = "lost_late"
|
|
@@ -399,22 +396,12 @@ class ChargebackOpsEnvironment(
|
|
| 399 |
-0.2,
|
| 400 |
f"Representment for case {case.case_id} was submitted after the deadline.",
|
| 401 |
)
|
| 402 |
-
if missing:
|
| 403 |
-
progress.final_resolution = "contest"
|
| 404 |
-
progress.resolution_status = "lost_incomplete"
|
| 405 |
-
progress.resolved_at_step = self._state.step_count
|
| 406 |
-
return -0.18, (
|
| 407 |
-
f"Representment for case {case.case_id} is incomplete; missing {', '.join(sorted(missing))}."
|
| 408 |
-
)
|
| 409 |
-
if harmful:
|
| 410 |
-
progress.final_resolution = "contest"
|
| 411 |
-
progress.resolution_status = "lost_harmful_evidence"
|
| 412 |
-
progress.resolved_at_step = self._state.step_count
|
| 413 |
-
return -0.15, (
|
| 414 |
-
f"Representment for case {case.case_id} included harmful evidence {', '.join(sorted(harmful))}."
|
| 415 |
-
)
|
| 416 |
|
| 417 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
review = self._invoke_issuer_review(case, progress, round_number=1)
|
| 419 |
|
| 420 |
if review.decision == IssuerDecision.ACCEPT:
|
|
@@ -457,6 +444,7 @@ class ChargebackOpsEnvironment(
|
|
| 457 |
|
| 458 |
review = self._issuer_agent.decide_review(case, progress, round_number=round_number)
|
| 459 |
progress.issuer_decisions.append(review.decision.value)
|
|
|
|
| 460 |
return review
|
| 461 |
|
| 462 |
def _respond_to_pre_arb(
|
|
@@ -856,6 +844,17 @@ class ChargebackOpsEnvironment(
|
|
| 856 |
submission_status=progress.resolution_status
|
| 857 |
if progress.resolution_status != "open"
|
| 858 |
else None,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 859 |
)
|
| 860 |
|
| 861 |
def _build_available_actions(self) -> list[str]:
|
|
@@ -869,8 +868,8 @@ class ChargebackOpsEnvironment(
|
|
| 869 |
return ["select_case"]
|
| 870 |
if case_progress.round_number == 2:
|
| 871 |
# Pre-arbitration: investigation actions still help (e.g. to pull
|
| 872 |
-
# compelling evidence from a system) but the round-1 submit path
|
| 873 |
-
# closed off in favour of the three terminal
|
| 874 |
return base + [
|
| 875 |
"query_system",
|
| 876 |
"retrieve_policy",
|
|
|
|
| 388 |
if progress.resolution_status != "open":
|
| 389 |
return -0.05, f"Case {case.case_id} is already resolved."
|
| 390 |
|
|
|
|
|
|
|
|
|
|
| 391 |
if self._state.step_count > case.deadline_step:
|
| 392 |
progress.final_resolution = "contest"
|
| 393 |
progress.resolution_status = "lost_late"
|
|
|
|
| 396 |
-0.2,
|
| 397 |
f"Representment for case {case.case_id} was submitted after the deadline.",
|
| 398 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 399 |
|
| 400 |
+
# Every on-time packet is handed to the scripted Issuer. Missing
|
| 401 |
+
# required evidence and attached harmful evidence are not terminal —
|
| 402 |
+
# they push the score down so the Issuer requests more evidence
|
| 403 |
+
# (round 2) or escalates to arbitration (round 3), exercising the
|
| 404 |
+
# multi-round dispute path the rubric is built for.
|
| 405 |
review = self._invoke_issuer_review(case, progress, round_number=1)
|
| 406 |
|
| 407 |
if review.decision == IssuerDecision.ACCEPT:
|
|
|
|
| 444 |
|
| 445 |
review = self._issuer_agent.decide_review(case, progress, round_number=round_number)
|
| 446 |
progress.issuer_decisions.append(review.decision.value)
|
| 447 |
+
progress.issuer_rationales.append(review.rationale)
|
| 448 |
return review
|
| 449 |
|
| 450 |
def _respond_to_pre_arb(
|
|
|
|
| 844 |
submission_status=progress.resolution_status
|
| 845 |
if progress.resolution_status != "open"
|
| 846 |
else None,
|
| 847 |
+
round_number=progress.round_number,
|
| 848 |
+
last_issuer_decision=(
|
| 849 |
+
progress.issuer_decisions[-1] if progress.issuer_decisions else None
|
| 850 |
+
),
|
| 851 |
+
last_issuer_rationale=(
|
| 852 |
+
progress.issuer_rationales[-1] if progress.issuer_rationales else None
|
| 853 |
+
),
|
| 854 |
+
pre_arb_evidence_added=list(progress.pre_arb_evidence_added),
|
| 855 |
+
arbitration_outcome=progress.arbitration_outcome,
|
| 856 |
+
arb_fees_paid=progress.arb_fees_paid,
|
| 857 |
+
final_economic_outcome=progress.final_economic_outcome,
|
| 858 |
)
|
| 859 |
|
| 860 |
def _build_available_actions(self) -> list[str]:
|
|
|
|
| 868 |
return ["select_case"]
|
| 869 |
if case_progress.round_number == 2:
|
| 870 |
# Pre-arbitration: investigation actions still help (e.g. to pull
|
| 871 |
+
# compelling evidence from a system) but the round-1 submit path
|
| 872 |
+
# is closed off in favour of the three terminal pre-arb actions.
|
| 873 |
return base + [
|
| 874 |
"query_system",
|
| 875 |
"retrieve_policy",
|
server/demo_ui.py
CHANGED
|
@@ -74,6 +74,27 @@ _CSS = """
|
|
| 74 |
.color-yellow { color: #eab308; }
|
| 75 |
.color-red { color: #ef4444; }
|
| 76 |
.color-blue { color: #3b82f6; }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
"""
|
| 78 |
|
| 79 |
|
|
@@ -175,6 +196,76 @@ def _budget_html(steps_used: int, max_steps: int, score: float) -> str:
|
|
| 175 |
"""
|
| 176 |
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
def _grader_html(report: dict | None) -> str:
|
| 179 |
if not report:
|
| 180 |
return ""
|
|
@@ -191,13 +282,14 @@ def _grader_html(report: dict | None) -> str:
|
|
| 191 |
)
|
| 192 |
|
| 193 |
dims = [
|
| 194 |
-
("Strategy", "strategy_correctness", "
|
| 195 |
-
("Evidence", "evidence_quality", "
|
| 196 |
-
("Packet", "packet_validity", "
|
| 197 |
-
("Deadline", "deadline_compliance", "
|
| 198 |
("Efficiency", "efficiency", "10%"),
|
| 199 |
("Outcome", "outcome_quality", "10%"),
|
| 200 |
("Note", "note_quality", "5%"),
|
|
|
|
| 201 |
]
|
| 202 |
|
| 203 |
for case in report.get("case_reports", []):
|
|
@@ -259,6 +351,8 @@ def run_episode(
|
|
| 259 |
_queue_html(obs),
|
| 260 |
_budget_html(0, max_steps, 0.0),
|
| 261 |
[row[:] for row in rows],
|
|
|
|
|
|
|
| 262 |
"",
|
| 263 |
None,
|
| 264 |
)
|
|
@@ -302,6 +396,8 @@ def run_episode(
|
|
| 302 |
_queue_html(obs),
|
| 303 |
_budget_html(step, max_steps, obs.progress_score),
|
| 304 |
[row[:] for row in rows],
|
|
|
|
|
|
|
| 305 |
grader,
|
| 306 |
None,
|
| 307 |
)
|
|
@@ -314,6 +410,8 @@ def run_episode(
|
|
| 314 |
_queue_html(obs),
|
| 315 |
_budget_html(step, max_steps, obs.progress_score),
|
| 316 |
[row[:] for row in rows],
|
|
|
|
|
|
|
| 317 |
_grader_html(report),
|
| 318 |
report,
|
| 319 |
)
|
|
@@ -370,7 +468,8 @@ def build_demo() -> gr.Blocks:
|
|
| 370 |
|
| 371 |
md_status = gr.Markdown(
|
| 372 |
"Pick a task + policy and click **Run Episode**. Compare **Heuristic** vs "
|
| 373 |
-
"**Naive** to see how the
|
|
|
|
| 374 |
)
|
| 375 |
|
| 376 |
with gr.Row(equal_height=True):
|
|
@@ -395,6 +494,12 @@ def build_demo() -> gr.Blocks:
|
|
| 395 |
label="Step Trace",
|
| 396 |
)
|
| 397 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 398 |
html_grader = gr.HTML(label="Grader Report")
|
| 399 |
json_raw = gr.JSON(label="Raw JSON", visible=False)
|
| 400 |
|
|
@@ -406,6 +511,8 @@ def build_demo() -> gr.Blocks:
|
|
| 406 |
html_queue,
|
| 407 |
html_budget,
|
| 408 |
df_trace,
|
|
|
|
|
|
|
| 409 |
html_grader,
|
| 410 |
json_raw,
|
| 411 |
],
|
|
@@ -445,29 +552,33 @@ def build_demo() -> gr.Blocks:
|
|
| 445 |
],
|
| 446 |
interactive=False,
|
| 447 |
wrap=True,
|
| 448 |
-
label="
|
| 449 |
)
|
| 450 |
|
| 451 |
# ── Tab 3: Environment Info ───────────────────────────
|
| 452 |
with gr.Tab("Environment"):
|
| 453 |
gr.Markdown(
|
| 454 |
-
"## Action Space (
|
| 455 |
-
"`select_case` · `inspect_case` ·
|
| 456 |
-
"`
|
| 457 |
-
"`
|
|
|
|
|
|
|
|
|
|
| 458 |
"## Merchant Systems (6)\n\n"
|
| 459 |
"`orders` · `payment` · `shipping` · "
|
| 460 |
"`support` · `refunds` · `risk`\n\n"
|
| 461 |
-
"## Grading (
|
| 462 |
"| Dimension | Weight | Scoring |\n"
|
| 463 |
"|---|---|---|\n"
|
| 464 |
-
"| Strategy Correctness |
|
| 465 |
-
"| Evidence Quality |
|
| 466 |
-
"| Packet Validity |
|
| 467 |
-
"| Deadline Compliance |
|
| 468 |
"| Efficiency | 10% | Penalises waste, rewards early concession |\n"
|
| 469 |
"| Outcome Quality | 10% | 1.0 optimal, 0.4 acceptable, 0.0 wrong |\n"
|
| 470 |
-
"| Note Quality | 5% | Policy keywords + evidence refs |\n
|
|
|
|
| 471 |
"## Card Networks\n\n"
|
| 472 |
"| Reason Code | Visa | Mastercard |\n"
|
| 473 |
"|---|---|---|\n"
|
|
|
|
| 74 |
.color-yellow { color: #eab308; }
|
| 75 |
.color-red { color: #ef4444; }
|
| 76 |
.color-blue { color: #3b82f6; }
|
| 77 |
+
|
| 78 |
+
.round-panel { border: 1px solid #3a3a3a; border-radius: 8px; padding: 12px 14px; margin: 8px 0; background: #1a1a1a; }
|
| 79 |
+
.round-panel .panel-title { font-weight: 700; font-size: 13px; color: #ccc; margin-bottom: 6px; text-transform: uppercase; letter-spacing: 0.5px; }
|
| 80 |
+
.round-badge { display: inline-block; padding: 3px 10px; border-radius: 12px; font-size: 12px; font-weight: 700; margin-right: 8px; }
|
| 81 |
+
.round-1 { background: #1e3a8a; color: #93c5fd; }
|
| 82 |
+
.round-2 { background: #78350f; color: #fcd34d; }
|
| 83 |
+
.round-3 { background: #7f1d1d; color: #fca5a5; }
|
| 84 |
+
.issuer-quote { font-style: italic; color: #d4d4d4; font-size: 13px; padding: 6px 10px; border-left: 3px solid #6366f1; margin: 6px 0; background: #15171f; }
|
| 85 |
+
.issuer-decision { font-weight: 700; font-size: 13px; }
|
| 86 |
+
.dec-accept { color: #22c55e; }
|
| 87 |
+
.dec-request { color: #eab308; }
|
| 88 |
+
.dec-escalate { color: #ef4444; }
|
| 89 |
+
|
| 90 |
+
.arb-panel { border: 1px solid #7f1d1d; border-radius: 8px; padding: 12px 14px; margin: 8px 0; background: #1a0e0e; }
|
| 91 |
+
.arb-row { display: flex; justify-content: space-between; padding: 4px 0; font-size: 13px; }
|
| 92 |
+
.arb-row .arb-label { color: #999; }
|
| 93 |
+
.arb-row .arb-value { font-weight: 700; }
|
| 94 |
+
.outcome-merchant { color: #22c55e; }
|
| 95 |
+
.outcome-issuer { color: #ef4444; }
|
| 96 |
+
.pnl-pos { color: #22c55e; font-weight: 800; }
|
| 97 |
+
.pnl-neg { color: #ef4444; font-weight: 800; }
|
| 98 |
"""
|
| 99 |
|
| 100 |
|
|
|
|
| 196 |
"""
|
| 197 |
|
| 198 |
|
| 199 |
+
_DEC_CLASS = {
|
| 200 |
+
"accept": "dec-accept",
|
| 201 |
+
"request_more_evidence": "dec-request",
|
| 202 |
+
"escalate_to_arbitration": "dec-escalate",
|
| 203 |
+
"merchant_wins": "outcome-merchant",
|
| 204 |
+
"issuer_wins": "outcome-issuer",
|
| 205 |
+
}
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def _round_panel_html(observation) -> str:
|
| 209 |
+
vc = observation.visible_case
|
| 210 |
+
if vc is None:
|
| 211 |
+
return ""
|
| 212 |
+
|
| 213 |
+
rnd = vc.round_number or 1
|
| 214 |
+
badge_cls = f"round-{min(rnd, 3)}"
|
| 215 |
+
rnd_label = {1: "Representment", 2: "Pre-Arbitration", 3: "Arbitration"}.get(rnd, f"Round {rnd}")
|
| 216 |
+
|
| 217 |
+
body = (
|
| 218 |
+
f'<div class="panel-title">'
|
| 219 |
+
f'<span class="round-badge {badge_cls}">R{rnd}</span>'
|
| 220 |
+
f'{rnd_label} · case <b>{vc.case_id}</b>'
|
| 221 |
+
f'</div>'
|
| 222 |
+
)
|
| 223 |
+
|
| 224 |
+
if vc.last_issuer_decision:
|
| 225 |
+
dec = vc.last_issuer_decision
|
| 226 |
+
dec_cls = _DEC_CLASS.get(dec, "")
|
| 227 |
+
dec_pretty = dec.replace("_", " ").title()
|
| 228 |
+
body += f'<div class="issuer-decision {dec_cls}">Issuer: {dec_pretty}</div>'
|
| 229 |
+
|
| 230 |
+
if vc.last_issuer_rationale:
|
| 231 |
+
body += f'<div class="issuer-quote">“{vc.last_issuer_rationale}”</div>'
|
| 232 |
+
|
| 233 |
+
if vc.pre_arb_evidence_added:
|
| 234 |
+
ids = ", ".join(vc.pre_arb_evidence_added)
|
| 235 |
+
body += (
|
| 236 |
+
f'<div style="font-size:12px;color:#999;margin-top:4px;">'
|
| 237 |
+
f'Pre-arb evidence added: <code>{ids}</code></div>'
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
return f'<div class="round-panel">{body}</div>'
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
def _arbitration_panel_html(observation) -> str:
|
| 244 |
+
vc = observation.visible_case
|
| 245 |
+
if vc is None or vc.arbitration_outcome is None:
|
| 246 |
+
return ""
|
| 247 |
+
|
| 248 |
+
outcome = vc.arbitration_outcome
|
| 249 |
+
outcome_cls = _DEC_CLASS.get(outcome, "")
|
| 250 |
+
outcome_label = outcome.replace("_", " ").title()
|
| 251 |
+
pnl = vc.final_economic_outcome
|
| 252 |
+
pnl_cls = "pnl-pos" if (pnl is not None and pnl >= 0) else "pnl-neg"
|
| 253 |
+
pnl_str = f"${pnl:+,.2f}" if pnl is not None else "n/a"
|
| 254 |
+
fees = vc.arb_fees_paid or 0.0
|
| 255 |
+
|
| 256 |
+
return (
|
| 257 |
+
f'<div class="arb-panel">'
|
| 258 |
+
f'<div class="panel-title"><span class="round-badge round-3">ARB</span>Arbitration Outcome</div>'
|
| 259 |
+
f'<div class="arb-row"><span class="arb-label">Ruling</span>'
|
| 260 |
+
f'<span class="arb-value {outcome_cls}">{outcome_label}</span></div>'
|
| 261 |
+
f'<div class="arb-row"><span class="arb-label">Arb fees paid</span>'
|
| 262 |
+
f'<span class="arb-value">${fees:,.2f}</span></div>'
|
| 263 |
+
f'<div class="arb-row"><span class="arb-label">Final P&L</span>'
|
| 264 |
+
f'<span class="arb-value {pnl_cls}">{pnl_str}</span></div>'
|
| 265 |
+
f'</div>'
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
|
| 269 |
def _grader_html(report: dict | None) -> str:
|
| 270 |
if not report:
|
| 271 |
return ""
|
|
|
|
| 282 |
)
|
| 283 |
|
| 284 |
dims = [
|
| 285 |
+
("Strategy", "strategy_correctness", "20%"),
|
| 286 |
+
("Evidence", "evidence_quality", "15%"),
|
| 287 |
+
("Packet", "packet_validity", "10%"),
|
| 288 |
+
("Deadline", "deadline_compliance", "10%"),
|
| 289 |
("Efficiency", "efficiency", "10%"),
|
| 290 |
("Outcome", "outcome_quality", "10%"),
|
| 291 |
("Note", "note_quality", "5%"),
|
| 292 |
+
("Esc ROI", "escalation_roi", "20%"),
|
| 293 |
]
|
| 294 |
|
| 295 |
for case in report.get("case_reports", []):
|
|
|
|
| 351 |
_queue_html(obs),
|
| 352 |
_budget_html(0, max_steps, 0.0),
|
| 353 |
[row[:] for row in rows],
|
| 354 |
+
_round_panel_html(obs),
|
| 355 |
+
_arbitration_panel_html(obs),
|
| 356 |
"",
|
| 357 |
None,
|
| 358 |
)
|
|
|
|
| 396 |
_queue_html(obs),
|
| 397 |
_budget_html(step, max_steps, obs.progress_score),
|
| 398 |
[row[:] for row in rows],
|
| 399 |
+
_round_panel_html(obs),
|
| 400 |
+
_arbitration_panel_html(obs),
|
| 401 |
grader,
|
| 402 |
None,
|
| 403 |
)
|
|
|
|
| 410 |
_queue_html(obs),
|
| 411 |
_budget_html(step, max_steps, obs.progress_score),
|
| 412 |
[row[:] for row in rows],
|
| 413 |
+
_round_panel_html(obs),
|
| 414 |
+
_arbitration_panel_html(obs),
|
| 415 |
_grader_html(report),
|
| 416 |
report,
|
| 417 |
)
|
|
|
|
| 468 |
|
| 469 |
md_status = gr.Markdown(
|
| 470 |
"Pick a task + policy and click **Run Episode**. Compare **Heuristic** vs "
|
| 471 |
+
"**Naive** to see how the 8-dimension rubric — including escalation ROI — "
|
| 472 |
+
"separates an EV-rational agent from a lazy one."
|
| 473 |
)
|
| 474 |
|
| 475 |
with gr.Row(equal_height=True):
|
|
|
|
| 494 |
label="Step Trace",
|
| 495 |
)
|
| 496 |
|
| 497 |
+
with gr.Row(equal_height=True):
|
| 498 |
+
with gr.Column(scale=1):
|
| 499 |
+
html_round = gr.HTML(label="Dispute Round")
|
| 500 |
+
with gr.Column(scale=1):
|
| 501 |
+
html_arb = gr.HTML(label="Arbitration")
|
| 502 |
+
|
| 503 |
html_grader = gr.HTML(label="Grader Report")
|
| 504 |
json_raw = gr.JSON(label="Raw JSON", visible=False)
|
| 505 |
|
|
|
|
| 511 |
html_queue,
|
| 512 |
html_budget,
|
| 513 |
df_trace,
|
| 514 |
+
html_round,
|
| 515 |
+
html_arb,
|
| 516 |
html_grader,
|
| 517 |
json_raw,
|
| 518 |
],
|
|
|
|
| 552 |
],
|
| 553 |
interactive=False,
|
| 554 |
wrap=True,
|
| 555 |
+
label=f"{len(tasks)}-Task Benchmark Catalog",
|
| 556 |
)
|
| 557 |
|
| 558 |
# ── Tab 3: Environment Info ───────────────────────────
|
| 559 |
with gr.Tab("Environment"):
|
| 560 |
gr.Markdown(
|
| 561 |
+
"## Action Space (12 typed actions)\n\n"
|
| 562 |
+
"**Round 1 — Representment:** `select_case` · `inspect_case` · "
|
| 563 |
+
"`query_system` · `retrieve_policy` · `add_evidence` · "
|
| 564 |
+
"`remove_evidence` · `set_strategy` · `submit_representment` · "
|
| 565 |
+
"`resolve_case`\n\n"
|
| 566 |
+
"**Round 2/3 — Pre-arb & Arbitration:** `respond_to_pre_arb` · "
|
| 567 |
+
"`escalate_to_arbitration` · `accept_arbitration_loss`\n\n"
|
| 568 |
"## Merchant Systems (6)\n\n"
|
| 569 |
"`orders` · `payment` · `shipping` · "
|
| 570 |
"`support` · `refunds` · `risk`\n\n"
|
| 571 |
+
"## Grading (8 dimensions)\n\n"
|
| 572 |
"| Dimension | Weight | Scoring |\n"
|
| 573 |
"|---|---|---|\n"
|
| 574 |
+
"| Strategy Correctness | 20% | 1.0 optimal, 0.35 acceptable, 0.0 wrong |\n"
|
| 575 |
+
"| Evidence Quality | 15% | Required + helpful coverage, harmful penalty |\n"
|
| 576 |
+
"| Packet Validity | 10% | Binary: all required, zero harmful |\n"
|
| 577 |
+
"| Deadline Compliance | 10% | Binary: resolved before deadline |\n"
|
| 578 |
"| Efficiency | 10% | Penalises waste, rewards early concession |\n"
|
| 579 |
"| Outcome Quality | 10% | 1.0 optimal, 0.4 acceptable, 0.0 wrong |\n"
|
| 580 |
+
"| Note Quality | 5% | Policy keywords + evidence refs |\n"
|
| 581 |
+
"| Escalation ROI | 20% | EV-rational arbitration: P(win)·amount vs $250 fee |\n\n"
|
| 582 |
"## Card Networks\n\n"
|
| 583 |
"| Reason Code | Visa | Mastercard |\n"
|
| 584 |
"|---|---|---|\n"
|
tests/test_api.py
CHANGED
|
@@ -59,11 +59,23 @@ def test_grader_endpoint_after_completed_episode():
|
|
| 59 |
system_name="shipping",
|
| 60 |
)
|
| 61 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
env.step(
|
| 63 |
ChargebackOpsAction(
|
| 64 |
action_type="add_evidence",
|
| 65 |
case_id="CB-E1",
|
| 66 |
-
evidence_ids=[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
)
|
| 68 |
)
|
| 69 |
env.step(
|
|
|
|
| 59 |
system_name="shipping",
|
| 60 |
)
|
| 61 |
)
|
| 62 |
+
env.step(
|
| 63 |
+
ChargebackOpsAction(
|
| 64 |
+
action_type="query_system",
|
| 65 |
+
case_id="CB-E1",
|
| 66 |
+
system_name="support",
|
| 67 |
+
)
|
| 68 |
+
)
|
| 69 |
env.step(
|
| 70 |
ChargebackOpsAction(
|
| 71 |
action_type="add_evidence",
|
| 72 |
case_id="CB-E1",
|
| 73 |
+
evidence_ids=[
|
| 74 |
+
"E1-ORDER-CONF",
|
| 75 |
+
"E1-DELIVERY-SCAN",
|
| 76 |
+
"E1-SIGNATURE",
|
| 77 |
+
"E1-SUPPORT-ACK",
|
| 78 |
+
],
|
| 79 |
)
|
| 80 |
)
|
| 81 |
env.step(
|
tests/test_arbitration.py
CHANGED
|
@@ -32,8 +32,10 @@ def _progress(attached: list[str]) -> CaseProgress:
|
|
| 32 |
|
| 33 |
|
| 34 |
def test_merchant_wins_on_strong_packet():
|
| 35 |
-
"""
|
| 36 |
-
progress = _progress(
|
|
|
|
|
|
|
| 37 |
ruling = arbitration_ruling(_CASE, progress)
|
| 38 |
assert ruling.evidence_strength_score >= ARB_MERCHANT_WIN_THRESHOLD
|
| 39 |
assert ruling.outcome == ArbitrationOutcome.MERCHANT_WINS
|
|
@@ -53,7 +55,7 @@ def test_issuer_wins_on_empty_packet():
|
|
| 53 |
def test_ambiguity_band_uses_deterministic_coin_flip():
|
| 54 |
"""Scores in (0.35, 0.65) map to a case_id-keyed coin flip — reproducible."""
|
| 55 |
# Two helpful-only evidence ids → 0.4 band score, no required subset.
|
| 56 |
-
progress = _progress(["E1-
|
| 57 |
r1 = arbitration_ruling(_CASE, progress)
|
| 58 |
r2 = arbitration_ruling(_CASE, progress)
|
| 59 |
assert r1.outcome == r2.outcome
|
|
@@ -78,7 +80,9 @@ def test_coin_flip_varies_across_case_ids():
|
|
| 78 |
|
| 79 |
def test_ruling_is_pure():
|
| 80 |
"""Same inputs, same outputs — required for reproducible benchmarks."""
|
| 81 |
-
progress = _progress(
|
|
|
|
|
|
|
| 82 |
r1 = arbitration_ruling(_CASE, progress)
|
| 83 |
r2 = arbitration_ruling(_CASE, progress)
|
| 84 |
assert r1 == r2
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
def test_merchant_wins_on_strong_packet():
|
| 35 |
+
"""Required + 2 helpful → score 0.8 clears the 0.65 bar → MERCHANT_WINS."""
|
| 36 |
+
progress = _progress(
|
| 37 |
+
["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"]
|
| 38 |
+
)
|
| 39 |
ruling = arbitration_ruling(_CASE, progress)
|
| 40 |
assert ruling.evidence_strength_score >= ARB_MERCHANT_WIN_THRESHOLD
|
| 41 |
assert ruling.outcome == ArbitrationOutcome.MERCHANT_WINS
|
|
|
|
| 55 |
def test_ambiguity_band_uses_deterministic_coin_flip():
|
| 56 |
"""Scores in (0.35, 0.65) map to a case_id-keyed coin flip — reproducible."""
|
| 57 |
# Two helpful-only evidence ids → 0.4 band score, no required subset.
|
| 58 |
+
progress = _progress(["E1-SIGNATURE", "E1-SUPPORT-ACK"])
|
| 59 |
r1 = arbitration_ruling(_CASE, progress)
|
| 60 |
r2 = arbitration_ruling(_CASE, progress)
|
| 61 |
assert r1.outcome == r2.outcome
|
|
|
|
| 80 |
|
| 81 |
def test_ruling_is_pure():
|
| 82 |
"""Same inputs, same outputs — required for reproducible benchmarks."""
|
| 83 |
+
progress = _progress(
|
| 84 |
+
["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"]
|
| 85 |
+
)
|
| 86 |
r1 = arbitration_ruling(_CASE, progress)
|
| 87 |
r2 = arbitration_ruling(_CASE, progress)
|
| 88 |
assert r1 == r2
|
tests/test_env.py
CHANGED
|
@@ -98,11 +98,23 @@ def test_easy_case_can_be_won():
|
|
| 98 |
system_name="shipping",
|
| 99 |
)
|
| 100 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
env.step(
|
| 102 |
ChargebackOpsAction(
|
| 103 |
action_type="add_evidence",
|
| 104 |
case_id="CB-E1",
|
| 105 |
-
evidence_ids=[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
)
|
| 107 |
)
|
| 108 |
env.step(
|
|
@@ -194,12 +206,17 @@ def test_full_three_round_cycle_ending_in_arbitration():
|
|
| 194 |
)
|
| 195 |
_drive_case_into_round_2(env)
|
| 196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
obs = env.step(
|
| 198 |
ChargebackOpsAction(
|
| 199 |
action_type="respond_to_pre_arb",
|
| 200 |
case_id="CB-E1",
|
| 201 |
-
compelling_evidence_ids=["E1-SIGNATURE"],
|
| 202 |
-
note="Added signature
|
| 203 |
)
|
| 204 |
)
|
| 205 |
|
|
|
|
| 98 |
system_name="shipping",
|
| 99 |
)
|
| 100 |
)
|
| 101 |
+
env.step(
|
| 102 |
+
ChargebackOpsAction(
|
| 103 |
+
action_type="query_system",
|
| 104 |
+
case_id="CB-E1",
|
| 105 |
+
system_name="support",
|
| 106 |
+
)
|
| 107 |
+
)
|
| 108 |
env.step(
|
| 109 |
ChargebackOpsAction(
|
| 110 |
action_type="add_evidence",
|
| 111 |
case_id="CB-E1",
|
| 112 |
+
evidence_ids=[
|
| 113 |
+
"E1-ORDER-CONF",
|
| 114 |
+
"E1-DELIVERY-SCAN",
|
| 115 |
+
"E1-SIGNATURE",
|
| 116 |
+
"E1-SUPPORT-ACK",
|
| 117 |
+
],
|
| 118 |
)
|
| 119 |
)
|
| 120 |
env.step(
|
|
|
|
| 206 |
)
|
| 207 |
_drive_case_into_round_2(env)
|
| 208 |
|
| 209 |
+
env.step(
|
| 210 |
+
ChargebackOpsAction(
|
| 211 |
+
action_type="query_system", case_id="CB-E1", system_name="support"
|
| 212 |
+
)
|
| 213 |
+
)
|
| 214 |
obs = env.step(
|
| 215 |
ChargebackOpsAction(
|
| 216 |
action_type="respond_to_pre_arb",
|
| 217 |
case_id="CB-E1",
|
| 218 |
+
compelling_evidence_ids=["E1-SIGNATURE", "E1-SUPPORT-ACK"],
|
| 219 |
+
note="Added signature delivery proof and support ack for pre-arb.",
|
| 220 |
)
|
| 221 |
)
|
| 222 |
|
tests/test_escalation_roi.py
CHANGED
|
@@ -62,7 +62,7 @@ def test_pre_arb_accept_is_full_credit():
|
|
| 62 |
"""Winning on the pre-arbitration re-submit without filing arbitration is
|
| 63 |
the optimal path."""
|
| 64 |
progress = _progress(
|
| 65 |
-
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN"],
|
| 66 |
round_number=2,
|
| 67 |
resolution_status="won_pre_arb",
|
| 68 |
)
|
|
@@ -74,7 +74,7 @@ def test_reward_positive_ev_escalation():
|
|
| 74 |
"""Strong packet → P(win)=1.0 × $129.99 > $250? No. Test with bigger amount."""
|
| 75 |
big_case = replace(_CASE, case_id="CB-BIG", amount=1000.0)
|
| 76 |
progress = _progress(
|
| 77 |
-
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN"],
|
| 78 |
round_number=3,
|
| 79 |
resolution_status="won_arbitration",
|
| 80 |
arbitration_outcome="merchant_wins",
|
|
@@ -102,7 +102,7 @@ def test_penalise_concede_when_escalation_was_positive_ev():
|
|
| 102 |
"""Conceding with a strong packet + large amount leaves money on the table."""
|
| 103 |
big_case = replace(_CASE, case_id="CB-BIG2", amount=1000.0)
|
| 104 |
progress = _progress(
|
| 105 |
-
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN"],
|
| 106 |
round_number=2,
|
| 107 |
resolution_status="conceded_pre_arb",
|
| 108 |
)
|
|
@@ -128,9 +128,9 @@ def test_fee_threshold_is_the_pivot():
|
|
| 128 |
assert ARB_FEE_PER_SIDE == 250.0
|
| 129 |
# P(win)=0.5 × $600 = 300 > 250 → escalate is rational
|
| 130 |
mid_case = replace(_CASE, case_id="CB-MID", amount=600.0)
|
| 131 |
-
# Score in ambiguity band by attaching two helpful-only ids.
|
| 132 |
progress = _progress(
|
| 133 |
-
attached=["E1-
|
| 134 |
round_number=3,
|
| 135 |
resolution_status="won_arbitration",
|
| 136 |
arbitration_outcome="merchant_wins",
|
|
|
|
| 62 |
"""Winning on the pre-arbitration re-submit without filing arbitration is
|
| 63 |
the optimal path."""
|
| 64 |
progress = _progress(
|
| 65 |
+
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"],
|
| 66 |
round_number=2,
|
| 67 |
resolution_status="won_pre_arb",
|
| 68 |
)
|
|
|
|
| 74 |
"""Strong packet → P(win)=1.0 × $129.99 > $250? No. Test with bigger amount."""
|
| 75 |
big_case = replace(_CASE, case_id="CB-BIG", amount=1000.0)
|
| 76 |
progress = _progress(
|
| 77 |
+
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"],
|
| 78 |
round_number=3,
|
| 79 |
resolution_status="won_arbitration",
|
| 80 |
arbitration_outcome="merchant_wins",
|
|
|
|
| 102 |
"""Conceding with a strong packet + large amount leaves money on the table."""
|
| 103 |
big_case = replace(_CASE, case_id="CB-BIG2", amount=1000.0)
|
| 104 |
progress = _progress(
|
| 105 |
+
attached=["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"],
|
| 106 |
round_number=2,
|
| 107 |
resolution_status="conceded_pre_arb",
|
| 108 |
)
|
|
|
|
| 128 |
assert ARB_FEE_PER_SIDE == 250.0
|
| 129 |
# P(win)=0.5 × $600 = 300 > 250 → escalate is rational
|
| 130 |
mid_case = replace(_CASE, case_id="CB-MID", amount=600.0)
|
| 131 |
+
# Score in ambiguity band by attaching two helpful-only ids (no required).
|
| 132 |
progress = _progress(
|
| 133 |
+
attached=["E1-SIGNATURE", "E1-SUPPORT-ACK"],
|
| 134 |
round_number=3,
|
| 135 |
resolution_status="won_arbitration",
|
| 136 |
arbitration_outcome="merchant_wins",
|
tests/test_issuer.py
CHANGED
|
@@ -30,8 +30,10 @@ def _progress(attached: list[str], note: str | None = None) -> CaseProgress:
|
|
| 30 |
|
| 31 |
|
| 32 |
def test_representment_accepted_when_required_and_helpful_attached():
|
| 33 |
-
"""
|
| 34 |
-
progress = _progress(
|
|
|
|
|
|
|
| 35 |
score = evidence_strength_score(_CASE, progress)
|
| 36 |
assert score >= ROUND1_ACCEPT_THRESHOLD
|
| 37 |
|
|
@@ -49,24 +51,25 @@ def test_representment_rejected_when_packet_empty():
|
|
| 49 |
|
| 50 |
|
| 51 |
def test_harmful_evidence_drops_score():
|
| 52 |
-
"""Harmful evidence applies -0.3
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
)
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
if _CASE.harmful_evidence_ids:
|
| 60 |
-
with_harmful = evidence_strength_score(
|
| 61 |
-
_CASE,
|
| 62 |
-
_progress(
|
| 63 |
-
["E1-ORDER-CONF", "E1-DELIVERY-SCAN", _CASE.harmful_evidence_ids[0]]
|
| 64 |
-
),
|
| 65 |
-
)
|
| 66 |
-
assert with_harmful < helpful_only
|
| 67 |
-
else:
|
| 68 |
-
# Verify the upper bound holds without harmful evidence.
|
| 69 |
-
assert 0.0 <= helpful_only <= 1.0
|
| 70 |
|
| 71 |
|
| 72 |
def test_pre_arb_escalates_when_score_below_06():
|
|
@@ -79,16 +82,19 @@ def test_pre_arb_escalates_when_score_below_06():
|
|
| 79 |
|
| 80 |
def test_pre_arb_accepted_when_evidence_strong():
|
| 81 |
"""Pre-arb review accepts at the lower 0.60 bar once the packet is rebuilt."""
|
| 82 |
-
progress = _progress(
|
|
|
|
|
|
|
| 83 |
review = IssuerAgent().decide_review(_CASE, progress, round_number=2)
|
| 84 |
assert review.decision == IssuerDecision.ACCEPT
|
| 85 |
assert review.evidence_strength_score >= ROUND2_ACCEPT_THRESHOLD
|
| 86 |
|
| 87 |
|
| 88 |
def test_midpoint_band_uses_deterministic_fallback():
|
| 89 |
-
"""
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
def test_representment_accepted_when_required_and_helpful_attached():
|
| 33 |
+
"""Required + 2 helpful attached → score 0.8 → ACCEPT on first review."""
|
| 34 |
+
progress = _progress(
|
| 35 |
+
["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"]
|
| 36 |
+
)
|
| 37 |
score = evidence_strength_score(_CASE, progress)
|
| 38 |
assert score >= ROUND1_ACCEPT_THRESHOLD
|
| 39 |
|
|
|
|
| 51 |
|
| 52 |
|
| 53 |
def test_harmful_evidence_drops_score():
|
| 54 |
+
"""Harmful evidence applies -0.3 per piece, no cap. Verified on a case
|
| 55 |
+
that actually carries harmful items so the assertion is not vacuous."""
|
| 56 |
+
fraud_case = get_task("fraud_signal_ambiguity").cases[0]
|
| 57 |
+
assert fraud_case.harmful_evidence_ids, "fixture must expose harmful evidence"
|
| 58 |
+
|
| 59 |
+
base_attached = list(fraud_case.required_evidence_ids) + list(
|
| 60 |
+
fraud_case.helpful_evidence_ids[:1]
|
| 61 |
+
)
|
| 62 |
+
clean_score = evidence_strength_score(fraud_case, _progress(base_attached))
|
| 63 |
+
one_harmful = evidence_strength_score(
|
| 64 |
+
fraud_case,
|
| 65 |
+
_progress(base_attached + [fraud_case.harmful_evidence_ids[0]]),
|
| 66 |
+
)
|
| 67 |
+
two_harmful = evidence_strength_score(
|
| 68 |
+
fraud_case,
|
| 69 |
+
_progress(base_attached + list(fraud_case.harmful_evidence_ids[:2])),
|
| 70 |
)
|
| 71 |
+
assert one_harmful == max(0.0, clean_score - 0.3)
|
| 72 |
+
assert two_harmful == max(0.0, clean_score - 0.6)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
|
| 75 |
def test_pre_arb_escalates_when_score_below_06():
|
|
|
|
| 82 |
|
| 83 |
def test_pre_arb_accepted_when_evidence_strong():
|
| 84 |
"""Pre-arb review accepts at the lower 0.60 bar once the packet is rebuilt."""
|
| 85 |
+
progress = _progress(
|
| 86 |
+
["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE", "E1-SUPPORT-ACK"]
|
| 87 |
+
)
|
| 88 |
review = IssuerAgent().decide_review(_CASE, progress, round_number=2)
|
| 89 |
assert review.decision == IssuerDecision.ACCEPT
|
| 90 |
assert review.evidence_strength_score >= ROUND2_ACCEPT_THRESHOLD
|
| 91 |
|
| 92 |
|
| 93 |
def test_midpoint_band_uses_deterministic_fallback():
|
| 94 |
+
"""Required + 1 helpful → score 0.6, lands in ambiguity band, accepts at midpoint."""
|
| 95 |
+
progress = _progress(["E1-ORDER-CONF", "E1-DELIVERY-SCAN", "E1-SIGNATURE"])
|
| 96 |
+
score = evidence_strength_score(_CASE, progress)
|
| 97 |
+
assert 0.4 < score < ROUND1_ACCEPT_THRESHOLD
|
| 98 |
+
assert score >= ROUND1_MIDPOINT_FALLBACK
|
| 99 |
+
review = IssuerAgent().decide_review(_CASE, progress, round_number=1)
|
| 100 |
+
assert review.decision == IssuerDecision.ACCEPT
|
tests/test_llm_note_judge.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Unit tests for the optional LLM-backed note judge.
|
| 2 |
+
|
| 3 |
+
The deterministic note scorer is pinned via ``test_grader.py`` and the
|
| 4 |
+
``EvidenceQuality`` / ``PacketValidity`` rubric tests. These tests cover the
|
| 5 |
+
LLM-backed wrapper specifically: opt-in activation through env var,
|
| 6 |
+
fallback on parse failure, and that the rubric still respects the
|
| 7 |
+
contest-only gate.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
import os
|
| 13 |
+
from typing import Any
|
| 14 |
+
|
| 15 |
+
from evaluation import llm_note_judge
|
| 16 |
+
from evaluation.llm_note_judge import LLMNoteJudgeRubric, llm_score_note
|
| 17 |
+
from evaluation.rubrics import (
|
| 18 |
+
CASE_DIMENSION_NAMES,
|
| 19 |
+
CaseRubric,
|
| 20 |
+
GradingContext,
|
| 21 |
+
NoteQualityRubric,
|
| 22 |
+
)
|
| 23 |
+
from scenarios.simulation import CaseProgress, get_task
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
_TASK = get_task("goods_not_received_easy")
|
| 27 |
+
_CASE = _TASK.cases[0]
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def _strong_progress() -> CaseProgress:
|
| 31 |
+
p = CaseProgress()
|
| 32 |
+
p.attached_evidence_ids = [
|
| 33 |
+
"E1-ORDER-CONF",
|
| 34 |
+
"E1-DELIVERY-SCAN",
|
| 35 |
+
"E1-SIGNATURE",
|
| 36 |
+
]
|
| 37 |
+
p.final_resolution = "contest"
|
| 38 |
+
p.representment_note = (
|
| 39 |
+
"Order confirmation and carrier delivery confirmation establish "
|
| 40 |
+
"fulfillment per policy."
|
| 41 |
+
)
|
| 42 |
+
return p
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _ctx(progress: CaseProgress | None = None) -> GradingContext:
|
| 46 |
+
return GradingContext(case=_CASE, progress=progress or _strong_progress(), step_count=5)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def test_default_rubric_is_deterministic_when_flag_unset(monkeypatch):
|
| 50 |
+
"""Without USE_LLM_NOTE_JUDGE the case rubric uses the deterministic scorer."""
|
| 51 |
+
monkeypatch.delenv("USE_LLM_NOTE_JUDGE", raising=False)
|
| 52 |
+
rubric = CaseRubric()
|
| 53 |
+
note_idx = CASE_DIMENSION_NAMES.index("note_quality")
|
| 54 |
+
note_child = rubric.aggregator._rubric_list[note_idx]
|
| 55 |
+
assert isinstance(note_child, NoteQualityRubric)
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def test_rubric_swaps_to_llm_judge_when_flag_set(monkeypatch):
|
| 59 |
+
"""With USE_LLM_NOTE_JUDGE=1 the case rubric installs the LLM-backed one."""
|
| 60 |
+
monkeypatch.setenv("USE_LLM_NOTE_JUDGE", "1")
|
| 61 |
+
rubric = CaseRubric()
|
| 62 |
+
note_idx = CASE_DIMENSION_NAMES.index("note_quality")
|
| 63 |
+
note_child = rubric.aggregator._rubric_list[note_idx]
|
| 64 |
+
assert isinstance(note_child, LLMNoteJudgeRubric)
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def test_llm_judge_falls_back_when_provider_returns_none(monkeypatch):
|
| 68 |
+
"""No API keys → llm_score_note returns None → fallback to deterministic."""
|
| 69 |
+
monkeypatch.setattr(llm_note_judge, "llm_score_note", lambda case, progress: None)
|
| 70 |
+
rubric = LLMNoteJudgeRubric()
|
| 71 |
+
score = rubric(_ctx(), None)
|
| 72 |
+
assert 0.0 < score <= 1.0
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def test_llm_judge_uses_provider_score_when_available(monkeypatch):
|
| 76 |
+
"""When the provider returns a score, the rubric returns it as-is."""
|
| 77 |
+
monkeypatch.setattr(llm_note_judge, "llm_score_note", lambda case, progress: 0.42)
|
| 78 |
+
rubric = LLMNoteJudgeRubric()
|
| 79 |
+
score = rubric(_ctx(), None)
|
| 80 |
+
assert score == 0.42
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def test_llm_judge_returns_zero_when_not_contest(monkeypatch):
|
| 84 |
+
"""Non-contest cases skip note grading entirely."""
|
| 85 |
+
monkeypatch.setattr(llm_note_judge, "llm_score_note", lambda case, progress: 0.99)
|
| 86 |
+
progress = CaseProgress()
|
| 87 |
+
progress.final_resolution = "accept_chargeback"
|
| 88 |
+
progress.representment_note = "doesn't matter"
|
| 89 |
+
rubric = LLMNoteJudgeRubric()
|
| 90 |
+
assert rubric(_ctx(progress), None) == 0.0
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def test_llm_judge_returns_zero_when_no_note(monkeypatch):
|
| 94 |
+
"""Empty note → zero, regardless of LLM availability."""
|
| 95 |
+
monkeypatch.setattr(
|
| 96 |
+
llm_note_judge, "llm_score_note", lambda case, progress: 0.99
|
| 97 |
+
)
|
| 98 |
+
progress = _strong_progress()
|
| 99 |
+
progress.representment_note = ""
|
| 100 |
+
rubric = LLMNoteJudgeRubric()
|
| 101 |
+
assert rubric(_ctx(progress), None) == 0.0
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def test_provider_chain_returns_none_when_no_keys(monkeypatch):
|
| 105 |
+
"""Empty env → walks the chain without ever calling OpenAI → None."""
|
| 106 |
+
for var in ("OPENROUTER_API_KEY", "GOOGLE_API_KEY", "GROQ_API_KEY"):
|
| 107 |
+
monkeypatch.delenv(var, raising=False)
|
| 108 |
+
assert llm_score_note(_CASE, _strong_progress()) is None
|