Spaces:
Sleeping
Sleeping
| # ChargebackOps Agent: Complete Technical Reference | |
| This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made. | |
| --- | |
| ## Table of Contents | |
| - [The Problem](#the-problem) | |
| - [The Use Case](#the-use-case) | |
| - [How the Environment Works](#how-the-environment-works) | |
| - [How the Agent Works](#how-the-agent-works) | |
| - [The Three-Tier Decision Pipeline](#the-three-tier-decision-pipeline) | |
| - [Reason Code Strategies](#reason-code-strategies) | |
| - [Multi-Case Triage](#multi-case-triage) | |
| - [Evidence Handling](#evidence-handling) | |
| - [Representment Notes](#representment-notes) | |
| - [The Grading System](#the-grading-system) | |
| - [LLM Integration](#llm-integration) | |
| - [Key Optimizations](#key-optimizations) | |
| - [File Map](#file-map) | |
| --- | |
| ## The Problem | |
| When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a **chargeback** against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a **representment package** -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline. | |
| This is not a simple yes/no decision. Each dispute has: | |
| - A **reason code** (why the customer disputes: fraud, goods not received, product not as described, etc.) | |
| - A **deadline** (fixed number of steps before the case auto-closes against the merchant) | |
| - **Evidence** scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk) | |
| - Some evidence is **helpful**, some is **required**, and some is **harmful** (weakens the case if included) | |
| - A correct **strategy** that depends on the evidence available (contest, accept the chargeback, or issue a refund) | |
| A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case. | |
| **ChargebackOps turns this into a measurable agent benchmark.** The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring. | |
| --- | |
| ## The Use Case | |
| ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst. | |
| **What the agent receives:** | |
| - A queue of 1-6 open dispute cases (5-6 at nightmare difficulty) | |
| - A step budget (10-20 actions total, ~2.4 steps/case at nightmare) | |
| - Per-case deadlines (must resolve before step N) | |
| **What the agent must do:** | |
| - Select and focus on one case at a time | |
| - Query internal merchant systems to retrieve evidence | |
| - Decide whether to contest, accept, or refund each case | |
| - Attach the right evidence (and avoid harmful artifacts) | |
| - Write a representment note explaining why the dispute should be reversed | |
| - Submit or resolve each case before its deadline | |
| - Manage step budget across all cases when there are more cases than steps | |
| **What the agent is scored on:** | |
| - Did it choose the correct strategy? (20% of score) | |
| - Did it gather the right evidence? (15%) | |
| - Is the evidence packet complete and clean? (10%) | |
| - Did it meet the deadline? (10%) | |
| - Was it efficient (no wasted steps)? (10%) | |
| - Did the resolution match the strategy? (10%) | |
| - Is the representment note well-written? (5%) | |
| - Was escalation EV-rational? (20% β escalate iff `P(win)Β·amount > $250 fee`) | |
| After the merchant submits a representment, a scripted **IssuerAgent** reviews the packet and returns one of three decisions: `accept`, `request_more_evidence` (triggering pre-arbitration with compelling evidence), or `escalate_to_arbitration`. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee. | |
| --- | |
| ## How the Environment Works | |
| The environment follows the OpenEnv `reset()` / `step()` / `state()` contract. | |
| ### Lifecycle | |
| ``` | |
| reset(task_id) β Observation | |
| step(action) β Observation | |
| state() β State (includes grader report when done) | |
| ``` | |
| ### Observation | |
| Each observation contains: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `queue` | list | All cases with status, reason_code, amount, steps_until_deadline | | |
| | `visible_case` | object or null | The currently selected case with full detail | | |
| | `steps_remaining` | int | Steps left before episode ends | | |
| | `done` | bool | Whether the episode is complete | | |
| | `reward` | float | Immediate reward from the last action | | |
| | `result` | string | Human-readable outcome of the last action | | |
| ### The Visible Case | |
| When a case is selected, `visible_case` exposes: | |
| | Field | Description | | |
| |---|---| | |
| | `case_id` | Unique identifier | | |
| | `reason_code` | Why the customer disputed (e.g., `goods_not_received`) | | |
| | `amount` | Transaction amount in dollars | | |
| | `current_strategy` | Currently set strategy (null if not set) | | |
| | `policy` | Policy guidance (null until `retrieve_policy` is called) | | |
| | `systems_revealed` | Which merchant systems have been queried | | |
| | `retrieved_evidence` | Evidence items revealed by queries | | |
| | `attached_evidence` | Evidence currently attached to the representment package | | |
| | `inspection_notes` | Analyst notes (null until `inspect_case` is called) | | |
| ### Action Space (12 Actions) | |
| **Round 1 β Representment** | |
| | Action | Arguments | Cost | What It Does | | |
| |---|---|---|---| | |
| | `select_case` | case_id | 1 step | Focus on a case from the queue | | |
| | `inspect_case` | case_id | 1 step | Reveal analyst inspection notes (+0.04 reward) | | |
| | `query_system` | case_id, system_name | 1 step | Pull evidence from orders/payment/shipping/support/refunds/risk | | |
| | `retrieve_policy` | case_id | 1 step | Get reason-code-specific guidance and required evidence list | | |
| | `add_evidence` | case_id, evidence_ids | 1 step | Attach evidence to the representment package | | |
| | `remove_evidence` | case_id, evidence_ids | 1 step | Remove evidence (useful for cleaning harmful attachments) | | |
| | `set_strategy` | case_id, strategy | 1 step | Choose contest / accept_chargeback / issue_refund | | |
| | `submit_representment` | case_id, note | 1 step | Submit the contest package (requires strategy = contest) | | |
| | `resolve_case` | case_id, strategy | 1 step | Close a non-contest case (accept or refund) | | |
| **Round 2/3 β Pre-Arbitration & Arbitration** | |
| | Action | Arguments | Cost | What It Does | | |
| |---|---|---|---| | |
| | `respond_to_pre_arb` | case_id, compelling_evidence_ids | 1 step | Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) | | |
| | `escalate_to_arbitration` | case_id | 1 step | Skip rebuilding the packet, pay $250 fee, push to network arbitration | | |
| | `accept_arbitration_loss` | case_id | 1 step | Concede at round 2/3 to cap fees | | |
| Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes. | |
| ### Reward Signals | |
| The environment returns immediate rewards after each action: | |
| | Event | Reward | | |
| |---|---| | |
| | Select an open case | +0.02 | | |
| | Inspect a case (first time) | +0.04 | | |
| | Query a new system with helpful evidence | +0.06 to +0.08 | | |
| | Query a new system with no useful evidence | -0.01 to +0.01 | | |
| | Query an already-queried system (duplicate) | -0.03 | | |
| | Attach helpful evidence | +0.08 per piece | | |
| | Attach harmful evidence | -0.08 per piece | | |
| | Attach neutral evidence | +0.01 | | |
| | Remove harmful evidence | +0.05 | | |
| | Remove helpful evidence | -0.03 | | |
| | Set optimal strategy | +0.10 | | |
| | Set acceptable strategy | +0.03 | | |
| | Set wrong strategy | -0.08 | | |
| | Submit a strong representment on time | +0.20 | | |
| | Submit after deadline | -0.20 | | |
| | Submit with missing required evidence | -0.18 | | |
| | Submit with harmful evidence attached | -0.15 | | |
| | Contest a case that shouldn't be contested | -0.12 | | |
| | Resolve with optimal strategy | +0.16 | | |
| | Resolve with acceptable strategy | +0.06 | | |
| | Resolve with wrong strategy | -0.12 | | |
| | Invalid action | -0.12 | | |
| These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation. | |
| --- | |
| ## How the Agent Works | |
| The agent is implemented in `baseline_runner.py`. It is a **heuristic-first, LLM-augmented** policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates). | |
| ### Why Heuristic-First? | |
| 1. **Reliability**: Heuristic decisions never fail, never timeout, never cost money. | |
| 2. **Speed**: No network round-trip for obvious moves. | |
| 3. **Determinism**: Same input always produces same output (important for reproducibility). | |
| 4. **Budget**: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode. | |
| The LLM acts as a **tiebreaker** when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds. | |
| --- | |
| ## The Three-Tier Decision Pipeline | |
| Every step, the agent runs this pipeline: | |
| ### Tier 1: `candidate_actions(observation)` | |
| Reads the current observation and generates a list of `CandidateAction` objects -- the legal moves the agent considers. This is the core intelligence of the agent. | |
| The function applies these checks in strict priority order: | |
| 1. **No case selected?** Generate `select_case` candidates sorted by triage priority. | |
| 2. **Current case resolved?** Switch to an open case. | |
| 3. **Harmful evidence attached?** Immediately generate `remove_evidence` and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total). | |
| 4. **Deadline <= 1 step?** Emergency submit or resolve. No time for anything else. | |
| 5. **Budget too tight to contest?** If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with `issue_refund`. | |
| 6. **Budget pressure (steps <= cases * 2)?** If the inferred strategy is accept/refund, resolve immediately. | |
| 7. **Reason code handler**: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission. | |
| ### Tier 2: `_obvious_next_action(observation, candidates)` | |
| Before calling any LLM, checks if the choice is trivial: | |
| - Only 1 candidate? Take it. | |
| - All candidates have the same action type? Take the first. | |
| - One candidate targets a case with much tighter deadline? Take it. | |
| If obvious, the LLM is skipped entirely. | |
| ### Tier 3: LLM or `_heuristic_pick(candidates)` | |
| When Tier 2 returns None (genuine ambiguity): | |
| - **With LLM**: Sends the observation summary and candidate list as a JSON prompt. The model returns `{"candidate_index": N, "rationale": "..."}`. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq). | |
| - **Without LLM**: `_heuristic_pick()` returns the first candidate (the heuristic already sorted by priority). | |
| --- | |
| ## Reason Code Strategies | |
| The agent handles 6 reason code families, each with a different workflow: | |
| ### `goods_not_received` (Deterministic: contest) | |
| The customer claims they never received the product. The merchant almost always has delivery proof. | |
| **Steps**: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment | |
| **Systems queried**: orders, shipping | |
| **Typical evidence**: Order confirmation, delivery scan, tracking number | |
| **Strategy**: Always contest (delivery proof is definitive) | |
| ### `fraud_cnp` (Non-deterministic: contest or accept) | |
| Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't. | |
| **Steps**: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve | |
| **Systems queried**: risk, support, orders (optional under tight budget) | |
| **Typical evidence**: Risk assessment, prior order linkage, account verification | |
| **Harmful evidence**: AVS mismatch, CVV mismatch (proves the card data didn't fully match) | |
| **Strategy**: Contest if strong evidence exists, accept_chargeback if evidence is weak | |
| ### `credit_not_processed` (Deterministic: issue_refund) | |
| The customer claims a refund was promised but never issued. The correct response is to issue the refund. | |
| **Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total) | |
| **Strategy**: Always issue_refund (cheapest to resolve, no contest needed) | |
| ### `duplicate_processing` (Deterministic: issue_refund) | |
| The customer was charged twice. The correct response is to refund the duplicate. | |
| **Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total) | |
| **Strategy**: Always issue_refund | |
| ### `product_not_as_described` (Non-deterministic: contest or accept) | |
| The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process. | |
| **Steps**: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance | |
| **Systems queried**: orders, support, shipping (optional) | |
| **Strategy**: Contest if listing proof is strong, accept_chargeback if not supportable | |
| ### `service_not_provided` (Non-deterministic: contest or accept) | |
| The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment. | |
| **Steps**: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance | |
| **Systems queried**: support, orders (optional) | |
| **Strategy**: Contest if service completion proof exists, accept_chargeback otherwise | |
| --- | |
| ## Multi-Case Triage | |
| When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm: | |
| ### Step Cost Estimates | |
| | Reason Code | Est. Steps | Notes | | |
| |---|---|---| | |
| | `goods_not_received` | 6 | select + 2 queries + attach + strategy + submit | | |
| | `credit_not_processed` | 3 | select + strategy + resolve | | |
| | `duplicate_processing` | 3 | select + strategy + resolve | | |
| | `fraud_cnp` | 8 | select + policy + 2-3 queries + attach + strategy + submit | | |
| | `product_not_as_described` | 8 | select + policy + 2-3 queries + attach + strategy + submit | | |
| | `service_not_provided` | 7 | select + policy + 2 queries + attach + strategy + submit | | |
| ### Triage Algorithm | |
| ``` | |
| 1. If total_estimated_cost > steps_remaining: | |
| Sort cases: deterministic-strategy codes first, then by amount descending. | |
| This ensures cheap, guaranteed-outcome cases are handled first, | |
| and the highest-value non-deterministic cases get remaining budget. | |
| 2. When processing each case, check: | |
| - Is steps_remaining < 5? β Fast-concede (can't even minimally contest). | |
| - Is this the lowest-value case and total_cost > budget? β Fast-concede. | |
| - Otherwise β Full contest or policy-guided resolution. | |
| 3. Never interrupt a near-complete case: | |
| - If the current case has evidence attached and is 1-2 steps from | |
| submission, finish it before switching to another case's deadline. | |
| ``` | |
| ### Why This Ordering Works | |
| - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget. | |
| - **goods_not_received** costs 6 steps and always contests. Handle next. | |
| - **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness. | |
| --- | |
| ## Evidence Handling | |
| ### Harmful Evidence Detection | |
| The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns: | |
| ``` | |
| mismatch, failed, declined, suspicious, flagged, fraud risk, | |
| unauthorized, rejected, invalid, expired, violation, | |
| non-compliant, discrepancy, inconsistent, unverified | |
| ``` | |
| Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is: | |
| 1. **Never attached** (ranked 999 in the priority sort, excluded from `add_evidence` calls) | |
| 2. **Removed if already attached** (a `remove_evidence` action is generated immediately before any other action) | |
| ### Evidence Priority Ranking | |
| Non-harmful evidence is ranked by keyword relevance: | |
| | Rank | Keywords | Example | | |
| |---|---|---| | |
| | 0 (highest) | signature, completion, booking, listing | "Delivery signature scan" | | |
| | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" | | |
| | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" | | |
| | 4 (default) | anything else | "Internal memo" | | |
| | 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" | | |
| ### Attachment Strategy | |
| The agent attaches **all** non-harmful retrieved evidence in a single `add_evidence` call. This maximizes the evidence_quality score, which rewards `helpful_attached / total_helpful`. | |
| --- | |
| ## Representment Notes | |
| When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions: | |
| | Dimension | Weight | What Earns Points | | |
| |---|---|---| | |
| | Substance | 20% | Note has >= 5 words | | |
| | Policy claims coverage | 50% | Note mentions keywords from `case.policy_requirements` (e.g., "order confirmation", "carrier delivery") | | |
| | Evidence coherence | 15% | Note references attached evidence IDs (e.g., "E1-ORDER-CONF") | | |
| | Harmful mention penalty | -15% each | Note contains words like "mismatch", "failed", "declined" | | |
| ### How the Agent Builds Notes | |
| 1. Start with a **reason-code-specific template** that uses policy requirement language: | |
| - goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..." | |
| - fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch") | |
| - product_not_as_described: "Product listing verification confirms..." | |
| - service_not_provided: "Service completion record and customer acknowledgment..." | |
| 2. If policy was retrieved, append the policy requirements directly: | |
| - "Evidence covers: order confirmation, carrier delivery confirmation." | |
| 3. Append evidence IDs for coherence scoring: | |
| - "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN." | |
| 4. Truncate to 500 characters. | |
| --- | |
| ## The Grading System | |
| After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv `Rubric` subclass defined in `evaluation/rubrics.py`; they compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that is wired into `env.rubric`. `evaluation/grading.py` keeps the legacy `score_case` / `grade_episode` API as a thin adapter over the rubric tree. | |
| ### Strategy Correctness (20%) | |
| | Outcome | Score | | |
| |---|---| | |
| | Chose the optimal strategy | 1.0 | | |
| | Chose an acceptable fallback | 0.35 | | |
| | Chose the wrong strategy | 0.0 | | |
| "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa. | |
| ### Evidence Quality (15%) | |
| For **contest** cases: | |
| ``` | |
| quality = 0.7 * (required_attached / required_total) | |
| + 0.3 * (helpful_attached / helpful_total) | |
| - 0.25 * harmful_attached_count | |
| ``` | |
| For **non-contest** cases where optimal strategy is also non-contest: | |
| - 1.0 if no evidence was attached (clean concession) | |
| - 0.7 if evidence was attached (unnecessary work) | |
| For **non-contest** cases where optimal was contest: | |
| - 0.15 (the agent abandoned evidence gathering for a contestable case) | |
| ### Packet Validity (10%) | |
| Binary, all-or-nothing: | |
| - **1.0** if ALL required evidence is attached AND zero harmful evidence is attached | |
| - **0.0** otherwise | |
| This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out. | |
| ### Deadline Compliance (10%) | |
| Binary: | |
| - **1.0** if the case was resolved at or before the deadline step | |
| - **0.0** if resolved after the deadline or never resolved | |
| ### Efficiency (10%) | |
| ``` | |
| efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05) | |
| ``` | |
| The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0. | |
| Additional penalties for shallow operational behaviour: | |
| - **Over-querying a concedable case**: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful. | |
| - **Late policy retrieval**: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted. | |
| - **Early correct concession bonus**: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly. | |
| ### Outcome Quality (10%) | |
| | Outcome | Score | | |
| |---|---| | |
| | Final resolution matches optimal strategy | 1.0 | | |
| | Final resolution is an acceptable fallback | 0.4 | | |
| | Final resolution is wrong | 0.0 | | |
| ### Note Quality (5%) | |
| Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown. | |
| ### Escalation ROI (20%) | |
| Encodes the economic rule that escalating to network arbitration is rational only when | |
| `P(win) Γ dispute_amount > $250 fee`. Conceding a positive-EV contestable case (where | |
| `amount > $250` and the optimal strategy is `contest`) is penalised. Escalating a | |
| negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that | |
| keeps `concede_all` from being a free 0.6+ score. | |
| ### Deadline Gate | |
| Before the WeightedSum scores anything, `Gate(CaseAbandonedRubric)` checks whether the case | |
| was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This | |
| prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still | |
| collecting partial credit on the dimensions it did touch. | |
| ### Final Score Calculation | |
| ``` | |
| case_score = 0.20 * strategy_correctness | |
| + 0.15 * evidence_quality | |
| + 0.10 * packet_validity | |
| + 0.10 * deadline_compliance | |
| + 0.10 * efficiency | |
| + 0.10 * outcome_quality | |
| + 0.05 * note_quality | |
| + 0.20 * escalation_roi | |
| case_score = 0.0 if case_abandoned else case_score # deadline gate | |
| weighted_case_score = case_score * case_weight | |
| episode_score = sum(weighted_case_scores) / sum(case_weights) | |
| ``` | |
| Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to `[0.0, 1.0]`. | |
| --- | |
| ## The Issuer Agent | |
| After every `submit_representment`, a scripted `IssuerAgent` (see `scenarios/issuer_model.py`) | |
| reviews the packet and returns one of three decisions: | |
| | Decision | Score band (round 1) | Score band (round 2) | What happens | | |
| |---|---|---|---| | |
| | `accept` | β₯ 0.70 | β₯ 0.60 | Merchant wins the dispute, case closes positive | | |
| | `request_more_evidence` | 0.40 β 0.70 | < 0.60 | Round 2: merchant gets one more shot with compelling evidence | | |
| | `escalate_to_arbitration` | < 0.40 | (only if merchant escalates) | Round 3: case goes to network arbitration | | |
| The score itself comes from `evidence_strength_score`: | |
| ``` | |
| score = 0.4 (if all required evidence attached) | |
| + min(0.4, 0.2 Γ helpful_attached) | |
| β 0.3 Γ harmful_attached # uncapped | |
| + 0.1 (if note has β₯ 2 policy keywords) | |
| + min(0.30, 0.15 Γ pre_arb_unique) # round 2 only | |
| ``` | |
| In the round-1 ambiguity band (0.40β0.70), the deterministic fallback uses the midpoint rule: | |
| `accept` at score β₯ 0.55, otherwise `request_more_evidence`. An optional LLM softening layer | |
| can override this midpoint when an API key is set; with no key it falls back to the | |
| deterministic rule so offline benchmarks stay reproducible. | |
| ## Arbitration | |
| Network arbitration is a pure function (see `scenarios/arbitration.py`). Given the same case ID | |
| and packet state, the ruling is always the same β it seeds a coin flip from a SHA-256 hash of | |
| the case ID inside an ambiguity band. The bands: | |
| | Evidence-strength score | Ruling | | |
| |---|---| | |
| | β₯ 0.65 | `merchant_wins` | | |
| | β€ 0.35 | `issuer_wins` | | |
| | (0.35, 0.65) | seeded coin flip on `sha256(case_id)` | | |
| Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount | |
| minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The | |
| `EscalationROIRubric` reads the final P&L and scores whether the agent's escalate / concede | |
| decision was EV-rational ex ante. | |
| ## LLM Integration | |
| The agent supports 5 LLM providers through OpenAI-compatible clients: | |
| | Provider | Model | Base URL | | |
| |---|---|---| | |
| | OpenRouter | openai/gpt-oss-120b | openrouter.ai/api/v1 | | |
| | Google Gemini | gemini-2.5-flash | generativelanguage.googleapis.com/v1beta/openai/ | | |
| | Groq | llama-3.3-70b-versatile | api.groq.com/openai/v1 | | |
| | OpenAI | gpt-4.1-mini | api.openai.com/v1 | | |
| | Anthropic | claude-sonnet-4 | (compatible gateway) | | |
| ### Fallback Chain | |
| ``` | |
| Primary (configured in .env) β OpenRouter β Google Gemini β Groq β Heuristic | |
| ``` | |
| If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to `_heuristic_pick()`. | |
| ### What the LLM Sees | |
| ```json | |
| { | |
| "queue_summary": "2 open cases, 8 steps remaining", | |
| "visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps", | |
| "candidates": [ | |
| {"index": 0, "action": "submit_representment", "summary": "Submit the contest package"}, | |
| {"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"}, | |
| {"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"} | |
| ] | |
| } | |
| ``` | |
| ### What the LLM Returns | |
| ```json | |
| { | |
| "candidate_index": 0, | |
| "rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence." | |
| } | |
| ``` | |
| ### Configuration | |
| | Env Variable | Default | Purpose | | |
| |---|---|---| | |
| | `BASELINE_PROVIDER` | openrouter | Primary LLM provider | | |
| | `BASELINE_MODEL` | openai/gpt-oss-120b | Model to use | | |
| | `BASELINE_REQUEST_TIMEOUT_SECONDS` | 15 | Per-call timeout | | |
| | `PROVIDER_RATE_LIMIT_RETRIES` | 2 | Retry count on rate limits | | |
| | `PROVIDER_RETRY_BACKOFF_SECONDS` | 1.0 | Backoff between retries | | |
| | `MAX_PROVIDER_RESPONSE_TOKENS` | 200 | Max tokens for LLM response | | |
| | `STRICT_LLM_MODE` | false | If true, fail instead of falling back to heuristic | | |
| --- | |
| ## Key Optimizations | |
| ### 1. Deterministic Strategy Inference | |
| For reason codes where the optimal strategy never varies (`goods_not_received` = contest, `credit_not_processed` / `duplicate_processing` = issue_refund), the agent skips `retrieve_policy` entirely. This saves 1 step per case. | |
| ### 2. Deadline-Aware Query Limiting | |
| When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried: | |
| - `product_not_as_described`: drops from 3 systems (orders, support, shipping) to 2 (orders, support) | |
| - `fraud_cnp`: drops from 3 systems (risk, support, orders) to 2 (risk, support) | |
| - `service_not_provided`: drops from 2 systems (orders, support) to 1 (support) | |
| ### 3. Near-Completion Protection | |
| When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting. | |
| ### 4. Harmful Evidence Cleanup | |
| Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a `remove_evidence` action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure. | |
| ### 5. Budget-Aware Note Generation | |
| Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%). | |
| ### 6. Adversarial Evidence (Hard/Nightmare) | |
| At hard and nightmare difficulty, the case generator injects **adversarial evidence** β items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching. | |
| ### 7. Nightmare Difficulty | |
| Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively β fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure. | |
| --- | |
| ## File Map | |
| | File | Purpose | Lines | | |
| |---|---|---| | |
| | `runners/baseline_runner.py` | The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 | | |
| | `server/chargeback_ops_environment.py` | The environment: step/reset/state, action execution, reward computation | ~500 | | |
| | `evaluation/rubrics.py` | OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)` | ~400 | | |
| | `scenarios/issuer_model.py` | Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening | ~250 | | |
| | `scenarios/arbitration.py` | Deterministic network arbitration resolver with $250 per-side fee | ~120 | | |
| | `evaluation/grading.py` | Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree | ~120 | | |
| | `scenarios/simulation.py` | Task definitions, case progress tracking, evidence metadata | ~600 | | |
| | `core/models.py` | Pydantic models for actions, observations, state, grading | ~600 | | |
| | `runners/inference.py` | OpenEnv-compatible inference entry point with provider fallback | ~200 | | |
| | `inference.py` | Root re-export for submission contract | ~10 | | |
| | `scenarios/case_generator.py` | Parametric task generator with seeded RNG | ~700 | | |
| | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 | | |
| | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 | | |
| | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 | | |
| | `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 | | |
| | `server/demo_ui.py` | Gradio live demo UI with step-by-step episode playback | ~150 | | |
| | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 | | |
| | `core/client.py` | OpenEnv WebSocket client | ~100 | | |
| --- | |
| ## Performance | |
| Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task | |
| multi-seed grid: | |
| | Policy | Headline (11) | Multi-seed (28) | Delta vs naive | | |
| |---|---|---|---| | |
| | naive (empty packet) | 0.000 | 0.000 | β | | |
| | concede_all | 0.567 | 0.563 | +0.567 | | |
| | escalate_all | 0.773 | 0.765 | +0.773 | | |
| | heuristic | **0.773** | **0.765** | **+0.773** | | |
| The difficulty curve runs 0.97 β 0.88 β 0.70 β 0.51 across easy / medium / hard / nightmare on | |
| the multi-seed grid β monotone and well-separated. The `Gate(CaseAbandonedRubric)` wrapper | |
| hard-zeros abandoned cases, and `EscalationROIRubric` (20%) penalises both conceding positive-EV | |
| contestable cases and escalating negative-EV ones β together they kill the concede-everything | |
| shortcut. `escalate_all` ties heuristic at the headline because the merchant's round-1 packet | |
| is strong enough on most tasks that the pre-arb branch never fires. See `docs/RESULTS.md` for | |
| full per-task numbers, the rubric tree, and reproduction commands. | |