Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

ChargeBackOps / AGENT.md

mitudrudutta

feat: tighten EscalationROI, add ambiguous medium case, LLM note judge wrapper

e32a33b about 2 months ago

preview code

raw

history blame contribute delete

32.6 kB

	# ChargebackOps Agent: Complete Technical Reference

	This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made.

	---

	## Table of Contents

	- [The Problem](#the-problem)
	- [The Use Case](#the-use-case)
	- [How the Environment Works](#how-the-environment-works)
	- [How the Agent Works](#how-the-agent-works)
	- [The Three-Tier Decision Pipeline](#the-three-tier-decision-pipeline)
	- [Reason Code Strategies](#reason-code-strategies)
	- [Multi-Case Triage](#multi-case-triage)
	- [Evidence Handling](#evidence-handling)
	- [Representment Notes](#representment-notes)
	- [The Grading System](#the-grading-system)
	- [LLM Integration](#llm-integration)
	- [Key Optimizations](#key-optimizations)
	- [File Map](#file-map)

	---

	## The Problem

	When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a chargeback against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a representment package -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline.

	This is not a simple yes/no decision. Each dispute has:

	- A reason code (why the customer disputes: fraud, goods not received, product not as described, etc.)
	- A deadline (fixed number of steps before the case auto-closes against the merchant)
	- Evidence scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk)
	- Some evidence is helpful, some is required, and some is harmful (weakens the case if included)
	- A correct strategy that depends on the evidence available (contest, accept the chargeback, or issue a refund)

	A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case.

	ChargebackOps turns this into a measurable agent benchmark. The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring.

	---

	## The Use Case

	ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a simulated merchant dispute resolution environment where an AI agent acts as the dispute analyst.

	What the agent receives:
	- A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
	- A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
	- Per-case deadlines (must resolve before step N)

	What the agent must do:
	- Select and focus on one case at a time
	- Query internal merchant systems to retrieve evidence
	- Decide whether to contest, accept, or refund each case
	- Attach the right evidence (and avoid harmful artifacts)
	- Write a representment note explaining why the dispute should be reversed
	- Submit or resolve each case before its deadline
	- Manage step budget across all cases when there are more cases than steps

	What the agent is scored on:
	- Did it choose the correct strategy? (20% of score)
	- Did it gather the right evidence? (15%)
	- Is the evidence packet complete and clean? (10%)
	- Did it meet the deadline? (10%)
	- Was it efficient (no wasted steps)? (10%)
	- Did the resolution match the strategy? (10%)
	- Is the representment note well-written? (5%)
	- Was escalation EV-rational? (20% — escalate iff `P(win)·amount > $250 fee`)

	After the merchant submits a representment, a scripted IssuerAgent reviews the packet and returns one of three decisions: `accept`, `request_more_evidence` (triggering pre-arbitration with compelling evidence), or `escalate_to_arbitration`. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.

	---

	## How the Environment Works

	The environment follows the OpenEnv `reset()` / `step()` / `state()` contract.

	### Lifecycle

	```
	reset(task_id) → Observation
	step(action) → Observation
	state() → State (includes grader report when done)
	```

	### Observation

	Each observation contains:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `queue` \| list \| All cases with status, reason_code, amount, steps_until_deadline \|
	\| `visible_case` \| object or null \| The currently selected case with full detail \|
	\| `steps_remaining` \| int \| Steps left before episode ends \|
	\| `done` \| bool \| Whether the episode is complete \|
	\| `reward` \| float \| Immediate reward from the last action \|
	\| `result` \| string \| Human-readable outcome of the last action \|

	### The Visible Case

	When a case is selected, `visible_case` exposes:

	\| Field \| Description \|
	\|---\|---\|
	\| `case_id` \| Unique identifier \|
	\| `reason_code` \| Why the customer disputed (e.g., `goods_not_received`) \|
	\| `amount` \| Transaction amount in dollars \|
	\| `current_strategy` \| Currently set strategy (null if not set) \|
	\| `policy` \| Policy guidance (null until `retrieve_policy` is called) \|
	\| `systems_revealed` \| Which merchant systems have been queried \|
	\| `retrieved_evidence` \| Evidence items revealed by queries \|
	\| `attached_evidence` \| Evidence currently attached to the representment package \|
	\| `inspection_notes` \| Analyst notes (null until `inspect_case` is called) \|

	### Action Space (12 Actions)

	Round 1 — Representment

	\| Action \| Arguments \| Cost \| What It Does \|
	\|---\|---\|---\|---\|
	\| `select_case` \| case_id \| 1 step \| Focus on a case from the queue \|
	\| `inspect_case` \| case_id \| 1 step \| Reveal analyst inspection notes (+0.04 reward) \|
	\| `query_system` \| case_id, system_name \| 1 step \| Pull evidence from orders/payment/shipping/support/refunds/risk \|
	\| `retrieve_policy` \| case_id \| 1 step \| Get reason-code-specific guidance and required evidence list \|
	\| `add_evidence` \| case_id, evidence_ids \| 1 step \| Attach evidence to the representment package \|
	\| `remove_evidence` \| case_id, evidence_ids \| 1 step \| Remove evidence (useful for cleaning harmful attachments) \|
	\| `set_strategy` \| case_id, strategy \| 1 step \| Choose contest / accept_chargeback / issue_refund \|
	\| `submit_representment` \| case_id, note \| 1 step \| Submit the contest package (requires strategy = contest) \|
	\| `resolve_case` \| case_id, strategy \| 1 step \| Close a non-contest case (accept or refund) \|

	Round 2/3 — Pre-Arbitration & Arbitration

	\| Action \| Arguments \| Cost \| What It Does \|
	\|---\|---\|---\|---\|
	\| `respond_to_pre_arb` \| case_id, compelling_evidence_ids \| 1 step \| Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) \|
	\| `escalate_to_arbitration` \| case_id \| 1 step \| Skip rebuilding the packet, pay $250 fee, push to network arbitration \|
	\| `accept_arbitration_loss` \| case_id \| 1 step \| Concede at round 2/3 to cap fees \|

	Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.

	### Reward Signals

	The environment returns immediate rewards after each action:

	\| Event \| Reward \|
	\|---\|---\|
	\| Select an open case \| +0.02 \|
	\| Inspect a case (first time) \| +0.04 \|
	\| Query a new system with helpful evidence \| +0.06 to +0.08 \|
	\| Query a new system with no useful evidence \| -0.01 to +0.01 \|
	\| Query an already-queried system (duplicate) \| -0.03 \|
	\| Attach helpful evidence \| +0.08 per piece \|
	\| Attach harmful evidence \| -0.08 per piece \|
	\| Attach neutral evidence \| +0.01 \|
	\| Remove harmful evidence \| +0.05 \|
	\| Remove helpful evidence \| -0.03 \|
	\| Set optimal strategy \| +0.10 \|
	\| Set acceptable strategy \| +0.03 \|
	\| Set wrong strategy \| -0.08 \|
	\| Submit a strong representment on time \| +0.20 \|
	\| Submit after deadline \| -0.20 \|
	\| Submit with missing required evidence \| -0.18 \|
	\| Submit with harmful evidence attached \| -0.15 \|
	\| Contest a case that shouldn't be contested \| -0.12 \|
	\| Resolve with optimal strategy \| +0.16 \|
	\| Resolve with acceptable strategy \| +0.06 \|
	\| Resolve with wrong strategy \| -0.12 \|
	\| Invalid action \| -0.12 \|

	These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation.

	---

	## How the Agent Works

	The agent is implemented in `baseline_runner.py`. It is a heuristic-first, LLM-augmented policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates).

	### Why Heuristic-First?

	1. Reliability: Heuristic decisions never fail, never timeout, never cost money.
	2. Speed: No network round-trip for obvious moves.
	3. Determinism: Same input always produces same output (important for reproducibility).
	4. Budget: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode.

	The LLM acts as a tiebreaker when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds.

	---

	## The Three-Tier Decision Pipeline

	Every step, the agent runs this pipeline:

	### Tier 1: `candidate_actions(observation)`

	Reads the current observation and generates a list of `CandidateAction` objects -- the legal moves the agent considers. This is the core intelligence of the agent.

	The function applies these checks in strict priority order:

	1. No case selected? Generate `select_case` candidates sorted by triage priority.

	2. Current case resolved? Switch to an open case.

	3. Harmful evidence attached? Immediately generate `remove_evidence` and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total).

	4. Deadline <= 1 step? Emergency submit or resolve. No time for anything else.

	5. Budget too tight to contest? If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with `issue_refund`.

	6. *Budget pressure (steps <= cases 2)?** If the inferred strategy is accept/refund, resolve immediately.

	7. Reason code handler: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission.

	### Tier 2: `_obvious_next_action(observation, candidates)`

	Before calling any LLM, checks if the choice is trivial:
	- Only 1 candidate? Take it.
	- All candidates have the same action type? Take the first.
	- One candidate targets a case with much tighter deadline? Take it.

	If obvious, the LLM is skipped entirely.

	### Tier 3: LLM or `_heuristic_pick(candidates)`

	When Tier 2 returns None (genuine ambiguity):
	- With LLM: Sends the observation summary and candidate list as a JSON prompt. The model returns `{"candidate_index": N, "rationale": "..."}`. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq).
	- Without LLM: `_heuristic_pick()` returns the first candidate (the heuristic already sorted by priority).

	---

	## Reason Code Strategies

	The agent handles 6 reason code families, each with a different workflow:

	### `goods_not_received` (Deterministic: contest)

	The customer claims they never received the product. The merchant almost always has delivery proof.

	Steps: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment

	Systems queried: orders, shipping
	Typical evidence: Order confirmation, delivery scan, tracking number
	Strategy: Always contest (delivery proof is definitive)

	### `fraud_cnp` (Non-deterministic: contest or accept)

	Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't.

	Steps: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve

	Systems queried: risk, support, orders (optional under tight budget)
	Typical evidence: Risk assessment, prior order linkage, account verification
	Harmful evidence: AVS mismatch, CVV mismatch (proves the card data didn't fully match)
	Strategy: Contest if strong evidence exists, accept_chargeback if evidence is weak

	### `credit_not_processed` (Deterministic: issue_refund)

	The customer claims a refund was promised but never issued. The correct response is to issue the refund.

	Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total)
	Strategy: Always issue_refund (cheapest to resolve, no contest needed)

	### `duplicate_processing` (Deterministic: issue_refund)

	The customer was charged twice. The correct response is to refund the duplicate.

	Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total)
	Strategy: Always issue_refund

	### `product_not_as_described` (Non-deterministic: contest or accept)

	The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process.

	Steps: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance

	Systems queried: orders, support, shipping (optional)
	Strategy: Contest if listing proof is strong, accept_chargeback if not supportable

	### `service_not_provided` (Non-deterministic: contest or accept)

	The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment.

	Steps: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance

	Systems queried: support, orders (optional)
	Strategy: Contest if service completion proof exists, accept_chargeback otherwise

	---

	## Multi-Case Triage

	When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm:

	### Step Cost Estimates

	\| Reason Code \| Est. Steps \| Notes \|
	\|---\|---\|---\|
	\| `goods_not_received` \| 6 \| select + 2 queries + attach + strategy + submit \|
	\| `credit_not_processed` \| 3 \| select + strategy + resolve \|
	\| `duplicate_processing` \| 3 \| select + strategy + resolve \|
	\| `fraud_cnp` \| 8 \| select + policy + 2-3 queries + attach + strategy + submit \|
	\| `product_not_as_described` \| 8 \| select + policy + 2-3 queries + attach + strategy + submit \|
	\| `service_not_provided` \| 7 \| select + policy + 2 queries + attach + strategy + submit \|

	### Triage Algorithm

	```
	1. If total_estimated_cost > steps_remaining:
	Sort cases: deterministic-strategy codes first, then by amount descending.
	This ensures cheap, guaranteed-outcome cases are handled first,
	and the highest-value non-deterministic cases get remaining budget.

	2. When processing each case, check:
	- Is steps_remaining < 5? → Fast-concede (can't even minimally contest).
	- Is this the lowest-value case and total_cost > budget? → Fast-concede.
	- Otherwise → Full contest or policy-guided resolution.

	3. Never interrupt a near-complete case:
	- If the current case has evidence attached and is 1-2 steps from
	submission, finish it before switching to another case's deadline.
	```

	### Why This Ordering Works

	- credit_not_processed/duplicate_processing cost 3 steps and always get optimal score. Handle them first to free budget.
	- goods_not_received costs 6 steps and always contests. Handle next.
	- fraud_cnp/product_not_as_described/service_not_provided cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness.

	---

	## Evidence Handling

	### Harmful Evidence Detection

	The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:

	```
	mismatch, failed, declined, suspicious, flagged, fraud risk,
	unauthorized, rejected, invalid, expired, violation,
	non-compliant, discrepancy, inconsistent, unverified
	```

	Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
	1. Never attached (ranked 999 in the priority sort, excluded from `add_evidence` calls)
	2. Removed if already attached (a `remove_evidence` action is generated immediately before any other action)

	### Evidence Priority Ranking

	Non-harmful evidence is ranked by keyword relevance:

	\| Rank \| Keywords \| Example \|
	\|---\|---\|---\|
	\| 0 (highest) \| signature, completion, booking, listing \| "Delivery signature scan" \|
	\| 1 \| duplicate, delivery, prior, account, authenticated \| "Prior good order linkage" \|
	\| 2 \| return policy, refund, cancel, confirmation, cancellation \| "Return policy documentation" \|
	\| 4 (default) \| anything else \| "Internal memo" \|
	\| 999 (excluded) \| mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified \| "AVS mismatch report" \|

	### Attachment Strategy

	The agent attaches all non-harmful retrieved evidence in a single `add_evidence` call. This maximizes the evidence_quality score, which rewards `helpful_attached / total_helpful`.

	---

	## Representment Notes

	When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions:

	\| Dimension \| Weight \| What Earns Points \|
	\|---\|---\|---\|
	\| Substance \| 20% \| Note has >= 5 words \|
	\| Policy claims coverage \| 50% \| Note mentions keywords from `case.policy_requirements` (e.g., "order confirmation", "carrier delivery") \|
	\| Evidence coherence \| 15% \| Note references attached evidence IDs (e.g., "E1-ORDER-CONF") \|
	\| Harmful mention penalty \| -15% each \| Note contains words like "mismatch", "failed", "declined" \|

	### How the Agent Builds Notes

	1. Start with a reason-code-specific template that uses policy requirement language:
	- goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..."
	- fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch")
	- product_not_as_described: "Product listing verification confirms..."
	- service_not_provided: "Service completion record and customer acknowledgment..."

	2. If policy was retrieved, append the policy requirements directly:
	- "Evidence covers: order confirmation, carrier delivery confirmation."

	3. Append evidence IDs for coherence scoring:
	- "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN."

	4. Truncate to 500 characters.

	---

	## The Grading System

	After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv `Rubric` subclass defined in `evaluation/rubrics.py`; they compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that is wired into `env.rubric`. `evaluation/grading.py` keeps the legacy `score_case` / `grade_episode` API as a thin adapter over the rubric tree.

	### Strategy Correctness (20%)

	\| Outcome \| Score \|
	\|---\|---\|
	\| Chose the optimal strategy \| 1.0 \|
	\| Chose an acceptable fallback \| 0.35 \|
	\| Chose the wrong strategy \| 0.0 \|

	"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.

	### Evidence Quality (15%)

	For contest cases:
	```
	quality = 0.7 * (required_attached / required_total)
	+ 0.3 * (helpful_attached / helpful_total)
	- 0.25 * harmful_attached_count
	```

	For non-contest cases where optimal strategy is also non-contest:
	- 1.0 if no evidence was attached (clean concession)
	- 0.7 if evidence was attached (unnecessary work)

	For non-contest cases where optimal was contest:
	- 0.15 (the agent abandoned evidence gathering for a contestable case)

	### Packet Validity (10%)

	Binary, all-or-nothing:
	- 1.0 if ALL required evidence is attached AND zero harmful evidence is attached
	- 0.0 otherwise

	This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.

	### Deadline Compliance (10%)

	Binary:
	- 1.0 if the case was resolved at or before the deadline step
	- 0.0 if resolved after the deadline or never resolved

	### Efficiency (10%)

	```
	efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)
	```

	The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.

	Additional penalties for shallow operational behaviour:
	- Over-querying a concedable case: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
	- Late policy retrieval: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
	- Early correct concession bonus: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.

	### Outcome Quality (10%)

	\| Outcome \| Score \|
	\|---\|---\|
	\| Final resolution matches optimal strategy \| 1.0 \|
	\| Final resolution is an acceptable fallback \| 0.4 \|
	\| Final resolution is wrong \| 0.0 \|

	### Note Quality (5%)

	Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown.

	### Escalation ROI (20%)

	Encodes the economic rule that escalating to network arbitration is rational only when
	`P(win) × dispute_amount > $250 fee`. Conceding a positive-EV contestable case (where
	`amount > $250` and the optimal strategy is `contest`) is penalised. Escalating a
	negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that
	keeps `concede_all` from being a free 0.6+ score.

	### Deadline Gate

	Before the WeightedSum scores anything, `Gate(CaseAbandonedRubric)` checks whether the case
	was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This
	prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still
	collecting partial credit on the dimensions it did touch.

	### Final Score Calculation

	```
	case_score = 0.20 * strategy_correctness
	+ 0.15 * evidence_quality
	+ 0.10 * packet_validity
	+ 0.10 * deadline_compliance
	+ 0.10 * efficiency
	+ 0.10 * outcome_quality
	+ 0.05 * note_quality
	+ 0.20 * escalation_roi

	case_score = 0.0 if case_abandoned else case_score # deadline gate

	weighted_case_score = case_score * case_weight

	episode_score = sum(weighted_case_scores) / sum(case_weights)
	```

	Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to `[0.0, 1.0]`.

	---

	## The Issuer Agent

	After every `submit_representment`, a scripted `IssuerAgent` (see `scenarios/issuer_model.py`)
	reviews the packet and returns one of three decisions:

	\| Decision \| Score band (round 1) \| Score band (round 2) \| What happens \|
	\|---\|---\|---\|---\|
	\| `accept` \| ≥ 0.70 \| ≥ 0.60 \| Merchant wins the dispute, case closes positive \|
	\| `request_more_evidence` \| 0.40 – 0.70 \| < 0.60 \| Round 2: merchant gets one more shot with compelling evidence \|
	\| `escalate_to_arbitration` \| < 0.40 \| (only if merchant escalates) \| Round 3: case goes to network arbitration \|

	The score itself comes from `evidence_strength_score`:

	```
	score = 0.4 (if all required evidence attached)
	+ min(0.4, 0.2 × helpful_attached)
	− 0.3 × harmful_attached # uncapped
	+ 0.1 (if note has ≥ 2 policy keywords)
	+ min(0.30, 0.15 × pre_arb_unique) # round 2 only
	```

	In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule:
	`accept` at score ≥ 0.55, otherwise `request_more_evidence`. An optional LLM softening layer
	can override this midpoint when an API key is set; with no key it falls back to the
	deterministic rule so offline benchmarks stay reproducible.

	## Arbitration

	Network arbitration is a pure function (see `scenarios/arbitration.py`). Given the same case ID
	and packet state, the ruling is always the same — it seeds a coin flip from a SHA-256 hash of
	the case ID inside an ambiguity band. The bands:

	\| Evidence-strength score \| Ruling \|
	\|---\|---\|
	\| ≥ 0.65 \| `merchant_wins` \|
	\| ≤ 0.35 \| `issuer_wins` \|
	\| (0.35, 0.65) \| seeded coin flip on `sha256(case_id)` \|

	Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount
	minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The
	`EscalationROIRubric` reads the final P&L and scores whether the agent's escalate / concede
	decision was EV-rational ex ante.

	## LLM Integration

	The agent supports 5 LLM providers through OpenAI-compatible clients:

	\| Provider \| Model \| Base URL \|
	\|---\|---\|---\|
	\| OpenRouter \| openai/gpt-oss-120b \| openrouter.ai/api/v1 \|
	\| Google Gemini \| gemini-2.5-flash \| generativelanguage.googleapis.com/v1beta/openai/ \|
	\| Groq \| llama-3.3-70b-versatile \| api.groq.com/openai/v1 \|
	\| OpenAI \| gpt-4.1-mini \| api.openai.com/v1 \|
	\| Anthropic \| claude-sonnet-4 \| (compatible gateway) \|

	### Fallback Chain

	```
	Primary (configured in .env) → OpenRouter → Google Gemini → Groq → Heuristic
	```

	If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to `_heuristic_pick()`.

	### What the LLM Sees

	```json
	{
	"queue_summary": "2 open cases, 8 steps remaining",
	"visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps",
	"candidates": [
	{"index": 0, "action": "submit_representment", "summary": "Submit the contest package"},
	{"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"},
	{"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"}
	]
	}
	```

	### What the LLM Returns

	```json
	{
	"candidate_index": 0,
	"rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence."
	}
	```

	### Configuration

	\| Env Variable \| Default \| Purpose \|
	\|---\|---\|---\|
	\| `BASELINE_PROVIDER` \| openrouter \| Primary LLM provider \|
	\| `BASELINE_MODEL` \| openai/gpt-oss-120b \| Model to use \|
	\| `BASELINE_REQUEST_TIMEOUT_SECONDS` \| 15 \| Per-call timeout \|
	\| `PROVIDER_RATE_LIMIT_RETRIES` \| 2 \| Retry count on rate limits \|
	\| `PROVIDER_RETRY_BACKOFF_SECONDS` \| 1.0 \| Backoff between retries \|
	\| `MAX_PROVIDER_RESPONSE_TOKENS` \| 200 \| Max tokens for LLM response \|
	\| `STRICT_LLM_MODE` \| false \| If true, fail instead of falling back to heuristic \|

	---

	## Key Optimizations

	### 1. Deterministic Strategy Inference

	For reason codes where the optimal strategy never varies (`goods_not_received` = contest, `credit_not_processed` / `duplicate_processing` = issue_refund), the agent skips `retrieve_policy` entirely. This saves 1 step per case.

	### 2. Deadline-Aware Query Limiting

	When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried:

	- `product_not_as_described`: drops from 3 systems (orders, support, shipping) to 2 (orders, support)
	- `fraud_cnp`: drops from 3 systems (risk, support, orders) to 2 (risk, support)
	- `service_not_provided`: drops from 2 systems (orders, support) to 1 (support)

	### 3. Near-Completion Protection

	When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting.

	### 4. Harmful Evidence Cleanup

	Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a `remove_evidence` action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure.

	### 5. Budget-Aware Note Generation

	Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).

	### 6. Adversarial Evidence (Hard/Nightmare)

	At hard and nightmare difficulty, the case generator injects adversarial evidence — items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.

	### 7. Nightmare Difficulty

	Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively — fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.

	---

	## File Map

	\| File \| Purpose \| Lines \|
	\|---\|---\|---\|
	\| `runners/baseline_runner.py` \| The agent: decision pipeline, candidate generation, LLM integration, representment notes \| ~1100 \|
	\| `server/chargeback_ops_environment.py` \| The environment: step/reset/state, action execution, reward computation \| ~500 \|
	\| `evaluation/rubrics.py` \| OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)` \| ~400 \|
	\| `scenarios/issuer_model.py` \| Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening \| ~250 \|
	\| `scenarios/arbitration.py` \| Deterministic network arbitration resolver with $250 per-side fee \| ~120 \|
	\| `evaluation/grading.py` \| Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree \| ~120 \|
	\| `scenarios/simulation.py` \| Task definitions, case progress tracking, evidence metadata \| ~600 \|
	\| `core/models.py` \| Pydantic models for actions, observations, state, grading \| ~600 \|
	\| `runners/inference.py` \| OpenEnv-compatible inference entry point with provider fallback \| ~200 \|
	\| `inference.py` \| Root re-export for submission contract \| ~10 \|
	\| `scenarios/case_generator.py` \| Parametric task generator with seeded RNG \| ~700 \|
	\| `scenarios/iso_adapter.py` \| Converts ISO 20022 CASR.003 records to environment cases \| ~160 \|
	\| `connectors/stripe_sandbox.py` \| Maps Stripe test-mode disputes to environment cases \| ~280 \|
	\| `evaluation/agent_brutal_audit.py` \| 126-episode evaluation across all data sources \| ~300 \|
	\| `server/app.py` \| FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo \| ~200 \|
	\| `server/demo_ui.py` \| Gradio live demo UI with step-by-step episode playback \| ~150 \|
	\| `core/episode_store.py` \| Thread-safe storage with JSONL file persistence \| ~60 \|
	\| `core/client.py` \| OpenEnv WebSocket client \| ~100 \|

	---

	## Performance

	Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task
	multi-seed grid:

	\| Policy \| Headline (11) \| Multi-seed (28) \| Delta vs naive \|
	\|---\|---\|---\|---\|
	\| naive (empty packet) \| 0.000 \| 0.000 \| — \|
	\| concede_all \| 0.567 \| 0.563 \| +0.567 \|
	\| escalate_all \| 0.773 \| 0.765 \| +0.773 \|
	\| heuristic \| 0.773 \| 0.765 \| +0.773 \|

	The difficulty curve runs 0.97 → 0.88 → 0.70 → 0.51 across easy / medium / hard / nightmare on
	the multi-seed grid — monotone and well-separated. The `Gate(CaseAbandonedRubric)` wrapper
	hard-zeros abandoned cases, and `EscalationROIRubric` (20%) penalises both conceding positive-EV
	contestable cases and escalating negative-EV ones — together they kill the concede-everything
	shortcut. `escalate_all` ties heuristic at the headline because the merchant's round-1 packet
	is strong enough on most tasks that the pre-arb branch never fires. See `docs/RESULTS.md` for
	full per-task numbers, the rubric tree, and reproduction commands.