ChargeBackOps / AGENT.md
mitudrudutta's picture
feat: tighten EscalationROI, add ambiguous medium case, LLM note judge wrapper
e32a33b
# ChargebackOps Agent: Complete Technical Reference
This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made.
---
## Table of Contents
- [The Problem](#the-problem)
- [The Use Case](#the-use-case)
- [How the Environment Works](#how-the-environment-works)
- [How the Agent Works](#how-the-agent-works)
- [The Three-Tier Decision Pipeline](#the-three-tier-decision-pipeline)
- [Reason Code Strategies](#reason-code-strategies)
- [Multi-Case Triage](#multi-case-triage)
- [Evidence Handling](#evidence-handling)
- [Representment Notes](#representment-notes)
- [The Grading System](#the-grading-system)
- [LLM Integration](#llm-integration)
- [Key Optimizations](#key-optimizations)
- [File Map](#file-map)
---
## The Problem
When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a **chargeback** against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a **representment package** -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline.
This is not a simple yes/no decision. Each dispute has:
- A **reason code** (why the customer disputes: fraud, goods not received, product not as described, etc.)
- A **deadline** (fixed number of steps before the case auto-closes against the merchant)
- **Evidence** scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk)
- Some evidence is **helpful**, some is **required**, and some is **harmful** (weakens the case if included)
- A correct **strategy** that depends on the evidence available (contest, accept the chargeback, or issue a refund)
A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case.
**ChargebackOps turns this into a measurable agent benchmark.** The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring.
---
## The Use Case
ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst.
**What the agent receives:**
- A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
- A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
- Per-case deadlines (must resolve before step N)
**What the agent must do:**
- Select and focus on one case at a time
- Query internal merchant systems to retrieve evidence
- Decide whether to contest, accept, or refund each case
- Attach the right evidence (and avoid harmful artifacts)
- Write a representment note explaining why the dispute should be reversed
- Submit or resolve each case before its deadline
- Manage step budget across all cases when there are more cases than steps
**What the agent is scored on:**
- Did it choose the correct strategy? (20% of score)
- Did it gather the right evidence? (15%)
- Is the evidence packet complete and clean? (10%)
- Did it meet the deadline? (10%)
- Was it efficient (no wasted steps)? (10%)
- Did the resolution match the strategy? (10%)
- Is the representment note well-written? (5%)
- Was escalation EV-rational? (20% β€” escalate iff `P(win)Β·amount > $250 fee`)
After the merchant submits a representment, a scripted **IssuerAgent** reviews the packet and returns one of three decisions: `accept`, `request_more_evidence` (triggering pre-arbitration with compelling evidence), or `escalate_to_arbitration`. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.
---
## How the Environment Works
The environment follows the OpenEnv `reset()` / `step()` / `state()` contract.
### Lifecycle
```
reset(task_id) β†’ Observation
step(action) β†’ Observation
state() β†’ State (includes grader report when done)
```
### Observation
Each observation contains:
| Field | Type | Description |
|---|---|---|
| `queue` | list | All cases with status, reason_code, amount, steps_until_deadline |
| `visible_case` | object or null | The currently selected case with full detail |
| `steps_remaining` | int | Steps left before episode ends |
| `done` | bool | Whether the episode is complete |
| `reward` | float | Immediate reward from the last action |
| `result` | string | Human-readable outcome of the last action |
### The Visible Case
When a case is selected, `visible_case` exposes:
| Field | Description |
|---|---|
| `case_id` | Unique identifier |
| `reason_code` | Why the customer disputed (e.g., `goods_not_received`) |
| `amount` | Transaction amount in dollars |
| `current_strategy` | Currently set strategy (null if not set) |
| `policy` | Policy guidance (null until `retrieve_policy` is called) |
| `systems_revealed` | Which merchant systems have been queried |
| `retrieved_evidence` | Evidence items revealed by queries |
| `attached_evidence` | Evidence currently attached to the representment package |
| `inspection_notes` | Analyst notes (null until `inspect_case` is called) |
### Action Space (12 Actions)
**Round 1 β€” Representment**
| Action | Arguments | Cost | What It Does |
|---|---|---|---|
| `select_case` | case_id | 1 step | Focus on a case from the queue |
| `inspect_case` | case_id | 1 step | Reveal analyst inspection notes (+0.04 reward) |
| `query_system` | case_id, system_name | 1 step | Pull evidence from orders/payment/shipping/support/refunds/risk |
| `retrieve_policy` | case_id | 1 step | Get reason-code-specific guidance and required evidence list |
| `add_evidence` | case_id, evidence_ids | 1 step | Attach evidence to the representment package |
| `remove_evidence` | case_id, evidence_ids | 1 step | Remove evidence (useful for cleaning harmful attachments) |
| `set_strategy` | case_id, strategy | 1 step | Choose contest / accept_chargeback / issue_refund |
| `submit_representment` | case_id, note | 1 step | Submit the contest package (requires strategy = contest) |
| `resolve_case` | case_id, strategy | 1 step | Close a non-contest case (accept or refund) |
**Round 2/3 β€” Pre-Arbitration & Arbitration**
| Action | Arguments | Cost | What It Does |
|---|---|---|---|
| `respond_to_pre_arb` | case_id, compelling_evidence_ids | 1 step | Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) |
| `escalate_to_arbitration` | case_id | 1 step | Skip rebuilding the packet, pay $250 fee, push to network arbitration |
| `accept_arbitration_loss` | case_id | 1 step | Concede at round 2/3 to cap fees |
Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.
### Reward Signals
The environment returns immediate rewards after each action:
| Event | Reward |
|---|---|
| Select an open case | +0.02 |
| Inspect a case (first time) | +0.04 |
| Query a new system with helpful evidence | +0.06 to +0.08 |
| Query a new system with no useful evidence | -0.01 to +0.01 |
| Query an already-queried system (duplicate) | -0.03 |
| Attach helpful evidence | +0.08 per piece |
| Attach harmful evidence | -0.08 per piece |
| Attach neutral evidence | +0.01 |
| Remove harmful evidence | +0.05 |
| Remove helpful evidence | -0.03 |
| Set optimal strategy | +0.10 |
| Set acceptable strategy | +0.03 |
| Set wrong strategy | -0.08 |
| Submit a strong representment on time | +0.20 |
| Submit after deadline | -0.20 |
| Submit with missing required evidence | -0.18 |
| Submit with harmful evidence attached | -0.15 |
| Contest a case that shouldn't be contested | -0.12 |
| Resolve with optimal strategy | +0.16 |
| Resolve with acceptable strategy | +0.06 |
| Resolve with wrong strategy | -0.12 |
| Invalid action | -0.12 |
These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation.
---
## How the Agent Works
The agent is implemented in `baseline_runner.py`. It is a **heuristic-first, LLM-augmented** policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates).
### Why Heuristic-First?
1. **Reliability**: Heuristic decisions never fail, never timeout, never cost money.
2. **Speed**: No network round-trip for obvious moves.
3. **Determinism**: Same input always produces same output (important for reproducibility).
4. **Budget**: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode.
The LLM acts as a **tiebreaker** when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds.
---
## The Three-Tier Decision Pipeline
Every step, the agent runs this pipeline:
### Tier 1: `candidate_actions(observation)`
Reads the current observation and generates a list of `CandidateAction` objects -- the legal moves the agent considers. This is the core intelligence of the agent.
The function applies these checks in strict priority order:
1. **No case selected?** Generate `select_case` candidates sorted by triage priority.
2. **Current case resolved?** Switch to an open case.
3. **Harmful evidence attached?** Immediately generate `remove_evidence` and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total).
4. **Deadline <= 1 step?** Emergency submit or resolve. No time for anything else.
5. **Budget too tight to contest?** If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with `issue_refund`.
6. **Budget pressure (steps <= cases * 2)?** If the inferred strategy is accept/refund, resolve immediately.
7. **Reason code handler**: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission.
### Tier 2: `_obvious_next_action(observation, candidates)`
Before calling any LLM, checks if the choice is trivial:
- Only 1 candidate? Take it.
- All candidates have the same action type? Take the first.
- One candidate targets a case with much tighter deadline? Take it.
If obvious, the LLM is skipped entirely.
### Tier 3: LLM or `_heuristic_pick(candidates)`
When Tier 2 returns None (genuine ambiguity):
- **With LLM**: Sends the observation summary and candidate list as a JSON prompt. The model returns `{"candidate_index": N, "rationale": "..."}`. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq).
- **Without LLM**: `_heuristic_pick()` returns the first candidate (the heuristic already sorted by priority).
---
## Reason Code Strategies
The agent handles 6 reason code families, each with a different workflow:
### `goods_not_received` (Deterministic: contest)
The customer claims they never received the product. The merchant almost always has delivery proof.
**Steps**: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment
**Systems queried**: orders, shipping
**Typical evidence**: Order confirmation, delivery scan, tracking number
**Strategy**: Always contest (delivery proof is definitive)
### `fraud_cnp` (Non-deterministic: contest or accept)
Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't.
**Steps**: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve
**Systems queried**: risk, support, orders (optional under tight budget)
**Typical evidence**: Risk assessment, prior order linkage, account verification
**Harmful evidence**: AVS mismatch, CVV mismatch (proves the card data didn't fully match)
**Strategy**: Contest if strong evidence exists, accept_chargeback if evidence is weak
### `credit_not_processed` (Deterministic: issue_refund)
The customer claims a refund was promised but never issued. The correct response is to issue the refund.
**Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total)
**Strategy**: Always issue_refund (cheapest to resolve, no contest needed)
### `duplicate_processing` (Deterministic: issue_refund)
The customer was charged twice. The correct response is to refund the duplicate.
**Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total)
**Strategy**: Always issue_refund
### `product_not_as_described` (Non-deterministic: contest or accept)
The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process.
**Steps**: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance
**Systems queried**: orders, support, shipping (optional)
**Strategy**: Contest if listing proof is strong, accept_chargeback if not supportable
### `service_not_provided` (Non-deterministic: contest or accept)
The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment.
**Steps**: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance
**Systems queried**: support, orders (optional)
**Strategy**: Contest if service completion proof exists, accept_chargeback otherwise
---
## Multi-Case Triage
When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm:
### Step Cost Estimates
| Reason Code | Est. Steps | Notes |
|---|---|---|
| `goods_not_received` | 6 | select + 2 queries + attach + strategy + submit |
| `credit_not_processed` | 3 | select + strategy + resolve |
| `duplicate_processing` | 3 | select + strategy + resolve |
| `fraud_cnp` | 8 | select + policy + 2-3 queries + attach + strategy + submit |
| `product_not_as_described` | 8 | select + policy + 2-3 queries + attach + strategy + submit |
| `service_not_provided` | 7 | select + policy + 2 queries + attach + strategy + submit |
### Triage Algorithm
```
1. If total_estimated_cost > steps_remaining:
Sort cases: deterministic-strategy codes first, then by amount descending.
This ensures cheap, guaranteed-outcome cases are handled first,
and the highest-value non-deterministic cases get remaining budget.
2. When processing each case, check:
- Is steps_remaining < 5? β†’ Fast-concede (can't even minimally contest).
- Is this the lowest-value case and total_cost > budget? β†’ Fast-concede.
- Otherwise β†’ Full contest or policy-guided resolution.
3. Never interrupt a near-complete case:
- If the current case has evidence attached and is 1-2 steps from
submission, finish it before switching to another case's deadline.
```
### Why This Ordering Works
- **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget.
- **goods_not_received** costs 6 steps and always contests. Handle next.
- **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness.
---
## Evidence Handling
### Harmful Evidence Detection
The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:
```
mismatch, failed, declined, suspicious, flagged, fraud risk,
unauthorized, rejected, invalid, expired, violation,
non-compliant, discrepancy, inconsistent, unverified
```
Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
1. **Never attached** (ranked 999 in the priority sort, excluded from `add_evidence` calls)
2. **Removed if already attached** (a `remove_evidence` action is generated immediately before any other action)
### Evidence Priority Ranking
Non-harmful evidence is ranked by keyword relevance:
| Rank | Keywords | Example |
|---|---|---|
| 0 (highest) | signature, completion, booking, listing | "Delivery signature scan" |
| 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
| 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
| 4 (default) | anything else | "Internal memo" |
| 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" |
### Attachment Strategy
The agent attaches **all** non-harmful retrieved evidence in a single `add_evidence` call. This maximizes the evidence_quality score, which rewards `helpful_attached / total_helpful`.
---
## Representment Notes
When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions:
| Dimension | Weight | What Earns Points |
|---|---|---|
| Substance | 20% | Note has >= 5 words |
| Policy claims coverage | 50% | Note mentions keywords from `case.policy_requirements` (e.g., "order confirmation", "carrier delivery") |
| Evidence coherence | 15% | Note references attached evidence IDs (e.g., "E1-ORDER-CONF") |
| Harmful mention penalty | -15% each | Note contains words like "mismatch", "failed", "declined" |
### How the Agent Builds Notes
1. Start with a **reason-code-specific template** that uses policy requirement language:
- goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..."
- fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch")
- product_not_as_described: "Product listing verification confirms..."
- service_not_provided: "Service completion record and customer acknowledgment..."
2. If policy was retrieved, append the policy requirements directly:
- "Evidence covers: order confirmation, carrier delivery confirmation."
3. Append evidence IDs for coherence scoring:
- "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN."
4. Truncate to 500 characters.
---
## The Grading System
After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv `Rubric` subclass defined in `evaluation/rubrics.py`; they compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that is wired into `env.rubric`. `evaluation/grading.py` keeps the legacy `score_case` / `grade_episode` API as a thin adapter over the rubric tree.
### Strategy Correctness (20%)
| Outcome | Score |
|---|---|
| Chose the optimal strategy | 1.0 |
| Chose an acceptable fallback | 0.35 |
| Chose the wrong strategy | 0.0 |
"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
### Evidence Quality (15%)
For **contest** cases:
```
quality = 0.7 * (required_attached / required_total)
+ 0.3 * (helpful_attached / helpful_total)
- 0.25 * harmful_attached_count
```
For **non-contest** cases where optimal strategy is also non-contest:
- 1.0 if no evidence was attached (clean concession)
- 0.7 if evidence was attached (unnecessary work)
For **non-contest** cases where optimal was contest:
- 0.15 (the agent abandoned evidence gathering for a contestable case)
### Packet Validity (10%)
Binary, all-or-nothing:
- **1.0** if ALL required evidence is attached AND zero harmful evidence is attached
- **0.0** otherwise
This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.
### Deadline Compliance (10%)
Binary:
- **1.0** if the case was resolved at or before the deadline step
- **0.0** if resolved after the deadline or never resolved
### Efficiency (10%)
```
efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)
```
The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.
Additional penalties for shallow operational behaviour:
- **Over-querying a concedable case**: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
- **Late policy retrieval**: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
- **Early correct concession bonus**: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.
### Outcome Quality (10%)
| Outcome | Score |
|---|---|
| Final resolution matches optimal strategy | 1.0 |
| Final resolution is an acceptable fallback | 0.4 |
| Final resolution is wrong | 0.0 |
### Note Quality (5%)
Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown.
### Escalation ROI (20%)
Encodes the economic rule that escalating to network arbitration is rational only when
`P(win) Γ— dispute_amount > $250 fee`. Conceding a positive-EV contestable case (where
`amount > $250` and the optimal strategy is `contest`) is penalised. Escalating a
negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that
keeps `concede_all` from being a free 0.6+ score.
### Deadline Gate
Before the WeightedSum scores anything, `Gate(CaseAbandonedRubric)` checks whether the case
was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This
prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still
collecting partial credit on the dimensions it did touch.
### Final Score Calculation
```
case_score = 0.20 * strategy_correctness
+ 0.15 * evidence_quality
+ 0.10 * packet_validity
+ 0.10 * deadline_compliance
+ 0.10 * efficiency
+ 0.10 * outcome_quality
+ 0.05 * note_quality
+ 0.20 * escalation_roi
case_score = 0.0 if case_abandoned else case_score # deadline gate
weighted_case_score = case_score * case_weight
episode_score = sum(weighted_case_scores) / sum(case_weights)
```
Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to `[0.0, 1.0]`.
---
## The Issuer Agent
After every `submit_representment`, a scripted `IssuerAgent` (see `scenarios/issuer_model.py`)
reviews the packet and returns one of three decisions:
| Decision | Score band (round 1) | Score band (round 2) | What happens |
|---|---|---|---|
| `accept` | β‰₯ 0.70 | β‰₯ 0.60 | Merchant wins the dispute, case closes positive |
| `request_more_evidence` | 0.40 – 0.70 | < 0.60 | Round 2: merchant gets one more shot with compelling evidence |
| `escalate_to_arbitration` | < 0.40 | (only if merchant escalates) | Round 3: case goes to network arbitration |
The score itself comes from `evidence_strength_score`:
```
score = 0.4 (if all required evidence attached)
+ min(0.4, 0.2 Γ— helpful_attached)
βˆ’ 0.3 Γ— harmful_attached # uncapped
+ 0.1 (if note has β‰₯ 2 policy keywords)
+ min(0.30, 0.15 Γ— pre_arb_unique) # round 2 only
```
In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule:
`accept` at score β‰₯ 0.55, otherwise `request_more_evidence`. An optional LLM softening layer
can override this midpoint when an API key is set; with no key it falls back to the
deterministic rule so offline benchmarks stay reproducible.
## Arbitration
Network arbitration is a pure function (see `scenarios/arbitration.py`). Given the same case ID
and packet state, the ruling is always the same β€” it seeds a coin flip from a SHA-256 hash of
the case ID inside an ambiguity band. The bands:
| Evidence-strength score | Ruling |
|---|---|
| β‰₯ 0.65 | `merchant_wins` |
| ≀ 0.35 | `issuer_wins` |
| (0.35, 0.65) | seeded coin flip on `sha256(case_id)` |
Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount
minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The
`EscalationROIRubric` reads the final P&L and scores whether the agent's escalate / concede
decision was EV-rational ex ante.
## LLM Integration
The agent supports 5 LLM providers through OpenAI-compatible clients:
| Provider | Model | Base URL |
|---|---|---|
| OpenRouter | openai/gpt-oss-120b | openrouter.ai/api/v1 |
| Google Gemini | gemini-2.5-flash | generativelanguage.googleapis.com/v1beta/openai/ |
| Groq | llama-3.3-70b-versatile | api.groq.com/openai/v1 |
| OpenAI | gpt-4.1-mini | api.openai.com/v1 |
| Anthropic | claude-sonnet-4 | (compatible gateway) |
### Fallback Chain
```
Primary (configured in .env) β†’ OpenRouter β†’ Google Gemini β†’ Groq β†’ Heuristic
```
If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to `_heuristic_pick()`.
### What the LLM Sees
```json
{
"queue_summary": "2 open cases, 8 steps remaining",
"visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps",
"candidates": [
{"index": 0, "action": "submit_representment", "summary": "Submit the contest package"},
{"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"},
{"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"}
]
}
```
### What the LLM Returns
```json
{
"candidate_index": 0,
"rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence."
}
```
### Configuration
| Env Variable | Default | Purpose |
|---|---|---|
| `BASELINE_PROVIDER` | openrouter | Primary LLM provider |
| `BASELINE_MODEL` | openai/gpt-oss-120b | Model to use |
| `BASELINE_REQUEST_TIMEOUT_SECONDS` | 15 | Per-call timeout |
| `PROVIDER_RATE_LIMIT_RETRIES` | 2 | Retry count on rate limits |
| `PROVIDER_RETRY_BACKOFF_SECONDS` | 1.0 | Backoff between retries |
| `MAX_PROVIDER_RESPONSE_TOKENS` | 200 | Max tokens for LLM response |
| `STRICT_LLM_MODE` | false | If true, fail instead of falling back to heuristic |
---
## Key Optimizations
### 1. Deterministic Strategy Inference
For reason codes where the optimal strategy never varies (`goods_not_received` = contest, `credit_not_processed` / `duplicate_processing` = issue_refund), the agent skips `retrieve_policy` entirely. This saves 1 step per case.
### 2. Deadline-Aware Query Limiting
When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried:
- `product_not_as_described`: drops from 3 systems (orders, support, shipping) to 2 (orders, support)
- `fraud_cnp`: drops from 3 systems (risk, support, orders) to 2 (risk, support)
- `service_not_provided`: drops from 2 systems (orders, support) to 1 (support)
### 3. Near-Completion Protection
When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting.
### 4. Harmful Evidence Cleanup
Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a `remove_evidence` action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure.
### 5. Budget-Aware Note Generation
Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
### 6. Adversarial Evidence (Hard/Nightmare)
At hard and nightmare difficulty, the case generator injects **adversarial evidence** β€” items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.
### 7. Nightmare Difficulty
Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively β€” fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.
---
## File Map
| File | Purpose | Lines |
|---|---|---|
| `runners/baseline_runner.py` | The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 |
| `server/chargeback_ops_environment.py` | The environment: step/reset/state, action execution, reward computation | ~500 |
| `evaluation/rubrics.py` | OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)` | ~400 |
| `scenarios/issuer_model.py` | Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening | ~250 |
| `scenarios/arbitration.py` | Deterministic network arbitration resolver with $250 per-side fee | ~120 |
| `evaluation/grading.py` | Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree | ~120 |
| `scenarios/simulation.py` | Task definitions, case progress tracking, evidence metadata | ~600 |
| `core/models.py` | Pydantic models for actions, observations, state, grading | ~600 |
| `runners/inference.py` | OpenEnv-compatible inference entry point with provider fallback | ~200 |
| `inference.py` | Root re-export for submission contract | ~10 |
| `scenarios/case_generator.py` | Parametric task generator with seeded RNG | ~700 |
| `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 |
| `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 |
| `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 |
| `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 |
| `server/demo_ui.py` | Gradio live demo UI with step-by-step episode playback | ~150 |
| `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 |
| `core/client.py` | OpenEnv WebSocket client | ~100 |
---
## Performance
Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task
multi-seed grid:
| Policy | Headline (11) | Multi-seed (28) | Delta vs naive |
|---|---|---|---|
| naive (empty packet) | 0.000 | 0.000 | β€” |
| concede_all | 0.567 | 0.563 | +0.567 |
| escalate_all | 0.773 | 0.765 | +0.773 |
| heuristic | **0.773** | **0.765** | **+0.773** |
The difficulty curve runs 0.97 β†’ 0.88 β†’ 0.70 β†’ 0.51 across easy / medium / hard / nightmare on
the multi-seed grid β€” monotone and well-separated. The `Gate(CaseAbandonedRubric)` wrapper
hard-zeros abandoned cases, and `EscalationROIRubric` (20%) penalises both conceding positive-EV
contestable cases and escalating negative-EV ones β€” together they kill the concede-everything
shortcut. `escalate_all` ties heuristic at the headline because the merchant's round-1 packet
is strong enough on most tasks that the pre-arb branch never fires. See `docs/RESULTS.md` for
full per-task numbers, the rubric tree, and reproduction commands.