# ChargebackOps Agent: Complete Technical Reference This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made. --- ## Table of Contents - [The Problem](#the-problem) - [The Use Case](#the-use-case) - [How the Environment Works](#how-the-environment-works) - [How the Agent Works](#how-the-agent-works) - [The Three-Tier Decision Pipeline](#the-three-tier-decision-pipeline) - [Reason Code Strategies](#reason-code-strategies) - [Multi-Case Triage](#multi-case-triage) - [Evidence Handling](#evidence-handling) - [Representment Notes](#representment-notes) - [The Grading System](#the-grading-system) - [LLM Integration](#llm-integration) - [Key Optimizations](#key-optimizations) - [File Map](#file-map) --- ## The Problem When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a **chargeback** against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a **representment package** -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline. This is not a simple yes/no decision. Each dispute has: - A **reason code** (why the customer disputes: fraud, goods not received, product not as described, etc.) - A **deadline** (fixed number of steps before the case auto-closes against the merchant) - **Evidence** scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk) - Some evidence is **helpful**, some is **required**, and some is **harmful** (weakens the case if included) - A correct **strategy** that depends on the evidence available (contest, accept the chargeback, or issue a refund) A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case. **ChargebackOps turns this into a measurable agent benchmark.** The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring. --- ## The Use Case ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst. **What the agent receives:** - A queue of 1-6 open dispute cases (5-6 at nightmare difficulty) - A step budget (10-20 actions total, ~2.4 steps/case at nightmare) - Per-case deadlines (must resolve before step N) **What the agent must do:** - Select and focus on one case at a time - Query internal merchant systems to retrieve evidence - Decide whether to contest, accept, or refund each case - Attach the right evidence (and avoid harmful artifacts) - Write a representment note explaining why the dispute should be reversed - Submit or resolve each case before its deadline - Manage step budget across all cases when there are more cases than steps **What the agent is scored on:** - Did it choose the correct strategy? (20% of score) - Did it gather the right evidence? (15%) - Is the evidence packet complete and clean? (10%) - Did it meet the deadline? (10%) - Was it efficient (no wasted steps)? (10%) - Did the resolution match the strategy? (10%) - Is the representment note well-written? (5%) - Was escalation EV-rational? (20% — escalate iff `P(win)·amount > $250 fee`) After the merchant submits a representment, a scripted **IssuerAgent** reviews the packet and returns one of three decisions: `accept`, `request_more_evidence` (triggering pre-arbitration with compelling evidence), or `escalate_to_arbitration`. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee. --- ## How the Environment Works The environment follows the OpenEnv `reset()` / `step()` / `state()` contract. ### Lifecycle ``` reset(task_id) → Observation step(action) → Observation state() → State (includes grader report when done) ``` ### Observation Each observation contains: | Field | Type | Description | |---|---|---| | `queue` | list | All cases with status, reason_code, amount, steps_until_deadline | | `visible_case` | object or null | The currently selected case with full detail | | `steps_remaining` | int | Steps left before episode ends | | `done` | bool | Whether the episode is complete | | `reward` | float | Immediate reward from the last action | | `result` | string | Human-readable outcome of the last action | ### The Visible Case When a case is selected, `visible_case` exposes: | Field | Description | |---|---| | `case_id` | Unique identifier | | `reason_code` | Why the customer disputed (e.g., `goods_not_received`) | | `amount` | Transaction amount in dollars | | `current_strategy` | Currently set strategy (null if not set) | | `policy` | Policy guidance (null until `retrieve_policy` is called) | | `systems_revealed` | Which merchant systems have been queried | | `retrieved_evidence` | Evidence items revealed by queries | | `attached_evidence` | Evidence currently attached to the representment package | | `inspection_notes` | Analyst notes (null until `inspect_case` is called) | ### Action Space (12 Actions) **Round 1 — Representment** | Action | Arguments | Cost | What It Does | |---|---|---|---| | `select_case` | case_id | 1 step | Focus on a case from the queue | | `inspect_case` | case_id | 1 step | Reveal analyst inspection notes (+0.04 reward) | | `query_system` | case_id, system_name | 1 step | Pull evidence from orders/payment/shipping/support/refunds/risk | | `retrieve_policy` | case_id | 1 step | Get reason-code-specific guidance and required evidence list | | `add_evidence` | case_id, evidence_ids | 1 step | Attach evidence to the representment package | | `remove_evidence` | case_id, evidence_ids | 1 step | Remove evidence (useful for cleaning harmful attachments) | | `set_strategy` | case_id, strategy | 1 step | Choose contest / accept_chargeback / issue_refund | | `submit_representment` | case_id, note | 1 step | Submit the contest package (requires strategy = contest) | | `resolve_case` | case_id, strategy | 1 step | Close a non-contest case (accept or refund) | **Round 2/3 — Pre-Arbitration & Arbitration** | Action | Arguments | Cost | What It Does | |---|---|---|---| | `respond_to_pre_arb` | case_id, compelling_evidence_ids | 1 step | Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) | | `escalate_to_arbitration` | case_id | 1 step | Skip rebuilding the packet, pay $250 fee, push to network arbitration | | `accept_arbitration_loss` | case_id | 1 step | Concede at round 2/3 to cap fees | Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes. ### Reward Signals The environment returns immediate rewards after each action: | Event | Reward | |---|---| | Select an open case | +0.02 | | Inspect a case (first time) | +0.04 | | Query a new system with helpful evidence | +0.06 to +0.08 | | Query a new system with no useful evidence | -0.01 to +0.01 | | Query an already-queried system (duplicate) | -0.03 | | Attach helpful evidence | +0.08 per piece | | Attach harmful evidence | -0.08 per piece | | Attach neutral evidence | +0.01 | | Remove harmful evidence | +0.05 | | Remove helpful evidence | -0.03 | | Set optimal strategy | +0.10 | | Set acceptable strategy | +0.03 | | Set wrong strategy | -0.08 | | Submit a strong representment on time | +0.20 | | Submit after deadline | -0.20 | | Submit with missing required evidence | -0.18 | | Submit with harmful evidence attached | -0.15 | | Contest a case that shouldn't be contested | -0.12 | | Resolve with optimal strategy | +0.16 | | Resolve with acceptable strategy | +0.06 | | Resolve with wrong strategy | -0.12 | | Invalid action | -0.12 | These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation. --- ## How the Agent Works The agent is implemented in `baseline_runner.py`. It is a **heuristic-first, LLM-augmented** policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates). ### Why Heuristic-First? 1. **Reliability**: Heuristic decisions never fail, never timeout, never cost money. 2. **Speed**: No network round-trip for obvious moves. 3. **Determinism**: Same input always produces same output (important for reproducibility). 4. **Budget**: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode. The LLM acts as a **tiebreaker** when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds. --- ## The Three-Tier Decision Pipeline Every step, the agent runs this pipeline: ### Tier 1: `candidate_actions(observation)` Reads the current observation and generates a list of `CandidateAction` objects -- the legal moves the agent considers. This is the core intelligence of the agent. The function applies these checks in strict priority order: 1. **No case selected?** Generate `select_case` candidates sorted by triage priority. 2. **Current case resolved?** Switch to an open case. 3. **Harmful evidence attached?** Immediately generate `remove_evidence` and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total). 4. **Deadline <= 1 step?** Emergency submit or resolve. No time for anything else. 5. **Budget too tight to contest?** If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with `issue_refund`. 6. **Budget pressure (steps <= cases * 2)?** If the inferred strategy is accept/refund, resolve immediately. 7. **Reason code handler**: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission. ### Tier 2: `_obvious_next_action(observation, candidates)` Before calling any LLM, checks if the choice is trivial: - Only 1 candidate? Take it. - All candidates have the same action type? Take the first. - One candidate targets a case with much tighter deadline? Take it. If obvious, the LLM is skipped entirely. ### Tier 3: LLM or `_heuristic_pick(candidates)` When Tier 2 returns None (genuine ambiguity): - **With LLM**: Sends the observation summary and candidate list as a JSON prompt. The model returns `{"candidate_index": N, "rationale": "..."}`. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq). - **Without LLM**: `_heuristic_pick()` returns the first candidate (the heuristic already sorted by priority). --- ## Reason Code Strategies The agent handles 6 reason code families, each with a different workflow: ### `goods_not_received` (Deterministic: contest) The customer claims they never received the product. The merchant almost always has delivery proof. **Steps**: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment **Systems queried**: orders, shipping **Typical evidence**: Order confirmation, delivery scan, tracking number **Strategy**: Always contest (delivery proof is definitive) ### `fraud_cnp` (Non-deterministic: contest or accept) Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't. **Steps**: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve **Systems queried**: risk, support, orders (optional under tight budget) **Typical evidence**: Risk assessment, prior order linkage, account verification **Harmful evidence**: AVS mismatch, CVV mismatch (proves the card data didn't fully match) **Strategy**: Contest if strong evidence exists, accept_chargeback if evidence is weak ### `credit_not_processed` (Deterministic: issue_refund) The customer claims a refund was promised but never issued. The correct response is to issue the refund. **Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total) **Strategy**: Always issue_refund (cheapest to resolve, no contest needed) ### `duplicate_processing` (Deterministic: issue_refund) The customer was charged twice. The correct response is to refund the duplicate. **Steps**: select -> set_strategy issue_refund -> resolve_case (3 steps total) **Strategy**: Always issue_refund ### `product_not_as_described` (Non-deterministic: contest or accept) The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process. **Steps**: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance **Systems queried**: orders, support, shipping (optional) **Strategy**: Contest if listing proof is strong, accept_chargeback if not supportable ### `service_not_provided` (Non-deterministic: contest or accept) The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment. **Steps**: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance **Systems queried**: support, orders (optional) **Strategy**: Contest if service completion proof exists, accept_chargeback otherwise --- ## Multi-Case Triage When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm: ### Step Cost Estimates | Reason Code | Est. Steps | Notes | |---|---|---| | `goods_not_received` | 6 | select + 2 queries + attach + strategy + submit | | `credit_not_processed` | 3 | select + strategy + resolve | | `duplicate_processing` | 3 | select + strategy + resolve | | `fraud_cnp` | 8 | select + policy + 2-3 queries + attach + strategy + submit | | `product_not_as_described` | 8 | select + policy + 2-3 queries + attach + strategy + submit | | `service_not_provided` | 7 | select + policy + 2 queries + attach + strategy + submit | ### Triage Algorithm ``` 1. If total_estimated_cost > steps_remaining: Sort cases: deterministic-strategy codes first, then by amount descending. This ensures cheap, guaranteed-outcome cases are handled first, and the highest-value non-deterministic cases get remaining budget. 2. When processing each case, check: - Is steps_remaining < 5? → Fast-concede (can't even minimally contest). - Is this the lowest-value case and total_cost > budget? → Fast-concede. - Otherwise → Full contest or policy-guided resolution. 3. Never interrupt a near-complete case: - If the current case has evidence attached and is 1-2 steps from submission, finish it before switching to another case's deadline. ``` ### Why This Ordering Works - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget. - **goods_not_received** costs 6 steps and always contests. Handle next. - **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness. --- ## Evidence Handling ### Harmful Evidence Detection The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns: ``` mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified ``` Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is: 1. **Never attached** (ranked 999 in the priority sort, excluded from `add_evidence` calls) 2. **Removed if already attached** (a `remove_evidence` action is generated immediately before any other action) ### Evidence Priority Ranking Non-harmful evidence is ranked by keyword relevance: | Rank | Keywords | Example | |---|---|---| | 0 (highest) | signature, completion, booking, listing | "Delivery signature scan" | | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" | | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" | | 4 (default) | anything else | "Internal memo" | | 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" | ### Attachment Strategy The agent attaches **all** non-harmful retrieved evidence in a single `add_evidence` call. This maximizes the evidence_quality score, which rewards `helpful_attached / total_helpful`. --- ## Representment Notes When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions: | Dimension | Weight | What Earns Points | |---|---|---| | Substance | 20% | Note has >= 5 words | | Policy claims coverage | 50% | Note mentions keywords from `case.policy_requirements` (e.g., "order confirmation", "carrier delivery") | | Evidence coherence | 15% | Note references attached evidence IDs (e.g., "E1-ORDER-CONF") | | Harmful mention penalty | -15% each | Note contains words like "mismatch", "failed", "declined" | ### How the Agent Builds Notes 1. Start with a **reason-code-specific template** that uses policy requirement language: - goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..." - fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch") - product_not_as_described: "Product listing verification confirms..." - service_not_provided: "Service completion record and customer acknowledgment..." 2. If policy was retrieved, append the policy requirements directly: - "Evidence covers: order confirmation, carrier delivery confirmation." 3. Append evidence IDs for coherence scoring: - "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN." 4. Truncate to 500 characters. --- ## The Grading System After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv `Rubric` subclass defined in `evaluation/rubrics.py`; they compose into a per-case `WeightedSum` (wrapped in a `Gate(CaseAbandonedRubric)` deadline guard) and an episode-level `ChargebackOpsEpisodeRubric` that is wired into `env.rubric`. `evaluation/grading.py` keeps the legacy `score_case` / `grade_episode` API as a thin adapter over the rubric tree. ### Strategy Correctness (20%) | Outcome | Score | |---|---| | Chose the optimal strategy | 1.0 | | Chose an acceptable fallback | 0.35 | | Chose the wrong strategy | 0.0 | "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa. ### Evidence Quality (15%) For **contest** cases: ``` quality = 0.7 * (required_attached / required_total) + 0.3 * (helpful_attached / helpful_total) - 0.25 * harmful_attached_count ``` For **non-contest** cases where optimal strategy is also non-contest: - 1.0 if no evidence was attached (clean concession) - 0.7 if evidence was attached (unnecessary work) For **non-contest** cases where optimal was contest: - 0.15 (the agent abandoned evidence gathering for a contestable case) ### Packet Validity (10%) Binary, all-or-nothing: - **1.0** if ALL required evidence is attached AND zero harmful evidence is attached - **0.0** otherwise This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out. ### Deadline Compliance (10%) Binary: - **1.0** if the case was resolved at or before the deadline step - **0.0** if resolved after the deadline or never resolved ### Efficiency (10%) ``` efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05) ``` The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0. Additional penalties for shallow operational behaviour: - **Over-querying a concedable case**: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful. - **Late policy retrieval**: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted. - **Early correct concession bonus**: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly. ### Outcome Quality (10%) | Outcome | Score | |---|---| | Final resolution matches optimal strategy | 1.0 | | Final resolution is an acceptable fallback | 0.4 | | Final resolution is wrong | 0.0 | ### Note Quality (5%) Only scored for contest cases with a representment note. See [Representment Notes](#representment-notes) for the scoring breakdown. ### Escalation ROI (20%) Encodes the economic rule that escalating to network arbitration is rational only when `P(win) × dispute_amount > $250 fee`. Conceding a positive-EV contestable case (where `amount > $250` and the optimal strategy is `contest`) is penalised. Escalating a negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that keeps `concede_all` from being a free 0.6+ score. ### Deadline Gate Before the WeightedSum scores anything, `Gate(CaseAbandonedRubric)` checks whether the case was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still collecting partial credit on the dimensions it did touch. ### Final Score Calculation ``` case_score = 0.20 * strategy_correctness + 0.15 * evidence_quality + 0.10 * packet_validity + 0.10 * deadline_compliance + 0.10 * efficiency + 0.10 * outcome_quality + 0.05 * note_quality + 0.20 * escalation_roi case_score = 0.0 if case_abandoned else case_score # deadline gate weighted_case_score = case_score * case_weight episode_score = sum(weighted_case_scores) / sum(case_weights) ``` Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to `[0.0, 1.0]`. --- ## The Issuer Agent After every `submit_representment`, a scripted `IssuerAgent` (see `scenarios/issuer_model.py`) reviews the packet and returns one of three decisions: | Decision | Score band (round 1) | Score band (round 2) | What happens | |---|---|---|---| | `accept` | ≥ 0.70 | ≥ 0.60 | Merchant wins the dispute, case closes positive | | `request_more_evidence` | 0.40 – 0.70 | < 0.60 | Round 2: merchant gets one more shot with compelling evidence | | `escalate_to_arbitration` | < 0.40 | (only if merchant escalates) | Round 3: case goes to network arbitration | The score itself comes from `evidence_strength_score`: ``` score = 0.4 (if all required evidence attached) + min(0.4, 0.2 × helpful_attached) − 0.3 × harmful_attached # uncapped + 0.1 (if note has ≥ 2 policy keywords) + min(0.30, 0.15 × pre_arb_unique) # round 2 only ``` In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule: `accept` at score ≥ 0.55, otherwise `request_more_evidence`. An optional LLM softening layer can override this midpoint when an API key is set; with no key it falls back to the deterministic rule so offline benchmarks stay reproducible. ## Arbitration Network arbitration is a pure function (see `scenarios/arbitration.py`). Given the same case ID and packet state, the ruling is always the same — it seeds a coin flip from a SHA-256 hash of the case ID inside an ambiguity band. The bands: | Evidence-strength score | Ruling | |---|---| | ≥ 0.65 | `merchant_wins` | | ≤ 0.35 | `issuer_wins` | | (0.35, 0.65) | seeded coin flip on `sha256(case_id)` | Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The `EscalationROIRubric` reads the final P&L and scores whether the agent's escalate / concede decision was EV-rational ex ante. ## LLM Integration The agent supports 5 LLM providers through OpenAI-compatible clients: | Provider | Model | Base URL | |---|---|---| | OpenRouter | openai/gpt-oss-120b | openrouter.ai/api/v1 | | Google Gemini | gemini-2.5-flash | generativelanguage.googleapis.com/v1beta/openai/ | | Groq | llama-3.3-70b-versatile | api.groq.com/openai/v1 | | OpenAI | gpt-4.1-mini | api.openai.com/v1 | | Anthropic | claude-sonnet-4 | (compatible gateway) | ### Fallback Chain ``` Primary (configured in .env) → OpenRouter → Google Gemini → Groq → Heuristic ``` If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to `_heuristic_pick()`. ### What the LLM Sees ```json { "queue_summary": "2 open cases, 8 steps remaining", "visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps", "candidates": [ {"index": 0, "action": "submit_representment", "summary": "Submit the contest package"}, {"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"}, {"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"} ] } ``` ### What the LLM Returns ```json { "candidate_index": 0, "rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence." } ``` ### Configuration | Env Variable | Default | Purpose | |---|---|---| | `BASELINE_PROVIDER` | openrouter | Primary LLM provider | | `BASELINE_MODEL` | openai/gpt-oss-120b | Model to use | | `BASELINE_REQUEST_TIMEOUT_SECONDS` | 15 | Per-call timeout | | `PROVIDER_RATE_LIMIT_RETRIES` | 2 | Retry count on rate limits | | `PROVIDER_RETRY_BACKOFF_SECONDS` | 1.0 | Backoff between retries | | `MAX_PROVIDER_RESPONSE_TOKENS` | 200 | Max tokens for LLM response | | `STRICT_LLM_MODE` | false | If true, fail instead of falling back to heuristic | --- ## Key Optimizations ### 1. Deterministic Strategy Inference For reason codes where the optimal strategy never varies (`goods_not_received` = contest, `credit_not_processed` / `duplicate_processing` = issue_refund), the agent skips `retrieve_policy` entirely. This saves 1 step per case. ### 2. Deadline-Aware Query Limiting When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried: - `product_not_as_described`: drops from 3 systems (orders, support, shipping) to 2 (orders, support) - `fraud_cnp`: drops from 3 systems (risk, support, orders) to 2 (risk, support) - `service_not_provided`: drops from 2 systems (orders, support) to 1 (support) ### 3. Near-Completion Protection When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting. ### 4. Harmful Evidence Cleanup Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a `remove_evidence` action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure. ### 5. Budget-Aware Note Generation Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%). ### 6. Adversarial Evidence (Hard/Nightmare) At hard and nightmare difficulty, the case generator injects **adversarial evidence** — items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching. ### 7. Nightmare Difficulty Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively — fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure. --- ## File Map | File | Purpose | Lines | |---|---|---| | `runners/baseline_runner.py` | The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 | | `server/chargeback_ops_environment.py` | The environment: step/reset/state, action execution, reward computation | ~500 | | `evaluation/rubrics.py` | OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)` | ~400 | | `scenarios/issuer_model.py` | Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening | ~250 | | `scenarios/arbitration.py` | Deterministic network arbitration resolver with $250 per-side fee | ~120 | | `evaluation/grading.py` | Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree | ~120 | | `scenarios/simulation.py` | Task definitions, case progress tracking, evidence metadata | ~600 | | `core/models.py` | Pydantic models for actions, observations, state, grading | ~600 | | `runners/inference.py` | OpenEnv-compatible inference entry point with provider fallback | ~200 | | `inference.py` | Root re-export for submission contract | ~10 | | `scenarios/case_generator.py` | Parametric task generator with seeded RNG | ~700 | | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 | | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 | | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 | | `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 | | `server/demo_ui.py` | Gradio live demo UI with step-by-step episode playback | ~150 | | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 | | `core/client.py` | OpenEnv WebSocket client | ~100 | --- ## Performance Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task multi-seed grid: | Policy | Headline (11) | Multi-seed (28) | Delta vs naive | |---|---|---|---| | naive (empty packet) | 0.000 | 0.000 | — | | concede_all | 0.567 | 0.563 | +0.567 | | escalate_all | 0.773 | 0.765 | +0.773 | | heuristic | **0.773** | **0.765** | **+0.773** | The difficulty curve runs 0.97 → 0.88 → 0.70 → 0.51 across easy / medium / hard / nightmare on the multi-seed grid — monotone and well-separated. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros abandoned cases, and `EscalationROIRubric` (20%) penalises both conceding positive-EV contestable cases and escalating negative-EV ones — together they kill the concede-everything shortcut. `escalate_all` ties heuristic at the headline because the merchant's round-1 packet is strong enough on most tasks that the pre-arb branch never fires. See `docs/RESULTS.md` for full per-task numbers, the rubric tree, and reproduction commands.