ChargeBackOps / AGENT.md
mitudrudutta's picture
feat: tighten EscalationROI, add ambiguous medium case, LLM note judge wrapper
e32a33b

ChargebackOps Agent: Complete Technical Reference

This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made.


Table of Contents


The Problem

When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a chargeback against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a representment package -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline.

This is not a simple yes/no decision. Each dispute has:

  • A reason code (why the customer disputes: fraud, goods not received, product not as described, etc.)
  • A deadline (fixed number of steps before the case auto-closes against the merchant)
  • Evidence scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk)
  • Some evidence is helpful, some is required, and some is harmful (weakens the case if included)
  • A correct strategy that depends on the evidence available (contest, accept the chargeback, or issue a refund)

A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case.

ChargebackOps turns this into a measurable agent benchmark. The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring.


The Use Case

ChargebackOps is built for the OpenEnv evaluation framework. It is a simulated merchant dispute resolution environment where an AI agent acts as the dispute analyst.

What the agent receives:

  • A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
  • A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
  • Per-case deadlines (must resolve before step N)

What the agent must do:

  • Select and focus on one case at a time
  • Query internal merchant systems to retrieve evidence
  • Decide whether to contest, accept, or refund each case
  • Attach the right evidence (and avoid harmful artifacts)
  • Write a representment note explaining why the dispute should be reversed
  • Submit or resolve each case before its deadline
  • Manage step budget across all cases when there are more cases than steps

What the agent is scored on:

  • Did it choose the correct strategy? (20% of score)
  • Did it gather the right evidence? (15%)
  • Is the evidence packet complete and clean? (10%)
  • Did it meet the deadline? (10%)
  • Was it efficient (no wasted steps)? (10%)
  • Did the resolution match the strategy? (10%)
  • Is the representment note well-written? (5%)
  • Was escalation EV-rational? (20% β€” escalate iff P(win)Β·amount > $250 fee)

After the merchant submits a representment, a scripted IssuerAgent reviews the packet and returns one of three decisions: accept, request_more_evidence (triggering pre-arbitration with compelling evidence), or escalate_to_arbitration. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.


How the Environment Works

The environment follows the OpenEnv reset() / step() / state() contract.

Lifecycle

reset(task_id) β†’ Observation
step(action)   β†’ Observation
state()        β†’ State (includes grader report when done)

Observation

Each observation contains:

Field Type Description
queue list All cases with status, reason_code, amount, steps_until_deadline
visible_case object or null The currently selected case with full detail
steps_remaining int Steps left before episode ends
done bool Whether the episode is complete
reward float Immediate reward from the last action
result string Human-readable outcome of the last action

The Visible Case

When a case is selected, visible_case exposes:

Field Description
case_id Unique identifier
reason_code Why the customer disputed (e.g., goods_not_received)
amount Transaction amount in dollars
current_strategy Currently set strategy (null if not set)
policy Policy guidance (null until retrieve_policy is called)
systems_revealed Which merchant systems have been queried
retrieved_evidence Evidence items revealed by queries
attached_evidence Evidence currently attached to the representment package
inspection_notes Analyst notes (null until inspect_case is called)

Action Space (12 Actions)

Round 1 β€” Representment

Action Arguments Cost What It Does
select_case case_id 1 step Focus on a case from the queue
inspect_case case_id 1 step Reveal analyst inspection notes (+0.04 reward)
query_system case_id, system_name 1 step Pull evidence from orders/payment/shipping/support/refunds/risk
retrieve_policy case_id 1 step Get reason-code-specific guidance and required evidence list
add_evidence case_id, evidence_ids 1 step Attach evidence to the representment package
remove_evidence case_id, evidence_ids 1 step Remove evidence (useful for cleaning harmful attachments)
set_strategy case_id, strategy 1 step Choose contest / accept_chargeback / issue_refund
submit_representment case_id, note 1 step Submit the contest package (requires strategy = contest)
resolve_case case_id, strategy 1 step Close a non-contest case (accept or refund)

Round 2/3 β€” Pre-Arbitration & Arbitration

Action Arguments Cost What It Does
respond_to_pre_arb case_id, compelling_evidence_ids 1 step Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60)
escalate_to_arbitration case_id 1 step Skip rebuilding the packet, pay $250 fee, push to network arbitration
accept_arbitration_loss case_id 1 step Concede at round 2/3 to cap fees

Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.

Reward Signals

The environment returns immediate rewards after each action:

Event Reward
Select an open case +0.02
Inspect a case (first time) +0.04
Query a new system with helpful evidence +0.06 to +0.08
Query a new system with no useful evidence -0.01 to +0.01
Query an already-queried system (duplicate) -0.03
Attach helpful evidence +0.08 per piece
Attach harmful evidence -0.08 per piece
Attach neutral evidence +0.01
Remove harmful evidence +0.05
Remove helpful evidence -0.03
Set optimal strategy +0.10
Set acceptable strategy +0.03
Set wrong strategy -0.08
Submit a strong representment on time +0.20
Submit after deadline -0.20
Submit with missing required evidence -0.18
Submit with harmful evidence attached -0.15
Contest a case that shouldn't be contested -0.12
Resolve with optimal strategy +0.16
Resolve with acceptable strategy +0.06
Resolve with wrong strategy -0.12
Invalid action -0.12

These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation.


How the Agent Works

The agent is implemented in baseline_runner.py. It is a heuristic-first, LLM-augmented policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates).

Why Heuristic-First?

  1. Reliability: Heuristic decisions never fail, never timeout, never cost money.
  2. Speed: No network round-trip for obvious moves.
  3. Determinism: Same input always produces same output (important for reproducibility).
  4. Budget: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode.

The LLM acts as a tiebreaker when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds.


The Three-Tier Decision Pipeline

Every step, the agent runs this pipeline:

Tier 1: candidate_actions(observation)

Reads the current observation and generates a list of CandidateAction objects -- the legal moves the agent considers. This is the core intelligence of the agent.

The function applies these checks in strict priority order:

  1. No case selected? Generate select_case candidates sorted by triage priority.

  2. Current case resolved? Switch to an open case.

  3. Harmful evidence attached? Immediately generate remove_evidence and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total).

  4. Deadline <= 1 step? Emergency submit or resolve. No time for anything else.

  5. Budget too tight to contest? If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with issue_refund.

  6. Budget pressure (steps <= cases * 2)? If the inferred strategy is accept/refund, resolve immediately.

  7. Reason code handler: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission.

Tier 2: _obvious_next_action(observation, candidates)

Before calling any LLM, checks if the choice is trivial:

  • Only 1 candidate? Take it.
  • All candidates have the same action type? Take the first.
  • One candidate targets a case with much tighter deadline? Take it.

If obvious, the LLM is skipped entirely.

Tier 3: LLM or _heuristic_pick(candidates)

When Tier 2 returns None (genuine ambiguity):

  • With LLM: Sends the observation summary and candidate list as a JSON prompt. The model returns {"candidate_index": N, "rationale": "..."}. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq).
  • Without LLM: _heuristic_pick() returns the first candidate (the heuristic already sorted by priority).

Reason Code Strategies

The agent handles 6 reason code families, each with a different workflow:

goods_not_received (Deterministic: contest)

The customer claims they never received the product. The merchant almost always has delivery proof.

Steps: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment

Systems queried: orders, shipping Typical evidence: Order confirmation, delivery scan, tracking number Strategy: Always contest (delivery proof is definitive)

fraud_cnp (Non-deterministic: contest or accept)

Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't.

Steps: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve

Systems queried: risk, support, orders (optional under tight budget) Typical evidence: Risk assessment, prior order linkage, account verification Harmful evidence: AVS mismatch, CVV mismatch (proves the card data didn't fully match) Strategy: Contest if strong evidence exists, accept_chargeback if evidence is weak

credit_not_processed (Deterministic: issue_refund)

The customer claims a refund was promised but never issued. The correct response is to issue the refund.

Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund (cheapest to resolve, no contest needed)

duplicate_processing (Deterministic: issue_refund)

The customer was charged twice. The correct response is to refund the duplicate.

Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund

product_not_as_described (Non-deterministic: contest or accept)

The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process.

Steps: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance

Systems queried: orders, support, shipping (optional) Strategy: Contest if listing proof is strong, accept_chargeback if not supportable

service_not_provided (Non-deterministic: contest or accept)

The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment.

Steps: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance

Systems queried: support, orders (optional) Strategy: Contest if service completion proof exists, accept_chargeback otherwise


Multi-Case Triage

When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm:

Step Cost Estimates

Reason Code Est. Steps Notes
goods_not_received 6 select + 2 queries + attach + strategy + submit
credit_not_processed 3 select + strategy + resolve
duplicate_processing 3 select + strategy + resolve
fraud_cnp 8 select + policy + 2-3 queries + attach + strategy + submit
product_not_as_described 8 select + policy + 2-3 queries + attach + strategy + submit
service_not_provided 7 select + policy + 2 queries + attach + strategy + submit

Triage Algorithm

1. If total_estimated_cost > steps_remaining:
     Sort cases: deterministic-strategy codes first, then by amount descending.
     This ensures cheap, guaranteed-outcome cases are handled first,
     and the highest-value non-deterministic cases get remaining budget.

2. When processing each case, check:
   - Is steps_remaining < 5? β†’ Fast-concede (can't even minimally contest).
   - Is this the lowest-value case and total_cost > budget? β†’ Fast-concede.
   - Otherwise β†’ Full contest or policy-guided resolution.

3. Never interrupt a near-complete case:
   - If the current case has evidence attached and is 1-2 steps from
     submission, finish it before switching to another case's deadline.

Why This Ordering Works

  • credit_not_processed/duplicate_processing cost 3 steps and always get optimal score. Handle them first to free budget.
  • goods_not_received costs 6 steps and always contests. Handle next.
  • fraud_cnp/product_not_as_described/service_not_provided cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with issue_refund (an acceptable fallback) still earns 35% strategy correctness.

Evidence Handling

Harmful Evidence Detection

The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:

mismatch, failed, declined, suspicious, flagged, fraud risk,
unauthorized, rejected, invalid, expired, violation,
non-compliant, discrepancy, inconsistent, unverified

Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:

  1. Never attached (ranked 999 in the priority sort, excluded from add_evidence calls)
  2. Removed if already attached (a remove_evidence action is generated immediately before any other action)

Evidence Priority Ranking

Non-harmful evidence is ranked by keyword relevance:

Rank Keywords Example
0 (highest) signature, completion, booking, listing "Delivery signature scan"
1 duplicate, delivery, prior, account, authenticated "Prior good order linkage"
2 return policy, refund, cancel, confirmation, cancellation "Return policy documentation"
4 (default) anything else "Internal memo"
999 (excluded) mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified "AVS mismatch report"

Attachment Strategy

The agent attaches all non-harmful retrieved evidence in a single add_evidence call. This maximizes the evidence_quality score, which rewards helpful_attached / total_helpful.


Representment Notes

When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions:

Dimension Weight What Earns Points
Substance 20% Note has >= 5 words
Policy claims coverage 50% Note mentions keywords from case.policy_requirements (e.g., "order confirmation", "carrier delivery")
Evidence coherence 15% Note references attached evidence IDs (e.g., "E1-ORDER-CONF")
Harmful mention penalty -15% each Note contains words like "mismatch", "failed", "declined"

How the Agent Builds Notes

  1. Start with a reason-code-specific template that uses policy requirement language:

    • goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..."
    • fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch")
    • product_not_as_described: "Product listing verification confirms..."
    • service_not_provided: "Service completion record and customer acknowledgment..."
  2. If policy was retrieved, append the policy requirements directly:

    • "Evidence covers: order confirmation, carrier delivery confirmation."
  3. Append evidence IDs for coherence scoring:

    • "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN."
  4. Truncate to 500 characters.


The Grading System

After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv Rubric subclass defined in evaluation/rubrics.py; they compose into a per-case WeightedSum (wrapped in a Gate(CaseAbandonedRubric) deadline guard) and an episode-level ChargebackOpsEpisodeRubric that is wired into env.rubric. evaluation/grading.py keeps the legacy score_case / grade_episode API as a thin adapter over the rubric tree.

Strategy Correctness (20%)

Outcome Score
Chose the optimal strategy 1.0
Chose an acceptable fallback 0.35
Chose the wrong strategy 0.0

"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, goods_not_received optimal is always "contest" with no acceptable fallback. fraud_cnp optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.

Evidence Quality (15%)

For contest cases:

quality = 0.7 * (required_attached / required_total)
        + 0.3 * (helpful_attached / helpful_total)
        - 0.25 * harmful_attached_count

For non-contest cases where optimal strategy is also non-contest:

  • 1.0 if no evidence was attached (clean concession)
  • 0.7 if evidence was attached (unnecessary work)

For non-contest cases where optimal was contest:

  • 0.15 (the agent abandoned evidence gathering for a contestable case)

Packet Validity (10%)

Binary, all-or-nothing:

  • 1.0 if ALL required evidence is attached AND zero harmful evidence is attached
  • 0.0 otherwise

This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.

Deadline Compliance (10%)

Binary:

  • 1.0 if the case was resolved at or before the deadline step
  • 0.0 if resolved after the deadline or never resolved

Efficiency (10%)

efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)

The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.

Additional penalties for shallow operational behaviour:

  • Over-querying a concedable case: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
  • Late policy retrieval: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
  • Early correct concession bonus: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.

Outcome Quality (10%)

Outcome Score
Final resolution matches optimal strategy 1.0
Final resolution is an acceptable fallback 0.4
Final resolution is wrong 0.0

Note Quality (5%)

Only scored for contest cases with a representment note. See Representment Notes for the scoring breakdown.

Escalation ROI (20%)

Encodes the economic rule that escalating to network arbitration is rational only when P(win) Γ— dispute_amount > $250 fee. Conceding a positive-EV contestable case (where amount > $250 and the optimal strategy is contest) is penalised. Escalating a negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that keeps concede_all from being a free 0.6+ score.

Deadline Gate

Before the WeightedSum scores anything, Gate(CaseAbandonedRubric) checks whether the case was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still collecting partial credit on the dimensions it did touch.

Final Score Calculation

case_score = 0.20 * strategy_correctness
           + 0.15 * evidence_quality
           + 0.10 * packet_validity
           + 0.10 * deadline_compliance
           + 0.10 * efficiency
           + 0.10 * outcome_quality
           + 0.05 * note_quality
           + 0.20 * escalation_roi

case_score = 0.0 if case_abandoned else case_score   # deadline gate

weighted_case_score = case_score * case_weight

episode_score = sum(weighted_case_scores) / sum(case_weights)

Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to [0.0, 1.0].


The Issuer Agent

After every submit_representment, a scripted IssuerAgent (see scenarios/issuer_model.py) reviews the packet and returns one of three decisions:

Decision Score band (round 1) Score band (round 2) What happens
accept β‰₯ 0.70 β‰₯ 0.60 Merchant wins the dispute, case closes positive
request_more_evidence 0.40 – 0.70 < 0.60 Round 2: merchant gets one more shot with compelling evidence
escalate_to_arbitration < 0.40 (only if merchant escalates) Round 3: case goes to network arbitration

The score itself comes from evidence_strength_score:

score = 0.4 (if all required evidence attached)
      + min(0.4, 0.2 Γ— helpful_attached)
      βˆ’ 0.3 Γ— harmful_attached            # uncapped
      + 0.1 (if note has β‰₯ 2 policy keywords)
      + min(0.30, 0.15 Γ— pre_arb_unique)  # round 2 only

In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule: accept at score β‰₯ 0.55, otherwise request_more_evidence. An optional LLM softening layer can override this midpoint when an API key is set; with no key it falls back to the deterministic rule so offline benchmarks stay reproducible.

Arbitration

Network arbitration is a pure function (see scenarios/arbitration.py). Given the same case ID and packet state, the ruling is always the same β€” it seeds a coin flip from a SHA-256 hash of the case ID inside an ambiguity band. The bands:

Evidence-strength score Ruling
β‰₯ 0.65 merchant_wins
≀ 0.35 issuer_wins
(0.35, 0.65) seeded coin flip on sha256(case_id)

Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The EscalationROIRubric reads the final P&L and scores whether the agent's escalate / concede decision was EV-rational ex ante.

LLM Integration

The agent supports 5 LLM providers through OpenAI-compatible clients:

Provider Model Base URL
OpenRouter openai/gpt-oss-120b openrouter.ai/api/v1
Google Gemini gemini-2.5-flash generativelanguage.googleapis.com/v1beta/openai/
Groq llama-3.3-70b-versatile api.groq.com/openai/v1
OpenAI gpt-4.1-mini api.openai.com/v1
Anthropic claude-sonnet-4 (compatible gateway)

Fallback Chain

Primary (configured in .env) β†’ OpenRouter β†’ Google Gemini β†’ Groq β†’ Heuristic

If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to _heuristic_pick().

What the LLM Sees

{
  "queue_summary": "2 open cases, 8 steps remaining",
  "visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps",
  "candidates": [
    {"index": 0, "action": "submit_representment", "summary": "Submit the contest package"},
    {"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"},
    {"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"}
  ]
}

What the LLM Returns

{
  "candidate_index": 0,
  "rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence."
}

Configuration

Env Variable Default Purpose
BASELINE_PROVIDER openrouter Primary LLM provider
BASELINE_MODEL openai/gpt-oss-120b Model to use
BASELINE_REQUEST_TIMEOUT_SECONDS 15 Per-call timeout
PROVIDER_RATE_LIMIT_RETRIES 2 Retry count on rate limits
PROVIDER_RETRY_BACKOFF_SECONDS 1.0 Backoff between retries
MAX_PROVIDER_RESPONSE_TOKENS 200 Max tokens for LLM response
STRICT_LLM_MODE false If true, fail instead of falling back to heuristic

Key Optimizations

1. Deterministic Strategy Inference

For reason codes where the optimal strategy never varies (goods_not_received = contest, credit_not_processed / duplicate_processing = issue_refund), the agent skips retrieve_policy entirely. This saves 1 step per case.

2. Deadline-Aware Query Limiting

When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried:

  • product_not_as_described: drops from 3 systems (orders, support, shipping) to 2 (orders, support)
  • fraud_cnp: drops from 3 systems (risk, support, orders) to 2 (risk, support)
  • service_not_provided: drops from 2 systems (orders, support) to 1 (support)

3. Near-Completion Protection

When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting.

4. Harmful Evidence Cleanup

Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a remove_evidence action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure.

5. Budget-Aware Note Generation

Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).

6. Adversarial Evidence (Hard/Nightmare)

At hard and nightmare difficulty, the case generator injects adversarial evidence β€” items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.

7. Nightmare Difficulty

Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively β€” fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.


File Map

File Purpose Lines
runners/baseline_runner.py The agent: decision pipeline, candidate generation, LLM integration, representment notes ~1100
server/chargeback_ops_environment.py The environment: step/reset/state, action execution, reward computation ~500
evaluation/rubrics.py OpenEnv Rubric subclasses for all 8 scoring dimensions, composed via WeightedSum + Gate(CaseAbandonedRubric) ~400
scenarios/issuer_model.py Scripted IssuerAgent: evidence-strength scoring, threshold bands, optional LLM softening ~250
scenarios/arbitration.py Deterministic network arbitration resolver with $250 per-side fee ~120
evaluation/grading.py Legacy score_case / grade_episode adapter that delegates to the rubric tree ~120
scenarios/simulation.py Task definitions, case progress tracking, evidence metadata ~600
core/models.py Pydantic models for actions, observations, state, grading ~600
runners/inference.py OpenEnv-compatible inference entry point with provider fallback ~200
inference.py Root re-export for submission contract ~10
scenarios/case_generator.py Parametric task generator with seeded RNG ~700
scenarios/iso_adapter.py Converts ISO 20022 CASR.003 records to environment cases ~160
connectors/stripe_sandbox.py Maps Stripe test-mode disputes to environment cases ~280
evaluation/agent_brutal_audit.py 126-episode evaluation across all data sources ~300
server/app.py FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo ~200
server/demo_ui.py Gradio live demo UI with step-by-step episode playback ~150
core/episode_store.py Thread-safe storage with JSONL file persistence ~60
core/client.py OpenEnv WebSocket client ~100

Performance

Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task multi-seed grid:

Policy Headline (11) Multi-seed (28) Delta vs naive
naive (empty packet) 0.000 0.000 β€”
concede_all 0.567 0.563 +0.567
escalate_all 0.773 0.765 +0.773
heuristic 0.773 0.765 +0.773

The difficulty curve runs 0.97 β†’ 0.88 β†’ 0.70 β†’ 0.51 across easy / medium / hard / nightmare on the multi-seed grid β€” monotone and well-separated. The Gate(CaseAbandonedRubric) wrapper hard-zeros abandoned cases, and EscalationROIRubric (20%) penalises both conceding positive-EV contestable cases and escalating negative-EV ones β€” together they kill the concede-everything shortcut. escalate_all ties heuristic at the headline because the merchant's round-1 packet is strong enough on most tasks that the pre-arb branch never fires. See docs/RESULTS.md for full per-task numbers, the rubric tree, and reproduction commands.