Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

ChargeBackOps / AGENT.md

mitudrudutta

feat: tighten EscalationROI, add ambiguous medium case, LLM note judge wrapper

e32a33b about 2 months ago

preview code

raw

history blame contribute delete

32.6 kB

ChargebackOps Agent: Complete Technical Reference

This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made.

The Problem
The Use Case
How the Environment Works
How the Agent Works
The Three-Tier Decision Pipeline
Reason Code Strategies
Multi-Case Triage
Evidence Handling
Representment Notes
The Grading System
LLM Integration
Key Optimizations
File Map

The Problem

When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a chargeback against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a representment package -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline.

This is not a simple yes/no decision. Each dispute has:

A reason code (why the customer disputes: fraud, goods not received, product not as described, etc.)
A deadline (fixed number of steps before the case auto-closes against the merchant)
Evidence scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk)
Some evidence is helpful, some is required, and some is harmful (weakens the case if included)
A correct strategy that depends on the evidence available (contest, accept the chargeback, or issue a refund)

A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case.

ChargebackOps turns this into a measurable agent benchmark. The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring.

The Use Case

ChargebackOps is built for the OpenEnv evaluation framework. It is a simulated merchant dispute resolution environment where an AI agent acts as the dispute analyst.

What the agent receives:

A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
Per-case deadlines (must resolve before step N)

What the agent must do:

Select and focus on one case at a time
Query internal merchant systems to retrieve evidence
Decide whether to contest, accept, or refund each case
Attach the right evidence (and avoid harmful artifacts)
Write a representment note explaining why the dispute should be reversed
Submit or resolve each case before its deadline
Manage step budget across all cases when there are more cases than steps

What the agent is scored on:

Did it choose the correct strategy? (20% of score)
Did it gather the right evidence? (15%)
Is the evidence packet complete and clean? (10%)
Did it meet the deadline? (10%)
Was it efficient (no wasted steps)? (10%)
Did the resolution match the strategy? (10%)
Is the representment note well-written? (5%)
Was escalation EV-rational? (20% — escalate iff P(win)·amount > $250 fee)

After the merchant submits a representment, a scripted IssuerAgent reviews the packet and returns one of three decisions: accept, request_more_evidence (triggering pre-arbitration with compelling evidence), or escalate_to_arbitration. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.

How the Environment Works

The environment follows the OpenEnv reset() / step() / state() contract.

Lifecycle

reset(task_id) → Observation
step(action)   → Observation
state()        → State (includes grader report when done)

Observation

Each observation contains:

Field	Type	Description
`queue`	list	All cases with status, reason_code, amount, steps_until_deadline
`visible_case`	object or null	The currently selected case with full detail
`steps_remaining`	int	Steps left before episode ends
`done`	bool	Whether the episode is complete
`reward`	float	Immediate reward from the last action
`result`	string	Human-readable outcome of the last action

The Visible Case

When a case is selected, visible_case exposes:

Field	Description
`case_id`	Unique identifier
`reason_code`	Why the customer disputed (e.g., `goods_not_received`)
`amount`	Transaction amount in dollars
`current_strategy`	Currently set strategy (null if not set)
`policy`	Policy guidance (null until `retrieve_policy` is called)
`systems_revealed`	Which merchant systems have been queried
`retrieved_evidence`	Evidence items revealed by queries
`attached_evidence`	Evidence currently attached to the representment package
`inspection_notes`	Analyst notes (null until `inspect_case` is called)

Action Space (12 Actions)

Round 1 — Representment

Action	Arguments	Cost	What It Does
`select_case`	case_id	1 step	Focus on a case from the queue
`inspect_case`	case_id	1 step	Reveal analyst inspection notes (+0.04 reward)
`query_system`	case_id, system_name	1 step	Pull evidence from orders/payment/shipping/support/refunds/risk
`retrieve_policy`	case_id	1 step	Get reason-code-specific guidance and required evidence list
`add_evidence`	case_id, evidence_ids	1 step	Attach evidence to the representment package
`remove_evidence`	case_id, evidence_ids	1 step	Remove evidence (useful for cleaning harmful attachments)
`set_strategy`	case_id, strategy	1 step	Choose contest / accept_chargeback / issue_refund
`submit_representment`	case_id, note	1 step	Submit the contest package (requires strategy = contest)
`resolve_case`	case_id, strategy	1 step	Close a non-contest case (accept or refund)

Round 2/3 — Pre-Arbitration & Arbitration

Action	Arguments	Cost	What It Does
`respond_to_pre_arb`	case_id, compelling_evidence_ids	1 step	Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60)
`escalate_to_arbitration`	case_id	1 step	Skip rebuilding the packet, pay $250 fee, push to network arbitration
`accept_arbitration_loss`	case_id	1 step	Concede at round 2/3 to cap fees

Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.

Reward Signals

The environment returns immediate rewards after each action:

Event	Reward
Select an open case	+0.02
Inspect a case (first time)	+0.04
Query a new system with helpful evidence	+0.06 to +0.08
Query a new system with no useful evidence	-0.01 to +0.01
Query an already-queried system (duplicate)	-0.03
Attach helpful evidence	+0.08 per piece
Attach harmful evidence	-0.08 per piece
Attach neutral evidence	+0.01
Remove harmful evidence	+0.05
Remove helpful evidence	-0.03
Set optimal strategy	+0.10
Set acceptable strategy	+0.03
Set wrong strategy	-0.08
Submit a strong representment on time	+0.20
Submit after deadline	-0.20
Submit with missing required evidence	-0.18
Submit with harmful evidence attached	-0.15
Contest a case that shouldn't be contested	-0.12
Resolve with optimal strategy	+0.16
Resolve with acceptable strategy	+0.06
Resolve with wrong strategy	-0.12
Invalid action	-0.12

These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation.

How the Agent Works

The agent is implemented in baseline_runner.py. It is a heuristic-first, LLM-augmented policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates).

Why Heuristic-First?

Reliability: Heuristic decisions never fail, never timeout, never cost money.
Speed: No network round-trip for obvious moves.
Determinism: Same input always produces same output (important for reproducibility).
Budget: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode.

The LLM acts as a tiebreaker when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds.

The Three-Tier Decision Pipeline

Every step, the agent runs this pipeline:

Tier 1: `candidate_actions(observation)`

Reads the current observation and generates a list of CandidateAction objects -- the legal moves the agent considers. This is the core intelligence of the agent.

The function applies these checks in strict priority order:

No case selected? Generate select_case candidates sorted by triage priority.
Current case resolved? Switch to an open case.
Harmful evidence attached? Immediately generate remove_evidence and return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total).
Deadline <= 1 step? Emergency submit or resolve. No time for anything else.
Budget too tight to contest? If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with issue_refund.
Budget pressure (steps <= cases * 2)? If the inferred strategy is accept/refund, resolve immediately.
Reason code handler: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission.

Tier 2: `_obvious_next_action(observation, candidates)`

Before calling any LLM, checks if the choice is trivial:

Only 1 candidate? Take it.
All candidates have the same action type? Take the first.
One candidate targets a case with much tighter deadline? Take it.

If obvious, the LLM is skipped entirely.

Tier 3: LLM or `_heuristic_pick(candidates)`

When Tier 2 returns None (genuine ambiguity):

With LLM: Sends the observation summary and candidate list as a JSON prompt. The model returns {"candidate_index": N, "rationale": "..."}. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq).
Without LLM: _heuristic_pick() returns the first candidate (the heuristic already sorted by priority).

Reason Code Strategies

The agent handles 6 reason code families, each with a different workflow:

`goods_not_received` (Deterministic: contest)

The customer claims they never received the product. The merchant almost always has delivery proof.

Steps: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment

Systems queried: orders, shipping Typical evidence: Order confirmation, delivery scan, tracking number Strategy: Always contest (delivery proof is definitive)

`fraud_cnp` (Non-deterministic: contest or accept)

Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't.

Steps: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve

Systems queried: risk, support, orders (optional under tight budget) Typical evidence: Risk assessment, prior order linkage, account verification Harmful evidence: AVS mismatch, CVV mismatch (proves the card data didn't fully match) Strategy: Contest if strong evidence exists, accept_chargeback if evidence is weak

`credit_not_processed` (Deterministic: issue_refund)

The customer claims a refund was promised but never issued. The correct response is to issue the refund.

Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund (cheapest to resolve, no contest needed)

`duplicate_processing` (Deterministic: issue_refund)

The customer was charged twice. The correct response is to refund the duplicate.

Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund

`product_not_as_described` (Non-deterministic: contest or accept)

The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process.

Steps: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance

Systems queried: orders, support, shipping (optional) Strategy: Contest if listing proof is strong, accept_chargeback if not supportable

`service_not_provided` (Non-deterministic: contest or accept)

The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment.

Steps: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance

Systems queried: support, orders (optional) Strategy: Contest if service completion proof exists, accept_chargeback otherwise

Multi-Case Triage

When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm:

Step Cost Estimates

Reason Code	Est. Steps	Notes
`goods_not_received`	6	select + 2 queries + attach + strategy + submit
`credit_not_processed`	3	select + strategy + resolve
`duplicate_processing`	3	select + strategy + resolve
`fraud_cnp`	8	select + policy + 2-3 queries + attach + strategy + submit
`product_not_as_described`	8	select + policy + 2-3 queries + attach + strategy + submit
`service_not_provided`	7	select + policy + 2 queries + attach + strategy + submit

Triage Algorithm

1. If total_estimated_cost > steps_remaining:
     Sort cases: deterministic-strategy codes first, then by amount descending.
     This ensures cheap, guaranteed-outcome cases are handled first,
     and the highest-value non-deterministic cases get remaining budget.

2. When processing each case, check:
   - Is steps_remaining < 5? → Fast-concede (can't even minimally contest).
   - Is this the lowest-value case and total_cost > budget? → Fast-concede.
   - Otherwise → Full contest or policy-guided resolution.

3. Never interrupt a near-complete case:
   - If the current case has evidence attached and is 1-2 steps from
     submission, finish it before switching to another case's deadline.

Why This Ordering Works

credit_not_processed/duplicate_processing cost 3 steps and always get optimal score. Handle them first to free budget.
goods_not_received costs 6 steps and always contests. Handle next.
fraud_cnp/product_not_as_described/service_not_provided cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with issue_refund (an acceptable fallback) still earns 35% strategy correctness.

Evidence Handling

Harmful Evidence Detection

The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:

mismatch, failed, declined, suspicious, flagged, fraud risk,
unauthorized, rejected, invalid, expired, violation,
non-compliant, discrepancy, inconsistent, unverified

Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:

Never attached (ranked 999 in the priority sort, excluded from add_evidence calls)
Removed if already attached (a remove_evidence action is generated immediately before any other action)

Evidence Priority Ranking

Non-harmful evidence is ranked by keyword relevance:

Rank	Keywords	Example
0 (highest)	signature, completion, booking, listing	"Delivery signature scan"
1	duplicate, delivery, prior, account, authenticated	"Prior good order linkage"
2	return policy, refund, cancel, confirmation, cancellation	"Return policy documentation"
4 (default)	anything else	"Internal memo"
999 (excluded)	mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified	"AVS mismatch report"

Attachment Strategy

The agent attaches all non-harmful retrieved evidence in a single add_evidence call. This maximizes the evidence_quality score, which rewards helpful_attached / total_helpful.

Representment Notes

When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions:

Dimension	Weight	What Earns Points
Substance	20%	Note has >= 5 words
Policy claims coverage	50%	Note mentions keywords from `case.policy_requirements` (e.g., "order confirmation", "carrier delivery")
Evidence coherence	15%	Note references attached evidence IDs (e.g., "E1-ORDER-CONF")
Harmful mention penalty	-15% each	Note contains words like "mismatch", "failed", "declined"

How the Agent Builds Notes

Start with a reason-code-specific template that uses policy requirement language:
- goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..."
- fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch")
- product_not_as_described: "Product listing verification confirms..."
- service_not_provided: "Service completion record and customer acknowledgment..."
If policy was retrieved, append the policy requirements directly:
- "Evidence covers: order confirmation, carrier delivery confirmation."
Append evidence IDs for coherence scoring:
- "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN."
Truncate to 500 characters.

The Grading System

After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv Rubric subclass defined in evaluation/rubrics.py; they compose into a per-case WeightedSum (wrapped in a Gate(CaseAbandonedRubric) deadline guard) and an episode-level ChargebackOpsEpisodeRubric that is wired into env.rubric. evaluation/grading.py keeps the legacy score_case / grade_episode API as a thin adapter over the rubric tree.

Strategy Correctness (20%)

Outcome	Score
Chose the optimal strategy	1.0
Chose an acceptable fallback	0.35
Chose the wrong strategy	0.0

"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, goods_not_received optimal is always "contest" with no acceptable fallback. fraud_cnp optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.

Evidence Quality (15%)

For contest cases:

quality = 0.7 * (required_attached / required_total)
        + 0.3 * (helpful_attached / helpful_total)
        - 0.25 * harmful_attached_count

For non-contest cases where optimal strategy is also non-contest:

1.0 if no evidence was attached (clean concession)
0.7 if evidence was attached (unnecessary work)

For non-contest cases where optimal was contest:

0.15 (the agent abandoned evidence gathering for a contestable case)

Packet Validity (10%)

Binary, all-or-nothing:

1.0 if ALL required evidence is attached AND zero harmful evidence is attached
0.0 otherwise

This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.

Deadline Compliance (10%)

Binary:

1.0 if the case was resolved at or before the deadline step
0.0 if resolved after the deadline or never resolved

Efficiency (10%)

efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)

The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.

Additional penalties for shallow operational behaviour:

Over-querying a concedable case: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
Late policy retrieval: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
Early correct concession bonus: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.

Outcome Quality (10%)

Outcome	Score
Final resolution matches optimal strategy	1.0
Final resolution is an acceptable fallback	0.4
Final resolution is wrong	0.0

Note Quality (5%)

Only scored for contest cases with a representment note. See Representment Notes for the scoring breakdown.

Escalation ROI (20%)

Encodes the economic rule that escalating to network arbitration is rational only when P(win) × dispute_amount > $250 fee. Conceding a positive-EV contestable case (where amount > $250 and the optimal strategy is contest) is penalised. Escalating a negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that keeps concede_all from being a free 0.6+ score.

Deadline Gate

Before the WeightedSum scores anything, Gate(CaseAbandonedRubric) checks whether the case was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still collecting partial credit on the dimensions it did touch.

Final Score Calculation

case_score = 0.20 * strategy_correctness
           + 0.15 * evidence_quality
           + 0.10 * packet_validity
           + 0.10 * deadline_compliance
           + 0.10 * efficiency
           + 0.10 * outcome_quality
           + 0.05 * note_quality
           + 0.20 * escalation_roi

case_score = 0.0 if case_abandoned else case_score   # deadline gate

weighted_case_score = case_score * case_weight

episode_score = sum(weighted_case_scores) / sum(case_weights)

Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to [0.0, 1.0].

The Issuer Agent

After every submit_representment, a scripted IssuerAgent (see scenarios/issuer_model.py) reviews the packet and returns one of three decisions:

Decision	Score band (round 1)	Score band (round 2)	What happens
`accept`	≥ 0.70	≥ 0.60	Merchant wins the dispute, case closes positive
`request_more_evidence`	0.40 – 0.70	< 0.60	Round 2: merchant gets one more shot with compelling evidence
`escalate_to_arbitration`	< 0.40	(only if merchant escalates)	Round 3: case goes to network arbitration

The score itself comes from evidence_strength_score:

score = 0.4 (if all required evidence attached)
      + min(0.4, 0.2 × helpful_attached)
      − 0.3 × harmful_attached            # uncapped
      + 0.1 (if note has ≥ 2 policy keywords)
      + min(0.30, 0.15 × pre_arb_unique)  # round 2 only

In the round-1 ambiguity band (0.40–0.70), the deterministic fallback uses the midpoint rule: accept at score ≥ 0.55, otherwise request_more_evidence. An optional LLM softening layer can override this midpoint when an API key is set; with no key it falls back to the deterministic rule so offline benchmarks stay reproducible.

Arbitration

Network arbitration is a pure function (see scenarios/arbitration.py). Given the same case ID and packet state, the ruling is always the same — it seeds a coin flip from a SHA-256 hash of the case ID inside an ambiguity band. The bands:

Evidence-strength score	Ruling
≥ 0.65	`merchant_wins`
≤ 0.35	`issuer_wins`
(0.35, 0.65)	seeded coin flip on `sha256(case_id)`

Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The EscalationROIRubric reads the final P&L and scores whether the agent's escalate / concede decision was EV-rational ex ante.

LLM Integration

The agent supports 5 LLM providers through OpenAI-compatible clients:

Provider	Model	Base URL
OpenRouter	openai/gpt-oss-120b	openrouter.ai/api/v1
Google Gemini	gemini-2.5-flash	generativelanguage.googleapis.com/v1beta/openai/
Groq	llama-3.3-70b-versatile	api.groq.com/openai/v1
OpenAI	gpt-4.1-mini	api.openai.com/v1
Anthropic	claude-sonnet-4	(compatible gateway)

Fallback Chain

Primary (configured in .env) → OpenRouter → Google Gemini → Groq → Heuristic

If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to _heuristic_pick().

What the LLM Sees

{
  "queue_summary": "2 open cases, 8 steps remaining",
  "visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps",
  "candidates": [
    {"index": 0, "action": "submit_representment", "summary": "Submit the contest package"},
    {"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"},
    {"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"}
  ]
}

What the LLM Returns

{
  "candidate_index": 0,
  "rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence."
}

Configuration

Env Variable	Default	Purpose
`BASELINE_PROVIDER`	openrouter	Primary LLM provider
`BASELINE_MODEL`	openai/gpt-oss-120b	Model to use
`BASELINE_REQUEST_TIMEOUT_SECONDS`	15	Per-call timeout
`PROVIDER_RATE_LIMIT_RETRIES`	2	Retry count on rate limits
`PROVIDER_RETRY_BACKOFF_SECONDS`	1.0	Backoff between retries
`MAX_PROVIDER_RESPONSE_TOKENS`	200	Max tokens for LLM response
`STRICT_LLM_MODE`	false	If true, fail instead of falling back to heuristic

Key Optimizations

1. Deterministic Strategy Inference

For reason codes where the optimal strategy never varies (goods_not_received = contest, credit_not_processed / duplicate_processing = issue_refund), the agent skips retrieve_policy entirely. This saves 1 step per case.

2. Deadline-Aware Query Limiting

When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried:

product_not_as_described: drops from 3 systems (orders, support, shipping) to 2 (orders, support)
fraud_cnp: drops from 3 systems (risk, support, orders) to 2 (risk, support)
service_not_provided: drops from 2 systems (orders, support) to 1 (support)

3. Near-Completion Protection

When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting.

4. Harmful Evidence Cleanup

Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a remove_evidence action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure.

5. Budget-Aware Note Generation

Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).

6. Adversarial Evidence (Hard/Nightmare)

At hard and nightmare difficulty, the case generator injects adversarial evidence — items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.

7. Nightmare Difficulty

Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively — fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.

File Map

File	Purpose	Lines
`runners/baseline_runner.py`	The agent: decision pipeline, candidate generation, LLM integration, representment notes	~1100
`server/chargeback_ops_environment.py`	The environment: step/reset/state, action execution, reward computation	~500
`evaluation/rubrics.py`	OpenEnv `Rubric` subclasses for all 8 scoring dimensions, composed via `WeightedSum` + `Gate(CaseAbandonedRubric)`	~400
`scenarios/issuer_model.py`	Scripted `IssuerAgent`: evidence-strength scoring, threshold bands, optional LLM softening	~250
`scenarios/arbitration.py`	Deterministic network arbitration resolver with $250 per-side fee	~120
`evaluation/grading.py`	Legacy `score_case` / `grade_episode` adapter that delegates to the rubric tree	~120
`scenarios/simulation.py`	Task definitions, case progress tracking, evidence metadata	~600
`core/models.py`	Pydantic models for actions, observations, state, grading	~600
`runners/inference.py`	OpenEnv-compatible inference entry point with provider fallback	~200
`inference.py`	Root re-export for submission contract	~10
`scenarios/case_generator.py`	Parametric task generator with seeded RNG	~700
`scenarios/iso_adapter.py`	Converts ISO 20022 CASR.003 records to environment cases	~160
`connectors/stripe_sandbox.py`	Maps Stripe test-mode disputes to environment cases	~280
`evaluation/agent_brutal_audit.py`	126-episode evaluation across all data sources	~300
`server/app.py`	FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo	~200
`server/demo_ui.py`	Gradio live demo UI with step-by-step episode playback	~150
`core/episode_store.py`	Thread-safe storage with JSONL file persistence	~60
`core/client.py`	OpenEnv WebSocket client	~100

Performance

Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task multi-seed grid:

Policy	Headline (11)	Multi-seed (28)	Delta vs naive
naive (empty packet)	0.000	0.000	—
concede_all	0.567	0.563	+0.567
escalate_all	0.773	0.765	+0.773
heuristic	0.773	0.765	+0.773

The difficulty curve runs 0.97 → 0.88 → 0.70 → 0.51 across easy / medium / hard / nightmare on the multi-seed grid — monotone and well-separated. The Gate(CaseAbandonedRubric) wrapper hard-zeros abandoned cases, and EscalationROIRubric (20%) penalises both conceding positive-EV contestable cases and escalating negative-EV ones — together they kill the concede-everything shortcut. escalate_all ties heuristic at the headline because the merchant's round-1 packet is strong enough on most tasks that the pre-arb branch never fires. See docs/RESULTS.md for full per-task numbers, the rubric tree, and reproduction commands.