Spaces:
Sleeping
ChargebackOps Agent: Complete Technical Reference
This document explains every aspect of the ChargebackOps agent -- what problem it solves, how it thinks, how it is scored, and why every design decision was made.
Table of Contents
- The Problem
- The Use Case
- How the Environment Works
- How the Agent Works
- The Three-Tier Decision Pipeline
- Reason Code Strategies
- Multi-Case Triage
- Evidence Handling
- Representment Notes
- The Grading System
- LLM Integration
- Key Optimizations
- File Map
The Problem
When a customer disputes a credit card charge, the card network (Visa, Mastercard) initiates a chargeback against the merchant. The merchant loses the transaction amount immediately. To recover the funds, the merchant must build a representment package -- a bundle of evidence proving the charge was legitimate -- and submit it before a hard deadline.
This is not a simple yes/no decision. Each dispute has:
- A reason code (why the customer disputes: fraud, goods not received, product not as described, etc.)
- A deadline (fixed number of steps before the case auto-closes against the merchant)
- Evidence scattered across 6 internal merchant systems (orders, payment, shipping, support, refunds, risk)
- Some evidence is helpful, some is required, and some is harmful (weakens the case if included)
- A correct strategy that depends on the evidence available (contest, accept the chargeback, or issue a refund)
A human dispute analyst handles 50-200 cases per day. They must triage by urgency, query the right systems, avoid attaching damaging evidence, and submit within deadline. Mistakes are expensive: a lost $500 chargeback plus a $25 network fee per case.
ChargebackOps turns this into a measurable agent benchmark. The agent must do exactly what a human analyst does, but programmatically, under step-budget constraints, with deterministic scoring.
The Use Case
ChargebackOps is built for the OpenEnv evaluation framework. It is a simulated merchant dispute resolution environment where an AI agent acts as the dispute analyst.
What the agent receives:
- A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
- A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
- Per-case deadlines (must resolve before step N)
What the agent must do:
- Select and focus on one case at a time
- Query internal merchant systems to retrieve evidence
- Decide whether to contest, accept, or refund each case
- Attach the right evidence (and avoid harmful artifacts)
- Write a representment note explaining why the dispute should be reversed
- Submit or resolve each case before its deadline
- Manage step budget across all cases when there are more cases than steps
What the agent is scored on:
- Did it choose the correct strategy? (20% of score)
- Did it gather the right evidence? (15%)
- Is the evidence packet complete and clean? (10%)
- Did it meet the deadline? (10%)
- Was it efficient (no wasted steps)? (10%)
- Did the resolution match the strategy? (10%)
- Is the representment note well-written? (5%)
- Was escalation EV-rational? (20% β escalate iff
P(win)Β·amount > $250 fee)
After the merchant submits a representment, a scripted IssuerAgent reviews the packet and returns one of three decisions: accept, request_more_evidence (triggering pre-arbitration with compelling evidence), or escalate_to_arbitration. The merchant can also choose to escalate at round 2 instead of rebuilding the packet, or accept an arbitration loss to cap fees. Network arbitration is a deterministic resolver: the loser eats the dispute amount plus a $250 fee, the winner is reimbursed minus their own $250 fee.
How the Environment Works
The environment follows the OpenEnv reset() / step() / state() contract.
Lifecycle
reset(task_id) β Observation
step(action) β Observation
state() β State (includes grader report when done)
Observation
Each observation contains:
| Field | Type | Description |
|---|---|---|
queue |
list | All cases with status, reason_code, amount, steps_until_deadline |
visible_case |
object or null | The currently selected case with full detail |
steps_remaining |
int | Steps left before episode ends |
done |
bool | Whether the episode is complete |
reward |
float | Immediate reward from the last action |
result |
string | Human-readable outcome of the last action |
The Visible Case
When a case is selected, visible_case exposes:
| Field | Description |
|---|---|
case_id |
Unique identifier |
reason_code |
Why the customer disputed (e.g., goods_not_received) |
amount |
Transaction amount in dollars |
current_strategy |
Currently set strategy (null if not set) |
policy |
Policy guidance (null until retrieve_policy is called) |
systems_revealed |
Which merchant systems have been queried |
retrieved_evidence |
Evidence items revealed by queries |
attached_evidence |
Evidence currently attached to the representment package |
inspection_notes |
Analyst notes (null until inspect_case is called) |
Action Space (12 Actions)
Round 1 β Representment
| Action | Arguments | Cost | What It Does |
|---|---|---|---|
select_case |
case_id | 1 step | Focus on a case from the queue |
inspect_case |
case_id | 1 step | Reveal analyst inspection notes (+0.04 reward) |
query_system |
case_id, system_name | 1 step | Pull evidence from orders/payment/shipping/support/refunds/risk |
retrieve_policy |
case_id | 1 step | Get reason-code-specific guidance and required evidence list |
add_evidence |
case_id, evidence_ids | 1 step | Attach evidence to the representment package |
remove_evidence |
case_id, evidence_ids | 1 step | Remove evidence (useful for cleaning harmful attachments) |
set_strategy |
case_id, strategy | 1 step | Choose contest / accept_chargeback / issue_refund |
submit_representment |
case_id, note | 1 step | Submit the contest package (requires strategy = contest) |
resolve_case |
case_id, strategy | 1 step | Close a non-contest case (accept or refund) |
Round 2/3 β Pre-Arbitration & Arbitration
| Action | Arguments | Cost | What It Does |
|---|---|---|---|
respond_to_pre_arb |
case_id, compelling_evidence_ids | 1 step | Attach compelling evidence and resubmit at round 2 (Issuer accept threshold drops to 0.60) |
escalate_to_arbitration |
case_id | 1 step | Skip rebuilding the packet, pay $250 fee, push to network arbitration |
accept_arbitration_loss |
case_id | 1 step | Concede at round 2/3 to cap fees |
Every action costs exactly 1 step. There is no free action. The agent must be deliberate about every step it takes.
Reward Signals
The environment returns immediate rewards after each action:
| Event | Reward |
|---|---|
| Select an open case | +0.02 |
| Inspect a case (first time) | +0.04 |
| Query a new system with helpful evidence | +0.06 to +0.08 |
| Query a new system with no useful evidence | -0.01 to +0.01 |
| Query an already-queried system (duplicate) | -0.03 |
| Attach helpful evidence | +0.08 per piece |
| Attach harmful evidence | -0.08 per piece |
| Attach neutral evidence | +0.01 |
| Remove harmful evidence | +0.05 |
| Remove helpful evidence | -0.03 |
| Set optimal strategy | +0.10 |
| Set acceptable strategy | +0.03 |
| Set wrong strategy | -0.08 |
| Submit a strong representment on time | +0.20 |
| Submit after deadline | -0.20 |
| Submit with missing required evidence | -0.18 |
| Submit with harmful evidence attached | -0.15 |
| Contest a case that shouldn't be contested | -0.12 |
| Resolve with optimal strategy | +0.16 |
| Resolve with acceptable strategy | +0.06 |
| Resolve with wrong strategy | -0.12 |
| Invalid action | -0.12 |
These rewards are shaping signals. The final score comes from the deterministic grader, not reward accumulation.
How the Agent Works
The agent is implemented in baseline_runner.py. It is a heuristic-first, LLM-augmented policy. The heuristic handles ~90% of decisions deterministically. The LLM is only called when the heuristic encounters genuine ambiguity (multiple meaningfully different candidates).
Why Heuristic-First?
- Reliability: Heuristic decisions never fail, never timeout, never cost money.
- Speed: No network round-trip for obvious moves.
- Determinism: Same input always produces same output (important for reproducibility).
- Budget: LLM calls cost tokens and have rate limits. The agent makes 6-20 decisions per episode.
The LLM acts as a tiebreaker when the heuristic produces multiple viable candidates that differ in action type. This hybrid approach gets the best of both worlds.
The Three-Tier Decision Pipeline
Every step, the agent runs this pipeline:
Tier 1: candidate_actions(observation)
Reads the current observation and generates a list of CandidateAction objects -- the legal moves the agent considers. This is the core intelligence of the agent.
The function applies these checks in strict priority order:
No case selected? Generate
select_casecandidates sorted by triage priority.Current case resolved? Switch to an open case.
Harmful evidence attached? Immediately generate
remove_evidenceand return. This fires before anything else because harmful evidence torpedoes the packet_validity score (15% of total).Deadline <= 1 step? Emergency submit or resolve. No time for anything else.
Budget too tight to contest? If there aren't enough steps to run the full contest path (minimum 5 steps: policy + query + attach + strategy + submit), or if this is the lowest-value case in a multi-case triage, fast-concede with
issue_refund.Budget pressure (steps <= cases * 2)? If the inferred strategy is accept/refund, resolve immediately.
Reason code handler: Dispatches to a reason-code-specific handler that generates the appropriate sequence of queries, evidence attachments, strategy setting, and submission.
Tier 2: _obvious_next_action(observation, candidates)
Before calling any LLM, checks if the choice is trivial:
- Only 1 candidate? Take it.
- All candidates have the same action type? Take the first.
- One candidate targets a case with much tighter deadline? Take it.
If obvious, the LLM is skipped entirely.
Tier 3: LLM or _heuristic_pick(candidates)
When Tier 2 returns None (genuine ambiguity):
- With LLM: Sends the observation summary and candidate list as a JSON prompt. The model returns
{"candidate_index": N, "rationale": "..."}. On failure, walks the fallback chain (OpenRouter -> Gemini -> Groq). - Without LLM:
_heuristic_pick()returns the first candidate (the heuristic already sorted by priority).
Reason Code Strategies
The agent handles 6 reason code families, each with a different workflow:
goods_not_received (Deterministic: contest)
The customer claims they never received the product. The merchant almost always has delivery proof.
Steps: select -> query orders -> query shipping -> attach delivery evidence -> set_strategy contest -> submit representment
Systems queried: orders, shipping Typical evidence: Order confirmation, delivery scan, tracking number Strategy: Always contest (delivery proof is definitive)
fraud_cnp (Non-deterministic: contest or accept)
Card-not-present fraud. The customer claims they didn't authorize the transaction. This is the most nuanced reason code -- sometimes the merchant has strong evidence (prior good orders, account match), sometimes they don't.
Steps: select -> retrieve_policy -> query risk + support (+ orders if budget allows) -> attach non-harmful evidence -> set_strategy per policy -> submit or resolve
Systems queried: risk, support, orders (optional under tight budget) Typical evidence: Risk assessment, prior order linkage, account verification Harmful evidence: AVS mismatch, CVV mismatch (proves the card data didn't fully match) Strategy: Contest if strong evidence exists, accept_chargeback if evidence is weak
credit_not_processed (Deterministic: issue_refund)
The customer claims a refund was promised but never issued. The correct response is to issue the refund.
Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund (cheapest to resolve, no contest needed)
duplicate_processing (Deterministic: issue_refund)
The customer was charged twice. The correct response is to refund the duplicate.
Steps: select -> set_strategy issue_refund -> resolve_case (3 steps total) Strategy: Always issue_refund
product_not_as_described (Non-deterministic: contest or accept)
The customer claims the product didn't match its description. Success depends on whether the merchant has listing accuracy proof and whether the customer bypassed the return process.
Steps: select -> retrieve_policy -> query orders + support (+ shipping if deadline allows) -> attach listing/return evidence -> contest or accept per guidance
Systems queried: orders, support, shipping (optional) Strategy: Contest if listing proof is strong, accept_chargeback if not supportable
service_not_provided (Non-deterministic: contest or accept)
The customer claims a service was never delivered. Success depends on completion records and customer acknowledgment.
Steps: select -> retrieve_policy -> query support (+ orders if deadline allows) -> attach completion evidence -> contest or accept per guidance
Systems queried: support, orders (optional) Strategy: Contest if service completion proof exists, accept_chargeback otherwise
Multi-Case Triage
When the agent has multiple open cases and the total estimated step cost exceeds the budget, it uses a triage algorithm:
Step Cost Estimates
| Reason Code | Est. Steps | Notes |
|---|---|---|
goods_not_received |
6 | select + 2 queries + attach + strategy + submit |
credit_not_processed |
3 | select + strategy + resolve |
duplicate_processing |
3 | select + strategy + resolve |
fraud_cnp |
8 | select + policy + 2-3 queries + attach + strategy + submit |
product_not_as_described |
8 | select + policy + 2-3 queries + attach + strategy + submit |
service_not_provided |
7 | select + policy + 2 queries + attach + strategy + submit |
Triage Algorithm
1. If total_estimated_cost > steps_remaining:
Sort cases: deterministic-strategy codes first, then by amount descending.
This ensures cheap, guaranteed-outcome cases are handled first,
and the highest-value non-deterministic cases get remaining budget.
2. When processing each case, check:
- Is steps_remaining < 5? β Fast-concede (can't even minimally contest).
- Is this the lowest-value case and total_cost > budget? β Fast-concede.
- Otherwise β Full contest or policy-guided resolution.
3. Never interrupt a near-complete case:
- If the current case has evidence attached and is 1-2 steps from
submission, finish it before switching to another case's deadline.
Why This Ordering Works
- credit_not_processed/duplicate_processing cost 3 steps and always get optimal score. Handle them first to free budget.
- goods_not_received costs 6 steps and always contests. Handle next.
- fraud_cnp/product_not_as_described/service_not_provided cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with
issue_refund(an acceptable fallback) still earns 35% strategy correctness.
Evidence Handling
Harmful Evidence Detection
The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:
mismatch, failed, declined, suspicious, flagged, fraud risk,
unauthorized, rejected, invalid, expired, violation,
non-compliant, discrepancy, inconsistent, unverified
Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
- Never attached (ranked 999 in the priority sort, excluded from
add_evidencecalls) - Removed if already attached (a
remove_evidenceaction is generated immediately before any other action)
Evidence Priority Ranking
Non-harmful evidence is ranked by keyword relevance:
| Rank | Keywords | Example |
|---|---|---|
| 0 (highest) | signature, completion, booking, listing | "Delivery signature scan" |
| 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
| 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
| 4 (default) | anything else | "Internal memo" |
| 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" |
Attachment Strategy
The agent attaches all non-harmful retrieved evidence in a single add_evidence call. This maximizes the evidence_quality score, which rewards helpful_attached / total_helpful.
Representment Notes
When the agent submits a contest, it generates a representment note. The grader scores notes on 4 dimensions:
| Dimension | Weight | What Earns Points |
|---|---|---|
| Substance | 20% | Note has >= 5 words |
| Policy claims coverage | 50% | Note mentions keywords from case.policy_requirements (e.g., "order confirmation", "carrier delivery") |
| Evidence coherence | 15% | Note references attached evidence IDs (e.g., "E1-ORDER-CONF") |
| Harmful mention penalty | -15% each | Note contains words like "mismatch", "failed", "declined" |
How the Agent Builds Notes
Start with a reason-code-specific template that uses policy requirement language:
- goods_not_received: "Order confirmation and carrier delivery confirmation establish fulfillment..."
- fraud_cnp: "Prior good order linkage and customer account confirmation..." (never mentions "mismatch")
- product_not_as_described: "Product listing verification confirms..."
- service_not_provided: "Service completion record and customer acknowledgment..."
If policy was retrieved, append the policy requirements directly:
- "Evidence covers: order confirmation, carrier delivery confirmation."
Append evidence IDs for coherence scoring:
- "Supporting evidence: E1-ORDER-CONF, E1-DELIVERY-SCAN."
Truncate to 500 characters.
The Grading System
After all cases are resolved (or the step budget is exhausted), the grader scores each case across 8 dimensions. Each dimension is an OpenEnv Rubric subclass defined in evaluation/rubrics.py; they compose into a per-case WeightedSum (wrapped in a Gate(CaseAbandonedRubric) deadline guard) and an episode-level ChargebackOpsEpisodeRubric that is wired into env.rubric. evaluation/grading.py keeps the legacy score_case / grade_episode API as a thin adapter over the rubric tree.
Strategy Correctness (20%)
| Outcome | Score |
|---|---|
| Chose the optimal strategy | 1.0 |
| Chose an acceptable fallback | 0.35 |
| Chose the wrong strategy | 0.0 |
"Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, goods_not_received optimal is always "contest" with no acceptable fallback. fraud_cnp optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
Evidence Quality (15%)
For contest cases:
quality = 0.7 * (required_attached / required_total)
+ 0.3 * (helpful_attached / helpful_total)
- 0.25 * harmful_attached_count
For non-contest cases where optimal strategy is also non-contest:
- 1.0 if no evidence was attached (clean concession)
- 0.7 if evidence was attached (unnecessary work)
For non-contest cases where optimal was contest:
- 0.15 (the agent abandoned evidence gathering for a contestable case)
Packet Validity (10%)
Binary, all-or-nothing:
- 1.0 if ALL required evidence is attached AND zero harmful evidence is attached
- 0.0 otherwise
This is the strictest dimension. Missing one required piece or having one harmful piece zeroes it out.
Deadline Compliance (10%)
Binary:
- 1.0 if the case was resolved at or before the deadline step
- 0.0 if resolved after the deadline or never resolved
Efficiency (10%)
efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)
The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.
Additional penalties for shallow operational behaviour:
- Over-querying a concedable case: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
- Late policy retrieval: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
- Early correct concession bonus: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.
Outcome Quality (10%)
| Outcome | Score |
|---|---|
| Final resolution matches optimal strategy | 1.0 |
| Final resolution is an acceptable fallback | 0.4 |
| Final resolution is wrong | 0.0 |
Note Quality (5%)
Only scored for contest cases with a representment note. See Representment Notes for the scoring breakdown.
Escalation ROI (20%)
Encodes the economic rule that escalating to network arbitration is rational only when
P(win) Γ dispute_amount > $250 fee. Conceding a positive-EV contestable case (where
amount > $250 and the optimal strategy is contest) is penalised. Escalating a
negative-EV case (low P(win) or low amount) is also penalised. This is the dimension that
keeps concede_all from being a free 0.6+ score.
Deadline Gate
Before the WeightedSum scores anything, Gate(CaseAbandonedRubric) checks whether the case
was left unresolved past its deadline. If yes, the entire case score is hard-zeroed. This
prevents the agent from gaming the rubric by ignoring nightmare-tier cases and still
collecting partial credit on the dimensions it did touch.
Final Score Calculation
case_score = 0.20 * strategy_correctness
+ 0.15 * evidence_quality
+ 0.10 * packet_validity
+ 0.10 * deadline_compliance
+ 0.10 * efficiency
+ 0.10 * outcome_quality
+ 0.05 * note_quality
+ 0.20 * escalation_roi
case_score = 0.0 if case_abandoned else case_score # deadline gate
weighted_case_score = case_score * case_weight
episode_score = sum(weighted_case_scores) / sum(case_weights)
Case weights are determined by financial impact (amount and difficulty). The episode score normalizes to [0.0, 1.0].
The Issuer Agent
After every submit_representment, a scripted IssuerAgent (see scenarios/issuer_model.py)
reviews the packet and returns one of three decisions:
| Decision | Score band (round 1) | Score band (round 2) | What happens |
|---|---|---|---|
accept |
β₯ 0.70 | β₯ 0.60 | Merchant wins the dispute, case closes positive |
request_more_evidence |
0.40 β 0.70 | < 0.60 | Round 2: merchant gets one more shot with compelling evidence |
escalate_to_arbitration |
< 0.40 | (only if merchant escalates) | Round 3: case goes to network arbitration |
The score itself comes from evidence_strength_score:
score = 0.4 (if all required evidence attached)
+ min(0.4, 0.2 Γ helpful_attached)
β 0.3 Γ harmful_attached # uncapped
+ 0.1 (if note has β₯ 2 policy keywords)
+ min(0.30, 0.15 Γ pre_arb_unique) # round 2 only
In the round-1 ambiguity band (0.40β0.70), the deterministic fallback uses the midpoint rule:
accept at score β₯ 0.55, otherwise request_more_evidence. An optional LLM softening layer
can override this midpoint when an API key is set; with no key it falls back to the
deterministic rule so offline benchmarks stay reproducible.
Arbitration
Network arbitration is a pure function (see scenarios/arbitration.py). Given the same case ID
and packet state, the ruling is always the same β it seeds a coin flip from a SHA-256 hash of
the case ID inside an ambiguity band. The bands:
| Evidence-strength score | Ruling |
|---|---|
| β₯ 0.65 | merchant_wins |
| β€ 0.35 | issuer_wins |
| (0.35, 0.65) | seeded coin flip on sha256(case_id) |
Both sides pay a $250 fee regardless of outcome. The winner is reimbursed the dispute amount
minus their $250 fee; the loser eats the dispute amount plus the $250 fee. The
EscalationROIRubric reads the final P&L and scores whether the agent's escalate / concede
decision was EV-rational ex ante.
LLM Integration
The agent supports 5 LLM providers through OpenAI-compatible clients:
| Provider | Model | Base URL |
|---|---|---|
| OpenRouter | openai/gpt-oss-120b | openrouter.ai/api/v1 |
| Google Gemini | gemini-2.5-flash | generativelanguage.googleapis.com/v1beta/openai/ |
| Groq | llama-3.3-70b-versatile | api.groq.com/openai/v1 |
| OpenAI | gpt-4.1-mini | api.openai.com/v1 |
| Anthropic | claude-sonnet-4 | (compatible gateway) |
Fallback Chain
Primary (configured in .env) β OpenRouter β Google Gemini β Groq β Heuristic
If the primary provider fails (timeout, rate limit, connection error), the agent automatically tries the next provider in the chain. If all providers fail, it falls back to _heuristic_pick().
What the LLM Sees
{
"queue_summary": "2 open cases, 8 steps remaining",
"visible_case": "CB-G1, fraud_cnp, $480, deadline in 3 steps",
"candidates": [
{"index": 0, "action": "submit_representment", "summary": "Submit the contest package"},
{"index": 1, "action": "query_system orders", "summary": "Query orders for more evidence"},
{"index": 2, "action": "select_case CB-G2", "summary": "Switch to case with deadline in 1 step"}
]
}
What the LLM Returns
{
"candidate_index": 0,
"rationale": "CB-G1 has sufficient evidence and contesting before deadline takes priority over gathering more evidence."
}
Configuration
| Env Variable | Default | Purpose |
|---|---|---|
BASELINE_PROVIDER |
openrouter | Primary LLM provider |
BASELINE_MODEL |
openai/gpt-oss-120b | Model to use |
BASELINE_REQUEST_TIMEOUT_SECONDS |
15 | Per-call timeout |
PROVIDER_RATE_LIMIT_RETRIES |
2 | Retry count on rate limits |
PROVIDER_RETRY_BACKOFF_SECONDS |
1.0 | Backoff between retries |
MAX_PROVIDER_RESPONSE_TOKENS |
200 | Max tokens for LLM response |
STRICT_LLM_MODE |
false | If true, fail instead of falling back to heuristic |
Key Optimizations
1. Deterministic Strategy Inference
For reason codes where the optimal strategy never varies (goods_not_received = contest, credit_not_processed / duplicate_processing = issue_refund), the agent skips retrieve_policy entirely. This saves 1 step per case.
2. Deadline-Aware Query Limiting
When the remaining steps before deadline can't accommodate all planned queries, the agent reduces the number of systems queried:
product_not_as_described: drops from 3 systems (orders, support, shipping) to 2 (orders, support)fraud_cnp: drops from 3 systems (risk, support, orders) to 2 (risk, support)service_not_provided: drops from 2 systems (orders, support) to 1 (support)
3. Near-Completion Protection
When the current case has evidence attached and is 1-2 steps from submission, the agent does NOT switch to handle another case's deadline. Finishing the current case is almost always higher-value than interrupting.
4. Harmful Evidence Cleanup
Before any submit, the agent checks for harmful evidence in the attached set. If found, it generates a remove_evidence action immediately. This prevents the -0.25 per-piece evidence_quality penalty and the packet_validity = 0.0 failure.
5. Budget-Aware Note Generation
Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
6. Adversarial Evidence (Hard/Nightmare)
At hard and nightmare difficulty, the case generator injects adversarial evidence β items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.
7. Nightmare Difficulty
Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively β fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.
File Map
| File | Purpose | Lines |
|---|---|---|
runners/baseline_runner.py |
The agent: decision pipeline, candidate generation, LLM integration, representment notes | ~1100 |
server/chargeback_ops_environment.py |
The environment: step/reset/state, action execution, reward computation | ~500 |
evaluation/rubrics.py |
OpenEnv Rubric subclasses for all 8 scoring dimensions, composed via WeightedSum + Gate(CaseAbandonedRubric) |
~400 |
scenarios/issuer_model.py |
Scripted IssuerAgent: evidence-strength scoring, threshold bands, optional LLM softening |
~250 |
scenarios/arbitration.py |
Deterministic network arbitration resolver with $250 per-side fee | ~120 |
evaluation/grading.py |
Legacy score_case / grade_episode adapter that delegates to the rubric tree |
~120 |
scenarios/simulation.py |
Task definitions, case progress tracking, evidence metadata | ~600 |
core/models.py |
Pydantic models for actions, observations, state, grading | ~600 |
runners/inference.py |
OpenEnv-compatible inference entry point with provider fallback | ~200 |
inference.py |
Root re-export for submission contract | ~10 |
scenarios/case_generator.py |
Parametric task generator with seeded RNG | ~700 |
scenarios/iso_adapter.py |
Converts ISO 20022 CASR.003 records to environment cases | ~160 |
connectors/stripe_sandbox.py |
Maps Stripe test-mode disputes to environment cases | ~280 |
evaluation/agent_brutal_audit.py |
126-episode evaluation across all data sources | ~300 |
server/app.py |
FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 |
server/demo_ui.py |
Gradio live demo UI with step-by-step episode playback | ~150 |
core/episode_store.py |
Thread-safe storage with JSONL file persistence | ~60 |
core/client.py |
OpenEnv WebSocket client | ~100 |
Performance
Tested across the 11-task headline benchmark (4 showcase + 7 seeded holdout) and a 28-task multi-seed grid:
| Policy | Headline (11) | Multi-seed (28) | Delta vs naive |
|---|---|---|---|
| naive (empty packet) | 0.000 | 0.000 | β |
| concede_all | 0.567 | 0.563 | +0.567 |
| escalate_all | 0.773 | 0.765 | +0.773 |
| heuristic | 0.773 | 0.765 | +0.773 |
The difficulty curve runs 0.97 β 0.88 β 0.70 β 0.51 across easy / medium / hard / nightmare on
the multi-seed grid β monotone and well-separated. The Gate(CaseAbandonedRubric) wrapper
hard-zeros abandoned cases, and EscalationROIRubric (20%) penalises both conceding positive-EV
contestable cases and escalating negative-EV ones β together they kill the concede-everything
shortcut. escalate_all ties heuristic at the headline because the merchant's round-1 packet
is strong enough on most tasks that the pre-arb branch never fires. See docs/RESULTS.md for
full per-task numbers, the rubric tree, and reproduction commands.