Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

mitudrudutta commited on Mar 28

Commit

83eb290

1 Parent(s): a6b0c55

feat(connectors): add Stripe sandbox connector for dispute processing

- Introduced `stripe_sandbox.py` to map Stripe test-mode dispute objects into `InternalCase` and `TaskScenario`.
- Implemented functions to fetch disputes, build evidence, and infer strategies based on dispute reasons and statuses.
- Added synthetic dispute generation for testing when Stripe API is unavailable.

fix(episode_store): limit stored reports to a maximum count

- Added a maximum report limit of 100 in the episode store to prevent excessive memory usage.

feat(grading): enhance representment note scoring

- Added `grade_representment_note` function to evaluate the quality of representment notes based on required claims, harmful evidence, and substance.
- Updated `score_case` to incorporate representment note quality into overall case scoring.

feat(iso_adapter): implement ISO 20022 chargeback CSV processing

- Created `iso_adapter.py` to convert ISO 20022 chargeback CSV rows into `InternalCase` and `TaskScenario` objects.
- Mapped chargeback reasons to internal codes and built evidence based on the CSV data.

refactor(models): remove unnecessary fields and improve validation

- Removed `recommended_strategy` from `PolicyView`.
- Added max length constraints to `case_id`, `evidence_ids`, and `note` fields in `ChargebackOpsAction`.

fix(server): update representment submission to include notes

- Modified `_submit_representment` to accept an optional note parameter for better tracking of representment rationale.

chore(simulation): add resolved step tracking to case progress

- Introduced `resolved_at_step` to `CaseProgress` to track the step at which a case was resolved.

Files changed (9) hide show

README.md +191 -296
connectors/__init__.py +0 -0
connectors/stripe_sandbox.py +300 -0
episode_store.py +4 -0
grading.py +74 -13
iso_adapter.py +268 -0
models.py +4 -2
server/chargeback_ops_environment.py +9 -3
simulation.py +42 -6

README.md CHANGED Viewed

@@ -8,376 +8,271 @@ tags:
 # ChargebackOps
-ChargebackOps is a real-world OpenEnv environment for merchant-side chargeback operations. An agent acts as a dispute analyst, works a queue of payment disputes, investigates evidence across synthetic internal systems, chooses whether to contest or concede, and is graded on recovery quality, deadline handling, and operational discipline.
-The environment is designed for the Round 1 OpenEnv problem statement:
-- Real-world task, not a game or toy
-- Typed OpenEnv models and `reset()` / `step()` / `state()` support
-- Three graded tasks with easy, medium, and hard difficulty
-- Dense reward shaping with partial progress and negative signals
-- Root-level `inference.py` that uses the OpenAI client contract
-- Docker and Hugging Face Spaces deployment path
 ## Why This Environment Matters
-Merchant dispute handling is a real operations workflow. Analysts do not just classify a ticket or answer a question. They must:
-- inspect the dispute reason code and the response deadline
-- gather evidence from the right internal systems
-- avoid attaching evidence that weakens the case
-- choose whether to contest, accept, or refund
-- maximize recovery across a queue under limited time
-That makes ChargebackOps a strong benchmark for tool-using agents. It tests retrieval, decision-making, prioritization, and operational restraint in a controlled environment with deterministic scoring.
-## System Architecture
 ```mermaid
-flowchart LR
-    A["Agent or inference.py"] --> B["OpenAI-compatible client<br/>API_BASE_URL + MODEL_NAME + HF_TOKEN"]
-    A --> C["ChargebackOps HTTP API"]
-    C --> D["OpenEnv server<br/>server.app"]
-    D --> E["ChargebackOpsEnvironment<br/>step / reset / state"]
-    E --> F["Task simulator<br/>simulation.py"]
-    E --> G["Dense reward shaping<br/>server/chargeback_ops_environment.py"]
-    E --> H["Deterministic grader<br/>grading.py"]
-    H --> I["Episode report store<br/>episode_store.py"]
-    D --> J["Utility routes<br/>/tasks /grader /baseline /health"]
 ```
 ## Episode Workflow
 ```mermaid
 flowchart TD
-    A["reset(task_id)"] --> B["Select the next case from the queue"]
-    B --> C["Inspect case metadata"]
-    C --> D["Retrieve policy guidance"]
-    D --> E["Query merchant systems<br/>orders, payment, shipping, support, refunds, risk"]
-    E --> F["Attach or remove evidence"]
-    F --> G["Set strategy"]
-    G --> H{"contest?"}
-    H -->|yes| I["submit_representment"]
-    H -->|no| J["resolve_case<br/>accept_chargeback or issue_refund"]
-    I --> K{"all cases resolved or max steps reached?"}
-    J --> K
-    K -->|no| B
-    K -->|yes| L["grader computes final score 0.0 to 1.0"]
 ```
-## Environment Design
-### Internal systems
-The environment exposes evidence gradually from six synthetic merchant systems:
-- `orders`
-- `payment`
-- `shipping`
-- `support`
-- `refunds`
-- `risk`
-Each task contains hidden ground truth about:
-- optimal strategy per case
-- acceptable fallback strategies
-- required evidence
-- helpful evidence
-- harmful evidence
-- deadline pressure
-- case weight in the final score
-### OpenEnv contract
-| Method | Behavior |
-| --- | --- |
-| `reset(task_id=...)` | starts a fresh episode and returns the initial typed observation |
-| `step(action)` | applies one typed action and returns the next observation with reward and done |
-| `state()` | returns the current typed internal state |
-Core runtime files:
-- [`models.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/models.py)
-- [`server/chargeback_ops_environment.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/server/chargeback_ops_environment.py)
-- [`server/app.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/server/app.py)
-- [`openenv.yaml`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/openenv.yaml)
-## Typed Spaces
-### Action space
-| Action | Purpose |
-| --- | --- |
-| `select_case` | focus a case from the queue |
-| `inspect_case` | reveal analyst notes for the selected case |
-| `query_system` | pull evidence from one merchant system |
-| `retrieve_policy` | reveal reason-code guidance and required evidence |
-| `add_evidence` | attach retrieved evidence to the current package |
-| `remove_evidence` | remove evidence, including harmful attachments |
-| `set_strategy` | choose `contest`, `accept_chargeback`, or `issue_refund` |
-| `submit_representment` | submit a contest package for a contested case |
-| `resolve_case` | close a non-contest case with acceptance or refund |
-### Observation space
-Each observation includes:
-- task metadata: id, title, difficulty, objective
-- current queue with deadlines and case summaries
-- currently selected case
-- visible evidence and policy data
-- available actions
-- `steps_remaining`
-- `progress_score`
-- `last_action_result`
-- optional terminal `grader_report`
-### State space
-The environment state exposes:
-- current episode id and step count
-- public queue resolution state
-- action history
-- latest grade estimate
-- final grader report once complete
-## Task Suite
-| Task ID | Title | Difficulty | Objective |
-| --- | --- | --- | --- |
-| `goods_not_received_easy` | Delivered But Disputed | easy | contest a straightforward goods-not-received case with delivery proof |
-| `fraud_signal_ambiguity` | Fraud Signal Ambiguity | medium | handle a card-not-present fraud dispute with mixed evidence and harmful artifacts |
-| `queue_optimization_hard` | Dispute Queue Optimization | hard | maximize recovery across a multi-case queue under tight step and deadline pressure |
-Difficulty progression is deliberate:
-- Easy teaches the standard representment loop.
-- Medium introduces ambiguity and evidence curation.
-- Hard adds queue prioritization, step-budget pressure, and opportunity cost.
-## Reward Design
-ChargebackOps provides dense per-step feedback and a terminal bonus. The environment rewards progress and penalizes obviously bad operations behavior.
-Positive signals include:
-- selecting and inspecting the right case
-- retrieving policy guidance
-- querying systems that expose useful evidence
-- attaching helpful or required evidence
-- setting the optimal strategy
-- submitting a complete representment on time
-- resolving a case with the optimal non-contest strategy
-Negative signals include:
-- invalid actions
-- duplicate system queries
-- attaching harmful evidence
-- removing helpful evidence
-- weak strategy choices
-- submitting incomplete or late representments
-- missing deadlines on still-open cases
-At episode end, the environment adds a terminal bonus proportional to the deterministic grader score.
-## Grading
-Each finished episode is scored in `[0.0, 1.0]` by the deterministic grader in [`grading.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/grading.py).
-Per-case weighting:
-| Component | Weight |
-| --- | --- |
-| strategy correctness | 0.25 |
-| evidence quality | 0.25 |
-| packet validity | 0.15 |
-| deadline compliance | 0.15 |
-| efficiency | 0.10 |
-| outcome quality | 0.10 |
-The hard task aggregates multiple case scores by case weight and normalizes the final result to `0.0` to `1.0`.
-## Inference and Model Providers
-The required root inference entry point is [`inference.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/inference.py). It uses the OpenAI Python client with the challenge-compatible environment variables:
-- `API_BASE_URL`
-- `MODEL_NAME`
-- `HF_TOKEN`
-Default configuration:
-- provider path: OpenRouter
-- model: `openai/gpt-oss-120b`
-Also supported through the same OpenAI-compatible client pattern:
-- OpenAI
-- Anthropic-compatible gateways
-- Groq
-- OpenRouter
-The repository also keeps optional direct keys for convenience in [`.env.example`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/.env.example):
-- `OPENAI_API_KEY`
-- `ANTHROPIC_API_KEY`
-- `GROQ_API_KEY`
-- `OPENROUTER_API_KEY`
-### OpenRouter referer
-Leave `OPENROUTER_HTTP_REFERER` empty during local development. Once the app is deployed, set it to the public app URL, for example:
-```bash
-OPENROUTER_HTTP_REFERER=https://your-space-name.hf.space
-OPENROUTER_APP_TITLE=ChargebackOps
-```
-## Baseline Results
-The repository includes two baseline entry points:
-- [`inference.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/inference.py) for the challenge contract
-- [`baseline_runner.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/baseline_runner.py) for direct local runs and the `/baseline` endpoint
-Verified local heuristic-fallback baseline scores are documented below after the latest validation pass:
-| Task | Score |
-| --- | --- |
-| Delivered But Disputed | `0.7075` |
-| Fraud Signal Ambiguity | `0.7075` |
-| Dispute Queue Optimization | `0.7271` |
-| Average | `0.7140` |
-These values are replaced after each validation run so the README reflects real, reproducible output from the current codebase.
-## API Surface
-The FastAPI app exposes:
-- `GET /` basic service ping
-- `GET /health` health check
-- `GET /docs` interactive OpenAPI docs
-- `POST /reset` start a new episode
-- `POST /step` advance the environment
-- `GET /state` inspect the current state
-- `GET /tasks` enumerate tasks and the action schema
-- `GET /grader` or `POST /grader` fetch the last completed episode grade
-- `GET /baseline` or `POST /baseline` run the bundled baseline
-## Local Setup
-### 1. Install dependencies
-Using `uv`:
 ```bash
 uv sync --extra dev
 ```
-Using `pip`:
-```bash
-python -m pip install -e ".[dev]"
-```
-### 2. Configure environment variables
 ```bash
 cp .env.example .env
 ```
-At minimum, configure:
-```bash
-API_BASE_URL=https://openrouter.ai/api/v1
-MODEL_NAME=openai/gpt-oss-120b
-HF_TOKEN=your_provider_key
-```
-### 3. Run the test and validation suite
 ```bash
 pytest -q tests
 openenv validate .
-python inference.py
 ```
-### 4. Start the server locally
 ```bash
 uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
-## Docker
-Build and run the root Docker image:
 ```bash
 docker build -t chargebackops .
 docker run --rm -p 8000:8000 --env-file .env chargebackops
 ```
-Once the container is running:
-```bash
-curl http://localhost:8000/
-curl http://localhost:8000/tasks
-curl http://localhost:8000/health
-```
-## Hugging Face Spaces Deployment
-ChargebackOps is configured as a Docker Space through the YAML frontmatter in this README.
-Recommended deployment steps:
-1. Create a new Hugging Face Space with `Docker` as the SDK.
-2. Push this repository to the Space.
-3. Add the runtime variables in Space Settings:
-   - `API_BASE_URL`
-   - `MODEL_NAME`
-   - `HF_TOKEN`
-4. If using OpenRouter, add:
-   - `OPENROUTER_HTTP_REFERER=https://your-space-name.hf.space`
-   - `OPENROUTER_APP_TITLE=ChargebackOps`
-5. Verify:
-   - `/`
-   - `/health`
-   - `/tasks`
-   - `/docs`
-   - `/baseline`
-## Validation Checklist
-- `pytest -q tests`
-- `openenv validate .`
-- `python inference.py`
-- `docker build -t chargebackops .`
-- `docker run --rm -p 8000:8000 --env-file .env chargebackops`
 ## Project Layout
-```text
 .
-├── baseline_runner.py
-├── client.py
-├── grading.py
-├── inference.py
-├── models.py
-├── openenv.yaml
 ├── server/
-│   ├── app.py
-│   └── chargeback_ops_environment.py
-├── simulation.py
-└── tests/
 ```
-## Notes
-- This is a synthetic benchmark environment, not a live payments integration.
-- The world state is deterministic by design so graders remain reproducible.
-- Live model quality still depends on the quota and reliability of the configured provider.

 # ChargebackOps
+A production-grade OpenEnv environment for merchant-side chargeback dispute operations. An AI agent acts as a dispute analyst — investigating evidence across internal systems, choosing whether to contest or concede, and maximizing financial recovery under deadline and step-budget pressure.
+Built for the [OpenEnv Hackathon](https://openenv.org/) Round 1 challenge.
 ## Why This Environment Matters
+Chargeback dispute handling is a real operations workflow that costs merchants **$125 billion annually**. Analysts must:
+- Parse reason codes and assess representment deadlines
+- Gather evidence from the right merchant systems while avoiding harmful artifacts
+- Decide whether to contest, accept, or refund — under time pressure
+- Prioritize cases in a multi-dispute queue by deadline urgency and financial impact
+This makes ChargebackOps a strong benchmark for tool-using agents. It tests retrieval, decision-making, prioritization, and operational restraint in a controlled environment with deterministic scoring.
+## Architecture
 ```mermaid
+graph TB
+    subgraph Agent Layer
+        INF[inference.py<br/>OpenAI-compatible client]
+        BL[baseline_runner.py<br/>Heuristic policy]
+    end
+    subgraph API Layer
+        APP[FastAPI server<br/>server/app.py]
+        WS[OpenEnv WebSocket<br/>client.py]
+    end
+    subgraph Environment Core
+        ENV[ChargebackOpsEnvironment<br/>step / reset / state]
+        SIM[Simulation Engine<br/>simulation.py]
+        GRD[Deterministic Grader<br/>grading.py]
+        STORE[Episode Store<br/>episode_store.py]
+    end
+    subgraph Task Sources
+        FIXED[Built-in Tasks<br/>3 handcrafted scenarios]
+        GEN[Parametric Generator<br/>case_generator.py]
+        ISO[ISO 20022 Adapter<br/>iso_adapter.py]
+        STRIPE[Stripe Connector<br/>connectors/stripe_sandbox.py]
+    end
+    subgraph Merchant Systems
+        ORD[Orders]
+        PAY[Payment]
+        SHIP[Shipping]
+        SUP[Support]
+        REF[Refunds]
+        RISK[Risk]
+    end
+    INF --> APP
+    BL --> ENV
+    APP --> ENV
+    WS --> APP
+    ENV --> SIM
+    ENV --> GRD
+    GRD --> STORE
+    SIM --> FIXED
+    SIM --> GEN
+    SIM --> ISO
+    SIM --> STRIPE
+    ENV --> ORD
+    ENV --> PAY
+    ENV --> SHIP
+    ENV --> SUP
+    ENV --> REF
+    ENV --> RISK
 ```
 ## Episode Workflow
 ```mermaid
 flowchart TD
+    A[reset&#40;task_id&#41;] --> B[Select case from queue]
+    B --> C{Reason code<br/>deterministic?}
+    C -->|Yes| D[Skip policy retrieval<br/>Infer strategy directly]
+    C -->|No| E[Retrieve policy guidance]
+    D --> F[Query merchant systems<br/>for evidence]
+    E --> F
+    F --> G[Attach relevant evidence<br/>Avoid harmful artifacts]
+    G --> H[Set strategy]
+    H --> I{Strategy?}
+    I -->|contest| J[Generate representment note<br/>Submit package]
+    I -->|accept / refund| K[Resolve case]
+    J --> L{More open cases?}
+    K --> L
+    L -->|Yes| M{Deadline urgency?}
+    M -->|Urgent| N[Switch to urgent case<br/>Fast-resolve]
+    M -->|Normal| B
+    N --> L
+    L -->|No / Max steps| O[Grader computes<br/>final score 0.0 - 1.0]
+    style A fill:#2d5016,color:#fff
+    style O fill:#1a3a5c,color:#fff
+    style N fill:#8b0000,color:#fff
 ```
+## Grading Dimensions
+```mermaid
+pie title Case Score Weights
+    "Strategy Correctness" : 25
+    "Evidence Quality" : 20
+    "Packet Validity" : 15
+    "Deadline Compliance" : 15
+    "Efficiency" : 10
+    "Outcome Quality" : 10
+    "Note Quality" : 5
+```
+Each case is scored across seven dimensions and weighted by financial impact. The episode score normalizes across all cases to `[0.0, 1.0]`.
+## Agent Performance (126 Episodes)
+Results from the heuristic agent tested across all data sources:
+| Source | Easy | Medium | Hard |
+|---|---|---|---|
+| Built-in tasks | 0.968 | 0.960 | 0.778 |
+| Parametric (20 seeds) | 0.957 | 0.844 | 0.706 |
+| ISO 20022 real data (20 each) | 0.977 | 0.812 | 0.605 |
+| Stripe live API | 0.980 | 0.887 | 0.577 |
+**Overall: 0.819 avg across 126 episodes | 43.7% score >= 0.90 | 5.6% score < 0.50**
+Heuristic vs bad-control gap: **+0.503** (threshold for "strong": 0.15)
+## Task Sources
+### Built-in Scenarios (3 tasks)
+| Task ID | Difficulty | Objective |
+|---|---|---|
+| `goods_not_received_easy` | Easy | Contest a goods-not-received case with delivery proof |
+| `fraud_signal_ambiguity` | Medium | Handle CNP fraud with mixed evidence and harmful artifacts |
+| `queue_optimization_hard` | Hard | Maximize recovery across a multi-case queue under deadline pressure |
+### Parametric Generator (`case_generator.py`)
+Generates infinite reproducible tasks from seeded RNG across 6 reason code families. Usage: `generated_{difficulty}_s{seed}` (e.g., `generated_hard_s42`).
+### ISO 20022 Real Data (`iso_adapter.py`)
+Converts 300 real chargeback records from ISO 20022 CASR.003 format into environment cases. Covers fraud, goods-not-received, duplicate processing, credit-not-processed, product-not-as-described, and service-not-provided disputes.
+### Stripe Sandbox (`connectors/stripe_sandbox.py`)
+Maps Stripe test-mode dispute objects into environment cases. Supports live API access with `STRIPE_API_KEY` or falls back to synthetic Stripe-format disputes.
+## Action Space
+| Action | Purpose |
+|---|---|
+| `select_case` | Focus a case from the dispute queue |
+| `inspect_case` | Reveal analyst inspection notes |
+| `query_system` | Pull evidence from a merchant system |
+| `retrieve_policy` | Get reason-code guidance and required evidence |
+| `add_evidence` | Attach retrieved evidence to the representment package |
+| `remove_evidence` | Remove evidence (including harmful attachments) |
+| `set_strategy` | Choose `contest`, `accept_chargeback`, or `issue_refund` |
+| `submit_representment` | Submit a contest package with an optional rationale note |
+| `resolve_case` | Close a non-contest case |
+## Quick Start
+### Install
 ```bash
 uv sync --extra dev
+# or
+pip install -e ".[dev]"
 ```
+### Configure
 ```bash
 cp .env.example .env
+# Edit .env with your provider keys
 ```
+### Validate
 ```bash
 pytest -q tests
 openenv validate .
+python baseline_runner.py
+python agent_brutal_audit.py
 ```
+### Run Server
 ```bash
 uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
+### Docker
 ```bash
 docker build -t chargebackops .
 docker run --rm -p 8000:8000 --env-file .env chargebackops
 ```
+## API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/` | Service info |
+| `GET` | `/health` | Health check |
+| `GET` | `/docs` | Interactive OpenAPI docs |
+| `POST` | `/reset` | Start a new episode |
+| `POST` | `/step` | Apply an action |
+| `GET` | `/state` | Current environment state |
+| `GET` | `/tasks` | List available tasks |
+| `GET` | `/generate` | Generate parametric tasks |
+| `GET/POST` | `/grader` | Fetch latest episode grade |
+| `GET/POST` | `/baseline` | Run the heuristic baseline |
+## Inference Contract
+The required entry point [`inference.py`](inference.py) uses the OpenAI-compatible client with:
+```bash
+API_BASE_URL=https://openrouter.ai/api/v1
+MODEL_NAME=openai/gpt-oss-120b
+HF_TOKEN=your_key
+```
+Supported providers: OpenRouter, OpenAI, Groq, Anthropic-compatible gateways.
+## Hugging Face Deployment
+1. Create a new HF Space with **Docker** SDK
+2. Push this repository
+3. Set secrets in Space Settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
+4. Verify: `/health`, `/tasks`, `/baseline`
 ## Project Layout
+```
 .
+├── openenv.yaml                 # OpenEnv spec
+├── models.py                    # Pydantic action/observation/state models
+├── simulation.py                # Task definitions and case progress
+├── grading.py                   # Deterministic 7-dimension grader
+├── baseline_runner.py           # Heuristic agent with LLM fallback
+├── inference.py                 # Challenge-compatible inference entry
+├── case_generator.py            # Parametric seeded task generator
+├── iso_adapter.py               # ISO 20022 real data adapter
+├── agent_brutal_audit.py        # Comprehensive agent evaluation
+├── client.py                    # OpenEnv WebSocket client
+├── episode_store.py             # Thread-safe episode report store
+├── connectors/
+│   └── stripe_sandbox.py        # Stripe test-mode connector
 ├── server/
+│   ├── app.py                   # FastAPI application
+│   └── chargeback_ops_environment.py  # Core environment
+├── tests/
+│   ├── test_env.py              # Environment + generator tests
+│   ├── test_grader.py           # Grading logic tests
+│   ├── test_api.py              # API endpoint tests
+│   ├── test_requirements.py     # Problem statement compliance
+│   └── test_agent_audit.py      # Audit validation tests
+├── Dockerfile                   # Production container
+├── pyproject.toml               # Package config
+└── .env.example                 # Environment variable template
 ```

connectors/__init__.py ADDED Viewed

File without changes

connectors/stripe_sandbox.py ADDED Viewed

	@@ -0,0 +1,300 @@

+"""Stripe sandbox connector for ChargebackOps.
+Maps Stripe test-mode dispute objects into ``InternalCase`` / ``TaskScenario``
+so real Stripe dispute flows can be processed through the environment.
+Usage::
+    export STRIPE_API_KEY=sk_test_...
+    from connectors.stripe_sandbox import fetch_disputes, build_stripe_task
+    disputes = fetch_disputes(limit=10)
+    task = build_stripe_task(disputes, difficulty="medium")
+"""
+from __future__ import annotations
+import hashlib
+import os
+import random
+from typing import Any
+try:
+    from ..simulation import (
+        InternalCase,
+        InternalEvidence,
+        TaskScenario,
+        SystemName,
+        StrategyName,
+    )
+except ImportError:  # pragma: no cover
+    from simulation import (
+        InternalCase,
+        InternalEvidence,
+        TaskScenario,
+        SystemName,
+        StrategyName,
+    )
+_STRIPE_REASON_MAP: dict[str, str] = {
+    "fraudulent": "fraud_cnp",
+    "unrecognized": "fraud_cnp",
+    "product_not_received": "goods_not_received",
+    "product_unacceptable": "product_not_as_described",
+    "duplicate": "duplicate_processing",
+    "subscription_canceled": "credit_not_processed",
+    "credit_not_processed": "credit_not_processed",
+    "general": "goods_not_received",
+    "service_not_as_described": "service_not_provided",
+}
+_STRIPE_STATUS_WON = {"won"}
+_STRIPE_STATUS_LOST = {"lost"}
+_STRIPE_STATUS_OPEN = {
+    "needs_response",
+    "under_review",
+    "warning_needs_response",
+    "warning_under_review",
+    "warning_closed",
+    "charge_refunded",
+}
+_POLICY_GUIDANCE: dict[str, str] = {
+    "goods_not_received": "Prove fulfillment with order confirmation and carrier delivery evidence.",
+    "fraud_cnp": "Contest only with prior account linkage and device history. Do not attach mismatch artifacts.",
+    "product_not_as_described": "Contest when listing accurately represents the product and customer bypassed returns.",
+    "service_not_provided": "Contest when provider records confirm service delivery.",
+    "credit_not_processed": "Refund immediately or concede. Contesting is not supportable.",
+    "duplicate_processing": "Refund the duplicate charge immediately. Do not contest.",
+}
+_POLICY_REQS: dict[str, tuple[str, ...]] = {
+    "goods_not_received": ("order confirmation", "carrier delivery confirmation"),
+    "fraud_cnp": ("prior good order linkage", "customer account confirmation"),
+    "product_not_as_described": ("product listing verification", "return policy documentation"),
+    "service_not_provided": ("service completion record", "customer acknowledgment"),
+    "credit_not_processed": ("proof of cancellation request", "refund status check"),
+    "duplicate_processing": ("payment transaction log", "duplicate confirmation"),
+}
+def _ev(eid: str, system: SystemName, title: str, summary: str,
+        *, helpful: bool = False, harmful: bool = False, required: bool = False) -> InternalEvidence:
+    return InternalEvidence(
+        evidence_id=eid, source_system=system, title=title,
+        summary=summary, helpful=helpful, harmful=harmful, required=required,
+    )
+def _infer_strategy(reason_code: str, stripe_status: str) -> tuple[str, tuple[str, ...]]:
+    """Infer optimal strategy from Stripe dispute status."""
+    # These reason codes should always refund — contesting is never supportable.
+    if reason_code in ("credit_not_processed", "duplicate_processing"):
+        return "issue_refund", ("accept_chargeback",)
+    if stripe_status in _STRIPE_STATUS_WON:
+        return "contest", ()
+    if stripe_status in _STRIPE_STATUS_LOST:
+        return "accept_chargeback", ("issue_refund",)
+    return "contest", ()
+def _build_evidence(
+    prefix: str,
+    reason_code: str,
+    amount: float,
+    currency: str,
+    metadata: dict[str, Any],
+    optimal: str,
+    rng: random.Random,
+) -> tuple[dict[SystemName, tuple[InternalEvidence, ...]], tuple[str, ...], tuple[str, ...], tuple[str, ...]]:
+    by_sys: dict[SystemName, list[InternalEvidence]] = {
+        s: [] for s in ("orders", "payment", "shipping", "support", "refunds", "risk")
+    }
+    req: list[str] = []
+    hlp: list[str] = []
+    hrm: list[str] = []
+    desc = metadata.get("description", f"Stripe dispute for {amount} {currency}")
+    if reason_code == "goods_not_received":
+        e = _ev(f"{prefix}-ORDER", "orders", "Order confirmation", f"Order for {amount} {currency}.", helpful=True, required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Stripe charge captured."))
+        if optimal == "contest":
+            e = _ev(f"{prefix}-DELIVERY", "shipping", "Delivery confirmation", "Carrier confirms delivery.", helpful=True, required=True)
+            by_sys["shipping"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        else:
+            by_sys["shipping"].append(_ev(f"{prefix}-NOTRACK", "shipping", "Tracking", "No delivery confirmation."))
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
+    elif reason_code == "fraud_cnp":
+        by_sys["orders"].append(_ev(f"{prefix}-ORDER", "orders", "Order receipt", f"Order for {amount} {currency}.", helpful=True))
+        hlp.append(f"{prefix}-ORDER")
+        e_avs = _ev(f"{prefix}-AVS", "payment", "AVS check", "AVS mismatch at authorization.", harmful=True)
+        by_sys["payment"].append(e_avs); hrm.append(e_avs.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Stripe charge captured."))
+        if optimal == "contest":
+            e = _ev(f"{prefix}-PRIOR", "risk", "Prior account activity", "Same account with prior fulfilled orders.", helpful=True, required=True)
+            by_sys["risk"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+            e = _ev(f"{prefix}-CHAT", "support", "Customer verification", "Customer confirmed order via support.", helpful=True, required=True)
+            by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        else:
+            by_sys["risk"].append(_ev(f"{prefix}-RISK", "risk", "Risk summary", "No positive account history."))
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
+    elif reason_code == "product_not_as_described":
+        e = _ev(f"{prefix}-ORDER", "orders", "Order details", f"Order for {amount} {currency} — SKU matches.", helpful=True, required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        e = _ev(f"{prefix}-LISTING", "orders", "Product listing", "Listing matches manufacturer specs.", helpful=True, required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Settled at listed price."))
+        by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY", "shipping", "Delivery confirmation", "Delivered.", helpful=True))
+        hlp.append(f"{prefix}-DELIVERY")
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund processed."))
+    elif reason_code == "service_not_provided":
+        e = _ev(f"{prefix}-BOOKING", "orders", "Service booking", f"Booking for {amount} {currency}.", helpful=True, required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment record", "Stripe charge captured."))
+        if optimal == "contest":
+            e = _ev(f"{prefix}-COMPLETION", "support", "Service completion", "Service marked completed.", helpful=True, required=True)
+            by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
+    elif reason_code in ("credit_not_processed", "duplicate_processing"):
+        by_sys["orders"].append(_ev(f"{prefix}-ORDER", "orders", "Invoice", f"Charge of {amount} {currency}."))
+        by_sys["payment"].append(_ev(f"{prefix}-PAYMENT", "payment", "Payment", "Stripe charge settled."))
+        by_sys["support"].append(_ev(f"{prefix}-REQ", "support", "Customer request", desc[:100], helpful=True))
+        hlp.append(f"{prefix}-REQ")
+        by_sys["refunds"].append(_ev(f"{prefix}-NOREFUND", "refunds", "Refund ledger", "No refund processed.", helpful=True))
+        hlp.append(f"{prefix}-NOREFUND")
+    frozen = {k: tuple(v) for k, v in by_sys.items()}
+    return frozen, tuple(req), tuple(hlp), tuple(hrm)
+def dispute_to_case(dispute: dict[str, Any], case_index: int, *, deadline_step: int = 8) -> InternalCase | None:
+    """Convert a Stripe dispute object to an InternalCase."""
+    stripe_reason = dispute.get("reason", "general")
+    reason_code = _STRIPE_REASON_MAP.get(stripe_reason)
+    if reason_code is None:
+        return None
+    amount = dispute.get("amount", 0) / 100.0  # Stripe amounts are in cents
+    currency = dispute.get("currency", "usd").upper()
+    status = dispute.get("status", "needs_response")
+    metadata = dispute.get("metadata", {})
+    dispute_id = dispute.get("id", f"dp_{case_index}")
+    optimal, acceptable = _infer_strategy(reason_code, status)
+    rng = random.Random(int(hashlib.sha256(dispute_id.encode()).hexdigest()[:8], 16))
+    prefix = f"STRIPE{case_index}"
+    evidence, req_ids, hlp_ids, hrm_ids = _build_evidence(
+        prefix, reason_code, amount, currency, metadata, optimal, rng,
+    )
+    guidance = _POLICY_GUIDANCE.get(reason_code, "")
+    if optimal in ("accept_chargeback", "issue_refund") and reason_code not in ("credit_not_processed", "duplicate_processing"):
+        guidance = f"Do not contest this {reason_code.replace('_', ' ')} dispute. Concede to avoid wasting resources."
+    return InternalCase(
+        case_id=f"CB-STRIPE{case_index}",
+        order_id=dispute.get("charge", f"ch_stripe{case_index}"),
+        customer_id=f"CUST-STRIPE{case_index}",
+        amount=amount,
+        currency=currency,
+        reason_code=reason_code,
+        summary=dispute.get("evidence_details", {}).get("due_by_reason", f"Stripe dispute: {stripe_reason}"),
+        inspection_notes=f"Stripe dispute {dispute_id} — {stripe_reason}. Status: {status}.",
+        deadline_step=deadline_step,
+        optimal_strategy=optimal,
+        acceptable_strategies=acceptable,
+        policy_guidance=guidance,
+        policy_requirements=_POLICY_REQS.get(reason_code, ()),
+        recommended_strategy=optimal,
+        resolution_summary=f"Stripe dispute status: {status}.",
+        weight=round(1.0 + (amount / 5000.0), 2),
+        required_evidence_ids=req_ids,
+        helpful_evidence_ids=hlp_ids,
+        harmful_evidence_ids=hrm_ids,
+        evidence_by_system=evidence,
+    )
+def build_stripe_task(
+    disputes: list[dict[str, Any]],
+    *,
+    difficulty: str = "medium",
+    task_index: int = 0,
+) -> TaskScenario | None:
+    """Build a TaskScenario from a list of Stripe dispute objects."""
+    case_count = {"easy": 1, "medium": 2, "hard": 3}.get(difficulty, 2)
+    max_steps = {"easy": 10, "medium": 12, "hard": max(12, case_count * 5)}.get(difficulty, 12)
+    deadline = {"easy": 8, "medium": 7, "hard": 5}.get(difficulty, 7)
+    cases: list[InternalCase] = []
+    for i, dispute in enumerate(disputes):
+        if len(cases) >= case_count:
+            break
+        case = dispute_to_case(dispute, i + 1, deadline_step=deadline)
+        if case is not None:
+            cases.append(case)
+    if not cases:
+        return None
+    codes = ", ".join(list({c.reason_code for c in cases})[:3])
+    return TaskScenario(
+        task_id=f"stripe_{difficulty}_{task_index}",
+        title=f"Stripe Dispute {'Queue' if len(cases) > 1 else 'Case'} ({difficulty.title()})",
+        difficulty=difficulty,
+        objective=f"Handle {len(cases)} Stripe dispute(s) ({codes}).",
+        description=f"Real Stripe sandbox dispute scenario with {len(cases)} case(s). Codes: {codes}.",
+        max_steps=max_steps,
+        cases=tuple(cases),
+    )
+def fetch_disputes(*, limit: int = 10, api_key: str | None = None) -> list[dict[str, Any]]:
+    """Fetch disputes from Stripe test mode.
+    Requires ``stripe`` package and a test-mode API key.
+    Falls back to synthetic test disputes if Stripe is unavailable.
+    """
+    key = api_key or os.environ.get("STRIPE_API_KEY", "")
+    if not key or not key.startswith("sk_test_"):
+        return _synthetic_test_disputes(limit)
+    try:
+        import stripe
+        stripe.api_key = key
+        result = stripe.Dispute.list(limit=limit)
+        return [d.to_dict() if hasattr(d, "to_dict") else dict(d) for d in result.data]
+    except Exception:
+        return _synthetic_test_disputes(limit)
+def _synthetic_test_disputes(count: int) -> list[dict[str, Any]]:
+    """Generate synthetic Stripe-format dispute objects for testing without API access."""
+    rng = random.Random(42)
+    reasons = list(_STRIPE_REASON_MAP.keys())
+    statuses = ["needs_response", "won", "lost", "under_review"]
+    disputes = []
+    for i in range(count):
+        reason = rng.choice(reasons)
+        status = rng.choice(statuses)
+        amount = rng.randint(500, 50000)  # cents
+        disputes.append({
+            "id": f"dp_test_{i:04d}",
+            "amount": amount,
+            "currency": "usd",
+            "reason": reason,
+            "status": status,
+            "charge": f"ch_test_{i:04d}",
+            "metadata": {"description": f"Test dispute {i} — {reason}"},
+            "evidence_details": {"due_by_reason": f"Dispute for {reason}"},
+        })
+    return disputes

episode_store.py CHANGED Viewed

@@ -12,6 +12,7 @@ except ImportError:  # pragma: no cover
 _LOCK = Lock()
 _REPORTS: dict[str, GraderReport] = {}
 _LATEST_EPISODE_ID: str | None = None
 def record_report(report: GraderReport) -> None:
@@ -19,6 +20,9 @@ def record_report(report: GraderReport) -> None:
     global _LATEST_EPISODE_ID
     with _LOCK:
         _REPORTS[report.episode_id] = report
         _LATEST_EPISODE_ID = report.episode_id

 _LOCK = Lock()
 _REPORTS: dict[str, GraderReport] = {}
 _LATEST_EPISODE_ID: str | None = None
+_MAX_REPORTS = 100
 def record_report(report: GraderReport) -> None:
     global _LATEST_EPISODE_ID
     with _LOCK:
+        if len(_REPORTS) >= _MAX_REPORTS:
+            oldest = next(iter(_REPORTS))
+            del _REPORTS[oldest]
         _REPORTS[report.episode_id] = report
         _LATEST_EPISODE_ID = report.episode_id

grading.py CHANGED Viewed

@@ -16,6 +16,58 @@ def _ratio(numerator: int, denominator: int) -> float:
     return max(0.0, min(1.0, numerator / denominator))
 def score_case(
     case: InternalCase,
     progress: CaseProgress,
@@ -24,15 +76,10 @@ def score_case(
     """Score one case deterministically."""
     final_resolution = progress.final_resolution or "unresolved"
-    required_attached = len(
-        set(progress.attached_evidence_ids).intersection(case.required_evidence_ids)
-    )
-    helpful_attached = len(
-        set(progress.attached_evidence_ids).intersection(case.helpful_evidence_ids)
-    )
-    harmful_attached = len(
-        set(progress.attached_evidence_ids).intersection(case.harmful_evidence_ids)
-    )
     if final_resolution == case.optimal_strategy:
         strategy_correctness = 1.0
@@ -53,16 +100,22 @@ def score_case(
         )
     else:
         if final_resolution in {"accept_chargeback", "issue_refund"}:
-            evidence_quality = 1.0 if helpful_attached == 0 and harmful_attached == 0 else 0.7
-            packet_validity = 1.0
         else:
             evidence_quality = 0.0
             packet_validity = 0.0
     deadline_compliance = 1.0
     if final_resolution == "unresolved":
         deadline_compliance = 0.0
-    elif step_count > case.deadline_step:
         deadline_compliance = 0.0
     wasted_actions = progress.duplicate_queries + progress.invalid_actions
@@ -75,13 +128,20 @@ def score_case(
     else:
         outcome_quality = 0.0
     weighted_score = (
         0.25 * strategy_correctness
-        + 0.25 * evidence_quality
         + 0.15 * packet_validity
         + 0.15 * deadline_compliance
         + 0.10 * efficiency
         + 0.10 * outcome_quality
     )
     note_parts = [case.resolution_summary]
@@ -100,6 +160,7 @@ def score_case(
         deadline_compliance=round(deadline_compliance, 4),
         efficiency=round(efficiency, 4),
         outcome_quality=round(outcome_quality, 4),
         weighted_score=round(weighted_score * case.weight, 4),
         final_resolution=final_resolution,
         notes=" ".join(note_parts),

     return max(0.0, min(1.0, numerator / denominator))
+def grade_representment_note(
+    note: str | None,
+    case: "InternalCase",
+    attached_ids: set[str],
+) -> float:
+    """Score a representment note from 0.0 to 1.0.
+    Evaluates whether the note:
+    - References required claims from the policy requirements
+    - Avoids mentioning harmful evidence
+    - Has sufficient substance (length and specificity)
+    """
+    if not note or not note.strip():
+        return 0.0
+    text = note.lower()
+    score = 0.0
+    # Substance: minimum length for a coherent note
+    word_count = len(text.split())
+    if word_count >= 5:
+        score += 0.2
+    elif word_count >= 2:
+        score += 0.1
+    # Required claims coverage: does the note mention policy requirements?
+    if case.policy_requirements:
+        claims_hit = 0
+        for req in case.policy_requirements:
+            req_keywords = req.lower().split()
+            if any(kw in text for kw in req_keywords if len(kw) > 3):
+                claims_hit += 1
+        score += 0.5 * _ratio(claims_hit, len(case.policy_requirements))
+    else:
+        score += 0.3  # No requirements to check
+    # Evidence coherence: does the note reference attached evidence?
+    evidence_refs = sum(1 for eid in attached_ids if eid.lower() in text or any(
+        part in text for part in eid.lower().replace("-", " ").split() if len(part) > 3
+    ))
+    if evidence_refs > 0:
+        score += 0.15
+    # Harmful mention penalty: does the note mention harmful evidence concepts?
+    harmful_keywords = {"mismatch", "failed", "declined", "suspicious", "flagged", "fraud risk"}
+    harmful_hits = sum(1 for kw in harmful_keywords if kw in text)
+    if harmful_hits > 0:
+        score -= 0.15 * min(harmful_hits, 2)
+    return max(0.0, min(1.0, score))
 def score_case(
     case: InternalCase,
     progress: CaseProgress,
     """Score one case deterministically."""
     final_resolution = progress.final_resolution or "unresolved"
+    attached_set = set(progress.attached_evidence_ids)
+    required_attached = len(attached_set.intersection(case.required_evidence_ids))
+    helpful_attached = len(attached_set.intersection(case.helpful_evidence_ids))
+    harmful_attached = len(attached_set.intersection(case.harmful_evidence_ids))
     if final_resolution == case.optimal_strategy:
         strategy_correctness = 1.0
         )
     else:
         if final_resolution in {"accept_chargeback", "issue_refund"}:
+            if case.optimal_strategy == "contest":
+                # Conceded a contestable case — evidence gathering was abandoned
+                evidence_quality = 0.3
+                packet_validity = 0.0
+            else:
+                evidence_quality = 1.0 if helpful_attached == 0 and harmful_attached == 0 else 0.7
+                packet_validity = 1.0
         else:
             evidence_quality = 0.0
             packet_validity = 0.0
+    resolution_step = progress.resolved_at_step if progress.resolved_at_step is not None else step_count
     deadline_compliance = 1.0
     if final_resolution == "unresolved":
         deadline_compliance = 0.0
+    elif resolution_step > case.deadline_step:
         deadline_compliance = 0.0
     wasted_actions = progress.duplicate_queries + progress.invalid_actions
     else:
         outcome_quality = 0.0
+    # Representment note quality (only relevant for contested cases)
+    if final_resolution == "contest" and progress.representment_note:
+        note_quality = grade_representment_note(progress.representment_note, case, attached_set)
+    else:
+        note_quality = 0.0
     weighted_score = (
         0.25 * strategy_correctness
+        + 0.20 * evidence_quality
         + 0.15 * packet_validity
         + 0.15 * deadline_compliance
         + 0.10 * efficiency
         + 0.10 * outcome_quality
+        + 0.05 * note_quality
     )
     note_parts = [case.resolution_summary]
         deadline_compliance=round(deadline_compliance, 4),
         efficiency=round(efficiency, 4),
         outcome_quality=round(outcome_quality, 4),
+        note_quality=round(note_quality, 4),
         weighted_score=round(weighted_score * case.weight, 4),
         final_resolution=final_resolution,
         notes=" ".join(note_parts),

iso_adapter.py ADDED Viewed

	@@ -0,0 +1,268 @@

+"""Adapter that converts real ISO 20022 chargeback CSV rows into environment cases.
+Reads ``data/iso20022-card-chargeback-casr-003.csv`` and produces
+``InternalCase`` / ``TaskScenario`` objects so real dispute data flows
+through the benchmark.
+"""
+from __future__ import annotations
+import csv
+import hashlib
+import random
+from pathlib import Path
+from typing import Literal
+try:
+    from .simulation import InternalCase, InternalEvidence, TaskScenario, SystemName, StrategyName
+except ImportError:  # pragma: no cover
+    from simulation import InternalCase, InternalEvidence, TaskScenario, SystemName, StrategyName
+ISO_CSV_PATH = Path("data/iso20022-card-chargeback-casr-003.csv")
+_REASON_MAP: dict[str, str] = {
+    "goods_not_received": "goods_not_received",
+    "GOODS_NOT_RECEIVED": "goods_not_received",
+    "NR02": "goods_not_received",
+    "FRAUD": "fraud_cnp",
+    "fraud": "fraud_cnp",
+    "fraudulent_transaction": "fraud_cnp",
+    "FR01": "fraud_cnp",
+    "FR02": "fraud_cnp",
+    "goods_not_as_described": "product_not_as_described",
+    "GOODS_NOT_AS_DESCRIBED": "product_not_as_described",
+    "not_as_described": "product_not_as_described",
+    "NR04": "product_not_as_described",
+    "SERVICE_NOT_RENDERED": "service_not_provided",
+    "services_not_rendered": "service_not_provided",
+    "NR03": "credit_not_processed",
+    "duplicate": "duplicate_processing",
+    "DUPLICATE_PROCESSING": "duplicate_processing",
+    "duplicate_processing": "duplicate_processing",
+}
+_MERCHANT_WON = {"merchant_won", "chargeback_reversed", "chargeback_declined"}
+_CONCEDED = {"chargeback_accepted"}
+_POLICY_GUIDANCE: dict[str, str] = {
+    "goods_not_received": "For goods-not-received disputes, prove fulfillment with order confirmation and carrier delivery evidence.",
+    "fraud_cnp": "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. Do not attach mismatch artifacts.",
+    "product_not_as_described": "Contest product-not-as-described disputes when the listing accurately represents the product and the customer bypassed the return process.",
+    "service_not_provided": "Contest service-not-provided disputes when provider records confirm the service was delivered.",
+    "credit_not_processed": "If the merchant failed to process a promised credit, refund immediately or concede. Contesting is not supportable.",
+    "duplicate_processing": "When a duplicate charge is confirmed, refund the extra amount immediately. Do not contest.",
+}
+_POLICY_REQS: dict[str, tuple[str, ...]] = {
+    "goods_not_received": ("order confirmation", "carrier delivery confirmation"),
+    "fraud_cnp": ("prior good order linkage", "customer account confirmation"),
+    "product_not_as_described": ("product listing verification", "return policy documentation"),
+    "service_not_provided": ("service completion record", "customer acknowledgment"),
+    "credit_not_processed": ("proof of cancellation request", "refund status check"),
+    "duplicate_processing": ("payment transaction log", "duplicate confirmation"),
+}
+def _ev(eid, system, title, summary, *, helpful=False, harmful=False, required=False):
+    return InternalEvidence(evidence_id=eid, source_system=system, title=title,
+                           summary=summary, helpful=helpful, harmful=harmful, required=required)
+def _infer_strategy(reason_code, final_decision, notes):
+    nl = notes.lower()
+    if final_decision in _MERCHANT_WON:
+        return "contest", ()
+    if final_decision in _CONCEDED:
+        if reason_code in ("credit_not_processed", "duplicate_processing"):
+            return "issue_refund", ("accept_chargeback",)
+        return "accept_chargeback", ("issue_refund",)
+    if reason_code in ("credit_not_processed", "duplicate_processing"):
+        return "issue_refund", ("accept_chargeback",)
+    if reason_code == "fraud_cnp" and ("stolen" in nl or "no evidence" in nl or "unable" in nl):
+        return "accept_chargeback", ("issue_refund",)
+    return "contest", ()
+def _build_evidence(prefix, reason_code, merchant, amount, notes, optimal, rng):
+    by_sys: dict[SystemName, list[InternalEvidence]] = {s: [] for s in ("orders","payment","shipping","support","refunds","risk")}
+    req, hlp, hrm = [], [], []
+    if reason_code == "goods_not_received":
+        e = _ev(f"{prefix}-ORDER","orders","Order confirmation",f"Order with {merchant} for ${amount:.2f}.",helpful=True,required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Authorization","Payment authorized and captured."))
+        if optimal == "contest":
+            e = _ev(f"{prefix}-DELIVERY","shipping","Carrier delivery confirmation","Carrier confirms delivery to customer address.",helpful=True,required=True)
+            by_sys["shipping"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+            if rng.random()>0.4:
+                e2=_ev(f"{prefix}-SIG","shipping","Delivery signature","Recipient signature on file.",helpful=True)
+                by_sys["shipping"].append(e2); hlp.append(e2.evidence_id)
+        else:
+            by_sys["shipping"].append(_ev(f"{prefix}-NOTRACK","shipping","Tracking status","No confirmed delivery scan."))
+        by_sys["support"].append(_ev(f"{prefix}-SUPPORT","support","Support notes",notes[:120] if notes else "No support interactions."))
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued before dispute."))
+    elif reason_code == "fraud_cnp":
+        by_sys["orders"].append(_ev(f"{prefix}-ORDER","orders","Order receipt",f"Order with {merchant} for ${amount:.2f}.",helpful=True))
+        hlp.append(f"{prefix}-ORDER")
+        e_avs=_ev(f"{prefix}-AVS","payment","AVS mismatch","Street mismatch at authorization.",harmful=True)
+        by_sys["payment"].append(e_avs); hrm.append(e_avs.evidence_id)
+        if rng.random()>0.5:
+            e_cvv=_ev(f"{prefix}-CVV","payment","CVV mismatch","CVV verification failed.",harmful=True)
+            by_sys["payment"].append(e_cvv); hrm.append(e_cvv.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Authorization","Payment captured."))
+        if optimal=="contest":
+            e=_ev(f"{prefix}-PRIOR","risk","Prior account activity","Same account/device with prior fulfilled orders.",helpful=True,required=True)
+            by_sys["risk"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+            e=_ev(f"{prefix}-CHAT","support","Authenticated chat","Customer logged in and confirmed order.",helpful=True,required=True)
+            by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        else:
+            by_sys["risk"].append(_ev(f"{prefix}-RISK","risk","Risk summary","Elevated risk. No positive account history."))
+            by_sys["support"].append(_ev(f"{prefix}-SUPPORT","support","Support log","No authenticated interactions."))
+        by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY","shipping","Delivery confirmation","Delivered to address on file.",helpful=True))
+        hlp.append(f"{prefix}-DELIVERY")
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued."))
+    elif reason_code == "product_not_as_described":
+        e=_ev(f"{prefix}-ORDER","orders","Order details",f"Order with {merchant} — SKU matches listing.",helpful=True,required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        e=_ev(f"{prefix}-LISTING","orders","Product listing","Listing matches manufacturer specs.",helpful=True,required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Payment capture","Settled for listed price."))
+        by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY","shipping","Delivery confirmation","Delivered within window.",helpful=True))
+        hlp.append(f"{prefix}-DELIVERY")
+        by_sys["support"].append(_ev(f"{prefix}-RETURN","support","Return policy","No return initiated before dispute.",helpful=True))
+        hlp.append(f"{prefix}-RETURN")
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund processed."))
+    elif reason_code == "service_not_provided":
+        e=_ev(f"{prefix}-BOOKING","orders","Service booking",f"Booking with {merchant} for ${amount:.2f}.",helpful=True,required=True)
+        by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Payment record","Payment captured."))
+        if optimal=="contest":
+            e=_ev(f"{prefix}-COMPLETION","support","Service completion","Provider marked service completed.",helpful=True,required=True)
+            by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
+        else:
+            by_sys["support"].append(_ev(f"{prefix}-CANCEL","support","Cancellation",notes[:100] if notes else "Service cancelled.",helpful=True))
+            hlp.append(f"{prefix}-CANCEL")
+        by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued."))
+    elif reason_code in ("credit_not_processed","duplicate_processing"):
+        by_sys["orders"].append(_ev(f"{prefix}-ORDER","orders","Invoice",f"Charge of ${amount:.2f} from {merchant}."))
+        by_sys["payment"].append(_ev(f"{prefix}-PAYMENT","payment","Payment","Payment settled."))
+        by_sys["support"].append(_ev(f"{prefix}-REQ","support","Customer request",notes[:100] if notes else "Customer requested credit.",helpful=True))
+        hlp.append(f"{prefix}-REQ")
+        by_sys["refunds"].append(_ev(f"{prefix}-NOREFUND","refunds","Refund ledger","No refund processed.",helpful=True))
+        hlp.append(f"{prefix}-NOREFUND")
+    frozen = {k: tuple(v) for k, v in by_sys.items()}
+    return frozen, tuple(req), tuple(hlp), tuple(hrm)
+def _concedable_guidance(reason_code: str, optimal: str) -> str:
+    """Return guidance that signals concede when the optimal strategy isn't contest."""
+    if optimal in ("accept_chargeback", "issue_refund") and reason_code not in (
+        "credit_not_processed", "duplicate_processing",
+    ):
+        if optimal == "accept_chargeback":
+            return (
+                f"Do not contest this {reason_code.replace('_', ' ')} dispute. "
+                "The merchant's position is not supportable. Concede to avoid wasting resources."
+            )
+        return (
+            f"Refund immediately for this {reason_code.replace('_', ' ')} dispute. "
+            "Contesting is not supportable."
+        )
+    return _POLICY_GUIDANCE.get(reason_code, "")
+def row_to_case(row, case_index, *, deadline_step=8):
+    raw_code = row.get("chargeback_reason_code", "")
+    reason_code = _REASON_MAP.get(raw_code)
+    if reason_code is None:
+        return None
+    amount = float(row.get("transaction_amount", "0") or "0")
+    merchant = row.get("merchant_name", "Unknown")
+    notes = row.get("notes", "")
+    final_decision = row.get("final_decision", "")
+    optimal, acceptable = _infer_strategy(reason_code, final_decision, notes)
+    rng = random.Random(int(hashlib.sha256(row["chargeback_id"].encode()).hexdigest()[:8], 16))
+    prefix = f"ISO{case_index}"
+    evidence, req_ids, hlp_ids, hrm_ids = _build_evidence(prefix, reason_code, merchant, amount, notes, optimal, rng)
+    return InternalCase(
+        case_id=f"CB-ISO{case_index}",
+        order_id=row.get("original_transaction_id", f"TX-ISO{case_index}"),
+        customer_id=f"CUST-ISO{case_index}",
+        amount=amount, currency=row.get("transaction_currency", "USD"),
+        reason_code=reason_code,
+        summary=row.get("chargeback_reason_description", "Chargeback filed."),
+        inspection_notes=notes or f"Chargeback against {merchant} for ${amount:.2f}.",
+        deadline_step=deadline_step,
+        optimal_strategy=optimal, acceptable_strategies=acceptable,
+        policy_guidance=_concedable_guidance(reason_code, optimal),
+        policy_requirements=_POLICY_REQS.get(reason_code, ()),
+        recommended_strategy=optimal,
+        resolution_summary=f"Real case outcome: {final_decision or 'pending'}.",
+        weight=round(1.0 + (amount / 5000.0), 2),
+        required_evidence_ids=req_ids, helpful_evidence_ids=hlp_ids, harmful_evidence_ids=hrm_ids,
+        evidence_by_system=evidence,
+    )
+def load_iso_rows(csv_path=None):
+    path = csv_path or ISO_CSV_PATH
+    if not path.exists():
+        return []
+    with path.open(newline="", encoding="utf-8") as f:
+        return list(csv.DictReader(f))
+def build_iso_task(rows, *, difficulty="medium", start_index=0, case_count=None, task_index=0):
+    if case_count is None:
+        case_count = {"easy": 1, "medium": 2, "hard": 3}[difficulty]
+    max_steps = {"easy": 10, "medium": 12, "hard": max(12, case_count * 5)}[difficulty]
+    cases = []
+    idx = start_index
+    while len(cases) < case_count and idx < len(rows):
+        deadline = {"easy": 8, "medium": 7, "hard": max(4, 8 - len(cases))}[difficulty]
+        case = row_to_case(rows[idx], idx + 1, deadline_step=deadline)
+        idx += 1
+        if case is not None:
+            cases.append(case)
+    if not cases:
+        return None
+    codes = ", ".join(list({c.reason_code for c in cases})[:3])
+    return TaskScenario(
+        task_id=f"iso_{difficulty}_{task_index}",
+        title=f"ISO Dispute {'Queue' if len(cases) > 1 else 'Case'} ({difficulty.title()})",
+        difficulty=difficulty,
+        objective=f"Handle {len(cases)} real dispute(s) ({codes}) from ISO 20022 chargeback data.",
+        description=f"Real-world-derived scenario with {len(cases)} case(s). Reason codes: {codes}.",
+        max_steps=max_steps,
+        cases=tuple(cases),
+    )
+def generate_iso_suite(csv_path=None, *, easy_count=3, medium_count=3, hard_count=3):
+    rows = load_iso_rows(csv_path)
+    if not rows:
+        return []
+    rng = random.Random(42)
+    shuffled = list(rows)
+    rng.shuffle(shuffled)
+    tasks, offset, idx = [], 0, 0
+    for diff, count in [("easy", easy_count), ("medium", medium_count), ("hard", hard_count)]:
+        for _ in range(count):
+            task = build_iso_task(shuffled, difficulty=diff, start_index=offset, task_index=idx)
+            if task is not None:
+                tasks.append(task)
+                offset += len(task.cases) + 1
+                idx += 1
+    return tasks

models.py CHANGED Viewed

@@ -51,7 +51,6 @@ class PolicyView(BaseModel):
     reason_code: str
     guidance: str
     required_evidence: list[str] = Field(default_factory=list)
-    recommended_strategy: StrategyName
 class VisibleCase(BaseModel):
@@ -116,6 +115,7 @@ class CaseScoreBreakdown(BaseModel):
     deadline_compliance: float
     efficiency: float
     outcome_quality: float
     weighted_score: float
     final_resolution: str
     notes: str
@@ -167,13 +167,14 @@ class ChargebackOpsAction(Action):
     """Action schema for ChargebackOps."""
     action_type: ActionType
-    case_id: str | None = Field(default=None, description="Target case id when applicable")
     system_name: SystemName | None = Field(
         default=None,
         description="System to query when action_type is query_system",
     )
     evidence_ids: list[str] = Field(
         default_factory=list,
         description="Evidence ids to attach or remove",
     )
     strategy: StrategyName | None = Field(
@@ -182,6 +183,7 @@ class ChargebackOpsAction(Action):
     )
     note: str | None = Field(
         default=None,
         description="Optional short rationale for the action",
     )

     reason_code: str
     guidance: str
     required_evidence: list[str] = Field(default_factory=list)
 class VisibleCase(BaseModel):
     deadline_compliance: float
     efficiency: float
     outcome_quality: float
+    note_quality: float = 0.0
     weighted_score: float
     final_resolution: str
     notes: str
     """Action schema for ChargebackOps."""
     action_type: ActionType
+    case_id: str | None = Field(default=None, max_length=64, description="Target case id when applicable")
     system_name: SystemName | None = Field(
         default=None,
         description="System to query when action_type is query_system",
     )
     evidence_ids: list[str] = Field(
         default_factory=list,
+        max_length=20,
         description="Evidence ids to attach or remove",
     )
     strategy: StrategyName | None = Field(
     )
     note: str | None = Field(
         default=None,
+        max_length=500,
         description="Optional short rationale for the action",
     )

server/chargeback_ops_environment.py CHANGED Viewed

@@ -175,7 +175,7 @@ class ChargebackOpsEnvironment(
         if action.action_type == "set_strategy":
             return self._set_strategy(case, action.strategy)
         if action.action_type == "submit_representment":
-            return self._submit_representment(case)
         if action.action_type == "resolve_case":
             return self._resolve_case(case, action.strategy)
         raise ValueError(f"Unsupported action_type '{action.action_type}'.")
@@ -306,9 +306,11 @@ class ChargebackOpsEnvironment(
             return 0.03, f"Set an acceptable fallback strategy '{strategy}' for case {case.case_id}."
         return -0.08, f"Set a weak strategy '{strategy}' for case {case.case_id}."
-    def _submit_representment(self, case: InternalCase) -> tuple[float, str]:
         progress = self._progress_by_case[case.case_id]
         progress.submit_attempts += 1
         if progress.current_strategy != "contest":
             raise ValueError("submit_representment requires current strategy to be 'contest'.")
         if progress.resolution_status != "open":
@@ -320,21 +322,25 @@ class ChargebackOpsEnvironment(
         if self._state.step_count > case.deadline_step:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_late"
             return -0.2, f"Representment for case {case.case_id} was submitted after the deadline."
         if missing:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_incomplete"
             return -0.18, (
                 f"Representment for case {case.case_id} is incomplete; missing {', '.join(sorted(missing))}."
             )
         if harmful:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_harmful_evidence"
             return -0.15, (
                 f"Representment for case {case.case_id} included harmful evidence {', '.join(sorted(harmful))}."
             )
         progress.final_resolution = "contest"
         if case.optimal_strategy == "contest":
             progress.resolution_status = "won"
             return 0.2, f"Submitted a strong representment package for case {case.case_id}."
@@ -354,6 +360,7 @@ class ChargebackOpsEnvironment(
             return -0.04, f"Case {case.case_id} is already resolved."
         progress.final_resolution = resolution
         progress.current_strategy = resolution
         progress.resolution_status = (
             "refunded" if resolution == "issue_refund" else "accepted_chargeback"
         )
@@ -457,7 +464,6 @@ class ChargebackOpsEnvironment(
                 reason_code=case.reason_code,
                 guidance=case.policy_guidance,
                 required_evidence=list(case.policy_requirements),
-                recommended_strategy=case.recommended_strategy,
             )
         return VisibleCase(
             case_id=case.case_id,

         if action.action_type == "set_strategy":
             return self._set_strategy(case, action.strategy)
         if action.action_type == "submit_representment":
+            return self._submit_representment(case, note=action.note)
         if action.action_type == "resolve_case":
             return self._resolve_case(case, action.strategy)
         raise ValueError(f"Unsupported action_type '{action.action_type}'.")
             return 0.03, f"Set an acceptable fallback strategy '{strategy}' for case {case.case_id}."
         return -0.08, f"Set a weak strategy '{strategy}' for case {case.case_id}."
+    def _submit_representment(self, case: InternalCase, *, note: str | None = None) -> tuple[float, str]:
         progress = self._progress_by_case[case.case_id]
         progress.submit_attempts += 1
+        if note:
+            progress.representment_note = note
         if progress.current_strategy != "contest":
             raise ValueError("submit_representment requires current strategy to be 'contest'.")
         if progress.resolution_status != "open":
         if self._state.step_count > case.deadline_step:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_late"
+            progress.resolved_at_step = self._state.step_count
             return -0.2, f"Representment for case {case.case_id} was submitted after the deadline."
         if missing:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_incomplete"
+            progress.resolved_at_step = self._state.step_count
             return -0.18, (
                 f"Representment for case {case.case_id} is incomplete; missing {', '.join(sorted(missing))}."
             )
         if harmful:
             progress.final_resolution = "contest"
             progress.resolution_status = "lost_harmful_evidence"
+            progress.resolved_at_step = self._state.step_count
             return -0.15, (
                 f"Representment for case {case.case_id} included harmful evidence {', '.join(sorted(harmful))}."
             )
         progress.final_resolution = "contest"
+        progress.resolved_at_step = self._state.step_count
         if case.optimal_strategy == "contest":
             progress.resolution_status = "won"
             return 0.2, f"Submitted a strong representment package for case {case.case_id}."
             return -0.04, f"Case {case.case_id} is already resolved."
         progress.final_resolution = resolution
         progress.current_strategy = resolution
+        progress.resolved_at_step = self._state.step_count
         progress.resolution_status = (
             "refunded" if resolution == "issue_refund" else "accepted_chargeback"
         )
                 reason_code=case.reason_code,
                 guidance=case.policy_guidance,
                 required_evidence=list(case.policy_requirements),
             )
         return VisibleCase(
             case_id=case.case_id,

simulation.py CHANGED Viewed

@@ -73,11 +73,13 @@ class CaseProgress:
     current_strategy: StrategyName | None = None
     final_resolution: str | None = None
     resolution_status: str = "open"
     duplicate_queries: int = 0
     invalid_actions: int = 0
     submit_attempts: int = 0
     deadline_penalized: bool = False
     notes: list[str] = field(default_factory=list)
 @dataclass
@@ -224,7 +226,7 @@ TASKS: dict[str, TaskScenario] = {
             "A card-not-present fraud dispute with mixed signals. Strong account-linkage evidence exists, "
             "but payment mismatch artifacts will hurt the case if attached."
         ),
-        max_steps=12,
         cases=(
             InternalCase(
                 case_id="CB-M1",
@@ -238,7 +240,7 @@ TASKS: dict[str, TaskScenario] = {
                     "The order used a known account and device, but AVS/CVV mismatches were present. "
                     "Winning requires emphasizing customer-account linkage and avoiding mismatch artifacts."
                 ),
-                deadline_step=9,
                 optimal_strategy="contest",
                 acceptable_strategies=("accept_chargeback",),
                 policy_guidance=(
@@ -341,7 +343,7 @@ TASKS: dict[str, TaskScenario] = {
             "A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
             "The step budget leaves little room for waste."
         ),
-        max_steps=18,
         cases=(
             InternalCase(
                 case_id="CB-H1",
@@ -582,13 +584,47 @@ TASKS: dict[str, TaskScenario] = {
 def get_task(task_id: str) -> TaskScenario:
-    """Look up a task or raise KeyError."""
-    return TASKS[task_id]
 def list_tasks() -> list[TaskScenario]:
-    """Return tasks in a stable order."""
     ordered_ids = [
         "goods_not_received_easy",

     current_strategy: StrategyName | None = None
     final_resolution: str | None = None
     resolution_status: str = "open"
+    resolved_at_step: int | None = None
     duplicate_queries: int = 0
     invalid_actions: int = 0
     submit_attempts: int = 0
     deadline_penalized: bool = False
     notes: list[str] = field(default_factory=list)
+    representment_note: str | None = None
 @dataclass
             "A card-not-present fraud dispute with mixed signals. Strong account-linkage evidence exists, "
             "but payment mismatch artifacts will hurt the case if attached."
         ),
+        max_steps=10,
         cases=(
             InternalCase(
                 case_id="CB-M1",
                     "The order used a known account and device, but AVS/CVV mismatches were present. "
                     "Winning requires emphasizing customer-account linkage and avoiding mismatch artifacts."
                 ),
+                deadline_step=7,
                 optimal_strategy="contest",
                 acceptable_strategies=("accept_chargeback",),
                 policy_guidance=(
             "A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
             "The step budget leaves little room for waste."
         ),
+        max_steps=15,
         cases=(
             InternalCase(
                 case_id="CB-H1",
 def get_task(task_id: str) -> TaskScenario:
+    """Look up a built-in task or generate one from a ``generated_*`` id."""
+    if task_id in TASKS:
+        return TASKS[task_id]
+    # Support generated task ids: generated_{difficulty}_s{seed}
+    import re
+    m = re.match(r"^generated_(easy|medium|hard)_s(\d+)$", task_id)
+    if m:
+        try:
+            from .case_generator import generate_task
+        except ImportError:  # pragma: no cover
+            from case_generator import generate_task
+        difficulty = m.group(1)
+        seed = int(m.group(2))
+        return generate_task(seed, difficulty=difficulty)  # type: ignore[arg-type]
+    # Support ISO-derived task ids: iso_{difficulty}_{index}
+    m_iso = re.match(r"^iso_(easy|medium|hard)_(\d+)$", task_id)
+    if m_iso:
+        try:
+            from .iso_adapter import build_iso_task, load_iso_rows
+        except ImportError:  # pragma: no cover
+            from iso_adapter import build_iso_task, load_iso_rows
+        difficulty = m_iso.group(1)
+        task_index = int(m_iso.group(2))
+        rows = load_iso_rows()
+        if rows:
+            import random as _rng_mod
+            shuffled = list(rows)
+            _rng_mod.Random(42).shuffle(shuffled)
+            task = build_iso_task(shuffled, difficulty=difficulty, start_index=task_index * 4, task_index=task_index)
+            if task is not None:
+                return task
+    raise ValueError(f"Unknown task_id '{task_id}'. Available: {', '.join(TASKS)}")
 def list_tasks() -> list[TaskScenario]:
+    """Return built-in tasks in a stable order."""
     ordered_ids = [
         "goods_not_received_easy",