Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

mitudrudutta commited on Apr 2

Commit

5edab41

1 Parent(s): 3af94fa

docs: sync AGENT.md and README with current grading values, add LICENSE

Update stale documentation: strategy fallback 0.55→0.35, outcome
acceptable 0.6→0.4, harmful keywords 6→15, performance tables now
reflect the current 10-task benchmark. Add adversarial evidence and
nightmare difficulty sections to AGENT.md. Add MIT license.

Files changed (4) hide show

AGENT.md +36 -20
LICENSE +21 -0
README.md +23 -22
pyproject.toml +1 -0

AGENT.md CHANGED Viewed

@@ -45,8 +45,8 @@ A human dispute analyst handles 50-200 cases per day. They must triage by urgenc
 ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst.
 **What the agent receives:**
-- A queue of 1-4 open dispute cases
-- A step budget (10-20 actions total)
 - Per-case deadlines (must resolve before step N)
 **What the agent must do:**
@@ -311,7 +311,7 @@ When the agent has multiple open cases and the total estimated step cost exceeds
 - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget.
 - **goods_not_received** costs 6 steps and always contests. Handle next.
-- **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 55% strategy correctness.
 ---
@@ -319,10 +319,12 @@ When the agent has multiple open cases and the total estimated step cost exceeds
 ### Harmful Evidence Detection
-The agent maintains a set of 6 harmful keywords derived from the grading system:
 ```
-mismatch, failed, declined, suspicious, flagged, fraud risk
 ```
 Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
@@ -339,7 +341,7 @@ Non-harmful evidence is ranked by keyword relevance:
 | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
 | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
 | 4 (default) | anything else | "Internal memo" |
-| 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk | "AVS mismatch report" |
 ### Attachment Strategy
@@ -385,7 +387,7 @@ After all cases are resolved (or the step budget is exhausted), the deterministi
 | Outcome | Score |
 |---|---|
 | Chose the optimal strategy | 1.0 |
-| Chose an acceptable fallback | 0.55 |
 | Chose the wrong strategy | 0.0 |
 "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
@@ -404,7 +406,7 @@ For **non-contest** cases where optimal strategy is also non-contest:
 - 0.7 if evidence was attached (unnecessary work)
 For **non-contest** cases where optimal was contest:
-- 0.3 (the agent abandoned evidence gathering for a contestable case)
 ### Packet Validity (15%)
@@ -423,17 +425,22 @@ Binary:
 ### Efficiency (10%)
 ```
-efficiency = 1.0 - min(0.9, duplicate_queries * 0.1 + submit_attempts * 0.05)
 ```
-The agent loses 0.1 per duplicate system query and 0.05 per submit attempt. Minimum efficiency is 0.1.
 ### Outcome Quality (10%)
 | Outcome | Score |
 |---|---|
 | Final resolution matches optimal strategy | 1.0 |
-| Final resolution is an acceptable fallback | 0.6 |
 | Final resolution is wrong | 0.0 |
 ### Note Quality (5%)
@@ -543,6 +550,14 @@ Before any submit, the agent checks for harmful evidence in the attached set. If
 Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
 ---
 ## File Map
@@ -560,7 +575,8 @@ Representment notes are generated with direct references to policy requirement k
 | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 |
 | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 |
 | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 |
-| `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results | ~200 |
 | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 |
 | `core/client.py` | OpenEnv WebSocket client | ~100 |
@@ -568,14 +584,14 @@ Representment notes are generated with direct references to policy requirement k
 ## Performance
-Tested across 63 episodes (3 built-in + 60 parametric at 20 seeds per difficulty):
-| Source | Avg Score | Perfect (>= 0.90) | Failed (< 0.50) |
 |---|---|---|---|
-| Built-in (3) | 0.933 | 67% | 0% |
-| Parametric easy (20) | 0.980 | 100% | 0% |
-| Parametric medium (20) | 0.868 | 60% | 0% |
-| Parametric hard (20) | 0.722 | 0% | 0% |
-| **Overall (63)** | **0.861** | **54%** | **0%** |
-The agent scores **zero failures** (no episode below 0.50) and over half of all episodes above 0.90.

 ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst.
 **What the agent receives:**
+- A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
+- A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
 - Per-case deadlines (must resolve before step N)
 **What the agent must do:**
 - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget.
 - **goods_not_received** costs 6 steps and always contests. Handle next.
+- **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness.
 ---
 ### Harmful Evidence Detection
+The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:
 ```
+mismatch, failed, declined, suspicious, flagged, fraud risk,
+unauthorized, rejected, invalid, expired, violation,
+non-compliant, discrepancy, inconsistent, unverified
 ```
 Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
 | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
 | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
 | 4 (default) | anything else | "Internal memo" |
+| 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" |
 ### Attachment Strategy
 | Outcome | Score |
 |---|---|
 | Chose the optimal strategy | 1.0 |
+| Chose an acceptable fallback | 0.35 |
 | Chose the wrong strategy | 0.0 |
 "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
 - 0.7 if evidence was attached (unnecessary work)
 For **non-contest** cases where optimal was contest:
+- 0.15 (the agent abandoned evidence gathering for a contestable case)
 ### Packet Validity (15%)
 ### Efficiency (10%)
 ```
+efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)
 ```
+The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.
+Additional penalties for shallow operational behaviour:
+- **Over-querying a concedable case**: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
+- **Late policy retrieval**: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
+- **Early correct concession bonus**: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.
 ### Outcome Quality (10%)
 | Outcome | Score |
 |---|---|
 | Final resolution matches optimal strategy | 1.0 |
+| Final resolution is an acceptable fallback | 0.4 |
 | Final resolution is wrong | 0.0 |
 ### Note Quality (5%)
 Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
+### 6. Adversarial Evidence (Hard/Nightmare)
+At hard and nightmare difficulty, the case generator injects **adversarial evidence** — items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.
+### 7. Nightmare Difficulty
+Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively — fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.
 ---
 ## File Map
 | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 |
 | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 |
 | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 |
+| `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 |
+| `server/demo_ui.py` | Gradio live demo UI with step-by-step episode playback | ~150 |
 | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 |
 | `core/client.py` | OpenEnv WebSocket client | ~100 |
 ## Performance
+Tested across the 10-task benchmark (3 showcase + 7 seeded holdout):
+| Difficulty | Tasks | Avg Score | Key Observations |
 |---|---|---|---|
+| Easy | 2 | 0.963 | Near-perfect on straightforward cases |
+| Medium | 3 | 0.518 | Agent struggles with ambiguous fraud signals |
+| Hard | 3 | 0.686 | Wrong strategies on adversarial evidence traps |
+| Nightmare | 2 | 0.474 | Step budget exhaustion, 2-3 cases left unresolved |
+| **Overall** | **10** | **0.648** | **Clear difficulty curve from 0.96 to 0.47** |
+The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push even the heuristic+LLM agent below 50%. The medium-tier drop (0.518) is driven by `fraud_signal_ambiguity` where the agent picks the wrong strategy on genuinely ambiguous evidence.

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Mitudru Dutta
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -246,36 +246,37 @@ flowchart LR
     style H fill:#8b4513,color:#fff
 ```
-## Agent Performance (63 Episodes)
-Results from the heuristic agent across built-in and parametric tasks:
-| Source | Avg | >= 0.90 | < 0.50 | Min |
-|---|---|---|---|---|
-| Built-in tasks (3) | 0.933 | 2/3 | 0/3 | 0.865 |
-| Parametric easy (20) | 0.980 | 20/20 | 0/20 | 0.958 |
-| Parametric medium (20) | 0.868 | 12/20 | 0/20 | 0.624 |
-| Parametric hard (20) | 0.722 | 0/20 | 0/20 | 0.559 |
-**Overall: 0.861 avg | 54.0% score >= 0.90 | 0.0% score < 0.50**
 ### Heuristic vs Naive Agent Comparison
-The grading system reliably distinguishes competent behavior from naive strategies:
 | Task | Heuristic Score | Naive Score | Gap |
 |---|---|---|---|
-| `goods_not_received_easy` | 0.9225 | 0.3500 | +0.5725 |
-| `fraud_signal_ambiguity` | 0.7355 | 0.1750 | +0.5605 |
-| `queue_optimization_hard` | 0.7475 | 0.2000 | +0.5475 |
-| `generated_easy_s42` | 0.9725 | 0.2500 | +0.7225 |
-| `generated_medium_s17` | 0.9125 | 0.1750 | +0.7375 |
-| `generated_medium_s99` | 0.8500 | 0.1750 | +0.6750 |
-| `generated_hard_s7` | 0.7600 | 0.2250 | +0.5350 |
-| `generated_hard_s53` | 0.8275 | 0.3550 | +0.4725 |
-| **Average** | **0.8410** | **0.2381** | **+0.6029** |
-The +0.60 gap across all difficulties confirms the environment produces meaningful signal for agent evaluation.
 ## Task Sources

     style H fill:#8b4513,color:#fff
 ```
+## Agent Performance (10-Task Benchmark)
+Results from the heuristic+LLM agent across the full benchmark (3 showcase + 7 seeded holdout):
+| Difficulty | Tasks | Avg Score | Key Observations |
+|---|---|---|---|
+| Easy | 2 | 0.963 | Near-perfect on straightforward cases |
+| Medium | 3 | 0.518 | Struggles with ambiguous fraud signals |
+| Hard | 3 | 0.686 | Wrong strategies on adversarial evidence traps |
+| Nightmare | 2 | 0.474 | Step budget exhaustion, 2-3 cases unresolved |
+| **Overall** | **10** | **0.648** | **Difficulty curve: 0.96 → 0.47** |
 ### Heuristic vs Naive Agent Comparison
+The grading system reliably distinguishes competent behavior from naive strategies. The naive agent blindly selects each case and resolves with `issue_refund`:
 | Task | Heuristic Score | Naive Score | Gap |
 |---|---|---|---|
+| `goods_not_received_easy` | 0.9675 | 0.2800 | +0.6875 |
+| `fraud_signal_ambiguity` | 0.9675 | 0.2800 | +0.6875 |
+| `queue_optimization_hard` | 0.8015 | 0.5454 | +0.2561 |
+| `generated_easy_s42` | 0.9575 | 0.2800 | +0.6775 |
+| `generated_medium_s17` | 0.7276 | 0.7276 | +0.0000 |
+| `generated_medium_s99` | 0.6919 | 0.5049 | +0.1870 |
+| `generated_hard_s7` | 0.6817 | 0.6817 | +0.0000 |
+| `generated_hard_s53` | 0.5238 | 0.5238 | +0.0000 |
+| `generated_nightmare_s31` | 0.5534 | 0.4689 | +0.0845 |
+| `generated_nightmare_s77` | 0.5180 | 0.5009 | +0.0171 |
+| **Average** | **0.7390** | **0.4793** | **+0.2597** |
+The +0.26 gap confirms the environment produces meaningful signal. Tasks where the gap is zero are cases where the optimal strategy is non-contest (accept/refund) — the naive agent accidentally gets the right answer but for the wrong reasons, while the heuristic agent recognises the correct strategy deliberately. The environment scores both the same because the grader rewards outcomes, not reasoning. On contestable cases (easy, fraud_signal_ambiguity), the gap exceeds +0.68.
 ## Task Sources

pyproject.toml CHANGED Viewed

@@ -7,6 +7,7 @@ name = "openenv-chargeback_ops"
 version = "0.1.0"
 description = "ChargebackOps: a real-world OpenEnv environment for merchant dispute handling."
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
     "anthropic>=0.51.0",

 version = "0.1.0"
 description = "ChargebackOps: a real-world OpenEnv environment for merchant dispute handling."
 readme = "README.md"
+license = {text = "MIT"}
 requires-python = ">=3.10"
 dependencies = [
     "anthropic>=0.51.0",