mitudrudutta commited on
Commit
5edab41
·
1 Parent(s): 3af94fa

docs: sync AGENT.md and README with current grading values, add LICENSE

Browse files

Update stale documentation: strategy fallback 0.55→0.35, outcome
acceptable 0.6→0.4, harmful keywords 6→15, performance tables now
reflect the current 10-task benchmark. Add adversarial evidence and
nightmare difficulty sections to AGENT.md. Add MIT license.

Files changed (4) hide show
  1. AGENT.md +36 -20
  2. LICENSE +21 -0
  3. README.md +23 -22
  4. pyproject.toml +1 -0
AGENT.md CHANGED
@@ -45,8 +45,8 @@ A human dispute analyst handles 50-200 cases per day. They must triage by urgenc
45
  ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst.
46
 
47
  **What the agent receives:**
48
- - A queue of 1-4 open dispute cases
49
- - A step budget (10-20 actions total)
50
  - Per-case deadlines (must resolve before step N)
51
 
52
  **What the agent must do:**
@@ -311,7 +311,7 @@ When the agent has multiple open cases and the total estimated step cost exceeds
311
 
312
  - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget.
313
  - **goods_not_received** costs 6 steps and always contests. Handle next.
314
- - **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 55% strategy correctness.
315
 
316
  ---
317
 
@@ -319,10 +319,12 @@ When the agent has multiple open cases and the total estimated step cost exceeds
319
 
320
  ### Harmful Evidence Detection
321
 
322
- The agent maintains a set of 6 harmful keywords derived from the grading system:
323
 
324
  ```
325
- mismatch, failed, declined, suspicious, flagged, fraud risk
 
 
326
  ```
327
 
328
  Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
@@ -339,7 +341,7 @@ Non-harmful evidence is ranked by keyword relevance:
339
  | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
340
  | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
341
  | 4 (default) | anything else | "Internal memo" |
342
- | 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk | "AVS mismatch report" |
343
 
344
  ### Attachment Strategy
345
 
@@ -385,7 +387,7 @@ After all cases are resolved (or the step budget is exhausted), the deterministi
385
  | Outcome | Score |
386
  |---|---|
387
  | Chose the optimal strategy | 1.0 |
388
- | Chose an acceptable fallback | 0.55 |
389
  | Chose the wrong strategy | 0.0 |
390
 
391
  "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
@@ -404,7 +406,7 @@ For **non-contest** cases where optimal strategy is also non-contest:
404
  - 0.7 if evidence was attached (unnecessary work)
405
 
406
  For **non-contest** cases where optimal was contest:
407
- - 0.3 (the agent abandoned evidence gathering for a contestable case)
408
 
409
  ### Packet Validity (15%)
410
 
@@ -423,17 +425,22 @@ Binary:
423
  ### Efficiency (10%)
424
 
425
  ```
426
- efficiency = 1.0 - min(0.9, duplicate_queries * 0.1 + submit_attempts * 0.05)
427
  ```
428
 
429
- The agent loses 0.1 per duplicate system query and 0.05 per submit attempt. Minimum efficiency is 0.1.
 
 
 
 
 
430
 
431
  ### Outcome Quality (10%)
432
 
433
  | Outcome | Score |
434
  |---|---|
435
  | Final resolution matches optimal strategy | 1.0 |
436
- | Final resolution is an acceptable fallback | 0.6 |
437
  | Final resolution is wrong | 0.0 |
438
 
439
  ### Note Quality (5%)
@@ -543,6 +550,14 @@ Before any submit, the agent checks for harmful evidence in the attached set. If
543
 
544
  Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
545
 
 
 
 
 
 
 
 
 
546
  ---
547
 
548
  ## File Map
@@ -560,7 +575,8 @@ Representment notes are generated with direct references to policy requirement k
560
  | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 |
561
  | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 |
562
  | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 |
563
- | `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results | ~200 |
 
564
  | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 |
565
  | `core/client.py` | OpenEnv WebSocket client | ~100 |
566
 
@@ -568,14 +584,14 @@ Representment notes are generated with direct references to policy requirement k
568
 
569
  ## Performance
570
 
571
- Tested across 63 episodes (3 built-in + 60 parametric at 20 seeds per difficulty):
572
 
573
- | Source | Avg Score | Perfect (>= 0.90) | Failed (< 0.50) |
574
  |---|---|---|---|
575
- | Built-in (3) | 0.933 | 67% | 0% |
576
- | Parametric easy (20) | 0.980 | 100% | 0% |
577
- | Parametric medium (20) | 0.868 | 60% | 0% |
578
- | Parametric hard (20) | 0.722 | 0% | 0% |
579
- | **Overall (63)** | **0.861** | **54%** | **0%** |
580
 
581
- The agent scores **zero failures** (no episode below 0.50) and over half of all episodes above 0.90.
 
45
  ChargebackOps is built for the [OpenEnv](https://meta-pytorch.org/OpenEnv/index.html) evaluation framework. It is a **simulated merchant dispute resolution environment** where an AI agent acts as the dispute analyst.
46
 
47
  **What the agent receives:**
48
+ - A queue of 1-6 open dispute cases (5-6 at nightmare difficulty)
49
+ - A step budget (10-20 actions total, ~2.4 steps/case at nightmare)
50
  - Per-case deadlines (must resolve before step N)
51
 
52
  **What the agent must do:**
 
311
 
312
  - **credit_not_processed/duplicate_processing** cost 3 steps and always get optimal score. Handle them first to free budget.
313
  - **goods_not_received** costs 6 steps and always contests. Handle next.
314
+ - **fraud_cnp/product_not_as_described/service_not_provided** cost 7-8 steps and may need to concede. Handle last -- if budget runs out, conceding these with `issue_refund` (an acceptable fallback) still earns 35% strategy correctness.
315
 
316
  ---
317
 
 
319
 
320
  ### Harmful Evidence Detection
321
 
322
+ The agent maintains a set of 15 negative-signal keywords derived from real chargeback dispute patterns:
323
 
324
  ```
325
+ mismatch, failed, declined, suspicious, flagged, fraud risk,
326
+ unauthorized, rejected, invalid, expired, violation,
327
+ non-compliant, discrepancy, inconsistent, unverified
328
  ```
329
 
330
  Every evidence item's title and summary are scanned. If any harmful keyword is found, the evidence is:
 
341
  | 1 | duplicate, delivery, prior, account, authenticated | "Prior good order linkage" |
342
  | 2 | return policy, refund, cancel, confirmation, cancellation | "Return policy documentation" |
343
  | 4 (default) | anything else | "Internal memo" |
344
+ | 999 (excluded) | mismatch, failed, declined, suspicious, flagged, fraud risk, unauthorized, rejected, invalid, expired, violation, non-compliant, discrepancy, inconsistent, unverified | "AVS mismatch report" |
345
 
346
  ### Attachment Strategy
347
 
 
387
  | Outcome | Score |
388
  |---|---|
389
  | Chose the optimal strategy | 1.0 |
390
+ | Chose an acceptable fallback | 0.35 |
391
  | Chose the wrong strategy | 0.0 |
392
 
393
  "Optimal" and "acceptable" strategies are defined per case by the task scenario. For example, `goods_not_received` optimal is always "contest" with no acceptable fallback. `fraud_cnp` optimal might be "contest" with "accept_chargeback" as acceptable, or vice versa.
 
406
  - 0.7 if evidence was attached (unnecessary work)
407
 
408
  For **non-contest** cases where optimal was contest:
409
+ - 0.15 (the agent abandoned evidence gathering for a contestable case)
410
 
411
  ### Packet Validity (15%)
412
 
 
425
  ### Efficiency (10%)
426
 
427
  ```
428
+ efficiency = 1.0 - min(0.9, (duplicate_queries + invalid_actions) * 0.1 + submit_attempts * 0.05)
429
  ```
430
 
431
+ The agent loses 0.1 per duplicate system query or invalid action, and 0.05 per submit attempt. Minimum efficiency is 0.0.
432
+
433
+ Additional penalties for shallow operational behaviour:
434
+ - **Over-querying a concedable case**: -0.15 per system queried beyond the 2nd when the agent concedes a case whose optimal strategy is also non-contest. Querying 4+ systems before conceding is wasteful.
435
+ - **Late policy retrieval**: -0.08 when policy is retrieved but the case is resolved with a concession that matches the optimal non-contest strategy. The policy step was wasted.
436
+ - **Early correct concession bonus**: +0.10 when the agent correctly concedes a case (matching optimal) within 3 steps. Rewards recognising a bad case quickly.
437
 
438
  ### Outcome Quality (10%)
439
 
440
  | Outcome | Score |
441
  |---|---|
442
  | Final resolution matches optimal strategy | 1.0 |
443
+ | Final resolution is an acceptable fallback | 0.4 |
444
  | Final resolution is wrong | 0.0 |
445
 
446
  ### Note Quality (5%)
 
550
 
551
  Representment notes are generated with direct references to policy requirement keywords and evidence IDs, maximizing the note_quality score (policy claims coverage 50% + evidence coherence 15%).
552
 
553
+ ### 6. Adversarial Evidence (Hard/Nightmare)
554
+
555
+ At hard and nightmare difficulty, the case generator injects **adversarial evidence** — items whose titles sound helpful ("Delivery verification report", "Account verification summary") but whose content is harmful (GPS discrepancies, prior non-receipt claims, failed 3D Secure challenges). This tests whether the agent reads beyond titles and inspects evidence content before attaching.
556
+
557
+ ### 7. Nightmare Difficulty
558
+
559
+ Nightmare tasks push the step budget to its limit: 5-6 cases with ~2.4 steps per case. The agent must triage aggressively — fast-conceding weak cases, handling deterministic codes first, and accepting that some cases will go unresolved. This tier specifically tests prioritisation under extreme resource pressure.
560
+
561
  ---
562
 
563
  ## File Map
 
575
  | `scenarios/iso_adapter.py` | Converts ISO 20022 CASR.003 records to environment cases | ~160 |
576
  | `connectors/stripe_sandbox.py` | Maps Stripe test-mode disputes to environment cases | ~280 |
577
  | `evaluation/agent_brutal_audit.py` | 126-episode evaluation across all data sources | ~300 |
578
+ | `server/app.py` | FastAPI routes: /reset, /step, /state, /tasks, /baseline, /grader, /results, /demo | ~200 |
579
+ | `server/demo_ui.py` | Gradio live demo UI with step-by-step episode playback | ~150 |
580
  | `core/episode_store.py` | Thread-safe storage with JSONL file persistence | ~60 |
581
  | `core/client.py` | OpenEnv WebSocket client | ~100 |
582
 
 
584
 
585
  ## Performance
586
 
587
+ Tested across the 10-task benchmark (3 showcase + 7 seeded holdout):
588
 
589
+ | Difficulty | Tasks | Avg Score | Key Observations |
590
  |---|---|---|---|
591
+ | Easy | 2 | 0.963 | Near-perfect on straightforward cases |
592
+ | Medium | 3 | 0.518 | Agent struggles with ambiguous fraud signals |
593
+ | Hard | 3 | 0.686 | Wrong strategies on adversarial evidence traps |
594
+ | Nightmare | 2 | 0.474 | Step budget exhaustion, 2-3 cases left unresolved |
595
+ | **Overall** | **10** | **0.648** | **Clear difficulty curve from 0.96 to 0.47** |
596
 
597
+ The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push even the heuristic+LLM agent below 50%. The medium-tier drop (0.518) is driven by `fraud_signal_ambiguity` where the agent picks the wrong strategy on genuinely ambiguous evidence.
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Mitudru Dutta
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -246,36 +246,37 @@ flowchart LR
246
  style H fill:#8b4513,color:#fff
247
  ```
248
 
249
- ## Agent Performance (63 Episodes)
250
 
251
- Results from the heuristic agent across built-in and parametric tasks:
252
 
253
- | Source | Avg | >= 0.90 | < 0.50 | Min |
254
- |---|---|---|---|---|
255
- | Built-in tasks (3) | 0.933 | 2/3 | 0/3 | 0.865 |
256
- | Parametric easy (20) | 0.980 | 20/20 | 0/20 | 0.958 |
257
- | Parametric medium (20) | 0.868 | 12/20 | 0/20 | 0.624 |
258
- | Parametric hard (20) | 0.722 | 0/20 | 0/20 | 0.559 |
259
-
260
- **Overall: 0.861 avg | 54.0% score >= 0.90 | 0.0% score < 0.50**
261
 
262
  ### Heuristic vs Naive Agent Comparison
263
 
264
- The grading system reliably distinguishes competent behavior from naive strategies:
265
 
266
  | Task | Heuristic Score | Naive Score | Gap |
267
  |---|---|---|---|
268
- | `goods_not_received_easy` | 0.9225 | 0.3500 | +0.5725 |
269
- | `fraud_signal_ambiguity` | 0.7355 | 0.1750 | +0.5605 |
270
- | `queue_optimization_hard` | 0.7475 | 0.2000 | +0.5475 |
271
- | `generated_easy_s42` | 0.9725 | 0.2500 | +0.7225 |
272
- | `generated_medium_s17` | 0.9125 | 0.1750 | +0.7375 |
273
- | `generated_medium_s99` | 0.8500 | 0.1750 | +0.6750 |
274
- | `generated_hard_s7` | 0.7600 | 0.2250 | +0.5350 |
275
- | `generated_hard_s53` | 0.8275 | 0.3550 | +0.4725 |
276
- | **Average** | **0.8410** | **0.2381** | **+0.6029** |
277
-
278
- The +0.60 gap across all difficulties confirms the environment produces meaningful signal for agent evaluation.
 
 
279
 
280
  ## Task Sources
281
 
 
246
  style H fill:#8b4513,color:#fff
247
  ```
248
 
249
+ ## Agent Performance (10-Task Benchmark)
250
 
251
+ Results from the heuristic+LLM agent across the full benchmark (3 showcase + 7 seeded holdout):
252
 
253
+ | Difficulty | Tasks | Avg Score | Key Observations |
254
+ |---|---|---|---|
255
+ | Easy | 2 | 0.963 | Near-perfect on straightforward cases |
256
+ | Medium | 3 | 0.518 | Struggles with ambiguous fraud signals |
257
+ | Hard | 3 | 0.686 | Wrong strategies on adversarial evidence traps |
258
+ | Nightmare | 2 | 0.474 | Step budget exhaustion, 2-3 cases unresolved |
259
+ | **Overall** | **10** | **0.648** | **Difficulty curve: 0.96 → 0.47** |
 
260
 
261
  ### Heuristic vs Naive Agent Comparison
262
 
263
+ The grading system reliably distinguishes competent behavior from naive strategies. The naive agent blindly selects each case and resolves with `issue_refund`:
264
 
265
  | Task | Heuristic Score | Naive Score | Gap |
266
  |---|---|---|---|
267
+ | `goods_not_received_easy` | 0.9675 | 0.2800 | +0.6875 |
268
+ | `fraud_signal_ambiguity` | 0.9675 | 0.2800 | +0.6875 |
269
+ | `queue_optimization_hard` | 0.8015 | 0.5454 | +0.2561 |
270
+ | `generated_easy_s42` | 0.9575 | 0.2800 | +0.6775 |
271
+ | `generated_medium_s17` | 0.7276 | 0.7276 | +0.0000 |
272
+ | `generated_medium_s99` | 0.6919 | 0.5049 | +0.1870 |
273
+ | `generated_hard_s7` | 0.6817 | 0.6817 | +0.0000 |
274
+ | `generated_hard_s53` | 0.5238 | 0.5238 | +0.0000 |
275
+ | `generated_nightmare_s31` | 0.5534 | 0.4689 | +0.0845 |
276
+ | `generated_nightmare_s77` | 0.5180 | 0.5009 | +0.0171 |
277
+ | **Average** | **0.7390** | **0.4793** | **+0.2597** |
278
+
279
+ The +0.26 gap confirms the environment produces meaningful signal. Tasks where the gap is zero are cases where the optimal strategy is non-contest (accept/refund) — the naive agent accidentally gets the right answer but for the wrong reasons, while the heuristic agent recognises the correct strategy deliberately. The environment scores both the same because the grader rewards outcomes, not reasoning. On contestable cases (easy, fraud_signal_ambiguity), the gap exceeds +0.68.
280
 
281
  ## Task Sources
282
 
pyproject.toml CHANGED
@@ -7,6 +7,7 @@ name = "openenv-chargeback_ops"
7
  version = "0.1.0"
8
  description = "ChargebackOps: a real-world OpenEnv environment for merchant dispute handling."
9
  readme = "README.md"
 
10
  requires-python = ">=3.10"
11
  dependencies = [
12
  "anthropic>=0.51.0",
 
7
  version = "0.1.0"
8
  description = "ChargebackOps: a real-world OpenEnv environment for merchant dispute handling."
9
  readme = "README.md"
10
+ license = {text = "MIT"}
11
  requires-python = ">=3.10"
12
  dependencies = [
13
  "anthropic>=0.51.0",