mitudrudutta commited on
Commit
83eb290
·
1 Parent(s): a6b0c55

feat(connectors): add Stripe sandbox connector for dispute processing

Browse files

- Introduced `stripe_sandbox.py` to map Stripe test-mode dispute objects into `InternalCase` and `TaskScenario`.
- Implemented functions to fetch disputes, build evidence, and infer strategies based on dispute reasons and statuses.
- Added synthetic dispute generation for testing when Stripe API is unavailable.

fix(episode_store): limit stored reports to a maximum count

- Added a maximum report limit of 100 in the episode store to prevent excessive memory usage.

feat(grading): enhance representment note scoring

- Added `grade_representment_note` function to evaluate the quality of representment notes based on required claims, harmful evidence, and substance.
- Updated `score_case` to incorporate representment note quality into overall case scoring.

feat(iso_adapter): implement ISO 20022 chargeback CSV processing

- Created `iso_adapter.py` to convert ISO 20022 chargeback CSV rows into `InternalCase` and `TaskScenario` objects.
- Mapped chargeback reasons to internal codes and built evidence based on the CSV data.

refactor(models): remove unnecessary fields and improve validation

- Removed `recommended_strategy` from `PolicyView`.
- Added max length constraints to `case_id`, `evidence_ids`, and `note` fields in `ChargebackOpsAction`.

fix(server): update representment submission to include notes

- Modified `_submit_representment` to accept an optional note parameter for better tracking of representment rationale.

chore(simulation): add resolved step tracking to case progress

- Introduced `resolved_at_step` to `CaseProgress` to track the step at which a case was resolved.

README.md CHANGED
@@ -8,376 +8,271 @@ tags:
8
 
9
  # ChargebackOps
10
 
11
- ChargebackOps is a real-world OpenEnv environment for merchant-side chargeback operations. An agent acts as a dispute analyst, works a queue of payment disputes, investigates evidence across synthetic internal systems, chooses whether to contest or concede, and is graded on recovery quality, deadline handling, and operational discipline.
12
 
13
- The environment is designed for the Round 1 OpenEnv problem statement:
14
-
15
- - Real-world task, not a game or toy
16
- - Typed OpenEnv models and `reset()` / `step()` / `state()` support
17
- - Three graded tasks with easy, medium, and hard difficulty
18
- - Dense reward shaping with partial progress and negative signals
19
- - Root-level `inference.py` that uses the OpenAI client contract
20
- - Docker and Hugging Face Spaces deployment path
21
 
22
  ## Why This Environment Matters
23
 
24
- Merchant dispute handling is a real operations workflow. Analysts do not just classify a ticket or answer a question. They must:
25
 
26
- - inspect the dispute reason code and the response deadline
27
- - gather evidence from the right internal systems
28
- - avoid attaching evidence that weakens the case
29
- - choose whether to contest, accept, or refund
30
- - maximize recovery across a queue under limited time
31
 
32
- That makes ChargebackOps a strong benchmark for tool-using agents. It tests retrieval, decision-making, prioritization, and operational restraint in a controlled environment with deterministic scoring.
33
 
34
- ## System Architecture
35
 
36
  ```mermaid
37
- flowchart LR
38
- A["Agent or inference.py"] --> B["OpenAI-compatible client<br/>API_BASE_URL + MODEL_NAME + HF_TOKEN"]
39
- A --> C["ChargebackOps HTTP API"]
40
- C --> D["OpenEnv server<br/>server.app"]
41
- D --> E["ChargebackOpsEnvironment<br/>step / reset / state"]
42
- E --> F["Task simulator<br/>simulation.py"]
43
- E --> G["Dense reward shaping<br/>server/chargeback_ops_environment.py"]
44
- E --> H["Deterministic grader<br/>grading.py"]
45
- H --> I["Episode report store<br/>episode_store.py"]
46
- D --> J["Utility routes<br/>/tasks /grader /baseline /health"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
 
49
  ## Episode Workflow
50
 
51
  ```mermaid
52
  flowchart TD
53
- A["reset(task_id)"] --> B["Select the next case from the queue"]
54
- B --> C["Inspect case metadata"]
55
- C --> D["Retrieve policy guidance"]
56
- D --> E["Query merchant systems<br/>orders, payment, shipping, support, refunds, risk"]
57
- E --> F["Attach or remove evidence"]
58
- F --> G["Set strategy"]
59
- G --> H{"contest?"}
60
- H -->|yes| I["submit_representment"]
61
- H -->|no| J["resolve_case<br/>accept_chargeback or issue_refund"]
62
- I --> K{"all cases resolved or max steps reached?"}
63
- J --> K
64
- K -->|no| B
65
- K -->|yes| L["grader computes final score 0.0 to 1.0"]
 
 
 
 
 
 
 
 
 
66
  ```
67
 
68
- ## Environment Design
69
-
70
- ### Internal systems
71
-
72
- The environment exposes evidence gradually from six synthetic merchant systems:
73
-
74
- - `orders`
75
- - `payment`
76
- - `shipping`
77
- - `support`
78
- - `refunds`
79
- - `risk`
80
-
81
- Each task contains hidden ground truth about:
82
-
83
- - optimal strategy per case
84
- - acceptable fallback strategies
85
- - required evidence
86
- - helpful evidence
87
- - harmful evidence
88
- - deadline pressure
89
- - case weight in the final score
90
-
91
- ### OpenEnv contract
92
-
93
- | Method | Behavior |
94
- | --- | --- |
95
- | `reset(task_id=...)` | starts a fresh episode and returns the initial typed observation |
96
- | `step(action)` | applies one typed action and returns the next observation with reward and done |
97
- | `state()` | returns the current typed internal state |
98
-
99
- Core runtime files:
100
-
101
- - [`models.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/models.py)
102
- - [`server/chargeback_ops_environment.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/server/chargeback_ops_environment.py)
103
- - [`server/app.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/server/app.py)
104
- - [`openenv.yaml`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/openenv.yaml)
105
-
106
- ## Typed Spaces
107
-
108
- ### Action space
109
-
110
- | Action | Purpose |
111
- | --- | --- |
112
- | `select_case` | focus a case from the queue |
113
- | `inspect_case` | reveal analyst notes for the selected case |
114
- | `query_system` | pull evidence from one merchant system |
115
- | `retrieve_policy` | reveal reason-code guidance and required evidence |
116
- | `add_evidence` | attach retrieved evidence to the current package |
117
- | `remove_evidence` | remove evidence, including harmful attachments |
118
- | `set_strategy` | choose `contest`, `accept_chargeback`, or `issue_refund` |
119
- | `submit_representment` | submit a contest package for a contested case |
120
- | `resolve_case` | close a non-contest case with acceptance or refund |
121
-
122
- ### Observation space
123
-
124
- Each observation includes:
125
-
126
- - task metadata: id, title, difficulty, objective
127
- - current queue with deadlines and case summaries
128
- - currently selected case
129
- - visible evidence and policy data
130
- - available actions
131
- - `steps_remaining`
132
- - `progress_score`
133
- - `last_action_result`
134
- - optional terminal `grader_report`
135
-
136
- ### State space
137
-
138
- The environment state exposes:
139
-
140
- - current episode id and step count
141
- - public queue resolution state
142
- - action history
143
- - latest grade estimate
144
- - final grader report once complete
145
-
146
- ## Task Suite
147
-
148
- | Task ID | Title | Difficulty | Objective |
149
- | --- | --- | --- | --- |
150
- | `goods_not_received_easy` | Delivered But Disputed | easy | contest a straightforward goods-not-received case with delivery proof |
151
- | `fraud_signal_ambiguity` | Fraud Signal Ambiguity | medium | handle a card-not-present fraud dispute with mixed evidence and harmful artifacts |
152
- | `queue_optimization_hard` | Dispute Queue Optimization | hard | maximize recovery across a multi-case queue under tight step and deadline pressure |
153
-
154
- Difficulty progression is deliberate:
155
-
156
- - Easy teaches the standard representment loop.
157
- - Medium introduces ambiguity and evidence curation.
158
- - Hard adds queue prioritization, step-budget pressure, and opportunity cost.
159
-
160
- ## Reward Design
161
-
162
- ChargebackOps provides dense per-step feedback and a terminal bonus. The environment rewards progress and penalizes obviously bad operations behavior.
163
-
164
- Positive signals include:
165
-
166
- - selecting and inspecting the right case
167
- - retrieving policy guidance
168
- - querying systems that expose useful evidence
169
- - attaching helpful or required evidence
170
- - setting the optimal strategy
171
- - submitting a complete representment on time
172
- - resolving a case with the optimal non-contest strategy
173
-
174
- Negative signals include:
175
-
176
- - invalid actions
177
- - duplicate system queries
178
- - attaching harmful evidence
179
- - removing helpful evidence
180
- - weak strategy choices
181
- - submitting incomplete or late representments
182
- - missing deadlines on still-open cases
183
-
184
- At episode end, the environment adds a terminal bonus proportional to the deterministic grader score.
185
-
186
- ## Grading
187
-
188
- Each finished episode is scored in `[0.0, 1.0]` by the deterministic grader in [`grading.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/grading.py).
189
 
190
- Per-case weighting:
191
-
192
- | Component | Weight |
193
- | --- | --- |
194
- | strategy correctness | 0.25 |
195
- | evidence quality | 0.25 |
196
- | packet validity | 0.15 |
197
- | deadline compliance | 0.15 |
198
- | efficiency | 0.10 |
199
- | outcome quality | 0.10 |
200
-
201
- The hard task aggregates multiple case scores by case weight and normalizes the final result to `0.0` to `1.0`.
202
-
203
- ## Inference and Model Providers
204
-
205
- The required root inference entry point is [`inference.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/inference.py). It uses the OpenAI Python client with the challenge-compatible environment variables:
206
-
207
- - `API_BASE_URL`
208
- - `MODEL_NAME`
209
- - `HF_TOKEN`
210
-
211
- Default configuration:
212
-
213
- - provider path: OpenRouter
214
- - model: `openai/gpt-oss-120b`
215
 
216
- Also supported through the same OpenAI-compatible client pattern:
217
 
218
- - OpenAI
219
- - Anthropic-compatible gateways
220
- - Groq
221
- - OpenRouter
222
 
223
- The repository also keeps optional direct keys for convenience in [`.env.example`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/.env.example):
224
 
225
- - `OPENAI_API_KEY`
226
- - `ANTHROPIC_API_KEY`
227
- - `GROQ_API_KEY`
228
- - `OPENROUTER_API_KEY`
 
 
229
 
230
- ### OpenRouter referer
231
 
232
- Leave `OPENROUTER_HTTP_REFERER` empty during local development. Once the app is deployed, set it to the public app URL, for example:
233
 
234
- ```bash
235
- OPENROUTER_HTTP_REFERER=https://your-space-name.hf.space
236
- OPENROUTER_APP_TITLE=ChargebackOps
237
- ```
238
 
239
- ## Baseline Results
240
 
241
- The repository includes two baseline entry points:
 
 
 
 
242
 
243
- - [`inference.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/inference.py) for the challenge contract
244
- - [`baseline_runner.py`](/home/btwitsvoid/Documents/Agents/ChargeBackOps/baseline_runner.py) for direct local runs and the `/baseline` endpoint
245
 
246
- Verified local heuristic-fallback baseline scores are documented below after the latest validation pass:
247
 
248
- | Task | Score |
249
- | --- | --- |
250
- | Delivered But Disputed | `0.7075` |
251
- | Fraud Signal Ambiguity | `0.7075` |
252
- | Dispute Queue Optimization | `0.7271` |
253
- | Average | `0.7140` |
254
 
255
- These values are replaced after each validation run so the README reflects real, reproducible output from the current codebase.
256
 
257
- ## API Surface
258
 
259
- The FastAPI app exposes:
260
 
261
- - `GET /` basic service ping
262
- - `GET /health` health check
263
- - `GET /docs` interactive OpenAPI docs
264
- - `POST /reset` start a new episode
265
- - `POST /step` advance the environment
266
- - `GET /state` inspect the current state
267
- - `GET /tasks` enumerate tasks and the action schema
268
- - `GET /grader` or `POST /grader` fetch the last completed episode grade
269
- - `GET /baseline` or `POST /baseline` run the bundled baseline
270
 
271
- ## Local Setup
 
 
 
 
 
 
 
 
 
 
272
 
273
- ### 1. Install dependencies
274
 
275
- Using `uv`:
276
 
277
  ```bash
278
  uv sync --extra dev
 
 
279
  ```
280
 
281
- Using `pip`:
282
-
283
- ```bash
284
- python -m pip install -e ".[dev]"
285
- ```
286
-
287
- ### 2. Configure environment variables
288
 
289
  ```bash
290
  cp .env.example .env
 
291
  ```
292
 
293
- At minimum, configure:
294
-
295
- ```bash
296
- API_BASE_URL=https://openrouter.ai/api/v1
297
- MODEL_NAME=openai/gpt-oss-120b
298
- HF_TOKEN=your_provider_key
299
- ```
300
-
301
- ### 3. Run the test and validation suite
302
 
303
  ```bash
304
  pytest -q tests
305
  openenv validate .
306
- python inference.py
 
307
  ```
308
 
309
- ### 4. Start the server locally
310
 
311
  ```bash
312
  uvicorn server.app:app --host 0.0.0.0 --port 8000
313
  ```
314
 
315
- ## Docker
316
-
317
- Build and run the root Docker image:
318
 
319
  ```bash
320
  docker build -t chargebackops .
321
  docker run --rm -p 8000:8000 --env-file .env chargebackops
322
  ```
323
 
324
- Once the container is running:
325
 
326
- ```bash
327
- curl http://localhost:8000/
328
- curl http://localhost:8000/tasks
329
- curl http://localhost:8000/health
330
- ```
 
 
 
 
 
 
 
331
 
332
- ## Hugging Face Spaces Deployment
333
 
334
- ChargebackOps is configured as a Docker Space through the YAML frontmatter in this README.
335
 
336
- Recommended deployment steps:
 
 
 
 
337
 
338
- 1. Create a new Hugging Face Space with `Docker` as the SDK.
339
- 2. Push this repository to the Space.
340
- 3. Add the runtime variables in Space Settings:
341
- - `API_BASE_URL`
342
- - `MODEL_NAME`
343
- - `HF_TOKEN`
344
- 4. If using OpenRouter, add:
345
- - `OPENROUTER_HTTP_REFERER=https://your-space-name.hf.space`
346
- - `OPENROUTER_APP_TITLE=ChargebackOps`
347
- 5. Verify:
348
- - `/`
349
- - `/health`
350
- - `/tasks`
351
- - `/docs`
352
- - `/baseline`
353
 
354
- ## Validation Checklist
355
 
356
- - `pytest -q tests`
357
- - `openenv validate .`
358
- - `python inference.py`
359
- - `docker build -t chargebackops .`
360
- - `docker run --rm -p 8000:8000 --env-file .env chargebackops`
361
 
362
  ## Project Layout
363
 
364
- ```text
365
  .
366
- ├── baseline_runner.py
367
- ├── client.py
368
- ├── grading.py
369
- ├── inference.py
370
- ├── models.py
371
- ├── openenv.yaml
 
 
 
 
 
 
 
372
  ├── server/
373
- │ ├── app.py
374
- │ └── chargeback_ops_environment.py
375
- ├── simulation.py
376
- ── tests/
 
 
 
 
 
 
 
377
  ```
378
-
379
- ## Notes
380
-
381
- - This is a synthetic benchmark environment, not a live payments integration.
382
- - The world state is deterministic by design so graders remain reproducible.
383
- - Live model quality still depends on the quota and reliability of the configured provider.
 
8
 
9
  # ChargebackOps
10
 
11
+ A production-grade OpenEnv environment for merchant-side chargeback dispute operations. An AI agent acts as a dispute analyst investigating evidence across internal systems, choosing whether to contest or concede, and maximizing financial recovery under deadline and step-budget pressure.
12
 
13
+ Built for the [OpenEnv Hackathon](https://openenv.org/) Round 1 challenge.
 
 
 
 
 
 
 
14
 
15
  ## Why This Environment Matters
16
 
17
+ Chargeback dispute handling is a real operations workflow that costs merchants **$125 billion annually**. Analysts must:
18
 
19
+ - Parse reason codes and assess representment deadlines
20
+ - Gather evidence from the right merchant systems while avoiding harmful artifacts
21
+ - Decide whether to contest, accept, or refund — under time pressure
22
+ - Prioritize cases in a multi-dispute queue by deadline urgency and financial impact
 
23
 
24
+ This makes ChargebackOps a strong benchmark for tool-using agents. It tests retrieval, decision-making, prioritization, and operational restraint in a controlled environment with deterministic scoring.
25
 
26
+ ## Architecture
27
 
28
  ```mermaid
29
+ graph TB
30
+ subgraph Agent Layer
31
+ INF[inference.py<br/>OpenAI-compatible client]
32
+ BL[baseline_runner.py<br/>Heuristic policy]
33
+ end
34
+
35
+ subgraph API Layer
36
+ APP[FastAPI server<br/>server/app.py]
37
+ WS[OpenEnv WebSocket<br/>client.py]
38
+ end
39
+
40
+ subgraph Environment Core
41
+ ENV[ChargebackOpsEnvironment<br/>step / reset / state]
42
+ SIM[Simulation Engine<br/>simulation.py]
43
+ GRD[Deterministic Grader<br/>grading.py]
44
+ STORE[Episode Store<br/>episode_store.py]
45
+ end
46
+
47
+ subgraph Task Sources
48
+ FIXED[Built-in Tasks<br/>3 handcrafted scenarios]
49
+ GEN[Parametric Generator<br/>case_generator.py]
50
+ ISO[ISO 20022 Adapter<br/>iso_adapter.py]
51
+ STRIPE[Stripe Connector<br/>connectors/stripe_sandbox.py]
52
+ end
53
+
54
+ subgraph Merchant Systems
55
+ ORD[Orders]
56
+ PAY[Payment]
57
+ SHIP[Shipping]
58
+ SUP[Support]
59
+ REF[Refunds]
60
+ RISK[Risk]
61
+ end
62
+
63
+ INF --> APP
64
+ BL --> ENV
65
+ APP --> ENV
66
+ WS --> APP
67
+ ENV --> SIM
68
+ ENV --> GRD
69
+ GRD --> STORE
70
+ SIM --> FIXED
71
+ SIM --> GEN
72
+ SIM --> ISO
73
+ SIM --> STRIPE
74
+ ENV --> ORD
75
+ ENV --> PAY
76
+ ENV --> SHIP
77
+ ENV --> SUP
78
+ ENV --> REF
79
+ ENV --> RISK
80
  ```
81
 
82
  ## Episode Workflow
83
 
84
  ```mermaid
85
  flowchart TD
86
+ A[reset&#40;task_id&#41;] --> B[Select case from queue]
87
+ B --> C{Reason code<br/>deterministic?}
88
+ C -->|Yes| D[Skip policy retrieval<br/>Infer strategy directly]
89
+ C -->|No| E[Retrieve policy guidance]
90
+ D --> F[Query merchant systems<br/>for evidence]
91
+ E --> F
92
+ F --> G[Attach relevant evidence<br/>Avoid harmful artifacts]
93
+ G --> H[Set strategy]
94
+ H --> I{Strategy?}
95
+ I -->|contest| J[Generate representment note<br/>Submit package]
96
+ I -->|accept / refund| K[Resolve case]
97
+ J --> L{More open cases?}
98
+ K --> L
99
+ L -->|Yes| M{Deadline urgency?}
100
+ M -->|Urgent| N[Switch to urgent case<br/>Fast-resolve]
101
+ M -->|Normal| B
102
+ N --> L
103
+ L -->|No / Max steps| O[Grader computes<br/>final score 0.0 - 1.0]
104
+
105
+ style A fill:#2d5016,color:#fff
106
+ style O fill:#1a3a5c,color:#fff
107
+ style N fill:#8b0000,color:#fff
108
  ```
109
 
110
+ ## Grading Dimensions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
+ ```mermaid
113
+ pie title Case Score Weights
114
+ "Strategy Correctness" : 25
115
+ "Evidence Quality" : 20
116
+ "Packet Validity" : 15
117
+ "Deadline Compliance" : 15
118
+ "Efficiency" : 10
119
+ "Outcome Quality" : 10
120
+ "Note Quality" : 5
121
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
+ Each case is scored across seven dimensions and weighted by financial impact. The episode score normalizes across all cases to `[0.0, 1.0]`.
124
 
125
+ ## Agent Performance (126 Episodes)
 
 
 
126
 
127
+ Results from the heuristic agent tested across all data sources:
128
 
129
+ | Source | Easy | Medium | Hard |
130
+ |---|---|---|---|
131
+ | Built-in tasks | 0.968 | 0.960 | 0.778 |
132
+ | Parametric (20 seeds) | 0.957 | 0.844 | 0.706 |
133
+ | ISO 20022 real data (20 each) | 0.977 | 0.812 | 0.605 |
134
+ | Stripe live API | 0.980 | 0.887 | 0.577 |
135
 
136
+ **Overall: 0.819 avg across 126 episodes | 43.7% score >= 0.90 | 5.6% score < 0.50**
137
 
138
+ Heuristic vs bad-control gap: **+0.503** (threshold for "strong": 0.15)
139
 
140
+ ## Task Sources
 
 
 
141
 
142
+ ### Built-in Scenarios (3 tasks)
143
 
144
+ | Task ID | Difficulty | Objective |
145
+ |---|---|---|
146
+ | `goods_not_received_easy` | Easy | Contest a goods-not-received case with delivery proof |
147
+ | `fraud_signal_ambiguity` | Medium | Handle CNP fraud with mixed evidence and harmful artifacts |
148
+ | `queue_optimization_hard` | Hard | Maximize recovery across a multi-case queue under deadline pressure |
149
 
150
+ ### Parametric Generator (`case_generator.py`)
 
151
 
152
+ Generates infinite reproducible tasks from seeded RNG across 6 reason code families. Usage: `generated_{difficulty}_s{seed}` (e.g., `generated_hard_s42`).
153
 
154
+ ### ISO 20022 Real Data (`iso_adapter.py`)
 
 
 
 
 
155
 
156
+ Converts 300 real chargeback records from ISO 20022 CASR.003 format into environment cases. Covers fraud, goods-not-received, duplicate processing, credit-not-processed, product-not-as-described, and service-not-provided disputes.
157
 
158
+ ### Stripe Sandbox (`connectors/stripe_sandbox.py`)
159
 
160
+ Maps Stripe test-mode dispute objects into environment cases. Supports live API access with `STRIPE_API_KEY` or falls back to synthetic Stripe-format disputes.
161
 
162
+ ## Action Space
 
 
 
 
 
 
 
 
163
 
164
+ | Action | Purpose |
165
+ |---|---|
166
+ | `select_case` | Focus a case from the dispute queue |
167
+ | `inspect_case` | Reveal analyst inspection notes |
168
+ | `query_system` | Pull evidence from a merchant system |
169
+ | `retrieve_policy` | Get reason-code guidance and required evidence |
170
+ | `add_evidence` | Attach retrieved evidence to the representment package |
171
+ | `remove_evidence` | Remove evidence (including harmful attachments) |
172
+ | `set_strategy` | Choose `contest`, `accept_chargeback`, or `issue_refund` |
173
+ | `submit_representment` | Submit a contest package with an optional rationale note |
174
+ | `resolve_case` | Close a non-contest case |
175
 
176
+ ## Quick Start
177
 
178
+ ### Install
179
 
180
  ```bash
181
  uv sync --extra dev
182
+ # or
183
+ pip install -e ".[dev]"
184
  ```
185
 
186
+ ### Configure
 
 
 
 
 
 
187
 
188
  ```bash
189
  cp .env.example .env
190
+ # Edit .env with your provider keys
191
  ```
192
 
193
+ ### Validate
 
 
 
 
 
 
 
 
194
 
195
  ```bash
196
  pytest -q tests
197
  openenv validate .
198
+ python baseline_runner.py
199
+ python agent_brutal_audit.py
200
  ```
201
 
202
+ ### Run Server
203
 
204
  ```bash
205
  uvicorn server.app:app --host 0.0.0.0 --port 8000
206
  ```
207
 
208
+ ### Docker
 
 
209
 
210
  ```bash
211
  docker build -t chargebackops .
212
  docker run --rm -p 8000:8000 --env-file .env chargebackops
213
  ```
214
 
215
+ ## API Endpoints
216
 
217
+ | Method | Path | Description |
218
+ |---|---|---|
219
+ | `GET` | `/` | Service info |
220
+ | `GET` | `/health` | Health check |
221
+ | `GET` | `/docs` | Interactive OpenAPI docs |
222
+ | `POST` | `/reset` | Start a new episode |
223
+ | `POST` | `/step` | Apply an action |
224
+ | `GET` | `/state` | Current environment state |
225
+ | `GET` | `/tasks` | List available tasks |
226
+ | `GET` | `/generate` | Generate parametric tasks |
227
+ | `GET/POST` | `/grader` | Fetch latest episode grade |
228
+ | `GET/POST` | `/baseline` | Run the heuristic baseline |
229
 
230
+ ## Inference Contract
231
 
232
+ The required entry point [`inference.py`](inference.py) uses the OpenAI-compatible client with:
233
 
234
+ ```bash
235
+ API_BASE_URL=https://openrouter.ai/api/v1
236
+ MODEL_NAME=openai/gpt-oss-120b
237
+ HF_TOKEN=your_key
238
+ ```
239
 
240
+ Supported providers: OpenRouter, OpenAI, Groq, Anthropic-compatible gateways.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
+ ## Hugging Face Deployment
243
 
244
+ 1. Create a new HF Space with **Docker** SDK
245
+ 2. Push this repository
246
+ 3. Set secrets in Space Settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
247
+ 4. Verify: `/health`, `/tasks`, `/baseline`
 
248
 
249
  ## Project Layout
250
 
251
+ ```
252
  .
253
+ ├── openenv.yaml # OpenEnv spec
254
+ ├── models.py # Pydantic action/observation/state models
255
+ ├── simulation.py # Task definitions and case progress
256
+ ├── grading.py # Deterministic 7-dimension grader
257
+ ├── baseline_runner.py # Heuristic agent with LLM fallback
258
+ ├── inference.py # Challenge-compatible inference entry
259
+ ├── case_generator.py # Parametric seeded task generator
260
+ ├── iso_adapter.py # ISO 20022 real data adapter
261
+ ├── agent_brutal_audit.py # Comprehensive agent evaluation
262
+ ├── client.py # OpenEnv WebSocket client
263
+ ├── episode_store.py # Thread-safe episode report store
264
+ ├── connectors/
265
+ │ └── stripe_sandbox.py # Stripe test-mode connector
266
  ├── server/
267
+ │ ├── app.py # FastAPI application
268
+ │ └── chargeback_ops_environment.py # Core environment
269
+ ├── tests/
270
+ │ ├── test_env.py # Environment + generator tests
271
+ │ ├── test_grader.py # Grading logic tests
272
+ │ ├── test_api.py # API endpoint tests
273
+ │ ├── test_requirements.py # Problem statement compliance
274
+ │ └── test_agent_audit.py # Audit validation tests
275
+ ├── Dockerfile # Production container
276
+ ├── pyproject.toml # Package config
277
+ └── .env.example # Environment variable template
278
  ```
 
 
 
 
 
 
connectors/__init__.py ADDED
File without changes
connectors/stripe_sandbox.py ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Stripe sandbox connector for ChargebackOps.
2
+
3
+ Maps Stripe test-mode dispute objects into ``InternalCase`` / ``TaskScenario``
4
+ so real Stripe dispute flows can be processed through the environment.
5
+
6
+ Usage::
7
+
8
+ export STRIPE_API_KEY=sk_test_...
9
+ from connectors.stripe_sandbox import fetch_disputes, build_stripe_task
10
+
11
+ disputes = fetch_disputes(limit=10)
12
+ task = build_stripe_task(disputes, difficulty="medium")
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import hashlib
18
+ import os
19
+ import random
20
+ from typing import Any
21
+
22
+ try:
23
+ from ..simulation import (
24
+ InternalCase,
25
+ InternalEvidence,
26
+ TaskScenario,
27
+ SystemName,
28
+ StrategyName,
29
+ )
30
+ except ImportError: # pragma: no cover
31
+ from simulation import (
32
+ InternalCase,
33
+ InternalEvidence,
34
+ TaskScenario,
35
+ SystemName,
36
+ StrategyName,
37
+ )
38
+
39
+ _STRIPE_REASON_MAP: dict[str, str] = {
40
+ "fraudulent": "fraud_cnp",
41
+ "unrecognized": "fraud_cnp",
42
+ "product_not_received": "goods_not_received",
43
+ "product_unacceptable": "product_not_as_described",
44
+ "duplicate": "duplicate_processing",
45
+ "subscription_canceled": "credit_not_processed",
46
+ "credit_not_processed": "credit_not_processed",
47
+ "general": "goods_not_received",
48
+ "service_not_as_described": "service_not_provided",
49
+ }
50
+
51
+ _STRIPE_STATUS_WON = {"won"}
52
+ _STRIPE_STATUS_LOST = {"lost"}
53
+ _STRIPE_STATUS_OPEN = {
54
+ "needs_response",
55
+ "under_review",
56
+ "warning_needs_response",
57
+ "warning_under_review",
58
+ "warning_closed",
59
+ "charge_refunded",
60
+ }
61
+
62
+ _POLICY_GUIDANCE: dict[str, str] = {
63
+ "goods_not_received": "Prove fulfillment with order confirmation and carrier delivery evidence.",
64
+ "fraud_cnp": "Contest only with prior account linkage and device history. Do not attach mismatch artifacts.",
65
+ "product_not_as_described": "Contest when listing accurately represents the product and customer bypassed returns.",
66
+ "service_not_provided": "Contest when provider records confirm service delivery.",
67
+ "credit_not_processed": "Refund immediately or concede. Contesting is not supportable.",
68
+ "duplicate_processing": "Refund the duplicate charge immediately. Do not contest.",
69
+ }
70
+
71
+ _POLICY_REQS: dict[str, tuple[str, ...]] = {
72
+ "goods_not_received": ("order confirmation", "carrier delivery confirmation"),
73
+ "fraud_cnp": ("prior good order linkage", "customer account confirmation"),
74
+ "product_not_as_described": ("product listing verification", "return policy documentation"),
75
+ "service_not_provided": ("service completion record", "customer acknowledgment"),
76
+ "credit_not_processed": ("proof of cancellation request", "refund status check"),
77
+ "duplicate_processing": ("payment transaction log", "duplicate confirmation"),
78
+ }
79
+
80
+
81
+ def _ev(eid: str, system: SystemName, title: str, summary: str,
82
+ *, helpful: bool = False, harmful: bool = False, required: bool = False) -> InternalEvidence:
83
+ return InternalEvidence(
84
+ evidence_id=eid, source_system=system, title=title,
85
+ summary=summary, helpful=helpful, harmful=harmful, required=required,
86
+ )
87
+
88
+
89
+ def _infer_strategy(reason_code: str, stripe_status: str) -> tuple[str, tuple[str, ...]]:
90
+ """Infer optimal strategy from Stripe dispute status."""
91
+ # These reason codes should always refund — contesting is never supportable.
92
+ if reason_code in ("credit_not_processed", "duplicate_processing"):
93
+ return "issue_refund", ("accept_chargeback",)
94
+ if stripe_status in _STRIPE_STATUS_WON:
95
+ return "contest", ()
96
+ if stripe_status in _STRIPE_STATUS_LOST:
97
+ return "accept_chargeback", ("issue_refund",)
98
+ return "contest", ()
99
+
100
+
101
+ def _build_evidence(
102
+ prefix: str,
103
+ reason_code: str,
104
+ amount: float,
105
+ currency: str,
106
+ metadata: dict[str, Any],
107
+ optimal: str,
108
+ rng: random.Random,
109
+ ) -> tuple[dict[SystemName, tuple[InternalEvidence, ...]], tuple[str, ...], tuple[str, ...], tuple[str, ...]]:
110
+ by_sys: dict[SystemName, list[InternalEvidence]] = {
111
+ s: [] for s in ("orders", "payment", "shipping", "support", "refunds", "risk")
112
+ }
113
+ req: list[str] = []
114
+ hlp: list[str] = []
115
+ hrm: list[str] = []
116
+
117
+ desc = metadata.get("description", f"Stripe dispute for {amount} {currency}")
118
+
119
+ if reason_code == "goods_not_received":
120
+ e = _ev(f"{prefix}-ORDER", "orders", "Order confirmation", f"Order for {amount} {currency}.", helpful=True, required=True)
121
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
122
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Stripe charge captured."))
123
+ if optimal == "contest":
124
+ e = _ev(f"{prefix}-DELIVERY", "shipping", "Delivery confirmation", "Carrier confirms delivery.", helpful=True, required=True)
125
+ by_sys["shipping"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
126
+ else:
127
+ by_sys["shipping"].append(_ev(f"{prefix}-NOTRACK", "shipping", "Tracking", "No delivery confirmation."))
128
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
129
+
130
+ elif reason_code == "fraud_cnp":
131
+ by_sys["orders"].append(_ev(f"{prefix}-ORDER", "orders", "Order receipt", f"Order for {amount} {currency}.", helpful=True))
132
+ hlp.append(f"{prefix}-ORDER")
133
+ e_avs = _ev(f"{prefix}-AVS", "payment", "AVS check", "AVS mismatch at authorization.", harmful=True)
134
+ by_sys["payment"].append(e_avs); hrm.append(e_avs.evidence_id)
135
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Stripe charge captured."))
136
+ if optimal == "contest":
137
+ e = _ev(f"{prefix}-PRIOR", "risk", "Prior account activity", "Same account with prior fulfilled orders.", helpful=True, required=True)
138
+ by_sys["risk"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
139
+ e = _ev(f"{prefix}-CHAT", "support", "Customer verification", "Customer confirmed order via support.", helpful=True, required=True)
140
+ by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
141
+ else:
142
+ by_sys["risk"].append(_ev(f"{prefix}-RISK", "risk", "Risk summary", "No positive account history."))
143
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
144
+
145
+ elif reason_code == "product_not_as_described":
146
+ e = _ev(f"{prefix}-ORDER", "orders", "Order details", f"Order for {amount} {currency} — SKU matches.", helpful=True, required=True)
147
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
148
+ e = _ev(f"{prefix}-LISTING", "orders", "Product listing", "Listing matches manufacturer specs.", helpful=True, required=True)
149
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
150
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment capture", "Settled at listed price."))
151
+ by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY", "shipping", "Delivery confirmation", "Delivered.", helpful=True))
152
+ hlp.append(f"{prefix}-DELIVERY")
153
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund processed."))
154
+
155
+ elif reason_code == "service_not_provided":
156
+ e = _ev(f"{prefix}-BOOKING", "orders", "Service booking", f"Booking for {amount} {currency}.", helpful=True, required=True)
157
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
158
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH", "payment", "Payment record", "Stripe charge captured."))
159
+ if optimal == "contest":
160
+ e = _ev(f"{prefix}-COMPLETION", "support", "Service completion", "Service marked completed.", helpful=True, required=True)
161
+ by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
162
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND", "refunds", "Refund ledger", "No refund issued."))
163
+
164
+ elif reason_code in ("credit_not_processed", "duplicate_processing"):
165
+ by_sys["orders"].append(_ev(f"{prefix}-ORDER", "orders", "Invoice", f"Charge of {amount} {currency}."))
166
+ by_sys["payment"].append(_ev(f"{prefix}-PAYMENT", "payment", "Payment", "Stripe charge settled."))
167
+ by_sys["support"].append(_ev(f"{prefix}-REQ", "support", "Customer request", desc[:100], helpful=True))
168
+ hlp.append(f"{prefix}-REQ")
169
+ by_sys["refunds"].append(_ev(f"{prefix}-NOREFUND", "refunds", "Refund ledger", "No refund processed.", helpful=True))
170
+ hlp.append(f"{prefix}-NOREFUND")
171
+
172
+ frozen = {k: tuple(v) for k, v in by_sys.items()}
173
+ return frozen, tuple(req), tuple(hlp), tuple(hrm)
174
+
175
+
176
+ def dispute_to_case(dispute: dict[str, Any], case_index: int, *, deadline_step: int = 8) -> InternalCase | None:
177
+ """Convert a Stripe dispute object to an InternalCase."""
178
+ stripe_reason = dispute.get("reason", "general")
179
+ reason_code = _STRIPE_REASON_MAP.get(stripe_reason)
180
+ if reason_code is None:
181
+ return None
182
+
183
+ amount = dispute.get("amount", 0) / 100.0 # Stripe amounts are in cents
184
+ currency = dispute.get("currency", "usd").upper()
185
+ status = dispute.get("status", "needs_response")
186
+ metadata = dispute.get("metadata", {})
187
+ dispute_id = dispute.get("id", f"dp_{case_index}")
188
+
189
+ optimal, acceptable = _infer_strategy(reason_code, status)
190
+ rng = random.Random(int(hashlib.sha256(dispute_id.encode()).hexdigest()[:8], 16))
191
+ prefix = f"STRIPE{case_index}"
192
+
193
+ evidence, req_ids, hlp_ids, hrm_ids = _build_evidence(
194
+ prefix, reason_code, amount, currency, metadata, optimal, rng,
195
+ )
196
+
197
+ guidance = _POLICY_GUIDANCE.get(reason_code, "")
198
+ if optimal in ("accept_chargeback", "issue_refund") and reason_code not in ("credit_not_processed", "duplicate_processing"):
199
+ guidance = f"Do not contest this {reason_code.replace('_', ' ')} dispute. Concede to avoid wasting resources."
200
+
201
+ return InternalCase(
202
+ case_id=f"CB-STRIPE{case_index}",
203
+ order_id=dispute.get("charge", f"ch_stripe{case_index}"),
204
+ customer_id=f"CUST-STRIPE{case_index}",
205
+ amount=amount,
206
+ currency=currency,
207
+ reason_code=reason_code,
208
+ summary=dispute.get("evidence_details", {}).get("due_by_reason", f"Stripe dispute: {stripe_reason}"),
209
+ inspection_notes=f"Stripe dispute {dispute_id} — {stripe_reason}. Status: {status}.",
210
+ deadline_step=deadline_step,
211
+ optimal_strategy=optimal,
212
+ acceptable_strategies=acceptable,
213
+ policy_guidance=guidance,
214
+ policy_requirements=_POLICY_REQS.get(reason_code, ()),
215
+ recommended_strategy=optimal,
216
+ resolution_summary=f"Stripe dispute status: {status}.",
217
+ weight=round(1.0 + (amount / 5000.0), 2),
218
+ required_evidence_ids=req_ids,
219
+ helpful_evidence_ids=hlp_ids,
220
+ harmful_evidence_ids=hrm_ids,
221
+ evidence_by_system=evidence,
222
+ )
223
+
224
+
225
+ def build_stripe_task(
226
+ disputes: list[dict[str, Any]],
227
+ *,
228
+ difficulty: str = "medium",
229
+ task_index: int = 0,
230
+ ) -> TaskScenario | None:
231
+ """Build a TaskScenario from a list of Stripe dispute objects."""
232
+ case_count = {"easy": 1, "medium": 2, "hard": 3}.get(difficulty, 2)
233
+ max_steps = {"easy": 10, "medium": 12, "hard": max(12, case_count * 5)}.get(difficulty, 12)
234
+ deadline = {"easy": 8, "medium": 7, "hard": 5}.get(difficulty, 7)
235
+
236
+ cases: list[InternalCase] = []
237
+ for i, dispute in enumerate(disputes):
238
+ if len(cases) >= case_count:
239
+ break
240
+ case = dispute_to_case(dispute, i + 1, deadline_step=deadline)
241
+ if case is not None:
242
+ cases.append(case)
243
+
244
+ if not cases:
245
+ return None
246
+
247
+ codes = ", ".join(list({c.reason_code for c in cases})[:3])
248
+ return TaskScenario(
249
+ task_id=f"stripe_{difficulty}_{task_index}",
250
+ title=f"Stripe Dispute {'Queue' if len(cases) > 1 else 'Case'} ({difficulty.title()})",
251
+ difficulty=difficulty,
252
+ objective=f"Handle {len(cases)} Stripe dispute(s) ({codes}).",
253
+ description=f"Real Stripe sandbox dispute scenario with {len(cases)} case(s). Codes: {codes}.",
254
+ max_steps=max_steps,
255
+ cases=tuple(cases),
256
+ )
257
+
258
+
259
+ def fetch_disputes(*, limit: int = 10, api_key: str | None = None) -> list[dict[str, Any]]:
260
+ """Fetch disputes from Stripe test mode.
261
+
262
+ Requires ``stripe`` package and a test-mode API key.
263
+ Falls back to synthetic test disputes if Stripe is unavailable.
264
+ """
265
+ key = api_key or os.environ.get("STRIPE_API_KEY", "")
266
+ if not key or not key.startswith("sk_test_"):
267
+ return _synthetic_test_disputes(limit)
268
+
269
+ try:
270
+ import stripe
271
+ stripe.api_key = key
272
+ result = stripe.Dispute.list(limit=limit)
273
+ return [d.to_dict() if hasattr(d, "to_dict") else dict(d) for d in result.data]
274
+ except Exception:
275
+ return _synthetic_test_disputes(limit)
276
+
277
+
278
+ def _synthetic_test_disputes(count: int) -> list[dict[str, Any]]:
279
+ """Generate synthetic Stripe-format dispute objects for testing without API access."""
280
+ rng = random.Random(42)
281
+ reasons = list(_STRIPE_REASON_MAP.keys())
282
+ statuses = ["needs_response", "won", "lost", "under_review"]
283
+ disputes = []
284
+
285
+ for i in range(count):
286
+ reason = rng.choice(reasons)
287
+ status = rng.choice(statuses)
288
+ amount = rng.randint(500, 50000) # cents
289
+ disputes.append({
290
+ "id": f"dp_test_{i:04d}",
291
+ "amount": amount,
292
+ "currency": "usd",
293
+ "reason": reason,
294
+ "status": status,
295
+ "charge": f"ch_test_{i:04d}",
296
+ "metadata": {"description": f"Test dispute {i} — {reason}"},
297
+ "evidence_details": {"due_by_reason": f"Dispute for {reason}"},
298
+ })
299
+
300
+ return disputes
episode_store.py CHANGED
@@ -12,6 +12,7 @@ except ImportError: # pragma: no cover
12
  _LOCK = Lock()
13
  _REPORTS: dict[str, GraderReport] = {}
14
  _LATEST_EPISODE_ID: str | None = None
 
15
 
16
 
17
  def record_report(report: GraderReport) -> None:
@@ -19,6 +20,9 @@ def record_report(report: GraderReport) -> None:
19
 
20
  global _LATEST_EPISODE_ID
21
  with _LOCK:
 
 
 
22
  _REPORTS[report.episode_id] = report
23
  _LATEST_EPISODE_ID = report.episode_id
24
 
 
12
  _LOCK = Lock()
13
  _REPORTS: dict[str, GraderReport] = {}
14
  _LATEST_EPISODE_ID: str | None = None
15
+ _MAX_REPORTS = 100
16
 
17
 
18
  def record_report(report: GraderReport) -> None:
 
20
 
21
  global _LATEST_EPISODE_ID
22
  with _LOCK:
23
+ if len(_REPORTS) >= _MAX_REPORTS:
24
+ oldest = next(iter(_REPORTS))
25
+ del _REPORTS[oldest]
26
  _REPORTS[report.episode_id] = report
27
  _LATEST_EPISODE_ID = report.episode_id
28
 
grading.py CHANGED
@@ -16,6 +16,58 @@ def _ratio(numerator: int, denominator: int) -> float:
16
  return max(0.0, min(1.0, numerator / denominator))
17
 
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  def score_case(
20
  case: InternalCase,
21
  progress: CaseProgress,
@@ -24,15 +76,10 @@ def score_case(
24
  """Score one case deterministically."""
25
 
26
  final_resolution = progress.final_resolution or "unresolved"
27
- required_attached = len(
28
- set(progress.attached_evidence_ids).intersection(case.required_evidence_ids)
29
- )
30
- helpful_attached = len(
31
- set(progress.attached_evidence_ids).intersection(case.helpful_evidence_ids)
32
- )
33
- harmful_attached = len(
34
- set(progress.attached_evidence_ids).intersection(case.harmful_evidence_ids)
35
- )
36
 
37
  if final_resolution == case.optimal_strategy:
38
  strategy_correctness = 1.0
@@ -53,16 +100,22 @@ def score_case(
53
  )
54
  else:
55
  if final_resolution in {"accept_chargeback", "issue_refund"}:
56
- evidence_quality = 1.0 if helpful_attached == 0 and harmful_attached == 0 else 0.7
57
- packet_validity = 1.0
 
 
 
 
 
58
  else:
59
  evidence_quality = 0.0
60
  packet_validity = 0.0
61
 
 
62
  deadline_compliance = 1.0
63
  if final_resolution == "unresolved":
64
  deadline_compliance = 0.0
65
- elif step_count > case.deadline_step:
66
  deadline_compliance = 0.0
67
 
68
  wasted_actions = progress.duplicate_queries + progress.invalid_actions
@@ -75,13 +128,20 @@ def score_case(
75
  else:
76
  outcome_quality = 0.0
77
 
 
 
 
 
 
 
78
  weighted_score = (
79
  0.25 * strategy_correctness
80
- + 0.25 * evidence_quality
81
  + 0.15 * packet_validity
82
  + 0.15 * deadline_compliance
83
  + 0.10 * efficiency
84
  + 0.10 * outcome_quality
 
85
  )
86
 
87
  note_parts = [case.resolution_summary]
@@ -100,6 +160,7 @@ def score_case(
100
  deadline_compliance=round(deadline_compliance, 4),
101
  efficiency=round(efficiency, 4),
102
  outcome_quality=round(outcome_quality, 4),
 
103
  weighted_score=round(weighted_score * case.weight, 4),
104
  final_resolution=final_resolution,
105
  notes=" ".join(note_parts),
 
16
  return max(0.0, min(1.0, numerator / denominator))
17
 
18
 
19
+ def grade_representment_note(
20
+ note: str | None,
21
+ case: "InternalCase",
22
+ attached_ids: set[str],
23
+ ) -> float:
24
+ """Score a representment note from 0.0 to 1.0.
25
+
26
+ Evaluates whether the note:
27
+ - References required claims from the policy requirements
28
+ - Avoids mentioning harmful evidence
29
+ - Has sufficient substance (length and specificity)
30
+ """
31
+ if not note or not note.strip():
32
+ return 0.0
33
+
34
+ text = note.lower()
35
+ score = 0.0
36
+
37
+ # Substance: minimum length for a coherent note
38
+ word_count = len(text.split())
39
+ if word_count >= 5:
40
+ score += 0.2
41
+ elif word_count >= 2:
42
+ score += 0.1
43
+
44
+ # Required claims coverage: does the note mention policy requirements?
45
+ if case.policy_requirements:
46
+ claims_hit = 0
47
+ for req in case.policy_requirements:
48
+ req_keywords = req.lower().split()
49
+ if any(kw in text for kw in req_keywords if len(kw) > 3):
50
+ claims_hit += 1
51
+ score += 0.5 * _ratio(claims_hit, len(case.policy_requirements))
52
+ else:
53
+ score += 0.3 # No requirements to check
54
+
55
+ # Evidence coherence: does the note reference attached evidence?
56
+ evidence_refs = sum(1 for eid in attached_ids if eid.lower() in text or any(
57
+ part in text for part in eid.lower().replace("-", " ").split() if len(part) > 3
58
+ ))
59
+ if evidence_refs > 0:
60
+ score += 0.15
61
+
62
+ # Harmful mention penalty: does the note mention harmful evidence concepts?
63
+ harmful_keywords = {"mismatch", "failed", "declined", "suspicious", "flagged", "fraud risk"}
64
+ harmful_hits = sum(1 for kw in harmful_keywords if kw in text)
65
+ if harmful_hits > 0:
66
+ score -= 0.15 * min(harmful_hits, 2)
67
+
68
+ return max(0.0, min(1.0, score))
69
+
70
+
71
  def score_case(
72
  case: InternalCase,
73
  progress: CaseProgress,
 
76
  """Score one case deterministically."""
77
 
78
  final_resolution = progress.final_resolution or "unresolved"
79
+ attached_set = set(progress.attached_evidence_ids)
80
+ required_attached = len(attached_set.intersection(case.required_evidence_ids))
81
+ helpful_attached = len(attached_set.intersection(case.helpful_evidence_ids))
82
+ harmful_attached = len(attached_set.intersection(case.harmful_evidence_ids))
 
 
 
 
 
83
 
84
  if final_resolution == case.optimal_strategy:
85
  strategy_correctness = 1.0
 
100
  )
101
  else:
102
  if final_resolution in {"accept_chargeback", "issue_refund"}:
103
+ if case.optimal_strategy == "contest":
104
+ # Conceded a contestable case — evidence gathering was abandoned
105
+ evidence_quality = 0.3
106
+ packet_validity = 0.0
107
+ else:
108
+ evidence_quality = 1.0 if helpful_attached == 0 and harmful_attached == 0 else 0.7
109
+ packet_validity = 1.0
110
  else:
111
  evidence_quality = 0.0
112
  packet_validity = 0.0
113
 
114
+ resolution_step = progress.resolved_at_step if progress.resolved_at_step is not None else step_count
115
  deadline_compliance = 1.0
116
  if final_resolution == "unresolved":
117
  deadline_compliance = 0.0
118
+ elif resolution_step > case.deadline_step:
119
  deadline_compliance = 0.0
120
 
121
  wasted_actions = progress.duplicate_queries + progress.invalid_actions
 
128
  else:
129
  outcome_quality = 0.0
130
 
131
+ # Representment note quality (only relevant for contested cases)
132
+ if final_resolution == "contest" and progress.representment_note:
133
+ note_quality = grade_representment_note(progress.representment_note, case, attached_set)
134
+ else:
135
+ note_quality = 0.0
136
+
137
  weighted_score = (
138
  0.25 * strategy_correctness
139
+ + 0.20 * evidence_quality
140
  + 0.15 * packet_validity
141
  + 0.15 * deadline_compliance
142
  + 0.10 * efficiency
143
  + 0.10 * outcome_quality
144
+ + 0.05 * note_quality
145
  )
146
 
147
  note_parts = [case.resolution_summary]
 
160
  deadline_compliance=round(deadline_compliance, 4),
161
  efficiency=round(efficiency, 4),
162
  outcome_quality=round(outcome_quality, 4),
163
+ note_quality=round(note_quality, 4),
164
  weighted_score=round(weighted_score * case.weight, 4),
165
  final_resolution=final_resolution,
166
  notes=" ".join(note_parts),
iso_adapter.py ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Adapter that converts real ISO 20022 chargeback CSV rows into environment cases.
2
+
3
+ Reads ``data/iso20022-card-chargeback-casr-003.csv`` and produces
4
+ ``InternalCase`` / ``TaskScenario`` objects so real dispute data flows
5
+ through the benchmark.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import csv
11
+ import hashlib
12
+ import random
13
+ from pathlib import Path
14
+ from typing import Literal
15
+
16
+ try:
17
+ from .simulation import InternalCase, InternalEvidence, TaskScenario, SystemName, StrategyName
18
+ except ImportError: # pragma: no cover
19
+ from simulation import InternalCase, InternalEvidence, TaskScenario, SystemName, StrategyName
20
+
21
+ ISO_CSV_PATH = Path("data/iso20022-card-chargeback-casr-003.csv")
22
+
23
+ _REASON_MAP: dict[str, str] = {
24
+ "goods_not_received": "goods_not_received",
25
+ "GOODS_NOT_RECEIVED": "goods_not_received",
26
+ "NR02": "goods_not_received",
27
+ "FRAUD": "fraud_cnp",
28
+ "fraud": "fraud_cnp",
29
+ "fraudulent_transaction": "fraud_cnp",
30
+ "FR01": "fraud_cnp",
31
+ "FR02": "fraud_cnp",
32
+ "goods_not_as_described": "product_not_as_described",
33
+ "GOODS_NOT_AS_DESCRIBED": "product_not_as_described",
34
+ "not_as_described": "product_not_as_described",
35
+ "NR04": "product_not_as_described",
36
+ "SERVICE_NOT_RENDERED": "service_not_provided",
37
+ "services_not_rendered": "service_not_provided",
38
+ "NR03": "credit_not_processed",
39
+ "duplicate": "duplicate_processing",
40
+ "DUPLICATE_PROCESSING": "duplicate_processing",
41
+ "duplicate_processing": "duplicate_processing",
42
+ }
43
+
44
+ _MERCHANT_WON = {"merchant_won", "chargeback_reversed", "chargeback_declined"}
45
+ _CONCEDED = {"chargeback_accepted"}
46
+
47
+ _POLICY_GUIDANCE: dict[str, str] = {
48
+ "goods_not_received": "For goods-not-received disputes, prove fulfillment with order confirmation and carrier delivery evidence.",
49
+ "fraud_cnp": "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. Do not attach mismatch artifacts.",
50
+ "product_not_as_described": "Contest product-not-as-described disputes when the listing accurately represents the product and the customer bypassed the return process.",
51
+ "service_not_provided": "Contest service-not-provided disputes when provider records confirm the service was delivered.",
52
+ "credit_not_processed": "If the merchant failed to process a promised credit, refund immediately or concede. Contesting is not supportable.",
53
+ "duplicate_processing": "When a duplicate charge is confirmed, refund the extra amount immediately. Do not contest.",
54
+ }
55
+
56
+ _POLICY_REQS: dict[str, tuple[str, ...]] = {
57
+ "goods_not_received": ("order confirmation", "carrier delivery confirmation"),
58
+ "fraud_cnp": ("prior good order linkage", "customer account confirmation"),
59
+ "product_not_as_described": ("product listing verification", "return policy documentation"),
60
+ "service_not_provided": ("service completion record", "customer acknowledgment"),
61
+ "credit_not_processed": ("proof of cancellation request", "refund status check"),
62
+ "duplicate_processing": ("payment transaction log", "duplicate confirmation"),
63
+ }
64
+
65
+
66
+ def _ev(eid, system, title, summary, *, helpful=False, harmful=False, required=False):
67
+ return InternalEvidence(evidence_id=eid, source_system=system, title=title,
68
+ summary=summary, helpful=helpful, harmful=harmful, required=required)
69
+
70
+
71
+ def _infer_strategy(reason_code, final_decision, notes):
72
+ nl = notes.lower()
73
+ if final_decision in _MERCHANT_WON:
74
+ return "contest", ()
75
+ if final_decision in _CONCEDED:
76
+ if reason_code in ("credit_not_processed", "duplicate_processing"):
77
+ return "issue_refund", ("accept_chargeback",)
78
+ return "accept_chargeback", ("issue_refund",)
79
+ if reason_code in ("credit_not_processed", "duplicate_processing"):
80
+ return "issue_refund", ("accept_chargeback",)
81
+ if reason_code == "fraud_cnp" and ("stolen" in nl or "no evidence" in nl or "unable" in nl):
82
+ return "accept_chargeback", ("issue_refund",)
83
+ return "contest", ()
84
+
85
+
86
+ def _build_evidence(prefix, reason_code, merchant, amount, notes, optimal, rng):
87
+ by_sys: dict[SystemName, list[InternalEvidence]] = {s: [] for s in ("orders","payment","shipping","support","refunds","risk")}
88
+ req, hlp, hrm = [], [], []
89
+
90
+ if reason_code == "goods_not_received":
91
+ e = _ev(f"{prefix}-ORDER","orders","Order confirmation",f"Order with {merchant} for ${amount:.2f}.",helpful=True,required=True)
92
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
93
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Authorization","Payment authorized and captured."))
94
+ if optimal == "contest":
95
+ e = _ev(f"{prefix}-DELIVERY","shipping","Carrier delivery confirmation","Carrier confirms delivery to customer address.",helpful=True,required=True)
96
+ by_sys["shipping"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
97
+ if rng.random()>0.4:
98
+ e2=_ev(f"{prefix}-SIG","shipping","Delivery signature","Recipient signature on file.",helpful=True)
99
+ by_sys["shipping"].append(e2); hlp.append(e2.evidence_id)
100
+ else:
101
+ by_sys["shipping"].append(_ev(f"{prefix}-NOTRACK","shipping","Tracking status","No confirmed delivery scan."))
102
+ by_sys["support"].append(_ev(f"{prefix}-SUPPORT","support","Support notes",notes[:120] if notes else "No support interactions."))
103
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued before dispute."))
104
+
105
+ elif reason_code == "fraud_cnp":
106
+ by_sys["orders"].append(_ev(f"{prefix}-ORDER","orders","Order receipt",f"Order with {merchant} for ${amount:.2f}.",helpful=True))
107
+ hlp.append(f"{prefix}-ORDER")
108
+ e_avs=_ev(f"{prefix}-AVS","payment","AVS mismatch","Street mismatch at authorization.",harmful=True)
109
+ by_sys["payment"].append(e_avs); hrm.append(e_avs.evidence_id)
110
+ if rng.random()>0.5:
111
+ e_cvv=_ev(f"{prefix}-CVV","payment","CVV mismatch","CVV verification failed.",harmful=True)
112
+ by_sys["payment"].append(e_cvv); hrm.append(e_cvv.evidence_id)
113
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Authorization","Payment captured."))
114
+ if optimal=="contest":
115
+ e=_ev(f"{prefix}-PRIOR","risk","Prior account activity","Same account/device with prior fulfilled orders.",helpful=True,required=True)
116
+ by_sys["risk"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
117
+ e=_ev(f"{prefix}-CHAT","support","Authenticated chat","Customer logged in and confirmed order.",helpful=True,required=True)
118
+ by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
119
+ else:
120
+ by_sys["risk"].append(_ev(f"{prefix}-RISK","risk","Risk summary","Elevated risk. No positive account history."))
121
+ by_sys["support"].append(_ev(f"{prefix}-SUPPORT","support","Support log","No authenticated interactions."))
122
+ by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY","shipping","Delivery confirmation","Delivered to address on file.",helpful=True))
123
+ hlp.append(f"{prefix}-DELIVERY")
124
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued."))
125
+
126
+ elif reason_code == "product_not_as_described":
127
+ e=_ev(f"{prefix}-ORDER","orders","Order details",f"Order with {merchant} — SKU matches listing.",helpful=True,required=True)
128
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
129
+ e=_ev(f"{prefix}-LISTING","orders","Product listing","Listing matches manufacturer specs.",helpful=True,required=True)
130
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
131
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Payment capture","Settled for listed price."))
132
+ by_sys["shipping"].append(_ev(f"{prefix}-DELIVERY","shipping","Delivery confirmation","Delivered within window.",helpful=True))
133
+ hlp.append(f"{prefix}-DELIVERY")
134
+ by_sys["support"].append(_ev(f"{prefix}-RETURN","support","Return policy","No return initiated before dispute.",helpful=True))
135
+ hlp.append(f"{prefix}-RETURN")
136
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund processed."))
137
+
138
+ elif reason_code == "service_not_provided":
139
+ e=_ev(f"{prefix}-BOOKING","orders","Service booking",f"Booking with {merchant} for ${amount:.2f}.",helpful=True,required=True)
140
+ by_sys["orders"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
141
+ by_sys["payment"].append(_ev(f"{prefix}-AUTH","payment","Payment record","Payment captured."))
142
+ if optimal=="contest":
143
+ e=_ev(f"{prefix}-COMPLETION","support","Service completion","Provider marked service completed.",helpful=True,required=True)
144
+ by_sys["support"].append(e); req.append(e.evidence_id); hlp.append(e.evidence_id)
145
+ else:
146
+ by_sys["support"].append(_ev(f"{prefix}-CANCEL","support","Cancellation",notes[:100] if notes else "Service cancelled.",helpful=True))
147
+ hlp.append(f"{prefix}-CANCEL")
148
+ by_sys["refunds"].append(_ev(f"{prefix}-REFUND","refunds","Refund ledger","No refund issued."))
149
+
150
+ elif reason_code in ("credit_not_processed","duplicate_processing"):
151
+ by_sys["orders"].append(_ev(f"{prefix}-ORDER","orders","Invoice",f"Charge of ${amount:.2f} from {merchant}."))
152
+ by_sys["payment"].append(_ev(f"{prefix}-PAYMENT","payment","Payment","Payment settled."))
153
+ by_sys["support"].append(_ev(f"{prefix}-REQ","support","Customer request",notes[:100] if notes else "Customer requested credit.",helpful=True))
154
+ hlp.append(f"{prefix}-REQ")
155
+ by_sys["refunds"].append(_ev(f"{prefix}-NOREFUND","refunds","Refund ledger","No refund processed.",helpful=True))
156
+ hlp.append(f"{prefix}-NOREFUND")
157
+
158
+ frozen = {k: tuple(v) for k, v in by_sys.items()}
159
+ return frozen, tuple(req), tuple(hlp), tuple(hrm)
160
+
161
+
162
+ def _concedable_guidance(reason_code: str, optimal: str) -> str:
163
+ """Return guidance that signals concede when the optimal strategy isn't contest."""
164
+ if optimal in ("accept_chargeback", "issue_refund") and reason_code not in (
165
+ "credit_not_processed", "duplicate_processing",
166
+ ):
167
+ if optimal == "accept_chargeback":
168
+ return (
169
+ f"Do not contest this {reason_code.replace('_', ' ')} dispute. "
170
+ "The merchant's position is not supportable. Concede to avoid wasting resources."
171
+ )
172
+ return (
173
+ f"Refund immediately for this {reason_code.replace('_', ' ')} dispute. "
174
+ "Contesting is not supportable."
175
+ )
176
+ return _POLICY_GUIDANCE.get(reason_code, "")
177
+
178
+
179
+ def row_to_case(row, case_index, *, deadline_step=8):
180
+ raw_code = row.get("chargeback_reason_code", "")
181
+ reason_code = _REASON_MAP.get(raw_code)
182
+ if reason_code is None:
183
+ return None
184
+
185
+ amount = float(row.get("transaction_amount", "0") or "0")
186
+ merchant = row.get("merchant_name", "Unknown")
187
+ notes = row.get("notes", "")
188
+ final_decision = row.get("final_decision", "")
189
+
190
+ optimal, acceptable = _infer_strategy(reason_code, final_decision, notes)
191
+ rng = random.Random(int(hashlib.sha256(row["chargeback_id"].encode()).hexdigest()[:8], 16))
192
+ prefix = f"ISO{case_index}"
193
+
194
+ evidence, req_ids, hlp_ids, hrm_ids = _build_evidence(prefix, reason_code, merchant, amount, notes, optimal, rng)
195
+
196
+ return InternalCase(
197
+ case_id=f"CB-ISO{case_index}",
198
+ order_id=row.get("original_transaction_id", f"TX-ISO{case_index}"),
199
+ customer_id=f"CUST-ISO{case_index}",
200
+ amount=amount, currency=row.get("transaction_currency", "USD"),
201
+ reason_code=reason_code,
202
+ summary=row.get("chargeback_reason_description", "Chargeback filed."),
203
+ inspection_notes=notes or f"Chargeback against {merchant} for ${amount:.2f}.",
204
+ deadline_step=deadline_step,
205
+ optimal_strategy=optimal, acceptable_strategies=acceptable,
206
+ policy_guidance=_concedable_guidance(reason_code, optimal),
207
+ policy_requirements=_POLICY_REQS.get(reason_code, ()),
208
+ recommended_strategy=optimal,
209
+ resolution_summary=f"Real case outcome: {final_decision or 'pending'}.",
210
+ weight=round(1.0 + (amount / 5000.0), 2),
211
+ required_evidence_ids=req_ids, helpful_evidence_ids=hlp_ids, harmful_evidence_ids=hrm_ids,
212
+ evidence_by_system=evidence,
213
+ )
214
+
215
+
216
+ def load_iso_rows(csv_path=None):
217
+ path = csv_path or ISO_CSV_PATH
218
+ if not path.exists():
219
+ return []
220
+ with path.open(newline="", encoding="utf-8") as f:
221
+ return list(csv.DictReader(f))
222
+
223
+
224
+ def build_iso_task(rows, *, difficulty="medium", start_index=0, case_count=None, task_index=0):
225
+ if case_count is None:
226
+ case_count = {"easy": 1, "medium": 2, "hard": 3}[difficulty]
227
+ max_steps = {"easy": 10, "medium": 12, "hard": max(12, case_count * 5)}[difficulty]
228
+
229
+ cases = []
230
+ idx = start_index
231
+ while len(cases) < case_count and idx < len(rows):
232
+ deadline = {"easy": 8, "medium": 7, "hard": max(4, 8 - len(cases))}[difficulty]
233
+ case = row_to_case(rows[idx], idx + 1, deadline_step=deadline)
234
+ idx += 1
235
+ if case is not None:
236
+ cases.append(case)
237
+
238
+ if not cases:
239
+ return None
240
+
241
+ codes = ", ".join(list({c.reason_code for c in cases})[:3])
242
+ return TaskScenario(
243
+ task_id=f"iso_{difficulty}_{task_index}",
244
+ title=f"ISO Dispute {'Queue' if len(cases) > 1 else 'Case'} ({difficulty.title()})",
245
+ difficulty=difficulty,
246
+ objective=f"Handle {len(cases)} real dispute(s) ({codes}) from ISO 20022 chargeback data.",
247
+ description=f"Real-world-derived scenario with {len(cases)} case(s). Reason codes: {codes}.",
248
+ max_steps=max_steps,
249
+ cases=tuple(cases),
250
+ )
251
+
252
+
253
+ def generate_iso_suite(csv_path=None, *, easy_count=3, medium_count=3, hard_count=3):
254
+ rows = load_iso_rows(csv_path)
255
+ if not rows:
256
+ return []
257
+ rng = random.Random(42)
258
+ shuffled = list(rows)
259
+ rng.shuffle(shuffled)
260
+ tasks, offset, idx = [], 0, 0
261
+ for diff, count in [("easy", easy_count), ("medium", medium_count), ("hard", hard_count)]:
262
+ for _ in range(count):
263
+ task = build_iso_task(shuffled, difficulty=diff, start_index=offset, task_index=idx)
264
+ if task is not None:
265
+ tasks.append(task)
266
+ offset += len(task.cases) + 1
267
+ idx += 1
268
+ return tasks
models.py CHANGED
@@ -51,7 +51,6 @@ class PolicyView(BaseModel):
51
  reason_code: str
52
  guidance: str
53
  required_evidence: list[str] = Field(default_factory=list)
54
- recommended_strategy: StrategyName
55
 
56
 
57
  class VisibleCase(BaseModel):
@@ -116,6 +115,7 @@ class CaseScoreBreakdown(BaseModel):
116
  deadline_compliance: float
117
  efficiency: float
118
  outcome_quality: float
 
119
  weighted_score: float
120
  final_resolution: str
121
  notes: str
@@ -167,13 +167,14 @@ class ChargebackOpsAction(Action):
167
  """Action schema for ChargebackOps."""
168
 
169
  action_type: ActionType
170
- case_id: str | None = Field(default=None, description="Target case id when applicable")
171
  system_name: SystemName | None = Field(
172
  default=None,
173
  description="System to query when action_type is query_system",
174
  )
175
  evidence_ids: list[str] = Field(
176
  default_factory=list,
 
177
  description="Evidence ids to attach or remove",
178
  )
179
  strategy: StrategyName | None = Field(
@@ -182,6 +183,7 @@ class ChargebackOpsAction(Action):
182
  )
183
  note: str | None = Field(
184
  default=None,
 
185
  description="Optional short rationale for the action",
186
  )
187
 
 
51
  reason_code: str
52
  guidance: str
53
  required_evidence: list[str] = Field(default_factory=list)
 
54
 
55
 
56
  class VisibleCase(BaseModel):
 
115
  deadline_compliance: float
116
  efficiency: float
117
  outcome_quality: float
118
+ note_quality: float = 0.0
119
  weighted_score: float
120
  final_resolution: str
121
  notes: str
 
167
  """Action schema for ChargebackOps."""
168
 
169
  action_type: ActionType
170
+ case_id: str | None = Field(default=None, max_length=64, description="Target case id when applicable")
171
  system_name: SystemName | None = Field(
172
  default=None,
173
  description="System to query when action_type is query_system",
174
  )
175
  evidence_ids: list[str] = Field(
176
  default_factory=list,
177
+ max_length=20,
178
  description="Evidence ids to attach or remove",
179
  )
180
  strategy: StrategyName | None = Field(
 
183
  )
184
  note: str | None = Field(
185
  default=None,
186
+ max_length=500,
187
  description="Optional short rationale for the action",
188
  )
189
 
server/chargeback_ops_environment.py CHANGED
@@ -175,7 +175,7 @@ class ChargebackOpsEnvironment(
175
  if action.action_type == "set_strategy":
176
  return self._set_strategy(case, action.strategy)
177
  if action.action_type == "submit_representment":
178
- return self._submit_representment(case)
179
  if action.action_type == "resolve_case":
180
  return self._resolve_case(case, action.strategy)
181
  raise ValueError(f"Unsupported action_type '{action.action_type}'.")
@@ -306,9 +306,11 @@ class ChargebackOpsEnvironment(
306
  return 0.03, f"Set an acceptable fallback strategy '{strategy}' for case {case.case_id}."
307
  return -0.08, f"Set a weak strategy '{strategy}' for case {case.case_id}."
308
 
309
- def _submit_representment(self, case: InternalCase) -> tuple[float, str]:
310
  progress = self._progress_by_case[case.case_id]
311
  progress.submit_attempts += 1
 
 
312
  if progress.current_strategy != "contest":
313
  raise ValueError("submit_representment requires current strategy to be 'contest'.")
314
  if progress.resolution_status != "open":
@@ -320,21 +322,25 @@ class ChargebackOpsEnvironment(
320
  if self._state.step_count > case.deadline_step:
321
  progress.final_resolution = "contest"
322
  progress.resolution_status = "lost_late"
 
323
  return -0.2, f"Representment for case {case.case_id} was submitted after the deadline."
324
  if missing:
325
  progress.final_resolution = "contest"
326
  progress.resolution_status = "lost_incomplete"
 
327
  return -0.18, (
328
  f"Representment for case {case.case_id} is incomplete; missing {', '.join(sorted(missing))}."
329
  )
330
  if harmful:
331
  progress.final_resolution = "contest"
332
  progress.resolution_status = "lost_harmful_evidence"
 
333
  return -0.15, (
334
  f"Representment for case {case.case_id} included harmful evidence {', '.join(sorted(harmful))}."
335
  )
336
 
337
  progress.final_resolution = "contest"
 
338
  if case.optimal_strategy == "contest":
339
  progress.resolution_status = "won"
340
  return 0.2, f"Submitted a strong representment package for case {case.case_id}."
@@ -354,6 +360,7 @@ class ChargebackOpsEnvironment(
354
  return -0.04, f"Case {case.case_id} is already resolved."
355
  progress.final_resolution = resolution
356
  progress.current_strategy = resolution
 
357
  progress.resolution_status = (
358
  "refunded" if resolution == "issue_refund" else "accepted_chargeback"
359
  )
@@ -457,7 +464,6 @@ class ChargebackOpsEnvironment(
457
  reason_code=case.reason_code,
458
  guidance=case.policy_guidance,
459
  required_evidence=list(case.policy_requirements),
460
- recommended_strategy=case.recommended_strategy,
461
  )
462
  return VisibleCase(
463
  case_id=case.case_id,
 
175
  if action.action_type == "set_strategy":
176
  return self._set_strategy(case, action.strategy)
177
  if action.action_type == "submit_representment":
178
+ return self._submit_representment(case, note=action.note)
179
  if action.action_type == "resolve_case":
180
  return self._resolve_case(case, action.strategy)
181
  raise ValueError(f"Unsupported action_type '{action.action_type}'.")
 
306
  return 0.03, f"Set an acceptable fallback strategy '{strategy}' for case {case.case_id}."
307
  return -0.08, f"Set a weak strategy '{strategy}' for case {case.case_id}."
308
 
309
+ def _submit_representment(self, case: InternalCase, *, note: str | None = None) -> tuple[float, str]:
310
  progress = self._progress_by_case[case.case_id]
311
  progress.submit_attempts += 1
312
+ if note:
313
+ progress.representment_note = note
314
  if progress.current_strategy != "contest":
315
  raise ValueError("submit_representment requires current strategy to be 'contest'.")
316
  if progress.resolution_status != "open":
 
322
  if self._state.step_count > case.deadline_step:
323
  progress.final_resolution = "contest"
324
  progress.resolution_status = "lost_late"
325
+ progress.resolved_at_step = self._state.step_count
326
  return -0.2, f"Representment for case {case.case_id} was submitted after the deadline."
327
  if missing:
328
  progress.final_resolution = "contest"
329
  progress.resolution_status = "lost_incomplete"
330
+ progress.resolved_at_step = self._state.step_count
331
  return -0.18, (
332
  f"Representment for case {case.case_id} is incomplete; missing {', '.join(sorted(missing))}."
333
  )
334
  if harmful:
335
  progress.final_resolution = "contest"
336
  progress.resolution_status = "lost_harmful_evidence"
337
+ progress.resolved_at_step = self._state.step_count
338
  return -0.15, (
339
  f"Representment for case {case.case_id} included harmful evidence {', '.join(sorted(harmful))}."
340
  )
341
 
342
  progress.final_resolution = "contest"
343
+ progress.resolved_at_step = self._state.step_count
344
  if case.optimal_strategy == "contest":
345
  progress.resolution_status = "won"
346
  return 0.2, f"Submitted a strong representment package for case {case.case_id}."
 
360
  return -0.04, f"Case {case.case_id} is already resolved."
361
  progress.final_resolution = resolution
362
  progress.current_strategy = resolution
363
+ progress.resolved_at_step = self._state.step_count
364
  progress.resolution_status = (
365
  "refunded" if resolution == "issue_refund" else "accepted_chargeback"
366
  )
 
464
  reason_code=case.reason_code,
465
  guidance=case.policy_guidance,
466
  required_evidence=list(case.policy_requirements),
 
467
  )
468
  return VisibleCase(
469
  case_id=case.case_id,
simulation.py CHANGED
@@ -73,11 +73,13 @@ class CaseProgress:
73
  current_strategy: StrategyName | None = None
74
  final_resolution: str | None = None
75
  resolution_status: str = "open"
 
76
  duplicate_queries: int = 0
77
  invalid_actions: int = 0
78
  submit_attempts: int = 0
79
  deadline_penalized: bool = False
80
  notes: list[str] = field(default_factory=list)
 
81
 
82
 
83
  @dataclass
@@ -224,7 +226,7 @@ TASKS: dict[str, TaskScenario] = {
224
  "A card-not-present fraud dispute with mixed signals. Strong account-linkage evidence exists, "
225
  "but payment mismatch artifacts will hurt the case if attached."
226
  ),
227
- max_steps=12,
228
  cases=(
229
  InternalCase(
230
  case_id="CB-M1",
@@ -238,7 +240,7 @@ TASKS: dict[str, TaskScenario] = {
238
  "The order used a known account and device, but AVS/CVV mismatches were present. "
239
  "Winning requires emphasizing customer-account linkage and avoiding mismatch artifacts."
240
  ),
241
- deadline_step=9,
242
  optimal_strategy="contest",
243
  acceptable_strategies=("accept_chargeback",),
244
  policy_guidance=(
@@ -341,7 +343,7 @@ TASKS: dict[str, TaskScenario] = {
341
  "A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
342
  "The step budget leaves little room for waste."
343
  ),
344
- max_steps=18,
345
  cases=(
346
  InternalCase(
347
  case_id="CB-H1",
@@ -582,13 +584,47 @@ TASKS: dict[str, TaskScenario] = {
582
 
583
 
584
  def get_task(task_id: str) -> TaskScenario:
585
- """Look up a task or raise KeyError."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
586
 
587
- return TASKS[task_id]
588
 
589
 
590
  def list_tasks() -> list[TaskScenario]:
591
- """Return tasks in a stable order."""
592
 
593
  ordered_ids = [
594
  "goods_not_received_easy",
 
73
  current_strategy: StrategyName | None = None
74
  final_resolution: str | None = None
75
  resolution_status: str = "open"
76
+ resolved_at_step: int | None = None
77
  duplicate_queries: int = 0
78
  invalid_actions: int = 0
79
  submit_attempts: int = 0
80
  deadline_penalized: bool = False
81
  notes: list[str] = field(default_factory=list)
82
+ representment_note: str | None = None
83
 
84
 
85
  @dataclass
 
226
  "A card-not-present fraud dispute with mixed signals. Strong account-linkage evidence exists, "
227
  "but payment mismatch artifacts will hurt the case if attached."
228
  ),
229
+ max_steps=10,
230
  cases=(
231
  InternalCase(
232
  case_id="CB-M1",
 
240
  "The order used a known account and device, but AVS/CVV mismatches were present. "
241
  "Winning requires emphasizing customer-account linkage and avoiding mismatch artifacts."
242
  ),
243
+ deadline_step=7,
244
  optimal_strategy="contest",
245
  acceptable_strategies=("accept_chargeback",),
246
  policy_guidance=(
 
343
  "A real operations queue with three disputes. Two should be actioned quickly, and one should be conceded. "
344
  "The step budget leaves little room for waste."
345
  ),
346
+ max_steps=15,
347
  cases=(
348
  InternalCase(
349
  case_id="CB-H1",
 
584
 
585
 
586
  def get_task(task_id: str) -> TaskScenario:
587
+ """Look up a built-in task or generate one from a ``generated_*`` id."""
588
+
589
+ if task_id in TASKS:
590
+ return TASKS[task_id]
591
+
592
+ # Support generated task ids: generated_{difficulty}_s{seed}
593
+ import re
594
+
595
+ m = re.match(r"^generated_(easy|medium|hard)_s(\d+)$", task_id)
596
+ if m:
597
+ try:
598
+ from .case_generator import generate_task
599
+ except ImportError: # pragma: no cover
600
+ from case_generator import generate_task
601
+ difficulty = m.group(1)
602
+ seed = int(m.group(2))
603
+ return generate_task(seed, difficulty=difficulty) # type: ignore[arg-type]
604
+
605
+ # Support ISO-derived task ids: iso_{difficulty}_{index}
606
+ m_iso = re.match(r"^iso_(easy|medium|hard)_(\d+)$", task_id)
607
+ if m_iso:
608
+ try:
609
+ from .iso_adapter import build_iso_task, load_iso_rows
610
+ except ImportError: # pragma: no cover
611
+ from iso_adapter import build_iso_task, load_iso_rows
612
+ difficulty = m_iso.group(1)
613
+ task_index = int(m_iso.group(2))
614
+ rows = load_iso_rows()
615
+ if rows:
616
+ import random as _rng_mod
617
+ shuffled = list(rows)
618
+ _rng_mod.Random(42).shuffle(shuffled)
619
+ task = build_iso_task(shuffled, difficulty=difficulty, start_index=task_index * 4, task_index=task_index)
620
+ if task is not None:
621
+ return task
622
 
623
+ raise ValueError(f"Unknown task_id '{task_id}'. Available: {', '.join(TASKS)}")
624
 
625
 
626
  def list_tasks() -> list[TaskScenario]:
627
+ """Return built-in tasks in a stable order."""
628
 
629
  ordered_ids = [
630
  "goods_not_received_easy",