mitudrudutta commited on
Commit
2f6f026
·
1 Parent(s): a92af86

Add Hugging Face blog writeup

Browse files
Files changed (1) hide show
  1. BLOG.md +528 -0
BLOG.md ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChargebackOps: Teaching LLM Agents to Fight Credit Card Disputes
2
+
3
+ **Hugging Face Space:** https://huggingface.co/spaces/mitudrudutta/ChargeBackOps
4
+ **GitHub Repository:** https://github.com/MitudruDutta/chargebackops
5
+
6
+ A customer disputes a card payment.
7
+
8
+ The merchant has a deadline. Evidence is scattered across order logs, payment records, shipping scans, support chats, refund ledgers, and fraud-risk systems.
9
+
10
+ Some evidence helps. Some evidence hurts. The agent can fight, refund, concede, or escalate. But arbitration costs a fixed **$250**.
11
+
12
+ So the real question is not simply:
13
+
14
+ > Can an LLM answer questions about chargebacks?
15
+
16
+ The better question is:
17
+
18
+ > Can an LLM agent make evidence-backed, cost-aware decisions in a real operational workflow?
19
+
20
+ That is the idea behind **ChargebackOps**: an OpenEnv environment for training and evaluating LLM agents on merchant-side chargeback operations.
21
+
22
+ ---
23
+
24
+ ## The Problem: Chargebacks Are Not One-Step Decisions
25
+
26
+ When a customer disputes a transaction, the merchant can lose the transaction amount immediately. To recover the money, the merchant must submit a **representment packet** before a strict deadline.
27
+
28
+ A human analyst needs to answer several questions:
29
+
30
+ - Is this case worth contesting?
31
+ - Which internal system should be checked first?
32
+ - Is the shipping proof enough?
33
+ - Does the fraud signal help or hurt?
34
+ - Should we issue a refund instead?
35
+ - If the issuer rejects us, should we escalate to arbitration?
36
+
37
+ This is why chargebacks are a useful environment for agent training. They combine:
38
+
39
+ - partial observability,
40
+ - tool use,
41
+ - evidence selection,
42
+ - deadlines,
43
+ - multi-round review,
44
+ - and economic tradeoffs.
45
+
46
+ Most benchmarks test whether an LLM can produce the right answer. ChargebackOps tests whether an LLM can make the right sequence of decisions.
47
+
48
+ ---
49
+
50
+ ## What ChargebackOps Is
51
+
52
+ ChargebackOps is a simulated merchant dispute operations environment built on **OpenEnv**.
53
+
54
+ The agent receives a structured observation, takes a typed action, and receives the next observation after the environment updates.
55
+
56
+ The loop is simple:
57
+
58
+ $$
59
+ \text{Observation} \rightarrow \text{Action} \rightarrow \text{State Update} \rightarrow \text{Reward / Score}
60
+ $$
61
+
62
+ The workflow is not simple.
63
+
64
+ The agent must triage cases, query internal systems, collect evidence, avoid harmful artifacts, submit representment packets, handle issuer responses, and decide whether arbitration is worth the fee.
65
+
66
+ ---
67
+
68
+ ## What the Agent Can Do
69
+
70
+ The agent cannot act with free-form text. It must choose from a typed action space.
71
+
72
+ ### Round 1: Representment
73
+
74
+ - `select_case`
75
+ - `inspect_case`
76
+ - `query_system`
77
+ - `retrieve_policy`
78
+ - `add_evidence`
79
+ - `remove_evidence`
80
+ - `set_strategy`
81
+ - `submit_representment`
82
+ - `resolve_case`
83
+
84
+ ### Round 2 and 3: Pre-Arbitration and Arbitration
85
+
86
+ - `respond_to_pre_arb`
87
+ - `escalate_to_arbitration`
88
+ - `accept_arbitration_loss`
89
+
90
+ ### Long-Horizon Backlog Management
91
+
92
+ - `wait_for_updates`
93
+
94
+ The six merchant systems are:
95
+
96
+ - `orders`
97
+ - `payment`
98
+ - `shipping`
99
+ - `support`
100
+ - `refunds`
101
+ - `risk`
102
+
103
+ This makes the environment closer to a real back-office workflow than a static prompt benchmark.
104
+
105
+ ---
106
+
107
+ ## A Simple Example
108
+
109
+ Imagine a `goods_not_received` dispute.
110
+
111
+ The customer says:
112
+
113
+ > I never received the product.
114
+
115
+ A weak agent might submit immediately with no evidence.
116
+
117
+ A better agent does this:
118
+
119
+ 1. Select the case.
120
+ 2. Inspect the dispute.
121
+ 3. Query `orders`.
122
+ 4. Query `shipping`.
123
+ 5. Attach order confirmation and delivery scan.
124
+ 6. Set strategy to `contest`.
125
+ 7. Submit representment.
126
+ 8. Let the issuer review the packet.
127
+
128
+ That is a complete evidence-backed operational path.
129
+
130
+ ---
131
+
132
+ ## Architecture
133
+
134
+ ```mermaid
135
+ flowchart LR
136
+ A[Agent] --> B[Typed Action]
137
+ B --> C[ChargebackOpsEnvironment]
138
+ C --> D[Scenario Data]
139
+ C --> E[Issuer Review]
140
+ C --> F[Arbitration]
141
+ C --> G[Rubric Grader]
142
+ G --> H[Observation + Score]
143
+ ```
144
+
145
+ The environment has five main layers:
146
+
147
+ 1. **Interface layer** — Pydantic/OpenEnv models define actions, observations, state, and reports.
148
+ 2. **Environment core** — `ChargebackOpsEnvironment` runs `reset`, `step`, delayed events, deadlines, issuer review, arbitration, and final grading.
149
+ 3. **Scenario layer** — cases, evidence, tasks, generated tasks, and runtime `CaseProgress`.
150
+ 4. **Issuer/arbitration layer** — scripted issuer review and deterministic arbitration economics.
151
+ 5. **Evaluation layer** — an OpenEnv rubric tree that produces transparent case and episode scores.
152
+
153
+ ---
154
+
155
+ ## Multi-Round Dispute Lifecycle
156
+
157
+ ```mermaid
158
+ flowchart LR
159
+ A[Round 1: Representment] --> B{Issuer Review}
160
+ B -->|Accept| C[Merchant Wins]
161
+ B -->|Request More Evidence| D[Round 2: Pre-Arbitration]
162
+ D --> E{Issuer Re-Review}
163
+ E -->|Accept| C
164
+ E -->|Escalate| F[Round 3: Arbitration]
165
+ F -->|Merchant Wins| G[Amount - $250 Fee]
166
+ F -->|Issuer Wins| H[-Amount - $250 Fee]
167
+ ```
168
+
169
+ Arbitration is where the environment becomes especially interesting.
170
+
171
+ Both sides pay a fixed **$250** fee.
172
+
173
+ If the merchant wins arbitration:
174
+
175
+ $$
176
+ \text{merchant\_net\_pnl} = \text{amount} - 250
177
+ $$
178
+
179
+ If the merchant loses arbitration:
180
+
181
+ $$
182
+ \text{merchant\_net\_pnl} = -\text{amount} - 250
183
+ $$
184
+
185
+ So escalation is rational only when:
186
+
187
+ $$
188
+ P(\text{win}) \times \text{amount} > 250
189
+ $$
190
+
191
+ This single inequality changes the behavior we want from the agent. The best policy is not “always fight.” The best policy is to fight when the expected value is positive.
192
+
193
+ ---
194
+
195
+ ## The Hardest Task: Monthly Backlog Marathon
196
+
197
+ The flagship task is:
198
+
199
+ ```text
200
+ monthly_dispute_backlog_marathon
201
+ ```
202
+
203
+ It includes:
204
+
205
+ - 12 cases,
206
+ - 60 steps,
207
+ - wave-based case arrivals,
208
+ - delayed evidence,
209
+ - delayed issuer reviews,
210
+ - multiple open disputes,
211
+ - and arbitration tradeoffs.
212
+
213
+ This task tests memory, prioritization, and portfolio-level reasoning.
214
+
215
+ The agent has to decide not only what to do, but what to do **now**.
216
+
217
+ ---
218
+
219
+ ## Scoring: An 8-Dimensional Rubric
220
+
221
+ ChargebackOps uses a composable OpenEnv rubric instead of one monolithic reward.
222
+
223
+ ![8-dimensional rubric weights](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/rubric_weights.png)
224
+
225
+ | Dimension | Weight | What it measures |
226
+ |---|---:|---|
227
+ | Strategy correctness | 0.20 | Did the agent choose the right strategy? |
228
+ | Evidence quality | 0.15 | Did it attach useful evidence and avoid harmful evidence? |
229
+ | Packet validity | 0.10 | Was the packet complete and clean? |
230
+ | Deadline compliance | 0.10 | Was the case handled on time? |
231
+ | Efficiency | 0.10 | Did the agent avoid wasted actions? |
232
+ | Outcome quality | 0.10 | Did the final result match the correct resolution? |
233
+ | Note quality | 0.05 | Was the representment note coherent? |
234
+ | Escalation ROI | 0.20 | Was arbitration economically rational? |
235
+
236
+ The case score can be written as:
237
+
238
+ $$
239
+ \begin{aligned}
240
+ S_{case} ={}& 0.20S_{strategy}
241
+ + 0.15S_{evidence}
242
+ + 0.10S_{packet}
243
+ + 0.10S_{deadline} \\
244
+ &+ 0.10S_{efficiency}
245
+ + 0.10S_{outcome}
246
+ + 0.05S_{note}
247
+ + 0.20S_{roi}
248
+ \end{aligned}
249
+ $$
250
+
251
+ The environment also includes a deadline gate. If a case is truly abandoned past deadline, it can be hard-zeroed:
252
+
253
+ $$
254
+ S_{case}^{final} =
255
+ \begin{cases}
256
+ 0, & \text{if abandoned past deadline} \\
257
+ S_{case}, & \text{otherwise}
258
+ \end{cases}
259
+ $$
260
+
261
+ The final episode score is a weighted average across cases:
262
+
263
+ $$
264
+ S_{episode} = \frac{\sum_i w_i S_i}{\sum_i w_i}
265
+ $$
266
+
267
+ This gives the environment a useful property: results are not just scores, they are diagnosable.
268
+
269
+ ---
270
+
271
+ ## Does the Environment Separate Good and Bad Policies?
272
+
273
+ Yes.
274
+
275
+ I tested four scripted policies across the headline catalog and multi-seed grid.
276
+
277
+ ![Policy discrimination benchmark](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/discrimination_gradient.png)
278
+
279
+ | Policy | Headline avg | Multi-seed avg | Behavior |
280
+ |---|---:|---:|---|
281
+ | `naive` | 0.000 | 0.000 | Submit empty packets immediately |
282
+ | `concede_all` | 0.444 | 0.445 | Always accept the chargeback |
283
+ | `escalate_all` | 0.767 | 0.768 | Always contest and always escalate |
284
+ | `heuristic` | **0.813** | 0.763 | EV-rational rule-based policy |
285
+
286
+ The headline discrimination delta is:
287
+
288
+ $$
289
+ \Delta = S_{heuristic} - S_{naive} = 0.813 - 0.000 = 0.813
290
+ $$
291
+
292
+ That means the benchmark clearly separates weak shortcut behavior from stronger operational behavior.
293
+
294
+ The naive policy collapses. The concede-all policy gets partial credit but misses positive-EV contests. The escalate-all policy looks strong but pays unnecessary arbitration costs. The heuristic wins because it balances evidence, deadline, and ROI.
295
+
296
+ ---
297
+
298
+ ## Training Pipeline
299
+
300
+ The training setup uses:
301
+
302
+ ```text
303
+ Qwen/Qwen2.5-3B-Instruct + fp16 LoRA
304
+ ```
305
+
306
+ on a single T4.
307
+
308
+ The training pipeline has two phases.
309
+
310
+ ### Phase A: Supervised Fine-Tuning
311
+
312
+ The SFT phase teaches the model the typed-action interface and basic dispute workflow behavior.
313
+
314
+ - 4,000 heuristic-generated prompt/action pairs.
315
+ - LoRA rank 16.
316
+ - 150 SFT steps.
317
+
318
+ ### Phase B: GRPO
319
+
320
+ The GRPO phase uses outcome reward and format reward.
321
+
322
+ The outcome reward is based on terminal merchant P&L:
323
+
324
+ $$
325
+ R_{outcome} = \text{normalize}(\text{terminal merchant P\&L})
326
+ $$
327
+
328
+ The format reward provides a small signal for structured output:
329
+
330
+ $$
331
+ R_{format} =
332
+ \begin{cases}
333
+ +0.05, & \text{if output is parseable JSON} \\
334
+ -0.10, & \text{if output is not parseable JSON}
335
+ \end{cases}
336
+ $$
337
+
338
+ The combined reward is:
339
+
340
+ $$
341
+ R = R_{outcome} + R_{format}
342
+ $$
343
+
344
+ The reason for using outcome reward is simple: the goal is not just to imitate a heuristic. The goal is to optimize business outcome.
345
+
346
+ ---
347
+
348
+ ## Training Results
349
+
350
+ ![Training curve](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/training_curve.png)
351
+
352
+ The clearest legitimate learning signal is the SFT checkpoint.
353
+
354
+ | Checkpoint | Overall score |
355
+ |---|---:|
356
+ | Untrained Qwen2.5-3B base | 0.456 |
357
+ | SFT checkpoint | **0.536** |
358
+
359
+ The absolute improvement is:
360
+
361
+ $$
362
+ \Delta_{SFT} = 0.536 - 0.456 = 0.080
363
+ $$
364
+
365
+ The relative improvement is:
366
+
367
+ $$
368
+ \frac{0.536 - 0.456}{0.456} \times 100 \approx 17.54\% \approx 18\%
369
+ $$
370
+
371
+ The SFT model learned the interface and improved over the base model.
372
+
373
+ ---
374
+
375
+ ## Per-Difficulty Behavior
376
+
377
+ ![Training curve by family](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/training_curve_by_family.png)
378
+
379
+ The easy and medium cases improve most clearly after SFT.
380
+
381
+ The hard and nightmare tasks remain difficult because they require:
382
+
383
+ - multi-case triage,
384
+ - delayed evidence tracking,
385
+ - deadline-aware prioritization,
386
+ - harmful evidence filtering,
387
+ - and portfolio-level planning.
388
+
389
+ That is exactly what makes the environment useful. The easy cases teach the interface. The harder cases test whether an agent can manage operational complexity.
390
+
391
+ ---
392
+
393
+ ## A Useful Training Surprise
394
+
395
+ The GRPO phase produced an important lesson.
396
+
397
+ Later GRPO checkpoints appeared to match the heuristic baseline exactly. At first, that looked like success.
398
+
399
+ But diagnostic rollouts showed the model was emitting:
400
+
401
+ ```json
402
+ {"action_type": "accept_case", "case_id": "CB-E1"}
403
+ ```
404
+
405
+ `accept_case` is not a valid environment action.
406
+
407
+ The closest valid actions are:
408
+
409
+ - `accept_chargeback`
410
+ - `accept_arbitration_loss`
411
+ - `select_case`
412
+
413
+ The invalid action parsed as JSON but failed action validation. Because the evaluation helper fell back to the heuristic on invalid model output, the final score reflected heuristic behavior rather than trained-model behavior.
414
+
415
+ ![Gaming attribution](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/gaming_attribution.png)
416
+
417
+ This produced a clear rule for typed-action RL environments:
418
+
419
+ > A model should not get credit for a fallback policy’s work.
420
+
421
+ A corrected evaluation can penalize invalid actions:
422
+
423
+ $$
424
+ S_{eval}^{corrected} = S_{rubric} - \lambda \cdot N_{invalid}
425
+ $$
426
+
427
+ where:
428
+
429
+ - $S_{rubric}$ is the original rubric score,
430
+ - $N_{invalid}$ is the number of invalid actions,
431
+ - and $\lambda$ is the penalty per invalid action.
432
+
433
+ This is a side lesson, not the core product. But it is an important one: typed-action training needs strict attribution.
434
+
435
+ ---
436
+
437
+ ## Why This Matters Beyond Chargebacks
438
+
439
+ ChargebackOps is about more than disputes.
440
+
441
+ The same structure appears in many real-world workflows:
442
+
443
+ - insurance claims,
444
+ - tax audits,
445
+ - content-moderation appeals,
446
+ - procurement disputes,
447
+ - patent disputes,
448
+ - compliance reviews.
449
+
450
+ These workflows share the same pattern:
451
+
452
+ 1. Evidence is scattered.
453
+ 2. Deadlines matter.
454
+ 3. Escalation has a cost.
455
+ 4. Bad evidence can hurt.
456
+ 5. The correct action depends on both probability and value.
457
+
458
+ ChargebackOps turns that pattern into a benchmark.
459
+
460
+ ---
461
+
462
+ ## How to Try It
463
+
464
+ Open the Hugging Face Space:
465
+
466
+ https://huggingface.co/spaces/mitudrudutta/ChargeBackOps
467
+
468
+ Or run locally:
469
+
470
+ ```bash
471
+ git clone https://github.com/MitudruDutta/chargebackops.git
472
+ cd chargebackops
473
+ pip install -e ".[dev]"
474
+ pytest -q tests
475
+ openenv validate .
476
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
477
+ ```
478
+
479
+ Then open:
480
+
481
+ ```text
482
+ http://localhost:8000/docs
483
+ http://localhost:8000/demo
484
+ ```
485
+
486
+ A simple demo path is:
487
+
488
+ 1. Start `goods_not_received_easy`.
489
+ 2. Select the dispute case.
490
+ 3. Query `orders`.
491
+ 4. Query `shipping`.
492
+ 5. Attach order and delivery evidence.
493
+ 6. Set strategy to `contest`.
494
+ 7. Submit representment.
495
+ 8. Show issuer acceptance.
496
+ 9. Show the final grader report.
497
+
498
+ ---
499
+
500
+ ## What Comes Next
501
+
502
+ The next improvements are clear:
503
+
504
+ - stricter invalid-action penalties,
505
+ - fallback-free trained-policy evaluation,
506
+ - deeper Visa/Mastercard rule modeling,
507
+ - more stochastic merchant-system behavior,
508
+ - adaptive or learned issuer opponents,
509
+ - richer ISO and Stripe data adapters,
510
+ - and a cleaner product dashboard for dispute workflows.
511
+
512
+ ---
513
+
514
+ ## Conclusion
515
+
516
+ ChargebackOps is a reproducible OpenEnv benchmark for long-horizon, cost-sensitive, evidence-driven agent behavior.
517
+
518
+ It does not ask an LLM to merely summarize documents. It asks the agent to act.
519
+
520
+ The agent must gather evidence, avoid harmful signals, handle deadlines, respond to issuer pushback, and decide whether arbitration is worth the fee.
521
+
522
+ In short:
523
+
524
+ > ChargebackOps is not about teaching an agent to click buttons. It is about teaching an agent to make evidence-backed decisions when every step has a cost.
525
+
526
+ Try it here:
527
+
528
+ https://huggingface.co/spaces/mitudrudutta/ChargeBackOps