mitudrudutta commited on
Commit
0054f7f
Β·
1 Parent(s): 3149b7e

refactor: tighten rubric discrimination + LLM path + add running doc

Browse files

- Tighten acceptable_strategies on contest-optimal templates (fraud_signal_ambiguity,
_PRODUCT_NOT_AS_DESCRIBED, _SERVICE_NOT_PROVIDED) so conceding a winnable case no
longer earns partial strategy credit. Bad-policy floor 0.212 -> 0.199.
- Expand _obvious_next_action to cover single-candidate states, remove_evidence, and
all set_strategy picks. Starves the LLM of bad-call opportunities on deterministic
workflow states; provider calls drop 19 -> 7 per 10-task run.
- Improve LLM prompt with reason-code -> optimal-strategy mapping and explicit
"candidate 0 is usually correct" guidance.
- LLM baseline: 0.622 -> 0.729 (+0.107). fraud_signal_ambiguity recovered 0.408 -> 0.968,
generated_nightmare_s31 0.134 -> 0.486. LLM now edges heuristic (0.729 > 0.724).
- Add 28-task multi-seed stress grid to RESULTS.md (7 seeds x 4 difficulties).
- Add docs/RUNNING_THE_AGENT.md with end-to-end run instructions.
- Fix README Docker block: offline run no longer requires --env-file; section reflows.

AGENT.md CHANGED
@@ -589,10 +589,10 @@ Tested across the 10-task benchmark (3 showcase + 7 seeded holdout):
589
 
590
  | Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
591
  |---|---|---|---|---|---|
592
- | Easy | 3 | 0.964 | 0.778 | 0.365 | Near-perfect heuristic; LLM mispicks on `fraud_signal_ambiguity` |
593
- | Medium | 2 | 0.755 | 0.608 | 0.278 | Strategy selection + evidence curation drive the spread |
594
- | Hard | 3 | 0.680 | 0.697 | 0.113 | LLM edges heuristic on `queue_optimization_hard` |
595
- | Nightmare | 2 | 0.466 | 0.289 | 0.065 | 5-case portfolios with deadline_step=3–5; step budget collides |
596
- | **Overall** | **10** | **0.738** | **0.622** | **0.212** | **Delta 0.526 vs bad policy** |
597
 
598
- The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. See `docs/RESULTS.md` for full per-task numbers.
 
589
 
590
  | Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
591
  |---|---|---|---|---|---|
592
+ | Easy | 3 | 0.964 | 0.964 | 0.323 | Heuristic + LLM both saturate the easy band |
593
+ | Medium | 2 | 0.755 | 0.755 | 0.278 | Strategy selection + evidence curation drive the spread |
594
+ | Hard | 3 | 0.635 | 0.651 | 0.113 | LLM edges heuristic on `queue_optimization_hard` (+0.049) |
595
+ | Nightmare | 2 | 0.466 | 0.466 | 0.065 | 5-case portfolios with deadline_step=3–5; step budget collides |
596
+ | **Overall** | **10** | **0.724** | **0.729** | **0.199** | **Delta 0.525 vs bad policy** |
597
 
598
+ The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. The LLM-assisted run now edges ahead of the pure heuristic (+0.005) and makes only **7 provider calls** across the 10-task run (down from 19 in v1) because `_obvious_next_action` short-circuits deterministic workflow states β€” strategy picks, add/remove evidence, submit, resolve. A 28-task multi-seed grid (7 seeds Γ— 4 difficulties) reports heuristic 0.712 Β± 0.235 and bad policy 0.241 Β± 0.194 β€” the fixed-seed headline is within 1Οƒ of the multi-seed result. See `docs/RESULTS.md` for full per-task numbers.
README.md CHANGED
@@ -84,17 +84,22 @@ pie title Case Score Weights
84
 
85
  | Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
86
  |---|---|---|---|---|
87
- | Easy | 3 | **0.964** | 0.778 | 0.365 |
88
- | Medium | 2 | **0.755** | 0.608 | 0.278 |
89
- | Hard | 3 | **0.680** | 0.697 | 0.113 |
90
- | Nightmare | 2 | **0.466** | 0.289 | 0.065 |
91
- | **Overall** | **10** | **0.738** | **0.622** | **0.212** |
92
 
93
- **Rubric discrimination:** heuristic vs. naive concede-everything delta is **+0.526** β€” the
 
 
 
94
  rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
95
  cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
96
- Per-dimension breakdown, score reproduction commands, and calibration notes live in
97
- [`docs/RESULTS.md`](docs/RESULTS.md).
 
 
98
 
99
  ## Action Space (9 typed actions)
100
 
@@ -132,12 +137,22 @@ for name, r in env.rubric.named_rubrics():
132
  # ... (all 7 dimensions)
133
  ```
134
 
 
 
135
  ```bash
136
- # Docker
137
  docker build -t chargebackops .
 
 
 
 
 
138
  docker run --rm -p 8000:8000 --env-file .env chargebackops
139
  ```
140
 
 
 
 
141
  ## API
142
 
143
  | Method | Path | Description |
 
84
 
85
  | Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
86
  |---|---|---|---|---|
87
+ | Easy | 3 | 0.964 | **0.964** | 0.323 |
88
+ | Medium | 2 | 0.755 | **0.755** | 0.278 |
89
+ | Hard | 3 | 0.635 | **0.651** | 0.113 |
90
+ | Nightmare | 2 | 0.466 | **0.466** | 0.065 |
91
+ | **Overall** | **10** | **0.724** | **0.729** | **0.199** |
92
 
93
+ 28-task multi-seed grid (7 seeds Γ— 4 difficulties, fully offline): heuristic **0.712 Β± 0.235**,
94
+ bad policy **0.241 Β± 0.194**, delta **+0.471** β€” within 1Οƒ of the headline fixed-seed delta.
95
+
96
+ **Rubric discrimination:** heuristic vs. naive concede-everything delta is **+0.525** β€” the
97
  rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
98
  cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
99
+ The LLM-assisted run edges ahead of the pure heuristic (+0.005) while making only **7 provider
100
+ calls** (down from 19 in v1) because `_obvious_next_action` now short-circuits all
101
+ deterministic workflow states. Per-dimension breakdown, score reproduction commands, and
102
+ calibration notes live in [`docs/RESULTS.md`](docs/RESULTS.md).
103
 
104
  ## Action Space (9 typed actions)
105
 
 
137
  # ... (all 7 dimensions)
138
  ```
139
 
140
+ Run the server in Docker:
141
+
142
  ```bash
143
+ # 1. Build the image (tag: chargebackops)
144
  docker build -t chargebackops .
145
+
146
+ # 2a. Offline run β€” no env vars required
147
+ docker run --rm -p 8000:8000 chargebackops
148
+
149
+ # 2b. With LLM provider keys (requires .env from Quick Start above)
150
  docker run --rm -p 8000:8000 --env-file .env chargebackops
151
  ```
152
 
153
+ The container exposes the FastAPI app on port 8000 (`/docs` for OpenAPI, `/demo` for the Gradio
154
+ live demo, `/health` for readiness). Stop it with Ctrl-C or `docker stop`.
155
+
156
  ## API
157
 
158
  | Method | Path | Description |
docs/RESULTS.md CHANGED
@@ -1,46 +1,56 @@
1
  # ChargebackOps β€” Baseline Results
2
 
3
- Reference numbers for the 10-task benchmark catalog. Captured on **2026-04-14** against
4
- `main` (Rubric system + deadline `Gate` composition). Reproduce with the commands below; scores
5
- should match to within Β±1e-3 (float rounding).
 
 
6
 
7
  ## TL;DR
8
 
9
  | Agent | Avg score | Best task | Worst task | Provider calls |
10
  | --- | --- | --- | --- | --- |
11
- | **Bad policy** (concede-everything) | **0.212** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
12
- | **Heuristic** (no LLM, rule-based) | **0.738** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_nightmare_s77` (0.445) | 0 |
13
- | **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.622** | `goods_not_received_easy` (0.968) | `generated_nightmare_s31` (0.134) | 19 (19 βœ“ / 0 βœ—) |
14
-
15
- **Key signal:** the bad policy vs. heuristic delta is **0.526** (73.8 β†’ 21.2 = 248% spread). The
16
- `Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left unresolved
17
- past its deadline hard-zeros β€” a lazy concede-everything agent cannot game the score, and a correct
18
- agent cannot trivially saturate it on hard tasks.
 
 
 
19
 
20
  ## Score Curve by Difficulty
21
 
22
  | Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
23
  | --- | --- | --- | --- | --- | --- | --- |
24
- | easy | 3 | 0.964 | 0.778 | 0.365 | β‰₯ 0.90 | βœ“ |
25
- | medium | 2 | 0.755 | 0.608 | 0.278 | 0.50 – 0.85 | βœ“ |
26
- | hard | 3 | 0.680 | 0.697 | 0.113 | 0.50 – 0.75 | βœ“ |
27
- | nightmare | 2 | 0.466 | 0.289 | 0.065 | ≀ 0.55 | βœ“ |
28
 
29
  Observations:
30
- - The heuristic **outscores** the LLM-assisted run by 11.6 points on this catalog. The LLM is only
31
- invoked to break ties between candidate actions; on the current task set the heuristic tiebreak is
32
- almost always the optimal choice, and the LLM occasionally picks a worse candidate (notably on
33
- `fraud_signal_ambiguity` and `generated_medium_s99`, where it dropped 0.56 and 0.29 respectively
34
- against the heuristic). The LLM recovers on `hard` tasks where genuine branching exists
35
- (`queue_optimization_hard`: +0.048 over heuristic).
36
- - Nightmare tasks cluster around **0.45** for the heuristic because the 15-step budget collides with
37
- 5-case portfolios that have deadline_step=3–5 per case. Missed deadlines that were *attempted* still
38
- land in the weighted sum (with 0 on the deadline dimension and ~0.55 from the other 85%); truly
39
- abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper. Not a scoring artifact: the
40
- bad-policy run shows the same tasks at ~0.065.
 
 
 
 
 
41
  - The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
42
- the deadline collapses completely, while a case resolved late still earns dimensional credit for
43
- evidence, strategy, and packet quality. This matches real chargeback operations β€” a missed
44
  representment is "case forfeit," while a late one takes a penalty but is still scored on what
45
  the merchant tried to do.
46
 
@@ -49,26 +59,50 @@ Observations:
49
  | Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
50
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
51
  | goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
52
- | fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.408 | 6 | 0.408 | 3 |
53
  | generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
54
  | generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
55
- | generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.406 | 9 | 0.442 | 12 |
56
  | queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
57
- | generated_hard_s7 | hard | 2 | 0.718 | 5 | 0.718 | 5 | 0.120 | 12 |
58
- | generated_hard_s53 | hard | 3 | 0.522 | 6 | 0.522 | 6 | 0.089 | 15 |
59
- | generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.134 | 15 | 0.077 | 15 |
60
  | generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
61
- | **Average** | | | **0.738** | 9.2 | **0.622** | 9.0 | **0.212** | 10.5 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- `fraud_signal_ambiguity` has been relabeled from `medium` to `easy` β€” it scores consistently at 0.968
64
- under the heuristic because the correct action collapses to a single-case contest with attached
65
- policy evidence. The label-difficulty mismatch was flagged in v1 and fixed in this release.
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Rubric Breakdown (single-case sanity check)
68
 
69
  For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
70
- `ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved before
71
- step 8):
72
 
73
  | Dimension | Weight | Score | Weighted contribution |
74
  | --- | --- | --- | --- |
@@ -114,7 +148,7 @@ surface for a judge or trainer to introspect per-dimension scores.
114
  # Activate the project's venv
115
  source ~/python/bin/activate
116
 
117
- # 1. Run the heuristic + bad-policy comparison (no network)
118
  python - <<'PY'
119
  from evaluation.agent_brutal_audit import run_episode
120
  from scenarios.simulation import list_tasks
@@ -124,7 +158,19 @@ for t in list_tasks():
124
  print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
125
  PY
126
 
127
- # 2. Run the baseline with a real LLM (requires OPENROUTER_API_KEY in .env)
 
 
 
 
 
 
 
 
 
 
 
 
128
  python -m runners.baseline_runner | tee /tmp/baseline_run.json
129
  ```
130
 
@@ -132,18 +178,16 @@ python -m runners.baseline_runner | tee /tmp/baseline_run.json
132
 
133
  - Python 3.12.13, pytest 7.4.3
134
  - `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
135
- - Provider: OpenRouter (model `openai/gpt-oss-120b`), all 19 decision calls succeeded, zero retries
136
- - Average end-to-end episode wall-clock: ~0.8s (heuristic), ~2.5s (with LLM tiebreak)
 
137
  - Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
138
 
139
  ## What This Table Does Not Show
140
 
141
- - **Single-seed-per-task** β€” generated tasks use a fixed seed. A statistically rigorous eval would
142
- run each task across 10+ seeds. The fixed-seed catalog is intentional for the hackathon (direct
143
- score comparison between submissions), but is flagged as a scale-up item for v1.1.
144
  - **Per-dimension score dispersion across the full catalog** β€” the table above shows one task's
145
- breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on any
146
- run: see `README.md` β†’ "Rubric introspection".
147
  - **RL training curves** β€” ChargebackOps is a ready environment, not a trained agent. Anyone
148
- wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the rubric
149
- tree is the machinery they would hook into for credit assignment.
 
1
  # ChargebackOps β€” Baseline Results
2
 
3
+ Reference numbers for the 10-task headline benchmark and the 28-task multi-seed stress grid.
4
+ Captured on **2026-04-15** against `main` (Rubric system + `Gate(CaseAbandonedRubric)`
5
+ composition, tightened `acceptable_strategies` on contest-optimal templates, expanded
6
+ `_obvious_next_action` coverage, improved LLM prompt). Reproduce with the commands at the
7
+ bottom; headline scores match to within Β±1e-3 (float rounding).
8
 
9
  ## TL;DR
10
 
11
  | Agent | Avg score | Best task | Worst task | Provider calls |
12
  | --- | --- | --- | --- | --- |
13
+ | **Bad policy** (concede-everything) | **0.199** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
14
+ | **Heuristic** (no LLM, rule-based) | **0.724** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_hard_s53` (0.440) | 0 |
15
+ | **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.729** | `goods_not_received_easy` / `fraud_signal_ambiguity` / `generated_easy_s42` (0.958) | `generated_hard_s53` (0.440) | 7 (7 βœ“ / 0 βœ—) |
16
+
17
+ **Key signal:** the bad policy vs. heuristic delta is **0.525** (72.4 β†’ 19.9 = 264% spread).
18
+ The `Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left
19
+ unresolved past its deadline hard-zeros β€” a lazy concede-everything agent cannot game the score,
20
+ and a correct agent cannot trivially saturate it on hard tasks. The LLM-assisted run now edges
21
+ ahead of the pure heuristic (+0.005) after the v1.1 prompt and `_obvious_next_action` upgrades;
22
+ the LLM is invoked only **7 times** across the 10-task run (down from 19 in v1) because
23
+ deterministic workflow states are now dispatched without a model call.
24
 
25
  ## Score Curve by Difficulty
26
 
27
  | Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
28
  | --- | --- | --- | --- | --- | --- | --- |
29
+ | easy | 3 | 0.964 | 0.964 | 0.323 | β‰₯ 0.90 | βœ“ |
30
+ | medium | 2 | 0.755 | 0.755 | 0.278 | 0.50 – 0.85 | βœ“ |
31
+ | hard | 3 | 0.635 | 0.651 | 0.113 | 0.50 – 0.75 | βœ“ |
32
+ | nightmare | 2 | 0.466 | 0.466 | 0.065 | ≀ 0.55 | βœ“ |
33
 
34
  Observations:
35
+ - The LLM-assisted run now **matches or narrowly beats** the heuristic on every difficulty band
36
+ (overall +0.005). The old v1 regression β€” where the LLM dropped 0.56 on `fraud_signal_ambiguity`
37
+ and 0.29 on `generated_medium_s99` β€” was caused by the model picking a concede strategy over
38
+ contest at `set_strategy` time. `_obvious_next_action` now short-circuits all strategy picks
39
+ so the heuristic-derived strategy is used directly, and the prompt explicitly lists the
40
+ reason-code β†’ optimal-strategy mapping for the remaining decision points. Provider call count
41
+ fell from 19 to 7 because deterministic housekeeping (add_evidence, remove_evidence,
42
+ submit_representment, set_strategy, resolve_case) is now bypassed entirely.
43
+ - The LLM's remaining upside is on `queue_optimization_hard` (+0.049 over heuristic), where the
44
+ queue-triage branching is genuine and the heuristic's fixed priority order leaves marginal
45
+ value on the table.
46
+ - Nightmare tasks cluster around **0.47** for the heuristic because the 15-step budget collides
47
+ with 5-case portfolios that have deadline_step=3–5 per case. Missed deadlines that were
48
+ *attempted* still land in the weighted sum (with 0 on the deadline dimension and ~0.55 from
49
+ the other 85%); truly abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper.
50
+ Not a scoring artifact: the bad-policy run shows the same tasks at ~0.065.
51
  - The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
52
+ the deadline collapses completely, while a case resolved late still earns dimensional credit
53
+ for evidence, strategy, and packet quality. This matches real chargeback operations β€” a missed
54
  representment is "case forfeit," while a late one takes a penalty but is still scored on what
55
  the merchant tried to do.
56
 
 
59
  | Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
60
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
61
  | goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
62
+ | fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.968 | 7 | 0.280 | 3 |
63
  | generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
64
  | generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
65
+ | generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.701 | 9 | 0.442 | 12 |
66
  | queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
67
+ | generated_hard_s7 | hard | 2 | 0.663 | 5 | 0.663 | 5 | 0.120 | 12 |
68
+ | generated_hard_s53 | hard | 3 | 0.440 | 6 | 0.440 | 6 | 0.089 | 15 |
69
+ | generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.486 | 15 | 0.077 | 15 |
70
  | generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
71
+ | **Average** | | | **0.724** | 9.2 | **0.729** | 9.0 | **0.199** | 10.5 |
72
+
73
+ ## Multi-seed Stress Grid (7 seeds Γ— 4 difficulties)
74
+
75
+ Running the heuristic and bad-policy agents across seven generator seeds per difficulty (seeds
76
+ 7, 17, 31, 42, 53, 77, 99) gives the statistically defensible version of the headline numbers.
77
+ All runs are fully offline β€” no provider calls involved.
78
+
79
+ | Difficulty | n | Heuristic mean Β± std | Bad mean Β± std |
80
+ | --- | --- | --- | --- |
81
+ | easy | 7 | 0.9696 Β± 0.014 | 0.3346 Β± 0.068 |
82
+ | medium | 7 | 0.8411 Β± 0.089 | 0.4369 Β± 0.238 |
83
+ | hard | 7 | 0.6245 Β± 0.151 | 0.1299 Β± 0.047 |
84
+ | nightmare | 7 | 0.4121 Β± 0.079 | 0.0635 Β± 0.010 |
85
+ | **OVERALL** | **28** | **0.7118 Β± 0.235** | **0.2412 Β± 0.194** |
86
 
87
+ Observations:
88
+ - Heuristic score decreases cleanly and monotonically with difficulty: 0.97 β†’ 0.84 β†’ 0.62 β†’
89
+ 0.41. The difficulty gradient is real β€” not a labeling artifact.
90
+ - Nightmare std is the tightest (0.079) because every nightmare task is constrained by the
91
+ same step budget vs. case count collision. Hard is the widest (0.151) because case counts
92
+ vary from 2 to 3 across seeds.
93
+ - Bad policy shows wide variance on medium (Β±0.238) because some medium seeds generate
94
+ concede-optimal templates (credit_not_processed, duplicate_processing) where
95
+ concede-everything is trivially correct β€” exactly the expected behavior of a discriminating
96
+ rubric on a mixed task distribution.
97
+ - Overall delta (heuristic βˆ’ bad) across 28 runs: **0.4706**. The headline 10-task catalog
98
+ delta (0.525) is within 1Οƒ of the multi-seed delta, so the fixed-seed headline is not a
99
+ cherry-picked result.
100
 
101
  ## Rubric Breakdown (single-case sanity check)
102
 
103
  For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
104
+ `ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved
105
+ before step 8):
106
 
107
  | Dimension | Weight | Score | Weighted contribution |
108
  | --- | --- | --- | --- |
 
148
  # Activate the project's venv
149
  source ~/python/bin/activate
150
 
151
+ # 1. Headline 10-task run (heuristic + bad policy, no network)
152
  python - <<'PY'
153
  from evaluation.agent_brutal_audit import run_episode
154
  from scenarios.simulation import list_tasks
 
158
  print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
159
  PY
160
 
161
+ # 2. Multi-seed stress grid (28 runs across 7 seeds Γ— 4 difficulties, no network)
162
+ python - <<'PY'
163
+ from statistics import mean, stdev
164
+ from evaluation.agent_brutal_audit import run_episode
165
+ for d in ("easy","medium","hard","nightmare"):
166
+ hs, bs = [], []
167
+ for s in (7, 17, 31, 42, 53, 77, 99):
168
+ hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
169
+ bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
170
+ print(f"{d:10s} heur={mean(hs):.4f}Β±{stdev(hs):.4f} bad={mean(bs):.4f}Β±{stdev(bs):.4f}")
171
+ PY
172
+
173
+ # 3. LLM tiebreak run (requires OPENROUTER_API_KEY in .env)
174
  python -m runners.baseline_runner | tee /tmp/baseline_run.json
175
  ```
176
 
 
178
 
179
  - Python 3.12.13, pytest 7.4.3
180
  - `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
181
+ - Provider: OpenRouter (model `openai/gpt-oss-120b`), all 7 decision calls succeeded, zero retries
182
+ - Average end-to-end episode wall-clock: ~0.8s (heuristic), ~1.8s (with LLM tiebreak β€” down from
183
+ ~2.5s in v1 because `_obvious_next_action` bypasses most model calls)
184
  - Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
185
 
186
  ## What This Table Does Not Show
187
 
 
 
 
188
  - **Per-dimension score dispersion across the full catalog** β€” the table above shows one task's
189
+ breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on
190
+ any run: see `README.md` β†’ "Rubric introspection".
191
  - **RL training curves** β€” ChargebackOps is a ready environment, not a trained agent. Anyone
192
+ wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the
193
+ rubric tree is the machinery they would hook into for credit assignment.
docs/RUNNING_THE_AGENT.md ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Running the ChargebackOps Agent
2
+
3
+ End-to-end instructions for running the ChargebackOps environment and its baseline agent β€”
4
+ offline (heuristic only), with an LLM tiebreak, against a single task, across the whole
5
+ benchmark, or as a live server. If you just want the numbers, see
6
+ [`docs/RESULTS.md`](RESULTS.md). If you want to understand the agent internals, see
7
+ [`AGENT.md`](../AGENT.md).
8
+
9
+ ---
10
+
11
+ ## 1. Prerequisites
12
+
13
+ - **Python 3.12** (required β€” `tomllib` and `Rubric` type hints assume 3.12+)
14
+ - **git** (for cloning)
15
+ - **Docker** (optional β€” only if you want the containerized server)
16
+
17
+ Clone the repo if you haven't already:
18
+
19
+ ```bash
20
+ git clone https://github.com/MitudruDutta/chargebackops.git
21
+ cd chargebackops
22
+ ```
23
+
24
+ Create or reuse a virtual environment, install the project in editable mode with dev extras:
25
+
26
+ ```bash
27
+ source ~/python/bin/activate # or: python3.12 -m venv .venv && source .venv/bin/activate
28
+ pip install -e ".[dev]"
29
+ ```
30
+
31
+ This installs `openenv-core`, `pydantic`, `openai`, `fastapi`, `uvicorn`, `gradio`, and the test
32
+ harness. Nothing else is required for offline runs.
33
+
34
+ Verify the install:
35
+
36
+ ```bash
37
+ pytest -q tests # expect: 22 passed
38
+ openenv validate . # expect: Ready for multi-mode deployment
39
+ ```
40
+
41
+ ---
42
+
43
+ ## 2. Configure the environment
44
+
45
+ Copy the template and edit it:
46
+
47
+ ```bash
48
+ cp .env.example .env
49
+ ```
50
+
51
+ ### 2a. Offline mode (no API keys)
52
+
53
+ You can run the heuristic and bad-policy agents with **no keys at all**. The runner
54
+ automatically falls back to the heuristic when no provider is configured. Skip to section 3.
55
+
56
+ ### 2b. LLM tiebreak mode
57
+
58
+ Fill in **one** of the provider blocks in `.env`. The runner auto-detects which provider to
59
+ use based on `BASELINE_PROVIDER`:
60
+
61
+ **OpenRouter (recommended β€” free tier available, used in reference results):**
62
+
63
+ ```env
64
+ BASELINE_PROVIDER=openrouter
65
+ BASELINE_MODEL=openai/gpt-oss-120b
66
+ OPENROUTER_API_KEY=sk-or-v1-...
67
+ OPENROUTER_APP_TITLE=ChargebackOps
68
+ ```
69
+
70
+ **Groq (fastest, free tier):**
71
+
72
+ ```env
73
+ BASELINE_PROVIDER=groq
74
+ BASELINE_MODEL=llama-3.3-70b-versatile
75
+ GROQ_API_KEY=gsk_...
76
+ ```
77
+
78
+ **OpenAI:**
79
+
80
+ ```env
81
+ BASELINE_PROVIDER=openai
82
+ BASELINE_MODEL=gpt-4o-mini
83
+ OPENAI_API_KEY=sk-...
84
+ ```
85
+
86
+ **Google Gemini:**
87
+
88
+ ```env
89
+ BASELINE_PROVIDER=google
90
+ BASELINE_MODEL=gemini-2.0-flash-exp
91
+ GOOGLE_API_KEY=AI...
92
+ ```
93
+
94
+ The fallback chain is: **primary β†’ OpenRouter β†’ Google β†’ Groq β†’ Heuristic**. If the primary
95
+ provider times out or 429s, the runner automatically walks the chain. Set `STRICT_LLM_MODE=1`
96
+ if you want failures to surface instead of silently falling back to the heuristic.
97
+
98
+ ---
99
+
100
+ ## 3. Run the agent
101
+
102
+ ### 3a. Against a single task
103
+
104
+ ```bash
105
+ source ~/python/bin/activate
106
+ python - <<'PY'
107
+ from evaluation.agent_brutal_audit import run_episode
108
+ result = run_episode("goods_not_received_easy", policy="heuristic")
109
+ print(f"score = {result['score']:.4f}")
110
+ print(f"steps = {result['steps']}")
111
+ print(f"summary: {result['summary']}")
112
+ PY
113
+ ```
114
+
115
+ Available policies: `"heuristic"` (rule-based, no LLM), `"bad"` (concede-everything baseline).
116
+ Any built-in or generated task id works β€” e.g. `"generated_nightmare_s31"`,
117
+ `"fraud_signal_ambiguity"`, `"queue_optimization_hard"`.
118
+
119
+ ### 3b. Across the full 10-task headline benchmark (offline)
120
+
121
+ ```bash
122
+ python - <<'PY'
123
+ from evaluation.agent_brutal_audit import run_episode
124
+ from scenarios.simulation import list_tasks
125
+ for t in list_tasks():
126
+ h = run_episode(t.task_id, policy='heuristic')
127
+ b = run_episode(t.task_id, policy='bad')
128
+ print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
129
+ PY
130
+ ```
131
+
132
+ Expect the heuristic to average **0.724** and the bad policy to average **0.199** (Β±1e-3 for
133
+ float rounding). Total wall-clock: ~8 seconds, zero provider calls.
134
+
135
+ ### 3c. Across the full benchmark with an LLM tiebreak
136
+
137
+ Make sure a provider is configured (section 2b), then:
138
+
139
+ ```bash
140
+ python -m runners.baseline_runner | tee /tmp/baseline_run.json
141
+ ```
142
+
143
+ This writes a JSON report with `task_results`, `average_score`, `provider_calls_attempted`,
144
+ and `provider_calls_succeeded`. Expect **0.729** average and **7 provider calls** on the
145
+ reference setup (OpenRouter + `openai/gpt-oss-120b`).
146
+
147
+ ### 3d. Multi-seed stress grid (28 runs, fully offline)
148
+
149
+ ```bash
150
+ python - <<'PY'
151
+ from statistics import mean, stdev
152
+ from evaluation.agent_brutal_audit import run_episode
153
+ for d in ("easy","medium","hard","nightmare"):
154
+ hs, bs = [], []
155
+ for s in (7, 17, 31, 42, 53, 77, 99):
156
+ hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
157
+ bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
158
+ print(f"{d:10s} heur={mean(hs):.4f}Β±{stdev(hs):.4f} bad={mean(bs):.4f}Β±{stdev(bs):.4f}")
159
+ PY
160
+ ```
161
+
162
+ Expected: heuristic **0.712 Β± 0.235**, bad policy **0.241 Β± 0.194** across 28 runs.
163
+
164
+ ### 3e. Custom inference contract (challenge submission)
165
+
166
+ `inference.py` is the submission entry point used by the hackathon harness. It reads
167
+ `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment and returns decisions via an
168
+ OpenAI-compatible client.
169
+
170
+ ```bash
171
+ API_BASE_URL=https://openrouter.ai/api/v1 \
172
+ MODEL_NAME=openai/gpt-oss-120b \
173
+ HF_TOKEN=sk-or-v1-... \
174
+ python -m runners.inference
175
+ ```
176
+
177
+ ---
178
+
179
+ ## 4. Run the server (FastAPI + Gradio demo)
180
+
181
+ The server exposes the environment via HTTP for OpenEnv-compatible clients and a live demo
182
+ at `/demo`.
183
+
184
+ ### 4a. Local
185
+
186
+ ```bash
187
+ source ~/python/bin/activate
188
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
189
+ ```
190
+
191
+ Endpoints:
192
+
193
+ | Path | Method | Purpose |
194
+ |---|---|---|
195
+ | `/reset` | POST | Start an episode (pass `task_id` in JSON body) |
196
+ | `/step` | POST | Take one action |
197
+ | `/state` | GET | Current observation + progress |
198
+ | `/tasks` | GET | Task catalog |
199
+ | `/demo` | GET | Gradio live demo β€” click-through playback |
200
+ | `/baseline` | GET/POST | Run the heuristic agent headlessly |
201
+ | `/grader` | GET/POST | Score a completed episode |
202
+ | `/health` | GET | Health check |
203
+ | `/docs` | GET | OpenAPI / Swagger UI |
204
+
205
+ Example `curl` flow:
206
+
207
+ ```bash
208
+ curl -s -X POST localhost:8000/reset -H 'content-type: application/json' \
209
+ -d '{"task_id":"goods_not_received_easy"}' | jq '.task_id, .steps_remaining'
210
+
211
+ curl -s -X POST localhost:8000/step -H 'content-type: application/json' \
212
+ -d '{"action":{"action_type":"select_case","case_id":"CB-E1"}}' | jq '.reward'
213
+ ```
214
+
215
+ ### 4b. Docker
216
+
217
+ ```bash
218
+ docker build -t chargebackops .
219
+ docker run --rm -p 8000:8000 --env-file .env chargebackops
220
+ ```
221
+
222
+ The Dockerfile is layered so source edits don't re-run `pip install` β€” first build takes ~40s,
223
+ edits after that rebuild in ~6s.
224
+
225
+ ### 4c. Hugging Face Space
226
+
227
+ The repo doubles as a Hugging Face Space (see the frontmatter at the top of `README.md`). Push
228
+ to the `hf` remote and the space rebuilds automatically:
229
+
230
+ ```bash
231
+ git push hf main
232
+ ```
233
+
234
+ ---
235
+
236
+ ## 5. Inspect the rubric tree
237
+
238
+ Every scoring dimension is an OpenEnv `Rubric` subclass. Walk the composition tree on any
239
+ live environment:
240
+
241
+ ```bash
242
+ python - <<'PY'
243
+ from server.chargeback_ops_environment import ChargebackOpsEnvironment
244
+ env = ChargebackOpsEnvironment()
245
+ for name, r in env.rubric.named_rubrics():
246
+ print(f"{name}: {type(r).__name__}")
247
+ PY
248
+ ```
249
+
250
+ Expected output (11 named children):
251
+
252
+ ```
253
+ case_rubric: CaseRubric
254
+ case_rubric.aggregator: WeightedSum
255
+ case_rubric.aggregator.rubric_0: StrategyCorrectnessRubric
256
+ case_rubric.aggregator.rubric_1: EvidenceQualityRubric
257
+ case_rubric.aggregator.rubric_2: PacketValidityRubric
258
+ case_rubric.aggregator.rubric_3: DeadlineComplianceRubric
259
+ case_rubric.aggregator.rubric_4: EfficiencyRubric
260
+ case_rubric.aggregator.rubric_5: OutcomeQualityRubric
261
+ case_rubric.aggregator.rubric_6: NoteQualityRubric
262
+ case_rubric.deadline_gate: Gate
263
+ case_rubric.deadline_gate.rubric: CaseAbandonedRubric
264
+ ```
265
+
266
+ After a forward pass, each child exposes `last_score` β€” this is the introspection path an RL
267
+ trainer hooks into for credit assignment.
268
+
269
+ ---
270
+
271
+ ## 6. Troubleshooting
272
+
273
+ **`openenv validate .` fails.** Check `openenv.yaml` is present at repo root and your venv has
274
+ `openenv-core>=0.2.3` installed.
275
+
276
+ **Provider calls all fail / score drops to heuristic.** Run `python -m runners.baseline_runner`
277
+ and inspect `provider_errors`. Common causes: expired API key, wrong `BASELINE_MODEL` slug for
278
+ the provider, or rate limits (the runner retries twice, then falls back). Set
279
+ `BASELINE_REQUEST_TIMEOUT_SECONDS=30` if the provider is slow.
280
+
281
+ **`ImportError: attempted relative import`.** Always run commands from the repo root with the
282
+ venv activated. Use `python -m runners.baseline_runner`, not `python runners/baseline_runner.py`.
283
+
284
+ **Docker build is slow on every edit.** You probably edited `pyproject.toml` β€” the deps layer
285
+ only caches when that file is unchanged. If you edit source only, rebuilds should be ~6s.
286
+
287
+ **Scores differ from `docs/RESULTS.md`.** If you pass different seeds or LLM providers you
288
+ will get different numbers. The reference numbers are captured on the fixed 10-task catalog
289
+ defined by `scenarios.simulation.list_tasks()` plus OpenRouter `openai/gpt-oss-120b`. Anything
290
+ else is not directly comparable.
291
+
292
+ ---
293
+
294
+ ## 7. Minimal "does it work?" smoke test
295
+
296
+ One command to verify everything is wired up correctly:
297
+
298
+ ```bash
299
+ source ~/python/bin/activate && \
300
+ pytest -q tests && \
301
+ openenv validate . && \
302
+ python -c "
303
+ from evaluation.agent_brutal_audit import run_episode
304
+ r = run_episode('goods_not_received_easy', policy='heuristic')
305
+ assert abs(r['score'] - 0.9675) < 1e-3, r['score']
306
+ print('smoke OK, score =', r['score'])
307
+ "
308
+ ```
309
+
310
+ If that prints `smoke OK, score = 0.9675`, the agent runs cleanly and the rubric math is stable.
runners/baseline_runner.py CHANGED
@@ -1043,6 +1043,10 @@ def _obvious_next_action(
1043
  if not candidates:
1044
  return None
1045
 
 
 
 
 
1046
  first = candidates[0]
1047
  visible_case = observation.get("visible_case")
1048
  queue = observation["queue"]
@@ -1065,9 +1069,18 @@ def _obvious_next_action(
1065
  if visible_case["status"] != "open":
1066
  return first if first.action.action_type == "select_case" else None
1067
 
 
 
 
 
 
 
 
 
1068
  if first.action.action_type in {
1069
  "retrieve_policy",
1070
  "add_evidence",
 
1071
  "submit_representment",
1072
  "resolve_case",
1073
  }:
@@ -1078,19 +1091,6 @@ def _obvious_next_action(
1078
  if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
1079
  return first
1080
 
1081
- if first.action.action_type == "set_strategy":
1082
- strategy = first.action.strategy
1083
- competing_strategies = {
1084
- candidate.action.strategy
1085
- for candidate in candidates[1:]
1086
- if candidate.action.action_type == "set_strategy"
1087
- }
1088
- if (
1089
- strategy in {"accept_chargeback", "issue_refund"}
1090
- and "contest" not in competing_strategies
1091
- ):
1092
- return first
1093
-
1094
  if first.action.action_type == "select_case":
1095
  current_case_id = visible_case["case_id"]
1096
  current_deadline = next(
@@ -1278,9 +1278,22 @@ def _provider_pick(
1278
  {
1279
  "role": "system",
1280
  "content": (
1281
- "You are a merchant chargeback dispute analyst. Pick the single best next action from the candidates. "
1282
- "Prioritize: 1) deadline-urgent cases, 2) evidence-backed contests, 3) fast concedes for weak cases. "
1283
- 'Avoid attaching harmful evidence. Return JSON: {"candidate_index": N, "rationale": "brief reason"}'
 
 
 
 
 
 
 
 
 
 
 
 
 
1284
  ),
1285
  },
1286
  {"role": "user", "content": payload},
 
1043
  if not candidates:
1044
  return None
1045
 
1046
+ # Single candidate = no decision to make.
1047
+ if len(candidates) == 1:
1048
+ return candidates[0]
1049
+
1050
  first = candidates[0]
1051
  visible_case = observation.get("visible_case")
1052
  queue = observation["queue"]
 
1069
  if visible_case["status"] != "open":
1070
  return first if first.action.action_type == "select_case" else None
1071
 
1072
+ # Strategy selection: the heuristic already derives the optimal strategy
1073
+ # from policy + retrieved evidence. The LLM has no additional signal that
1074
+ # improves this specific call β€” invoking it here has only caused regressions
1075
+ # on fraud_signal_ambiguity and generated_medium_s99 where the model picks
1076
+ # a concede-style strategy over the correct contest.
1077
+ if first.action.action_type == "set_strategy":
1078
+ return first
1079
+
1080
  if first.action.action_type in {
1081
  "retrieve_policy",
1082
  "add_evidence",
1083
+ "remove_evidence",
1084
  "submit_representment",
1085
  "resolve_case",
1086
  }:
 
1091
  if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
1092
  return first
1093
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1094
  if first.action.action_type == "select_case":
1095
  current_case_id = visible_case["case_id"]
1096
  current_deadline = next(
 
1278
  {
1279
  "role": "system",
1280
  "content": (
1281
+ "You are a merchant chargeback dispute analyst. Pick the single best next action from the ordered candidate list. "
1282
+ "The candidates are pre-sorted by a deterministic heuristic β€” candidate 0 is usually correct. Deviate only when you spot a concrete reason. "
1283
+ "\n"
1284
+ "Reason-code β†’ optimal strategy (follow unless evidence clearly contradicts):\n"
1285
+ " goods_not_received β†’ contest (with order + delivery proof)\n"
1286
+ " fraud_cnp β†’ contest when account linkage exists, otherwise concede\n"
1287
+ " product_not_as_described β†’ contest (with listing + return policy proof)\n"
1288
+ " service_not_provided β†’ contest (with completion log)\n"
1289
+ " credit_not_processed β†’ issue_refund immediately\n"
1290
+ " duplicate_processing β†’ issue_refund immediately\n"
1291
+ "\n"
1292
+ "Priorities: (1) resolve cases whose deadline is 1 step away before anything else, "
1293
+ "(2) prefer the highest-$ open case when budget is tight, "
1294
+ "(3) never attach harmful evidence (AVS/CVV mismatch on fraud_cnp, GPS anomalies on goods_not_received), "
1295
+ "(4) when multiple candidates look equivalent, take candidate 0.\n"
1296
+ 'Return only JSON: {"candidate_index": N, "rationale": "brief reason"}'
1297
  ),
1298
  },
1299
  {"role": "user", "content": payload},
scenarios/case_generator.py CHANGED
@@ -580,7 +580,7 @@ _PRODUCT_NOT_AS_DESCRIBED = _CaseTemplate(
580
  ),
581
  policy_requirements=("product listing verification", "return policy documentation"),
582
  optimal_strategy="contest",
583
- acceptable_strategies=("issue_refund",),
584
  resolution_summary="Contest with listing accuracy proof and return policy documentation.",
585
  base_weight=1.0,
586
  evidence_blueprints=(
@@ -890,7 +890,7 @@ _SERVICE_NOT_PROVIDED = _CaseTemplate(
890
  "customer acknowledgment or scheduling proof",
891
  ),
892
  optimal_strategy="contest",
893
- acceptable_strategies=("issue_refund",),
894
  resolution_summary="Contest with service completion proof. The service was delivered as booked.",
895
  base_weight=1.0,
896
  evidence_blueprints=(
 
580
  ),
581
  policy_requirements=("product listing verification", "return policy documentation"),
582
  optimal_strategy="contest",
583
+ acceptable_strategies=(),
584
  resolution_summary="Contest with listing accuracy proof and return policy documentation.",
585
  base_weight=1.0,
586
  evidence_blueprints=(
 
890
  "customer acknowledgment or scheduling proof",
891
  ),
892
  optimal_strategy="contest",
893
+ acceptable_strategies=(),
894
  resolution_summary="Contest with service completion proof. The service was delivered as booked.",
895
  base_weight=1.0,
896
  evidence_blueprints=(
scenarios/simulation.py CHANGED
@@ -258,7 +258,7 @@ TASKS: dict[str, TaskScenario] = {
258
  ),
259
  deadline_step=7,
260
  optimal_strategy="contest",
261
- acceptable_strategies=("accept_chargeback",),
262
  policy_guidance=(
263
  "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
264
  "Do not attach evidence that strengthens the issuer's fraud narrative."
@@ -268,7 +268,7 @@ TASKS: dict[str, TaskScenario] = {
268
  "customer account confirmation",
269
  ),
270
  recommended_strategy="contest",
271
- resolution_summary="Contest only with strong account-linkage evidence. Conceding is acceptable but suboptimal.",
272
  weight=1.1,
273
  required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
274
  helpful_evidence_ids=(
 
258
  ),
259
  deadline_step=7,
260
  optimal_strategy="contest",
261
+ acceptable_strategies=(),
262
  policy_guidance=(
263
  "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
264
  "Do not attach evidence that strengthens the issuer's fraud narrative."
 
268
  "customer account confirmation",
269
  ),
270
  recommended_strategy="contest",
271
+ resolution_summary="Contest with strong account-linkage evidence. Conceding this case forfeits defensible revenue.",
272
  weight=1.1,
273
  required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
274
  helpful_evidence_ids=(