Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

mitudrudutta commited on Apr 15

Commit

0054f7f

1 Parent(s): 3149b7e

refactor: tighten rubric discrimination + LLM path + add running doc

- Tighten acceptable_strategies on contest-optimal templates (fraud_signal_ambiguity,
_PRODUCT_NOT_AS_DESCRIBED, _SERVICE_NOT_PROVIDED) so conceding a winnable case no
longer earns partial strategy credit. Bad-policy floor 0.212 -> 0.199.
- Expand _obvious_next_action to cover single-candidate states, remove_evidence, and
all set_strategy picks. Starves the LLM of bad-call opportunities on deterministic
workflow states; provider calls drop 19 -> 7 per 10-task run.
- Improve LLM prompt with reason-code -> optimal-strategy mapping and explicit
"candidate 0 is usually correct" guidance.
- LLM baseline: 0.622 -> 0.729 (+0.107). fraud_signal_ambiguity recovered 0.408 -> 0.968,
generated_nightmare_s31 0.134 -> 0.486. LLM now edges heuristic (0.729 > 0.724).
- Add 28-task multi-seed stress grid to RESULTS.md (7 seeds x 4 difficulties).
- Add docs/RUNNING_THE_AGENT.md with end-to-end run instructions.
- Fix README Docker block: offline run no longer requires --env-file; section reflows.

Files changed (7) hide show

AGENT.md +6 -6
README.md +24 -9
docs/RESULTS.md +94 -50
docs/RUNNING_THE_AGENT.md +310 -0
runners/baseline_runner.py +29 -16
scenarios/case_generator.py +2 -2
scenarios/simulation.py +2 -2

AGENT.md CHANGED Viewed

@@ -589,10 +589,10 @@ Tested across the 10-task benchmark (3 showcase + 7 seeded holdout):
 | Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
 |---|---|---|---|---|---|
-| Easy | 3 | 0.964 | 0.778 | 0.365 | Near-perfect heuristic; LLM mispicks on `fraud_signal_ambiguity` |
-| Medium | 2 | 0.755 | 0.608 | 0.278 | Strategy selection + evidence curation drive the spread |
-| Hard | 3 | 0.680 | 0.697 | 0.113 | LLM edges heuristic on `queue_optimization_hard` |
-| Nightmare | 2 | 0.466 | 0.289 | 0.065 | 5-case portfolios with deadline_step=3–5; step budget collides |
-| **Overall** | **10** | **0.738** | **0.622** | **0.212** | **Delta 0.526 vs bad policy** |
-The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. See `docs/RESULTS.md` for full per-task numbers.

 | Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
 |---|---|---|---|---|---|
+| Easy | 3 | 0.964 | 0.964 | 0.323 | Heuristic + LLM both saturate the easy band |
+| Medium | 2 | 0.755 | 0.755 | 0.278 | Strategy selection + evidence curation drive the spread |
+| Hard | 3 | 0.635 | 0.651 | 0.113 | LLM edges heuristic on `queue_optimization_hard` (+0.049) |
+| Nightmare | 2 | 0.466 | 0.466 | 0.065 | 5-case portfolios with deadline_step=3–5; step budget collides |
+| **Overall** | **10** | **0.724** | **0.729** | **0.199** | **Delta 0.525 vs bad policy** |
+The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. The LLM-assisted run now edges ahead of the pure heuristic (+0.005) and makes only **7 provider calls** across the 10-task run (down from 19 in v1) because `_obvious_next_action` short-circuits deterministic workflow states — strategy picks, add/remove evidence, submit, resolve. A 28-task multi-seed grid (7 seeds × 4 difficulties) reports heuristic 0.712 ± 0.235 and bad policy 0.241 ± 0.194 — the fixed-seed headline is within 1σ of the multi-seed result. See `docs/RESULTS.md` for full per-task numbers.

README.md CHANGED Viewed

@@ -84,17 +84,22 @@ pie title Case Score Weights
 | Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
 |---|---|---|---|---|
-| Easy | 3 | **0.964** | 0.778 | 0.365 |
-| Medium | 2 | **0.755** | 0.608 | 0.278 |
-| Hard | 3 | **0.680** | 0.697 | 0.113 |
-| Nightmare | 2 | **0.466** | 0.289 | 0.065 |
-| **Overall** | **10** | **0.738** | **0.622** | **0.212** |
-**Rubric discrimination:** heuristic vs. naive concede-everything delta is **+0.526** — the
 rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
 cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
-Per-dimension breakdown, score reproduction commands, and calibration notes live in
-[`docs/RESULTS.md`](docs/RESULTS.md).
 ## Action Space (9 typed actions)
@@ -132,12 +137,22 @@ for name, r in env.rubric.named_rubrics():
 # ... (all 7 dimensions)
 ```
 ```bash
-# Docker
 docker build -t chargebackops .
 docker run --rm -p 8000:8000 --env-file .env chargebackops
 ```
 ## API
 | Method | Path | Description |

 | Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
 |---|---|---|---|---|
+| Easy | 3 | 0.964 | **0.964** | 0.323 |
+| Medium | 2 | 0.755 | **0.755** | 0.278 |
+| Hard | 3 | 0.635 | **0.651** | 0.113 |
+| Nightmare | 2 | 0.466 | **0.466** | 0.065 |
+| **Overall** | **10** | **0.724** | **0.729** | **0.199** |
+28-task multi-seed grid (7 seeds × 4 difficulties, fully offline): heuristic **0.712 ± 0.235**,
+bad policy **0.241 ± 0.194**, delta **+0.471** — within 1σ of the headline fixed-seed delta.
+**Rubric discrimination:** heuristic vs. naive concede-everything delta is **+0.525** — the
 rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
 cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
+The LLM-assisted run edges ahead of the pure heuristic (+0.005) while making only **7 provider
+calls** (down from 19 in v1) because `_obvious_next_action` now short-circuits all
+deterministic workflow states. Per-dimension breakdown, score reproduction commands, and
+calibration notes live in [`docs/RESULTS.md`](docs/RESULTS.md).
 ## Action Space (9 typed actions)
 # ... (all 7 dimensions)
 ```
+Run the server in Docker:
 ```bash
+# 1. Build the image (tag: chargebackops)
 docker build -t chargebackops .
+# 2a. Offline run — no env vars required
+docker run --rm -p 8000:8000 chargebackops
+# 2b. With LLM provider keys (requires .env from Quick Start above)
 docker run --rm -p 8000:8000 --env-file .env chargebackops
 ```
+The container exposes the FastAPI app on port 8000 (`/docs` for OpenAPI, `/demo` for the Gradio
+live demo, `/health` for readiness). Stop it with Ctrl-C or `docker stop`.
 ## API
 | Method | Path | Description |

docs/RESULTS.md CHANGED Viewed

@@ -1,46 +1,56 @@
 # ChargebackOps — Baseline Results
-Reference numbers for the 10-task benchmark catalog. Captured on **2026-04-14** against
-`main` (Rubric system + deadline `Gate` composition). Reproduce with the commands below; scores
-should match to within ±1e-3 (float rounding).
 ## TL;DR
 | Agent | Avg score | Best task | Worst task | Provider calls |
 | --- | --- | --- | --- | --- |
-| **Bad policy** (concede-everything) | **0.212** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
-| **Heuristic** (no LLM, rule-based) | **0.738** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_nightmare_s77` (0.445) | 0 |
-| **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.622** | `goods_not_received_easy` (0.968) | `generated_nightmare_s31` (0.134) | 19 (19 ✓ / 0 ✗) |
-**Key signal:** the bad policy vs. heuristic delta is **0.526** (73.8 → 21.2 = 248% spread). The
-`Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left unresolved
-past its deadline hard-zeros — a lazy concede-everything agent cannot game the score, and a correct
-agent cannot trivially saturate it on hard tasks.
 ## Score Curve by Difficulty
 | Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
 | --- | --- | --- | --- | --- | --- | --- |
-| easy | 3 | 0.964 | 0.778 | 0.365 | ≥ 0.90 | ✓ |
-| medium | 2 | 0.755 | 0.608 | 0.278 | 0.50 – 0.85 | ✓ |
-| hard | 3 | 0.680 | 0.697 | 0.113 | 0.50 – 0.75 | ✓ |
-| nightmare | 2 | 0.466 | 0.289 | 0.065 | ≤ 0.55 | ✓ |
 Observations:
-- The heuristic **outscores** the LLM-assisted run by 11.6 points on this catalog. The LLM is only
-  invoked to break ties between candidate actions; on the current task set the heuristic tiebreak is
-  almost always the optimal choice, and the LLM occasionally picks a worse candidate (notably on
-  `fraud_signal_ambiguity` and `generated_medium_s99`, where it dropped 0.56 and 0.29 respectively
-  against the heuristic). The LLM recovers on `hard` tasks where genuine branching exists
-  (`queue_optimization_hard`: +0.048 over heuristic).
-- Nightmare tasks cluster around **0.45** for the heuristic because the 15-step budget collides with
-  5-case portfolios that have deadline_step=3–5 per case. Missed deadlines that were *attempted* still
-  land in the weighted sum (with 0 on the deadline dimension and ~0.55 from the other 85%); truly
-  abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper. Not a scoring artifact: the
-  bad-policy run shows the same tasks at ~0.065.
 - The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
-  the deadline collapses completely, while a case resolved late still earns dimensional credit for
-  evidence, strategy, and packet quality. This matches real chargeback operations — a missed
   representment is "case forfeit," while a late one takes a penalty but is still scored on what
   the merchant tried to do.
@@ -49,26 +59,50 @@ Observations:
 | Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
-| fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.408 | 6 | 0.408 | 3 |
 | generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
 | generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
-| generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.406 | 9 | 0.442 | 12 |
 | queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
-| generated_hard_s7 | hard | 2 | 0.718 | 5 | 0.718 | 5 | 0.120 | 12 |
-| generated_hard_s53 | hard | 3 | 0.522 | 6 | 0.522 | 6 | 0.089 | 15 |
-| generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.134 | 15 | 0.077 | 15 |
 | generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
-| **Average** | | | **0.738** | 9.2 | **0.622** | 9.0 | **0.212** | 10.5 |
-`fraud_signal_ambiguity` has been relabeled from `medium` to `easy` — it scores consistently at 0.968
-under the heuristic because the correct action collapses to a single-case contest with attached
-policy evidence. The label-difficulty mismatch was flagged in v1 and fixed in this release.
 ## Rubric Breakdown (single-case sanity check)
 For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
-`ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved before
-step 8):
 | Dimension | Weight | Score | Weighted contribution |
 | --- | --- | --- | --- |
@@ -114,7 +148,7 @@ surface for a judge or trainer to introspect per-dimension scores.
 # Activate the project's venv
 source ~/python/bin/activate
-# 1. Run the heuristic + bad-policy comparison (no network)
 python - <<'PY'
 from evaluation.agent_brutal_audit import run_episode
 from scenarios.simulation import list_tasks
@@ -124,7 +158,19 @@ for t in list_tasks():
     print(f"{t.task_id:32s}  heur={h['score']:.4f}  bad={b['score']:.4f}")
 PY
-# 2. Run the baseline with a real LLM (requires OPENROUTER_API_KEY in .env)
 python -m runners.baseline_runner | tee /tmp/baseline_run.json
 ```
@@ -132,18 +178,16 @@ python -m runners.baseline_runner | tee /tmp/baseline_run.json
 - Python 3.12.13, pytest 7.4.3
 - `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
-- Provider: OpenRouter (model `openai/gpt-oss-120b`), all 19 decision calls succeeded, zero retries
-- Average end-to-end episode wall-clock: ~0.8s (heuristic), ~2.5s (with LLM tiebreak)
 - Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
 ## What This Table Does Not Show
-- **Single-seed-per-task** — generated tasks use a fixed seed. A statistically rigorous eval would
-  run each task across 10+ seeds. The fixed-seed catalog is intentional for the hackathon (direct
-  score comparison between submissions), but is flagged as a scale-up item for v1.1.
 - **Per-dimension score dispersion across the full catalog** — the table above shows one task's
-  breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on any
-  run: see `README.md` → "Rubric introspection".
 - **RL training curves** — ChargebackOps is a ready environment, not a trained agent. Anyone
-  wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the rubric
-  tree is the machinery they would hook into for credit assignment.

 # ChargebackOps — Baseline Results
+Reference numbers for the 10-task headline benchmark and the 28-task multi-seed stress grid.
+Captured on **2026-04-15** against `main` (Rubric system + `Gate(CaseAbandonedRubric)`
+composition, tightened `acceptable_strategies` on contest-optimal templates, expanded
+`_obvious_next_action` coverage, improved LLM prompt). Reproduce with the commands at the
+bottom; headline scores match to within ±1e-3 (float rounding).
 ## TL;DR
 | Agent | Avg score | Best task | Worst task | Provider calls |
 | --- | --- | --- | --- | --- |
+| **Bad policy** (concede-everything) | **0.199** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
+| **Heuristic** (no LLM, rule-based) | **0.724** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_hard_s53` (0.440) | 0 |
+| **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.729** | `goods_not_received_easy` / `fraud_signal_ambiguity` / `generated_easy_s42` (0.958) | `generated_hard_s53` (0.440) | 7 (7 ✓ / 0 ✗) |
+**Key signal:** the bad policy vs. heuristic delta is **0.525** (72.4 → 19.9 = 264% spread).
+The `Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left
+unresolved past its deadline hard-zeros — a lazy concede-everything agent cannot game the score,
+and a correct agent cannot trivially saturate it on hard tasks. The LLM-assisted run now edges
+ahead of the pure heuristic (+0.005) after the v1.1 prompt and `_obvious_next_action` upgrades;
+the LLM is invoked only **7 times** across the 10-task run (down from 19 in v1) because
+deterministic workflow states are now dispatched without a model call.
 ## Score Curve by Difficulty
 | Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
 | --- | --- | --- | --- | --- | --- | --- |
+| easy | 3 | 0.964 | 0.964 | 0.323 | ≥ 0.90 | ✓ |
+| medium | 2 | 0.755 | 0.755 | 0.278 | 0.50 – 0.85 | ✓ |
+| hard | 3 | 0.635 | 0.651 | 0.113 | 0.50 – 0.75 | ✓ |
+| nightmare | 2 | 0.466 | 0.466 | 0.065 | ≤ 0.55 | ✓ |
 Observations:
+- The LLM-assisted run now **matches or narrowly beats** the heuristic on every difficulty band
+  (overall +0.005). The old v1 regression — where the LLM dropped 0.56 on `fraud_signal_ambiguity`
+  and 0.29 on `generated_medium_s99` — was caused by the model picking a concede strategy over
+  contest at `set_strategy` time. `_obvious_next_action` now short-circuits all strategy picks
+  so the heuristic-derived strategy is used directly, and the prompt explicitly lists the
+  reason-code → optimal-strategy mapping for the remaining decision points. Provider call count
+  fell from 19 to 7 because deterministic housekeeping (add_evidence, remove_evidence,
+  submit_representment, set_strategy, resolve_case) is now bypassed entirely.
+- The LLM's remaining upside is on `queue_optimization_hard` (+0.049 over heuristic), where the
+  queue-triage branching is genuine and the heuristic's fixed priority order leaves marginal
+  value on the table.
+- Nightmare tasks cluster around **0.47** for the heuristic because the 15-step budget collides
+  with 5-case portfolios that have deadline_step=3–5 per case. Missed deadlines that were
+  *attempted* still land in the weighted sum (with 0 on the deadline dimension and ~0.55 from
+  the other 85%); truly abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper.
+  Not a scoring artifact: the bad-policy run shows the same tasks at ~0.065.
 - The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
+  the deadline collapses completely, while a case resolved late still earns dimensional credit
+  for evidence, strategy, and packet quality. This matches real chargeback operations — a missed
   representment is "case forfeit," while a late one takes a penalty but is still scored on what
   the merchant tried to do.
 | Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
+| fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.968 | 7 | 0.280 | 3 |
 | generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
 | generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
+| generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.701 | 9 | 0.442 | 12 |
 | queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
+| generated_hard_s7 | hard | 2 | 0.663 | 5 | 0.663 | 5 | 0.120 | 12 |
+| generated_hard_s53 | hard | 3 | 0.440 | 6 | 0.440 | 6 | 0.089 | 15 |
+| generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.486 | 15 | 0.077 | 15 |
 | generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
+| **Average** | | | **0.724** | 9.2 | **0.729** | 9.0 | **0.199** | 10.5 |
+## Multi-seed Stress Grid (7 seeds × 4 difficulties)
+Running the heuristic and bad-policy agents across seven generator seeds per difficulty (seeds
+7, 17, 31, 42, 53, 77, 99) gives the statistically defensible version of the headline numbers.
+All runs are fully offline — no provider calls involved.
+| Difficulty | n | Heuristic mean ± std | Bad mean ± std |
+| --- | --- | --- | --- |
+| easy | 7 | 0.9696 ± 0.014 | 0.3346 ± 0.068 |
+| medium | 7 | 0.8411 ± 0.089 | 0.4369 ± 0.238 |
+| hard | 7 | 0.6245 ± 0.151 | 0.1299 ± 0.047 |
+| nightmare | 7 | 0.4121 ± 0.079 | 0.0635 ± 0.010 |
+| **OVERALL** | **28** | **0.7118 ± 0.235** | **0.2412 ± 0.194** |
+Observations:
+- Heuristic score decreases cleanly and monotonically with difficulty: 0.97 → 0.84 → 0.62 →
+  0.41. The difficulty gradient is real — not a labeling artifact.
+- Nightmare std is the tightest (0.079) because every nightmare task is constrained by the
+  same step budget vs. case count collision. Hard is the widest (0.151) because case counts
+  vary from 2 to 3 across seeds.
+- Bad policy shows wide variance on medium (±0.238) because some medium seeds generate
+  concede-optimal templates (credit_not_processed, duplicate_processing) where
+  concede-everything is trivially correct — exactly the expected behavior of a discriminating
+  rubric on a mixed task distribution.
+- Overall delta (heuristic − bad) across 28 runs: **0.4706**. The headline 10-task catalog
+  delta (0.525) is within 1σ of the multi-seed delta, so the fixed-seed headline is not a
+  cherry-picked result.
 ## Rubric Breakdown (single-case sanity check)
 For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
+`ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved
+before step 8):
 | Dimension | Weight | Score | Weighted contribution |
 | --- | --- | --- | --- |
 # Activate the project's venv
 source ~/python/bin/activate
+# 1. Headline 10-task run (heuristic + bad policy, no network)
 python - <<'PY'
 from evaluation.agent_brutal_audit import run_episode
 from scenarios.simulation import list_tasks
     print(f"{t.task_id:32s}  heur={h['score']:.4f}  bad={b['score']:.4f}")
 PY
+# 2. Multi-seed stress grid (28 runs across 7 seeds × 4 difficulties, no network)
+python - <<'PY'
+from statistics import mean, stdev
+from evaluation.agent_brutal_audit import run_episode
+for d in ("easy","medium","hard","nightmare"):
+    hs, bs = [], []
+    for s in (7, 17, 31, 42, 53, 77, 99):
+        hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
+        bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
+    print(f"{d:10s} heur={mean(hs):.4f}±{stdev(hs):.4f}  bad={mean(bs):.4f}±{stdev(bs):.4f}")
+PY
+# 3. LLM tiebreak run (requires OPENROUTER_API_KEY in .env)
 python -m runners.baseline_runner | tee /tmp/baseline_run.json
 ```
 - Python 3.12.13, pytest 7.4.3
 - `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
+- Provider: OpenRouter (model `openai/gpt-oss-120b`), all 7 decision calls succeeded, zero retries
+- Average end-to-end episode wall-clock: ~0.8s (heuristic), ~1.8s (with LLM tiebreak — down from
+  ~2.5s in v1 because `_obvious_next_action` bypasses most model calls)
 - Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
 ## What This Table Does Not Show
 - **Per-dimension score dispersion across the full catalog** — the table above shows one task's
+  breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on
+  any run: see `README.md` → "Rubric introspection".
 - **RL training curves** — ChargebackOps is a ready environment, not a trained agent. Anyone
+  wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the
+  rubric tree is the machinery they would hook into for credit assignment.

docs/RUNNING_THE_AGENT.md ADDED Viewed

	@@ -0,0 +1,310 @@

+# Running the ChargebackOps Agent
+End-to-end instructions for running the ChargebackOps environment and its baseline agent —
+offline (heuristic only), with an LLM tiebreak, against a single task, across the whole
+benchmark, or as a live server. If you just want the numbers, see
+[`docs/RESULTS.md`](RESULTS.md). If you want to understand the agent internals, see
+[`AGENT.md`](../AGENT.md).
+---
+## 1. Prerequisites
+- **Python 3.12** (required — `tomllib` and `Rubric` type hints assume 3.12+)
+- **git** (for cloning)
+- **Docker** (optional — only if you want the containerized server)
+Clone the repo if you haven't already:
+```bash
+git clone https://github.com/MitudruDutta/chargebackops.git
+cd chargebackops
+```
+Create or reuse a virtual environment, install the project in editable mode with dev extras:
+```bash
+source ~/python/bin/activate     # or: python3.12 -m venv .venv && source .venv/bin/activate
+pip install -e ".[dev]"
+```
+This installs `openenv-core`, `pydantic`, `openai`, `fastapi`, `uvicorn`, `gradio`, and the test
+harness. Nothing else is required for offline runs.
+Verify the install:
+```bash
+pytest -q tests           # expect: 22 passed
+openenv validate .        # expect: Ready for multi-mode deployment
+```
+---
+## 2. Configure the environment
+Copy the template and edit it:
+```bash
+cp .env.example .env
+```
+### 2a. Offline mode (no API keys)
+You can run the heuristic and bad-policy agents with **no keys at all**. The runner
+automatically falls back to the heuristic when no provider is configured. Skip to section 3.
+### 2b. LLM tiebreak mode
+Fill in **one** of the provider blocks in `.env`. The runner auto-detects which provider to
+use based on `BASELINE_PROVIDER`:
+**OpenRouter (recommended — free tier available, used in reference results):**
+```env
+BASELINE_PROVIDER=openrouter
+BASELINE_MODEL=openai/gpt-oss-120b
+OPENROUTER_API_KEY=sk-or-v1-...
+OPENROUTER_APP_TITLE=ChargebackOps
+```
+**Groq (fastest, free tier):**
+```env
+BASELINE_PROVIDER=groq
+BASELINE_MODEL=llama-3.3-70b-versatile
+GROQ_API_KEY=gsk_...
+```
+**OpenAI:**
+```env
+BASELINE_PROVIDER=openai
+BASELINE_MODEL=gpt-4o-mini
+OPENAI_API_KEY=sk-...
+```
+**Google Gemini:**
+```env
+BASELINE_PROVIDER=google
+BASELINE_MODEL=gemini-2.0-flash-exp
+GOOGLE_API_KEY=AI...
+```
+The fallback chain is: **primary → OpenRouter → Google → Groq → Heuristic**. If the primary
+provider times out or 429s, the runner automatically walks the chain. Set `STRICT_LLM_MODE=1`
+if you want failures to surface instead of silently falling back to the heuristic.
+---
+## 3. Run the agent
+### 3a. Against a single task
+```bash
+source ~/python/bin/activate
+python - <<'PY'
+from evaluation.agent_brutal_audit import run_episode
+result = run_episode("goods_not_received_easy", policy="heuristic")
+print(f"score = {result['score']:.4f}")
+print(f"steps = {result['steps']}")
+print(f"summary: {result['summary']}")
+PY
+```
+Available policies: `"heuristic"` (rule-based, no LLM), `"bad"` (concede-everything baseline).
+Any built-in or generated task id works — e.g. `"generated_nightmare_s31"`,
+`"fraud_signal_ambiguity"`, `"queue_optimization_hard"`.
+### 3b. Across the full 10-task headline benchmark (offline)
+```bash
+python - <<'PY'
+from evaluation.agent_brutal_audit import run_episode
+from scenarios.simulation import list_tasks
+for t in list_tasks():
+    h = run_episode(t.task_id, policy='heuristic')
+    b = run_episode(t.task_id, policy='bad')
+    print(f"{t.task_id:32s}  heur={h['score']:.4f}  bad={b['score']:.4f}")
+PY
+```
+Expect the heuristic to average **0.724** and the bad policy to average **0.199** (±1e-3 for
+float rounding). Total wall-clock: ~8 seconds, zero provider calls.
+### 3c. Across the full benchmark with an LLM tiebreak
+Make sure a provider is configured (section 2b), then:
+```bash
+python -m runners.baseline_runner | tee /tmp/baseline_run.json
+```
+This writes a JSON report with `task_results`, `average_score`, `provider_calls_attempted`,
+and `provider_calls_succeeded`. Expect **0.729** average and **7 provider calls** on the
+reference setup (OpenRouter + `openai/gpt-oss-120b`).
+### 3d. Multi-seed stress grid (28 runs, fully offline)
+```bash
+python - <<'PY'
+from statistics import mean, stdev
+from evaluation.agent_brutal_audit import run_episode
+for d in ("easy","medium","hard","nightmare"):
+    hs, bs = [], []
+    for s in (7, 17, 31, 42, 53, 77, 99):
+        hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
+        bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
+    print(f"{d:10s} heur={mean(hs):.4f}±{stdev(hs):.4f}  bad={mean(bs):.4f}±{stdev(bs):.4f}")
+PY
+```
+Expected: heuristic **0.712 ± 0.235**, bad policy **0.241 ± 0.194** across 28 runs.
+### 3e. Custom inference contract (challenge submission)
+`inference.py` is the submission entry point used by the hackathon harness. It reads
+`API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment and returns decisions via an
+OpenAI-compatible client.
+```bash
+API_BASE_URL=https://openrouter.ai/api/v1 \
+MODEL_NAME=openai/gpt-oss-120b \
+HF_TOKEN=sk-or-v1-... \
+python -m runners.inference
+```
+---
+## 4. Run the server (FastAPI + Gradio demo)
+The server exposes the environment via HTTP for OpenEnv-compatible clients and a live demo
+at `/demo`.
+### 4a. Local
+```bash
+source ~/python/bin/activate
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+Endpoints:
+| Path | Method | Purpose |
+|---|---|---|
+| `/reset` | POST | Start an episode (pass `task_id` in JSON body) |
+| `/step` | POST | Take one action |
+| `/state` | GET | Current observation + progress |
+| `/tasks` | GET | Task catalog |
+| `/demo` | GET | Gradio live demo — click-through playback |
+| `/baseline` | GET/POST | Run the heuristic agent headlessly |
+| `/grader` | GET/POST | Score a completed episode |
+| `/health` | GET | Health check |
+| `/docs` | GET | OpenAPI / Swagger UI |
+Example `curl` flow:
+```bash
+curl -s -X POST localhost:8000/reset -H 'content-type: application/json' \
+  -d '{"task_id":"goods_not_received_easy"}' | jq '.task_id, .steps_remaining'
+curl -s -X POST localhost:8000/step -H 'content-type: application/json' \
+  -d '{"action":{"action_type":"select_case","case_id":"CB-E1"}}' | jq '.reward'
+```
+### 4b. Docker
+```bash
+docker build -t chargebackops .
+docker run --rm -p 8000:8000 --env-file .env chargebackops
+```
+The Dockerfile is layered so source edits don't re-run `pip install` — first build takes ~40s,
+edits after that rebuild in ~6s.
+### 4c. Hugging Face Space
+The repo doubles as a Hugging Face Space (see the frontmatter at the top of `README.md`). Push
+to the `hf` remote and the space rebuilds automatically:
+```bash
+git push hf main
+```
+---
+## 5. Inspect the rubric tree
+Every scoring dimension is an OpenEnv `Rubric` subclass. Walk the composition tree on any
+live environment:
+```bash
+python - <<'PY'
+from server.chargeback_ops_environment import ChargebackOpsEnvironment
+env = ChargebackOpsEnvironment()
+for name, r in env.rubric.named_rubrics():
+    print(f"{name}: {type(r).__name__}")
+PY
+```
+Expected output (11 named children):
+```
+case_rubric: CaseRubric
+case_rubric.aggregator: WeightedSum
+case_rubric.aggregator.rubric_0: StrategyCorrectnessRubric
+case_rubric.aggregator.rubric_1: EvidenceQualityRubric
+case_rubric.aggregator.rubric_2: PacketValidityRubric
+case_rubric.aggregator.rubric_3: DeadlineComplianceRubric
+case_rubric.aggregator.rubric_4: EfficiencyRubric
+case_rubric.aggregator.rubric_5: OutcomeQualityRubric
+case_rubric.aggregator.rubric_6: NoteQualityRubric
+case_rubric.deadline_gate: Gate
+case_rubric.deadline_gate.rubric: CaseAbandonedRubric
+```
+After a forward pass, each child exposes `last_score` — this is the introspection path an RL
+trainer hooks into for credit assignment.
+---
+## 6. Troubleshooting
+**`openenv validate .` fails.** Check `openenv.yaml` is present at repo root and your venv has
+`openenv-core>=0.2.3` installed.
+**Provider calls all fail / score drops to heuristic.** Run `python -m runners.baseline_runner`
+and inspect `provider_errors`. Common causes: expired API key, wrong `BASELINE_MODEL` slug for
+the provider, or rate limits (the runner retries twice, then falls back). Set
+`BASELINE_REQUEST_TIMEOUT_SECONDS=30` if the provider is slow.
+**`ImportError: attempted relative import`.** Always run commands from the repo root with the
+venv activated. Use `python -m runners.baseline_runner`, not `python runners/baseline_runner.py`.
+**Docker build is slow on every edit.** You probably edited `pyproject.toml` — the deps layer
+only caches when that file is unchanged. If you edit source only, rebuilds should be ~6s.
+**Scores differ from `docs/RESULTS.md`.** If you pass different seeds or LLM providers you
+will get different numbers. The reference numbers are captured on the fixed 10-task catalog
+defined by `scenarios.simulation.list_tasks()` plus OpenRouter `openai/gpt-oss-120b`. Anything
+else is not directly comparable.
+---
+## 7. Minimal "does it work?" smoke test
+One command to verify everything is wired up correctly:
+```bash
+source ~/python/bin/activate && \
+  pytest -q tests && \
+  openenv validate . && \
+  python -c "
+from evaluation.agent_brutal_audit import run_episode
+r = run_episode('goods_not_received_easy', policy='heuristic')
+assert abs(r['score'] - 0.9675) < 1e-3, r['score']
+print('smoke OK, score =', r['score'])
+"
+```
+If that prints `smoke OK, score = 0.9675`, the agent runs cleanly and the rubric math is stable.

runners/baseline_runner.py CHANGED Viewed

@@ -1043,6 +1043,10 @@ def _obvious_next_action(
     if not candidates:
         return None
     first = candidates[0]
     visible_case = observation.get("visible_case")
     queue = observation["queue"]
@@ -1065,9 +1069,18 @@ def _obvious_next_action(
     if visible_case["status"] != "open":
         return first if first.action.action_type == "select_case" else None
     if first.action.action_type in {
         "retrieve_policy",
         "add_evidence",
         "submit_representment",
         "resolve_case",
     }:
@@ -1078,19 +1091,6 @@ def _obvious_next_action(
         if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
             return first
-    if first.action.action_type == "set_strategy":
-        strategy = first.action.strategy
-        competing_strategies = {
-            candidate.action.strategy
-            for candidate in candidates[1:]
-            if candidate.action.action_type == "set_strategy"
-        }
-        if (
-            strategy in {"accept_chargeback", "issue_refund"}
-            and "contest" not in competing_strategies
-        ):
-            return first
     if first.action.action_type == "select_case":
         current_case_id = visible_case["case_id"]
         current_deadline = next(
@@ -1278,9 +1278,22 @@ def _provider_pick(
                     {
                         "role": "system",
                         "content": (
-                            "You are a merchant chargeback dispute analyst. Pick the single best next action from the candidates. "
-                            "Prioritize: 1) deadline-urgent cases, 2) evidence-backed contests, 3) fast concedes for weak cases. "
-                            'Avoid attaching harmful evidence. Return JSON: {"candidate_index": N, "rationale": "brief reason"}'
                         ),
                     },
                     {"role": "user", "content": payload},

     if not candidates:
         return None
+    # Single candidate = no decision to make.
+    if len(candidates) == 1:
+        return candidates[0]
     first = candidates[0]
     visible_case = observation.get("visible_case")
     queue = observation["queue"]
     if visible_case["status"] != "open":
         return first if first.action.action_type == "select_case" else None
+    # Strategy selection: the heuristic already derives the optimal strategy
+    # from policy + retrieved evidence. The LLM has no additional signal that
+    # improves this specific call — invoking it here has only caused regressions
+    # on fraud_signal_ambiguity and generated_medium_s99 where the model picks
+    # a concede-style strategy over the correct contest.
+    if first.action.action_type == "set_strategy":
+        return first
     if first.action.action_type in {
         "retrieve_policy",
         "add_evidence",
+        "remove_evidence",
         "submit_representment",
         "resolve_case",
     }:
         if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
             return first
     if first.action.action_type == "select_case":
         current_case_id = visible_case["case_id"]
         current_deadline = next(
                     {
                         "role": "system",
                         "content": (
+                            "You are a merchant chargeback dispute analyst. Pick the single best next action from the ordered candidate list. "
+                            "The candidates are pre-sorted by a deterministic heuristic — candidate 0 is usually correct. Deviate only when you spot a concrete reason. "
+                            "\n"
+                            "Reason-code → optimal strategy (follow unless evidence clearly contradicts):\n"
+                            "  goods_not_received → contest (with order + delivery proof)\n"
+                            "  fraud_cnp → contest when account linkage exists, otherwise concede\n"
+                            "  product_not_as_described → contest (with listing + return policy proof)\n"
+                            "  service_not_provided → contest (with completion log)\n"
+                            "  credit_not_processed → issue_refund immediately\n"
+                            "  duplicate_processing → issue_refund immediately\n"
+                            "\n"
+                            "Priorities: (1) resolve cases whose deadline is 1 step away before anything else, "
+                            "(2) prefer the highest-$ open case when budget is tight, "
+                            "(3) never attach harmful evidence (AVS/CVV mismatch on fraud_cnp, GPS anomalies on goods_not_received), "
+                            "(4) when multiple candidates look equivalent, take candidate 0.\n"
+                            'Return only JSON: {"candidate_index": N, "rationale": "brief reason"}'
                         ),
                     },
                     {"role": "user", "content": payload},

scenarios/case_generator.py CHANGED Viewed

@@ -580,7 +580,7 @@ _PRODUCT_NOT_AS_DESCRIBED = _CaseTemplate(
     ),
     policy_requirements=("product listing verification", "return policy documentation"),
     optimal_strategy="contest",
-    acceptable_strategies=("issue_refund",),
     resolution_summary="Contest with listing accuracy proof and return policy documentation.",
     base_weight=1.0,
     evidence_blueprints=(
@@ -890,7 +890,7 @@ _SERVICE_NOT_PROVIDED = _CaseTemplate(
         "customer acknowledgment or scheduling proof",
     ),
     optimal_strategy="contest",
-    acceptable_strategies=("issue_refund",),
     resolution_summary="Contest with service completion proof. The service was delivered as booked.",
     base_weight=1.0,
     evidence_blueprints=(

     ),
     policy_requirements=("product listing verification", "return policy documentation"),
     optimal_strategy="contest",
+    acceptable_strategies=(),
     resolution_summary="Contest with listing accuracy proof and return policy documentation.",
     base_weight=1.0,
     evidence_blueprints=(
         "customer acknowledgment or scheduling proof",
     ),
     optimal_strategy="contest",
+    acceptable_strategies=(),
     resolution_summary="Contest with service completion proof. The service was delivered as booked.",
     base_weight=1.0,
     evidence_blueprints=(

scenarios/simulation.py CHANGED Viewed

@@ -258,7 +258,7 @@ TASKS: dict[str, TaskScenario] = {
                 ),
                 deadline_step=7,
                 optimal_strategy="contest",
-                acceptable_strategies=("accept_chargeback",),
                 policy_guidance=(
                     "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
                     "Do not attach evidence that strengthens the issuer's fraud narrative."
@@ -268,7 +268,7 @@ TASKS: dict[str, TaskScenario] = {
                     "customer account confirmation",
                 ),
                 recommended_strategy="contest",
-                resolution_summary="Contest only with strong account-linkage evidence. Conceding is acceptable but suboptimal.",
                 weight=1.1,
                 required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
                 helpful_evidence_ids=(

                 ),
                 deadline_step=7,
                 optimal_strategy="contest",
+                acceptable_strategies=(),
                 policy_guidance=(
                     "For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
                     "Do not attach evidence that strengthens the issuer's fraud narrative."
                     "customer account confirmation",
                 ),
                 recommended_strategy="contest",
+                resolution_summary="Contest with strong account-linkage evidence. Conceding this case forfeits defensible revenue.",
                 weight=1.1,
                 required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
                 helpful_evidence_ids=(