Spaces:
Sleeping
refactor: tighten rubric discrimination + LLM path + add running doc
Browse files- Tighten acceptable_strategies on contest-optimal templates (fraud_signal_ambiguity,
_PRODUCT_NOT_AS_DESCRIBED, _SERVICE_NOT_PROVIDED) so conceding a winnable case no
longer earns partial strategy credit. Bad-policy floor 0.212 -> 0.199.
- Expand _obvious_next_action to cover single-candidate states, remove_evidence, and
all set_strategy picks. Starves the LLM of bad-call opportunities on deterministic
workflow states; provider calls drop 19 -> 7 per 10-task run.
- Improve LLM prompt with reason-code -> optimal-strategy mapping and explicit
"candidate 0 is usually correct" guidance.
- LLM baseline: 0.622 -> 0.729 (+0.107). fraud_signal_ambiguity recovered 0.408 -> 0.968,
generated_nightmare_s31 0.134 -> 0.486. LLM now edges heuristic (0.729 > 0.724).
- Add 28-task multi-seed stress grid to RESULTS.md (7 seeds x 4 difficulties).
- Add docs/RUNNING_THE_AGENT.md with end-to-end run instructions.
- Fix README Docker block: offline run no longer requires --env-file; section reflows.
- AGENT.md +6 -6
- README.md +24 -9
- docs/RESULTS.md +94 -50
- docs/RUNNING_THE_AGENT.md +310 -0
- runners/baseline_runner.py +29 -16
- scenarios/case_generator.py +2 -2
- scenarios/simulation.py +2 -2
|
@@ -589,10 +589,10 @@ Tested across the 10-task benchmark (3 showcase + 7 seeded holdout):
|
|
| 589 |
|
| 590 |
| Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
|
| 591 |
|---|---|---|---|---|---|
|
| 592 |
-
| Easy | 3 | 0.964 | 0.
|
| 593 |
-
| Medium | 2 | 0.755 | 0.
|
| 594 |
-
| Hard | 3 | 0.
|
| 595 |
-
| Nightmare | 2 | 0.466 | 0.
|
| 596 |
-
| **Overall** | **10** | **0.
|
| 597 |
|
| 598 |
-
The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. See `docs/RESULTS.md` for full per-task numbers.
|
|
|
|
| 589 |
|
| 590 |
| Difficulty | Tasks | Heuristic | LLM tiebreak | Bad | Key Observations |
|
| 591 |
|---|---|---|---|---|---|
|
| 592 |
+
| Easy | 3 | 0.964 | 0.964 | 0.323 | Heuristic + LLM both saturate the easy band |
|
| 593 |
+
| Medium | 2 | 0.755 | 0.755 | 0.278 | Strategy selection + evidence curation drive the spread |
|
| 594 |
+
| Hard | 3 | 0.635 | 0.651 | 0.113 | LLM edges heuristic on `queue_optimization_hard` (+0.049) |
|
| 595 |
+
| Nightmare | 2 | 0.466 | 0.466 | 0.065 | 5-case portfolios with deadline_step=3β5; step budget collides |
|
| 596 |
+
| **Overall** | **10** | **0.724** | **0.729** | **0.199** | **Delta 0.525 vs bad policy** |
|
| 597 |
|
| 598 |
+
The difficulty curve demonstrates the environment discriminates effectively: easy tasks are near-trivial, nightmare tasks push every agent below 50%. The `Gate(CaseAbandonedRubric)` wrapper hard-zeros cases left unresolved past their deadline, so the heuristic's slowness on nightmare portfolios shows up as a real signal rather than dilution across 7 partial dimensions. The LLM-assisted run now edges ahead of the pure heuristic (+0.005) and makes only **7 provider calls** across the 10-task run (down from 19 in v1) because `_obvious_next_action` short-circuits deterministic workflow states β strategy picks, add/remove evidence, submit, resolve. A 28-task multi-seed grid (7 seeds Γ 4 difficulties) reports heuristic 0.712 Β± 0.235 and bad policy 0.241 Β± 0.194 β the fixed-seed headline is within 1Ο of the multi-seed result. See `docs/RESULTS.md` for full per-task numbers.
|
|
@@ -84,17 +84,22 @@ pie title Case Score Weights
|
|
| 84 |
|
| 85 |
| Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
|
| 86 |
|---|---|---|---|---|
|
| 87 |
-
| Easy | 3 |
|
| 88 |
-
| Medium | 2 |
|
| 89 |
-
| Hard | 3 |
|
| 90 |
-
| Nightmare | 2 |
|
| 91 |
-
| **Overall** | **10** | **0.
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
| 94 |
rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
|
| 95 |
cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
| 98 |
|
| 99 |
## Action Space (9 typed actions)
|
| 100 |
|
|
@@ -132,12 +137,22 @@ for name, r in env.rubric.named_rubrics():
|
|
| 132 |
# ... (all 7 dimensions)
|
| 133 |
```
|
| 134 |
|
|
|
|
|
|
|
| 135 |
```bash
|
| 136 |
-
#
|
| 137 |
docker build -t chargebackops .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
docker run --rm -p 8000:8000 --env-file .env chargebackops
|
| 139 |
```
|
| 140 |
|
|
|
|
|
|
|
|
|
|
| 141 |
## API
|
| 142 |
|
| 143 |
| Method | Path | Description |
|
|
|
|
| 84 |
|
| 85 |
| Difficulty | Tasks | Heuristic (no LLM) | Heuristic + LLM tiebreak | Naive baseline |
|
| 86 |
|---|---|---|---|---|
|
| 87 |
+
| Easy | 3 | 0.964 | **0.964** | 0.323 |
|
| 88 |
+
| Medium | 2 | 0.755 | **0.755** | 0.278 |
|
| 89 |
+
| Hard | 3 | 0.635 | **0.651** | 0.113 |
|
| 90 |
+
| Nightmare | 2 | 0.466 | **0.466** | 0.065 |
|
| 91 |
+
| **Overall** | **10** | **0.724** | **0.729** | **0.199** |
|
| 92 |
|
| 93 |
+
28-task multi-seed grid (7 seeds Γ 4 difficulties, fully offline): heuristic **0.712 Β± 0.235**,
|
| 94 |
+
bad policy **0.241 Β± 0.194**, delta **+0.471** β within 1Ο of the headline fixed-seed delta.
|
| 95 |
+
|
| 96 |
+
**Rubric discrimination:** heuristic vs. naive concede-everything delta is **+0.525** β the
|
| 97 |
rubric cannot be gamed by a lazy agent, and the `Gate(CaseAbandonedRubric)` wrapper hard-zeros
|
| 98 |
cases left unresolved past their deadline so the hard-band tasks cannot be trivially saturated.
|
| 99 |
+
The LLM-assisted run edges ahead of the pure heuristic (+0.005) while making only **7 provider
|
| 100 |
+
calls** (down from 19 in v1) because `_obvious_next_action` now short-circuits all
|
| 101 |
+
deterministic workflow states. Per-dimension breakdown, score reproduction commands, and
|
| 102 |
+
calibration notes live in [`docs/RESULTS.md`](docs/RESULTS.md).
|
| 103 |
|
| 104 |
## Action Space (9 typed actions)
|
| 105 |
|
|
|
|
| 137 |
# ... (all 7 dimensions)
|
| 138 |
```
|
| 139 |
|
| 140 |
+
Run the server in Docker:
|
| 141 |
+
|
| 142 |
```bash
|
| 143 |
+
# 1. Build the image (tag: chargebackops)
|
| 144 |
docker build -t chargebackops .
|
| 145 |
+
|
| 146 |
+
# 2a. Offline run β no env vars required
|
| 147 |
+
docker run --rm -p 8000:8000 chargebackops
|
| 148 |
+
|
| 149 |
+
# 2b. With LLM provider keys (requires .env from Quick Start above)
|
| 150 |
docker run --rm -p 8000:8000 --env-file .env chargebackops
|
| 151 |
```
|
| 152 |
|
| 153 |
+
The container exposes the FastAPI app on port 8000 (`/docs` for OpenAPI, `/demo` for the Gradio
|
| 154 |
+
live demo, `/health` for readiness). Stop it with Ctrl-C or `docker stop`.
|
| 155 |
+
|
| 156 |
## API
|
| 157 |
|
| 158 |
| Method | Path | Description |
|
|
@@ -1,46 +1,56 @@
|
|
| 1 |
# ChargebackOps β Baseline Results
|
| 2 |
|
| 3 |
-
Reference numbers for the 10-task benchmark
|
| 4 |
-
`main` (Rubric system +
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## TL;DR
|
| 8 |
|
| 9 |
| Agent | Avg score | Best task | Worst task | Provider calls |
|
| 10 |
| --- | --- | --- | --- | --- |
|
| 11 |
-
| **Bad policy** (concede-everything) | **0.
|
| 12 |
-
| **Heuristic** (no LLM, rule-based) | **0.
|
| 13 |
-
| **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.
|
| 14 |
-
|
| 15 |
-
**Key signal:** the bad policy vs. heuristic delta is **0.
|
| 16 |
-
`Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left
|
| 17 |
-
past its deadline hard-zeros β a lazy concede-everything agent cannot game the score,
|
| 18 |
-
agent cannot trivially saturate it on hard tasks.
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Score Curve by Difficulty
|
| 21 |
|
| 22 |
| Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
|
| 23 |
| --- | --- | --- | --- | --- | --- | --- |
|
| 24 |
-
| easy | 3 | 0.964 | 0.
|
| 25 |
-
| medium | 2 | 0.755 | 0.
|
| 26 |
-
| hard | 3 | 0.
|
| 27 |
-
| nightmare | 2 | 0.466 | 0.
|
| 28 |
|
| 29 |
Observations:
|
| 30 |
-
- The
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
`
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
- The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
|
| 42 |
-
the deadline collapses completely, while a case resolved late still earns dimensional credit
|
| 43 |
-
evidence, strategy, and packet quality. This matches real chargeback operations β a missed
|
| 44 |
representment is "case forfeit," while a late one takes a penalty but is still scored on what
|
| 45 |
the merchant tried to do.
|
| 46 |
|
|
@@ -49,26 +59,50 @@ Observations:
|
|
| 49 |
| Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
|
| 50 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 51 |
| goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
|
| 52 |
-
| fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.
|
| 53 |
| generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
|
| 54 |
| generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
|
| 55 |
-
| generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.
|
| 56 |
| queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
|
| 57 |
-
| generated_hard_s7 | hard | 2 | 0.
|
| 58 |
-
| generated_hard_s53 | hard | 3 | 0.
|
| 59 |
-
| generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.
|
| 60 |
| generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
|
| 61 |
-
| **Average** | | | **0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Rubric Breakdown (single-case sanity check)
|
| 68 |
|
| 69 |
For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
|
| 70 |
-
`ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved
|
| 71 |
-
step 8):
|
| 72 |
|
| 73 |
| Dimension | Weight | Score | Weighted contribution |
|
| 74 |
| --- | --- | --- | --- |
|
|
@@ -114,7 +148,7 @@ surface for a judge or trainer to introspect per-dimension scores.
|
|
| 114 |
# Activate the project's venv
|
| 115 |
source ~/python/bin/activate
|
| 116 |
|
| 117 |
-
# 1.
|
| 118 |
python - <<'PY'
|
| 119 |
from evaluation.agent_brutal_audit import run_episode
|
| 120 |
from scenarios.simulation import list_tasks
|
|
@@ -124,7 +158,19 @@ for t in list_tasks():
|
|
| 124 |
print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
|
| 125 |
PY
|
| 126 |
|
| 127 |
-
# 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
python -m runners.baseline_runner | tee /tmp/baseline_run.json
|
| 129 |
```
|
| 130 |
|
|
@@ -132,18 +178,16 @@ python -m runners.baseline_runner | tee /tmp/baseline_run.json
|
|
| 132 |
|
| 133 |
- Python 3.12.13, pytest 7.4.3
|
| 134 |
- `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
|
| 135 |
-
- Provider: OpenRouter (model `openai/gpt-oss-120b`), all
|
| 136 |
-
- Average end-to-end episode wall-clock: ~0.8s (heuristic), ~
|
|
|
|
| 137 |
- Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
|
| 138 |
|
| 139 |
## What This Table Does Not Show
|
| 140 |
|
| 141 |
-
- **Single-seed-per-task** β generated tasks use a fixed seed. A statistically rigorous eval would
|
| 142 |
-
run each task across 10+ seeds. The fixed-seed catalog is intentional for the hackathon (direct
|
| 143 |
-
score comparison between submissions), but is flagged as a scale-up item for v1.1.
|
| 144 |
- **Per-dimension score dispersion across the full catalog** β the table above shows one task's
|
| 145 |
-
breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on
|
| 146 |
-
run: see `README.md` β "Rubric introspection".
|
| 147 |
- **RL training curves** β ChargebackOps is a ready environment, not a trained agent. Anyone
|
| 148 |
-
wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the
|
| 149 |
-
tree is the machinery they would hook into for credit assignment.
|
|
|
|
| 1 |
# ChargebackOps β Baseline Results
|
| 2 |
|
| 3 |
+
Reference numbers for the 10-task headline benchmark and the 28-task multi-seed stress grid.
|
| 4 |
+
Captured on **2026-04-15** against `main` (Rubric system + `Gate(CaseAbandonedRubric)`
|
| 5 |
+
composition, tightened `acceptable_strategies` on contest-optimal templates, expanded
|
| 6 |
+
`_obvious_next_action` coverage, improved LLM prompt). Reproduce with the commands at the
|
| 7 |
+
bottom; headline scores match to within Β±1e-3 (float rounding).
|
| 8 |
|
| 9 |
## TL;DR
|
| 10 |
|
| 11 |
| Agent | Avg score | Best task | Worst task | Provider calls |
|
| 12 |
| --- | --- | --- | --- | --- |
|
| 13 |
+
| **Bad policy** (concede-everything) | **0.199** | `generated_medium_s99` (0.442) | `generated_nightmare_s77` (0.053) | 0 |
|
| 14 |
+
| **Heuristic** (no LLM, rule-based) | **0.724** | `goods_not_received_easy` / `fraud_signal_ambiguity` (0.968) | `generated_hard_s53` (0.440) | 0 |
|
| 15 |
+
| **Heuristic + LLM tiebreak** (openrouter gpt-oss-120b) | **0.729** | `goods_not_received_easy` / `fraud_signal_ambiguity` / `generated_easy_s42` (0.958) | `generated_hard_s53` (0.440) | 7 (7 β / 0 β) |
|
| 16 |
+
|
| 17 |
+
**Key signal:** the bad policy vs. heuristic delta is **0.525** (72.4 β 19.9 = 264% spread).
|
| 18 |
+
The `Gate(CaseAbandonedRubric)` wrapper around the per-case `WeightedSum` means a case left
|
| 19 |
+
unresolved past its deadline hard-zeros β a lazy concede-everything agent cannot game the score,
|
| 20 |
+
and a correct agent cannot trivially saturate it on hard tasks. The LLM-assisted run now edges
|
| 21 |
+
ahead of the pure heuristic (+0.005) after the v1.1 prompt and `_obvious_next_action` upgrades;
|
| 22 |
+
the LLM is invoked only **7 times** across the 10-task run (down from 19 in v1) because
|
| 23 |
+
deterministic workflow states are now dispatched without a model call.
|
| 24 |
|
| 25 |
## Score Curve by Difficulty
|
| 26 |
|
| 27 |
| Difficulty | Task count | Heuristic avg | LLM avg | Bad avg | Target band | Status |
|
| 28 |
| --- | --- | --- | --- | --- | --- | --- |
|
| 29 |
+
| easy | 3 | 0.964 | 0.964 | 0.323 | β₯ 0.90 | β |
|
| 30 |
+
| medium | 2 | 0.755 | 0.755 | 0.278 | 0.50 β 0.85 | β |
|
| 31 |
+
| hard | 3 | 0.635 | 0.651 | 0.113 | 0.50 β 0.75 | β |
|
| 32 |
+
| nightmare | 2 | 0.466 | 0.466 | 0.065 | β€ 0.55 | β |
|
| 33 |
|
| 34 |
Observations:
|
| 35 |
+
- The LLM-assisted run now **matches or narrowly beats** the heuristic on every difficulty band
|
| 36 |
+
(overall +0.005). The old v1 regression β where the LLM dropped 0.56 on `fraud_signal_ambiguity`
|
| 37 |
+
and 0.29 on `generated_medium_s99` β was caused by the model picking a concede strategy over
|
| 38 |
+
contest at `set_strategy` time. `_obvious_next_action` now short-circuits all strategy picks
|
| 39 |
+
so the heuristic-derived strategy is used directly, and the prompt explicitly lists the
|
| 40 |
+
reason-code β optimal-strategy mapping for the remaining decision points. Provider call count
|
| 41 |
+
fell from 19 to 7 because deterministic housekeeping (add_evidence, remove_evidence,
|
| 42 |
+
submit_representment, set_strategy, resolve_case) is now bypassed entirely.
|
| 43 |
+
- The LLM's remaining upside is on `queue_optimization_hard` (+0.049 over heuristic), where the
|
| 44 |
+
queue-triage branching is genuine and the heuristic's fixed priority order leaves marginal
|
| 45 |
+
value on the table.
|
| 46 |
+
- Nightmare tasks cluster around **0.47** for the heuristic because the 15-step budget collides
|
| 47 |
+
with 5-case portfolios that have deadline_step=3β5 per case. Missed deadlines that were
|
| 48 |
+
*attempted* still land in the weighted sum (with 0 on the deadline dimension and ~0.55 from
|
| 49 |
+
the other 85%); truly abandoned cases are zeroed by the `Gate(CaseAbandonedRubric)` wrapper.
|
| 50 |
+
Not a scoring artifact: the bad-policy run shows the same tasks at ~0.065.
|
| 51 |
- The deadline `Gate` is the v1 upgrade over a flat weighted sum: a case never even attempted by
|
| 52 |
+
the deadline collapses completely, while a case resolved late still earns dimensional credit
|
| 53 |
+
for evidence, strategy, and packet quality. This matches real chargeback operations β a missed
|
| 54 |
representment is "case forfeit," while a late one takes a penalty but is still scored on what
|
| 55 |
the merchant tried to do.
|
| 56 |
|
|
|
|
| 59 |
| Task ID | Difficulty | Cases | Heuristic | H steps | LLM | LLM steps | Bad | Bad steps |
|
| 60 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 61 |
| goods_not_received_easy | easy | 1 | 0.968 | 6 | 0.968 | 6 | 0.280 | 3 |
|
| 62 |
+
| fraud_signal_ambiguity | easy | 1 | 0.968 | 7 | 0.968 | 7 | 0.280 | 3 |
|
| 63 |
| generated_easy_s42 | easy | 1 | 0.958 | 7 | 0.958 | 7 | 0.408 | 3 |
|
| 64 |
| generated_medium_s17 | medium | 2 | 0.809 | 10 | 0.809 | 10 | 0.114 | 12 |
|
| 65 |
+
| generated_medium_s99 | medium | 2 | 0.701 | 9 | 0.701 | 9 | 0.442 | 12 |
|
| 66 |
| queue_optimization_hard | hard | 3 | 0.802 | 12 | 0.850 | 11 | 0.129 | 15 |
|
| 67 |
+
| generated_hard_s7 | hard | 2 | 0.663 | 5 | 0.663 | 5 | 0.120 | 12 |
|
| 68 |
+
| generated_hard_s53 | hard | 3 | 0.440 | 6 | 0.440 | 6 | 0.089 | 15 |
|
| 69 |
+
| generated_nightmare_s31 | nightmare | 5 | 0.486 | 15 | 0.486 | 15 | 0.077 | 15 |
|
| 70 |
| generated_nightmare_s77 | nightmare | 5 | 0.445 | 15 | 0.445 | 15 | 0.053 | 15 |
|
| 71 |
+
| **Average** | | | **0.724** | 9.2 | **0.729** | 9.0 | **0.199** | 10.5 |
|
| 72 |
+
|
| 73 |
+
## Multi-seed Stress Grid (7 seeds Γ 4 difficulties)
|
| 74 |
+
|
| 75 |
+
Running the heuristic and bad-policy agents across seven generator seeds per difficulty (seeds
|
| 76 |
+
7, 17, 31, 42, 53, 77, 99) gives the statistically defensible version of the headline numbers.
|
| 77 |
+
All runs are fully offline β no provider calls involved.
|
| 78 |
+
|
| 79 |
+
| Difficulty | n | Heuristic mean Β± std | Bad mean Β± std |
|
| 80 |
+
| --- | --- | --- | --- |
|
| 81 |
+
| easy | 7 | 0.9696 Β± 0.014 | 0.3346 Β± 0.068 |
|
| 82 |
+
| medium | 7 | 0.8411 Β± 0.089 | 0.4369 Β± 0.238 |
|
| 83 |
+
| hard | 7 | 0.6245 Β± 0.151 | 0.1299 Β± 0.047 |
|
| 84 |
+
| nightmare | 7 | 0.4121 Β± 0.079 | 0.0635 Β± 0.010 |
|
| 85 |
+
| **OVERALL** | **28** | **0.7118 Β± 0.235** | **0.2412 Β± 0.194** |
|
| 86 |
|
| 87 |
+
Observations:
|
| 88 |
+
- Heuristic score decreases cleanly and monotonically with difficulty: 0.97 β 0.84 β 0.62 β
|
| 89 |
+
0.41. The difficulty gradient is real β not a labeling artifact.
|
| 90 |
+
- Nightmare std is the tightest (0.079) because every nightmare task is constrained by the
|
| 91 |
+
same step budget vs. case count collision. Hard is the widest (0.151) because case counts
|
| 92 |
+
vary from 2 to 3 across seeds.
|
| 93 |
+
- Bad policy shows wide variance on medium (Β±0.238) because some medium seeds generate
|
| 94 |
+
concede-optimal templates (credit_not_processed, duplicate_processing) where
|
| 95 |
+
concede-everything is trivially correct β exactly the expected behavior of a discriminating
|
| 96 |
+
rubric on a mixed task distribution.
|
| 97 |
+
- Overall delta (heuristic β bad) across 28 runs: **0.4706**. The headline 10-task catalog
|
| 98 |
+
delta (0.525) is within 1Ο of the multi-seed delta, so the fixed-seed headline is not a
|
| 99 |
+
cherry-picked result.
|
| 100 |
|
| 101 |
## Rubric Breakdown (single-case sanity check)
|
| 102 |
|
| 103 |
For `goods_not_received_easy` under the heuristic, the 7-dimension breakdown from
|
| 104 |
+
`ChargebackOpsEpisodeRubric` (weights sum to 1.0, Gate passes because the case was resolved
|
| 105 |
+
before step 8):
|
| 106 |
|
| 107 |
| Dimension | Weight | Score | Weighted contribution |
|
| 108 |
| --- | --- | --- | --- |
|
|
|
|
| 148 |
# Activate the project's venv
|
| 149 |
source ~/python/bin/activate
|
| 150 |
|
| 151 |
+
# 1. Headline 10-task run (heuristic + bad policy, no network)
|
| 152 |
python - <<'PY'
|
| 153 |
from evaluation.agent_brutal_audit import run_episode
|
| 154 |
from scenarios.simulation import list_tasks
|
|
|
|
| 158 |
print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
|
| 159 |
PY
|
| 160 |
|
| 161 |
+
# 2. Multi-seed stress grid (28 runs across 7 seeds Γ 4 difficulties, no network)
|
| 162 |
+
python - <<'PY'
|
| 163 |
+
from statistics import mean, stdev
|
| 164 |
+
from evaluation.agent_brutal_audit import run_episode
|
| 165 |
+
for d in ("easy","medium","hard","nightmare"):
|
| 166 |
+
hs, bs = [], []
|
| 167 |
+
for s in (7, 17, 31, 42, 53, 77, 99):
|
| 168 |
+
hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
|
| 169 |
+
bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
|
| 170 |
+
print(f"{d:10s} heur={mean(hs):.4f}Β±{stdev(hs):.4f} bad={mean(bs):.4f}Β±{stdev(bs):.4f}")
|
| 171 |
+
PY
|
| 172 |
+
|
| 173 |
+
# 3. LLM tiebreak run (requires OPENROUTER_API_KEY in .env)
|
| 174 |
python -m runners.baseline_runner | tee /tmp/baseline_run.json
|
| 175 |
```
|
| 176 |
|
|
|
|
| 178 |
|
| 179 |
- Python 3.12.13, pytest 7.4.3
|
| 180 |
- `openenv-core==0.2.3`, `pydantic==2.12.5`, `openai==2.31.0`
|
| 181 |
+
- Provider: OpenRouter (model `openai/gpt-oss-120b`), all 7 decision calls succeeded, zero retries
|
| 182 |
+
- Average end-to-end episode wall-clock: ~0.8s (heuristic), ~1.8s (with LLM tiebreak β down from
|
| 183 |
+
~2.5s in v1 because `_obvious_next_action` bypasses most model calls)
|
| 184 |
- Full test suite: 22/22 passing, `openenv validate .` clean, Docker build clean
|
| 185 |
|
| 186 |
## What This Table Does Not Show
|
| 187 |
|
|
|
|
|
|
|
|
|
|
| 188 |
- **Per-dimension score dispersion across the full catalog** β the table above shows one task's
|
| 189 |
+
breakdown. An introspection demo command exists for walking `env.rubric.named_rubrics()` on
|
| 190 |
+
any run: see `README.md` β "Rubric introspection".
|
| 191 |
- **RL training curves** β ChargebackOps is a ready environment, not a trained agent. Anyone
|
| 192 |
+
wiring this into Gym/SB3/CleanRL is expected to produce training curves separately; the
|
| 193 |
+
rubric tree is the machinery they would hook into for credit assignment.
|
|
@@ -0,0 +1,310 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Running the ChargebackOps Agent
|
| 2 |
+
|
| 3 |
+
End-to-end instructions for running the ChargebackOps environment and its baseline agent β
|
| 4 |
+
offline (heuristic only), with an LLM tiebreak, against a single task, across the whole
|
| 5 |
+
benchmark, or as a live server. If you just want the numbers, see
|
| 6 |
+
[`docs/RESULTS.md`](RESULTS.md). If you want to understand the agent internals, see
|
| 7 |
+
[`AGENT.md`](../AGENT.md).
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Prerequisites
|
| 12 |
+
|
| 13 |
+
- **Python 3.12** (required β `tomllib` and `Rubric` type hints assume 3.12+)
|
| 14 |
+
- **git** (for cloning)
|
| 15 |
+
- **Docker** (optional β only if you want the containerized server)
|
| 16 |
+
|
| 17 |
+
Clone the repo if you haven't already:
|
| 18 |
+
|
| 19 |
+
```bash
|
| 20 |
+
git clone https://github.com/MitudruDutta/chargebackops.git
|
| 21 |
+
cd chargebackops
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
Create or reuse a virtual environment, install the project in editable mode with dev extras:
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
source ~/python/bin/activate # or: python3.12 -m venv .venv && source .venv/bin/activate
|
| 28 |
+
pip install -e ".[dev]"
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
This installs `openenv-core`, `pydantic`, `openai`, `fastapi`, `uvicorn`, `gradio`, and the test
|
| 32 |
+
harness. Nothing else is required for offline runs.
|
| 33 |
+
|
| 34 |
+
Verify the install:
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
pytest -q tests # expect: 22 passed
|
| 38 |
+
openenv validate . # expect: Ready for multi-mode deployment
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## 2. Configure the environment
|
| 44 |
+
|
| 45 |
+
Copy the template and edit it:
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
cp .env.example .env
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### 2a. Offline mode (no API keys)
|
| 52 |
+
|
| 53 |
+
You can run the heuristic and bad-policy agents with **no keys at all**. The runner
|
| 54 |
+
automatically falls back to the heuristic when no provider is configured. Skip to section 3.
|
| 55 |
+
|
| 56 |
+
### 2b. LLM tiebreak mode
|
| 57 |
+
|
| 58 |
+
Fill in **one** of the provider blocks in `.env`. The runner auto-detects which provider to
|
| 59 |
+
use based on `BASELINE_PROVIDER`:
|
| 60 |
+
|
| 61 |
+
**OpenRouter (recommended β free tier available, used in reference results):**
|
| 62 |
+
|
| 63 |
+
```env
|
| 64 |
+
BASELINE_PROVIDER=openrouter
|
| 65 |
+
BASELINE_MODEL=openai/gpt-oss-120b
|
| 66 |
+
OPENROUTER_API_KEY=sk-or-v1-...
|
| 67 |
+
OPENROUTER_APP_TITLE=ChargebackOps
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
**Groq (fastest, free tier):**
|
| 71 |
+
|
| 72 |
+
```env
|
| 73 |
+
BASELINE_PROVIDER=groq
|
| 74 |
+
BASELINE_MODEL=llama-3.3-70b-versatile
|
| 75 |
+
GROQ_API_KEY=gsk_...
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
**OpenAI:**
|
| 79 |
+
|
| 80 |
+
```env
|
| 81 |
+
BASELINE_PROVIDER=openai
|
| 82 |
+
BASELINE_MODEL=gpt-4o-mini
|
| 83 |
+
OPENAI_API_KEY=sk-...
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
**Google Gemini:**
|
| 87 |
+
|
| 88 |
+
```env
|
| 89 |
+
BASELINE_PROVIDER=google
|
| 90 |
+
BASELINE_MODEL=gemini-2.0-flash-exp
|
| 91 |
+
GOOGLE_API_KEY=AI...
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
The fallback chain is: **primary β OpenRouter β Google β Groq β Heuristic**. If the primary
|
| 95 |
+
provider times out or 429s, the runner automatically walks the chain. Set `STRICT_LLM_MODE=1`
|
| 96 |
+
if you want failures to surface instead of silently falling back to the heuristic.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## 3. Run the agent
|
| 101 |
+
|
| 102 |
+
### 3a. Against a single task
|
| 103 |
+
|
| 104 |
+
```bash
|
| 105 |
+
source ~/python/bin/activate
|
| 106 |
+
python - <<'PY'
|
| 107 |
+
from evaluation.agent_brutal_audit import run_episode
|
| 108 |
+
result = run_episode("goods_not_received_easy", policy="heuristic")
|
| 109 |
+
print(f"score = {result['score']:.4f}")
|
| 110 |
+
print(f"steps = {result['steps']}")
|
| 111 |
+
print(f"summary: {result['summary']}")
|
| 112 |
+
PY
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
Available policies: `"heuristic"` (rule-based, no LLM), `"bad"` (concede-everything baseline).
|
| 116 |
+
Any built-in or generated task id works β e.g. `"generated_nightmare_s31"`,
|
| 117 |
+
`"fraud_signal_ambiguity"`, `"queue_optimization_hard"`.
|
| 118 |
+
|
| 119 |
+
### 3b. Across the full 10-task headline benchmark (offline)
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
python - <<'PY'
|
| 123 |
+
from evaluation.agent_brutal_audit import run_episode
|
| 124 |
+
from scenarios.simulation import list_tasks
|
| 125 |
+
for t in list_tasks():
|
| 126 |
+
h = run_episode(t.task_id, policy='heuristic')
|
| 127 |
+
b = run_episode(t.task_id, policy='bad')
|
| 128 |
+
print(f"{t.task_id:32s} heur={h['score']:.4f} bad={b['score']:.4f}")
|
| 129 |
+
PY
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Expect the heuristic to average **0.724** and the bad policy to average **0.199** (Β±1e-3 for
|
| 133 |
+
float rounding). Total wall-clock: ~8 seconds, zero provider calls.
|
| 134 |
+
|
| 135 |
+
### 3c. Across the full benchmark with an LLM tiebreak
|
| 136 |
+
|
| 137 |
+
Make sure a provider is configured (section 2b), then:
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
python -m runners.baseline_runner | tee /tmp/baseline_run.json
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
This writes a JSON report with `task_results`, `average_score`, `provider_calls_attempted`,
|
| 144 |
+
and `provider_calls_succeeded`. Expect **0.729** average and **7 provider calls** on the
|
| 145 |
+
reference setup (OpenRouter + `openai/gpt-oss-120b`).
|
| 146 |
+
|
| 147 |
+
### 3d. Multi-seed stress grid (28 runs, fully offline)
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
python - <<'PY'
|
| 151 |
+
from statistics import mean, stdev
|
| 152 |
+
from evaluation.agent_brutal_audit import run_episode
|
| 153 |
+
for d in ("easy","medium","hard","nightmare"):
|
| 154 |
+
hs, bs = [], []
|
| 155 |
+
for s in (7, 17, 31, 42, 53, 77, 99):
|
| 156 |
+
hs.append(run_episode(f"generated_{d}_s{s}", policy='heuristic')['score'])
|
| 157 |
+
bs.append(run_episode(f"generated_{d}_s{s}", policy='bad')['score'])
|
| 158 |
+
print(f"{d:10s} heur={mean(hs):.4f}Β±{stdev(hs):.4f} bad={mean(bs):.4f}Β±{stdev(bs):.4f}")
|
| 159 |
+
PY
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
Expected: heuristic **0.712 Β± 0.235**, bad policy **0.241 Β± 0.194** across 28 runs.
|
| 163 |
+
|
| 164 |
+
### 3e. Custom inference contract (challenge submission)
|
| 165 |
+
|
| 166 |
+
`inference.py` is the submission entry point used by the hackathon harness. It reads
|
| 167 |
+
`API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment and returns decisions via an
|
| 168 |
+
OpenAI-compatible client.
|
| 169 |
+
|
| 170 |
+
```bash
|
| 171 |
+
API_BASE_URL=https://openrouter.ai/api/v1 \
|
| 172 |
+
MODEL_NAME=openai/gpt-oss-120b \
|
| 173 |
+
HF_TOKEN=sk-or-v1-... \
|
| 174 |
+
python -m runners.inference
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## 4. Run the server (FastAPI + Gradio demo)
|
| 180 |
+
|
| 181 |
+
The server exposes the environment via HTTP for OpenEnv-compatible clients and a live demo
|
| 182 |
+
at `/demo`.
|
| 183 |
+
|
| 184 |
+
### 4a. Local
|
| 185 |
+
|
| 186 |
+
```bash
|
| 187 |
+
source ~/python/bin/activate
|
| 188 |
+
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
Endpoints:
|
| 192 |
+
|
| 193 |
+
| Path | Method | Purpose |
|
| 194 |
+
|---|---|---|
|
| 195 |
+
| `/reset` | POST | Start an episode (pass `task_id` in JSON body) |
|
| 196 |
+
| `/step` | POST | Take one action |
|
| 197 |
+
| `/state` | GET | Current observation + progress |
|
| 198 |
+
| `/tasks` | GET | Task catalog |
|
| 199 |
+
| `/demo` | GET | Gradio live demo β click-through playback |
|
| 200 |
+
| `/baseline` | GET/POST | Run the heuristic agent headlessly |
|
| 201 |
+
| `/grader` | GET/POST | Score a completed episode |
|
| 202 |
+
| `/health` | GET | Health check |
|
| 203 |
+
| `/docs` | GET | OpenAPI / Swagger UI |
|
| 204 |
+
|
| 205 |
+
Example `curl` flow:
|
| 206 |
+
|
| 207 |
+
```bash
|
| 208 |
+
curl -s -X POST localhost:8000/reset -H 'content-type: application/json' \
|
| 209 |
+
-d '{"task_id":"goods_not_received_easy"}' | jq '.task_id, .steps_remaining'
|
| 210 |
+
|
| 211 |
+
curl -s -X POST localhost:8000/step -H 'content-type: application/json' \
|
| 212 |
+
-d '{"action":{"action_type":"select_case","case_id":"CB-E1"}}' | jq '.reward'
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
### 4b. Docker
|
| 216 |
+
|
| 217 |
+
```bash
|
| 218 |
+
docker build -t chargebackops .
|
| 219 |
+
docker run --rm -p 8000:8000 --env-file .env chargebackops
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
The Dockerfile is layered so source edits don't re-run `pip install` β first build takes ~40s,
|
| 223 |
+
edits after that rebuild in ~6s.
|
| 224 |
+
|
| 225 |
+
### 4c. Hugging Face Space
|
| 226 |
+
|
| 227 |
+
The repo doubles as a Hugging Face Space (see the frontmatter at the top of `README.md`). Push
|
| 228 |
+
to the `hf` remote and the space rebuilds automatically:
|
| 229 |
+
|
| 230 |
+
```bash
|
| 231 |
+
git push hf main
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
---
|
| 235 |
+
|
| 236 |
+
## 5. Inspect the rubric tree
|
| 237 |
+
|
| 238 |
+
Every scoring dimension is an OpenEnv `Rubric` subclass. Walk the composition tree on any
|
| 239 |
+
live environment:
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
python - <<'PY'
|
| 243 |
+
from server.chargeback_ops_environment import ChargebackOpsEnvironment
|
| 244 |
+
env = ChargebackOpsEnvironment()
|
| 245 |
+
for name, r in env.rubric.named_rubrics():
|
| 246 |
+
print(f"{name}: {type(r).__name__}")
|
| 247 |
+
PY
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
Expected output (11 named children):
|
| 251 |
+
|
| 252 |
+
```
|
| 253 |
+
case_rubric: CaseRubric
|
| 254 |
+
case_rubric.aggregator: WeightedSum
|
| 255 |
+
case_rubric.aggregator.rubric_0: StrategyCorrectnessRubric
|
| 256 |
+
case_rubric.aggregator.rubric_1: EvidenceQualityRubric
|
| 257 |
+
case_rubric.aggregator.rubric_2: PacketValidityRubric
|
| 258 |
+
case_rubric.aggregator.rubric_3: DeadlineComplianceRubric
|
| 259 |
+
case_rubric.aggregator.rubric_4: EfficiencyRubric
|
| 260 |
+
case_rubric.aggregator.rubric_5: OutcomeQualityRubric
|
| 261 |
+
case_rubric.aggregator.rubric_6: NoteQualityRubric
|
| 262 |
+
case_rubric.deadline_gate: Gate
|
| 263 |
+
case_rubric.deadline_gate.rubric: CaseAbandonedRubric
|
| 264 |
+
```
|
| 265 |
+
|
| 266 |
+
After a forward pass, each child exposes `last_score` β this is the introspection path an RL
|
| 267 |
+
trainer hooks into for credit assignment.
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 6. Troubleshooting
|
| 272 |
+
|
| 273 |
+
**`openenv validate .` fails.** Check `openenv.yaml` is present at repo root and your venv has
|
| 274 |
+
`openenv-core>=0.2.3` installed.
|
| 275 |
+
|
| 276 |
+
**Provider calls all fail / score drops to heuristic.** Run `python -m runners.baseline_runner`
|
| 277 |
+
and inspect `provider_errors`. Common causes: expired API key, wrong `BASELINE_MODEL` slug for
|
| 278 |
+
the provider, or rate limits (the runner retries twice, then falls back). Set
|
| 279 |
+
`BASELINE_REQUEST_TIMEOUT_SECONDS=30` if the provider is slow.
|
| 280 |
+
|
| 281 |
+
**`ImportError: attempted relative import`.** Always run commands from the repo root with the
|
| 282 |
+
venv activated. Use `python -m runners.baseline_runner`, not `python runners/baseline_runner.py`.
|
| 283 |
+
|
| 284 |
+
**Docker build is slow on every edit.** You probably edited `pyproject.toml` β the deps layer
|
| 285 |
+
only caches when that file is unchanged. If you edit source only, rebuilds should be ~6s.
|
| 286 |
+
|
| 287 |
+
**Scores differ from `docs/RESULTS.md`.** If you pass different seeds or LLM providers you
|
| 288 |
+
will get different numbers. The reference numbers are captured on the fixed 10-task catalog
|
| 289 |
+
defined by `scenarios.simulation.list_tasks()` plus OpenRouter `openai/gpt-oss-120b`. Anything
|
| 290 |
+
else is not directly comparable.
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
+
|
| 294 |
+
## 7. Minimal "does it work?" smoke test
|
| 295 |
+
|
| 296 |
+
One command to verify everything is wired up correctly:
|
| 297 |
+
|
| 298 |
+
```bash
|
| 299 |
+
source ~/python/bin/activate && \
|
| 300 |
+
pytest -q tests && \
|
| 301 |
+
openenv validate . && \
|
| 302 |
+
python -c "
|
| 303 |
+
from evaluation.agent_brutal_audit import run_episode
|
| 304 |
+
r = run_episode('goods_not_received_easy', policy='heuristic')
|
| 305 |
+
assert abs(r['score'] - 0.9675) < 1e-3, r['score']
|
| 306 |
+
print('smoke OK, score =', r['score'])
|
| 307 |
+
"
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
If that prints `smoke OK, score = 0.9675`, the agent runs cleanly and the rubric math is stable.
|
|
@@ -1043,6 +1043,10 @@ def _obvious_next_action(
|
|
| 1043 |
if not candidates:
|
| 1044 |
return None
|
| 1045 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1046 |
first = candidates[0]
|
| 1047 |
visible_case = observation.get("visible_case")
|
| 1048 |
queue = observation["queue"]
|
|
@@ -1065,9 +1069,18 @@ def _obvious_next_action(
|
|
| 1065 |
if visible_case["status"] != "open":
|
| 1066 |
return first if first.action.action_type == "select_case" else None
|
| 1067 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1068 |
if first.action.action_type in {
|
| 1069 |
"retrieve_policy",
|
| 1070 |
"add_evidence",
|
|
|
|
| 1071 |
"submit_representment",
|
| 1072 |
"resolve_case",
|
| 1073 |
}:
|
|
@@ -1078,19 +1091,6 @@ def _obvious_next_action(
|
|
| 1078 |
if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
|
| 1079 |
return first
|
| 1080 |
|
| 1081 |
-
if first.action.action_type == "set_strategy":
|
| 1082 |
-
strategy = first.action.strategy
|
| 1083 |
-
competing_strategies = {
|
| 1084 |
-
candidate.action.strategy
|
| 1085 |
-
for candidate in candidates[1:]
|
| 1086 |
-
if candidate.action.action_type == "set_strategy"
|
| 1087 |
-
}
|
| 1088 |
-
if (
|
| 1089 |
-
strategy in {"accept_chargeback", "issue_refund"}
|
| 1090 |
-
and "contest" not in competing_strategies
|
| 1091 |
-
):
|
| 1092 |
-
return first
|
| 1093 |
-
|
| 1094 |
if first.action.action_type == "select_case":
|
| 1095 |
current_case_id = visible_case["case_id"]
|
| 1096 |
current_deadline = next(
|
|
@@ -1278,9 +1278,22 @@ def _provider_pick(
|
|
| 1278 |
{
|
| 1279 |
"role": "system",
|
| 1280 |
"content": (
|
| 1281 |
-
"You are a merchant chargeback dispute analyst. Pick the single best next action from the
|
| 1282 |
-
"
|
| 1283 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1284 |
),
|
| 1285 |
},
|
| 1286 |
{"role": "user", "content": payload},
|
|
|
|
| 1043 |
if not candidates:
|
| 1044 |
return None
|
| 1045 |
|
| 1046 |
+
# Single candidate = no decision to make.
|
| 1047 |
+
if len(candidates) == 1:
|
| 1048 |
+
return candidates[0]
|
| 1049 |
+
|
| 1050 |
first = candidates[0]
|
| 1051 |
visible_case = observation.get("visible_case")
|
| 1052 |
queue = observation["queue"]
|
|
|
|
| 1069 |
if visible_case["status"] != "open":
|
| 1070 |
return first if first.action.action_type == "select_case" else None
|
| 1071 |
|
| 1072 |
+
# Strategy selection: the heuristic already derives the optimal strategy
|
| 1073 |
+
# from policy + retrieved evidence. The LLM has no additional signal that
|
| 1074 |
+
# improves this specific call β invoking it here has only caused regressions
|
| 1075 |
+
# on fraud_signal_ambiguity and generated_medium_s99 where the model picks
|
| 1076 |
+
# a concede-style strategy over the correct contest.
|
| 1077 |
+
if first.action.action_type == "set_strategy":
|
| 1078 |
+
return first
|
| 1079 |
+
|
| 1080 |
if first.action.action_type in {
|
| 1081 |
"retrieve_policy",
|
| 1082 |
"add_evidence",
|
| 1083 |
+
"remove_evidence",
|
| 1084 |
"submit_representment",
|
| 1085 |
"resolve_case",
|
| 1086 |
}:
|
|
|
|
| 1091 |
if visible_case.get("policy") is None or current_strategy in {None, "contest"}:
|
| 1092 |
return first
|
| 1093 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1094 |
if first.action.action_type == "select_case":
|
| 1095 |
current_case_id = visible_case["case_id"]
|
| 1096 |
current_deadline = next(
|
|
|
|
| 1278 |
{
|
| 1279 |
"role": "system",
|
| 1280 |
"content": (
|
| 1281 |
+
"You are a merchant chargeback dispute analyst. Pick the single best next action from the ordered candidate list. "
|
| 1282 |
+
"The candidates are pre-sorted by a deterministic heuristic β candidate 0 is usually correct. Deviate only when you spot a concrete reason. "
|
| 1283 |
+
"\n"
|
| 1284 |
+
"Reason-code β optimal strategy (follow unless evidence clearly contradicts):\n"
|
| 1285 |
+
" goods_not_received β contest (with order + delivery proof)\n"
|
| 1286 |
+
" fraud_cnp β contest when account linkage exists, otherwise concede\n"
|
| 1287 |
+
" product_not_as_described β contest (with listing + return policy proof)\n"
|
| 1288 |
+
" service_not_provided β contest (with completion log)\n"
|
| 1289 |
+
" credit_not_processed β issue_refund immediately\n"
|
| 1290 |
+
" duplicate_processing β issue_refund immediately\n"
|
| 1291 |
+
"\n"
|
| 1292 |
+
"Priorities: (1) resolve cases whose deadline is 1 step away before anything else, "
|
| 1293 |
+
"(2) prefer the highest-$ open case when budget is tight, "
|
| 1294 |
+
"(3) never attach harmful evidence (AVS/CVV mismatch on fraud_cnp, GPS anomalies on goods_not_received), "
|
| 1295 |
+
"(4) when multiple candidates look equivalent, take candidate 0.\n"
|
| 1296 |
+
'Return only JSON: {"candidate_index": N, "rationale": "brief reason"}'
|
| 1297 |
),
|
| 1298 |
},
|
| 1299 |
{"role": "user", "content": payload},
|
|
@@ -580,7 +580,7 @@ _PRODUCT_NOT_AS_DESCRIBED = _CaseTemplate(
|
|
| 580 |
),
|
| 581 |
policy_requirements=("product listing verification", "return policy documentation"),
|
| 582 |
optimal_strategy="contest",
|
| 583 |
-
acceptable_strategies=(
|
| 584 |
resolution_summary="Contest with listing accuracy proof and return policy documentation.",
|
| 585 |
base_weight=1.0,
|
| 586 |
evidence_blueprints=(
|
|
@@ -890,7 +890,7 @@ _SERVICE_NOT_PROVIDED = _CaseTemplate(
|
|
| 890 |
"customer acknowledgment or scheduling proof",
|
| 891 |
),
|
| 892 |
optimal_strategy="contest",
|
| 893 |
-
acceptable_strategies=(
|
| 894 |
resolution_summary="Contest with service completion proof. The service was delivered as booked.",
|
| 895 |
base_weight=1.0,
|
| 896 |
evidence_blueprints=(
|
|
|
|
| 580 |
),
|
| 581 |
policy_requirements=("product listing verification", "return policy documentation"),
|
| 582 |
optimal_strategy="contest",
|
| 583 |
+
acceptable_strategies=(),
|
| 584 |
resolution_summary="Contest with listing accuracy proof and return policy documentation.",
|
| 585 |
base_weight=1.0,
|
| 586 |
evidence_blueprints=(
|
|
|
|
| 890 |
"customer acknowledgment or scheduling proof",
|
| 891 |
),
|
| 892 |
optimal_strategy="contest",
|
| 893 |
+
acceptable_strategies=(),
|
| 894 |
resolution_summary="Contest with service completion proof. The service was delivered as booked.",
|
| 895 |
base_weight=1.0,
|
| 896 |
evidence_blueprints=(
|
|
@@ -258,7 +258,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 258 |
),
|
| 259 |
deadline_step=7,
|
| 260 |
optimal_strategy="contest",
|
| 261 |
-
acceptable_strategies=(
|
| 262 |
policy_guidance=(
|
| 263 |
"For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
|
| 264 |
"Do not attach evidence that strengthens the issuer's fraud narrative."
|
|
@@ -268,7 +268,7 @@ TASKS: dict[str, TaskScenario] = {
|
|
| 268 |
"customer account confirmation",
|
| 269 |
),
|
| 270 |
recommended_strategy="contest",
|
| 271 |
-
resolution_summary="Contest
|
| 272 |
weight=1.1,
|
| 273 |
required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
|
| 274 |
helpful_evidence_ids=(
|
|
|
|
| 258 |
),
|
| 259 |
deadline_step=7,
|
| 260 |
optimal_strategy="contest",
|
| 261 |
+
acceptable_strategies=(),
|
| 262 |
policy_guidance=(
|
| 263 |
"For CNP fraud disputes, contest only when you can link the cardholder to the account or device history. "
|
| 264 |
"Do not attach evidence that strengthens the issuer's fraud narrative."
|
|
|
|
| 268 |
"customer account confirmation",
|
| 269 |
),
|
| 270 |
recommended_strategy="contest",
|
| 271 |
+
resolution_summary="Contest with strong account-linkage evidence. Conceding this case forfeits defensible revenue.",
|
| 272 |
weight=1.1,
|
| 273 |
required_evidence_ids=("M1-PRIOR-ORDERS", "M1-ACCOUNT-CHAT"),
|
| 274 |
helpful_evidence_ids=(
|