# Reproducibility

This document gives the exact command sequence, pinned versions, expected runtimes, and expected score ranges to reproduce every published number in [`RESULTS.md`](RESULTS.md). Reported numbers are seed-deterministic where stated; cells that depend on stochastic sampling are flagged.

## 1. Pinned dependency stack

The training notebook [`notebooks/train_merchant_agent.ipynb`](../notebooks/train_merchant_agent.ipynb) installs and verifies the following pins. The setup cell asserts each pin and fails loud if any version slips.

| Package | Version | Why pinned |
|---|---|---|
| torch | 2.10.0+cu128 | Matches torchvision 0.25.0+cu128 and torchaudio 2.10.0+cu128 |
| transformers | 4.51.3 | TRL 0.21.0 was tested against 4.51.x; newer transformers break GRPOTrainer internals |
| trl | 0.21.0 | Provides `GRPOTrainer` with `reward_funcs` list and proper sampling kwarg passing |
| peft | 0.14.0 | Compatible with huggingface-hub 0.26.x; later peft requires hub 1.x |
| tokenizers | 0.21.4 | Required range for transformers 4.51.x |
| huggingface-hub | 0.26.5 | Last 0.x release; peft 0.14 imports paths that moved in hub 1.0 |
| accelerate | 1.0.1 | Compatible with hub 0.26; later accelerate hard-requires hub 1.x |
| openenv-core | ≥0.2.2 | Source of `Environment`, `Rubric`, `WeightedSum`, `Gate` |
| pydantic | ≥2.10 | Used by core models |
| datasets | ≥2.20,<4.0 | Compatible with the pinned transformers + tokenizers |

## 2. End-to-end reproduction (Colab T4)

### 2.1 Open the notebook

```
https://colab.research.google.com/github/MitudruDutta/ChargeBackOps/blob/main/notebooks/train_merchant_agent.ipynb
```

Connect to a T4 runtime (free tier suffices).

### 2.2 Run cells in order

Each cell is idempotent. Total wallclock ≈ 75 minutes on a free T4.

| Cell | Purpose | Wallclock | VRAM peak |
|---|---|---|---|
| `setup-code` | Pin install + repo clone + asserts | 4 min | 0 GB |
| `patch-code` | sys.path + module-cache flush | 5 sec | 0 GB |
| `model-code` | Load Qwen2.5-3B-Instruct fp16 + LoRA | 3 min | 6.3 GB |
| `sft-data-code` | Generate 4,000 SFT rows + chat-template wrap | 1 min | 6.3 GB |
| `sft-train-code` | Phase A SFT (150 steps) | 17 min | 8.4 GB |
| `merge-code` | Reload SFT, merge into base, attach Phase B LoRA | 2 min | 6.3 GB |
| `grpo-data-code` | Build GRPO state-action dataset with curriculum bias | 1 min | 6.3 GB |
| `grpo-train-code` | Phase B GRPO (200 steps) | 40 min | 11.5 GB |
| `eval-code` | Per-checkpoint eval + plot generation | 18 min | 12.5 GB |
| `diag-code` | Three-task diagnostic rollout | 2 min | 6.3 GB |

### 2.3 Override knobs

Set environment variables before running the relevant cell:

```python
import os
os.environ['MODEL_ID'] = 'Qwen/Qwen2.5-3B-Instruct'   # default
os.environ['SFT_TARGET_ROWS'] = '4000'                 # default
os.environ['SFT_MAX_STEPS'] = '150'                    # default
os.environ['SFT_LR'] = '1e-4'                          # default
os.environ['GRPO_MAX_STEPS'] = '200'                   # default
os.environ['GRPO_LR'] = '3e-5'                         # default
os.environ['PHASE_B_MAX_STATES_PER_TASK'] = '10'       # default
os.environ['GRPO_DIFFICULTIES'] = 'easy,medium,hard,nightmare'  # default
os.environ['RUN_SFT_TRAIN'] = 'auto'                   # auto-skip if adapter exists
os.environ['RUN_GRPO'] = '1'                           # set '0' to skip Phase B
```

## 3. End-to-end reproduction (local, ≥12 GB VRAM)

If you have a local machine with at least 12 GB CUDA VRAM, the same notebook runs unchanged. Adjust `WORK_DIR` at the top of the setup cell to a local writable path.

```bash
git clone https://github.com/MitudruDutta/ChargeBackOps
cd ChargeBackOps
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
jupyter notebook notebooks/train_merchant_agent.ipynb
```

For laptops with less VRAM, set `MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct` to fit in 8 GB. Expect lower absolute scores (the model is half the size) but the same qualitative training story.

## 4. Reproducing only the scripted-policy baseline sweep

No GPU required. Runs on CPU in under a minute.

```bash
pip install -e ".[dev]"
pytest -q tests/                           # 113 tests, all green
python -m runners.benchmark_runner         # prints headline + multi-seed sweep
```

Expected output (deterministic):

```
Headline catalog (12 tasks):
  naive          : 0.0000
  concede_all    : 0.4435
  escalate_all   : 0.7668
  heuristic      : 0.8132

Multi-seed grid (28 tasks):
  naive          : 0.0000
  concede_all    : 0.4454
  escalate_all   : 0.7675
  heuristic      : 0.7628

Marathon (long-horizon):
  naive          : 0.0000
  concede_all    : 0.4004
  escalate_all   : 0.6168
  heuristic      : 0.6793
```

These numbers are exact; the heuristic policy and arbitration adjudicator are deterministic given (case, packet).

## 5. Expected training-curve numbers (with seed variance)

The published training curve was produced with seeds:

```
SFT_SEED_START = 1000
HOLDOUT_SEEDS_BY_DIFF = {
    'easy':      {42},
    'medium':    {17, 99},
    'hard':      {7, 53},
    'nightmare': {31, 77},
}
```

Holdout seeds are excluded from training and used as the eval set.

Expected per-checkpoint scores (with ±0.03 absolute variance from sampling stochasticity):

| Checkpoint | overall | easy | medium | hard | nightmare | Status |
|---|---|---|---|---|---|---|
| Untrained Qwen2.5-3B base (step 0) | 0.46 ± 0.02 | 0.29 ± 0.05 | 0.44 ± 0.04 | 0.76 ± 0.03 | 0.34 ± 0.05 | Real |
| SFT (step 1, 150 steps) | 0.54 ± 0.03 | 0.78 ± 0.05 | 0.67 ± 0.04 | 0.46 ± 0.05 | 0.24 ± 0.06 | **Real, headline trained checkpoint** |
| GRPO step 80 | 0.80 ± 0.04 | 0.93 ± 0.04 | 0.79 ± 0.04 | 0.83 ± 0.05 | 0.65 ± 0.06 | Mixed: partial real + early gaming attractor |
| GRPO step 160+ | 0.8132 ± 0.0001 | 0.92 | 0.86 | 0.83 | 0.64 | Gaming-dominated (matches heuristic bit-exactly) |

The `0.8132 ± 0.0001` precision on GRPO step 160+ is not reproducibility precision — it is the eval rollout helper falling back to the deterministic heuristic on every invalid action. The heuristic produces `0.8132` exactly because both the heuristic and the arbitration adjudicator are deterministic given (case, packet). See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md) for the full diagnostic.

GRPO numbers in earlier rows (step 0 / step 1 / step 80) have wider variance because the trainer's sampling is stochastic and only 30–60% of steps see a non-zero gradient (see [`METHOD.md`](METHOD.md) §3 for why).

## 6. Reproducing the figures

After the eval cell completes, two PNGs are written to `docs/figures/`:

- `training_curve.png` — overall mean score vs GRPO step, with heuristic and naive baselines as dashed lines.
- `training_curve_by_family.png` — per-difficulty curves on the same axes.

Both are committed to the repo so judges who do not run the notebook can still see the results.

## 7. Test suite

```bash
pytest -q tests/
```

Should output:

```
113 passed in ~7s
```

Failures here indicate environment, grader, or training-pipeline regressions. See `tests/conftest.py` for fixture details.

## 8. Running the trained agent

After the notebook completes, the SFT and GRPO adapters are saved under:

- `/content/sft-merchant-agent/final/` (or `WORK_DIR/sft-merchant-agent/final/` locally)
- `/content/grpo-merchant-agent/final/`

To use the trained model in the inference path:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    'Qwen/Qwen2.5-3B-Instruct', torch_dtype=torch.float16, device_map='cuda',
)
sft = PeftModel.from_pretrained(base, 'sft-merchant-agent/final')
merged = sft.merge_and_unload()
trained = PeftModel.from_pretrained(merged, 'grpo-merchant-agent/final')
trained.eval()

tok = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')
# ... use trained as the policy in run_episode_with_text_policy()
```

See [`RUNNING_THE_AGENT.md`](RUNNING_THE_AGENT.md) for the full inference path.