ChargeBackOps / docs /REPRODUCIBILITY.md
mitudrudutta's picture
Enhance documentation and address specification gaming in ChargebackOps
a92af86
# Reproducibility
This document gives the exact command sequence, pinned versions, expected runtimes, and expected score ranges to reproduce every published number in [`RESULTS.md`](RESULTS.md). Reported numbers are seed-deterministic where stated; cells that depend on stochastic sampling are flagged.
## 1. Pinned dependency stack
The training notebook [`notebooks/train_merchant_agent.ipynb`](../notebooks/train_merchant_agent.ipynb) installs and verifies the following pins. The setup cell asserts each pin and fails loud if any version slips.
| Package | Version | Why pinned |
|---|---|---|
| torch | 2.10.0+cu128 | Matches torchvision 0.25.0+cu128 and torchaudio 2.10.0+cu128 |
| transformers | 4.51.3 | TRL 0.21.0 was tested against 4.51.x; newer transformers break GRPOTrainer internals |
| trl | 0.21.0 | Provides `GRPOTrainer` with `reward_funcs` list and proper sampling kwarg passing |
| peft | 0.14.0 | Compatible with huggingface-hub 0.26.x; later peft requires hub 1.x |
| tokenizers | 0.21.4 | Required range for transformers 4.51.x |
| huggingface-hub | 0.26.5 | Last 0.x release; peft 0.14 imports paths that moved in hub 1.0 |
| accelerate | 1.0.1 | Compatible with hub 0.26; later accelerate hard-requires hub 1.x |
| openenv-core | ≥0.2.2 | Source of `Environment`, `Rubric`, `WeightedSum`, `Gate` |
| pydantic | ≥2.10 | Used by core models |
| datasets | ≥2.20,<4.0 | Compatible with the pinned transformers + tokenizers |
## 2. End-to-end reproduction (Colab T4)
### 2.1 Open the notebook
```
https://colab.research.google.com/github/MitudruDutta/ChargeBackOps/blob/main/notebooks/train_merchant_agent.ipynb
```
Connect to a T4 runtime (free tier suffices).
### 2.2 Run cells in order
Each cell is idempotent. Total wallclock ≈ 75 minutes on a free T4.
| Cell | Purpose | Wallclock | VRAM peak |
|---|---|---|---|
| `setup-code` | Pin install + repo clone + asserts | 4 min | 0 GB |
| `patch-code` | sys.path + module-cache flush | 5 sec | 0 GB |
| `model-code` | Load Qwen2.5-3B-Instruct fp16 + LoRA | 3 min | 6.3 GB |
| `sft-data-code` | Generate 4,000 SFT rows + chat-template wrap | 1 min | 6.3 GB |
| `sft-train-code` | Phase A SFT (150 steps) | 17 min | 8.4 GB |
| `merge-code` | Reload SFT, merge into base, attach Phase B LoRA | 2 min | 6.3 GB |
| `grpo-data-code` | Build GRPO state-action dataset with curriculum bias | 1 min | 6.3 GB |
| `grpo-train-code` | Phase B GRPO (200 steps) | 40 min | 11.5 GB |
| `eval-code` | Per-checkpoint eval + plot generation | 18 min | 12.5 GB |
| `diag-code` | Three-task diagnostic rollout | 2 min | 6.3 GB |
### 2.3 Override knobs
Set environment variables before running the relevant cell:
```python
import os
os.environ['MODEL_ID'] = 'Qwen/Qwen2.5-3B-Instruct' # default
os.environ['SFT_TARGET_ROWS'] = '4000' # default
os.environ['SFT_MAX_STEPS'] = '150' # default
os.environ['SFT_LR'] = '1e-4' # default
os.environ['GRPO_MAX_STEPS'] = '200' # default
os.environ['GRPO_LR'] = '3e-5' # default
os.environ['PHASE_B_MAX_STATES_PER_TASK'] = '10' # default
os.environ['GRPO_DIFFICULTIES'] = 'easy,medium,hard,nightmare' # default
os.environ['RUN_SFT_TRAIN'] = 'auto' # auto-skip if adapter exists
os.environ['RUN_GRPO'] = '1' # set '0' to skip Phase B
```
## 3. End-to-end reproduction (local, ≥12 GB VRAM)
If you have a local machine with at least 12 GB CUDA VRAM, the same notebook runs unchanged. Adjust `WORK_DIR` at the top of the setup cell to a local writable path.
```bash
git clone https://github.com/MitudruDutta/ChargeBackOps
cd ChargeBackOps
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
jupyter notebook notebooks/train_merchant_agent.ipynb
```
For laptops with less VRAM, set `MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct` to fit in 8 GB. Expect lower absolute scores (the model is half the size) but the same qualitative training story.
## 4. Reproducing only the scripted-policy baseline sweep
No GPU required. Runs on CPU in under a minute.
```bash
pip install -e ".[dev]"
pytest -q tests/ # 113 tests, all green
python -m runners.benchmark_runner # prints headline + multi-seed sweep
```
Expected output (deterministic):
```
Headline catalog (12 tasks):
naive : 0.0000
concede_all : 0.4435
escalate_all : 0.7668
heuristic : 0.8132
Multi-seed grid (28 tasks):
naive : 0.0000
concede_all : 0.4454
escalate_all : 0.7675
heuristic : 0.7628
Marathon (long-horizon):
naive : 0.0000
concede_all : 0.4004
escalate_all : 0.6168
heuristic : 0.6793
```
These numbers are exact; the heuristic policy and arbitration adjudicator are deterministic given (case, packet).
## 5. Expected training-curve numbers (with seed variance)
The published training curve was produced with seeds:
```
SFT_SEED_START = 1000
HOLDOUT_SEEDS_BY_DIFF = {
'easy': {42},
'medium': {17, 99},
'hard': {7, 53},
'nightmare': {31, 77},
}
```
Holdout seeds are excluded from training and used as the eval set.
Expected per-checkpoint scores (with ±0.03 absolute variance from sampling stochasticity):
| Checkpoint | overall | easy | medium | hard | nightmare | Status |
|---|---|---|---|---|---|---|
| Untrained Qwen2.5-3B base (step 0) | 0.46 ± 0.02 | 0.29 ± 0.05 | 0.44 ± 0.04 | 0.76 ± 0.03 | 0.34 ± 0.05 | Real |
| SFT (step 1, 150 steps) | 0.54 ± 0.03 | 0.78 ± 0.05 | 0.67 ± 0.04 | 0.46 ± 0.05 | 0.24 ± 0.06 | **Real, headline trained checkpoint** |
| GRPO step 80 | 0.80 ± 0.04 | 0.93 ± 0.04 | 0.79 ± 0.04 | 0.83 ± 0.05 | 0.65 ± 0.06 | Mixed: partial real + early gaming attractor |
| GRPO step 160+ | 0.8132 ± 0.0001 | 0.92 | 0.86 | 0.83 | 0.64 | Gaming-dominated (matches heuristic bit-exactly) |
The `0.8132 ± 0.0001` precision on GRPO step 160+ is not reproducibility precision — it is the eval rollout helper falling back to the deterministic heuristic on every invalid action. The heuristic produces `0.8132` exactly because both the heuristic and the arbitration adjudicator are deterministic given (case, packet). See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md) for the full diagnostic.
GRPO numbers in earlier rows (step 0 / step 1 / step 80) have wider variance because the trainer's sampling is stochastic and only 30–60% of steps see a non-zero gradient (see [`METHOD.md`](METHOD.md) §3 for why).
## 6. Reproducing the figures
After the eval cell completes, two PNGs are written to `docs/figures/`:
- `training_curve.png` — overall mean score vs GRPO step, with heuristic and naive baselines as dashed lines.
- `training_curve_by_family.png` — per-difficulty curves on the same axes.
Both are committed to the repo so judges who do not run the notebook can still see the results.
## 7. Test suite
```bash
pytest -q tests/
```
Should output:
```
113 passed in ~7s
```
Failures here indicate environment, grader, or training-pipeline regressions. See `tests/conftest.py` for fixture details.
## 8. Running the trained agent
After the notebook completes, the SFT and GRPO adapters are saved under:
- `/content/sft-merchant-agent/final/` (or `WORK_DIR/sft-merchant-agent/final/` locally)
- `/content/grpo-merchant-agent/final/`
To use the trained model in the inference path:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
'Qwen/Qwen2.5-3B-Instruct', torch_dtype=torch.float16, device_map='cuda',
)
sft = PeftModel.from_pretrained(base, 'sft-merchant-agent/final')
merged = sft.merge_and_unload()
trained = PeftModel.from_pretrained(merged, 'grpo-merchant-agent/final')
trained.eval()
tok = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')
# ... use trained as the policy in run_episode_with_text_policy()
```
See [`RUNNING_THE_AGENT.md`](RUNNING_THE_AGENT.md) for the full inference path.