Spaces:
Sleeping
Reproducibility
This document gives the exact command sequence, pinned versions, expected runtimes, and expected score ranges to reproduce every published number in RESULTS.md. Reported numbers are seed-deterministic where stated; cells that depend on stochastic sampling are flagged.
1. Pinned dependency stack
The training notebook notebooks/train_merchant_agent.ipynb installs and verifies the following pins. The setup cell asserts each pin and fails loud if any version slips.
| Package | Version | Why pinned |
|---|---|---|
| torch | 2.10.0+cu128 | Matches torchvision 0.25.0+cu128 and torchaudio 2.10.0+cu128 |
| transformers | 4.51.3 | TRL 0.21.0 was tested against 4.51.x; newer transformers break GRPOTrainer internals |
| trl | 0.21.0 | Provides GRPOTrainer with reward_funcs list and proper sampling kwarg passing |
| peft | 0.14.0 | Compatible with huggingface-hub 0.26.x; later peft requires hub 1.x |
| tokenizers | 0.21.4 | Required range for transformers 4.51.x |
| huggingface-hub | 0.26.5 | Last 0.x release; peft 0.14 imports paths that moved in hub 1.0 |
| accelerate | 1.0.1 | Compatible with hub 0.26; later accelerate hard-requires hub 1.x |
| openenv-core | ≥0.2.2 | Source of Environment, Rubric, WeightedSum, Gate |
| pydantic | ≥2.10 | Used by core models |
| datasets | ≥2.20,<4.0 | Compatible with the pinned transformers + tokenizers |
2. End-to-end reproduction (Colab T4)
2.1 Open the notebook
https://colab.research.google.com/github/MitudruDutta/ChargeBackOps/blob/main/notebooks/train_merchant_agent.ipynb
Connect to a T4 runtime (free tier suffices).
2.2 Run cells in order
Each cell is idempotent. Total wallclock ≈ 75 minutes on a free T4.
| Cell | Purpose | Wallclock | VRAM peak |
|---|---|---|---|
setup-code |
Pin install + repo clone + asserts | 4 min | 0 GB |
patch-code |
sys.path + module-cache flush | 5 sec | 0 GB |
model-code |
Load Qwen2.5-3B-Instruct fp16 + LoRA | 3 min | 6.3 GB |
sft-data-code |
Generate 4,000 SFT rows + chat-template wrap | 1 min | 6.3 GB |
sft-train-code |
Phase A SFT (150 steps) | 17 min | 8.4 GB |
merge-code |
Reload SFT, merge into base, attach Phase B LoRA | 2 min | 6.3 GB |
grpo-data-code |
Build GRPO state-action dataset with curriculum bias | 1 min | 6.3 GB |
grpo-train-code |
Phase B GRPO (200 steps) | 40 min | 11.5 GB |
eval-code |
Per-checkpoint eval + plot generation | 18 min | 12.5 GB |
diag-code |
Three-task diagnostic rollout | 2 min | 6.3 GB |
2.3 Override knobs
Set environment variables before running the relevant cell:
import os
os.environ['MODEL_ID'] = 'Qwen/Qwen2.5-3B-Instruct' # default
os.environ['SFT_TARGET_ROWS'] = '4000' # default
os.environ['SFT_MAX_STEPS'] = '150' # default
os.environ['SFT_LR'] = '1e-4' # default
os.environ['GRPO_MAX_STEPS'] = '200' # default
os.environ['GRPO_LR'] = '3e-5' # default
os.environ['PHASE_B_MAX_STATES_PER_TASK'] = '10' # default
os.environ['GRPO_DIFFICULTIES'] = 'easy,medium,hard,nightmare' # default
os.environ['RUN_SFT_TRAIN'] = 'auto' # auto-skip if adapter exists
os.environ['RUN_GRPO'] = '1' # set '0' to skip Phase B
3. End-to-end reproduction (local, ≥12 GB VRAM)
If you have a local machine with at least 12 GB CUDA VRAM, the same notebook runs unchanged. Adjust WORK_DIR at the top of the setup cell to a local writable path.
git clone https://github.com/MitudruDutta/ChargeBackOps
cd ChargeBackOps
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
jupyter notebook notebooks/train_merchant_agent.ipynb
For laptops with less VRAM, set MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct to fit in 8 GB. Expect lower absolute scores (the model is half the size) but the same qualitative training story.
4. Reproducing only the scripted-policy baseline sweep
No GPU required. Runs on CPU in under a minute.
pip install -e ".[dev]"
pytest -q tests/ # 113 tests, all green
python -m runners.benchmark_runner # prints headline + multi-seed sweep
Expected output (deterministic):
Headline catalog (12 tasks):
naive : 0.0000
concede_all : 0.4435
escalate_all : 0.7668
heuristic : 0.8132
Multi-seed grid (28 tasks):
naive : 0.0000
concede_all : 0.4454
escalate_all : 0.7675
heuristic : 0.7628
Marathon (long-horizon):
naive : 0.0000
concede_all : 0.4004
escalate_all : 0.6168
heuristic : 0.6793
These numbers are exact; the heuristic policy and arbitration adjudicator are deterministic given (case, packet).
5. Expected training-curve numbers (with seed variance)
The published training curve was produced with seeds:
SFT_SEED_START = 1000
HOLDOUT_SEEDS_BY_DIFF = {
'easy': {42},
'medium': {17, 99},
'hard': {7, 53},
'nightmare': {31, 77},
}
Holdout seeds are excluded from training and used as the eval set.
Expected per-checkpoint scores (with ±0.03 absolute variance from sampling stochasticity):
| Checkpoint | overall | easy | medium | hard | nightmare | Status |
|---|---|---|---|---|---|---|
| Untrained Qwen2.5-3B base (step 0) | 0.46 ± 0.02 | 0.29 ± 0.05 | 0.44 ± 0.04 | 0.76 ± 0.03 | 0.34 ± 0.05 | Real |
| SFT (step 1, 150 steps) | 0.54 ± 0.03 | 0.78 ± 0.05 | 0.67 ± 0.04 | 0.46 ± 0.05 | 0.24 ± 0.06 | Real, headline trained checkpoint |
| GRPO step 80 | 0.80 ± 0.04 | 0.93 ± 0.04 | 0.79 ± 0.04 | 0.83 ± 0.05 | 0.65 ± 0.06 | Mixed: partial real + early gaming attractor |
| GRPO step 160+ | 0.8132 ± 0.0001 | 0.92 | 0.86 | 0.83 | 0.64 | Gaming-dominated (matches heuristic bit-exactly) |
The 0.8132 ± 0.0001 precision on GRPO step 160+ is not reproducibility precision — it is the eval rollout helper falling back to the deterministic heuristic on every invalid action. The heuristic produces 0.8132 exactly because both the heuristic and the arbitration adjudicator are deterministic given (case, packet). See SPECIFICATION_GAMING.md for the full diagnostic.
GRPO numbers in earlier rows (step 0 / step 1 / step 80) have wider variance because the trainer's sampling is stochastic and only 30–60% of steps see a non-zero gradient (see METHOD.md §3 for why).
6. Reproducing the figures
After the eval cell completes, two PNGs are written to docs/figures/:
training_curve.png— overall mean score vs GRPO step, with heuristic and naive baselines as dashed lines.training_curve_by_family.png— per-difficulty curves on the same axes.
Both are committed to the repo so judges who do not run the notebook can still see the results.
7. Test suite
pytest -q tests/
Should output:
113 passed in ~7s
Failures here indicate environment, grader, or training-pipeline regressions. See tests/conftest.py for fixture details.
8. Running the trained agent
After the notebook completes, the SFT and GRPO adapters are saved under:
/content/sft-merchant-agent/final/(orWORK_DIR/sft-merchant-agent/final/locally)/content/grpo-merchant-agent/final/
To use the trained model in the inference path:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
'Qwen/Qwen2.5-3B-Instruct', torch_dtype=torch.float16, device_map='cuda',
)
sft = PeftModel.from_pretrained(base, 'sft-merchant-agent/final')
merged = sft.merge_and_unload()
trained = PeftModel.from_pretrained(merged, 'grpo-merchant-agent/final')
trained.eval()
tok = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')
# ... use trained as the policy in run_episode_with_text_policy()
See RUNNING_THE_AGENT.md for the full inference path.