Spaces:
Sleeping
Sleeping
| # Reproducibility | |
| This document gives the exact command sequence, pinned versions, expected runtimes, and expected score ranges to reproduce every published number in [`RESULTS.md`](RESULTS.md). Reported numbers are seed-deterministic where stated; cells that depend on stochastic sampling are flagged. | |
| ## 1. Pinned dependency stack | |
| The training notebook [`notebooks/train_merchant_agent.ipynb`](../notebooks/train_merchant_agent.ipynb) installs and verifies the following pins. The setup cell asserts each pin and fails loud if any version slips. | |
| | Package | Version | Why pinned | | |
| |---|---|---| | |
| | torch | 2.10.0+cu128 | Matches torchvision 0.25.0+cu128 and torchaudio 2.10.0+cu128 | | |
| | transformers | 4.51.3 | TRL 0.21.0 was tested against 4.51.x; newer transformers break GRPOTrainer internals | | |
| | trl | 0.21.0 | Provides `GRPOTrainer` with `reward_funcs` list and proper sampling kwarg passing | | |
| | peft | 0.14.0 | Compatible with huggingface-hub 0.26.x; later peft requires hub 1.x | | |
| | tokenizers | 0.21.4 | Required range for transformers 4.51.x | | |
| | huggingface-hub | 0.26.5 | Last 0.x release; peft 0.14 imports paths that moved in hub 1.0 | | |
| | accelerate | 1.0.1 | Compatible with hub 0.26; later accelerate hard-requires hub 1.x | | |
| | openenv-core | ≥0.2.2 | Source of `Environment`, `Rubric`, `WeightedSum`, `Gate` | | |
| | pydantic | ≥2.10 | Used by core models | | |
| | datasets | ≥2.20,<4.0 | Compatible with the pinned transformers + tokenizers | | |
| ## 2. End-to-end reproduction (Colab T4) | |
| ### 2.1 Open the notebook | |
| ``` | |
| https://colab.research.google.com/github/MitudruDutta/ChargeBackOps/blob/main/notebooks/train_merchant_agent.ipynb | |
| ``` | |
| Connect to a T4 runtime (free tier suffices). | |
| ### 2.2 Run cells in order | |
| Each cell is idempotent. Total wallclock ≈ 75 minutes on a free T4. | |
| | Cell | Purpose | Wallclock | VRAM peak | | |
| |---|---|---|---| | |
| | `setup-code` | Pin install + repo clone + asserts | 4 min | 0 GB | | |
| | `patch-code` | sys.path + module-cache flush | 5 sec | 0 GB | | |
| | `model-code` | Load Qwen2.5-3B-Instruct fp16 + LoRA | 3 min | 6.3 GB | | |
| | `sft-data-code` | Generate 4,000 SFT rows + chat-template wrap | 1 min | 6.3 GB | | |
| | `sft-train-code` | Phase A SFT (150 steps) | 17 min | 8.4 GB | | |
| | `merge-code` | Reload SFT, merge into base, attach Phase B LoRA | 2 min | 6.3 GB | | |
| | `grpo-data-code` | Build GRPO state-action dataset with curriculum bias | 1 min | 6.3 GB | | |
| | `grpo-train-code` | Phase B GRPO (200 steps) | 40 min | 11.5 GB | | |
| | `eval-code` | Per-checkpoint eval + plot generation | 18 min | 12.5 GB | | |
| | `diag-code` | Three-task diagnostic rollout | 2 min | 6.3 GB | | |
| ### 2.3 Override knobs | |
| Set environment variables before running the relevant cell: | |
| ```python | |
| import os | |
| os.environ['MODEL_ID'] = 'Qwen/Qwen2.5-3B-Instruct' # default | |
| os.environ['SFT_TARGET_ROWS'] = '4000' # default | |
| os.environ['SFT_MAX_STEPS'] = '150' # default | |
| os.environ['SFT_LR'] = '1e-4' # default | |
| os.environ['GRPO_MAX_STEPS'] = '200' # default | |
| os.environ['GRPO_LR'] = '3e-5' # default | |
| os.environ['PHASE_B_MAX_STATES_PER_TASK'] = '10' # default | |
| os.environ['GRPO_DIFFICULTIES'] = 'easy,medium,hard,nightmare' # default | |
| os.environ['RUN_SFT_TRAIN'] = 'auto' # auto-skip if adapter exists | |
| os.environ['RUN_GRPO'] = '1' # set '0' to skip Phase B | |
| ``` | |
| ## 3. End-to-end reproduction (local, ≥12 GB VRAM) | |
| If you have a local machine with at least 12 GB CUDA VRAM, the same notebook runs unchanged. Adjust `WORK_DIR` at the top of the setup cell to a local writable path. | |
| ```bash | |
| git clone https://github.com/MitudruDutta/ChargeBackOps | |
| cd ChargeBackOps | |
| python -m venv .venv && source .venv/bin/activate | |
| pip install -e ".[dev]" | |
| jupyter notebook notebooks/train_merchant_agent.ipynb | |
| ``` | |
| For laptops with less VRAM, set `MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct` to fit in 8 GB. Expect lower absolute scores (the model is half the size) but the same qualitative training story. | |
| ## 4. Reproducing only the scripted-policy baseline sweep | |
| No GPU required. Runs on CPU in under a minute. | |
| ```bash | |
| pip install -e ".[dev]" | |
| pytest -q tests/ # 113 tests, all green | |
| python -m runners.benchmark_runner # prints headline + multi-seed sweep | |
| ``` | |
| Expected output (deterministic): | |
| ``` | |
| Headline catalog (12 tasks): | |
| naive : 0.0000 | |
| concede_all : 0.4435 | |
| escalate_all : 0.7668 | |
| heuristic : 0.8132 | |
| Multi-seed grid (28 tasks): | |
| naive : 0.0000 | |
| concede_all : 0.4454 | |
| escalate_all : 0.7675 | |
| heuristic : 0.7628 | |
| Marathon (long-horizon): | |
| naive : 0.0000 | |
| concede_all : 0.4004 | |
| escalate_all : 0.6168 | |
| heuristic : 0.6793 | |
| ``` | |
| These numbers are exact; the heuristic policy and arbitration adjudicator are deterministic given (case, packet). | |
| ## 5. Expected training-curve numbers (with seed variance) | |
| The published training curve was produced with seeds: | |
| ``` | |
| SFT_SEED_START = 1000 | |
| HOLDOUT_SEEDS_BY_DIFF = { | |
| 'easy': {42}, | |
| 'medium': {17, 99}, | |
| 'hard': {7, 53}, | |
| 'nightmare': {31, 77}, | |
| } | |
| ``` | |
| Holdout seeds are excluded from training and used as the eval set. | |
| Expected per-checkpoint scores (with ±0.03 absolute variance from sampling stochasticity): | |
| | Checkpoint | overall | easy | medium | hard | nightmare | Status | | |
| |---|---|---|---|---|---|---| | |
| | Untrained Qwen2.5-3B base (step 0) | 0.46 ± 0.02 | 0.29 ± 0.05 | 0.44 ± 0.04 | 0.76 ± 0.03 | 0.34 ± 0.05 | Real | | |
| | SFT (step 1, 150 steps) | 0.54 ± 0.03 | 0.78 ± 0.05 | 0.67 ± 0.04 | 0.46 ± 0.05 | 0.24 ± 0.06 | **Real, headline trained checkpoint** | | |
| | GRPO step 80 | 0.80 ± 0.04 | 0.93 ± 0.04 | 0.79 ± 0.04 | 0.83 ± 0.05 | 0.65 ± 0.06 | Mixed: partial real + early gaming attractor | | |
| | GRPO step 160+ | 0.8132 ± 0.0001 | 0.92 | 0.86 | 0.83 | 0.64 | Gaming-dominated (matches heuristic bit-exactly) | | |
| The `0.8132 ± 0.0001` precision on GRPO step 160+ is not reproducibility precision — it is the eval rollout helper falling back to the deterministic heuristic on every invalid action. The heuristic produces `0.8132` exactly because both the heuristic and the arbitration adjudicator are deterministic given (case, packet). See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md) for the full diagnostic. | |
| GRPO numbers in earlier rows (step 0 / step 1 / step 80) have wider variance because the trainer's sampling is stochastic and only 30–60% of steps see a non-zero gradient (see [`METHOD.md`](METHOD.md) §3 for why). | |
| ## 6. Reproducing the figures | |
| After the eval cell completes, two PNGs are written to `docs/figures/`: | |
| - `training_curve.png` — overall mean score vs GRPO step, with heuristic and naive baselines as dashed lines. | |
| - `training_curve_by_family.png` — per-difficulty curves on the same axes. | |
| Both are committed to the repo so judges who do not run the notebook can still see the results. | |
| ## 7. Test suite | |
| ```bash | |
| pytest -q tests/ | |
| ``` | |
| Should output: | |
| ``` | |
| 113 passed in ~7s | |
| ``` | |
| Failures here indicate environment, grader, or training-pipeline regressions. See `tests/conftest.py` for fixture details. | |
| ## 8. Running the trained agent | |
| After the notebook completes, the SFT and GRPO adapters are saved under: | |
| - `/content/sft-merchant-agent/final/` (or `WORK_DIR/sft-merchant-agent/final/` locally) | |
| - `/content/grpo-merchant-agent/final/` | |
| To use the trained model in the inference path: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| import torch | |
| base = AutoModelForCausalLM.from_pretrained( | |
| 'Qwen/Qwen2.5-3B-Instruct', torch_dtype=torch.float16, device_map='cuda', | |
| ) | |
| sft = PeftModel.from_pretrained(base, 'sft-merchant-agent/final') | |
| merged = sft.merge_and_unload() | |
| trained = PeftModel.from_pretrained(merged, 'grpo-merchant-agent/final') | |
| trained.eval() | |
| tok = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct') | |
| # ... use trained as the policy in run_episode_with_text_policy() | |
| ``` | |
| See [`RUNNING_THE_AGENT.md`](RUNNING_THE_AGENT.md) for the full inference path. | |