Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

ChargeBackOps / docs /REPRODUCIBILITY.md

mitudrudutta

Enhance documentation and address specification gaming in ChargebackOps

a92af86 about 1 month ago

preview code

raw

history blame contribute delete

8.09 kB

	# Reproducibility

	This document gives the exact command sequence, pinned versions, expected runtimes, and expected score ranges to reproduce every published number in [`RESULTS.md`](RESULTS.md). Reported numbers are seed-deterministic where stated; cells that depend on stochastic sampling are flagged.

	## 1. Pinned dependency stack

	The training notebook [`notebooks/train_merchant_agent.ipynb`](../notebooks/train_merchant_agent.ipynb) installs and verifies the following pins. The setup cell asserts each pin and fails loud if any version slips.

	\| Package \| Version \| Why pinned \|
	\|---\|---\|---\|
	\| torch \| 2.10.0+cu128 \| Matches torchvision 0.25.0+cu128 and torchaudio 2.10.0+cu128 \|
	\| transformers \| 4.51.3 \| TRL 0.21.0 was tested against 4.51.x; newer transformers break GRPOTrainer internals \|
	\| trl \| 0.21.0 \| Provides `GRPOTrainer` with `reward_funcs` list and proper sampling kwarg passing \|
	\| peft \| 0.14.0 \| Compatible with huggingface-hub 0.26.x; later peft requires hub 1.x \|
	\| tokenizers \| 0.21.4 \| Required range for transformers 4.51.x \|
	\| huggingface-hub \| 0.26.5 \| Last 0.x release; peft 0.14 imports paths that moved in hub 1.0 \|
	\| accelerate \| 1.0.1 \| Compatible with hub 0.26; later accelerate hard-requires hub 1.x \|
	\| openenv-core \| ≥0.2.2 \| Source of `Environment`, `Rubric`, `WeightedSum`, `Gate` \|
	\| pydantic \| ≥2.10 \| Used by core models \|
	\| datasets \| ≥2.20,<4.0 \| Compatible with the pinned transformers + tokenizers \|

	## 2. End-to-end reproduction (Colab T4)

	### 2.1 Open the notebook

	```
	https://colab.research.google.com/github/MitudruDutta/ChargeBackOps/blob/main/notebooks/train_merchant_agent.ipynb
	```

	Connect to a T4 runtime (free tier suffices).

	### 2.2 Run cells in order

	Each cell is idempotent. Total wallclock ≈ 75 minutes on a free T4.

	\| Cell \| Purpose \| Wallclock \| VRAM peak \|
	\|---\|---\|---\|---\|
	\| `setup-code` \| Pin install + repo clone + asserts \| 4 min \| 0 GB \|
	\| `patch-code` \| sys.path + module-cache flush \| 5 sec \| 0 GB \|
	\| `model-code` \| Load Qwen2.5-3B-Instruct fp16 + LoRA \| 3 min \| 6.3 GB \|
	\| `sft-data-code` \| Generate 4,000 SFT rows + chat-template wrap \| 1 min \| 6.3 GB \|
	\| `sft-train-code` \| Phase A SFT (150 steps) \| 17 min \| 8.4 GB \|
	\| `merge-code` \| Reload SFT, merge into base, attach Phase B LoRA \| 2 min \| 6.3 GB \|
	\| `grpo-data-code` \| Build GRPO state-action dataset with curriculum bias \| 1 min \| 6.3 GB \|
	\| `grpo-train-code` \| Phase B GRPO (200 steps) \| 40 min \| 11.5 GB \|
	\| `eval-code` \| Per-checkpoint eval + plot generation \| 18 min \| 12.5 GB \|
	\| `diag-code` \| Three-task diagnostic rollout \| 2 min \| 6.3 GB \|

	### 2.3 Override knobs

	Set environment variables before running the relevant cell:

	```python
	import os
	os.environ['MODEL_ID'] = 'Qwen/Qwen2.5-3B-Instruct' # default
	os.environ['SFT_TARGET_ROWS'] = '4000' # default
	os.environ['SFT_MAX_STEPS'] = '150' # default
	os.environ['SFT_LR'] = '1e-4' # default
	os.environ['GRPO_MAX_STEPS'] = '200' # default
	os.environ['GRPO_LR'] = '3e-5' # default
	os.environ['PHASE_B_MAX_STATES_PER_TASK'] = '10' # default
	os.environ['GRPO_DIFFICULTIES'] = 'easy,medium,hard,nightmare' # default
	os.environ['RUN_SFT_TRAIN'] = 'auto' # auto-skip if adapter exists
	os.environ['RUN_GRPO'] = '1' # set '0' to skip Phase B
	```

	## 3. End-to-end reproduction (local, ≥12 GB VRAM)

	If you have a local machine with at least 12 GB CUDA VRAM, the same notebook runs unchanged. Adjust `WORK_DIR` at the top of the setup cell to a local writable path.

	```bash
	git clone https://github.com/MitudruDutta/ChargeBackOps
	cd ChargeBackOps
	python -m venv .venv && source .venv/bin/activate
	pip install -e ".[dev]"
	jupyter notebook notebooks/train_merchant_agent.ipynb
	```

	For laptops with less VRAM, set `MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct` to fit in 8 GB. Expect lower absolute scores (the model is half the size) but the same qualitative training story.

	## 4. Reproducing only the scripted-policy baseline sweep

	No GPU required. Runs on CPU in under a minute.

	```bash
	pip install -e ".[dev]"
	pytest -q tests/ # 113 tests, all green
	python -m runners.benchmark_runner # prints headline + multi-seed sweep
	```

	Expected output (deterministic):

	```
	Headline catalog (12 tasks):
	naive : 0.0000
	concede_all : 0.4435
	escalate_all : 0.7668
	heuristic : 0.8132

	Multi-seed grid (28 tasks):
	naive : 0.0000
	concede_all : 0.4454
	escalate_all : 0.7675
	heuristic : 0.7628

	Marathon (long-horizon):
	naive : 0.0000
	concede_all : 0.4004
	escalate_all : 0.6168
	heuristic : 0.6793
	```

	These numbers are exact; the heuristic policy and arbitration adjudicator are deterministic given (case, packet).

	## 5. Expected training-curve numbers (with seed variance)

	The published training curve was produced with seeds:

	```
	SFT_SEED_START = 1000
	HOLDOUT_SEEDS_BY_DIFF = {
	'easy': {42},
	'medium': {17, 99},
	'hard': {7, 53},
	'nightmare': {31, 77},
	}
	```

	Holdout seeds are excluded from training and used as the eval set.

	Expected per-checkpoint scores (with ±0.03 absolute variance from sampling stochasticity):

	\| Checkpoint \| overall \| easy \| medium \| hard \| nightmare \| Status \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Untrained Qwen2.5-3B base (step 0) \| 0.46 ± 0.02 \| 0.29 ± 0.05 \| 0.44 ± 0.04 \| 0.76 ± 0.03 \| 0.34 ± 0.05 \| Real \|
	\| SFT (step 1, 150 steps) \| 0.54 ± 0.03 \| 0.78 ± 0.05 \| 0.67 ± 0.04 \| 0.46 ± 0.05 \| 0.24 ± 0.06 \| Real, headline trained checkpoint \|
	\| GRPO step 80 \| 0.80 ± 0.04 \| 0.93 ± 0.04 \| 0.79 ± 0.04 \| 0.83 ± 0.05 \| 0.65 ± 0.06 \| Mixed: partial real + early gaming attractor \|
	\| GRPO step 160+ \| 0.8132 ± 0.0001 \| 0.92 \| 0.86 \| 0.83 \| 0.64 \| Gaming-dominated (matches heuristic bit-exactly) \|

	The `0.8132 ± 0.0001` precision on GRPO step 160+ is not reproducibility precision — it is the eval rollout helper falling back to the deterministic heuristic on every invalid action. The heuristic produces `0.8132` exactly because both the heuristic and the arbitration adjudicator are deterministic given (case, packet). See [`SPECIFICATION_GAMING.md`](SPECIFICATION_GAMING.md) for the full diagnostic.

	GRPO numbers in earlier rows (step 0 / step 1 / step 80) have wider variance because the trainer's sampling is stochastic and only 30–60% of steps see a non-zero gradient (see [`METHOD.md`](METHOD.md) §3 for why).

	## 6. Reproducing the figures

	After the eval cell completes, two PNGs are written to `docs/figures/`:

	- `training_curve.png` — overall mean score vs GRPO step, with heuristic and naive baselines as dashed lines.
	- `training_curve_by_family.png` — per-difficulty curves on the same axes.

	Both are committed to the repo so judges who do not run the notebook can still see the results.

	## 7. Test suite

	```bash
	pytest -q tests/
	```

	Should output:

	```
	113 passed in ~7s
	```

	Failures here indicate environment, grader, or training-pipeline regressions. See `tests/conftest.py` for fixture details.

	## 8. Running the trained agent

	After the notebook completes, the SFT and GRPO adapters are saved under:

	- `/content/sft-merchant-agent/final/` (or `WORK_DIR/sft-merchant-agent/final/` locally)
	- `/content/grpo-merchant-agent/final/`

	To use the trained model in the inference path:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	base = AutoModelForCausalLM.from_pretrained(
	'Qwen/Qwen2.5-3B-Instruct', torch_dtype=torch.float16, device_map='cuda',
	)
	sft = PeftModel.from_pretrained(base, 'sft-merchant-agent/final')
	merged = sft.merge_and_unload()
	trained = PeftModel.from_pretrained(merged, 'grpo-merchant-agent/final')
	trained.eval()

	tok = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')
	# ... use trained as the policy in run_episode_with_text_policy()
	```

	See [`RUNNING_THE_AGENT.md`](RUNNING_THE_AGENT.md) for the full inference path.