BioRLHF / scripts /HPC_TRAINING_GUIDE.md

Phase 4: V1-aware calibration verifier, eval tools, cleanup

2145d80 about 1 month ago

5.62 kB

	# BioRLHF Training on Cayuga HPC

	Cluster: Cornell Cayuga HPC
	Target: GPU training with Mistral-7B + LoRA (SFT, DPO, GRPO)

	---

	## Quick Start

	```bash
	# 1. SSH to Cayuga
	ssh jak4013@cayuga-login1

	# 2. Submit a GRPO training job
	bash -l -c 'sbatch scripts/run_grpo_full.sh'

	# 3. Monitor
	squeue -u $USER
	tail -f logs/grpo_full_*.log
	```

	---

	## Step 1: Transfer Files to HPC

	From your local Mac:

	```bash
	rsync -avz --progress \
	/Users/jak4013/Dropbox/Bioinformatics/Claude/BioRLHF/biorlhf/ \
	jak4013@cayuga-login1:/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/
	```

	---

	## Step 2: Set Up Conda Environment (First Time Only)

	```bash
	# SSH to Cayuga
	ssh jak4013@cayuga-login1

	# Source conda (non-interactive shell requires explicit sourcing)
	. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh

	# Create environment
	conda create -n biorlhf python=3.10 -y
	conda activate biorlhf

	# Install PyTorch with CUDA support
	conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y

	# Install training dependencies
	pip install transformers>=4.36.0 peft>=0.6.0 trl>=0.14.0
	pip install bitsandbytes>=0.41.0 accelerate>=0.24.0 datasets>=2.14.0
	pip install wandb scipy scikit-learn sentencepiece jsonlines

	# Verify GPU access (on a GPU node)
	python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
	```

	---

	## Step 3: Training Options

	### Option A: GRPO Training (Recommended)

	GRPO with verifier-based multi-reward training from an SFT checkpoint:

	```bash
	# Submit via SLURM (use login shell for correct sbatch version)
	bash -l -c 'sbatch scripts/run_grpo_full.sh'
	```

	Key config (`configs/grpo_full_v2.json`):
	- G=16 generations per prompt
	- V1-V4 verifiers with weights [0.35, 0.30, 0.15, 0.20]
	- beta=0.02, 2 iterations per batch
	- ~48h on A40

	### Option B: SFT Training

	```bash
	# Interactive session
	srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash

	# Activate environment
	. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
	conda activate biorlhf

	# Run SFT
	cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
	biorlhf-train --model mistralai/Mistral-7B-v0.3 --dataset data/kmp_sft_final.json --output ./my_sft_model
	```

	### Option C: Interactive GPU Session

	```bash
	# Request GPU
	srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash

	# Activate environment
	. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
	conda activate biorlhf

	# Navigate and run
	cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
	biorlhf-grpo --config configs/grpo_full_v2.json
	```

	---

	## Step 4: Monitor Training

	```bash
	# Check job status
	squeue -u $USER

	# Tail logs
	tail -f logs/grpo_full_*.log

	# GPU usage (on compute node)
	nvidia-smi

	# WandB dashboard
	# https://wandb.ai/jangkeun-weill-cornell-medicine/biogrpo
	```

	---

	## Environment Details

	\| Component \| Version \|
	\|-----------\|---------\|
	\| Python \| 3.10 \|
	\| PyTorch \| 2.5.1+cu121 \|
	\| Transformers \| 4.57.3 \|
	\| TRL \| 0.26.2 \|
	\| PEFT \| 0.18.0 \|

	---

	## GPU Options on Cayuga

	\| GPU \| VRAM \| Best For \| SLURM Flag \|
	\|-----\|------\|----------\|------------\|
	\| A40 \| 48GB \| Standard GRPO/SFT with QLoRA \| `--gres=gpu:1` \|
	\| A100 \| 80GB \| Larger batches, faster training \| `--gres=gpu:a100:1` \|

	---

	## Important Notes

	### SLURM Version

	The default `sbatch` at `/usr/bin/sbatch` is outdated (v22.05.2). Use `bash -l -c 'sbatch ...'` to get the correct version (slurm/25.05.0) loaded via module.

	### Conda in Non-Interactive Shells

	`source ~/.bashrc` does not work in non-interactive SSH. Always source conda directly:
	```bash
	. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
	conda activate biorlhf
	```

	### SFT Checkpoint Symlink

	The SFT model adapter is stored at:
	```
	/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final
	```
	GRPO scripts auto-symlink this into the working directory.

	### Batch Size with G=16

	Both `per_device_eval_batch_size` and `generation_batch_size` must be divisible by `num_generations`. The TRL parameter is `generation_batch_size`, NOT `per_device_generation_batch_size`.

	### Eval Performance

	GRPOTrainer's eval loop generates completions sequentially (~3 min/sample). With 107 eval samples, each eval pass takes ~5.3h. Set `eval_steps=9999` to skip in-training eval; run post-hoc evaluation instead.

	---

	## Troubleshooting

	### "CUDA out of memory"
	Reduce batch size or gradient accumulation in the config JSON:
	```json
	{
	"batch_size": 1,
	"gradient_accumulation_steps": 16
	}
	```

	### "No GPU available"
	```bash
	nvidia-smi # Check GPU allocation
	squeue -u $USER # Verify you're on a GPU node
	```

	### LoRA adapter loading fails
	The SFT checkpoint is a LoRA adapter, not a full model. Load base model first:
	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM

	base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")
	model = PeftModel.from_pretrained(base, "path/to/kmp_sft_model_final")
	model = model.merge_and_unload() # Merge for GRPO training
	```

	---

	## Key Paths

	\| Path \| Description \|
	\|------\|-------------\|
	\| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/` \| Working directory \|
	\| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final` \| SFT checkpoint \|
	\| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/data/` \| Data directory \|
	\| `/home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh` \| Conda init script \|