BioRLHF / scripts /HPC_TRAINING_GUIDE.md
jang1563's picture
Phase 4: V1-aware calibration verifier, eval tools, cleanup
2145d80
# BioRLHF Training on Cayuga HPC
**Cluster:** Cornell Cayuga HPC
**Target:** GPU training with Mistral-7B + LoRA (SFT, DPO, GRPO)
---
## Quick Start
```bash
# 1. SSH to Cayuga
ssh jak4013@cayuga-login1
# 2. Submit a GRPO training job
bash -l -c 'sbatch scripts/run_grpo_full.sh'
# 3. Monitor
squeue -u $USER
tail -f logs/grpo_full_*.log
```
---
## Step 1: Transfer Files to HPC
From your local Mac:
```bash
rsync -avz --progress \
/Users/jak4013/Dropbox/Bioinformatics/Claude/BioRLHF/biorlhf/ \
jak4013@cayuga-login1:/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/
```
---
## Step 2: Set Up Conda Environment (First Time Only)
```bash
# SSH to Cayuga
ssh jak4013@cayuga-login1
# Source conda (non-interactive shell requires explicit sourcing)
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
# Create environment
conda create -n biorlhf python=3.10 -y
conda activate biorlhf
# Install PyTorch with CUDA support
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Install training dependencies
pip install transformers>=4.36.0 peft>=0.6.0 trl>=0.14.0
pip install bitsandbytes>=0.41.0 accelerate>=0.24.0 datasets>=2.14.0
pip install wandb scipy scikit-learn sentencepiece jsonlines
# Verify GPU access (on a GPU node)
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
```
---
## Step 3: Training Options
### Option A: GRPO Training (Recommended)
GRPO with verifier-based multi-reward training from an SFT checkpoint:
```bash
# Submit via SLURM (use login shell for correct sbatch version)
bash -l -c 'sbatch scripts/run_grpo_full.sh'
```
**Key config** (`configs/grpo_full_v2.json`):
- G=16 generations per prompt
- V1-V4 verifiers with weights [0.35, 0.30, 0.15, 0.20]
- beta=0.02, 2 iterations per batch
- ~48h on A40
### Option B: SFT Training
```bash
# Interactive session
srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash
# Activate environment
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf
# Run SFT
cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
biorlhf-train --model mistralai/Mistral-7B-v0.3 --dataset data/kmp_sft_final.json --output ./my_sft_model
```
### Option C: Interactive GPU Session
```bash
# Request GPU
srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash
# Activate environment
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf
# Navigate and run
cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
biorlhf-grpo --config configs/grpo_full_v2.json
```
---
## Step 4: Monitor Training
```bash
# Check job status
squeue -u $USER
# Tail logs
tail -f logs/grpo_full_*.log
# GPU usage (on compute node)
nvidia-smi
# WandB dashboard
# https://wandb.ai/jangkeun-weill-cornell-medicine/biogrpo
```
---
## Environment Details
| Component | Version |
|-----------|---------|
| Python | 3.10 |
| PyTorch | 2.5.1+cu121 |
| Transformers | 4.57.3 |
| TRL | 0.26.2 |
| PEFT | 0.18.0 |
---
## GPU Options on Cayuga
| GPU | VRAM | Best For | SLURM Flag |
|-----|------|----------|------------|
| A40 | 48GB | Standard GRPO/SFT with QLoRA | `--gres=gpu:1` |
| A100 | 80GB | Larger batches, faster training | `--gres=gpu:a100:1` |
---
## Important Notes
### SLURM Version
The default `sbatch` at `/usr/bin/sbatch` is outdated (v22.05.2). Use `bash -l -c 'sbatch ...'` to get the correct version (slurm/25.05.0) loaded via module.
### Conda in Non-Interactive Shells
`source ~/.bashrc` does not work in non-interactive SSH. Always source conda directly:
```bash
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf
```
### SFT Checkpoint Symlink
The SFT model adapter is stored at:
```
/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final
```
GRPO scripts auto-symlink this into the working directory.
### Batch Size with G=16
Both `per_device_eval_batch_size` and `generation_batch_size` must be divisible by `num_generations`. The TRL parameter is `generation_batch_size`, NOT `per_device_generation_batch_size`.
### Eval Performance
GRPOTrainer's eval loop generates completions sequentially (~3 min/sample). With 107 eval samples, each eval pass takes ~5.3h. Set `eval_steps=9999` to skip in-training eval; run post-hoc evaluation instead.
---
## Troubleshooting
### "CUDA out of memory"
Reduce batch size or gradient accumulation in the config JSON:
```json
{
"batch_size": 1,
"gradient_accumulation_steps": 16
}
```
### "No GPU available"
```bash
nvidia-smi # Check GPU allocation
squeue -u $USER # Verify you're on a GPU node
```
### LoRA adapter loading fails
The SFT checkpoint is a LoRA adapter, not a full model. Load base model first:
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")
model = PeftModel.from_pretrained(base, "path/to/kmp_sft_model_final")
model = model.merge_and_unload() # Merge for GRPO training
```
---
## Key Paths
| Path | Description |
|------|-------------|
| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/` | Working directory |
| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final` | SFT checkpoint |
| `/athena/cayuga_0003/scratch/users/jak4013/otsuka/data/` | Data directory |
| `/home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh` | Conda init script |