Omini3D / optimize_speed.md

maxmo2009

Sync from local: code + epoch-110 checkpoint, clean README

2af0e94 verified 16 days ago

17.5 kB

	# Speed Optimization: `OM_train_3modes_opt.py`

	Optimized 3-mode training pipeline (diffusion + contrastive + registration) that is mathematically equivalent to the original `OM_train_3modes.py`. All optimizations preserve identical loss values, gradients, and weight updates.

	## Files Changed

	\| File \| Type \| Description \|
	\|------\|------\|-------------\|
	\| `Diffusion/diffuser_opt.py` \| New \| Optimized `DeformDDPM` subclass (split loop, on-device tensors, OptSTN, inference_mode, recover() fix) \|
	\| `Diffusion/networks_opt.py` \| New \| `OptSTN` (register_buffer) + `OptRecMulModMutAttnNet` (cached tensors) \|
	\| `Diffusion/losses_opt.py` \| New \| Optimized `LNCC`/`MSLNCC` with `register_buffer` \|
	\| `OM_train_3modes_opt.py` \| New \| Optimized training script (uses all opt modules) \|
	\| `tests/test_3modes_opt_equivalence.py` \| New \| 7-suite equivalence test \|
	\| `tests/compare_3modes_speed.py` \| New \| XPU speed comparison script \|
	\| `bash_compare_orig.sh` \| New \| SLURM job script for original pipeline benchmark \|
	\| `bash_compare_opt.sh` \| New \| SLURM job script for optimized pipeline benchmark \|

	## Optimizations

	### 1. Hoist `clone().detach()` outside recovery loop (`diffuser_opt.py`)

	Location: `DeformDDPM.diff_recover()` — the registration recovery loop iterates 8-16 times.

	Before (in `diffuser.py`):
	```python
	for i in time_steps:
	...
	img_rec = self.img_stn(img_org.clone().detach(), ddf_comp) # clone every iteration
	msk_rec = self.msk_stn(msk_org.clone().detach(), ddf_comp) # clone every iteration
	```

	After (in `diffuser_opt.py`):
	```python
	# OPT: hoist clone().detach() outside the loop — grid_sample is read-only
	img_org_ref = img_org.clone().detach()
	msk_org_ref = msk_org.clone().detach() if msk_org is not None else None

	for i in time_steps:
	...
	img_rec = self.img_stn(img_org_ref, ddf_comp) # reuse pre-cloned ref
	msk_rec = self.msk_stn(msk_org_ref, ddf_comp) # reuse pre-cloned ref
	```

	Savings: N-1 fewer `clone().detach()` calls per registration step (N=8-16). Safe because `grid_sample` is read-only on its input.

	### 2. Remove redundant `* 0` on zero tensors (`diffuser_opt.py`)

	Location: `DeformDDPM._random_ddf_generate()`

	Before:
	```python
	ddf = torch.zeros(ctl_ddf_sz) * 0 # multiplying zeros by zero
	dddf = torch.zeros(ctl_ddf_sz) * 0
	```

	After:
	```python
	ddf = torch.zeros(ctl_ddf_sz) # already zero
	dddf = torch.zeros(ctl_ddf_sz)
	```

	### 3. Skip `clone()` for unconditional path (`diffuser_opt.py`)

	Location: `DeformDDPM.proc_cond_img()` — handles conditioning type selection. The `'uncon'` path (weight 3/8) replaces the image entirely with noise, so cloning the input is unnecessary.

	Before: Always clones `img` before checking proc_type.

	After: Checks for `'uncon'` first and returns noise map directly without cloning.

	### 4. `register_buffer` for loss kernels (`losses_opt.py`)

	Location: `LNCC.__init__()` and `MSLNCC`

	Before (in `losses.py`):
	```python
	self.sum_filt = torch.ones(...) # plain attribute
	# Then in lncc():
	self.sum_filt = self.sum_filt.to(I.device) # manual .to() every call
	```

	After (in `losses_opt.py`):
	```python
	self.register_buffer('kernels', kernels) # auto device transfer
	self.register_buffer('sum_filt', self._build_kernel(std=0.0))
	# No .to() needed in lncc() — buffers follow module.to()
	```

	Savings: Eliminates per-call `.to(device)` overhead for convolution kernels.

	### 5. Fixed registration timestep count (`OM_train_3modes_opt.py`)

	Location: Registration training block, `FIXED_T_REGIST_LEN = 16`

	Before (in `OM_train_3modes.py`):
	```python
	select_timestep = np.random.randint(8, 17) # variable 8-16 steps
	```

	After:
	```python
	FIXED_T_REGIST_LEN = 16
	select_timestep = min(FIXED_T_REGIST_LEN, len(t_pool)) # always 16 steps
	```

	Why: Variable loop lengths cause XPU graph recompilation on every change. This was previously measured to cause ~8x slowdown on Intel XPU. Fixed length avoids recompilation entirely.

	Impact: Only visible with real training (variable `scale_regist` values). Not testable with dummy data benchmarks.

	### 6. DataLoader I/O overlap (`OM_train_3modes_opt.py`)

	```python
	NUM_WORKERS = 4
	PIN_MEMORY = True

	train_loader = DataLoader(
	dataset,
	num_workers=num_workers, # OPT: parallel data loading
	pin_memory=use_pin_memory, # OPT: faster host→device transfer
	persistent_workers=num_workers > 0, # OPT: keep workers alive between epochs
	)
	```

	Impact: Overlaps CPU data loading with GPU computation. Only visible with real NIfTI data (disk I/O), not with in-memory dummy data.

	### 7. `optimizer.zero_grad(set_to_none=True)` (`OM_train_3modes_opt.py`)

	```python
	optimizer.zero_grad(set_to_none=True) # OPT: avoids memset to zero
	```

	Sets gradients to `None` instead of zeroing them, avoiding a memory write pass. Microsecond-level savings per step.

	### 8. Remove redundant `.to(device)` (`OM_train_3modes_opt.py`)

	Removed a duplicate `x0 = x0.to(hyp_parameters["device"])` that occurred after the tensor was already on device.

	---

	### Deep Optimizations (Phase 2)

	The following optimizations target the core compute path — the network and STN operations that run ~300 times per training step.

	### 9. Split inner DDF composition loop (`diffuser_opt.py`)

	Location: `DeformDDPM._random_ddf_generate()` — the inner loop that self-composes deformation fields.

	Problem: The original loop runs `max(mul_num_ddf, mul_num_dvf)` iterations for both `ddf` and `dddf`. When `j >= mul_num_dvf`, `flag[1]=0` so the dddf update becomes `dddf = 0 + STN(dddf, 0)` — an identity warp through `F.grid_sample` that wastes a full 3D resampling call.

	For timestep t=40: `mul_num_ddf=80`, `mul_num_dvf=9`, so 71 out of 80 dddf iterations are wasted no-ops.

	Before (in `diffuser.py`):
	```python
	for j in range(int(torch.max(mul_num[0]).numpy())):
	flag = [(n > j).int().to(self.device) for n in mul_num]
	ddf = dvf0 * flag[0] + self.ddf_stn_rec(ddf, dvf0 * flag[0]) # always runs
	dddf = dvf * flag[1] + self.ddf_stn_rec(dddf, dvf * flag[1]) # no-op when flag[1]=0
	```

	After (in `diffuser_opt.py`):
	```python
	mul_num_ddf_val = int(torch.max(mul_num[0]).item())
	mul_num_dvf_val = int(torch.max(mul_num[1]).item())
	joint_iters = min(mul_num_ddf_val, mul_num_dvf_val)

	# Phase 1: both active
	for j in range(joint_iters):
	ddf = dvf0 + self.ddf_stn_rec(ddf, dvf0)
	dddf = dvf + self.ddf_stn_rec(dddf, dvf)
	# Phase 2: only ddf active (skips wasted dddf grid_sample)
	for j in range(joint_iters, mul_num_ddf_val):
	ddf = dvf0 + self.ddf_stn_rec(ddf, dvf0)
	# Phase 3: only dddf active (rare case)
	for j in range(joint_iters, mul_num_dvf_val):
	dddf = dvf + self.ddf_stn_rec(dddf, dvf)
	```

	Savings: Eliminates `mul_num_ddf - mul_num_dvf` wasted `grid_sample` calls per composition. For t=40: saves 71 × 3D grid_sample operations.

	Note: The skipped identity warps introduce ~1e-10 floating-point drift per iteration in the original. The split loop avoids this drift, producing less error than the original (difference well below test tolerance of 1e-5).

	### 10. Create DDF/DVF tensors on device (`diffuser_opt.py`)

	Location: `DeformDDPM._random_ddf_generate()` and `_multiscale_dvf_generate()`

	Before:
	```python
	ddf = torch.zeros(ctl_ddf_sz) * 0 # CPU
	dvf_comp = torch.randn(...) # CPU
	dvf = dvf.to(self.device) # explicit transfer
	```

	After:
	```python
	ddf = torch.zeros(ctl_ddf_sz, device=self.device) # directly on GPU
	dvf_comp = torch.randn(..., device=self.device) * _v_scale # directly on GPU
	# No .to(self.device) needed
	```

	Savings: Eliminates CPU→GPU transfer of 3D volumes (e.g., [2, 3, 32, 32, 32]) and avoids the CPU memory allocation entirely.

	### 11. `OptSTN` — `register_buffer` for spatial transformer (`networks_opt.py`)

	> (Phase 2 deep optimization)

	Location: `STN.__init__()`, `STN.resample()`, `STN.forward()`

	Problem: Original `STN` stores `ref_grid`, `max_sz` as plain Python attributes. Every call to `resample()` and `forward()` does `.to(device)` on these tensors — 3× per call. With 16 recovery iterations and 3 STN instances (ddf_stn_rec, img_stn, msk_stn), this amounts to ~150 unnecessary CPU→GPU transfers per registration step.

	Before (in `networks.py`):
	```python
	class STN(nn.Module):
	def __init__(self, ...):
	self.max_sz = torch.Tensor(...) # plain attribute
	self.ref_grid = torch.reshape(...) # plain attribute

	def resample(self, vol, ddf, ...):
	ref = self.ref_grid.to(vol.device) # .to() every call
	max_sz = self.max_sz.to(vol.device) # .to() every call
	```

	After (in `networks_opt.py`):
	```python
	class OptSTN(STN):
	def __init__(self, ...):
	nn.Module.__init__(self)
	self.register_buffer('max_sz', ...) # auto device transfer
	self.register_buffer('ref_grid', ...) # auto device transfer
	self.register_buffer('_img_sz_for_resample', ...) # pre-computed

	def resample(self, vol, ddf, ...):
	ref = self.ref_grid # already on correct device
	max_sz = self.max_sz # already on correct device
	```

	Savings: ~150 `.to(device)` calls eliminated per registration step. Buffers auto-transfer when `module.to(device)` is called.

	### 12. `OptRecMulModMutAttnNet` — cached tensors for network (`networks_opt.py`)

	Location: `RecMulModMutAttnNet.resample()`, `RecMulModMutAttnNet.forward()`

	Problem: The production network's `resample()` method (called 2× per forward pass, with rec_num=2) creates a NumPy array, converts to PyTorch tensor, then transfers to GPU every call:
	```python
	def resample(self, vol, ddf, ...):
	max_sz = torch.Tensor(np.reshape(np.array(self.max_sz), ...)).to(self.device)
	```

	Similarly, `forward()` recreates `self.img_sz` tensor and checks/transfers `self.ref_grid` every call.

	With 16 recovery iterations × 2 resample calls × rec_num=2 = ~80 NumPy→GPU transfers per registration step.

	After (in `networks_opt.py`):
	```python
	class OptRecMulModMutAttnNet(RecMulModMutAttnNet):
	def _ensure_cache(self, img_sz, device):
	key = (tuple(img_sz), device)
	if key == self._cached_input_key:
	return # cache hit — skip all allocation
	self._cached_max_sz_tensor = ... # create ONCE
	self._cached_img_sz_tensor = ... # create ONCE

	def resample(self, vol, ddf, ...):
	img_sz = self._cached_img_sz_tensor # reuse cached tensor
	```

	Savings: ~80 NumPy→Torch→GPU chains reduced to ~1 per input size change (typically once per epoch).

	### 13. Fix `recover()` t tensor CPU bug (`diffuser_opt.py`)

	Location: `DeformDDPM.recover()` in `diffuser.py` line 332

	Bug: Original code creates a tensor then discards the `.to()` result:
	```python
	t = torch.tensor(t) # created on CPU
	t.to(x.device) # result DISCARDED — t stays on CPU!
	```

	This means every timestep tensor in the recovery loop stays on CPU, requiring implicit CPU→GPU transfer during network forward pass.

	Fix (in `diffuser_opt.py`):
	```python
	def recover(self, x, y, t, rec_num=2, text=None):
	if isinstance(t, torch.Tensor):
	if t.device != x.device:
	t = t.to(x.device) # actually assign the result
	else:
	t = torch.tensor(t, device=x.device) # create directly on device
	```

	### 14. `torch.no_grad()` for frozen iterations (`diffuser_opt.py`)

	Location: `DeformDDPM.diff_recover()` recovery loop

	Before (in `diffuser.py`): No gradient context management — all 16 iterations compute full autograd metadata (the original code has `no_grad` but the trainable check uses list membership on unhashable types, so it's fragile).

	After: Only the last 2 iterations (trainable) run with full autograd. The other 14 use `torch.no_grad()` with a robust index-based trainability check:
	```python
	trainable_start_idx = num_time_steps - len(trainable_iterations)
	for step_idx, i in enumerate(time_steps):
	t = torch.tensor([i], device=self.device) # OPT: direct device creation
	if step_idx >= trainable_start_idx:
	pre_dvf_I = self.recover(...) # full autograd
	else:
	with torch.no_grad(): # skip autograd for frozen steps
	pre_dvf_I = self.recover(...)
	```

	Note: `torch.inference_mode()` would be faster but produces inference tensors that can't participate in backward when composed with trainable iterations via `ddf_comp`.

	### 15. Pre-compute timestep tensors on device (`diffuser_opt.py`)

	Location: `DeformDDPM.diff_recover()` recovery loop

	Before:
	```python
	t = torch.tensor(np.array([i])).to(self.device) # NumPy → CPU tensor → GPU
	```

	After:
	```python
	t = torch.tensor([i], device=self.device) # create directly on GPU
	```

	Savings: Eliminates 16 NumPy allocations + CPU→GPU transfers per registration step.

	## Equivalence Verification

	### Unit tests (`tests/test_3modes_opt_equivalence.py`)

	7 test suites, all passing on CPU:

	\| Suite \| What it tests \|
	\|-------\|---------------\|
	\| Loss Equivalence \| LNCC, MSLNCC forward + backward identical \|
	\| DeformDDPM Methods \| `proc_cond_img` (all 7 types), `_random_ddf_generate` \|
	\| Mode 1: Diffusion \| Single diffusion step: loss, gradients, weights \|
	\| Mode 2: Contrastive \| Contrastive step: `img_embd`, cosine loss, clipped gradients \|
	\| Mode 3: Registration \| `diff_recover` loop: DDF, reconstructed image, all sub-losses \|
	\| Full Sequence \| All 3 modes sequentially, final weight comparison \|
	\| Checkpoint Compat \| Original checkpoint loads into optimized and vice versa \|

	Run with:
	```bash
	python -m pytest tests/test_3modes_opt_equivalence.py -v
	```

	### XPU comparison (`tests/compare_3modes_speed.py`)

	Head-to-head comparison on Intel Data Center GPU Max 1550, BATCHSIZE=1, IMG_SIZE=128, 10 steps (SLURM jobs 24541332/24541333):

	\| Step \| ORIG (s) \| OPT (s) \| Delta \|
	\|------\|----------\|---------\|-------\|
	\| 0 \| 8.44 \| 8.42 \| -0.02 \|
	\| 1 \| 7.60 \| 7.58 \| -0.02 \|
	\| 2 \| 12.36 \| 12.33 \| -0.03 \|
	\| 3 \| 8.69 \| 8.67 \| -0.02 \|
	\| 4 \| 12.15 \| 12.12 \| -0.03 \|
	\| 5 \| 9.56 \| 9.53 \| -0.03 \|
	\| 6 \| 10.71 \| 10.66 \| -0.05 \|
	\| 7 \| 7.39 \| 7.40 \| +0.01 \|
	\| 8 \| 10.18 \| 10.12 \| -0.06 \|
	\| 9 \| 9.19 \| 9.18 \| -0.01 \|
	\| Total \| 96.27 \| 96.01 \| -0.26 \|

	- Speedup: ~0.3% — within noise margin
	- Registration losses: Very close (within expected drift from separate runs)
	- BATCHSIZE=2: Both pipelines OOM at step 9 during `loss_regist.backward()` (registration backward graph = 16 recovery × rec_num=2 = 32 UNet passes)

	## Why Timing Is Similar

	All optimizations (Phase 1 + Phase 2, optimizations 1-15) collectively showed negligible per-step speedup (~0.3%). Analysis:

	1. Phase 1 (opts 1-8): Targeted script-level overhead (clone, .to(), zero_grad). These were already no-ops internally — PyTorch's `.to(device)` on a same-device tensor is a fast pointer return.

	2. Phase 2 (opts 9-15): Targeted deeper compute path:
	- ~150 STN `.to(device)` calls eliminated → but same-device `.to()` was already cheap
	- ~80 NumPy→GPU tensor chains eliminated → marginal savings, tensors were small
	- 71+ no-op grid_sample calls eliminated (split loop) → but at control-point resolution (32³), these were fast
	- 14/16 recovery iterations with no_grad → saves autograd metadata but not the forward pass itself

	3. The real bottleneck: UNet forward/backward passes dominate. Each registration step runs 32 full `RecMulModMutAttnNet` forward passes (16 timesteps × rec_num=2) through a multi-head attention UNet with channels [1,16,32,64,128,256] at 128³ resolution. This GPU kernel execution (~95% of step time) is unaffected by Python-level optimizations.

	Remaining speedup opportunities (require deeper changes):
	- `torch.compile()` — JIT-compile UNet forward pass, fuse kernels
	- Mixed precision (`bf16`) — Intel Max 1550 has strong bf16 support, ~2x throughput
	- Gradient checkpointing — reduce memory (fixes BATCHSIZE=2 OOM), enables larger batches
	- Reduce `rec_num` or recovery timesteps — algorithmic change, affects quality

	## Usage

	```bash
	# Optimized training (drop-in replacement for OM_train_3modes.py)
	python OM_train_3modes_opt.py -C Config/config_om.yaml

	# With dummy data for testing
	python OM_train_3modes_opt.py -C Config/config_om.yaml --dummy-samples 20

	# Override workers
	python OM_train_3modes_opt.py -C Config/config_om.yaml --num-workers 8

	# Run equivalence tests
	python -m pytest tests/test_3modes_opt_equivalence.py -v

	# Run XPU speed comparison (via SLURM)
	sbatch bash_compare_opt.sh
	```

	## Checkpoint Compatibility

	Optimized and original checkpoints are fully cross-compatible:
	- Original `.pth` loads into `diffuser_opt.DeformDDPM` (via inheritance)
	- Optimized `.pth` loads into `diffuser.DeformDDPM` (with `strict=False`, ignoring `register_buffer` keys)