Omini3D / optimize_speed.md

Sync from local: code + epoch-110 checkpoint, clean README

2af0e94 verified 16 days ago

17.5 kB

Speed Optimization: `OM_train_3modes_opt.py`

Optimized 3-mode training pipeline (diffusion + contrastive + registration) that is mathematically equivalent to the original OM_train_3modes.py. All optimizations preserve identical loss values, gradients, and weight updates.

Files Changed

File	Type	Description
`Diffusion/diffuser_opt.py`	New	Optimized `DeformDDPM` subclass (split loop, on-device tensors, OptSTN, inference_mode, recover() fix)
`Diffusion/networks_opt.py`	New	`OptSTN` (register_buffer) + `OptRecMulModMutAttnNet` (cached tensors)
`Diffusion/losses_opt.py`	New	Optimized `LNCC`/`MSLNCC` with `register_buffer`
`OM_train_3modes_opt.py`	New	Optimized training script (uses all opt modules)
`tests/test_3modes_opt_equivalence.py`	New	7-suite equivalence test
`tests/compare_3modes_speed.py`	New	XPU speed comparison script
`bash_compare_orig.sh`	New	SLURM job script for original pipeline benchmark
`bash_compare_opt.sh`	New	SLURM job script for optimized pipeline benchmark

Optimizations

1. Hoist `clone().detach()` outside recovery loop (`diffuser_opt.py`)

Location: DeformDDPM.diff_recover() — the registration recovery loop iterates 8-16 times.

Before (in diffuser.py):

for i in time_steps:
    ...
    img_rec = self.img_stn(img_org.clone().detach(), ddf_comp)  # clone every iteration
    msk_rec = self.msk_stn(msk_org.clone().detach(), ddf_comp)  # clone every iteration

After (in diffuser_opt.py):

# OPT: hoist clone().detach() outside the loop — grid_sample is read-only
img_org_ref = img_org.clone().detach()
msk_org_ref = msk_org.clone().detach() if msk_org is not None else None

for i in time_steps:
    ...
    img_rec = self.img_stn(img_org_ref, ddf_comp)   # reuse pre-cloned ref
    msk_rec = self.msk_stn(msk_org_ref, ddf_comp)   # reuse pre-cloned ref

Savings: N-1 fewer clone().detach() calls per registration step (N=8-16). Safe because grid_sample is read-only on its input.

2. Remove redundant `* 0` on zero tensors (`diffuser_opt.py`)

Location: DeformDDPM._random_ddf_generate()

Before:

ddf = torch.zeros(ctl_ddf_sz) * 0    # multiplying zeros by zero
dddf = torch.zeros(ctl_ddf_sz) * 0

After:

ddf = torch.zeros(ctl_ddf_sz)         # already zero
dddf = torch.zeros(ctl_ddf_sz)

3. Skip `clone()` for unconditional path (`diffuser_opt.py`)

Location: DeformDDPM.proc_cond_img() — handles conditioning type selection. The 'uncon' path (weight 3/8) replaces the image entirely with noise, so cloning the input is unnecessary.

Before: Always clones img before checking proc_type.

After: Checks for 'uncon' first and returns noise map directly without cloning.

4. `register_buffer` for loss kernels (`losses_opt.py`)

Location: LNCC.__init__() and MSLNCC

Before (in losses.py):

self.sum_filt = torch.ones(...)           # plain attribute
# Then in lncc():
self.sum_filt = self.sum_filt.to(I.device)  # manual .to() every call

After (in losses_opt.py):

self.register_buffer('kernels', kernels)     # auto device transfer
self.register_buffer('sum_filt', self._build_kernel(std=0.0))
# No .to() needed in lncc() — buffers follow module.to()

Savings: Eliminates per-call .to(device) overhead for convolution kernels.

5. Fixed registration timestep count (`OM_train_3modes_opt.py`)

Location: Registration training block, FIXED_T_REGIST_LEN = 16

Before (in OM_train_3modes.py):

select_timestep = np.random.randint(8, 17)  # variable 8-16 steps

After:

FIXED_T_REGIST_LEN = 16
select_timestep = min(FIXED_T_REGIST_LEN, len(t_pool))  # always 16 steps

Why: Variable loop lengths cause XPU graph recompilation on every change. This was previously measured to cause ~8x slowdown on Intel XPU. Fixed length avoids recompilation entirely.

Impact: Only visible with real training (variable scale_regist values). Not testable with dummy data benchmarks.

6. DataLoader I/O overlap (`OM_train_3modes_opt.py`)

NUM_WORKERS = 4
PIN_MEMORY = True

train_loader = DataLoader(
    dataset,
    num_workers=num_workers,           # OPT: parallel data loading
    pin_memory=use_pin_memory,         # OPT: faster host→device transfer
    persistent_workers=num_workers > 0, # OPT: keep workers alive between epochs
)

Impact: Overlaps CPU data loading with GPU computation. Only visible with real NIfTI data (disk I/O), not with in-memory dummy data.

7. `optimizer.zero_grad(set_to_none=True)` (`OM_train_3modes_opt.py`)

optimizer.zero_grad(set_to_none=True)   # OPT: avoids memset to zero

Sets gradients to None instead of zeroing them, avoiding a memory write pass. Microsecond-level savings per step.

8. Remove redundant `.to(device)` (`OM_train_3modes_opt.py`)

Removed a duplicate x0 = x0.to(hyp_parameters["device"]) that occurred after the tensor was already on device.

Deep Optimizations (Phase 2)

The following optimizations target the core compute path — the network and STN operations that run ~300 times per training step.

9. Split inner DDF composition loop (`diffuser_opt.py`)

Location: DeformDDPM._random_ddf_generate() — the inner loop that self-composes deformation fields.

Problem: The original loop runs max(mul_num_ddf, mul_num_dvf) iterations for both ddf and dddf. When j >= mul_num_dvf, flag[1]=0 so the dddf update becomes dddf = 0 + STN(dddf, 0) — an identity warp through F.grid_sample that wastes a full 3D resampling call.

For timestep t=40: mul_num_ddf=80, mul_num_dvf=9, so 71 out of 80 dddf iterations are wasted no-ops.

Before (in diffuser.py):

for j in range(int(torch.max(mul_num[0]).numpy())):
    flag = [(n > j).int().to(self.device) for n in mul_num]
    ddf = dvf0 * flag[0] + self.ddf_stn_rec(ddf, dvf0 * flag[0])   # always runs
    dddf = dvf * flag[1] + self.ddf_stn_rec(dddf, dvf * flag[1])   # no-op when flag[1]=0

After (in diffuser_opt.py):

mul_num_ddf_val = int(torch.max(mul_num[0]).item())
mul_num_dvf_val = int(torch.max(mul_num[1]).item())
joint_iters = min(mul_num_ddf_val, mul_num_dvf_val)

# Phase 1: both active
for j in range(joint_iters):
    ddf = dvf0 + self.ddf_stn_rec(ddf, dvf0)
    dddf = dvf + self.ddf_stn_rec(dddf, dvf)
# Phase 2: only ddf active (skips wasted dddf grid_sample)
for j in range(joint_iters, mul_num_ddf_val):
    ddf = dvf0 + self.ddf_stn_rec(ddf, dvf0)
# Phase 3: only dddf active (rare case)
for j in range(joint_iters, mul_num_dvf_val):
    dddf = dvf + self.ddf_stn_rec(dddf, dvf)

Savings: Eliminates mul_num_ddf - mul_num_dvf wasted grid_sample calls per composition. For t=40: saves 71 × 3D grid_sample operations.

Note: The skipped identity warps introduce ~1e-10 floating-point drift per iteration in the original. The split loop avoids this drift, producing less error than the original (difference well below test tolerance of 1e-5).

10. Create DDF/DVF tensors on device (`diffuser_opt.py`)

Location: DeformDDPM._random_ddf_generate() and _multiscale_dvf_generate()

Before:

ddf = torch.zeros(ctl_ddf_sz) * 0   # CPU
dvf_comp = torch.randn(...)          # CPU
dvf = dvf.to(self.device)            # explicit transfer

After:

ddf = torch.zeros(ctl_ddf_sz, device=self.device)         # directly on GPU
dvf_comp = torch.randn(..., device=self.device) * _v_scale # directly on GPU
# No .to(self.device) needed

Savings: Eliminates CPU→GPU transfer of 3D volumes (e.g., [2, 3, 32, 32, 32]) and avoids the CPU memory allocation entirely.

11. `OptSTN` — `register_buffer` for spatial transformer (`networks_opt.py`)

(Phase 2 deep optimization)

Location: STN.__init__(), STN.resample(), STN.forward()

Problem: Original STN stores ref_grid, max_sz as plain Python attributes. Every call to resample() and forward() does .to(device) on these tensors — 3× per call. With 16 recovery iterations and 3 STN instances (ddf_stn_rec, img_stn, msk_stn), this amounts to ~150 unnecessary CPU→GPU transfers per registration step.

Before (in networks.py):

class STN(nn.Module):
    def __init__(self, ...):
        self.max_sz = torch.Tensor(...)      # plain attribute
        self.ref_grid = torch.reshape(...)    # plain attribute

    def resample(self, vol, ddf, ...):
        ref = self.ref_grid.to(vol.device)   # .to() every call
        max_sz = self.max_sz.to(vol.device)  # .to() every call

After (in networks_opt.py):

class OptSTN(STN):
    def __init__(self, ...):
        nn.Module.__init__(self)
        self.register_buffer('max_sz', ...)           # auto device transfer
        self.register_buffer('ref_grid', ...)          # auto device transfer
        self.register_buffer('_img_sz_for_resample', ...)  # pre-computed

    def resample(self, vol, ddf, ...):
        ref = self.ref_grid       # already on correct device
        max_sz = self.max_sz      # already on correct device

Savings: ~150 .to(device) calls eliminated per registration step. Buffers auto-transfer when module.to(device) is called.

12. `OptRecMulModMutAttnNet` — cached tensors for network (`networks_opt.py`)

Location: RecMulModMutAttnNet.resample(), RecMulModMutAttnNet.forward()

Problem: The production network's resample() method (called 2× per forward pass, with rec_num=2) creates a NumPy array, converts to PyTorch tensor, then transfers to GPU every call:

def resample(self, vol, ddf, ...):
    max_sz = torch.Tensor(np.reshape(np.array(self.max_sz), ...)).to(self.device)

Similarly, forward() recreates self.img_sz tensor and checks/transfers self.ref_grid every call.

With 16 recovery iterations × 2 resample calls × rec_num=2 = ~80 NumPy→GPU transfers per registration step.

After (in networks_opt.py):

class OptRecMulModMutAttnNet(RecMulModMutAttnNet):
    def _ensure_cache(self, img_sz, device):
        key = (tuple(img_sz), device)
        if key == self._cached_input_key:
            return                           # cache hit — skip all allocation
        self._cached_max_sz_tensor = ...     # create ONCE
        self._cached_img_sz_tensor = ...     # create ONCE

    def resample(self, vol, ddf, ...):
        img_sz = self._cached_img_sz_tensor  # reuse cached tensor

Savings: ~80 NumPy→Torch→GPU chains reduced to ~1 per input size change (typically once per epoch).

13. Fix `recover()` t tensor CPU bug (`diffuser_opt.py`)

Location: DeformDDPM.recover() in diffuser.py line 332

Bug: Original code creates a tensor then discards the .to() result:

t = torch.tensor(t)     # created on CPU
t.to(x.device)          # result DISCARDED — t stays on CPU!

This means every timestep tensor in the recovery loop stays on CPU, requiring implicit CPU→GPU transfer during network forward pass.

Fix (in diffuser_opt.py):

def recover(self, x, y, t, rec_num=2, text=None):
    if isinstance(t, torch.Tensor):
        if t.device != x.device:
            t = t.to(x.device)      # actually assign the result
    else:
        t = torch.tensor(t, device=x.device)  # create directly on device

14. `torch.no_grad()` for frozen iterations (`diffuser_opt.py`)

Location: DeformDDPM.diff_recover() recovery loop

Before (in diffuser.py): No gradient context management — all 16 iterations compute full autograd metadata (the original code has no_grad but the trainable check uses list membership on unhashable types, so it's fragile).

After: Only the last 2 iterations (trainable) run with full autograd. The other 14 use torch.no_grad() with a robust index-based trainability check:

trainable_start_idx = num_time_steps - len(trainable_iterations)
for step_idx, i in enumerate(time_steps):
    t = torch.tensor([i], device=self.device)  # OPT: direct device creation
    if step_idx >= trainable_start_idx:
        pre_dvf_I = self.recover(...)          # full autograd
    else:
        with torch.no_grad():                  # skip autograd for frozen steps
            pre_dvf_I = self.recover(...)

Note: torch.inference_mode() would be faster but produces inference tensors that can't participate in backward when composed with trainable iterations via ddf_comp.

15. Pre-compute timestep tensors on device (`diffuser_opt.py`)

Location: DeformDDPM.diff_recover() recovery loop

Before:

t = torch.tensor(np.array([i])).to(self.device)  # NumPy → CPU tensor → GPU

After:

t = torch.tensor([i], device=self.device)  # create directly on GPU

Savings: Eliminates 16 NumPy allocations + CPU→GPU transfers per registration step.

Equivalence Verification

Unit tests (`tests/test_3modes_opt_equivalence.py`)

7 test suites, all passing on CPU:

Suite	What it tests
Loss Equivalence	LNCC, MSLNCC forward + backward identical
DeformDDPM Methods	`proc_cond_img` (all 7 types), `_random_ddf_generate`
Mode 1: Diffusion	Single diffusion step: loss, gradients, weights
Mode 2: Contrastive	Contrastive step: `img_embd`, cosine loss, clipped gradients
Mode 3: Registration	`diff_recover` loop: DDF, reconstructed image, all sub-losses
Full Sequence	All 3 modes sequentially, final weight comparison
Checkpoint Compat	Original checkpoint loads into optimized and vice versa

Run with:

python -m pytest tests/test_3modes_opt_equivalence.py -v

XPU comparison (`tests/compare_3modes_speed.py`)

Head-to-head comparison on Intel Data Center GPU Max 1550, BATCHSIZE=1, IMG_SIZE=128, 10 steps (SLURM jobs 24541332/24541333):

Step	ORIG (s)	OPT (s)	Delta
0	8.44	8.42	-0.02
1	7.60	7.58	-0.02
2	12.36	12.33	-0.03
3	8.69	8.67	-0.02
4	12.15	12.12	-0.03
5	9.56	9.53	-0.03
6	10.71	10.66	-0.05
7	7.39	7.40	+0.01
8	10.18	10.12	-0.06
9	9.19	9.18	-0.01
Total	96.27	96.01	-0.26

Speedup: ~0.3% — within noise margin
Registration losses: Very close (within expected drift from separate runs)
BATCHSIZE=2: Both pipelines OOM at step 9 during loss_regist.backward() (registration backward graph = 16 recovery × rec_num=2 = 32 UNet passes)

Why Timing Is Similar

All optimizations (Phase 1 + Phase 2, optimizations 1-15) collectively showed negligible per-step speedup (~0.3%). Analysis:

Phase 1 (opts 1-8): Targeted script-level overhead (clone, .to(), zero_grad). These were already no-ops internally — PyTorch's .to(device) on a same-device tensor is a fast pointer return.
Phase 2 (opts 9-15): Targeted deeper compute path:
- ~150 STN .to(device) calls eliminated → but same-device .to() was already cheap
- ~80 NumPy→GPU tensor chains eliminated → marginal savings, tensors were small
- 71+ no-op grid_sample calls eliminated (split loop) → but at control-point resolution (32³), these were fast
- 14/16 recovery iterations with no_grad → saves autograd metadata but not the forward pass itself
The real bottleneck: UNet forward/backward passes dominate. Each registration step runs 32 full RecMulModMutAttnNet forward passes (16 timesteps × rec_num=2) through a multi-head attention UNet with channels [1,16,32,64,128,256] at 128³ resolution. This GPU kernel execution (~95% of step time) is unaffected by Python-level optimizations.

Remaining speedup opportunities (require deeper changes):

torch.compile() — JIT-compile UNet forward pass, fuse kernels
Mixed precision (bf16) — Intel Max 1550 has strong bf16 support, ~2x throughput
Gradient checkpointing — reduce memory (fixes BATCHSIZE=2 OOM), enables larger batches
Reduce rec_num or recovery timesteps — algorithmic change, affects quality

Usage

# Optimized training (drop-in replacement for OM_train_3modes.py)
python OM_train_3modes_opt.py -C Config/config_om.yaml

# With dummy data for testing
python OM_train_3modes_opt.py -C Config/config_om.yaml --dummy-samples 20

# Override workers
python OM_train_3modes_opt.py -C Config/config_om.yaml --num-workers 8

# Run equivalence tests
python -m pytest tests/test_3modes_opt_equivalence.py -v

# Run XPU speed comparison (via SLURM)
sbatch bash_compare_opt.sh

Checkpoint Compatibility

Optimized and original checkpoints are fully cross-compatible:

Original .pth loads into diffuser_opt.DeformDDPM (via inheritance)
Optimized .pth loads into diffuser.DeformDDPM (with strict=False, ignoring register_buffer keys)

Speed Optimization: OM_train_3modes_opt.py

Files Changed

Optimizations

1. Hoist clone().detach() outside recovery loop (diffuser_opt.py)

2. Remove redundant * 0 on zero tensors (diffuser_opt.py)

3. Skip clone() for unconditional path (diffuser_opt.py)

4. register_buffer for loss kernels (losses_opt.py)

5. Fixed registration timestep count (OM_train_3modes_opt.py)

6. DataLoader I/O overlap (OM_train_3modes_opt.py)

7. optimizer.zero_grad(set_to_none=True) (OM_train_3modes_opt.py)

8. Remove redundant .to(device) (OM_train_3modes_opt.py)

Deep Optimizations (Phase 2)

9. Split inner DDF composition loop (diffuser_opt.py)

10. Create DDF/DVF tensors on device (diffuser_opt.py)

11. OptSTN — register_buffer for spatial transformer (networks_opt.py)

12. OptRecMulModMutAttnNet — cached tensors for network (networks_opt.py)

13. Fix recover() t tensor CPU bug (diffuser_opt.py)

14. torch.no_grad() for frozen iterations (diffuser_opt.py)

15. Pre-compute timestep tensors on device (diffuser_opt.py)

Equivalence Verification

Unit tests (tests/test_3modes_opt_equivalence.py)

XPU comparison (tests/compare_3modes_speed.py)

Why Timing Is Similar

Usage

Checkpoint Compatibility

Speed Optimization: `OM_train_3modes_opt.py`

1. Hoist `clone().detach()` outside recovery loop (`diffuser_opt.py`)

2. Remove redundant `* 0` on zero tensors (`diffuser_opt.py`)

3. Skip `clone()` for unconditional path (`diffuser_opt.py`)

4. `register_buffer` for loss kernels (`losses_opt.py`)

5. Fixed registration timestep count (`OM_train_3modes_opt.py`)

6. DataLoader I/O overlap (`OM_train_3modes_opt.py`)

7. `optimizer.zero_grad(set_to_none=True)` (`OM_train_3modes_opt.py`)

8. Remove redundant `.to(device)` (`OM_train_3modes_opt.py`)

9. Split inner DDF composition loop (`diffuser_opt.py`)

10. Create DDF/DVF tensors on device (`diffuser_opt.py`)

11. `OptSTN` — `register_buffer` for spatial transformer (`networks_opt.py`)

12. `OptRecMulModMutAttnNet` — cached tensors for network (`networks_opt.py`)

13. Fix `recover()` t tensor CPU bug (`diffuser_opt.py`)

14. `torch.no_grad()` for frozen iterations (`diffuser_opt.py`)

15. Pre-compute timestep tensors on device (`diffuser_opt.py`)

Unit tests (`tests/test_3modes_opt_equivalence.py`)

XPU comparison (`tests/compare_3modes_speed.py`)