FengheTan9 commited on Apr 21, 2025

Commit

6da2a44

verified ·

1 Parent(s): f94d70c

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

.gitattributes +2 -0
README.md +202 -3
dist.py +112 -0
img/TOKI.png +3 -0
img/masking_consistency.png +3 -0
main.py +192 -0
mambamim_mask75.pth +3 -0
models/MambaMIM.py +240 -0
models/__init__.py +51 -0
models/decoder.py +87 -0
models/encoder.py +258 -0
models/mamba/bi_vision_mamba.py +416 -0
models/network/hymamba.py +681 -0
utils/arg_util.py +125 -0
utils/lamb.py +151 -0
utils/lr_control.py +47 -0
utils/med_dataset.py +275 -0
utils/misc.py +338 -0
utils/sampler.py +68 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+img/TOKI.png filter=lfs diff=lfs merge=lfs -text
+img/masking_consistency.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,202 @@
----
-license: apache-2.0
----

+## [MIA'25] MambaMIM: Pre-training Mamba with State Space Token Interpolation and its Application to Medical Image Segmentation
+<p align="center" width="100%">
+<!---->
+</p>
+![MambaMIM](img/TOKI.png)
+<div align="center">
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=x1pODsMAAAAJ&hl=en" target="_blank">Fenghe Tang</a><sup>1,2</sup>,</span>
+    <span class="author-block">
+    <a target="_blank">Bingkun Nian</a><sup>3</sup>,</span>
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=ocAtNkkAAAAJ&hl=en" target="_blank">Yingtai Li</a><sup>1,2</sup>,</span>
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=Wo8tMSMAAAAJ&hl=en" target="_blank">Zihang Jiang</a><sup>1,2</sup>,</span>
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=tmx7tu8AAAAJ&hl=en" target="_blank">Jie Yang</a><sup>3</sup>,</span>
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=Vbb5EGIAAAAJ&hl=en" target="_blank"> Liu Wei</a><sup>3</sup>,</span>
+    <span class="author-block">
+    <a href="https://scholar.google.com/citations?user=8eNm2GMAAAAJ&hl=en" target="_blank">S.Kevin Zhou</a><sup>1,2</sup>
+    </span>
+</div>
+<br>
+<div align="center">
+    <sup>1</sup>
+    <a href='https://en.ustc.edu.cn/' target='_blank'>School of Biomedical Engineering, University of Science and Technology of China</a>&emsp;
+    <br>
+    <sup>2</sup> <a href='http://english.ict.cas.cn/' target='_blank'>Suzhou Institute for Advanced Research, University of Science and Technology of China</a>&emsp;
+    <br>
+    <sup>3</sup> <a href='http://www.pami.sjtu.edu.cn/En/Home' target='_blank'>Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University</a>
+    <br>
+</div>
+<br>
+<br>
+                                                                              [![arXiv](https://img.shields.io/badge/arxiv-2408.08070-b31b1b)](https://arxiv.org/pdf/2408.08070.pdf)   [![github](https://img.shields.io/badge/github-MambaMIM-purple)](https://github.com/FengheTan9/MambaMIM)    <a href="#LICENSE--citation"><img alt="License: Apache2.0" src="https://img.shields.io/badge/LICENSE-Apache%202.0-blue.svg"/></a>
+## News
+- **MambaMIM accepted by Medical Image Analyses (MIA'25) ! 🥰**
+- **Weights released ! 😎**
+- **Code released !** 😘
+- **Code and weights will be released soon !** 😘
+- **[2024/08/16] Paper released !**
+## TODOs
+- [x] Paper released
+- [x] Code released
+- [x] Weight released
+## Getting Started
+### Download weights
+|   Name   |  Resolution  |  Intensities  |      Spacing       |                           Weights                            |
+| :------: | :----------: | :-----------: | :----------------: | :----------------------------------------------------------: |
+| MambaMIM | 96 x 96 x 96 | [-175, - 250] | 1.5 x 1.5 x 1.5 mm | [Google Drive (87MB)](https://drive.google.com/file/d/1B3j5aRPxkDJqf8UPGKDiAjg2X85a3Kwx/view?usp=sharing) |
+### Prepare Environments
+```
+conda create -n mambamim python=3.9
+conda activate mambamim
+pip install torch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
+pip install packaging timm==0.5.4
+pip install transformers==4.34.1 typed-argument-parser
+pip install numpy==1.21.2 opencv-python==4.5.5.64 opencv-python-headless==4.5.5.64
+pip install 'monai[all]'
+pip install monai==1.2.0
+pip install causal_conv1d-1.2.0.post2+cu118torch1.13cxx11abiTRUE-cp38-cp38-linux_x86_64.whl
+pip install mamba_ssm-1.2.0.post1+cu118torch1.13cxx11abiFALSE-cp38-cp38-linux_x86_64.whl
+```
+### Prepare Datasets
+We recommend that you convert the dataset into the [nnUNet](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/dataset_format.md) format.
+```
+└── MambaMIM
+    ├── data
+        ├── Dataset060_TotalSegmentator
+            └── imagesTr
+                ├── xxx_0000.nii.gz
+                ├── ...
+        ├── Dataset006_FLARE2022
+            └── imagesTr
+                ├── xxx_0000.nii.gz
+                ├── ...
+        └── Other_dataset
+            └── imagesTr
+                ├── xxx_0000.nii.gz
+                ├── ...
+```
+An example ```dataset.json``` will be generated in ```./data```
+The content should be like below:
+```json
+{
+    "training": [
+        {
+            "image": "./Dataset060_TotalSegmentator/imagesTr/xxx_0000.nii.gz"
+        },
+        {
+            "image": "./Dataset006_FLARE2022/imagesTr/xxx_0000.nii.gz"
+        },
+    ]
+}
+```
+## Start Training
+![MambaMIM](img/masking_consistency.png)
+Run training on multi-GPU :
+```sh
+# An example of training on 4 GPUs with DDP
+torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=localhost --master_port=12351 main.py --exp_name=debug --data_path=./data  --model=mambamim --bs=16  --exp_dir=debug_mambamim_ddp_4
+```
+Run training on the single-GPU :
+```sh
+# An example of training on the single GPU
+python main.py --exp_name=debug --data_path=./data --model=mambamim --bs=4 --exp_dir=debug_mambamim
+```
+## Fine-tuning
+Load pre-training weights :
+```python
+# An example of Fine-tuning on BTCV (num_classes=14)
+from models.network.hymamba import build_hybird
+model = build_hybird(in_channel=1, n_classes=14, img_size=96).cuda()
+model_dict = torch.load("mambamim_mask75.pth")
+if model.load_state_dict(model_dict, strict=False):
+    print("MambaMIM use pretrained weights successfully !")
+```
+Downstream pipeline can be referred to [UNETR]([research-contributions/UNETR/BTCV at main · Project-MONAI/research-contributions (github.com)](https://github.com/Project-MONAI/research-contributions/tree/main/UNETR/BTCV)).
+## Acknowledgements:
+This code uses helper functions from [SparK](https://github.com/keyu-tian/SparK) and [HySparK](https://github.com/FengheTan9/HySparK).
+## Citation
+If the code, paper and weights help your research, please cite:
+```
+@article{tang2024mambamim,
+  title={MambaMIM: Pre-training Mamba with State Space Token-interpolation},
+  author={Tang, Fenghe and Nian, Bingkun and Li, Yingtai and Yang, Jie and Wei, Liu and Zhou, S Kevin},
+  journal={arXiv preprint arXiv:2408.08070},
+  year={2024}
+}
+```
+## License
+This project is released under the Apache 2.0 license. Please see the [LICENSE](LICENSE) file for more information.

dist.py ADDED Viewed

	@@ -0,0 +1,112 @@

+import os
+from typing import List
+from typing import Union
+import sys
+import torch
+import torch.distributed as tdist
+import torch.multiprocessing as mp
+__rank, __local_rank, __world_size, __device = 0, 0, 1, 'cpu'
+__initialized = False
+def initialized():
+    return __initialized
+def initialize(backend='nccl'):
+    global __device
+    if not torch.cuda.is_available():
+        print(f'[dist initialize] cuda is not available, use cpu instead', file=sys.stderr)
+        return
+    elif 'RANK' not in os.environ:
+        __device = torch.empty(1).cuda().device
+        print(f'[dist initialize] RANK is not set, use 1 GPU instead', file=sys.stderr)
+        return
+    # ref: https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/dist_utils.py#L29
+    if mp.get_start_method(allow_none=True) is None:
+        mp.set_start_method('spawn')
+    global_rank, num_gpus = int(os.environ['RANK']), torch.cuda.device_count()
+    local_rank = global_rank % num_gpus
+    torch.cuda.set_device(local_rank)
+    tdist.init_process_group(backend=backend)
+    global __rank, __local_rank, __world_size, __initialized
+    __local_rank = local_rank
+    __rank, __world_size = tdist.get_rank(), tdist.get_world_size()
+    __device = torch.empty(1).cuda().device
+    __initialized = True
+    assert tdist.is_initialized(), 'torch.distributed is not initialized!'
+def get_rank():
+    return __rank
+def get_local_rank():
+    return __local_rank
+def get_world_size():
+    return __world_size
+def get_device():
+    return __device
+def is_master():
+    return __rank == 0
+def is_local_master():
+    return __local_rank == 0
+def barrier():
+    if __initialized:
+        tdist.barrier()
+def parallelize(net, syncbn=False):
+    if syncbn:
+        net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)
+    net = net.cuda()
+    net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[get_local_rank()], find_unused_parameters=False, broadcast_buffers=False)
+    return net
+def allreduce(t: torch.Tensor) -> None:
+    if __initialized:
+        if not t.is_cuda:
+            cu = t.detach().cuda()
+            tdist.all_reduce(cu)
+            t.copy_(cu.cpu())
+        else:
+            tdist.all_reduce(t)
+def allgather(t: torch.Tensor, cat=True) -> Union[List[torch.Tensor], torch.Tensor]:
+    if __initialized:
+        if not t.is_cuda:
+            t = t.cuda()
+        ls = [torch.empty_like(t) for _ in range(__world_size)]
+        tdist.all_gather(ls, t)
+    else:
+        ls = [t]
+    if cat:
+        ls = torch.cat(ls, dim=0)
+    return ls
+def broadcast(t: torch.Tensor, src_rank) -> None:
+    if __initialized:
+        if not t.is_cuda:
+            cu = t.detach().cuda()
+            tdist.broadcast(cu, src=src_rank)
+            t.copy_(cu.cpu())
+        else:
+            tdist.broadcast(t, src=src_rank)

img/TOKI.png ADDED Viewed

Git LFS Details

SHA256: 4a3efce3120f63be89e2e17a1dbc80b794e34bb7f5277e0983961356c2fe91e1
Pointer size: 132 Bytes
Size of remote file: 1.69 MB

img/masking_consistency.png ADDED Viewed

Git LFS Details

SHA256: a4abcc4d8218afdab7cec2d353b3f1caba6e6ab85aa899b3f5a79ff668edf0e6
Pointer size: 132 Bytes
Size of remote file: 1.62 MB

main.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import datetime
+import math
+import sys
+import time
+import logging
+import os
+import torch
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.data import DataLoader
+import dist
+from models.encoder import SparseEncoder
+from models.decoder import LightDecoder
+from models.MambaMIM import MambaMIM
+from models import build_sparse_encoder
+from utils.sampler import DistInfiniteBatchSampler, worker_init_fn
+from utils import arg_util, misc
+from utils.med_dataset import get_loader
+from utils.lr_control import lr_wd_annealing
+cpu_num = 1
+os.environ['OMP_NUM_THREADS'] = str(cpu_num)
+os.environ['OPENBLAS_NUM_THREADS'] = str(cpu_num)
+os.environ['MKL_NUM_THREADS'] = str(cpu_num)
+os.environ['VECLIB_MAXIMUM_THREADS'] = str(cpu_num)
+os.environ['NUMEXPR_NUM_THREADS'] = str(cpu_num)
+torch.set_num_threads(cpu_num)
+torch.multiprocessing.set_sharing_strategy('file_system')
+class LocalDDP(torch.nn.Module):
+    def __init__(self, module):
+        super(LocalDDP, self).__init__()
+        self.module = module
+    def forward(self, *args, **kwargs):
+        return self.module(*args, **kwargs)
+def main_pt():
+    args: arg_util.Args = arg_util.init_dist_and_get_args()
+    print(f'initial args:\n{str(args)}')
+    args.log_epoch()
+    # build data
+    print(f'[build data for pre-training] ...\n')
+    dataset_train = get_loader(args.data_path, args.input_size)
+    data_loader_train = DataLoader(
+       dataset=dataset_train, num_workers=args.dataloader_workers, pin_memory=True,
+       batch_sampler=DistInfiniteBatchSampler(
+           dataset_len=len(dataset_train), glb_batch_size=args.glb_batch_size,
+           shuffle=True, filling=True, rank=dist.get_rank(), world_size=dist.get_world_size(),
+       ), worker_init_fn=worker_init_fn
+    )
+    itrt_train, iters_train = iter(data_loader_train), len(data_loader_train)
+    print(f'[dataloader] gbs={args.glb_batch_size}, lbs={args.batch_size_per_gpu}, iters_train={iters_train}')
+    # build encoder and decoder
+    enc: SparseEncoder = build_sparse_encoder(args.model, input_size=args.input_size, sbn=args.sbn, drop_path_rate=args.dp, verbose=False)
+    dec = LightDecoder(enc.downsample_raito, sbn=args.sbn)
+    model_without_ddp = MambaMIM(
+        sparse_encoder=enc, dense_decoder=dec, mask_ratio=args.mask,
+        densify_norm=args.densify_norm, sbn=args.sbn,
+    ).to(args.device)
+    print(f'[PT model] model = {model_without_ddp}\n')
+    # the model has been randomly initialized in their construction time
+    # now try to load some checkpoint as model weight initialization; this ONLY loads the model weights
+    model = LocalDDP(model_without_ddp)
+    # build optimizer and lr_scheduler
+    optimizer = torch.optim.AdamW(params=model_without_ddp.parameters(), lr=args.lr, weight_decay=1e-5)
+    # try to resume the experiment from some checkpoint.pth; this will load model weights, optimizer states, and last epoch (ep_start)
+    # if loaded, ep_start will be greater than 0
+    ep_start, performance_desc = misc.load_checkpoint(args.resume_from, model_without_ddp, optimizer)
+    if ep_start >= args.ep: # load from a complete checkpoint file
+        print(f'  [*] [PT already done]    Min/Last Recon Loss: {performance_desc}')
+    else:   # perform pre-training
+        tb_lg = misc.TensorboardLogger(args.tb_lg_dir, is_master=dist.is_master(), prefix='pt')
+        min_loss = 1e9
+        print(f'[PT start] from ep{ep_start}')
+        pt_start_time = time.time()
+        for ep in range(ep_start, args.ep):
+            ep_start_time = time.time()
+            tb_lg.set_step(ep * iters_train)
+            if hasattr(itrt_train, 'set_epoch'):
+                itrt_train.set_epoch(ep)
+            stats = pre_train_one_ep(ep, args, tb_lg, itrt_train, iters_train, model, optimizer)
+            last_loss = stats['last_loss']
+            min_loss = min(min_loss, last_loss)
+            performance_desc = f'{min_loss:.4f} {last_loss:.4f}'
+            misc.save_checkpoint_with_meta_info_and_opt_state(f'{args.model}_withdecoder_ct_pretrained.pth', args, ep, performance_desc, model_without_ddp.state_dict(with_config=True), optimizer.state_dict())
+            misc.save_checkpoint_model_weights_only(f'{args.model}_ct_pretrained_mambamim_timm_style.pth', args, model_without_ddp.sparse_encoder.state_dict())
+            ep_cost = round(time.time() - ep_start_time, 2) + 1    # +1s: approximate the following logging cost
+            remain_secs = (args.ep-1 - ep) * ep_cost
+            remain_time = datetime.timedelta(seconds=round(remain_secs))
+            finish_time = time.strftime("%m-%d %H:%M", time.localtime(time.time() + remain_secs))
+            print(f'  [*] [ep{ep}/{args.ep}]    Min/Last Recon Loss: {performance_desc},    Cost: {ep_cost}s,    Remain: {remain_time},    Finish @ {finish_time}')
+            args.cur_ep = f'{ep + 1}/{args.ep}'
+            args.remain_time, args.finish_time = str(remain_time), str(finish_time)
+            args.last_loss = last_loss
+            args.log_epoch()
+            tb_lg.update(min_loss=min_loss, head='train', step=ep)
+            tb_lg.update(rest_hours=round(remain_secs/60/60, 2), head='z_burnout', step=ep)
+            tb_lg.flush()
+        # finish pre-training
+        tb_lg.update(min_loss=min_loss, head='result', step=ep_start)
+        tb_lg.update(min_loss=min_loss, head='result', step=args.ep)
+        tb_lg.flush()
+        print(f'final args:\n{str(args)}')
+        print('\n\n')
+        print(f'  [*] [PT finished]    Min/Last Recon Loss: {performance_desc},    Total Cost: {(time.time() - pt_start_time) / 60 / 60:.1f}h\n')
+        print('\n\n')
+        tb_lg.close()
+        time.sleep(10)
+    args.remain_time, args.finish_time = '-', time.strftime("%m-%d %H:%M", time.localtime(time.time()))
+    args.log_epoch()
+def pre_train_one_ep(ep, args: arg_util.Args, tb_lg: misc.TensorboardLogger, itrt_train, iters_train, model: DistributedDataParallel, optimizer):
+    model.train()
+    me = misc.MetricLogger(delimiter='  ')
+    me.add_meter('max_lr', misc.SmoothedValue(window_size=1, fmt='{value:.5f}'))
+    header = f'[PT] Epoch {ep}:'
+    optimizer.zero_grad()
+    early_clipping = args.clip > 0 and not hasattr(optimizer, 'global_grad_norm')
+    late_clipping = hasattr(optimizer, 'global_grad_norm')
+    if early_clipping:
+        params_req_grad = [p for p in model.parameters() if p.requires_grad]
+    for it, inp in enumerate(me.log_every(iters_train, itrt_train, 3, header)):
+        # adjust lr and wd
+        min_lr, max_lr, min_wd, max_wd = lr_wd_annealing(optimizer, args.lr, args.wd, args.wde, it + ep * iters_train, args.wp_ep * iters_train, args.ep * iters_train)
+        # forward and backward
+        # print(inp)
+        temp = []
+        for crop_per_batch in inp:
+            temp.append(crop_per_batch["image"])
+        inp = torch.cat(temp, dim=0)
+        inp = inp.to(args.device, non_blocking=True)
+        MambaSparK.forward
+        loss = model(inp, active_b1fff=None, vis=False)
+        optimizer.zero_grad()
+        loss.backward()
+        loss = loss.item()
+        if not math.isfinite(loss):
+            print(f'[rk{dist.get_rank():02d}] Loss is {loss}, stopping training!', force=True, flush=True)
+            sys.exit(-1)
+        # optimize
+        grad_norm = None
+        if early_clipping: grad_norm = torch.nn.utils.clip_grad_norm_(params_req_grad, args.clip).item()
+        optimizer.step()
+        if late_clipping: grad_norm = optimizer.global_grad_norm
+        torch.cuda.synchronize()
+        # log
+        me.update(last_loss=loss)
+        me.update(max_lr=max_lr)
+        tb_lg.update(loss=me.meters['last_loss'].global_avg, head='train_loss')
+        tb_lg.update(sche_lr=max_lr, head='train_hp/lr_max')
+        tb_lg.update(sche_lr=min_lr, head='train_hp/lr_min')
+        tb_lg.update(sche_wd=max_wd, head='train_hp/wd_max')
+        tb_lg.update(sche_wd=min_wd, head='train_hp/wd_min')
+        if grad_norm is not None:
+            me.update(orig_norm=grad_norm)
+            tb_lg.update(orig_norm=grad_norm, head='train_hp')
+        tb_lg.set_step()
+    me.synchronize_between_processes()
+    return {k: meter.global_avg for k, meter in me.meters.items()}
+if __name__ == '__main__':
+    main_pt()

mambamim_mask75.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fd92f6cfdd2aff93f8942536f333bca7eb612b4238153c9b5accbacd9e4e1989
+size 90976893

models/MambaMIM.py ADDED Viewed

	@@ -0,0 +1,240 @@

+from pprint import pformat
+from typing import List
+import sys
+import torch
+import torch.nn as nn
+from timm.models.layers import trunc_normal_
+import models.encoder as encoder
+from models.decoder import LightDecoder
+from itertools import accumulate
+class MambaMIM(nn.Module):
+    def __init__(
+            self, sparse_encoder: encoder.SparseEncoder, dense_decoder: LightDecoder,
+            mask_ratio=0.6, densify_norm='ln', sbn=True,
+    ):
+        super().__init__()
+        input_size, downsample_raito = sparse_encoder.input_size, sparse_encoder.downsample_raito
+        self.downsample_raito = downsample_raito
+        self.fmap_h, self.fmap_w, self.fmap_d = input_size // downsample_raito, input_size // downsample_raito, input_size // downsample_raito
+        self.mask_ratio = mask_ratio
+        self.len_keep = round(self.fmap_h * self.fmap_w * self.fmap_d * (1 - mask_ratio))
+        self.sparse_encoder = sparse_encoder
+        self.dense_decoder = dense_decoder
+        self.sbn = sbn
+        self.hierarchy = len(sparse_encoder.enc_feat_map_chs)
+        self.densify_norm_str = densify_norm.lower()
+        self.densify_norms = nn.ModuleList()
+        self.densify_projs = nn.ModuleList()
+        self.mask_tokens = nn.ParameterList()
+        # build the `densify` layers
+        e_widths, d_width = self.sparse_encoder.enc_feat_map_chs, self.dense_decoder.width
+        e_widths: List[int]
+        self.A_interpolation = nn.Parameter(torch.zeros(1, self.sparse_encoder.enc_feat_map_chs[-1], self.sparse_encoder.enc_feat_map_chs[-1]))
+        print("self.A_interpolation: ", self.A_interpolation.shape)
+        for i in range(
+                self.hierarchy):  # from the smallest feat map to the largest; i=0: the last feat map; i=1: the second last feat map ...
+            e_width = e_widths.pop()
+            # create mask token
+            p = nn.Parameter(torch.zeros(1, e_width, 1, 1, 1))
+            trunc_normal_(p, mean=0, std=.02, a=-.02, b=.02)
+            self.mask_tokens.append(p)
+            # create densify norm
+            densify_norm = nn.Identity()
+            self.densify_norms.append(densify_norm)
+            # create densify proj
+            if i == 0 and e_width == d_width:
+                densify_proj = nn.Identity()  # todo: NOTE THAT CONVNEXT-S WOULD USE THIS, because it has a width of 768 that equals to the decoder's width 768
+                print(f'[MambaMIM.__init__, densify {i + 1}/{self.hierarchy}]: use nn.Identity() as densify_proj')
+            else:
+                kernel_size = 1 if i <= 0 else 3
+                densify_proj = nn.Conv3d(e_width, d_width, kernel_size=kernel_size, stride=1, padding=kernel_size // 2,
+                                         bias=True)
+                print(
+                    f'[MambaMIM.__init__, densify {i + 1}/{self.hierarchy}]: densify_proj(ksz={kernel_size}, #para={sum(x.numel() for x in densify_proj.parameters()) / 1e6:.2f}M)')
+            self.densify_projs.append(densify_proj)
+            # todo: the decoder's width follows a simple halfing rule; you can change it to any other rule
+            d_width //= 2
+        print(f'[MambaMIM.__init__] dims of mask_tokens={tuple(p.numel() for p in self.mask_tokens)}')
+    def mask(self, B: int, device, generator=None):
+        h, w, d = self.fmap_h, self.fmap_w, self.fmap_d
+        idx = torch.rand(B, h * w * d, generator=generator).argsort(dim=1)
+        idx = idx[:, :self.len_keep].to(device)  # (B, len_keep)
+        return torch.zeros(B, h * w * d, dtype=torch.bool, device=device)\
+            .scatter_(dim=1, index=idx, value=True).view(B, 1, h, w, d)
+    def mask_token_every_batch(self, bcfff,cur_active):
+        #A_token#
+        flag = cur_active.flatten(2).clone()
+        flag[0][0][0] = True
+        flag[0][0][-1] = True
+        indices = torch.nonzero(flag.squeeze()).squeeze()
+        #A_token#
+        B,N,H,L,W = bcfff.shape
+        A_token =[]
+        for i in range(0,len(indices)-1):
+            A_power = [torch.linalg.matrix_power(self.A_interpolation, i) for i in range(indices[i+1]-indices[i])]
+            max_power = indices[i+1]-indices[i]-1
+            for j in range(0,indices[i+1]-indices[i]):
+                A_token.append(A_power[max_power-j])
+        A_token.append(self.A_interpolation)
+        A_token = torch.cat(A_token, dim=0)
+        X_token = []
+        X_unmask = bcfff.flatten(2).transpose(1, 2).squeeze().unsqueeze(-1)
+        for i in range(0,len(indices)-1):
+            alpha = torch.linspace(0, 1, indices[i + 1] - indices[i], dtype=X_unmask.dtype, device=X_unmask.device)
+            alpha = alpha.view(-1, 1)  # alpha
+            X_interpolation = (1 - alpha) * X_unmask[indices[i]].transpose(0, 1) + alpha * X_unmask[indices[i + 1]].transpose(0, 1)
+            X_token.append(X_interpolation.unsqueeze(-1))
+        X_last_token = X_unmask[indices[-1]].unsqueeze(0)
+        X_token.append(X_last_token)
+        X_token = torch.cat(X_token,dim = 0)
+        AX = A_token.cuda() @ X_token
+        mask_token = AX
+        for i in range(0,len(indices)-1):
+            current_sum = list(accumulate(AX[indices[i]:indices[i+1]]))
+            mask_token[indices[i]:indices[i+1]] = torch.stack(current_sum,dim = 0)
+        mask_token = AX.reshape(B,N,H,L,W)
+        return mask_token
+    def manba_mask(self,bcfff,cur_active):
+        '''
+        S6T
+        '''
+        B,N,H,W,L = cur_active.shape
+        cur_active_list = torch.chunk(cur_active,B,dim = 0)
+        bcfff_list = torch.chunk(bcfff,B,dim = 0)
+        mask_token_list=[]
+        for i in range(B):
+            mask_token_list.append(self.mask_token_every_batch(bcfff_list[i],cur_active_list[i]))
+        mask_token = torch.cat(mask_token_list, dim=0)
+        return mask_token
+    def forward(self, inp_bchwd: torch.Tensor, active_b1fff=None, vis=False):
+        # step1. Mask
+        if active_b1fff is None:  # rand mask
+            active_b1fff: torch.BoolTensor = self.mask(inp_bchwd.shape[0], inp_bchwd.device)  # (B, 1, f, f, f)
+        encoder._cur_active = active_b1fff  # (B, 1, f, f)
+        active_b1hwd = active_b1fff.repeat_interleave(self.downsample_raito, 2).repeat_interleave(self.downsample_raito,
+                                                                                                  3).repeat_interleave(
+            self.downsample_raito, 4)  # (B, 1, H, W, D)
+        masked_bchwd = inp_bchwd * active_b1hwd
+        # step2. Encode: get hierarchical encoded sparse features (a list containing 4 feature maps at 4 scales)
+        fea_bcfffs: List[torch.Tensor] = self.sparse_encoder(masked_bchwd, active_b1fff)
+        fea_bcfffs.reverse()  # after reversion: from the smallest feature map to the largest
+        # step3. Densify: get hierarchical dense features for decoding (need to modified !!!!!!!!!!!)
+        cur_active = active_b1fff  # (B, 1, f, f, f)
+        to_dec = []
+        for i, bcfff in enumerate(fea_bcfffs):  # from the smallest feature map to the largest
+            if bcfff is not None:
+                bcfff = self.densify_norms[i](bcfff)
+                mask_tokens = self.manba_mask(bcfff,cur_active) if i==0 else self.mask_tokens[i].expand_as(bcfff)
+                # mask_tokens = self.mask_tokens[i].expand_as(bcfff)
+                bcfff = torch.where(cur_active.expand_as(bcfff), bcfff,
+                                    mask_tokens)  # fill in empty (non-active) positions with [mask] tokens
+                bcfff: torch.Tensor = self.densify_projs[i](bcfff)
+            to_dec.append(bcfff)
+            cur_active = cur_active.repeat_interleave(2, dim=2).repeat_interleave(2, dim=3).repeat_interleave(2,
+                                                                                                             dim=4)  # dilate the mask map, from (B, 1, f, f) to (B, 1, H, W)
+        # step4. Decode and reconstruct
+        rec_bchwd = self.dense_decoder(to_dec)
+        inp, rec = self.patchify(inp_bchwd), self.patchify(
+            rec_bchwd)  # inp and rec: (B, L = f*f*f, N = C*downsample_raito**2)
+        mean = inp.mean(dim=-1, keepdim=True)
+        var = (inp.var(dim=-1, keepdim=True) + 1e-6) ** .5
+        inp = (inp - mean) / var
+        l2_loss = ((rec - inp) ** 2).mean(dim=2, keepdim=False)  # (B, L, C) ==mean==> (B, L)
+        non_active = active_b1fff.logical_not().int().view(active_b1fff.shape[0], -1)  # (B, 1, f, f, f) => (B, L)
+        recon_loss = l2_loss.mul_(non_active).sum() / (
+                    non_active.sum() + 1e-8)  # loss only on masked (non-active) patches
+        if vis:
+            masked_bchwd = inp_bchwd * active_b1hwd
+            rec_bchwd = self.unpatchify(rec * var + mean)
+            rec_or_inp = torch.where(active_b1hwd, inp_bchwd, rec_bchwd)
+            return inp_bchwd, masked_bchwd, rec_or_inp
+        else:
+            return recon_loss
+    def patchify(self, bchwd):
+        p = self.downsample_raito
+        h, w, d = self.fmap_h, self.fmap_w, self.fmap_d
+        B, C = bchwd.shape[:2]
+        bchwd = bchwd.reshape(shape=(B, C, h, p, w, p, d, p))
+        bchwd = torch.einsum('bchpwqds->bhwdpqsc', bchwd)
+        bln = bchwd.reshape(shape=(B, h * w * d, C * p ** 3))  # (B, f*f, 3*downsample_raito**2)
+        return bln
+    def unpatchify(self, bln):
+        p = self.downsample_raito
+        h, w, d = self.fmap_h, self.fmap_w, self.fmap_d
+        B, C = bln.shape[0], bln.shape[-1] // p ** 3
+        bln = bln.reshape(shape=(B, h, w, d, p, p, p, C))
+        bln = torch.einsum('bhwdpqsc->bchpwqds', bln)
+        bchwd = bln.reshape(shape=(B, C, h * p, w * p, d * p))
+        return bchwd
+    def __repr__(self):
+        return (
+            f'\n'
+            f'[MambaMIM.config]: {pformat(self.get_config(), indent=2, width=250)}\n'
+            f'[MambaMIM.structure]: {super(MambaMIM, self).__repr__().replace(MambaMIM.__name__, "")}'
+        )
+    def get_config(self):
+        return {
+            # self
+            'mask_ratio': self.mask_ratio,
+            'densify_norm_str': self.densify_norm_str,
+            'sbn': self.sbn, 'hierarchy': self.hierarchy,
+            # enc
+            'sparse_encoder.input_size': self.sparse_encoder.input_size,
+            # dec
+            'dense_decoder.width': self.dense_decoder.width,
+        }
+    def state_dict(self, destination=None, prefix='', keep_vars=False, with_config=False):
+        state = super(MambaMIM, self).state_dict(destination=destination, prefix=prefix, keep_vars=keep_vars)
+        if with_config:
+            state['config'] = self.get_config()
+        return state
+    def load_state_dict(self, state_dict, strict=True):
+        config: dict = state_dict.pop('config', None)
+        incompatible_keys = super(MambaMIM, self).load_state_dict(state_dict, strict=strict)
+        if config is not None:
+            for k, v in self.get_config().items():
+                ckpt_v = config.get(k, None)
+                if ckpt_v != v:
+                    err = f'[SparseMIM.load_state_dict] config mismatch:  this.{k}={v} (ckpt.{k}={ckpt_v})'
+                    if strict:
+                        raise AttributeError(err)
+                    else:
+                        print(err, file=sys.stderr)
+        return incompatible_keys

models/__init__.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import torch
+from timm.loss import SoftTargetCrossEntropy
+from timm.models.layers import drop
+from models.network.hymamba import Encoder
+# log more
+def _ex_repr(self):
+    return ', '.join(
+        f'{k}=' + (f'{v:g}' if isinstance(v, float) else str(v))
+        for k, v in vars(self).items()
+        if not k.startswith('_') and k != 'training'
+        and not isinstance(v, (torch.nn.Module, torch.Tensor))
+    )
+for clz in (torch.nn.CrossEntropyLoss, SoftTargetCrossEntropy, drop.DropPath):
+    if hasattr(clz, 'extra_repr'):
+        clz.extra_repr = _ex_repr
+    else:
+        clz.__repr__ = lambda self: f'{type(self).__name__}({_ex_repr(self)})'
+pretrain_default_model_kwargs = {
+    'mambamim': dict(sparse=True, drop_path_rate=0.1),
+}
+for kw in pretrain_default_model_kwargs.values():
+    kw['pretrained'] = False
+    kw['num_classes'] = 0
+    kw['global_pool'] = ''
+def build_sparse_encoder(name: str, input_size: int, sbn=False, drop_path_rate=0.0, verbose=False):
+    from models.encoder import SparseEncoder
+    kwargs = pretrain_default_model_kwargs[name]
+    if drop_path_rate != 0:
+        kwargs['drop_path_rate'] = drop_path_rate
+    print(f'[build_sparse_encoder] model kwargs={kwargs}')
+    encoder = Encoder(
+                in_channel=1,
+                channels=(32, 64, 128, 192, 384),
+                depths=(1, 2, 2, 2, 1),
+                kernels=(3, 3, 3, 3, 3),
+                exp_r=(2, 2, 4, 4, 4),
+                img_size=96,
+                depth=4,
+                sparse=True)
+    return SparseEncoder(encoder=encoder, input_size=input_size, sbn=sbn, verbose=verbose)

models/decoder.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import math
+from typing import List
+import torch
+import torch.nn as nn
+from timm.models.layers import trunc_normal_
+class UNetBlock(nn.Module):
+    def __init__(self, cin, cout, bn3d):
+        """
+        a UNet block with 2x up sampling
+        """
+        super().__init__()
+        self.up_sample = nn.ConvTranspose3d(cin, cin, kernel_size=2, stride=2, padding=0, bias=True)
+        self.conv = nn.Sequential(
+            nn.Conv3d(cin, cout, kernel_size=3, stride=1, padding=1, bias=True), bn3d(cout), nn.ReLU(inplace=True),
+            nn.Conv3d(cout, cout, kernel_size=3, stride=1, padding=1, bias=True), bn3d(cout), nn.ReLU(inplace=True),
+        )
+    def forward(self, x):
+        x = self.up_sample(x)
+        return self.conv(x)
+class FusionBlock(nn.Module):
+    def __init__(self, cin, cout, bn3d):
+        """
+        a fusionBlock block with 2x up sampling
+        """
+        super().__init__()
+        self.conv = nn.Sequential(
+            nn.Conv3d(cin, cout, kernel_size=3, stride=1, padding=1, bias=True), bn3d(cout), nn.ReLU(inplace=True),
+            nn.Conv3d(cout, cout, kernel_size=3, stride=1, padding=1, bias=True), bn3d(cout), nn.ReLU(inplace=True),
+        )
+    def forward(self, x):
+        return self.conv(x)
+class LightDecoder(nn.Module):
+    def __init__(self, up_sample_ratio, width=768,
+                 sbn=True):  # todo: the decoder's width follows a simple halfing rule; you can change it to any other rule
+        super().__init__()
+        self.width = width
+        n = round(math.log2(up_sample_ratio))
+        channels = [self.width // 2 ** i for i in range(
+            n + 1)]  # todo: the decoder's width follows a simple halfing rule; you can change it to any other rule
+        bn3d = nn.BatchNorm3d
+        self.dec = nn.ModuleList([UNetBlock(cin, cout, bn3d) for (cin, cout) in zip(channels[:-1], channels[1:])])
+        self.fuse = nn.ModuleList([FusionBlock(cin * 2, cin, bn3d) for (cin, cout) in zip(channels[:-1], channels[1:])])
+        self.proj = nn.Conv3d(channels[-1], 1, kernel_size=1, stride=1, bias=True)
+        self.initialize()
+    def forward(self, to_dec: List[torch.Tensor]):
+        x = 0
+        for i, d in enumerate(self.dec):
+            if i < len(to_dec) and to_dec[i] is not None:
+                if isinstance(x, int):
+                    x = x + to_dec[i]
+                else:
+                    x = torch.cat((x, to_dec[i]), dim=1)
+                    x = self.fuse[i](x)
+            x = self.dec[i](x)
+        return self.proj(x)
+    def extra_repr(self) -> str:
+        return f'width={self.width}'
+    def initialize(self):
+        for m in self.modules():
+            if isinstance(m, nn.Linear):
+                trunc_normal_(m.weight, std=.02)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Conv3d):
+                trunc_normal_(m.weight, std=.02)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, (nn.Conv3d, nn.ConvTranspose3d)):
+                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0.)
+            elif isinstance(m, (nn.LayerNorm, nn.BatchNorm2d, nn.BatchNorm3d, nn.SyncBatchNorm)):
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)

models/encoder.py ADDED Viewed

	@@ -0,0 +1,258 @@

+import torch
+import torch.nn as nn
+from timm.models.layers import DropPath
+_cur_active: torch.Tensor = None  # B1fff
+# todo: try to use `gather` for speed?
+def _get_active_ex_or_ii(H, W, D, returning_active_ex=True):
+    h_repeat, w_repeat, d_repeat = H // _cur_active.shape[-3], W // _cur_active.shape[-2], D // _cur_active.shape[-1]
+    active_ex = _cur_active.repeat_interleave(h_repeat, dim=2).repeat_interleave(w_repeat, dim=3).repeat_interleave(d_repeat, dim=4)
+    return active_ex if returning_active_ex else active_ex.squeeze(1).nonzero(as_tuple=True)  # ii: bi, hi, wi
+def sp_conv_forward(self, x: torch.Tensor):
+    x = super(type(self), self).forward(x)
+    x *= _get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=True)  # (BCHW) *= (B1HW), mask the output of conv
+    return x
+def sp_bn_forward(self, x: torch.Tensor):
+    ii = _get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=False)
+    bhwdc = x.permute(0, 2, 3, 4, 1)
+    nc = bhwdc[ii]  # select the features on non-masked positions to form a flatten feature `nc`
+    nc = super(type(self), self).forward(nc)  # use BN1d to normalize this flatten feature `nc`
+    bchwd = torch.zeros_like(bhwdc)
+    bchwd[ii] = nc
+    bchwd = bchwd.permute(0, 4, 1, 2, 3)
+    return bchwd
+def sp_in_forward(self, x: torch.Tensor):
+    ii = _get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=False)
+    bhwdc = x.permute(0, 2, 3, 4, 1)
+    cn = bhwdc[ii].permute(1,
+                           0)  # select the features on non-masked positions to form a flatten feature `nc` [17787, 3]
+    C, N = cn.shape
+    bcl = cn.reshape(C, -1, x.shape[0]).permute(2, 0, 1)
+    bcl = super(type(self), self).forward(bcl)  # use BN1d to normalize this flatten feature `nc`
+    nc = bcl.permute(1, 2, 0).reshape(C, -1).permute(1, 0)
+    bchwd = torch.zeros_like(bhwdc)
+    bchwd[ii] = nc
+    bchwd = bchwd.permute(0, 4, 1, 2, 3)
+    return bchwd
+class SparseConv3d(nn.Conv3d):
+    forward = sp_conv_forward  # hack: override the forward function; see `sp_conv_forward` above for more details
+class SparseMaxPooling(nn.MaxPool3d):
+    forward = sp_conv_forward  # hack: override the forward function; see `sp_conv_forward` above for more details
+class SparseAvgPooling(nn.AvgPool3d):
+    forward = sp_conv_forward  # hack: override the forward function; see `sp_conv_forward` above for more details
+class SparseBatchNorm3d(nn.BatchNorm1d):
+    forward = sp_bn_forward  # hack: override the forward function; see `sp_bn_forward` above for more details
+class SparseSyncBatchNorm3d(nn.SyncBatchNorm):
+    forward = sp_bn_forward  # hack: override the forward function; see `sp_bn_forward` above for more details
+class SparseInstanceNorm3d(nn.InstanceNorm1d):
+    forward = sp_in_forward  # hack: override the forward function; see `sp_bn_forward` above for more details
+class SparseConvNeXtLayerNorm(nn.LayerNorm):
+    r""" LayerNorm that supports two data formats: channels_last (default) or channels_first.
+    The ordering of the dimensions in the inputs. channels_last corresponds to inputs with
+    shape (batch_size, height, width, channels) while channels_first corresponds to inputs
+    with shape (batch_size, channels, height, width).
+    """
+    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last", sparse=True):
+        if data_format not in ["channels_last", "channels_first"]:
+            raise NotImplementedError
+        super().__init__(normalized_shape, eps, elementwise_affine=True)
+        self.data_format = data_format
+        self.sparse = sparse
+    def forward(self, x):
+        if x.ndim == 5:  # BHWDC or BCHWD
+            if self.data_format == "channels_last":  # BHWDC
+                if self.sparse:
+                    ii = _get_active_ex_or_ii(H=x.shape[1], W=x.shape[2], D=x.shape[3], returning_active_ex=False)
+                    nc = x[ii]
+                    nc = super(SparseConvNeXtLayerNorm, self).forward(nc)
+                    x = torch.zeros_like(x)
+                    x[ii] = nc
+                    return x
+                else:
+                    return super(SparseConvNeXtLayerNorm, self).forward(x)
+            else:  # channels_first, BCHWD
+                if self.sparse:
+                    ii = _get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=False)
+                    bhwc = x.permute(0, 2, 3, 4, 1)
+                    nc = bhwc[ii]
+                    nc = super(SparseConvNeXtLayerNorm, self).forward(nc)
+                    x = torch.zeros_like(bhwc)
+                    x[ii] = nc
+                    return x.permute(0, 4, 1, 2, 3)
+                else:
+                    u = x.mean(1, keepdim=True)
+                    s = (x - u).pow(2).mean(1, keepdim=True)
+                    x = (x - u) / torch.sqrt(s + self.eps)
+                    x = self.weight[:, None, None, None] * x + self.bias[:, None, None, None]
+                    return x
+        else:  # BLC or BC
+            if self.sparse:
+                raise NotImplementedError
+            else:
+                return super(SparseConvNeXtLayerNorm, self).forward(x)
+    def __repr__(self):
+        return super(SparseConvNeXtLayerNorm, self).__repr__()[
+               :-1] + f', ch={self.data_format.split("_")[-1]}, sp={self.sparse})'
+class SparseConvNeXtBlock(nn.Module):
+    r""" ConvNeXt Block. There are two equivalent implementations:
+    (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
+    (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
+    We use (2) as we find it slightly faster in PyTorch
+    Args:
+        dim (int): Number of input channels.
+        drop_path (float): Stochastic depth rate. Default: 0.0
+        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
+    """
+    def __init__(self, in_channels, out_channels, kernel_size=7, exp_r=4, do_res=False, drop_path=0.,
+                 layer_scale_init_value=1e-6, sparse=True):
+        super().__init__()
+        self.do_res = do_res
+        self.dwconv = nn.Conv3d(in_channels, in_channels, kernel_size=kernel_size, padding=kernel_size // 2,
+                                groups=in_channels)  # depthwise conv
+        self.norm = SparseConvNeXtLayerNorm(in_channels, eps=1e-6, sparse=sparse)
+        self.pwconv1 = nn.Linear(in_channels,
+                                 exp_r * in_channels)  # pointwise/1x1 convs, implemented with linear layers
+        self.act = nn.GELU()
+        self.pwconv2 = nn.Linear(exp_r * in_channels, out_channels)
+        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((out_channels)),
+                                  requires_grad=True) if layer_scale_init_value > 0 else None
+        self.drop_path: nn.Module = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.sparse = sparse
+    def forward(self, x):
+        input = x
+        x = self.dwconv(x)
+        x = x.permute(0, 2, 3, 4, 1)  # (N, C, H, W, D) -> (N, H, W, D, C)
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)  # GELU(0) == (0), so there is no need to mask x (no need to `x *= _get_active_ex_or_ii`)
+        x = self.pwconv2(x)
+        if self.gamma is not None:
+            x = self.gamma * x
+        x = x.permute(0, 4, 1, 2, 3)  # (N, H, W, C) -> (N, C, H, W)
+        if self.sparse:
+            x *= _get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=True)
+        if self.do_res:
+            x = input + self.drop_path(x)
+        return x
+    def __repr__(self):
+        return super(SparseConvNeXtBlock, self).__repr__()[:-1] + f', sp={self.sparse})'
+class SparseEncoder(nn.Module):
+    def __init__(self, encoder, input_size, sbn=False, verbose=False):
+        super(SparseEncoder, self).__init__()
+        self.embeddings = SparseEncoder.dense_model_to_sparse(m=encoder.embeddings, verbose=verbose, sbn=sbn)
+        self.mae = encoder.mae
+        # self.encoder = SparseEncoder.dense_model_to_sparse(m=encoder, verbose=verbose, sbn=sbn)
+        self.input_size, self.downsample_raito, self.enc_feat_map_chs = input_size, encoder.get_downsample_ratio(), encoder.get_feature_map_channels()
+    @staticmethod
+    def dense_model_to_sparse(m: nn.Module, verbose=False, sbn=False):
+        oup = m
+        if isinstance(m, nn.Conv3d):
+            m: nn.Conv3d
+            bias = m.bias is not None
+            oup = SparseConv3d(
+                m.in_channels, m.out_channels,
+                kernel_size=m.kernel_size, stride=m.stride, padding=m.padding,
+                dilation=m.dilation, groups=m.groups, bias=bias, padding_mode=m.padding_mode,
+            )
+            oup.weight.data.copy_(m.weight.data)
+            if bias:
+                oup.bias.data.copy_(m.bias.data)
+        elif isinstance(m, nn.MaxPool3d):
+            m: nn.MaxPool3d
+            oup = SparseMaxPooling(m.kernel_size, stride=m.stride, padding=m.padding, dilation=m.dilation,
+                                   return_indices=m.return_indices, ceil_mode=m.ceil_mode)
+        elif isinstance(m, nn.AvgPool3d):
+            m: nn.AvgPool3d
+            oup = SparseAvgPooling(m.kernel_size, m.stride, m.padding, ceil_mode=m.ceil_mode,
+                                   count_include_pad=m.count_include_pad, divisor_override=m.divisor_override)
+        elif isinstance(m, (nn.BatchNorm3d, nn.SyncBatchNorm)):
+            m: nn.BatchNorm3d
+            oup = (SparseSyncBatchNorm3d if sbn else SparseBatchNorm3d)(m.weight.shape[0], eps=m.eps,
+                                                                        momentum=m.momentum, affine=m.affine,
+                                                                        track_running_stats=m.track_running_stats)
+            oup.weight.data.copy_(m.weight.data)
+            oup.bias.data.copy_(m.bias.data)
+            oup.running_mean.data.copy_(m.running_mean.data)
+            oup.running_var.data.copy_(m.running_var.data)
+            oup.num_batches_tracked.data.copy_(m.num_batches_tracked.data)
+            if hasattr(m, "qconfig"):
+                oup.qconfig = m.qconfig
+        elif isinstance(m, nn.InstanceNorm3d):
+            m: nn.InstanceNorm3d
+            oup = SparseInstanceNorm3d(m.num_features, eps=m.eps, momentum=m.momentum, affine=m.affine,
+                                       track_running_stats=m.track_running_stats)
+            if hasattr(m, "qconfig"):
+                oup.qconfig = m.qconfig
+        elif isinstance(m, nn.LayerNorm) and not isinstance(m, SparseConvNeXtLayerNorm):
+            m: nn.LayerNorm
+            oup = SparseConvNeXtLayerNorm(m.weight.shape[0], eps=m.eps)
+            oup.weight.data.copy_(m.weight.data)
+            oup.bias.data.copy_(m.bias.data)
+        elif isinstance(m, (nn.Conv1d,)):
+            m: nn.Conv1d
+            bias = m.bias is not None
+            oup = nn.Conv1d(
+                m.in_channels, m.out_channels,
+                kernel_size=m.kernel_size, stride=m.stride, padding=m.padding,
+                dilation=m.dilation, groups=m.groups, bias=bias, padding_mode=m.padding_mode)
+            oup.weight.data.copy_(m.weight.data)
+            if bias:
+                oup.bias.data.copy_(m.bias.data)
+        for name, child in m.named_children():
+            oup.add_module(name, SparseEncoder.dense_model_to_sparse(child, verbose=verbose, sbn=sbn))
+        del m
+        return oup
+    def forward(self, x, active_b1fff):
+        x1, x2, x3, x4, x5 = self.embeddings(x)
+        _x5 = self.mae(x5, active_b1fff)
+        return [x1, x2, x3, x4, _x5]
+if __name__ == '__main__':
+    x = torch.randn([1, 96, 24, 24, 24])
+    _cur_active = torch.randn([1, 1, 96 // 16, 96 // 16, 96 // 16])
+    print(x.shape)
+    print(_get_active_ex_or_ii(H=x.shape[2], W=x.shape[3], D=x.shape[4], returning_active_ex=True).shape)
+    print(x.shape)

models/mamba/bi_vision_mamba.py ADDED Viewed

	@@ -0,0 +1,416 @@

+# Copyright (c) 2023, Tri Dao, Albert Gu.
+import math
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+from einops import rearrange, repeat
+from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
+try:
+    from causal_conv1d import causal_conv1d_fn, causal_conv1d_update
+except ImportError:
+    causal_conv1d_fn, causal_conv1d_update = None, None
+try:
+    from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+except ImportError:
+    selective_state_update = None
+try:
+    from mamba_ssm.ops.triton.layernorm import RMSNorm, layer_norm_fn, rms_norm_fn
+except ImportError:
+    RMSNorm, layer_norm_fn, rms_norm_fn = None, None, None
+class Mamba(nn.Module):
+    def __init__(
+            self,
+            d_model,
+            d_state=16,
+            d_conv=4,
+            expand=2,
+            dt_rank="auto",
+            dt_min=0.001,
+            dt_max=0.1,
+            dt_init="random",
+            dt_scale=1.0,
+            dt_init_floor=1e-4,
+            conv_bias=True,
+            bias=False,
+            use_fast_path=True,  # Fused kernel options
+            layer_idx=None,
+            device=None,
+            dtype=None,
+            bimamba_type="none"
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        self.d_model = d_model
+        self.d_state = d_state
+        self.d_conv = d_conv
+        self.expand = expand
+        self.d_inner = int(self.expand * self.d_model)
+        self.dt_rank = math.ceil(self.d_model / 16) if dt_rank == "auto" else dt_rank
+        self.use_fast_path = use_fast_path
+        self.layer_idx = layer_idx
+        self.bimamba_type = bimamba_type
+        self.in_proj = nn.Linear(self.d_model, self.d_inner * 2, bias=bias, **factory_kwargs)
+        self.conv1d = nn.Conv1d(
+            in_channels=self.d_inner,
+            out_channels=self.d_inner,
+            bias=conv_bias,
+            kernel_size=d_conv,
+            groups=self.d_inner,
+            padding=d_conv - 1,
+            **factory_kwargs,
+        )
+        self.activation = "silu"
+        self.act = nn.SiLU()
+        self.x_proj = nn.Linear(
+            self.d_inner, self.dt_rank + self.d_state * 2, bias=False, **factory_kwargs
+        )
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True, **factory_kwargs)
+        # Initialize special dt projection to preserve variance at initialization
+        dt_init_std = self.dt_rank ** -0.5 * dt_scale
+        if dt_init == "constant":
+            nn.init.constant_(self.dt_proj.weight, dt_init_std)
+        elif dt_init == "random":
+            nn.init.uniform_(self.dt_proj.weight, -dt_init_std, dt_init_std)
+        else:
+            raise NotImplementedError
+        # Initialize dt bias so that F.softplus(dt_bias) is between dt_min and dt_max
+        dt = torch.exp(
+            torch.rand(self.d_inner, **factory_kwargs) * (math.log(dt_max) - math.log(dt_min))
+            + math.log(dt_min)
+        ).clamp(min=dt_init_floor)
+        # Inverse of softplus: https://github.com/pytorch/pytorch/issues/72759
+        inv_dt = dt + torch.log(-torch.expm1(-dt))
+        with torch.no_grad():
+            self.dt_proj.bias.copy_(inv_dt)
+        # Our initialization would set all Linear.bias to zero, need to mark this one as _no_reinit
+        self.dt_proj.bias._no_reinit = True
+        # S4D real initialization
+        A = repeat(
+            torch.arange(1, self.d_state + 1, dtype=torch.float32, device=device),
+            "n -> d n",
+            d=self.d_inner,
+        ).contiguous()
+        A_log = torch.log(A)  # Keep A_log in fp32
+        self.A_log = nn.Parameter(A_log)
+        self.A_log._no_weight_decay = True
+        # D "skip" parameter
+        self.D = nn.Parameter(torch.ones(self.d_inner, device=device))  # Keep in fp32
+        self.D._no_weight_decay = True
+        # assert bimamba_type == "v2"
+        A_b = repeat(
+            torch.arange(1, self.d_state + 1, dtype=torch.float32, device=device),
+            "n -> d n",
+            d=self.d_inner,
+        ).contiguous()
+        A_b_log = torch.log(A_b)  # Keep A_b_log in fp32
+        self.A_b_log = nn.Parameter(A_b_log)
+        self.A_b_log._no_weight_decay = True
+        self.conv1d_b = nn.Conv1d(
+            in_channels=self.d_inner,
+            out_channels=self.d_inner,
+            bias=conv_bias,
+            kernel_size=d_conv,
+            groups=self.d_inner,
+            padding=d_conv - 1,
+            **factory_kwargs,
+        )
+        self.x_proj_b = nn.Linear(
+            self.d_inner, self.dt_rank + self.d_state * 2, bias=False, **factory_kwargs
+        )
+        self.dt_proj_b = nn.Linear(self.dt_rank, self.d_inner, bias=True, **factory_kwargs)
+        self.D_b = nn.Parameter(torch.ones(self.d_inner, device=device))  # Keep in fp32
+        self.D_b._no_weight_decay = True
+        self.out_proj = nn.Linear(self.d_inner, self.d_model, bias=bias, **factory_kwargs)
+    def forward(self, hidden_states, inference_params=None):
+        """
+        hidden_states: (B, L, D)
+        Returns: same shape as hidden_states
+        """
+        batch, seqlen, dim = hidden_states.shape
+        conv_state, ssm_state = None, None
+        if inference_params is not None:
+            conv_state, ssm_state = self._get_states_from_cache(inference_params, batch)
+            if inference_params.seqlen_offset > 0:
+                # The states are updated inplace
+                out, _, _ = self.step(hidden_states, conv_state, ssm_state)
+                return out
+        # We do matmul and transpose BLH -> HBL at the same time
+        xz = rearrange(
+            self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
+            "d (b l) -> b d l",
+            l=seqlen,
+        )
+        if self.in_proj.bias is not None:
+            xz = xz + rearrange(self.in_proj.bias.to(dtype=xz.dtype), "d -> d 1")
+        A = -torch.exp(self.A_log.float())  # (d_inner, d_state)
+        # In the backward pass we write dx and dz next to each other to avoid torch.cat
+        if self.use_fast_path and causal_conv1d_fn is not None and inference_params is None:  # Doesn't support outputting the states
+            if self.bimamba_type == "v2":
+                A_b = -torch.exp(self.A_b_log.float())
+                out = mamba_inner_fn_no_out_proj(
+                    xz,
+                    self.conv1d.weight,
+                    self.conv1d.bias,
+                    self.x_proj.weight,
+                    self.dt_proj.weight,
+                    A,
+                    None,  # input-dependent B
+                    None,  # input-dependent C
+                    self.D.float(),
+                    delta_bias=self.dt_proj.bias.float(),
+                    delta_softplus=True,
+                )
+                out_b = mamba_inner_fn_no_out_proj(
+                    xz.flip([-1]),
+                    self.conv1d_b.weight,
+                    self.conv1d_b.bias,
+                    self.x_proj_b.weight,
+                    self.dt_proj_b.weight,
+                    A_b,
+                    None,
+                    None,
+                    self.D_b.float(),
+                    delta_bias=self.dt_proj_b.bias.float(),
+                    delta_softplus=True,
+                )
+                # F.linear(rearrange(out_z, "b d l -> b l d"), out_proj_weight, out_proj_bias)
+                out = F.linear(rearrange(out + out_b.flip([-1]), "b d l -> b l d"), self.out_proj.weight,
+                               self.out_proj.bias)
+            else:
+                out = mamba_inner_fn(
+                    xz,
+                    self.conv1d.weight,
+                    self.conv1d.bias,
+                    self.x_proj.weight,
+                    self.dt_proj.weight,
+                    self.out_proj.weight,
+                    self.out_proj.bias,
+                    A,
+                    None,  # input-dependent B
+                    None,  # input-dependent C
+                    self.D.float(),
+                    delta_bias=self.dt_proj.bias.float(),
+                    delta_softplus=True,
+                )
+        else:
+            x, z = xz.chunk(2, dim=1)
+            # Compute short convolution
+            if conv_state is not None:
+                # If we just take x[:, :, -self.d_conv :], it will error if seqlen < self.d_conv
+                # Instead F.pad will pad with zeros if seqlen < self.d_conv, and truncate otherwise.
+                conv_state.copy_(F.pad(x, (self.d_conv - x.shape[-1], 0)))  # Update state (B D W)
+            if causal_conv1d_fn is None:
+                x = self.act(self.conv1d(x)[..., :seqlen])
+            else:
+                assert self.activation in ["silu", "swish"]
+                x = causal_conv1d_fn(
+                    x=x,
+                    weight=rearrange(self.conv1d.weight, "d 1 w -> d w"),
+                    bias=self.conv1d.bias,
+                    activation=self.activation,
+                )
+            # We're careful here about the layout, to avoid extra transposes.
+            # We want dt to have d as the slowest moving dimension
+            # and L as the fastest moving dimension, since those are what the ssm_scan kernel expects.
+            x_dbl = self.x_proj(rearrange(x, "b d l -> (b l) d"))  # (bl d)
+            dt, B, C = torch.split(x_dbl, [self.dt_rank, self.d_state, self.d_state], dim=-1)
+            dt = self.dt_proj.weight @ dt.t()
+            dt = rearrange(dt, "d (b l) -> b d l", l=seqlen)
+            B = rearrange(B, "(b l) dstate -> b dstate l", l=seqlen).contiguous()
+            C = rearrange(C, "(b l) dstate -> b dstate l", l=seqlen).contiguous()
+            assert self.activation in ["silu", "swish"]
+            y = selective_scan_fn(
+                x,
+                dt,
+                A,
+                B,
+                C,
+                self.D.float(),
+                z=z,
+                delta_bias=self.dt_proj.bias.float(),
+                delta_softplus=True,
+                return_last_state=ssm_state is not None,
+            )
+            if ssm_state is not None:
+                y, last_state = y
+                ssm_state.copy_(last_state)
+            y = rearrange(y, "b d l -> b l d")
+            out = self.out_proj(y)
+        return out
+    def step(self, hidden_states, conv_state, ssm_state):
+        dtype = hidden_states.dtype
+        assert hidden_states.shape[1] == 1, "Only support decoding with 1 token at a time for now"
+        xz = self.in_proj(hidden_states.squeeze(1))  # (B 2D)
+        x, z = xz.chunk(2, dim=-1)  # (B D)
+        # Conv step
+        if causal_conv1d_update is None:
+            conv_state.copy_(torch.roll(conv_state, shifts=-1, dims=-1))  # Update state (B D W)
+            conv_state[:, :, -1] = x
+            x = torch.sum(conv_state * rearrange(self.conv1d.weight, "d 1 w -> d w"), dim=-1)  # (B D)
+            if self.conv1d.bias is not None:
+                x = x + self.conv1d.bias
+            x = self.act(x).to(dtype=dtype)
+        else:
+            x = causal_conv1d_update(
+                x,
+                conv_state,
+                rearrange(self.conv1d.weight, "d 1 w -> d w"),
+                self.conv1d.bias,
+                self.activation,
+            )
+        x_db = self.x_proj(x)  # (B dt_rank+2*d_state)
+        dt, B, C = torch.split(x_db, [self.dt_rank, self.d_state, self.d_state], dim=-1)
+        # Don't add dt_bias here
+        dt = F.linear(dt, self.dt_proj.weight)  # (B d_inner)
+        A = -torch.exp(self.A_log.float())  # (d_inner, d_state)
+        # SSM step
+        if selective_state_update is None:
+            # Discretize A and B
+            dt = F.softplus(dt + self.dt_proj.bias.to(dtype=dt.dtype))
+            dA = torch.exp(torch.einsum("bd,dn->bdn", dt, A))
+            dB = torch.einsum("bd,bn->bdn", dt, B)
+            ssm_state.copy_(ssm_state * dA + rearrange(x, "b d -> b d 1") * dB)
+            y = torch.einsum("bdn,bn->bd", ssm_state.to(dtype), C)
+            y = y + self.D.to(dtype) * x
+            y = y * self.act(z)  # (B D)
+        else:
+            y = selective_state_update(
+                ssm_state, x, dt, A, B, C, self.D, z=z, dt_bias=self.dt_proj.bias, dt_softplus=True
+            )
+        out = self.out_proj(y)
+        return out.unsqueeze(1), conv_state, ssm_state
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        device = self.out_proj.weight.device
+        conv_dtype = self.conv1d.weight.dtype if dtype is None else dtype
+        conv_state = torch.zeros(
+            batch_size, self.d_model * self.expand, self.d_conv, device=device, dtype=conv_dtype
+        )
+        ssm_dtype = self.dt_proj.weight.dtype if dtype is None else dtype
+        # ssm_dtype = torch.float32
+        ssm_state = torch.zeros(
+            batch_size, self.d_model * self.expand, self.d_state, device=device, dtype=ssm_dtype
+        )
+        return conv_state, ssm_state
+    def _get_states_from_cache(self, inference_params, batch_size, initialize_states=False):
+        assert self.layer_idx is not None
+        if self.layer_idx not in inference_params.key_value_memory_dict:
+            batch_shape = (batch_size,)
+            conv_state = torch.zeros(
+                batch_size,
+                self.d_model * self.expand,
+                self.d_conv,
+                device=self.conv1d.weight.device,
+                dtype=self.conv1d.weight.dtype,
+            )
+            ssm_state = torch.zeros(
+                batch_size,
+                self.d_model * self.expand,
+                self.d_state,
+                device=self.dt_proj.weight.device,
+                dtype=self.dt_proj.weight.dtype,
+                # dtype=torch.float32,
+            )
+            inference_params.key_value_memory_dict[self.layer_idx] = (conv_state, ssm_state)
+        else:
+            conv_state, ssm_state = inference_params.key_value_memory_dict[self.layer_idx]
+            # TODO: What if batch size changes between generation, and we reuse the same states?
+            if initialize_states:
+                conv_state.zero_()
+                ssm_state.zero_()
+        return conv_state, ssm_state
+class Block(nn.Module):
+    def __init__(
+            self, dim, mixer_cls, norm_cls=nn.LayerNorm, fused_add_norm=False, residual_in_fp32=False
+    ):
+        """
+        Simple block wrapping a mixer class with LayerNorm/RMSNorm and residual connection"
+        This Block has a slightly different structure compared to a regular
+        prenorm Transformer block.
+        The standard block is: LN -> MHA/MLP -> Add.
+        [Ref: https://arxiv.org/abs/2002.04745]
+        Here we have: Add -> LN -> Mixer, returning both
+        the hidden_states (output of the mixer) and the residual.
+        This is purely for performance reasons, as we can fuse add and LayerNorm.
+        The residual needs to be provided (except for the very first block).
+        """
+        super().__init__()
+        self.residual_in_fp32 = residual_in_fp32
+        self.fused_add_norm = fused_add_norm
+        self.mixer = mixer_cls(dim)
+        self.norm = norm_cls(dim)
+        if self.fused_add_norm:
+            assert RMSNorm is not None, "RMSNorm import fails"
+            assert isinstance(
+                self.norm, (nn.LayerNorm, RMSNorm)
+            ), "Only LayerNorm and RMSNorm are supported for fused_add_norm"
+    def forward(
+            self, hidden_states: Tensor, residual: Optional[Tensor] = None, inference_params=None
+    ):
+        r"""Pass the input through the encoder layer.
+        Args:
+            hidden_states: the sequence to the encoder layer (required).
+            residual: hidden_states = Mixer(LN(residual))
+        """
+        if not self.fused_add_norm:
+            residual = (hidden_states + residual) if residual is not None else hidden_states
+            hidden_states = self.norm(residual.to(dtype=self.norm.weight.dtype))
+            if self.residual_in_fp32:
+                residual = residual.to(torch.float32)
+        else:
+            fused_add_norm_fn = rms_norm_fn if isinstance(self.norm, RMSNorm) else layer_norm_fn
+            hidden_states, residual = fused_add_norm_fn(
+                hidden_states,
+                self.norm.weight,
+                self.norm.bias,
+                residual=residual,
+                prenorm=True,
+                residual_in_fp32=self.residual_in_fp32,
+                eps=self.norm.eps,
+            )
+        hidden_states = self.mixer(hidden_states, inference_params=inference_params)
+        return hidden_states, residual
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        return self.mixer.allocate_inference_cache(batch_size, max_seqlen, dtype=dtype, **kwargs)

models/network/hymamba.py ADDED Viewed

	@@ -0,0 +1,681 @@

+import torch
+import torch.nn as nn
+from models.encoder import SparseConvNeXtLayerNorm, _get_active_ex_or_ii
+from typing import Optional, Sequence, Tuple, Union, List
+import numpy as np
+from models.mamba.bi_vision_mamba import Mamba
+from monai.networks.blocks.unetr_block import UnetrUpBlock
+def build_3d_sincos_position_embedding(grid_size, embed_dim, num_tokens=0, temperature=10000.):
+    grid_size = (grid_size, grid_size, grid_size)
+    h, w, d = grid_size
+    grid_h = torch.arange(h, dtype=torch.float32)
+    grid_w = torch.arange(w, dtype=torch.float32)
+    grid_d = torch.arange(d, dtype=torch.float32)
+    grid_h, grid_w, grid_d = torch.meshgrid(grid_h, grid_w, grid_d)
+    assert embed_dim % 6 == 0, 'Embed dimension must be divisible by 6 for 3D sin-cos position embedding'
+    pos_dim = embed_dim // 6
+    omega = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
+    omega = 1. / (temperature ** omega)
+    out_h = torch.einsum('m,d->md', [grid_h.flatten(), omega])
+    out_w = torch.einsum('m,d->md', [grid_w.flatten(), omega])
+    out_d = torch.einsum('m,d->md', [grid_d.flatten(), omega])
+    pos_emb = torch.cat(
+        [torch.sin(out_h), torch.cos(out_h), torch.sin(out_w), torch.cos(out_w), torch.sin(out_d), torch.cos(out_d)],
+        dim=1)[None, :, :]
+    assert num_tokens == 1 or num_tokens == 0, "Number of tokens must be of 0 or 1"
+    if num_tokens == 1:
+        pe_token = torch.zeros([1, 1, embed_dim], dtype=torch.float32)
+        pos_embed = nn.Parameter(torch.cat([pe_token, pos_emb], dim=1))
+    else:
+        pos_embed = nn.Parameter(pos_emb)
+    pos_embed.requires_grad = False
+    return pos_embed
+class MlpChannel(nn.Module):
+    def __init__(self, hidden_size, mlp_dim):
+        super().__init__()
+        self.fc1 = nn.Linear(hidden_size, mlp_dim)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(mlp_dim, hidden_size)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        return x
+class MambaLayer(nn.Module):
+    def __init__(self, dim, d_state=16, d_conv=4, expand=2):
+        super().__init__()
+        self.dim = dim
+        self.norm1 = nn.LayerNorm(dim)
+        self.mamba = Mamba(
+            d_model=dim,  # Model dimension d_model
+            d_state=d_state,  # SSM state expansion factor
+            d_conv=d_conv,  # Local convolution width
+            expand=expand,  # Block expansion factor
+            bimamba_type="v1",
+        )
+        self.mlp = MlpChannel(hidden_size=dim, mlp_dim=2 * dim)
+        self.norm2 = nn.LayerNorm(dim)
+    def forward(self, x):
+        x = self.mamba(self.norm1(x)) + x
+        x = self.mlp(self.norm2(x)) + x
+        return x
+class MaskedAutoencoderMamba(nn.Module):
+    """ Masked Autoencoder with VisionTransformer backbone
+    """
+    def __init__(self, img_size=96, downsample_rato=16, embed_dim=384, depth=8, norm_layer=nn.LayerNorm, sparse=True):
+        super().__init__()
+        print("mamba sparse: ", sparse)
+        # --------------------------------------------------------------------------
+        # MAE encoder specifics
+        self.grid_size = img_size // downsample_rato
+        self.num_patches = (self.grid_size) ** 3
+        self.embed_dim = embed_dim
+        self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches, embed_dim),
+                                      requires_grad=False)  # fixed sin-cos embedding
+        self.blocks = nn.ModuleList([
+            MambaLayer(dim=embed_dim)
+            for i in range(depth)])
+        # self.gsc = GSC(in_channels=embed_dim, sparse=sparse)
+        self.sparse = sparse
+        if self.sparse:
+            self.mask_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        # --------------------------------------------------------------------------
+        self.initialize_weights()
+    def initialize_weights(self):
+        # initialization
+        # initialize (and freeze) pos_embed by sin-cos embedding
+        pos_embed = build_3d_sincos_position_embedding(self.grid_size, self.embed_dim)
+        self.pos_embed.data.copy_(pos_embed)
+        if self.sparse:
+            torch.nn.init.normal_(self.mask_token, std=.02)
+        # initialize nn.Linear and nn.LayerNorm
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            # we use xavier_uniform following official JAX ViT:
+            torch.nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def random_masking(self, enc, active_b1fff):
+        """
+        Perform per-sample random masking by per-sample shuffling.
+        Per-sample shuffling is done by argsort random noise.
+        x: [N, L, D], sequence
+        """
+        N, L, D = enc.shape  # batch, length, dim
+        mask = torch.tensor(active_b1fff, dtype=torch.int).flatten(2).transpose(1, 2)
+        # sort noise for each sample
+        noise = 1 - mask
+        len_keep = torch.sum(mask)
+        ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
+        ids_restore = torch.argsort(ids_shuffle, dim=1)
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        x_masked = torch.gather(enc, dim=1, index=ids_keep.repeat(1, 1, D))
+        # generate the binary mask: 0 is keep, 1 is remove
+        return x_masked, mask, ids_restore
+    def unmasking(self, x, ids_restore):
+        mask_tokens = self.mask_token.repeat(x.shape[0], ids_restore.shape[1] - x.shape[1], 1)
+        x_ = torch.cat([x, mask_tokens], dim=1)  # no cls token
+        x = torch.gather(x_, dim=1, index=ids_restore.repeat(1, 1, x.shape[2]))  # unshuffle
+        return x
+    def forward_encoder(self, enc, active_b1fff=None):
+        # enc = self.gsc(enc)
+        B, C, H, W, D = enc.shape
+        x = enc.flatten(2).transpose(1, 2)
+        # add pos embed w/o cls token
+        x = x + self.pos_embed
+        if self.sparse:
+            # masking: length -> length * mask_ratio
+            x, mask, ids_restore = self.random_masking(x, active_b1fff)
+            # apply Mamba blocks
+            for blk in self.blocks:
+                x = blk(x)
+            x = self.unmasking(x, ids_restore)
+        else:
+            for blk in self.blocks:
+                x = blk(x)
+        x = x.transpose(1, 2).reshape(B, C, H, W, D)
+        return x
+    def forward(self, imgs, active_b1fff=None):
+        return self.forward_encoder(imgs, active_b1fff)
+class MedNeXtBlock(nn.Module):
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 exp_r: int = 4,
+                 kernel_size: int = 7,
+                 do_res: int = True,
+                 n_groups: int or None = None,
+                 sparse=False):
+        super().__init__()
+        self.do_res = do_res
+        self.sparse = sparse
+        conv = nn.Conv3d
+        # First convolution layer with DepthWise Convolutions
+        self.conv1 = conv(
+            in_channels=in_channels,
+            out_channels=in_channels,
+            kernel_size=kernel_size,
+            stride=1,
+            padding=kernel_size // 2,
+            groups=in_channels if n_groups is None else n_groups,
+        )
+        # Normalization Layer. GroupNorm is used by default.
+        self.norm = SparseConvNeXtLayerNorm(normalized_shape=in_channels, data_format='channels_first', sparse=sparse)
+        # Second convolution (Expansion) layer with Conv3D 1x1x1
+        self.conv2 = conv(
+            in_channels=in_channels,
+            out_channels=exp_r * in_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0
+        )
+        # GeLU activations
+        self.act = nn.GELU()
+        # Third convolution (Compression) layer with Conv3D 1x1x1
+        self.conv3 = conv(
+            in_channels=exp_r * in_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0
+        )
+    def forward(self, x, dummy_tensor=None):
+        x1 = x
+        x1 = self.conv1(x1)
+        x1 = self.act(self.conv2(self.norm(x1)))
+        x1 = self.conv3(x1)
+        if self.sparse:
+            x1 *= _get_active_ex_or_ii(H=x1.shape[2], W=x1.shape[3], D=x1.shape[4], returning_active_ex=True)
+        if self.do_res:
+            x1 = x + x1
+        return x1
+class MedNeXtDownBlock(MedNeXtBlock):
+    def __init__(self, in_channels, out_channels, exp_r=4, kernel_size=7,
+                 do_res=False, sparse=False):
+        super().__init__(in_channels, out_channels, exp_r, kernel_size,
+                         do_res=False, sparse=sparse)
+        self.resample_do_res = do_res
+        if do_res:
+            self.res_conv = nn.Conv3d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+                stride=2
+            )
+        self.conv1 = nn.Conv3d(
+            in_channels=in_channels,
+            out_channels=in_channels,
+            kernel_size=kernel_size,
+            stride=2,
+            padding=kernel_size // 2,
+            groups=in_channels,
+        )
+    def forward(self, x, dummy_tensor=None):
+        x1 = super().forward(x)
+        if self.resample_do_res:
+            res = self.res_conv(x)
+            x1 = x1 + res
+        return x1
+class UnetResBlock(nn.Module):
+    """
+    A skip-connection based module that can be used for DynUNet, based on:
+    `Automated Design of Deep Learning Methods for Biomedical Image Segmentation <https://arxiv.org/abs/1904.08128>`_.
+    `nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation <https://arxiv.org/abs/1809.10486>`_.
+    Args:
+        spatial_dims: number of spatial dimensions.
+        in_channels: number of input channels.
+        out_channels: number of output channels.
+        kernel_size: convolution kernel size.
+        stride: convolution stride.
+        norm_name: feature normalization type and arguments.
+        act_name: activation layer type and arguments.
+        dropout: dropout probability.
+    """
+    def __init__(
+            self,
+            sparse: bool,
+            in_channels: int,
+            out_channels: int,
+            kernel_size: Union[Sequence[int], int],
+            stride: Union[Sequence[int], int],
+    ):
+        super().__init__()
+        self.conv1 = nn.Conv3d(
+            in_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=kernel_size // 2)
+        self.conv2 = nn.Conv3d(
+            out_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=1,
+            padding=kernel_size // 2,
+        )
+        self.lrelu = nn.LeakyReLU(inplace=True, negative_slope=0.01)
+        self.norm1 = SparseConvNeXtLayerNorm(normalized_shape=out_channels, data_format='channels_first', sparse=sparse)
+        self.norm2 = SparseConvNeXtLayerNorm(normalized_shape=out_channels, data_format='channels_first', sparse=sparse)
+        self.downsample = in_channels != out_channels
+        stride_np = np.atleast_1d(stride)
+        if not np.all(stride_np == 1):
+            self.downsample = True
+        if self.downsample:
+            self.conv3 = nn.Conv3d(
+                in_channels,
+                out_channels,
+                kernel_size=1,
+                stride=stride)
+            self.norm3 = SparseConvNeXtLayerNorm(normalized_shape=out_channels, data_format='channels_first', sparse=sparse)
+    def forward(self, inp):
+        residual = inp
+        out = self.conv1(inp)
+        out = self.norm1(out)
+        out = self.lrelu(out)
+        out = self.conv2(out)
+        out = self.norm2(out)
+        if hasattr(self, "conv3"):
+            residual = self.conv3(residual)
+        if hasattr(self, "norm3"):
+            residual = self.norm3(residual)
+        out += residual
+        out = self.lrelu(out)
+        return out
+class MedNeXtUpBlock(MedNeXtBlock):
+    def __init__(self, in_channels, out_channels, exp_r=4, kernel_size=3,
+                 do_res=True, sparse=False):
+        super().__init__(in_channels, out_channels, exp_r, kernel_size,
+                         do_res=False, sparse=sparse)
+        self.resample_do_res = do_res
+        conv = nn.ConvTranspose3d
+        if do_res:
+            self.res_conv = conv(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+                stride=2
+            )
+        self.conv1 = conv(
+            in_channels=in_channels,
+            out_channels=in_channels,
+            kernel_size=kernel_size,
+            stride=2,
+            padding=kernel_size // 2,
+            groups=in_channels,
+        )
+    def forward(self, x, dummy_tensor=None):
+        x1 = super().forward(x)
+        # Asymmetry but necessary to match shape
+        x1 = torch.nn.functional.pad(x1, (1, 0, 1, 0, 1, 0))
+        if self.resample_do_res:
+            res = self.res_conv(x)
+        res = torch.nn.functional.pad(res, (1, 0, 1, 0, 1, 0))
+        x1 = x1 + res
+        return x1
+class UnetOutBlock(nn.Module):
+    def __init__(self, in_channels: int, n_classes: int):
+        super().__init__()
+        self.conv = nn.Conv3d(
+            in_channels,
+            n_classes,
+            kernel_size=1,
+            stride=1,
+            bias=True,
+        )
+    def forward(self, inp):
+        return self.conv(inp)
+class Embeddings(nn.Module):
+    def __init__(self,
+                 in_channel: int = 3,
+                 channels: Tuple = (32, 64, 96, 128, 192),
+                 depths: Tuple = (1, 1, 3, 1, 1),
+                 kernels: Tuple = (3, 3, 3, 3, 3),
+                 exp_r: Tuple = (2, 4, 4, 4, 2),
+                 sparse=True):
+        super(Embeddings, self).__init__()
+        self.dim = [channels[1], channels[2], channels[3], channels[4], channels[4]]
+        self.stem = nn.Conv3d(in_channels=in_channel, out_channels=channels[0], kernel_size=3, stride=1, padding=1)
+        self.layer2 = nn.Sequential(*[
+            MedNeXtBlock(
+                in_channels=channels[1],
+                out_channels=channels[1],
+                exp_r=exp_r[1],
+                kernel_size=kernels[1],
+                do_res=True,
+                sparse=sparse
+            )
+            for i in range(depths[1])])
+        self.layer3 = nn.Sequential(*[
+            MedNeXtBlock(
+                in_channels=channels[2],
+                out_channels=channels[2],
+                exp_r=exp_r[2],
+                kernel_size=kernels[2],
+                do_res=True,
+                sparse=sparse
+            )
+            for i in range(depths[2])])
+        self.layer4 = nn.Sequential(*[
+            MedNeXtBlock(
+                in_channels=channels[3],
+                out_channels=channels[3],
+                exp_r=exp_r[3],
+                kernel_size=kernels[3],
+                do_res=True,
+                sparse=sparse
+            )
+            for i in range(depths[3])])
+        self.layer5 = nn.Sequential(*[
+            MedNeXtBlock(
+                in_channels=channels[4],
+                out_channels=channels[4],
+                exp_r=exp_r[4],
+                kernel_size=kernels[4],
+                do_res=True,
+                sparse=sparse
+            )
+            for i in range(depths[4])])
+        self.down = nn.MaxPool3d((2, 2, 2))
+        self.expend1 = nn.Conv3d(in_channels=channels[0], out_channels=channels[1], kernel_size=3, stride=1, padding=1)
+        self.expend2 = nn.Conv3d(in_channels=channels[1], out_channels=channels[2], kernel_size=3, stride=1, padding=1)
+        self.expend3 = nn.Conv3d(in_channels=channels[2], out_channels=channels[3], kernel_size=3, stride=1, padding=1)
+        self.expend4 = nn.Conv3d(in_channels=channels[3], out_channels=channels[4], kernel_size=3, stride=1, padding=1)
+        self.encoder1 = UnetResBlock(
+            in_channels=channels[1],
+            out_channels=channels[1],
+            kernel_size=3,
+            stride=1,
+            sparse=sparse
+        )
+        self.encoder2 = UnetResBlock(
+            in_channels=channels[2],
+            out_channels=channels[2],
+            kernel_size=3,
+            stride=1,
+            sparse=sparse
+        )
+        self.encoder3 = UnetResBlock(
+            in_channels=channels[3],
+            out_channels=channels[3],
+            kernel_size=3,
+            stride=1,
+            sparse=sparse
+        )
+        self.encoder4 = UnetResBlock(
+            in_channels=channels[4],
+            out_channels=channels[4],
+            kernel_size=3,
+            stride=1,
+            sparse=sparse
+        )
+    def forward(self, x):
+        x = self.stem(x)
+        x1 = self.expend1(x)
+        x = self.down(x1)
+        x = self.layer2(x)
+        x2 = self.expend2(x)
+        x = self.down(x2)
+        x = self.layer3(x)
+        x3 = self.expend3(x)
+        x = self.down(x3)
+        x = self.layer4(x)
+        x4 = self.expend4(x)
+        x = self.down(x4)
+        x5 = self.layer5(x)
+        return self.encoder1(x1), self.encoder2(x2), self.encoder3(x3), self.encoder4(x4), x5
+class Encoder(nn.Module):
+    def __init__(self,
+                 in_channel: int = 1,
+                 channels=(32, 64, 128, 192, 384),
+                 depths=(1, 2, 2, 2, 1),
+                 kernels=(3, 3, 3, 3, 3),
+                 exp_r=(2, 2, 4, 4, 4),
+                 img_size=96,
+                 depth=4,
+                 norm_layer=nn.LayerNorm,
+                 sparse=False):
+        super(Encoder, self).__init__()
+        self.dim = [channels[1], channels[2], channels[3], channels[4], channels[4]]
+        self.embeddings = Embeddings(in_channel=in_channel,
+                                     channels=channels,
+                                     depths=depths,
+                                     kernels=kernels,
+                                     exp_r=exp_r,
+                                     sparse=sparse)
+        self.mae = MaskedAutoencoderMamba(
+            img_size=img_size,
+            downsample_rato=self.get_downsample_ratio(),
+            embed_dim=channels[-1],
+            depth=depth,
+            norm_layer=norm_layer,
+            sparse=sparse)
+    def get_downsample_ratio(self) -> int:
+        """
+        This func would ONLY be used in `SparseEncoder's __init__` (see `pretrain/encoder.py`).
+        :return: the TOTAL downsample ratio of the ConvNet.
+        E.g., for a ResNet-50, this should return 32.
+        """
+        return 16
+    def get_feature_map_channels(self) -> List[int]:
+        """
+        This func would ONLY be used in `SparseEncoder's __init__` (see `pretrain/encoder.py`).
+        :return: a list of the number of channels of each feature map.
+        E.g., for a ResNet-50, this should return [256, 512, 1024, 2048].
+        """
+        return self.dim
+    def forward(self, x, active_b1fff=None):
+        x1, x2, x3, x4, x5 = self.embeddings(x)
+        _x5 = self.mae(x5, active_b1fff)
+        return x1, x2, x3, x4, _x5
+class Decoder(nn.Module):
+    def __init__(self,
+                 n_classes: int = 3,
+                 channels: Tuple = (32, 64, 128, 196, 384),
+                 norm_name = "instance",
+                 res_block: bool = True):
+        super(Decoder, self).__init__()
+        self.decoder5 = UnetrUpBlock(
+            spatial_dims=3,
+            in_channels=channels[4],
+            out_channels=channels[4],
+            kernel_size=3,
+            upsample_kernel_size=2,
+            norm_name=norm_name,
+            res_block=res_block,
+        )
+        self.decoder4 = UnetrUpBlock(
+            spatial_dims=3,
+            in_channels=channels[4],
+            out_channels=channels[3],
+            kernel_size=3,
+            upsample_kernel_size=2,
+            norm_name=norm_name,
+            res_block=res_block,
+        )
+        self.decoder3 = UnetrUpBlock(
+            spatial_dims=3,
+            in_channels=channels[3],
+            out_channels=channels[2],
+            kernel_size=3,
+            upsample_kernel_size=2,
+            norm_name=norm_name,
+            res_block=res_block,
+        )
+        self.decoder2 = UnetrUpBlock(
+            spatial_dims=3,
+            in_channels=channels[2],
+            out_channels=channels[1],
+            kernel_size=3,
+            upsample_kernel_size=2,
+            norm_name=norm_name,
+            res_block=res_block,
+        )
+        self.decoder1 = UnetResBlock(
+            in_channels=channels[1],
+            out_channels=channels[0],
+            kernel_size=3,
+            stride=1,
+            sparse=False
+        )
+        self.out = UnetOutBlock(in_channels=channels[0], n_classes=n_classes)
+    def forward(self, x1, x2, x3, x4, x5):
+        d4 = self.decoder5(x5, x4)
+        d3 = self.decoder4(d4, x3)
+        d2 = self.decoder3(d3, x2)
+        d1 = self.decoder2(d2, x1)
+        d0 = self.decoder1(d1)
+        return self.out(d0)
+class Hybird(nn.Module):
+    def __init__(self,
+                 in_channel: int = 3,
+                 n_classes: int = 3,
+                 channels: Tuple = (32, 64, 96, 128, 192),
+                 depths: Tuple = (1, 1, 3, 3, 1),
+                 kernels: Tuple = (3, 3, 3, 3, 3),
+                 exp_r: Tuple = (2, 4, 4, 4, 2),
+                 img_size=96,
+                 depth=3,
+                 norm_layer=nn.LayerNorm, ):
+        super().__init__()
+        self.embeddings = Embeddings(in_channel=in_channel,
+                                     channels=channels,
+                                     depths=depths,
+                                     kernels=kernels,
+                                     exp_r=exp_r,
+                                     sparse=False)
+        self.mae = MaskedAutoencoderMamba(
+            img_size=img_size,
+            downsample_rato=16,
+            embed_dim=channels[-1],
+            depth=depth,
+            norm_layer=norm_layer,
+            sparse=False)
+        self.decoder = Decoder(
+            n_classes=n_classes,
+            channels=channels,
+        )
+    def forward(self, x):
+        x1, x2, x3, x4, x5 = self.embeddings(x)
+        x5 = self.mae(x5, None)
+        return self.decoder(x1, x2, x3, x4, x5)
+def build_hybird(in_channel=1, n_classes=14, img_size=96):
+    return Hybird(in_channel=in_channel,
+                  n_classes=n_classes,
+                  channels=(32, 64, 128, 192, 384),
+                  depths=(1, 2, 2, 2, 1),
+                  kernels=(3, 3, 3, 3, 3),
+                  exp_r=(2, 2, 4, 4, 4),
+                  img_size=img_size,
+                  depth=4)
+if __name__ == '__main__':
+    x = torch.rand((1, 1, 96, 96, 96))
+    network = build_hybird()
+    print(network(x).shape)

utils/arg_util.py ADDED Viewed

	@@ -0,0 +1,125 @@

+import json
+import os
+import sys
+from tap import Tap
+import dist
+class Args(Tap):
+    # environment
+    exp_name: str = 'mamba'
+    exp_dir: str = ''   # will be created if not exists
+    data_path: str = ''
+    init_weight: str = ''   # use some checkpoint as model weight initialization; ONLY load model weights
+    resume_from: str = ''   # resume the experiment from some checkpoint.pth; load model weights, optimizer states, and last epoch
+    # MambaMIM hyperparameters
+    mask: float = 0.75  # mask ratio, should be in (0, 1)
+    # encoder hyperparameters
+    model: str = 'mambamim'
+    input_size: int = 96
+    sbn: bool = True
+    # data hyperparameters
+    bs: int = 1
+    dataloader_workers: int = 8
+    # pre-training hyperparameters
+    dp: float = 0.0
+    base_lr: float = 1e-4
+    wd: float = 0.04
+    wde: float = 0.2
+    ep: int = 100
+    wp_ep: int = 40
+    clip: int = 5.
+    opt: str = 'adamw'
+    ada: float = 0.
+    # NO NEED TO SPECIFIED; each of these args would be updated in runtime automatically
+    lr: float = 1e-4
+    batch_size_per_gpu: int = 0
+    glb_batch_size: int = 0
+    densify_norm: str = ''
+    device: str = 'gpu'
+    local_rank: int = 0
+    cmd: str = ' '.join(sys.argv[1:])
+    commit_id: str = os.popen(f'git rev-parse HEAD').read().strip() or '[unknown]'
+    commit_msg: str = (os.popen(f'git log -1').read().strip().splitlines() or ['[unknown]'])[-1].strip()
+    last_loss: float = 0.
+    cur_ep: str = ''
+    remain_time: str = ''
+    finish_time: str = ''
+    first_logging: bool = True
+    log_txt_name: str = '{args.exp_dir}/pretrain_log.txt'
+    tb_lg_dir: str = ''     # tensorboard log directory
+    @property
+    def is_convnext(self):
+        return 'convnext' in self.model or 'cnx' in self.model
+    @property
+    def is_resnet(self):
+        return 'resnet' in self.model
+    def log_epoch(self):
+        if not dist.is_local_master():
+            return
+        if self.first_logging:
+            self.first_logging = False
+            with open(self.log_txt_name, 'w') as fp:
+                json.dump({
+                    'name': self.exp_name, 'cmd': self.cmd, 'git_commit_id': self.commit_id, 'git_commit_msg': self.commit_msg,
+                    'model': self.model,
+                }, fp)
+                fp.write('\n\n')
+        with open(self.log_txt_name, 'a') as fp:
+            json.dump({
+                'cur_ep': self.cur_ep,
+                'last_L': self.last_loss,
+                'rema': self.remain_time, 'fini': self.finish_time,
+            }, fp)
+            fp.write('\n')
+def init_dist_and_get_args():
+    from utils import misc
+    # initialize
+    args = Args(explicit_bool=True).parse_args()
+    e = os.path.abspath(args.exp_dir)
+    d, e = os.path.dirname(e), os.path.basename(e)
+    e = ''.join(ch if (ch.isalnum() or ch == '-') else '_' for ch in e)
+    args.exp_dir = os.path.join(d, e)
+    os.makedirs(args.exp_dir, exist_ok=True)
+    args.log_txt_name = os.path.join(args.exp_dir, 'pretrain_log.txt')
+    args.tb_lg_dir = args.tb_lg_dir or os.path.join(args.exp_dir, 'tensorboard_log')
+    try:
+        os.makedirs(args.tb_lg_dir, exist_ok=True)
+    except:
+        pass
+    misc.init_distributed_environ(exp_dir=args.exp_dir)
+    # update args
+    if not dist.initialized():
+        args.sbn = False
+    args.first_logging = True
+    args.device = dist.get_device()
+    args.batch_size_per_gpu = args.bs // dist.get_world_size()
+    args.glb_batch_size = args.batch_size_per_gpu * dist.get_world_size()
+    args.ada = args.ada or 0.999
+    args.densify_norm = 'ln'
+    args.opt = args.opt.lower()
+    args.lr = args.base_lr
+    args.wde = args.wde or args.wd
+    return args

utils/lamb.py ADDED Viewed

	@@ -0,0 +1,151 @@

+""" PyTorch Lamb optimizer w/ behaviour similar to NVIDIA FusedLamb
+This optimizer code was adapted from the following (starting with latest)
+* https://github.com/HabanaAI/Model-References/blob/2b435114fe8e31f159b1d3063b8280ae37af7423/PyTorch/nlp/bert/pretraining/lamb.py
+* https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py
+* https://github.com/cybertronai/pytorch-lamb
+Use FusedLamb if you can (GPU). The reason for including this variant of Lamb is to have a version that is
+similar in behaviour to APEX FusedLamb if you aren't using NVIDIA GPUs or cannot install/use APEX.
+In addition to some cleanup, this Lamb impl has been modified to support PyTorch XLA and has been tested on TPU.
+Original copyrights for above sources are below.
+Modifications Copyright 2021 Ross Wightman
+"""
+import math
+import torch
+from torch.optim.optimizer import Optimizer
+class TheSameAsTimmLAMB(Optimizer):
+    """Implements a pure pytorch variant of FuseLAMB (NvLamb variant) optimizer from apex.optimizers.FusedLAMB
+    reference: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py
+    LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its norm. (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        grad_averaging (bool, optional): whether apply (1-beta2) to grad when
+            calculating running averages of gradient. (default: True)
+        max_grad_norm (float, optional): value used to clip global grad norm (default: 1.0)
+        trust_clip (bool): enable LAMBC trust ratio clipping (default: False)
+        always_adapt (boolean, optional): Apply adaptive learning rate to 0.0
+            weight decay parameter (default: False)
+    .. _Large Batch Optimization for Deep Learning - Training BERT in 76 minutes:
+        https://arxiv.org/abs/1904.00962
+    .. _On the Convergence of Adam and Beyond:
+        https://openreview.net/forum?id=ryQu7f-RZ
+    """
+    def __init__(
+            self, params, lr=1e-3, bias_correction=True, betas=(0.9, 0.999), eps=1e-6,
+            weight_decay=0.01, grad_averaging=True, max_grad_norm=2.0, trust_clip=False, always_adapt=False):
+        defaults = dict(
+            lr=lr, bias_correction=bias_correction, betas=betas, eps=eps, weight_decay=weight_decay,
+            grad_averaging=grad_averaging, max_grad_norm=max_grad_norm,
+            trust_clip=trust_clip, always_adapt=always_adapt)
+        super().__init__(params, defaults)
+        print(f'[lamb1] max_grad_norm={max_grad_norm}')
+        self.global_grad_norm = 0
+    @torch.no_grad()
+    def step(self, closure=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+        device = self.param_groups[0]['params'][0].device
+        one_tensor = torch.tensor(1.0, device=device)  # because torch.where doesn't handle scalars correctly
+        global_grad_norm = torch.zeros(1, device=device)
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad
+                if grad.is_sparse:
+                    raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.')
+                global_grad_norm.add_(grad.pow(2).sum())
+        global_grad_norm = torch.sqrt(global_grad_norm)
+        self.global_grad_norm = global_grad_norm.item()
+        max_grad_norm = torch.tensor(self.defaults['max_grad_norm'], device=device)
+        clip_global_grad_norm = 1 / torch.where(
+            global_grad_norm > max_grad_norm,
+            global_grad_norm / max_grad_norm,
+            one_tensor)
+        for group in self.param_groups:
+            bias_correction = 1 if group['bias_correction'] else 0
+            beta1, beta2 = group['betas']
+            grad_averaging = 1 if group['grad_averaging'] else 0
+            beta3 = 1 - beta1 if grad_averaging else 1.0
+            # assume same step across group now to simplify things
+            # per parameter step can be easily support by making it tensor, or pass list into kernel
+            if 'step' in group:
+                group['step'] += 1
+            else:
+                group['step'] = 1
+            if bias_correction:
+                bias_correction1 = 1 - beta1 ** group['step']
+                bias_correction2 = 1 - beta2 ** group['step']
+            else:
+                bias_correction1, bias_correction2 = 1.0, 1.0
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad.mul_(clip_global_grad_norm)
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    # Exponential moving average of gradient valuesa
+                    state['exp_avg'] = torch.zeros_like(p)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                # Decay the first and second moment running average coefficient
+                exp_avg.mul_(beta1).add_(grad, alpha=beta3)  # m_t
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)  # v_t
+                denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
+                update = (exp_avg / bias_correction1).div_(denom)
+                weight_decay = group['weight_decay']
+                if weight_decay != 0:
+                    update.add_(p, alpha=weight_decay)
+                if weight_decay != 0 or group['always_adapt']:
+                    # Layer-wise LR adaptation. By default, skip adaptation on parameters that are
+                    # excluded from weight decay, unless always_adapt == True, then always enabled.
+                    w_norm = p.norm(2.0)
+                    g_norm = update.norm(2.0)
+                    # FIXME nested where required since logical and/or not working in PT XLA
+                    trust_ratio = torch.where(
+                        w_norm > 0,
+                        torch.where(g_norm > 0, w_norm / g_norm, one_tensor),
+                        one_tensor,
+                        )
+                    if group['trust_clip']:
+                        # LAMBC trust clipping, upper bound fixed at one
+                        trust_ratio = torch.minimum(trust_ratio, one_tensor)
+                    update.mul_(trust_ratio)
+                p.add_(update, alpha=-group['lr'])
+        return loss

utils/lr_control.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import math
+from pprint import pformat
+def lr_wd_annealing(optimizer, peak_lr, wd, wd_end, cur_it, wp_it, max_it):
+    wp_it = round(wp_it)
+    if cur_it < wp_it:
+        cur_lr = 0.005 * peak_lr + 0.995 * peak_lr * cur_it / wp_it
+    else:
+        ratio = (cur_it - wp_it) / (max_it - 1 - wp_it)
+        cur_lr = 0.001 * peak_lr + 0.999 * peak_lr * (0.5 + 0.5 * math.cos(math.pi * ratio))
+    ratio = cur_it / (max_it - 1)
+    cur_wd = wd_end + (wd - wd_end) * (0.5 + 0.5 * math.cos(math.pi * ratio))
+    min_lr, max_lr = cur_lr, cur_lr
+    min_wd, max_wd = cur_wd, cur_wd
+    for param_group in optimizer.param_groups:
+        scaled_lr = param_group['lr'] = cur_lr * param_group.get('lr_scale', 1)  # 'lr_scale' could be assigned
+        min_lr, max_lr = min(min_lr, scaled_lr), max(max_lr, scaled_lr)
+        scaled_wd = param_group['weight_decay'] = cur_wd * param_group.get('weight_decay_scale', 1)  # 'weight_decay_scale' could be assigned
+        min_wd, max_wd = min(min_wd, scaled_wd), max(max_wd, scaled_wd)
+    return min_lr, max_lr, min_wd, max_wd
+def get_param_groups(model, nowd_keys=()):
+    para_groups, para_groups_dbg = {}, {}
+    for name, para in model.named_parameters():
+        if not para.requires_grad:
+            continue  # frozen weights
+        if len(para.shape) == 1 or name.endswith('.bias') or any(k in name for k in nowd_keys):
+            wd_scale, group_name = 0., 'no_decay'
+        else:
+            wd_scale, group_name = 1., 'decay'
+        if group_name not in para_groups:
+            para_groups[group_name] = {'params': [], 'weight_decay_scale': wd_scale, 'lr_scale': 1.}
+            para_groups_dbg[group_name] = {'params': [], 'weight_decay_scale': wd_scale, 'lr_scale': 1.}
+        para_groups[group_name]['params'].append(para)
+        para_groups_dbg[group_name]['params'].append(name)
+    for g in para_groups_dbg.values():
+        g['params'] = pformat(', '.join(g['params']), width=200)
+    print(f'[get_ft_param_groups] param groups = \n{pformat(para_groups_dbg, indent=2, width=250)}\n')
+    return list(para_groups.values())

utils/med_dataset.py ADDED Viewed

	@@ -0,0 +1,275 @@

+import math
+import os
+from typing import Any, Callable, Optional, Tuple
+from monai import data, transforms as med
+from monai.data import load_decathlon_datalist
+import PIL.Image as PImage
+from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+from torchvision.datasets.folder import DatasetFolder, IMG_EXTENSIONS
+from torchvision.transforms import transforms
+from torch.utils.data import Dataset
+import torch
+import numpy as np
+import cv2
+try:
+    from torchvision.transforms import InterpolationMode
+    interpolation = InterpolationMode.BICUBIC
+except:
+    import PIL
+    interpolation = PIL.Image.BICUBIC
+from monai.transforms.transform import LazyTransform, MapTransform, RandomizableTransform
+import random
+def pil_loader(path):
+    # open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
+    with open(path, 'rb') as f: img: PImage.Image = PImage.open(f).convert('RGB')
+    return img
+class ImageNetDataset(DatasetFolder):
+    def __init__(
+            self,
+            imagenet_folder: str,
+            train: bool,
+            transform: Callable,
+            is_valid_file: Optional[Callable[[str], bool]] = None,
+    ):
+        imagenet_folder = os.path.join(imagenet_folder, 'train' if train else 'val')
+        super(ImageNetDataset, self).__init__(
+            imagenet_folder,
+            loader=pil_loader,
+            extensions=IMG_EXTENSIONS if is_valid_file is None else None,
+            transform=transform,
+            target_transform=None, is_valid_file=is_valid_file
+        )
+        self.samples = tuple(img for (img, label) in self.samples)
+        self.targets = None # this is self-supervised learning so we don't need labels
+    def __getitem__(self, index: int) -> Any:
+        img_file_path = self.samples[index]
+        return self.transform(self.loader(img_file_path))
+def build_dataset_to_pretrain(dataset_path, input_size) -> Dataset:
+    """
+    You may need to modify this function to return your own dataset.
+    Define a new class, a subclass of `Dataset`, to replace our ImageNetDataset.
+    Use dataset_path to build your image file path list.
+    Use input_size to create the transformation function for your images, can refer to the `trans_train` blow.
+    :param dataset_path: the folder of dataset
+    :param input_size: the input size (image resolution)
+    :return: the dataset used for pretraining
+    """
+    trans_train = transforms.Compose([
+        transforms.RandomResizedCrop(input_size, scale=(0.67, 1.0), interpolation=interpolation),
+        transforms.RandomHorizontalFlip(),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+    ])
+    dataset_path = os.path.abspath(dataset_path)
+    for postfix in ('train', 'val'):
+        if dataset_path.endswith(postfix):
+            dataset_path = dataset_path[:-len(postfix)]
+    dataset_train = ImageNetDataset(imagenet_folder=dataset_path, transform=trans_train, train=True)
+    print_transform(trans_train, '[pre-train]')
+    return dataset_train
+def build_meddataset_to_pretrain(dataset_path, input_size) -> Dataset:
+    """
+    You may need to modify this function to return your own dataset.
+    Define a new class, a subclass of `Dataset`, to replace our ImageNetDataset.
+    Use dataset_path to build your image file path list.
+    Use input_size to create the transformation function for your images, can refer to the `trans_train` blow.
+    :param dataset_path: the folder of dataset
+    :param input_size: the input size (image resolution)
+    :return: the dataset used for pretraining
+    """
+    trans_train = transforms.Compose([
+        transforms.RandomResizedCrop(input_size, scale=(0.67, 1.0), interpolation=interpolation),
+        transforms.RandomHorizontalFlip(),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+    ])
+    dataset_path = os.path.abspath(dataset_path)
+    dataset_train = MedicalDataSets(base_dir=dataset_path, transform=trans_train)
+    print_transform(trans_train, '[pre-train]')
+    return dataset_train
+class MedicalDataSets(Dataset):
+    def __init__(
+        self,
+        base_dir=None,
+        transform=None,
+    ):
+        self._base_dir = base_dir
+        self.sample_list = []
+        self.sample_list = os.listdir(self._base_dir)
+        self.transform = transform
+        print("total {}".format(len(self.sample_list)))
+    def __len__(self):
+        return len(self.sample_list)
+    def __getitem__(self, idx):
+        case = self.sample_list[idx]
+        img = PImage.open(os.path.join(self._base_dir, case)).convert('RGB')
+        aug = self.transform(img)
+        return aug
+def print_transform(transform, s):
+    print(f'Transform {s} = ')
+    for t in transform.transforms:
+        print(t)
+    print('---------------------------\n')
+class Sampler(torch.utils.data.Sampler):
+    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, make_even=True):
+        if num_replicas is None:
+            if not torch.distributed.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            num_replicas = torch.distributed.get_world_size()
+        if rank is None:
+            if not torch.distributed.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            rank = torch.distributed.get_rank()
+        self.shuffle = shuffle
+        self.make_even = make_even
+        self.dataset = dataset
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
+        self.total_size = self.num_samples * self.num_replicas
+        indices = list(range(len(self.dataset)))
+        self.valid_length = len(indices[self.rank : self.total_size : self.num_replicas])
+    def __iter__(self):
+        if self.shuffle:
+            g = torch.Generator()
+            g.manual_seed(self.epoch)
+            indices = torch.randperm(len(self.dataset), generator=g).tolist()
+        else:
+            indices = list(range(len(self.dataset)))
+        if self.make_even:
+            if len(indices) < self.total_size:
+                if self.total_size - len(indices) < len(indices):
+                    indices += indices[: (self.total_size - len(indices))]
+                else:
+                    extra_ids = np.random.randint(low=0, high=len(indices), size=self.total_size - len(indices))
+                    indices += [indices[ids] for ids in extra_ids]
+            assert len(indices) == self.total_size
+        indices = indices[self.rank : self.total_size : self.num_replicas]
+        self.num_samples = len(indices)
+        return iter(indices)
+    def __len__(self):
+        return self.num_samples
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+class RandScaleCropdPlusScaleByMidDimSampled(MapTransform):
+    def __init__(self, keys, mode='area', max_size=128,allow_missing_keys=False,num_samples=4,max_radio=0.8,min_radio=0.5):
+        self.keys = keys
+        self.mode = mode
+        self.allow_missing_keys = allow_missing_keys
+        self.max_size=max_size
+        self.num_samples = num_samples
+        self.max_radio=max_radio
+        self.min_radio=min_radio
+    def __call__(self, data):
+        outputs = []
+        for i in range(self.num_samples):
+            random_number = round(random.uniform(self.min_radio, self.max_radio), 2)
+            _data = dict(data)
+            for key in self.keys:
+                cropper= med.RandScaleCropd(keys=[key],roi_scale=random_number)
+                _data[key] = cropper(_data)[key]
+                ct_tensor = _data[key]
+                sorted_numbers = sorted(ct_tensor.shape[1:])
+                scale_factor = self.max_size / sorted_numbers[1]
+                new_size = [int(d * scale_factor)
+                            for d in ct_tensor.shape[1:]]
+                resizer = med.Resized(keys=[key],
+                                             spatial_size=new_size,
+                                             mode=self.mode,
+                                             allow_missing_keys=self.allow_missing_keys)
+                _data[key] = resizer(_data)[key]
+            outputs.append(_data)
+        return outputs
+def get_loader(data_dir, size):
+    datalist_json = os.path.join(data_dir, "dataset.json")
+    train_transform = med.Compose(
+    [
+        med.LoadImaged(keys=["image"], allow_missing_keys=True),
+        med.AddChanneld(keys=["image"], allow_missing_keys=True),
+        med.Orientationd(keys=["image"], axcodes="RAS", allow_missing_keys=True),
+        med.Spacingd(keys=["image"], pixdim=(1.5, 1.5, 1.5), mode="bilinear", allow_missing_keys=True),
+        med.ScaleIntensityRanged(keys=["image"], a_min=-175, a_max=250, b_min=0.0, b_max=1.0, clip=True),
+        med.CropForegroundd(keys=["image"], source_key="image", allow_missing_keys=True),
+        med.SpatialPadd(keys=["image"], spatial_size=(size, size, size), mode='constant'),
+        med.RandCropByPosNegLabeld(
+            spatial_size=(size, size, size),
+            keys=["image"],
+            label_key="image",
+            pos=1,
+            neg=0,
+            num_samples=4,
+        ),
+        med.RandFlipd(keys=["image"],
+                            prob=0.2,
+                            spatial_axis=0),
+        med.RandFlipd(keys=["image"],
+                            prob=0.2,
+                            spatial_axis=1),
+        med.RandFlipd(keys=["image"],
+                            prob=0.1,
+                            spatial_axis=2),
+        med.ToTensord(keys=["image"]),
+    ])
+    # val_transform = transforms.Compose(
+    #     [
+    #         transforms.LoadImaged(keys=["image", "label"]),
+    #         transforms.AddChanneld(keys=["image", "label"]),
+    #         transforms.Orientationd(keys=["image", "label"], axcodes="RAS"),
+    #         transforms.Spacingd(
+    #             keys=["image", "label"], pixdim=(1, 1, 1), mode=("bilinear", "nearest")
+    #         ),
+    #         transforms.ScaleIntensityRanged(
+    #             keys=["image"], a_min=-175.0, a_max=250.0, b_min=0.0, b_max=1.0, clip=True
+    #         ),
+    #         transforms.CropForegroundd(keys=["image", "label"], source_key="image"),
+    #         transforms.ToTensord(keys=["image", "label"]),
+    #     ]
+    # )
+    datalist = load_decathlon_datalist(datalist_json, True, "training", base_dir=data_dir)
+    # train_ds = data.Dataset(data=datalist, transform=train_transform)
+    # train_ds = data.CacheDataset(data=datalist, transform=train_transform)
+    # train_ds = data.SmartCacheDataset(data=datalist, transform=train_transform, replace_rate=0.7, cache_num=256,  num_init_workers=4, num_replace_workers=4)
+    train_ds= data.CacheNTransDataset(data=datalist, transform=train_transform, cache_n_trans=6, cache_dir="/fenghetang/3d/pretrain/MM/cache_dataset")
+    return train_ds

utils/misc.py ADDED Viewed

	@@ -0,0 +1,338 @@

+import datetime
+import functools
+import os
+import subprocess
+import sys
+import time
+from collections import defaultdict, deque
+from typing import Iterator
+import numpy as np
+import pytz
+import torch
+from torch.utils.tensorboard import SummaryWriter
+import dist
+os_system = functools.partial(subprocess.call, shell=True)
+os_system_get_stdout = lambda cmd: subprocess.run(cmd, shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8')
+def os_system_get_stdout_stderr(cmd):
+    sp = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    return sp.stdout.decode('utf-8'), sp.stderr.decode('utf-8')
+def is_pow2n(x):
+    return x > 0 and ((x - 1) & x == 0)
+def time_str(for_dirname=False):
+    return datetime.datetime.now(tz=pytz.timezone('Asia/Shanghai')).strftime(
+        '%m-%d_%H-%M-%S' if for_dirname else '[%m-%d %H:%M:%S]')
+def init_distributed_environ(exp_dir):
+    dist.initialize()
+    dist.barrier()
+    import torch.backends.cudnn as cudnn
+    cudnn.benchmark = True
+    cudnn.deterministic = False
+    _set_print_only_on_master_proc(is_master=dist.is_local_master())
+    if dist.is_local_master() and len(exp_dir):
+        sys.stdout, sys.stderr = _SyncPrintToFile(exp_dir, stdout=True), _SyncPrintToFile(exp_dir, stdout=False)
+def _set_print_only_on_master_proc(is_master):
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+    def prt(msg, *args, **kwargs):
+        force = kwargs.pop('force', False)
+        clean = kwargs.pop('clean', False)
+        deeper = kwargs.pop('deeper', False)
+        if is_master or force:
+            if not clean:
+                f_back = sys._getframe().f_back
+                if deeper and f_back.f_back is not None:
+                    f_back = f_back.f_back
+                file_desc = f'{f_back.f_code.co_filename:24s}'[-24:]
+                msg = f'{time_str()} ({file_desc}, line{f_back.f_lineno:-4d})=> {msg}'
+            builtin_print(msg, *args, **kwargs)
+    __builtin__.print = prt
+class _SyncPrintToFile(object):
+    def __init__(self, exp_dir, stdout=True):
+        self.terminal = sys.stdout if stdout else sys.stderr
+        fname = os.path.join(exp_dir, 'stdout_backup.txt' if stdout else 'stderr_backup.txt')
+        self.log = open(fname, 'w')
+        self.log.flush()
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+        self.log.flush()
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+class TensorboardLogger(object):
+    def __init__(self, log_dir, is_master, prefix='pt'):
+        self.is_master = is_master
+        self.writer = SummaryWriter(log_dir=log_dir) if self.is_master else None
+        self.step = 0
+        self.prefix = prefix
+        self.log_freq = 300
+    def set_step(self, step=None):
+        if step is not None:
+            self.step = step
+        else:
+            self.step += 1
+    def get_loggable(self, step=None):
+        if step is None:  # iter wise
+            step = self.step
+            loggable = step % self.log_freq == 0
+        else:  # epoch wise
+            loggable = True
+        return step, (loggable and self.is_master)
+    def update(self, head='scalar', step=None, **kwargs):
+        step, loggable = self.get_loggable(step)
+        if loggable:
+            head = f'{self.prefix}_{head}'
+            for k, v in kwargs.items():
+                if v is None:
+                    continue
+                if isinstance(v, torch.Tensor):
+                    v = v.item()
+                assert isinstance(v, (float, int))
+                self.writer.add_scalar(head + "/" + k, v, step)
+    def log_distribution(self, tag, values, step=None):
+        step, loggable = self.get_loggable(step)
+        if loggable:
+            if not isinstance(values, torch.Tensor):
+                values = torch.tensor(values)
+            self.writer.add_histogram(tag=tag, values=values, global_step=step)
+    def log_image(self, tag, img, step=None, dataformats='NCHW'):
+        step, loggable = self.get_loggable(step)
+        if loggable:
+            # img = img.cpu().numpy()
+            self.writer.add_image(tag, img, step, dataformats=dataformats)
+    def flush(self):
+        if self.is_master: self.writer.flush()
+    def close(self):
+        if self.is_master: self.writer.close()
+def save_checkpoint_with_meta_info_and_opt_state(save_to, args, epoch, performance_desc, model_without_ddp_state,
+                                                 optimizer_state):
+    checkpoint_path = os.path.join(args.exp_dir, save_to)
+    if dist.is_local_master():
+        to_save = {
+            'args': str(args),
+            'input_size': args.input_size,
+            'arch': args.model,
+            'epoch': epoch,
+            'performance_desc': performance_desc,
+            'module': model_without_ddp_state,
+            'optimizer': optimizer_state,
+            'is_pretrain': True,
+        }
+        torch.save(to_save, checkpoint_path)
+def save_checkpoint_model_weights_only(save_to, args, sp_cnn_state):
+    checkpoint_path = os.path.join(args.exp_dir, save_to)
+    if dist.is_local_master():
+        torch.save(sp_cnn_state, checkpoint_path)
+def initialize_weight(init_weight: str, model_without_ddp):
+    # use some checkpoint as model weight initialization; ONLY load model weights
+    if len(init_weight):
+        checkpoint = torch.load(init_weight, 'cpu')
+        missing, unexpected = model_without_ddp.load_state_dict(checkpoint.get('module', checkpoint), strict=False)
+        print(f'[initialize_weight] missing_keys={missing}')
+        print(f'[initialize_weight] unexpected_keys={unexpected}')
+def load_checkpoint(resume_from: str, model_without_ddp, optimizer):
+    # resume the experiment from some checkpoint.pth; load model weights, optimizer states, and last epoch
+    if len(resume_from) == 0:
+        return 0, '[no performance_desc]'
+    print(f'[try to resume from file `{resume_from}`]')
+    checkpoint = torch.load(resume_from, map_location='cpu')
+    ep_start, performance_desc = checkpoint.get('epoch', -1) + 1, checkpoint.get('performance_desc',
+                                                                                 '[no performance_desc]')
+    missing, unexpected = model_without_ddp.load_state_dict(checkpoint.get('module', checkpoint), strict=False)
+    print(f'[load_checkpoint] missing_keys={missing}')
+    print(f'[load_checkpoint] unexpected_keys={unexpected}')
+    print(f'[load_checkpoint] ep_start={ep_start}, performance_desc={performance_desc}')
+    if 'optimizer' in checkpoint:
+        optimizer.load_state_dict(checkpoint['optimizer'])
+    return ep_start, performance_desc
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.allreduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+    @property
+    def global_avg(self):
+        return self.total / self.count
+    @property
+    def max(self):
+        return max(self.deque)
+    @property
+    def value(self):
+        return self.deque[-1]
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if v is None:
+                continue
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+    def log_every(self, max_iters, itrt, print_freq, header=None):
+        print_iters = set(np.linspace(0, max_iters - 1, print_freq, dtype=int).tolist())
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        self.iter_time = SmoothedValue(fmt='{avg:.4f}')
+        self.data_time = SmoothedValue(fmt='{avg:.4f}')
+        space_fmt = ':' + str(len(str(max_iters))) + 'd'
+        log_msg = [
+            header,
+            '[{0' + space_fmt + '}/{1}]',
+            'eta: {eta}',
+            '{meters}',
+            'iter: {time}s',
+            'data: {data}s'
+        ]
+        log_msg = self.delimiter.join(log_msg)
+        if isinstance(itrt, Iterator) and not hasattr(itrt, 'preload') and not hasattr(itrt, 'set_epoch'):
+            for i in range(max_iters):
+                obj = next(itrt)
+                self.data_time.update(time.time() - end)
+                yield obj
+                self.iter_time.update(time.time() - end)
+                if i in print_iters:
+                    eta_seconds = self.iter_time.global_avg * (max_iters - i)
+                    eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                    print(log_msg.format(
+                        i, max_iters, eta=eta_string,
+                        meters=str(self),
+                        time=str(self.iter_time), data=str(self.data_time)))
+                end = time.time()
+        else:
+            for i, obj in enumerate(itrt):
+                self.data_time.update(time.time() - end)
+                yield obj
+                self.iter_time.update(time.time() - end)
+                if i in print_iters:
+                    eta_seconds = self.iter_time.global_avg * (max_iters - i)
+                    eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                    print(log_msg.format(
+                        i, max_iters, eta=eta_string,
+                        meters=str(self),
+                        time=str(self.iter_time), data=str(self.data_time)))
+                end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{}   Total time:      {}   ({:.3f} s / it)'.format(
+            header, total_time_str, total_time / max_iters))

utils/sampler.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import random
+import numpy as np
+import torch
+from torch.utils.data.sampler import Sampler
+def worker_init_fn(worker_id):
+    # https://pytorch.org/docs/stable/notes/randomness.html#dataloader
+    worker_seed = torch.initial_seed() % 2 ** 32
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+class DistInfiniteBatchSampler(Sampler):
+    def __init__(self, world_size, rank, dataset_len, glb_batch_size, seed=1, filling=False, shuffle=True):
+        assert glb_batch_size % world_size == 0
+        self.world_size, self.rank = world_size, rank
+        self.dataset_len = dataset_len
+        self.glb_batch_size = glb_batch_size
+        self.batch_size = glb_batch_size // world_size
+        self.iters_per_ep = (dataset_len + glb_batch_size - 1) // glb_batch_size
+        self.filling = filling
+        self.shuffle = shuffle
+        self.epoch = 0
+        self.seed = seed
+        self.indices = self.gener_indices()
+    def gener_indices(self):
+        global_max_p = self.iters_per_ep * self.glb_batch_size  # global_max_p % world_size must be 0 cuz glb_batch_size % world_size == 0
+        if self.shuffle:
+            g = torch.Generator()
+            g.manual_seed(self.epoch + self.seed)
+            global_indices = torch.randperm(self.dataset_len, generator=g)
+        else:
+            global_indices = torch.arange(self.dataset_len)
+        filling = global_max_p - global_indices.shape[0]
+        if filling > 0 and self.filling:
+            global_indices = torch.cat((global_indices, global_indices[:filling]))
+        global_indices = tuple(global_indices.numpy().tolist())
+        seps = torch.linspace(0, len(global_indices), self.world_size + 1, dtype=torch.int)
+        local_indices = global_indices[seps[self.rank]:seps[self.rank + 1]]
+        self.max_p = len(local_indices)
+        return local_indices
+    def __iter__(self):
+        self.epoch = 0
+        while True:
+            self.epoch += 1
+            p, q = 0, 0
+            while p < self.max_p:
+                q = p + self.batch_size
+                yield self.indices[p:q]
+                p = q
+            if self.shuffle:
+                self.indices = self.gener_indices()
+    def __len__(self):
+        return self.iters_per_ep
+if __name__ == '__main__':
+    W = 16
+    for rk in range(W):
+        ind = DistInfiniteBatchSampler(W, rk, 5024, 5024).gener_indices()
+        print(rk, len(ind))