drkostas
/

MaskDistill-ViT-Base

+---
+license: apache-2.0
+tags:
+  - vision
+  - self-supervised-learning
+  - masked-image-modeling
+  - knowledge-distillation
+  - vit
+datasets:
+  - imagenet-1k
+  - ade20k
+  - coco
+metrics:
+  - accuracy
+  - mIoU
+  - mAP
+pipeline_tag: image-classification
+---
+# MaskDistill ViT-Base/16
+**The first open-source PyTorch implementation of MaskDistill with pre-trained weights.**
+This model was trained using the [MaskDistill-PyTorch](https://github.com/drkostas/MaskDistill-PyTorch) codebase, reproducing the method from ["A Unified View of Masked Image Modeling"](https://arxiv.org/abs/2210.10615).
+## Model Description
+MaskDistill learns visual representations by distilling knowledge from a frozen CLIP ViT-B/16 teacher into a ViT-Base student through masked image modeling. The student learns to predict the teacher's features for masked patches using Smooth L1 loss.
+- **Architecture**: ViT-Base/16 (86M params)
+- **Teacher**: CLIP ViT-B/16 (frozen)
+- **Pretraining**: 300 epochs on ImageNet-1K
+- **Masking**: Block masking at 40%, dense encoding with shared relative position bias
+## Results
+| Evaluation | Result |
+|-----------|--------|
+| k-NN (k=10) | **75.6%** top-1 |
+| Sem. Seg. (ADE20K, UPerNet) | **52.6** mIoU |
+| Obj. Det. (COCO, Mask R-CNN) | **44.4** bbox mAP |
+| Inst. Seg. (COCO, Mask R-CNN) | **40.1** segm mAP |
+## Available Checkpoints
+| File | Description |
+|------|------------|
+| `pretrain_vit_base_ep290.pth` | Pretrained ViT-Base (300 epochs) |
+| `semseg_upernet_ade20k_160k.pth` | UPerNet on ADE20K (52.6 mIoU) |
+| `detection_maskrcnn_coco_12ep.pth` | Mask R-CNN on COCO (44.4 mAP) |
+## Usage
+```python
+import torch
+from src.models.vision_transformer import VisionTransformerMIM
+# Load pretrained model
+model = VisionTransformerMIM(
+    img_size=224, patch_size=16, embed_dim=768, depth=12, num_heads=12,
+    use_shared_rel_pos_bias=True, use_mask_tokens=True,
+)
+ckpt = torch.load("pretrain_vit_base_ep290.pth", map_location="cpu")
+state = {k.replace("module.student.", ""): v for k, v in ckpt["model"].items() if "student" in k}
+model.load_state_dict(state, strict=False)
+```
+See the [GitHub repo](https://github.com/drkostas/MaskDistill-PyTorch) for full training and evaluation code.
+## Citation
+```bibtex
+@article{hou2022unified,
+  title={A Unified View of Masked Image Modeling},
+  author={Hou, Zhenda and Sun, Fei and Chen, Yun-Hao and Yuan, Jia-Hong and Yu, Jia-Mu},
+  journal={arXiv preprint arXiv:2210.10615},
+  year={2022}
+}
+```