drkostas commited on
Commit
1aba626
·
verified ·
1 Parent(s): 64b7440

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - self-supervised-learning
6
+ - masked-image-modeling
7
+ - knowledge-distillation
8
+ - vit
9
+ datasets:
10
+ - imagenet-1k
11
+ - ade20k
12
+ - coco
13
+ metrics:
14
+ - accuracy
15
+ - mIoU
16
+ - mAP
17
+ pipeline_tag: image-classification
18
+ ---
19
+
20
+ # MaskDistill ViT-Base/16
21
+
22
+ **The first open-source PyTorch implementation of MaskDistill with pre-trained weights.**
23
+
24
+ This model was trained using the [MaskDistill-PyTorch](https://github.com/drkostas/MaskDistill-PyTorch) codebase, reproducing the method from ["A Unified View of Masked Image Modeling"](https://arxiv.org/abs/2210.10615).
25
+
26
+ ## Model Description
27
+
28
+ MaskDistill learns visual representations by distilling knowledge from a frozen CLIP ViT-B/16 teacher into a ViT-Base student through masked image modeling. The student learns to predict the teacher's features for masked patches using Smooth L1 loss.
29
+
30
+ - **Architecture**: ViT-Base/16 (86M params)
31
+ - **Teacher**: CLIP ViT-B/16 (frozen)
32
+ - **Pretraining**: 300 epochs on ImageNet-1K
33
+ - **Masking**: Block masking at 40%, dense encoding with shared relative position bias
34
+
35
+ ## Results
36
+
37
+ | Evaluation | Result |
38
+ |-----------|--------|
39
+ | k-NN (k=10) | **75.6%** top-1 |
40
+ | Sem. Seg. (ADE20K, UPerNet) | **52.6** mIoU |
41
+ | Obj. Det. (COCO, Mask R-CNN) | **44.4** bbox mAP |
42
+ | Inst. Seg. (COCO, Mask R-CNN) | **40.1** segm mAP |
43
+
44
+ ## Available Checkpoints
45
+
46
+ | File | Description |
47
+ |------|------------|
48
+ | `pretrain_vit_base_ep290.pth` | Pretrained ViT-Base (300 epochs) |
49
+ | `semseg_upernet_ade20k_160k.pth` | UPerNet on ADE20K (52.6 mIoU) |
50
+ | `detection_maskrcnn_coco_12ep.pth` | Mask R-CNN on COCO (44.4 mAP) |
51
+
52
+ ## Usage
53
+
54
+ ```python
55
+ import torch
56
+ from src.models.vision_transformer import VisionTransformerMIM
57
+
58
+ # Load pretrained model
59
+ model = VisionTransformerMIM(
60
+ img_size=224, patch_size=16, embed_dim=768, depth=12, num_heads=12,
61
+ use_shared_rel_pos_bias=True, use_mask_tokens=True,
62
+ )
63
+ ckpt = torch.load("pretrain_vit_base_ep290.pth", map_location="cpu")
64
+ state = {k.replace("module.student.", ""): v for k, v in ckpt["model"].items() if "student" in k}
65
+ model.load_state_dict(state, strict=False)
66
+ ```
67
+
68
+ See the [GitHub repo](https://github.com/drkostas/MaskDistill-PyTorch) for full training and evaluation code.
69
+
70
+ ## Citation
71
+
72
+ ```bibtex
73
+ @article{hou2022unified,
74
+ title={A Unified View of Masked Image Modeling},
75
+ author={Hou, Zhenda and Sun, Fei and Chen, Yun-Hao and Yuan, Jia-Hong and Yu, Jia-Mu},
76
+ journal={arXiv preprint arXiv:2210.10615},
77
+ year={2022}
78
+ }
79
+ ```