drkostas
/

MaskDistill-ViT-Base

Image Classification

self-supervised-learning

masked-image-modeling

knowledge-distillation

Model card Files Files and versions

MaskDistill-ViT-Base / README.md

drkostas's picture

Upload README.md with huggingface_hub

7a1e10c verified 4 days ago

|

history blame contribute delete

2.74 kB

	---
	license: apache-2.0
	tags:
	- vision
	- self-supervised-learning
	- masked-image-modeling
	- knowledge-distillation
	- vit
	datasets:
	- ILSVRC/imagenet-1k
	- 1aurent/ADE20K
	- detection-datasets/coco
	metrics:
	- accuracy
	- mIoU
	- mAP
	pipeline_tag: image-classification
	---

	# MaskDistill ViT-Base/16

	The first open-source PyTorch implementation of MaskDistill with pre-trained weights.

	This model was trained using the [MaskDistill-PyTorch](https://github.com/drkostas/MaskDistill-PyTorch) codebase, reproducing the method from ["A Unified View of Masked Image Modeling"](https://arxiv.org/abs/2210.10615).

	## Model Description

	MaskDistill learns visual representations by distilling knowledge from a frozen CLIP ViT-B/16 teacher into a ViT-Base student through masked image modeling. The student learns to predict the teacher's features for masked patches using Smooth L1 loss.

	- Architecture: ViT-Base/16 (86M params)
	- Teacher: CLIP ViT-B/16 (frozen)
	- Pretraining: 300 epochs on ImageNet-1K
	- Masking: Block masking at 40%, dense encoding with shared relative position bias

	## Results

	\| Evaluation \| Result \|
	\|-----------\|--------\|
	\| Finetuning (ImageNet-1K) \| 84.8% top-1 \|
	\| k-NN (k=10) \| 75.6% top-1 \|
	\| Linear Probe \| 76.3% top-1 \|
	\| Sem. Seg. (ADE20K, UPerNet) \| 52.6 mIoU \|
	\| Obj. Det. (COCO, Mask R-CNN) \| 44.4 bbox mAP \|
	\| Inst. Seg. (COCO, Mask R-CNN) \| 40.1 segm mAP \|

	## Available Checkpoints

	\| File \| Description \|
	\|------\|------------\|
	\| `pretrain_vit_base_ep290.pth` \| Pretrained ViT-Base (300 epochs) \|
	\| `finetune_vit_base_ep100.pth` \| Finetuned on ImageNet-1K (84.8% top-1) \|
	\| `linprobe_vit_base_ep90.pth.tar` \| Linear probe (90 epochs, 76.3% top-1) \|
	\| `semseg_upernet_ade20k_160k.pth` \| UPerNet on ADE20K (52.6 mIoU) \|
	\| `detection_maskrcnn_coco_12ep.pth` \| Mask R-CNN on COCO (44.4 mAP) \|

	## Usage

	```python
	import torch
	from src.models.vision_transformer import VisionTransformerMIM

	# Load pretrained model
	model = VisionTransformerMIM(
	img_size=224, patch_size=16, embed_dim=768, depth=12, num_heads=12,
	use_shared_rel_pos_bias=True, use_mask_tokens=True,
	)
	ckpt = torch.load("pretrain_vit_base_ep290.pth", map_location="cpu")
	state = {k.replace("module.student.", ""): v for k, v in ckpt["model"].items() if "student" in k}
	model.load_state_dict(state, strict=False)
	```

	See the [GitHub repo](https://github.com/drkostas/MaskDistill-PyTorch) for full training and evaluation code.

	## Citation

	```bibtex
	@article{hou2022unified,
	title={A Unified View of Masked Image Modeling},
	author={Hou, Zhenda and Sun, Fei and Chen, Yun-Hao and Yuan, Jia-Hong and Yu, Jia-Mu},
	journal={arXiv preprint arXiv:2210.10615},
	year={2022}
	}
	```