metadata
license: apache-2.0
tags:
- vision
- self-supervised-learning
- masked-image-modeling
- knowledge-distillation
- multi-objective-learning
- vit
datasets:
- ILSVRC/imagenet-1k
- 1aurent/ADE20K
metrics:
- accuracy
- mIoU
pipeline_tag: image-classification
MEDiC ViT-Base/16
Multi-objective Exploration of Distillation from CLIP
This model was trained using the MEDiC codebase, implementing the method from "MEDiC: Multi-objective Exploration of Distillation from CLIP".
Model Description
MEDiC extends MaskDistill by combining three complementary training objectives:
- Token Distillation: Smooth L1 loss between student and CLIP teacher patch features
- CLS Token Alignment: Cosine similarity between student and teacher CLS tokens
- Pixel Reconstruction: MAE-style decoder reconstructing normalized pixel patches
Architecture:
- Student: ViT-Base/16 (86M params)
- Teacher: CLIP ViT-B/16 (frozen)
- Decoder: 8-block transformer (512 dim, for pixel reconstruction)
- Pretraining: 300 epochs on ImageNet-1K, sparse encoding, block masking 40%
Results
| Evaluation | Result |
|---|---|
| k-NN (k=20) | 73.9% top-1 |
| Linear Probe | 60.5% top-1 |
| Finetuning (ImageNet-1K) | 85.1% top-1 |
| Sem. Seg. (ADE20K, UPerNet) | 52.5 mIoU |
Loss Ablation
| Configuration | k-NN |
|---|---|
| Token only | 68.6% |
| + Pixel | 71.4% |
| + CLS | 72.3% |
| + Pixel + CLS (MEDiC) | 73.9% |
Usage
import torch
from src.models.vision_transformer import VisionTransformerMIM
# Load pretrained student backbone
model = VisionTransformerMIM(
img_size=224, patch_size=16, embed_dim=768, depth=12, num_heads=12,
use_abs_pos_emb=True, use_mask_tokens=False,
)
ckpt = torch.load("medic_vit_base_ep299.pth", map_location="cpu")
state = {k.replace("module.student.", ""): v for k, v in ckpt["model"].items() if "student" in k}
model.load_state_dict(state, strict=False)
See the GitHub repo for full training and evaluation code.
Citation
@article{georgiou2025medic,
title={MEDiC: Multi-objective Exploration of Distillation from CLIP},
author={Georgiou, Kostas},
journal={arXiv preprint arXiv:2603.29009},
year={2025}
}