|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision |
|
|
- image-classification |
|
|
- clip |
|
|
- knowledge-distillation |
|
|
- semi-supervised-learning |
|
|
- imagenet |
|
|
datasets: |
|
|
- imagenet-1k |
|
|
library_name: pytorch |
|
|
pipeline_tag: image-classification |
|
|
--- |
|
|
|
|
|
# DHO: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization |
|
|
|
|
|
[](https://arxiv.org/abs/2505.07675v1) |
|
|
|
|
|
This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data. |
|
|
The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data. |
|
|
|
|
|
**Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675) |
|
|
|
|
|
**Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- ✨ **Dual-head optimization** strategy for semi-supervised distillation |
|
|
- 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data |
|
|
- 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models |
|
|
- 🧩 Simple, scalable, and easy to integrate into existing pipelines |
|
|
|
|
|
## Available Checkpoints |
|
|
|
|
|
| Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters | |
|
|
|:----------------|:--------------|:--------------|:-------------|:-----------|:-----------| |
|
|
| `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M | |
|
|
| `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M | |
|
|
| `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M | |
|
|
| `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading a Checkpoint |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import torch.nn.functional as F |
|
|
import clip |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Define the DHO StudentModel architecture with dual heads |
|
|
class StudentModel(nn.Module): |
|
|
def __init__(self, num_classes=1000, model_name='ViT-B-16'): |
|
|
super().__init__() |
|
|
# Load CLIP backbone |
|
|
clip_model, _ = clip.load(model_name, device='cpu') |
|
|
self.backbone = clip_model.float().visual |
|
|
|
|
|
# Feature dimensions per architecture |
|
|
in_features = { |
|
|
'RN50': 1024, |
|
|
'ViT-B-16': 512, |
|
|
'ViT-L-14': 768, |
|
|
'ViT-L-14-336px': 768 |
|
|
}[model_name] |
|
|
|
|
|
# Dual-head architecture |
|
|
self.ce_head = nn.Linear(in_features, num_classes) # CE branch |
|
|
self.kd_head = nn.Linear(in_features, num_classes) # KD branch |
|
|
|
|
|
def forward(self, x): |
|
|
features = self.backbone(x) |
|
|
ce_out = self.ce_head(features) |
|
|
kd_out = self.kd_head(F.normalize(features, dim=1)) * 100 |
|
|
return ce_out, kd_out |
|
|
|
|
|
# Download and load checkpoint |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt") |
|
|
checkpoint = torch.load(checkpoint_path, map_location=device) |
|
|
|
|
|
# Initialize model |
|
|
model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device) |
|
|
|
|
|
# Handle DDP wrapped state_dict |
|
|
state_dict = checkpoint['model_state_dict'] |
|
|
state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()} |
|
|
model.load_state_dict(state_dict) |
|
|
|
|
|
# Get optimal inference parameters |
|
|
alpha = checkpoint['alpha'] # Weight for CE head |
|
|
beta = checkpoint['beta'] # Temperature for KD head |
|
|
model.eval() |
|
|
|
|
|
# Inference example |
|
|
from PIL import Image |
|
|
import torchvision.transforms as transforms |
|
|
|
|
|
# CLIP preprocessing |
|
|
preprocess = transforms.Compose([ |
|
|
transforms.Resize(224), |
|
|
transforms.CenterCrop(224), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), |
|
|
std=(0.26862954, 0.26130258, 0.27577711)) |
|
|
]) |
|
|
|
|
|
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device) |
|
|
with torch.no_grad(): |
|
|
ce_logits, kd_logits = model(image) |
|
|
|
|
|
# Combine predictions using saved parameters |
|
|
probs_ce = F.softmax(ce_logits, dim=1) |
|
|
probs_kd = F.softmax(kd_logits / beta, dim=1) |
|
|
probs = alpha * probs_ce + (1 - alpha) * probs_kd |
|
|
|
|
|
predicted_class = probs.argmax(dim=1) |
|
|
print(f"Predicted class: {predicted_class.item()}") |
|
|
``` |
|
|
|
|
|
**Important Notes:** |
|
|
- DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta` |
|
|
- The model has a **dual-head architecture** (CE head + KD head) |
|
|
- Use the saved `alpha` and `beta` parameters for optimal inference |
|
|
- For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px) |
|
|
|
|
|
### Training Your Own Model |
|
|
|
|
|
To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts. |
|
|
|
|
|
**Example training command:** |
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \ |
|
|
--teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \ |
|
|
--student_model "ViT-B-16" \ |
|
|
--lr 5e-5 \ |
|
|
--train_epoch 32 \ |
|
|
--batch_size 256 \ |
|
|
--percent 10.0 \ |
|
|
| tee ./logs/imagenet/imgnet_lowshot.log |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The DHO student model consists of: |
|
|
- **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14) |
|
|
- **Two parallel heads:** |
|
|
- **CE Head:** Optimized with cross-entropy loss on labeled data |
|
|
- **KD Head:** Optimized with knowledge distillation loss from teacher predictions |
|
|
|
|
|
During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta). |
|
|
|
|
|
## Performance |
|
|
|
|
|
### ImageNet Semi-supervised Learning |
|
|
|
|
|
| Student | Teacher | Labeled Data | Top-1 Accuracy | |
|
|
|:--------|:--------|:-------------|:---------------| |
|
|
| ViT-B/16 | ViT-H/14 | 1% | **81.6%** | |
|
|
| ViT-B/16 | ViT-H/14 | 10% | **82.8%** | |
|
|
| ViT-L/14 | ViT-H/14 | 1% | **84.6%** | |
|
|
| ViT-L/14 | ViT-H/14 | 10% | **85.9%** | |
|
|
|
|
|
These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use these models in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{kang2025simple, |
|
|
title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization}, |
|
|
author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju}, |
|
|
journal={arXiv preprint arXiv:2505.07675}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the Apache License 2.0 - see the LICENSE file for details. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
We appreciate the open-source implementations from: |
|
|
- [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter) |
|
|
- [CLIP](https://github.com/openai/CLIP) |
|
|
- [OpenCLIP](https://github.com/mlfoundations/open_clip) |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/erjui/DHO) or contact the authors. |
|
|
|
|
|
|