dho / README.md

Upload README.md with huggingface_hub

ce9ccf4 verified 4 months ago

7.19 kB

	---
	license: apache-2.0
	tags:
	- vision
	- image-classification
	- clip
	- knowledge-distillation
	- semi-supervised-learning
	- imagenet
	datasets:
	- imagenet-1k
	library_name: pytorch
	pipeline_tag: image-classification
	---

	# DHO: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization

	[![arXiv](https://img.shields.io/badge/arXiv-2505.07675v1-b31b1b.svg)](https://arxiv.org/abs/2505.07675v1)

	This repository contains pretrained checkpoints for DHO (Dual-Head Optimization), a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models.

	## Model Description

	DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data.
	The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data.

	Paper: [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675)

	Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

	## Key Features

	- ✨ Dual-head optimization strategy for semi-supervised distillation
	- 🏆 State-of-the-art performance on ImageNet with 1% and 10% labeled data
	- 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models
	- 🧩 Simple, scalable, and easy to integrate into existing pipelines

	## Available Checkpoints

	\| Checkpoint Name \| Student Model \| Teacher Model \| Labeled Data \| Top-1 Acc. \| Parameters \|
	\|:----------------\|:--------------\|:--------------\|:-------------\|:-----------\|:-----------\|
	\| `vit_b_1.pt` \| ViT-B/16 \| ViT-H/14 (DFN5B) \| 1% \| 81.6% \| 86M \|
	\| `vit_b_10.pt` \| ViT-B/16 \| ViT-H/14 (DFN5B) \| 10% \| 82.8% \| 86M \|
	\| `vit_l_1.pt` \| ViT-L/14 \| ViT-H/14 (DFN5B) \| 1% \| 84.6% \| 304M \|
	\| `vit_l_10.pt` \| ViT-L/14 \| ViT-H/14 (DFN5B) \| 10% \| 85.9% \| 304M \|

	## Usage

	### Loading a Checkpoint

	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import clip
	from huggingface_hub import hf_hub_download

	# Define the DHO StudentModel architecture with dual heads
	class StudentModel(nn.Module):
	def __init__(self, num_classes=1000, model_name='ViT-B-16'):
	super().__init__()
	# Load CLIP backbone
	clip_model, _ = clip.load(model_name, device='cpu')
	self.backbone = clip_model.float().visual

	# Feature dimensions per architecture
	in_features = {
	'RN50': 1024,
	'ViT-B-16': 512,
	'ViT-L-14': 768,
	'ViT-L-14-336px': 768
	}[model_name]

	# Dual-head architecture
	self.ce_head = nn.Linear(in_features, num_classes) # CE branch
	self.kd_head = nn.Linear(in_features, num_classes) # KD branch

	def forward(self, x):
	features = self.backbone(x)
	ce_out = self.ce_head(features)
	kd_out = self.kd_head(F.normalize(features, dim=1)) * 100
	return ce_out, kd_out

	# Download and load checkpoint
	device = "cuda" if torch.cuda.is_available() else "cpu"
	checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt")
	checkpoint = torch.load(checkpoint_path, map_location=device)

	# Initialize model
	model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device)

	# Handle DDP wrapped state_dict
	state_dict = checkpoint['model_state_dict']
	state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
	model.load_state_dict(state_dict)

	# Get optimal inference parameters
	alpha = checkpoint['alpha'] # Weight for CE head
	beta = checkpoint['beta'] # Temperature for KD head
	model.eval()

	# Inference example
	from PIL import Image
	import torchvision.transforms as transforms

	# CLIP preprocessing
	preprocess = transforms.Compose([
	transforms.Resize(224),
	transforms.CenterCrop(224),
	transforms.ToTensor(),
	transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
	std=(0.26862954, 0.26130258, 0.27577711))
	])

	image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device)
	with torch.no_grad():
	ce_logits, kd_logits = model(image)

	# Combine predictions using saved parameters
	probs_ce = F.softmax(ce_logits, dim=1)
	probs_kd = F.softmax(kd_logits / beta, dim=1)
	probs = alpha * probs_ce + (1 - alpha) * probs_kd

	predicted_class = probs.argmax(dim=1)
	print(f"Predicted class: {predicted_class.item()}")
	```

	Important Notes:
	- DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta`
	- The model has a dual-head architecture (CE head + KD head)
	- Use the saved `alpha` and `beta` parameters for optimal inference
	- For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px)

	### Training Your Own Model

	To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts.

	Example training command:
	```bash
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \
	--teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \
	--student_model "ViT-B-16" \
	--lr 5e-5 \
	--train_epoch 32 \
	--batch_size 256 \
	--percent 10.0 \
	\| tee ./logs/imagenet/imgnet_lowshot.log
	```

	## Model Architecture

	The DHO student model consists of:
	- Backbone: CLIP Vision Transformer (ViT-B/16 or ViT-L/14)
	- Two parallel heads:
	- CE Head: Optimized with cross-entropy loss on labeled data
	- KD Head: Optimized with knowledge distillation loss from teacher predictions

	During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta).

	## Performance

	### ImageNet Semi-supervised Learning

	\| Student \| Teacher \| Labeled Data \| Top-1 Accuracy \|
	\|:--------\|:--------\|:-------------\|:---------------\|
	\| ViT-B/16 \| ViT-H/14 \| 1% \| 81.6% \|
	\| ViT-B/16 \| ViT-H/14 \| 10% \| 82.8% \|
	\| ViT-L/14 \| ViT-H/14 \| 1% \| 84.6% \|
	\| ViT-L/14 \| ViT-H/14 \| 10% \| 85.9% \|

	These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K.

	## Citation

	If you use these models in your research, please cite:

	```bibtex
	@article{kang2025simple,
	title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization},
	author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju},
	journal={arXiv preprint arXiv:2505.07675},
	year={2025}
	}
	```

	## License

	This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

	## Acknowledgments

	We appreciate the open-source implementations from:
	- [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter)
	- [CLIP](https://github.com/openai/CLIP)
	- [OpenCLIP](https://github.com/mlfoundations/open_clip)

	## Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/erjui/DHO) or contact the authors.