EVA-CLIP
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun1, Yuxin Fang2,1, Ledell Wu1, Xinlong Wang1, Yue Cao1
We launch EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.
Notably, using exclusively publicly accessible training data, our large-sized EVA-02 CLIP-L/14 can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-modeld CLIP with only ~1/6 parameters and ~1/6 image-text training data. Our largest 5.0B-parameter EVA-02 CLIP-E/14 with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K.
Usage
import torch
from modeling_evaclip import EvaCLIPVisionModelWithProjection
model = EvaCLIPVisionModelWithProjection.from_pretrained("townwish/EVACLIP-ViT-L-14-336px")
img_size = model.config.image_size
fake_image = torch.randn(1, 3, img_size, img_size)
with torch.no_grad():
outputs = model(fake_image).image_embeds
- Downloads last month
- 2