EVA-CLIP

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun¹, Yuxin Fang^2,1, Ledell Wu¹, Xinlong Wang¹, Yue Cao¹

¹BAAI, ²HUST

We launch EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Notably, using exclusively publicly accessible training data, our large-sized EVA-02 CLIP-L/14 can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-modeld CLIP with only ~1/6 parameters and ~1/6 image-text training data. Our largest 5.0B-parameter EVA-02 CLIP-E/14 with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K.

Usage

import torch
from modeling_evaclip import EvaCLIPVisionModelWithProjection

model = EvaCLIPVisionModelWithProjection.from_pretrained("townwish/EVACLIP-ViT-L-14-336px")
img_size = model.config.image_size
fake_image = torch.randn(1, 3, img_size, img_size)

with torch.no_grad():
    outputs = model(fake_image).image_embeds

Downloads last month: 2

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for townwish/EVACLIP-ViT-L-14-336px

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Paper • 2303.15389 • Published Mar 27, 2023