EVA-CLIP

We launch EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Notably, using exclusively publicly accessible training data, our large-sized EVA-02 CLIP-L/14 can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-modeld CLIP with only ~1/6 parameters and ~1/6 image-text training data. Our largest 5.0B-parameter EVA-02 CLIP-E/14 with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K.

Usage

import torch
from modeling_evaclip import EvaCLIPVisionModelWithProjection

model = EvaCLIPVisionModelWithProjection.from_pretrained("townwish/EVACLIP-ViT-L-14-336px")
img_size = model.config.image_size
fake_image = torch.randn(1, 3, img_size, img_size)

with torch.no_grad():
    outputs = model(fake_image).image_embeds
Downloads last month
2
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for townwish/EVACLIP-ViT-L-14-336px