init model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,193 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- laion/laion2B-en
|
| 5 |
+
- kakaobrain/coyo-700m
|
| 6 |
---
|
| 7 |
+
<div align="center">
|
| 8 |
+
|
| 9 |
+
<h2><a href="http://arxiv.org/abs/">EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters</a></h2>
|
| 10 |
+
|
| 11 |
+
[Quan Sun](https://github.com/Quan-Sun)<sup>1*</sup>, [Jinsheng Wang](https://github.com/Wolfwjs/)<sup>1*</sup>, [Qiying Yu](https://yqy2001.github.io)<sup>1,2*</sup>, [Yufeng Cui](https://scholar.google.com/citations?hl=en&user=5Ydha2EAAAAJ)<sup>1</sup>, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)<sup>1</sup>, [Xiaosong Zhang](https://zhangxiaosong18.github.io)<sup>1</sup>, [Xinlong Wang](https://www.xloong.wang/)<sup>1</sup>
|
| 12 |
+
|
| 13 |
+
<sup>1</sup> [BAAI](https://www.baai.ac.cn/english.html), <sup>2</sup> [THU](https://air.tsinghua.edu.cn) <br><sup>*</sup> equal contribution
|
| 14 |
+
|
| 15 |
+
[Paper](http://arxiv.org/abs/) | [Github](https://github.com/baaivision/EVA/tree/master/EVA-CLIP-18B)
|
| 16 |
+
|
| 17 |
+
</div>
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional **80.7%** zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
**Table of Contents**
|
| 24 |
+
|
| 25 |
+
- [Summary of EVA-CLIP performance](#summary-of-eva-clip-performance)
|
| 26 |
+
- [Model Card](#model-card)
|
| 27 |
+
- [EVA-CLIP-8B](#eva-clip-8b)
|
| 28 |
+
- [EVA-CLIP-18B](#eva-clip-18b)
|
| 29 |
+
- [Usage](#usage)
|
| 30 |
+
- [BibTeX \& Citation](#bibtex--citation)
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
## Summary of EVA-CLIP performance
|
| 34 |
+
|
| 35 |
+

|
| 36 |
+
|
| 37 |
+
Scaling behavior of EVA-CLIP with zero-shot classification performance averaged across 27 image classification benchmarks, compared with the current state-of-the-art and largest CLIP models (224px). The diameter of each circle demonstrates the forward GFLOPs × the number of training samples seen. The performance of EVA-CLIP consistently improves as scaling up.
|
| 38 |
+
|
| 39 |
+
## Model Card
|
| 40 |
+
|
| 41 |
+
### EVA-8B
|
| 42 |
+
<div align="center">
|
| 43 |
+
|
| 44 |
+
| model name | total #params | seen samples | pytorch weight |
|
| 45 |
+
|:-----------|:------:|:------:|:------:|
|
| 46 |
+
| `EVA_8B_psz14` | 7.5B | 6B | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B/resolve/main/EVA_8B_psz14.bin) (`29.0GB`) |
|
| 47 |
+
|
| 48 |
+
</div>
|
| 49 |
+
|
| 50 |
+
### EVA-CLIP-8B
|
| 51 |
+
|
| 52 |
+
> Image encoder MIM teacher: [EVA02_CLIP_E_psz14_plus_s9B](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_E_psz14_s4B.pt).
|
| 53 |
+
|
| 54 |
+
<div align="center">
|
| 55 |
+
|
| 56 |
+
| model name | image enc. init. ckpt | text enc. init. ckpt | total #params | training data | training batch size | gpus for training | img. cls. avg. acc. | video cls. avg. acc. | retrieval MR | hf weight | pytorch weight |
|
| 57 |
+
|:-----|:-----|:-----------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
| 58 |
+
| `EVA-CLIP-8B` | `EVA_8B_psz14` | `EVA02_CLIP_E_psz14_plus_s9B` | 8.1B | Merged-2B | 178K | 384 A100(40GB) | **79.4** | **73.6** | **86.2**| [🤗 HF](https://huggingface.co/BAAI/EVA-CLIP-8B) | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B/resolve/main/EVA_CLIP_8B_psz14_s9B.pt) (`32.9GB`)|
|
| 59 |
+
| `EVA-CLIP-8B-448` | `EVA-CLIP-8B` | `EVA-CLIP-8B` | 8.1B | Merged-2B | 24K | 384 A100(40GB) | **80.0** | **73.7** | **86.4** | [🤗 HF](https://huggingface.co/BAAI/EVA-CLIP-8B-448) | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B-448/resolve/main/EVA_CLIP_8B_psz14_plus_s0.6B.pt) (`32.9GB`)|
|
| 60 |
+
|
| 61 |
+
</div>
|
| 62 |
+
|
| 63 |
+
### EVA-CLIP-18B
|
| 64 |
+
|
| 65 |
+
> Image encoder MIM teacher: [EVA02_CLIP_E_psz14_plus_s9B](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_E_psz14_s4B.pt).
|
| 66 |
+
|
| 67 |
+
<div align="center">
|
| 68 |
+
|
| 69 |
+
| model name | image enc. init. ckpt | text enc. init. ckpt | total #params | training data | training batch size | gpus for training | img. cls. avg. acc. | video cls. avg. acc. | retrieval MR | hf weight | pytorch weight |
|
| 70 |
+
|:-----|:-----|:-----------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
|
| 71 |
+
| `EVA-CLIP-18B` | `EVA_18B_psz14` | `EVA02_CLIP_E_psz14_plus_s9B` | 18.1B | Merged-2B+ | 108K | 360 A100(40GB) | **80.7** | **75.0** | **87.8**| stay tuned | stay tuned |
|
| 72 |
+
|
| 73 |
+
</div>
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
- To construct Merged-2B, we merged 1.6 billion samples from [LAION-2B](https://laion.ai/blog/laion-5b/) dataset with 0.4 billion samples from [COYO-700M](https://github.com/kakaobrain/coyo-dataset).
|
| 77 |
+
- The Merged-2B+ consists of all samples from Merged-2B, along with 20 millions samples from [LAION-COCO](https://laion.ai/blog/laion-coco/) and 23 millions samples from Merged-video including [VideoCC3M](https://github.com/google-research-datasets/videoCC-data), [InternVid](https://huggingface.co/datasets/OpenGVLab/InternVid) and [WebVid-10M](https://maxbain.com/webvid-dataset/). Merged-video was added at the end of the training process.
|
| 78 |
+
|
| 79 |
+
**It's important to note that all results presented in the paper are evaluated using PyTorch weights. There may be differences in performance when using Hugging Face (hf) models.**
|
| 80 |
+
|
| 81 |
+
## Zero-Shot Evaluation
|
| 82 |
+
|
| 83 |
+
We use [CLIP-Benchmark](https://github.com/LAION-AI/CLIP_benchmark) to evaluate the zero-shot performance of EVA-CLIP models. Following [vissl](https://github.com/facebookresearch/vissl/blob/main/extra_scripts/datasets/create_k700_data_files.py), we evauate the zero-shot video classification using 1 middle frame. Further details regarding the evaluation datasets can be found in our paper, particularly in Table 11.
|
| 84 |
+
|
| 85 |
+
## Usage
|
| 86 |
+
|
| 87 |
+
### Huggingface Version
|
| 88 |
+
```python
|
| 89 |
+
|
| 90 |
+
from PIL import Image
|
| 91 |
+
from transformers import AutoModel, AutoConfig
|
| 92 |
+
from transformers import CLIPImageProcessor, pipeline, CLIPTokenizer
|
| 93 |
+
import torch
|
| 94 |
+
import torchvision.transforms as T
|
| 95 |
+
from torchvision.transforms import InterpolationMode
|
| 96 |
+
|
| 97 |
+
image_path = "CLIP.png"
|
| 98 |
+
model_name_or_path = "BAAI/EVA-CLIP-8B" # or /path/to/local/EVA-CLIP-8B
|
| 99 |
+
image_size = 224
|
| 100 |
+
|
| 101 |
+
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")
|
| 102 |
+
|
| 103 |
+
# use image processor with conig
|
| 104 |
+
# processor = CLIPImageProcessor(size={"shortest_edge":image_size}, do_center_crop=True, crop_size=image_size)
|
| 105 |
+
|
| 106 |
+
## you can also directly use the image processor by torchvision
|
| 107 |
+
## squash
|
| 108 |
+
# processor = T.Compose(
|
| 109 |
+
# [
|
| 110 |
+
# T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 111 |
+
# T.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
|
| 112 |
+
# T.ToTensor(),
|
| 113 |
+
# T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
|
| 114 |
+
# ]
|
| 115 |
+
# )
|
| 116 |
+
## shortest
|
| 117 |
+
## processor = T.Compose(
|
| 118 |
+
# [
|
| 119 |
+
# T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 120 |
+
# T.Resize(image_size, interpolation=InterpolationMode.BICUBIC),
|
| 121 |
+
# T.CenterCrop(image_size),
|
| 122 |
+
# T.ToTensor(),
|
| 123 |
+
# T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
|
| 124 |
+
# ]
|
| 125 |
+
# )
|
| 126 |
+
|
| 127 |
+
model = AutoModel.from_pretrained(
|
| 128 |
+
model_name_or_path,
|
| 129 |
+
torch_dtype=torch.float16,
|
| 130 |
+
trust_remote_code=True).to('cuda').eval()
|
| 131 |
+
|
| 132 |
+
image = Image.open(image_path)
|
| 133 |
+
captions = ["a diagram", "a dog", "a cat"]
|
| 134 |
+
tokenizer = CLIPTokenizer.from_pretrained(model_name_or_path)
|
| 135 |
+
input_ids = tokenizer(captions, return_tensors="pt", padding=True).input_ids.to('cuda')
|
| 136 |
+
input_pixels = processor(images=image, return_tensors="pt", padding=True).pixel_values.to('cuda')
|
| 137 |
+
|
| 138 |
+
with torch.no_grad(), torch.cuda.amp.autocast():
|
| 139 |
+
image_features = model.encode_image(input_pixels)
|
| 140 |
+
text_features = model.encode_text(input_ids)
|
| 141 |
+
image_features /= image_features.norm(dim=-1, keepdim=True)
|
| 142 |
+
text_features /= text_features.norm(dim=-1, keepdim=True)
|
| 143 |
+
|
| 144 |
+
label_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
| 145 |
+
print(f"Label probs: {label_probs}")
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
### Pytorch version
|
| 149 |
+
|
| 150 |
+
Go to [GitHub](https://github.com/baaivision/EVA/tree/master/EVA-CLIP-18B)
|
| 151 |
+
|
| 152 |
+
```python
|
| 153 |
+
import torch
|
| 154 |
+
from eva_clip import create_model_and_transforms, get_tokenizer
|
| 155 |
+
from PIL import Image
|
| 156 |
+
|
| 157 |
+
model_name = "EVA-CLIP-8B"
|
| 158 |
+
pretrained = "eva_clip" # or "/path/to/EVA_CLIP_8B_psz14_s9B.pt"
|
| 159 |
+
|
| 160 |
+
image_path = "CLIP.png"
|
| 161 |
+
caption = ["a diagram", "a dog", "a cat"]
|
| 162 |
+
|
| 163 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 164 |
+
model, _, processor = create_model_and_transforms(model_name, pretrained, force_custom_clip=True)
|
| 165 |
+
tokenizer = get_tokenizer(model_name)
|
| 166 |
+
model = model.to(device)
|
| 167 |
+
|
| 168 |
+
image = processor(Image.open(image_path)).unsqueeze(0).to(device)
|
| 169 |
+
text = tokenizer(["a diagram", "a dog", "a cat"]).to(device)
|
| 170 |
+
|
| 171 |
+
with torch.no_grad(), torch.cuda.amp.autocast():
|
| 172 |
+
image_features = model.encode_image(image)
|
| 173 |
+
text_features = model.encode_text(text)
|
| 174 |
+
image_features /= image_features.norm(dim=-1, keepdim=True)
|
| 175 |
+
text_features /= text_features.norm(dim=-1, keepdim=True)
|
| 176 |
+
|
| 177 |
+
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
| 178 |
+
|
| 179 |
+
print("Label probs:", text_probs)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
You can leverage [deepspeed.zero.Init()](https://deepspeed.readthedocs.io/en/stable/zero3.html#constructing-massive-models) with deepspeed zero stage 3 if you have limited CPU memory. For loading a pretrained checkpoint in the context of using deepspeed.zero.Init(), it's advised to use the `load_zero_partitions()` function in `eva_clip/factory.py`.
|
| 183 |
+
|
| 184 |
+
## BibTeX & Citation
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
@article{EVA-CLIP-18B,
|
| 188 |
+
title={},
|
| 189 |
+
author={},
|
| 190 |
+
journal={arXiv preprint arXiv:},
|
| 191 |
+
year={2024}
|
| 192 |
+
}
|
| 193 |
+
```
|