File size: 9,407 Bytes
3f7dea5 bf81b61 952574e bf81b61 3f7dea5 7dc2a2c 3f7dea5 7dc2a2c bf81b61 3f7dea5 7dc2a2c 328ca0b 3f7dea5 328ca0b 3f7dea5 328ca0b 3f7dea5 7dc2a2c 3f7dea5 7dc2a2c 3f7dea5 7dc2a2c 3f7dea5 7dc2a2c 3f7dea5 7dc2a2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | ---
license: mit
---
### Regression CLIP - with strong typographic robustness!
- Fine-tuned using CLS-Patch Linear Regression teachers
- This model: Strong robustness to typographic attacks, good generalization
- Check the benchmarks below - or read the π [Latent Crossroads paper](https://github.com/zer0int/CLIP-fine-tune/blob/main/docs_regression_clip/Latent-Crossroads-Regression-CLIP-paper-final.pdf)
- β
- New full-auto CLIP-fine-tune suite, (almost) config-free & super fast:
- Get the code: π [github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)
- Dataset heuristics (will infer dataset from local or HuggingFace automatically)
- Loads HuggingFace models, pickles, state dicts / local safetensors, ...
- Geometry analysis tools: get human-language answers to 'what went wrong', if it did
-------
Love β€οΈ this CLIP?
α
[Buy me a coffee](https://ko-fi.com/zer0int) on Ko-Fi β
<details>
<summary>Or click here for address to send πͺβΏ BTC</summary>
```
3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK
```
</details>
-------

### π Standard Benchmark Evaluation
π = This Model
#### Zero-Shot (Typographic Attack)
| Task / Dataset | Metric | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| SCAM::NoSCAM | acc | 0.9905 | 0.9897 | 0.9897 |
| SCAM::SCAM | acc | 0.4191 | 0.8046 | 0.8830 |
| SCAM::SynthSCAM | acc | 0.3227 | 0.8029 | 0.8804 |
| RTA100 | acc | 0.4330 | 0.7880 | 0.8930 |
<details>
<summary>π CLICK to reproduce: Expand SCAM typographic attack benchmark code β‘π»</summary>
```
from datasets import load_dataset
from transformers import CLIPModel, CLIPProcessor
import torch
from PIL import Image
from tqdm import tqdm
import pandas as pd
device = "cuda" if torch.cuda.is_available() else "cpu"
# BLISS / SCAM Typographic Attack Dataset
# https://huggingface.co/datasets/BLISS-e-V/SCAM
ds = load_dataset("BLISS-e-V/SCAM", split="train")
# Benchmark pre-trained model against my fine-tune
model_variants = [
("OpenAI ", "openai/clip-vit-large-patch14-336", "openai/clip-vit-large-patch14-336"),
("regr-norm", "zer0int/CLIP-Regression-ViT-L-14", "zer0int/CLIP-Regression-ViT-L-14"),
("regr-brut", "zer0int/CLIP-Regression-BRUT-ViT-L-14", "zer0int/CLIP-Regression-BRUT-ViT-L-14"),
]
models = {}
for name, model_path, processor_path in model_variants:
model = CLIPModel.from_pretrained(model_path).to(device).float()
processor = CLIPProcessor.from_pretrained(processor_path)
models[name] = (model, processor)
for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
print(f"\n=== Evaluating var.: {variant} ===")
idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
if not idxs:
print(f" No samples for {variant}")
continue
subset = [ds[i] for i in idxs]
for model_name, (model, processor) in models.items():
results = []
for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
img = entry['image']
object_label = entry['object_label']
attack_word = entry['attack_word']
texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
inputs = processor(
text=texts,
images=img,
return_tensors="pt",
padding=True
)
for k in inputs:
if isinstance(inputs[k], torch.Tensor):
inputs[k] = inputs[k].to(device)
with torch.no_grad():
outputs = model(**inputs)
image_features = outputs.image_embeds
text_features = outputs.text_embeds
logits = image_features @ text_features.T
probs = logits.softmax(dim=-1).cpu().numpy().flatten()
pred_idx = probs.argmax()
pred_label = [object_label, attack_word][pred_idx]
is_correct = (pred_label == object_label)
results.append({
"id": entry['id'],
"object_label": object_label,
"attack_word": attack_word,
"pred_label": pred_label,
"is_correct": is_correct,
"type": entry['type'],
"model": model_name
})
n_total = len(results)
n_correct = sum(r['is_correct'] for r in results)
acc = n_correct / n_total if n_total else float('nan')
print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")
```
</details>
#### Zero-Shot (CLIP Benchmark)
| Task / Dataset | Metric | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| VOC-2007 multilabel | Zero-Shot acc | 0.7615 | 0.8523 | 0.8350 |
| ImageNet-1k (train) | Zero-Shot acc@1 | 0.3270 | 0.4566 | 0.4100 |
| ImageNet-1k (train) | Zero-Shot acc@5 | 0.5300 | 0.6817 | 0.6513 |
| ImageNet-1k (train) | Zero-Shot mean per-class recall | 0.3261 | 0.4547 | 0.4078 |
#### Retrieval (CLIP Benchmark)
| Dataset | Metric | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| MSCOCO Captions (COCO 2014 val) | image retrieval R@5 | 0.2196 | 0.3510 | 0.3308 |
| MSCOCO Captions (COCO 2014 val) | text retrieval R@5 | 0.3032 | 0.5042 | 0.4758 |
| XM3600 | image retrieval R@5 | 0.3059 | 0.4254 | 0.4138 |
| XM3600 | text retrieval R@5 | 0.2429 | 0.4091 | 0.3874 |
#### Retrieval (MSCOCO Captions, COCO 2014 val) β own scripts
| Task | Metric | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| Image-to-Text (I2T) | R@1 | 0.3366 | 0.3748 | 0.3508 |
| Image-to-Text (I2T) | R@5 | 0.7882 | 0.8706 | 0.8502 |
| Text-to-Image (T2I) | R@1 | 0.2153 | 0.3264 | 0.3184 |
| Text-to-Image (T2I) | R@5 | 0.5902 | 0.7851 | 0.7821 |
| Text-to-Text (T2T) | R@1 | 0.2064 | 0.2423 | 0.2359 |
| Text-to-Text (T2T) | R@5 | 0.5516 | 0.6175 | 0.6130 |
| Text-to-Text (T2T_IMG) | R@1 | 0.3120 | 0.3506 | 0.3275 |
| Text-to-Text (T2T_IMG) | R@5 | 0.7466 | 0.8386 | 0.8179 |
#### Retrieval (SugarCrepe, COCO 2017 val) β own scripts
| Split | Metric | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| add_obj | acc | 0.7842 | 0.9627 | 0.9515 |
| add_att | acc | 0.7168 | 0.9205 | 0.8743 |
| replace_obj | acc | 0.9407 | 0.9752 | 0.9740 |
| replace_att | acc | 0.7919 | 0.8579 | 0.8388 |
| replace_rel | acc | 0.6529 | 0.7752 | 0.7696 |
| swap_obj | acc | 0.6041 | 0.7224 | 0.6898 |
| swap_att | acc | 0.6261 | 0.7282 | 0.7102 |
#### Linear Probe (ImageNet-1k) β own scripts
| Metric | pretrained | π regr-norm | regr-brut |
|---|---:|---:|---:|
| Linear Probe Top-1 (%) | 72.35 | 70.94 | 65.09 |
| Linear Probe Top-5 (%) | 93.42 | 93.29 | 89.60 |
π Note: 'own scripts' available at [github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)
-------
### π― Special Evaluation
Please see the paper for more information!
### Zero-Shot Accuracy
| Dataset (n) | Method | pretrained | π regr-norm | regr-brut |
|---|---|---:|---:|---:|
| NoSCAM (1162) | CLS | 0.9905 | 0.9897 | 0.9897 |
| NoSCAM (1162) | CLS-PATCHSUB | 0.9544 | 0.9845 | 0.9811 |
| NoSCAM (1162) | CLS-PATCHREG-I | 0.9466 | 0.9888 | 0.9888 |
| NoSCAM (1162) | CLS-PATCHREG-N | 0.9871 | 0.9897 | 0.9888 |
| NoSCAM (1162) | REG-L23-NOPC | 0.9380 | 0.9613 | 0.9570 |
| NoSCAM (1162) | REG-L23-1PC | 0.9630 | 0.9802 | 0.9802 |
| NoSCAM (1162) | REG-L23-8PC | 0.9509 | 0.9664 | 0.9604 |
| NoSCAM (1162) | PATCH-L23 | 0.7349 | 0.9725 | 0.9716 |
| NoSCAM (1162) | PATCHΞ | 0.9690 | 0.9905 | 0.9888 |
| SCAM (1162) | CLS | 0.4182 | 0.8038 | 0.8830 |
| SCAM (1162) | CLS-PATCHSUB | 0.4957 | 0.8632 | 0.9002 |
| SCAM (1162) | CLS-PATCHREG-I | 0.8761 | 0.8537 | 0.9174 |
| SCAM (1162) | CLS-PATCHREG-N | 0.9286 | 0.8537 | 0.9165 |
| SCAM (1162) | REG-L23-NOPC | 0.7410 | 0.8244 | 0.7719 |
| SCAM (1162) | REG-L23-1PC | 0.7539 | 0.8726 | 0.7943 |
| SCAM (1162) | REG-L23-8PC | 0.7057 | 0.8038 | 0.7143 |
| SCAM (1162) | PATCH-L23 | 0.6024 | 0.7470 | 0.8623 |
| SCAM (1162) | PATCHΞ | 0.8778 | 0.8451 | 0.8744 |
| SynthSCAM (1162) | CLS | 0.3219 | 0.8021 | 0.8804 |
| SynthSCAM (1162) | CLS-PATCHSUB | 0.4406 | 0.8580 | 0.9071 |
| SynthSCAM (1162) | CLS-PATCHREG-I | 0.8890 | 0.8460 | 0.9200 |
| SynthSCAM (1162) | CLS-PATCHREG-N | 0.9449 | 0.8494 | 0.9200 |
| SynthSCAM (1162) | REG-L23-NOPC | 0.7823 | 0.8382 | 0.7771 |
| SynthSCAM (1162) | REG-L23-1PC | 0.8055 | 0.8812 | 0.8072 |
| SynthSCAM (1162) | REG-L23-8PC | 0.7289 | 0.8167 | 0.7126 |
| SynthSCAM (1162) | PATCH-L23 | 0.6317 | 0.7470 | 0.8632 |
| SynthSCAM (1162) | PATCHΞ | 0.9217 | 0.8614 | 0.8769 |
| MVT (200382) | CLS | 0.8830 | 0.8730 | 0.8573 |
| MVT (200382) | CLS-PATCHSUB | 0.4720 | 0.8246 | 0.8057 |
| MVT (200382) | CLS-PATCHREG-I | 0.7166 | 0.8703 | 0.8518 |
| MVT (200382) | CLS-PATCHREG-N | 0.5695 | 0.8675 | 0.8478 |
| MVT (200382) | REG-L23-NOPC | 0.7640 | 0.7935 | 0.7680 |
| MVT (200382) | REG-L23-1PC | 0.7921 | 0.8193 | 0.8032 |
| MVT (200382) | REG-L23-8PC | 0.7724 | 0.8057 | 0.7812 |
| MVT (200382) | PATCH-L23 | 0.3414 | 0.8652 | 0.8191 |
| MVT (200382) | PATCHΞ | 0.6881 | 0.8667 | 0.8510 | |