Release Regression-CLIP

952574e verified 25 days ago

9.41 kB

	---
	license: mit
	---
	### Regression CLIP - with strong typographic robustness!
	- Fine-tuned using CLS-Patch Linear Regression teachers
	- This model: Strong robustness to typographic attacks, good generalization
	- Check the benchmarks below - or read the 📄 [Latent Crossroads paper](https://github.com/zer0int/CLIP-fine-tune/blob/main/docs_regression_clip/Latent-Crossroads-Regression-CLIP-paper-final.pdf)
	- ➕
	- New full-auto CLIP-fine-tune suite, (almost) config-free & super fast:
	- Get the code: 👉 [github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)
	- Dataset heuristics (will infer dataset from local or HuggingFace automatically)
	- Loads HuggingFace models, pickles, state dicts / local safetensors, ...
	- Geometry analysis tools: get human-language answers to 'what went wrong', if it did
	-------
	Love ❤️ this CLIP?

	ᐅ [Buy me a coffee](https://ko-fi.com/zer0int) on Ko-Fi ☕
	<details>
	<summary>Or click here for address to send 🪙₿ BTC</summary>

	```
	3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK
	```
	</details>

	-------

	![latent-crossroads-banner](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/j2gMDXO6IMIpaYC_QHQF8.png)

	### 📊 Standard Benchmark Evaluation
	🌟 = This Model

	#### Zero-Shot (Typographic Attack)

	\| Task / Dataset \| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| SCAM::NoSCAM \| acc \| 0.9905 \| 0.9897 \| 0.9897 \|
	\| SCAM::SCAM \| acc \| 0.4191 \| 0.8046 \| 0.8830 \|
	\| SCAM::SynthSCAM \| acc \| 0.3227 \| 0.8029 \| 0.8804 \|
	\| RTA100 \| acc \| 0.4330 \| 0.7880 \| 0.8930 \|


	<details>
	<summary>👉 CLICK to reproduce: Expand SCAM typographic attack benchmark code ⚡💻</summary>

	```
	from datasets import load_dataset
	from transformers import CLIPModel, CLIPProcessor
	import torch
	from PIL import Image
	from tqdm import tqdm
	import pandas as pd

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# BLISS / SCAM Typographic Attack Dataset
	# https://huggingface.co/datasets/BLISS-e-V/SCAM
	ds = load_dataset("BLISS-e-V/SCAM", split="train")

	# Benchmark pre-trained model against my fine-tune
	model_variants = [
	("OpenAI ", "openai/clip-vit-large-patch14-336", "openai/clip-vit-large-patch14-336"),
	("regr-norm", "zer0int/CLIP-Regression-ViT-L-14", "zer0int/CLIP-Regression-ViT-L-14"),
	("regr-brut", "zer0int/CLIP-Regression-BRUT-ViT-L-14", "zer0int/CLIP-Regression-BRUT-ViT-L-14"),
	]

	models = {}
	for name, model_path, processor_path in model_variants:
	model = CLIPModel.from_pretrained(model_path).to(device).float()
	processor = CLIPProcessor.from_pretrained(processor_path)
	models[name] = (model, processor)

	for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
	print(f"\n=== Evaluating var.: {variant} ===")
	idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
	if not idxs:
	print(f" No samples for {variant}")
	continue
	subset = [ds[i] for i in idxs]

	for model_name, (model, processor) in models.items():
	results = []
	for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}\| {n_fmt}/{total_fmt} \|"):
	img = entry['image']
	object_label = entry['object_label']
	attack_word = entry['attack_word']

	texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
	inputs = processor(
	text=texts,
	images=img,
	return_tensors="pt",
	padding=True
	)
	for k in inputs:
	if isinstance(inputs[k], torch.Tensor):
	inputs[k] = inputs[k].to(device)

	with torch.no_grad():
	outputs = model(**inputs)
	image_features = outputs.image_embeds
	text_features = outputs.text_embeds

	logits = image_features @ text_features.T
	probs = logits.softmax(dim=-1).cpu().numpy().flatten()
	pred_idx = probs.argmax()
	pred_label = [object_label, attack_word][pred_idx]
	is_correct = (pred_label == object_label)

	results.append({
	"id": entry['id'],
	"object_label": object_label,
	"attack_word": attack_word,
	"pred_label": pred_label,
	"is_correct": is_correct,
	"type": entry['type'],
	"model": model_name
	})

	n_total = len(results)
	n_correct = sum(r['is_correct'] for r in results)
	acc = n_correct / n_total if n_total else float('nan')
	print(f"\| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")
	```
	</details>


	#### Zero-Shot (CLIP Benchmark)

	\| Task / Dataset \| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| VOC-2007 multilabel \| Zero-Shot acc \| 0.7615 \| 0.8523 \| 0.8350 \|
	\| ImageNet-1k (train) \| Zero-Shot acc@1 \| 0.3270 \| 0.4566 \| 0.4100 \|
	\| ImageNet-1k (train) \| Zero-Shot acc@5 \| 0.5300 \| 0.6817 \| 0.6513 \|
	\| ImageNet-1k (train) \| Zero-Shot mean per-class recall \| 0.3261 \| 0.4547 \| 0.4078 \|

	#### Retrieval (CLIP Benchmark)

	\| Dataset \| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| MSCOCO Captions (COCO 2014 val) \| image retrieval R@5 \| 0.2196 \| 0.3510 \| 0.3308 \|
	\| MSCOCO Captions (COCO 2014 val) \| text retrieval R@5 \| 0.3032 \| 0.5042 \| 0.4758 \|
	\| XM3600 \| image retrieval R@5 \| 0.3059 \| 0.4254 \| 0.4138 \|
	\| XM3600 \| text retrieval R@5 \| 0.2429 \| 0.4091 \| 0.3874 \|

	#### Retrieval (MSCOCO Captions, COCO 2014 val) — own scripts

	\| Task \| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| Image-to-Text (I2T) \| R@1 \| 0.3366 \| 0.3748 \| 0.3508 \|
	\| Image-to-Text (I2T) \| R@5 \| 0.7882 \| 0.8706 \| 0.8502 \|
	\| Text-to-Image (T2I) \| R@1 \| 0.2153 \| 0.3264 \| 0.3184 \|
	\| Text-to-Image (T2I) \| R@5 \| 0.5902 \| 0.7851 \| 0.7821 \|
	\| Text-to-Text (T2T) \| R@1 \| 0.2064 \| 0.2423 \| 0.2359 \|
	\| Text-to-Text (T2T) \| R@5 \| 0.5516 \| 0.6175 \| 0.6130 \|
	\| Text-to-Text (T2T_IMG) \| R@1 \| 0.3120 \| 0.3506 \| 0.3275 \|
	\| Text-to-Text (T2T_IMG) \| R@5 \| 0.7466 \| 0.8386 \| 0.8179 \|

	#### Retrieval (SugarCrepe, COCO 2017 val) — own scripts

	\| Split \| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| add_obj \| acc \| 0.7842 \| 0.9627 \| 0.9515 \|
	\| add_att \| acc \| 0.7168 \| 0.9205 \| 0.8743 \|
	\| replace_obj \| acc \| 0.9407 \| 0.9752 \| 0.9740 \|
	\| replace_att \| acc \| 0.7919 \| 0.8579 \| 0.8388 \|
	\| replace_rel \| acc \| 0.6529 \| 0.7752 \| 0.7696 \|
	\| swap_obj \| acc \| 0.6041 \| 0.7224 \| 0.6898 \|
	\| swap_att \| acc \| 0.6261 \| 0.7282 \| 0.7102 \|

	#### Linear Probe (ImageNet-1k) — own scripts

	\| Metric \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---:\|---:\|---:\|
	\| Linear Probe Top-1 (%) \| 72.35 \| 70.94 \| 65.09 \|
	\| Linear Probe Top-5 (%) \| 93.42 \| 93.29 \| 89.60 \|

	🔗 Note: 'own scripts' available at [github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)

	-------

	### 🎯 Special Evaluation

	Please see the paper for more information!

	### Zero-Shot Accuracy

	\| Dataset (n) \| Method \| pretrained \| 🌟 regr-norm \| regr-brut \|
	\|---\|---\|---:\|---:\|---:\|
	\| NoSCAM (1162) \| CLS \| 0.9905 \| 0.9897 \| 0.9897 \|
	\| NoSCAM (1162) \| CLS-PATCHSUB \| 0.9544 \| 0.9845 \| 0.9811 \|
	\| NoSCAM (1162) \| CLS-PATCHREG-I \| 0.9466 \| 0.9888 \| 0.9888 \|
	\| NoSCAM (1162) \| CLS-PATCHREG-N \| 0.9871 \| 0.9897 \| 0.9888 \|
	\| NoSCAM (1162) \| REG-L23-NOPC \| 0.9380 \| 0.9613 \| 0.9570 \|
	\| NoSCAM (1162) \| REG-L23-1PC \| 0.9630 \| 0.9802 \| 0.9802 \|
	\| NoSCAM (1162) \| REG-L23-8PC \| 0.9509 \| 0.9664 \| 0.9604 \|
	\| NoSCAM (1162) \| PATCH-L23 \| 0.7349 \| 0.9725 \| 0.9716 \|
	\| NoSCAM (1162) \| PATCHΔ \| 0.9690 \| 0.9905 \| 0.9888 \|
	\| SCAM (1162) \| CLS \| 0.4182 \| 0.8038 \| 0.8830 \|
	\| SCAM (1162) \| CLS-PATCHSUB \| 0.4957 \| 0.8632 \| 0.9002 \|
	\| SCAM (1162) \| CLS-PATCHREG-I \| 0.8761 \| 0.8537 \| 0.9174 \|
	\| SCAM (1162) \| CLS-PATCHREG-N \| 0.9286 \| 0.8537 \| 0.9165 \|
	\| SCAM (1162) \| REG-L23-NOPC \| 0.7410 \| 0.8244 \| 0.7719 \|
	\| SCAM (1162) \| REG-L23-1PC \| 0.7539 \| 0.8726 \| 0.7943 \|
	\| SCAM (1162) \| REG-L23-8PC \| 0.7057 \| 0.8038 \| 0.7143 \|
	\| SCAM (1162) \| PATCH-L23 \| 0.6024 \| 0.7470 \| 0.8623 \|
	\| SCAM (1162) \| PATCHΔ \| 0.8778 \| 0.8451 \| 0.8744 \|
	\| SynthSCAM (1162) \| CLS \| 0.3219 \| 0.8021 \| 0.8804 \|
	\| SynthSCAM (1162) \| CLS-PATCHSUB \| 0.4406 \| 0.8580 \| 0.9071 \|
	\| SynthSCAM (1162) \| CLS-PATCHREG-I \| 0.8890 \| 0.8460 \| 0.9200 \|
	\| SynthSCAM (1162) \| CLS-PATCHREG-N \| 0.9449 \| 0.8494 \| 0.9200 \|
	\| SynthSCAM (1162) \| REG-L23-NOPC \| 0.7823 \| 0.8382 \| 0.7771 \|
	\| SynthSCAM (1162) \| REG-L23-1PC \| 0.8055 \| 0.8812 \| 0.8072 \|
	\| SynthSCAM (1162) \| REG-L23-8PC \| 0.7289 \| 0.8167 \| 0.7126 \|
	\| SynthSCAM (1162) \| PATCH-L23 \| 0.6317 \| 0.7470 \| 0.8632 \|
	\| SynthSCAM (1162) \| PATCHΔ \| 0.9217 \| 0.8614 \| 0.8769 \|
	\| MVT (200382) \| CLS \| 0.8830 \| 0.8730 \| 0.8573 \|
	\| MVT (200382) \| CLS-PATCHSUB \| 0.4720 \| 0.8246 \| 0.8057 \|
	\| MVT (200382) \| CLS-PATCHREG-I \| 0.7166 \| 0.8703 \| 0.8518 \|
	\| MVT (200382) \| CLS-PATCHREG-N \| 0.5695 \| 0.8675 \| 0.8478 \|
	\| MVT (200382) \| REG-L23-NOPC \| 0.7640 \| 0.7935 \| 0.7680 \|
	\| MVT (200382) \| REG-L23-1PC \| 0.7921 \| 0.8193 \| 0.8032 \|
	\| MVT (200382) \| REG-L23-8PC \| 0.7724 \| 0.8057 \| 0.7812 \|
	\| MVT (200382) \| PATCH-L23 \| 0.3414 \| 0.8652 \| 0.8191 \|
	\| MVT (200382) \| PATCHΔ \| 0.6881 \| 0.8667 \| 0.8510 \|