Regression CLIP - with strong typographic robustness!

  • Fine-tuned using CLS-Patch Linear Regression teachers
  • This model (BRUT): Even more robust to typographic attacks (but less generalization)
  • Check the benchmarks below - or read the πŸ“„ Latent Crossroads paper
  • βž•
  • New full-auto CLIP-fine-tune suite, (almost) config-free & super fast:
  • Get the code: πŸ‘‰ github.com/zer0int/CLIP-fine-tune
  • Dataset heuristics (will infer dataset from local or HuggingFace automatically)
  • Loads HuggingFace models, pickles, state dicts / local safetensors, ...
  • Geometry analysis tools: get human-language answers to 'what went wrong', if it did

Love ❀️ this CLIP?

ᐅ Buy me a coffee on Ko-Fi β˜•

Or click here for address to send πŸͺ™β‚Ώ BTC
3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK

latent-crossroads-banner

πŸ“Š Standard Benchmark Evaluation

🌟 = This Model

Zero-Shot (Typographic Attack)

Task / Dataset Metric pretrained regr-norm 🌟 regr-brut
SCAM::NoSCAM acc 0.9905 0.9897 0.9897
SCAM::SCAM acc 0.4191 0.8046 0.8830
SCAM::SynthSCAM acc 0.3227 0.8029 0.8804
RTA100 acc 0.4330 0.7880 0.8930
πŸ‘‰ CLICK to reproduce: Expand SCAM typographic attack benchmark code βš‘πŸ’»
from datasets import load_dataset
from transformers import CLIPModel, CLIPProcessor
import torch
from PIL import Image
from tqdm import tqdm
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# BLISS / SCAM Typographic Attack Dataset
# https://huggingface.co/datasets/BLISS-e-V/SCAM
ds = load_dataset("BLISS-e-V/SCAM", split="train")

# Benchmark pre-trained model against my fine-tune
model_variants = [
    ("OpenAI ", "openai/clip-vit-large-patch14-336", "openai/clip-vit-large-patch14-336"),
    ("regr-norm", "zer0int/CLIP-Regression-ViT-L-14", "zer0int/CLIP-Regression-ViT-L-14"),
    ("regr-brut", "zer0int/CLIP-Regression-BRUT-ViT-L-14", "zer0int/CLIP-Regression-BRUT-ViT-L-14"),
]

models = {}
for name, model_path, processor_path in model_variants:
    model = CLIPModel.from_pretrained(model_path).to(device).float()
    processor = CLIPProcessor.from_pretrained(processor_path)
    models[name] = (model, processor)

for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
    print(f"\n=== Evaluating var.: {variant} ===")
    idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
    if not idxs:
        print(f"  No samples for {variant}")
        continue
    subset = [ds[i] for i in idxs]

    for model_name, (model, processor) in models.items():
        results = []
        for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
            img = entry['image']
            object_label = entry['object_label']
            attack_word = entry['attack_word']

            texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
            inputs = processor(
                text=texts,
                images=img,
                return_tensors="pt",
                padding=True
            )
            for k in inputs:
                if isinstance(inputs[k], torch.Tensor):
                    inputs[k] = inputs[k].to(device)

            with torch.no_grad():
                outputs = model(**inputs)
                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                logits = image_features @ text_features.T
                probs = logits.softmax(dim=-1).cpu().numpy().flatten()
                pred_idx = probs.argmax()
                pred_label = [object_label, attack_word][pred_idx]
                is_correct = (pred_label == object_label)

            results.append({
                "id": entry['id'],
                "object_label": object_label,
                "attack_word": attack_word,
                "pred_label": pred_label,
                "is_correct": is_correct,
                "type": entry['type'],
                "model": model_name
            })

        n_total = len(results)
        n_correct = sum(r['is_correct'] for r in results)
        acc = n_correct / n_total if n_total else float('nan')
        print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")

Zero-Shot (CLIP Benchmark)

Task / Dataset Metric pretrained regr-norm 🌟 regr-brut
VOC-2007 multilabel Zero-Shot acc 0.7615 0.8523 0.8350
ImageNet-1k (train) Zero-Shot acc@1 0.3270 0.4566 0.4100
ImageNet-1k (train) Zero-Shot acc@5 0.5300 0.6817 0.6513
ImageNet-1k (train) Zero-Shot mean per-class recall 0.3261 0.4547 0.4078

Retrieval (CLIP Benchmark)

Dataset Metric pretrained regr-norm 🌟 regr-brut
MSCOCO Captions (COCO 2014 val) image retrieval R@5 0.2196 0.3510 0.3308
MSCOCO Captions (COCO 2014 val) text retrieval R@5 0.3032 0.5042 0.4758
XM3600 image retrieval R@5 0.3059 0.4254 0.4138
XM3600 text retrieval R@5 0.2429 0.4091 0.3874

Retrieval (MSCOCO Captions, COCO 2014 val) β€” own scripts

Task Metric pretrained regr-norm 🌟 regr-brut
Image-to-Text (I2T) R@1 0.3366 0.3748 0.3508
Image-to-Text (I2T) R@5 0.7882 0.8706 0.8502
Text-to-Image (T2I) R@1 0.2153 0.3264 0.3184
Text-to-Image (T2I) R@5 0.5902 0.7851 0.7821
Text-to-Text (T2T) R@1 0.2064 0.2423 0.2359
Text-to-Text (T2T) R@5 0.5516 0.6175 0.6130
Text-to-Text (T2T_IMG) R@1 0.3120 0.3506 0.3275
Text-to-Text (T2T_IMG) R@5 0.7466 0.8386 0.8179

Retrieval (SugarCrepe, COCO 2017 val) β€” own scripts

Split Metric pretrained regr-norm 🌟 regr-brut
add_obj acc 0.7842 0.9627 0.9515
add_att acc 0.7168 0.9205 0.8743
replace_obj acc 0.9407 0.9752 0.9740
replace_att acc 0.7919 0.8579 0.8388
replace_rel acc 0.6529 0.7752 0.7696
swap_obj acc 0.6041 0.7224 0.6898
swap_att acc 0.6261 0.7282 0.7102

Linear Probe (ImageNet-1k) β€” own scripts

Metric pretrained regr-norm 🌟 regr-brut
Linear Probe Top-1 (%) 72.35 70.94 65.09
Linear Probe Top-5 (%) 93.42 93.29 89.60

πŸ”— Note: 'own scripts' available at github.com/zer0int/CLIP-fine-tune


🎯 Special Evaluation

Please see the paper for more information!

Zero-Shot Accuracy

Dataset (n) Method pretrained regr-norm 🌟 regr-brut
NoSCAM (1162) CLS 0.9905 0.9897 0.9897
NoSCAM (1162) CLS-PATCHSUB 0.9544 0.9845 0.9811
NoSCAM (1162) CLS-PATCHREG-I 0.9466 0.9888 0.9888
NoSCAM (1162) CLS-PATCHREG-N 0.9871 0.9897 0.9888
NoSCAM (1162) REG-L23-NOPC 0.9380 0.9613 0.9570
NoSCAM (1162) REG-L23-1PC 0.9630 0.9802 0.9802
NoSCAM (1162) REG-L23-8PC 0.9509 0.9664 0.9604
NoSCAM (1162) PATCH-L23 0.7349 0.9725 0.9716
NoSCAM (1162) PATCHΞ” 0.9690 0.9905 0.9888
SCAM (1162) CLS 0.4182 0.8038 0.8830
SCAM (1162) CLS-PATCHSUB 0.4957 0.8632 0.9002
SCAM (1162) CLS-PATCHREG-I 0.8761 0.8537 0.9174
SCAM (1162) CLS-PATCHREG-N 0.9286 0.8537 0.9165
SCAM (1162) REG-L23-NOPC 0.7410 0.8244 0.7719
SCAM (1162) REG-L23-1PC 0.7539 0.8726 0.7943
SCAM (1162) REG-L23-8PC 0.7057 0.8038 0.7143
SCAM (1162) PATCH-L23 0.6024 0.7470 0.8623
SCAM (1162) PATCHΞ” 0.8778 0.8451 0.8744
SynthSCAM (1162) CLS 0.3219 0.8021 0.8804
SynthSCAM (1162) CLS-PATCHSUB 0.4406 0.8580 0.9071
SynthSCAM (1162) CLS-PATCHREG-I 0.8890 0.8460 0.9200
SynthSCAM (1162) CLS-PATCHREG-N 0.9449 0.8494 0.9200
SynthSCAM (1162) REG-L23-NOPC 0.7823 0.8382 0.7771
SynthSCAM (1162) REG-L23-1PC 0.8055 0.8812 0.8072
SynthSCAM (1162) REG-L23-8PC 0.7289 0.8167 0.7126
SynthSCAM (1162) PATCH-L23 0.6317 0.7470 0.8632
SynthSCAM (1162) PATCHΞ” 0.9217 0.8614 0.8769
MVT (200382) CLS 0.8830 0.8730 0.8573
MVT (200382) CLS-PATCHSUB 0.4720 0.8246 0.8057
MVT (200382) CLS-PATCHREG-I 0.7166 0.8703 0.8518
MVT (200382) CLS-PATCHREG-N 0.5695 0.8675 0.8478
MVT (200382) REG-L23-NOPC 0.7640 0.7935 0.7680
MVT (200382) REG-L23-1PC 0.7921 0.8193 0.8032
MVT (200382) REG-L23-8PC 0.7724 0.8057 0.7812
MVT (200382) PATCH-L23 0.3414 0.8652 0.8191
MVT (200382) PATCHΞ” 0.6881 0.8667 0.8510
Downloads last month
4
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support