Safetensors
clip
dikdimon commited on
Commit
2a7826b
Β·
verified Β·
1 Parent(s): bee34e8

Upload README.md using SD-Hub

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - zer0int/CLIP-KO-Adversarial-Train-Typo-Attack
5
+ - SPRIGHT-T2I/spright_coco
6
+ base_model:
7
+ - openai/clip-vit-large-patch14
8
+ ---
9
+ # CLIP-KO: Knocking Out Typographic Attacks in CLIP πŸ’ͺπŸ€–
10
+ ### Finally, a CLIP without a 'text obsession'! πŸ€—
11
+ ❀️ this CLIP? [Donate](https://ko-fi.com/zer0int) if you can / want. TY!
12
+
13
+ # 🌱 CLIP-KO-LITE is slightly less robust, but the Text Encoder won't produce OOD embeddings.
14
+ - πŸ“ Read the [paper](https://github.com/zer0int/CLIP-fine-tune/blob/CLIP-vision/KO-CLIP-teaser/KO-CLIP-paper-final.pdf) (PDF) here.
15
+ - If you're looking for a a Text Encoder, you'll probably want these:
16
+ - πŸ–ΌοΈ Download [The Text Encoder for generative AI](https://huggingface.co/zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14/resolve/main/ViT-L-14-KO-LITE-HuggingFace-TE-only.safetensors?download=true)
17
+ - πŸ–ΌοΈ Download an [alternatve Text Encoder without Adversarial Training](https://huggingface.co/zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14/resolve/main/ViT-L-14-KO___NO-ADV___HF-TE-only.safetensors?download=true)
18
+ - πŸ€“ Wanna fine-tune yourself? Get the [code](https://github.com/zer0int/CLIP-fine-tune) on my GitHub.
19
+ - Included: Code for fine-tuning and all benchmarks / claims (as per the paper)
20
+
21
+ ## πŸ‘‰ Check out the [KO variant ](https://huggingface.co/zer0int/CLIP-KO-TypoAttack-Attn-Dropout-ViT-L-14) of this model (strict)
22
+
23
+ ----
24
+ <details>
25
+ <summary>πŸ‘‰ CLICK ME to expand example benchmark code βš‘πŸ’»</summary>
26
+
27
+ ```
28
+ from datasets import load_dataset
29
+ from transformers import CLIPModel, CLIPProcessor
30
+ import torch
31
+ from PIL import Image
32
+ from tqdm import tqdm
33
+ import pandas as pd
34
+
35
+ device = "cuda" if torch.cuda.is_available() else "cpu"
36
+
37
+ # BLISS / SCAM Typographic Attack Dataset
38
+ # https://huggingface.co/datasets/BLISS-e-V/SCAM
39
+ ds = load_dataset("BLISS-e-V/SCAM", split="train")
40
+
41
+ # Benchmark pre-trained model against my fine-tune
42
+ model_variants = [
43
+ ("OpenAI ", "openai/clip-vit-large-patch14", "openai/clip-vit-large-patch14"),
44
+ ("KO-CLIP", "zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14", "zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14"),
45
+ ]
46
+
47
+ models = {}
48
+ for name, model_path, processor_path in model_variants:
49
+ model = CLIPModel.from_pretrained(model_path).to(device).float()
50
+ processor = CLIPProcessor.from_pretrained(processor_path)
51
+ models[name] = (model, processor)
52
+
53
+ for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
54
+ print(f"\n=== Evaluating var.: {variant} ===")
55
+ idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
56
+ if not idxs:
57
+ print(f" No samples for {variant}")
58
+ continue
59
+ subset = [ds[i] for i in idxs]
60
+
61
+ for model_name, (model, processor) in models.items():
62
+ results = []
63
+ for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
64
+ img = entry['image']
65
+ object_label = entry['object_label']
66
+ attack_word = entry['attack_word']
67
+
68
+ texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
69
+ inputs = processor(
70
+ text=texts,
71
+ images=img,
72
+ return_tensors="pt",
73
+ padding=True
74
+ )
75
+ for k in inputs:
76
+ if isinstance(inputs[k], torch.Tensor):
77
+ inputs[k] = inputs[k].to(device)
78
+
79
+ with torch.no_grad():
80
+ outputs = model(**inputs)
81
+ image_features = outputs.image_embeds
82
+ text_features = outputs.text_embeds
83
+
84
+ logits = image_features @ text_features.T
85
+ probs = logits.softmax(dim=-1).cpu().numpy().flatten()
86
+ pred_idx = probs.argmax()
87
+ pred_label = [object_label, attack_word][pred_idx]
88
+ is_correct = (pred_label == object_label)
89
+
90
+ results.append({
91
+ "id": entry['id'],
92
+ "object_label": object_label,
93
+ "attack_word": attack_word,
94
+ "pred_label": pred_label,
95
+ "is_correct": is_correct,
96
+ "type": entry['type'],
97
+ "model": model_name
98
+ })
99
+
100
+ n_total = len(results)
101
+ n_correct = sum(r['is_correct'] for r in results)
102
+ acc = n_correct / n_total if n_total else float('nan')
103
+ print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")
104
+ ```
105
+ </details>
106
+
107
+ ----
108
+ # Typographic Attack / adversarial robustness:
109
+
110
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/GXaTlisvGQ5Gifxlkw9Es.png)
111
+ ---------
112
+ ## Attention Heatmaps without artifacts:
113
+
114
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/VDhxpJFE1MqHtSt7d5qUI.png)
115
+
116
+ ---------
117
+ ## πŸ‚ ALL: Flux.1-dev, NO T5 - CLIP only! CFG=5, Heun, fixed seed. Prompts, in order:
118
+
119
+ 1. "bumblewordoooooooo bumblefeelmbles blbeinbumbleghue" (weird CLIP words / text obsession / prompt injection)
120
+ 2. "a photo of a disintegrimpressionism rag hermit" (one weird CLIP word only)
121
+ 3. "a photo of a breakfast table with a highly detailed iridescent mandelbrot sitting on a plate that says 'maths for life!'" (note: "mandelbrot" literally means "almond bread" in German)
122
+ 4. "mathematflake tessswirl psychedsphere zanziflake aluminmathematdeeply mathematzanzirender methylmathematrender detailed mandelmicroscopy mathematfluctucarved iridescent mandelsurface mandeltrippy mandelhallucinpossessed pbr" (Complete CLIP gibberish math rant)
123
+ 5. "spiderman in the moshpit, berlin fashion, wearing punk clothing, they are fighting very angry" (CLIP Interrogator / BLIP)
124
+ 6. "epstein mattypixelart crying epilepsy pixelart dannypixelart mattyteeth trippy talladepixelart retarphotomedit hallucincollage gopro destroyed mathematzanzirender mathematgopro" (CLIP rant)
125
+
126
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/_a7CWmgpMXHUsUJmZwgZj.jpeg)
127
+ ------
128
+ # Evaluation Results
129
+ | Section | Measurement / Task | Pre-Trained | KO-CLIP | KO-LITE |
130
+ |-----------------------------|-----------------------------------|-------------|----------|----------|
131
+ | **RTA 100 Typographic** | Zero-Shot Acc | 0.4330 | **0.7210**πŸŽ–οΈ | 0.6260 |
132
+ | | | | | |
133
+ | **BLISS / SCAM** | NoSCAM | 0.9905 | **0.9897** | **0.9897** |
134
+ | | SCAM | 0.4165 | **0.7823**πŸŽ–οΈ | 0.7367 |
135
+ | | SynthSCAM | 0.3219 | **0.7358**πŸŽ–οΈ | 0.6790 |
136
+ | | | | | |
137
+ | **ILSVRC2012 Linear Probe** | Top-1 | 69.86% | 70.58% | **72.65%** |
138
+ | | Top-5 | 92.70% | 93.79% | **94.08%** |
139
+ | | | | | |
140
+ | **ObjectNet (ZS)** | Accuracy | 0.846 | 0.898 | **0.9029**πŸŽ–οΈ |
141
+ | | | | | |
142
+ | **ImageNet 1k (ZS)** | acc1 | 0.32696 | 0.43440 | **0.46882** |
143
+ | | acc5 | 0.52997 | 0.65297 | **0.68845**πŸŽ–οΈ |
144
+ | | mean_per_class_recall | 0.32609 | 0.43252 | **0.46695** |
145
+ | | | | | |
146
+ | **VoC-2007 (ZS)** | mAP | 0.7615 | 0.8579 | **0.8626**πŸŽ–οΈ |
147
+ | | | | | |
148
+ | **mscoco ZS Retrieval** | image_retrieval_recall@5 | 0.2196 | 0.3296 | **0.3385** |
149
+ | | text_retrieval_recall@5 | 0.3032 | 0.4396 | **0.4745** |
150
+ | | | | | |
151
+ | **xm3600 ZS Retrieval** | image_retrieval_recall@5 | 0.30593 | 0.43338 | **0.43700** |
152
+ | | text_retrieval_recall@5 | 0.24293 | 0.38884 | **0.42324** |
153
+ | | | | | |
154
+ | **Sugar_Crepe (PT)** | Add ATT: acc | 0.77745 | 0.84537 | **0.87427** |
155
+ | | Add OBJ: acc | 0.80358 | 0.84093 | **0.84772** |
156
+ | | Replace ATT: acc | 0.76903 | 0.81091 | **0.82106** |
157
+ | | Replace OBJ: acc | 0.87832 | 0.90617 | **0.91162** |
158
+ | | Replace REL: acc | 0.71550 | 0.73470 | **0.74253** |
159
+ | | Swap ATT: acc | 0.58558 | 0.62912 | **0.63363** |
160
+ | | Swap OBJ: acc | 0.57959 | 0.60816 | **0.62040** |
161
+ | | | | | |
162
+ | **Flickr-8k Cross-modal** | Euclidean Gap ↓ | 0.8276 | **0.8657** | 0.8182 |
163
+ | | JSD ↓ | 0.5200 | 0.2863 | **0.1455** |
164
+ | | Wasserstein Distance ↓ | 0.4084 | 0.4166 | **0.3889** |
165
+ | | Img-Text Cos Sim (mean) ↑ | 0.2723 | 0.3077 | **0.3300** |
166
+ | | Img-Text Cos Sim (std) | 0.0362 | 0.0645 | **0.0690** |
167
+ | | Text-Text Cos Sim (mean) | 0.6807 | **0.7243** | 0.7189 |
168
+ | | Text-Text Cos Sim (std) | 0.1344 | 0.1377 | **0.1387** |