Reza-paji commited on
Commit
be0484f
·
verified ·
1 Parent(s): b89e759

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - multilingual
6
+ library_name: peft
7
+ tags:
8
+ - clip
9
+ - lora
10
+ - vision-language
11
+ - contrastive
12
+ - multilingual
13
+ - glot
14
+ datasets:
15
+ - fictional-glot-5m-dataset
16
+ base_model: openai/clip-vit-large-patch14
17
+ ---
18
+
19
+ # Glot-CLIP: Multilingual and Culturally Aware CLIP LoRA Adapters
20
+
21
+ This repository contains a collection of **LoRA (Low-Rank Adaptation)** adapters for the `openai/clip-vit-large-patch14` model. These adapters were fine-tuned on the **Glot-5M dataset**, a large-scale, multilingual, and culturally diverse collection of image-text pairs, to improve the model's performance on non-English and culturally specific content.
22
+
23
+ The repository offers various adapters with different LoRA ranks and training checkpoints, allowing users to choose the best trade-off between performance and model size for their specific application.
24
+
25
+ ## Model Variants
26
+
27
+ The adapters are organized by their training configuration. The naming convention is `clip_lora_adapters_{epochs}e{rank}r`, with subdirectories for different training checkpoints.
28
+
29
+ * **`r` (Rank)**: The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks **16** and **32**.
30
+ * **`e` (Epochs)**: The total number of training epochs. All primary models were trained for **80 epochs**.
31
+ * **`Cut`**: Checkpoints saved at intermediate epochs (e.g., `30eCut`, `50eCut`). These can be useful if the model starts to overfit in later epochs.
32
+ * **`ES` (Early Stopping)**: The final adapter saved based on the best validation score using an early stopping mechanism.
33
+
34
+ ### Adapter Directory Structure:
35
+
36
+ * `clip_lora_adapters_80e16r_ES`: Final LoRA adapter with **rank 16**, trained for 80 epochs with early stopping.
37
+ * `clip_lora_adapters_80e16r_30eCut`: Checkpoint from the same run at 30 epochs.
38
+ * `clip_lora_adapters_80e16r_50eCut`: Checkpoint at 50 epochs.
39
+ * `clip_lora_adapters_80e16r_70eCut`: Checkpoint at 70 epochs.
40
+ * `clip_lora_adapters_80e32r_ES`: Final LoRA adapter with **rank 32**, trained for 80 epochs with early stopping.
41
+ * `clip_lora_adapters_80e32r_30eCut`: Checkpoint at 30 epochs.
42
+ * `clip_lora_adapters_80e32r_50eCut`: Checkpoint at 50 epochs.
43
+ * `clip_lora_adapters_80e32r_70eCut`: Checkpoint at 70 epochs.
44
+ * `glot-contrastive-final-lora`: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., `clip_lora_adapters_80e32r_ES`).
45
+ * `glot-mlm-adapted`: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.
46
+
47
+ ***
48
+
49
+ ## How to Use
50
+
51
+ To use these LoRA adapters, you need to install the `transformers`, `peft`, and `torch` libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.
52
+
53
+ ## CLIPFaLORA
54
+
55
+ ```python
56
+ import torch
57
+ from torchvision import transforms
58
+ from PIL import Image
59
+ from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer
60
+
61
+ from peft import PeftModel
62
+ from .CombinedContrastive import CombinedContrastive
63
+
64
+ import requests
65
+ from io import BytesIO
66
+
67
+ from typing import List
68
+
69
+
70
+ class CLIPFaLORA:
71
+ def __init__(self, name: str, path: str):
72
+ self.name = name
73
+ self.path = path
74
+
75
+ self.device = "cuda:0"
76
+ self.model = PeftModel.from_pretrained(
77
+ CombinedContrastive(
78
+ CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
79
+ RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
80
+ ),
81
+ self.path,
82
+ )
83
+ self.model = self.model.to(self.device)
84
+ self.model.eval()
85
+
86
+ self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
87
+ self.image_transform = transforms.Compose(
88
+ [
89
+ transforms.Resize((224, 224)),
90
+ transforms.ToTensor(),
91
+ transforms.Normalize(
92
+ mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
93
+ ),
94
+ ]
95
+ )
96
+
97
+ def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
98
+ inputs = self.text_transform(
99
+ contents, return_tensors="pt", padding=True, truncation=True
100
+ ).to(self.device)
101
+
102
+ with torch.no_grad():
103
+ embeddings = self.model.text_encoder(**inputs).pooler_output
104
+
105
+ return embeddings.cpu().numpy().tolist()
106
+
107
+ def get_image_embedding(self, images: List[str]) -> List[List[float]]:
108
+ images = [
109
+ self.image_transform(Image.open(image).convert("RGB")) for image in images
110
+ ]
111
+ images = torch.stack(images).to(self.device)
112
+
113
+ with torch.no_grad():
114
+ embeddings = self.model.vision_encoder(images).pooler_output
115
+
116
+ return embeddings.cpu().numpy().tolist()
117
+
118
+ def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
119
+ contents = [requests.get(image).content for image in images]
120
+ images = [BytesIO(content) for content in contents]
121
+
122
+ images = [
123
+ self.image_transform(Image.open(image).convert("RGB")) for image in images
124
+ ]
125
+ images = torch.stack(images).to(self.device)
126
+
127
+ with torch.no_grad():
128
+ embeddings = self.model.vision_encoder(images).pooler_output
129
+
130
+ return embeddings.cpu().numpy().tolist()
131
+ ```
132
+
133
+
134
+ ## GLOT500LORA
135
+
136
+ ```python
137
+ import torch
138
+ from transformers import AutoTokenizer, AutoModel
139
+ from peft import PeftModel
140
+ from typing import List
141
+
142
+
143
+ class GLOT500LORA:
144
+ def __init__(self, name: str, base: str, adapters: str):
145
+ self.name = name
146
+ self.base = base
147
+ self.adapters = adapters
148
+
149
+ self.device = "cuda:0"
150
+ self.model = PeftModel.from_pretrained(
151
+ AutoModel.from_pretrained(base), adapters
152
+ )
153
+ self.model.to(self.device)
154
+
155
+ self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)
156
+
157
+ def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
158
+ inputs = self.text_transform(
159
+ contents, return_tensors="pt", padding=True, truncation=True
160
+ ).to(self.device)
161
+
162
+ with torch.no_grad():
163
+ outputs = self.model(**inputs)
164
+ embeddings = outputs.last_hidden_state
165
+ mask = (
166
+ inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
167
+ )
168
+ embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
169
+ mask.sum(1), min=1e-9
170
+ )
171
+
172
+ return embeddings.cpu().numpy().tolist()
173
+ ```