parsi-ai-nlpclass
/

Digikala_RAG

+---
+license: apache-2.0
+language:
+- en
+- multilingual
+library_name: peft
+tags:
+- clip
+- lora
+- vision-language
+- contrastive
+- multilingual
+- glot
+datasets:
+- fictional-glot-5m-dataset
+base_model: openai/clip-vit-large-patch14
+---
+# Glot-CLIP: Multilingual and Culturally Aware CLIP LoRA Adapters
+This repository contains a collection of **LoRA (Low-Rank Adaptation)** adapters for the `openai/clip-vit-large-patch14` model. These adapters were fine-tuned on the **Glot-5M dataset**, a large-scale, multilingual, and culturally diverse collection of image-text pairs, to improve the model's performance on non-English and culturally specific content.
+The repository offers various adapters with different LoRA ranks and training checkpoints, allowing users to choose the best trade-off between performance and model size for their specific application.
+## Model Variants
+The adapters are organized by their training configuration. The naming convention is `clip_lora_adapters_{epochs}e{rank}r`, with subdirectories for different training checkpoints.
+* **`r` (Rank)**: The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks **16** and **32**.
+* **`e` (Epochs)**: The total number of training epochs. All primary models were trained for **80 epochs**.
+* **`Cut`**: Checkpoints saved at intermediate epochs (e.g., `30eCut`, `50eCut`). These can be useful if the model starts to overfit in later epochs.
+* **`ES` (Early Stopping)**: The final adapter saved based on the best validation score using an early stopping mechanism.
+### Adapter Directory Structure:
+* `clip_lora_adapters_80e16r_ES`: Final LoRA adapter with **rank 16**, trained for 80 epochs with early stopping.
+    * `clip_lora_adapters_80e16r_30eCut`: Checkpoint from the same run at 30 epochs.
+    * `clip_lora_adapters_80e16r_50eCut`: Checkpoint at 50 epochs.
+    * `clip_lora_adapters_80e16r_70eCut`: Checkpoint at 70 epochs.
+* `clip_lora_adapters_80e32r_ES`: Final LoRA adapter with **rank 32**, trained for 80 epochs with early stopping.
+    * `clip_lora_adapters_80e32r_30eCut`: Checkpoint at 30 epochs.
+    * `clip_lora_adapters_80e32r_50eCut`: Checkpoint at 50 epochs.
+    * `clip_lora_adapters_80e32r_70eCut`: Checkpoint at 70 epochs.
+* `glot-contrastive-final-lora`: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., `clip_lora_adapters_80e32r_ES`).
+* `glot-mlm-adapted`: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.
+***
+## How to Use
+To use these LoRA adapters, you need to install the `transformers`, `peft`, and `torch` libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.
+## CLIPFaLORA
+```python
+import torch
+from torchvision import transforms
+from PIL import Image
+from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer
+from peft import PeftModel
+from .CombinedContrastive import CombinedContrastive
+import requests
+from io import BytesIO
+from typing import List
+class CLIPFaLORA:
+    def __init__(self, name: str, path: str):
+        self.name = name
+        self.path = path
+        self.device = "cuda:0"
+        self.model = PeftModel.from_pretrained(
+            CombinedContrastive(
+                CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
+                RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
+            ),
+            self.path,
+        )
+        self.model = self.model.to(self.device)
+        self.model.eval()
+        self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
+        self.image_transform = transforms.Compose(
+            [
+                transforms.Resize((224, 224)),
+                transforms.ToTensor(),
+                transforms.Normalize(
+                    mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
+                ),
+            ]
+        )
+    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
+        inputs = self.text_transform(
+            contents, return_tensors="pt", padding=True, truncation=True
+        ).to(self.device)
+        with torch.no_grad():
+            embeddings = self.model.text_encoder(**inputs).pooler_output
+        return embeddings.cpu().numpy().tolist()
+    def get_image_embedding(self, images: List[str]) -> List[List[float]]:
+        images = [
+            self.image_transform(Image.open(image).convert("RGB")) for image in images
+        ]
+        images = torch.stack(images).to(self.device)
+        with torch.no_grad():
+            embeddings = self.model.vision_encoder(images).pooler_output
+        return embeddings.cpu().numpy().tolist()
+    def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
+        contents = [requests.get(image).content for image in images]
+        images = [BytesIO(content) for content in contents]
+        images = [
+            self.image_transform(Image.open(image).convert("RGB")) for image in images
+        ]
+        images = torch.stack(images).to(self.device)
+        with torch.no_grad():
+            embeddings = self.model.vision_encoder(images).pooler_output
+        return embeddings.cpu().numpy().tolist()
+```
+## GLOT500LORA
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+from peft import PeftModel
+from typing import List
+class GLOT500LORA:
+    def __init__(self, name: str, base: str, adapters: str):
+        self.name = name
+        self.base = base
+        self.adapters = adapters
+        self.device = "cuda:0"
+        self.model = PeftModel.from_pretrained(
+            AutoModel.from_pretrained(base), adapters
+        )
+        self.model.to(self.device)
+        self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)
+    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
+        inputs = self.text_transform(
+            contents, return_tensors="pt", padding=True, truncation=True
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            embeddings = outputs.last_hidden_state
+            mask = (
+                inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
+            )
+            embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
+                mask.sum(1), min=1e-9
+            )
+        return embeddings.cpu().numpy().tolist()
+```