|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
library_name: peft |
|
|
tags: |
|
|
- clip |
|
|
- lora |
|
|
- vision-language |
|
|
- contrastive |
|
|
- multilingual |
|
|
- glot |
|
|
datasets: |
|
|
- fictional-glot-5m-dataset |
|
|
base_model: openai/clip-vit-large-patch14 |
|
|
--- |
|
|
|
|
|
# Prompt-Kala: A Multimodal Conversational Agent for E-Commerce Built on Dual-Retrieval RAG Architecture |
|
|
|
|
|
## Abstract |
|
|
Effectively harnessing the vast and unstructured data from customer comments is a critical challenge in modern e-commerce. An intelligent system that can accurately interpret and respond to nuanced, multimodal user queries is essential for enhancing customer experience and providing scalable support. We propose a novel, dual-phase Retrieval-Augmented Generation (RAG) system that integrates both textual and visual information to power a conversational chatbot. Our empirical results demonstrate a significant performance uplift, with question-answering accuracy increasing by up to 20 percentage points when visual context is provided alongside text. This work establishes a robust framework for transforming raw customer feedback into a dynamic, interactive, and reliable knowledge base for e-commerce applications. The code for this project is available at https://github.com/NLP-Final-Projects/digikala rag. Index Terms—Retrieval-Augmented Generation (RAG), Natural Language Processing (NLP), Knowledge base/External knowl- edge, Vector database, Prompt engineering |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
The adapters are organized by their training configuration. The naming convention is `clip_lora_adapters_{epochs}e{rank}r`, with subdirectories for different training checkpoints. |
|
|
|
|
|
* **`r` (Rank)**: The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks **16** and **32**. |
|
|
* **`e` (Epochs)**: The total number of training epochs. All primary models were trained for **80 epochs**. |
|
|
* **`Cut`**: Checkpoints saved at intermediate epochs (e.g., `30eCut`, `50eCut`). These can be useful if the model starts to overfit in later epochs. |
|
|
* **`ES` (Early Stopping)**: The final adapter saved based on the best validation score using an early stopping mechanism. |
|
|
|
|
|
### Adapter Directory Structure: |
|
|
|
|
|
* `clip_lora_adapters_80e16r_ES`: Final LoRA adapter with **rank 16**, trained for 80 epochs with early stopping. |
|
|
* `clip_lora_adapters_80e16r_30eCut`: Checkpoint from the same run at 30 epochs. |
|
|
* `clip_lora_adapters_80e16r_50eCut`: Checkpoint at 50 epochs. |
|
|
* `clip_lora_adapters_80e16r_70eCut`: Checkpoint at 70 epochs. |
|
|
* `clip_lora_adapters_80e32r_ES`: Final LoRA adapter with **rank 32**, trained for 80 epochs with early stopping. |
|
|
* `clip_lora_adapters_80e32r_30eCut`: Checkpoint at 30 epochs. |
|
|
* `clip_lora_adapters_80e32r_50eCut`: Checkpoint at 50 epochs. |
|
|
* `clip_lora_adapters_80e32r_70eCut`: Checkpoint at 70 epochs. |
|
|
* `glot-contrastive-final-lora`: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., `clip_lora_adapters_80e32r_ES`). |
|
|
* `glot-mlm-adapted`: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder. |
|
|
|
|
|
*** |
|
|
|
|
|
## How to Use |
|
|
|
|
|
To use these LoRA adapters, you need to install the `transformers`, `peft`, and `torch` libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository. |
|
|
|
|
|
## CLIPFaLORA |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from torchvision import transforms |
|
|
from PIL import Image |
|
|
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer |
|
|
|
|
|
from peft import PeftModel |
|
|
from .CombinedContrastive import CombinedContrastive |
|
|
|
|
|
import requests |
|
|
from io import BytesIO |
|
|
|
|
|
from typing import List |
|
|
|
|
|
|
|
|
class CLIPFaLORA: |
|
|
def __init__(self, name: str, path: str): |
|
|
self.name = name |
|
|
self.path = path |
|
|
|
|
|
self.device = "cuda:0" |
|
|
self.model = PeftModel.from_pretrained( |
|
|
CombinedContrastive( |
|
|
CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"), |
|
|
RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"), |
|
|
), |
|
|
self.path, |
|
|
) |
|
|
self.model = self.model.to(self.device) |
|
|
self.model.eval() |
|
|
|
|
|
self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text") |
|
|
self.image_transform = transforms.Compose( |
|
|
[ |
|
|
transforms.Resize((224, 224)), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize( |
|
|
mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855] |
|
|
), |
|
|
] |
|
|
) |
|
|
|
|
|
def get_text_embedding(self, contents: List[str]) -> List[List[float]]: |
|
|
inputs = self.text_transform( |
|
|
contents, return_tensors="pt", padding=True, truncation=True |
|
|
).to(self.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
embeddings = self.model.text_encoder(**inputs).pooler_output |
|
|
|
|
|
return embeddings.cpu().numpy().tolist() |
|
|
|
|
|
def get_image_embedding(self, images: List[str]) -> List[List[float]]: |
|
|
images = [ |
|
|
self.image_transform(Image.open(image).convert("RGB")) for image in images |
|
|
] |
|
|
images = torch.stack(images).to(self.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
embeddings = self.model.vision_encoder(images).pooler_output |
|
|
|
|
|
return embeddings.cpu().numpy().tolist() |
|
|
|
|
|
def get_image_embedding_url(self, images: List[str]) -> List[List[float]]: |
|
|
contents = [requests.get(image).content for image in images] |
|
|
images = [BytesIO(content) for content in contents] |
|
|
|
|
|
images = [ |
|
|
self.image_transform(Image.open(image).convert("RGB")) for image in images |
|
|
] |
|
|
images = torch.stack(images).to(self.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
embeddings = self.model.vision_encoder(images).pooler_output |
|
|
|
|
|
return embeddings.cpu().numpy().tolist() |
|
|
``` |
|
|
|
|
|
|
|
|
## GLOT500LORA |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
from peft import PeftModel |
|
|
from typing import List |
|
|
|
|
|
|
|
|
class GLOT500LORA: |
|
|
def __init__(self, name: str, base: str, adapters: str): |
|
|
self.name = name |
|
|
self.base = base |
|
|
self.adapters = adapters |
|
|
|
|
|
self.device = "cuda:0" |
|
|
self.model = PeftModel.from_pretrained( |
|
|
AutoModel.from_pretrained(base), adapters |
|
|
) |
|
|
self.model.to(self.device) |
|
|
|
|
|
self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False) |
|
|
|
|
|
def get_text_embedding(self, contents: List[str]) -> List[List[float]]: |
|
|
inputs = self.text_transform( |
|
|
contents, return_tensors="pt", padding=True, truncation=True |
|
|
).to(self.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.model(**inputs) |
|
|
embeddings = outputs.last_hidden_state |
|
|
mask = ( |
|
|
inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float() |
|
|
) |
|
|
embeddings = torch.sum(embeddings * mask, 1) / torch.clamp( |
|
|
mask.sum(1), min=1e-9 |
|
|
) |
|
|
|
|
|
return embeddings.cpu().numpy().tolist() |
|
|
``` |