Digikala_RAG / README.md

Upload README.md with huggingface_hub

451c53d verified 3 months ago

7.16 kB

	---
	license: apache-2.0
	language:
	- en
	- multilingual
	library_name: peft
	tags:
	- clip
	- lora
	- vision-language
	- contrastive
	- multilingual
	- glot
	datasets:
	- fictional-glot-5m-dataset
	base_model: openai/clip-vit-large-patch14
	---

	# Prompt-Kala: A Multimodal Conversational Agent for E-Commerce Built on Dual-Retrieval RAG Architecture

	## Abstract
	Effectively harnessing the vast and unstructured data from customer comments is a critical challenge in modern e-commerce. An intelligent system that can accurately interpret and respond to nuanced, multimodal user queries is essential for enhancing customer experience and providing scalable support. We propose a novel, dual-phase Retrieval-Augmented Generation (RAG) system that integrates both textual and visual information to power a conversational chatbot. Our empirical results demonstrate a significant performance uplift, with question-answering accuracy increasing by up to 20 percentage points when visual context is provided alongside text. This work establishes a robust framework for transforming raw customer feedback into a dynamic, interactive, and reliable knowledge base for e-commerce applications. The code for this project is available at https://github.com/NLP-Final-Projects/digikala rag. Index Terms—Retrieval-Augmented Generation (RAG), Natural Language Processing (NLP), Knowledge base/External knowl- edge, Vector database, Prompt engineering

	## Model Variants

	The adapters are organized by their training configuration. The naming convention is `clip_lora_adapters_{epochs}e{rank}r`, with subdirectories for different training checkpoints.

	* `r` (Rank): The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks 16 and 32.
	* `e` (Epochs): The total number of training epochs. All primary models were trained for 80 epochs.
	* `Cut`: Checkpoints saved at intermediate epochs (e.g., `30eCut`, `50eCut`). These can be useful if the model starts to overfit in later epochs.
	* `ES` (Early Stopping): The final adapter saved based on the best validation score using an early stopping mechanism.

	### Adapter Directory Structure:

	* `clip_lora_adapters_80e16r_ES`: Final LoRA adapter with rank 16, trained for 80 epochs with early stopping.
	* `clip_lora_adapters_80e16r_30eCut`: Checkpoint from the same run at 30 epochs.
	* `clip_lora_adapters_80e16r_50eCut`: Checkpoint at 50 epochs.
	* `clip_lora_adapters_80e16r_70eCut`: Checkpoint at 70 epochs.
	* `clip_lora_adapters_80e32r_ES`: Final LoRA adapter with rank 32, trained for 80 epochs with early stopping.
	* `clip_lora_adapters_80e32r_30eCut`: Checkpoint at 30 epochs.
	* `clip_lora_adapters_80e32r_50eCut`: Checkpoint at 50 epochs.
	* `clip_lora_adapters_80e32r_70eCut`: Checkpoint at 70 epochs.
	* `glot-contrastive-final-lora`: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., `clip_lora_adapters_80e32r_ES`).
	* `glot-mlm-adapted`: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.

	***

	## How to Use

	To use these LoRA adapters, you need to install the `transformers`, `peft`, and `torch` libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.

	## CLIPFaLORA

	```python
	import torch
	from torchvision import transforms
	from PIL import Image
	from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer

	from peft import PeftModel
	from .CombinedContrastive import CombinedContrastive

	import requests
	from io import BytesIO

	from typing import List


	class CLIPFaLORA:
	def __init__(self, name: str, path: str):
	self.name = name
	self.path = path

	self.device = "cuda:0"
	self.model = PeftModel.from_pretrained(
	CombinedContrastive(
	CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
	RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
	),
	self.path,
	)
	self.model = self.model.to(self.device)
	self.model.eval()

	self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
	self.image_transform = transforms.Compose(
	[
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(
	mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
	),
	]
	)

	def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
	inputs = self.text_transform(
	contents, return_tensors="pt", padding=True, truncation=True
	).to(self.device)

	with torch.no_grad():
	embeddings = self.model.text_encoder(**inputs).pooler_output

	return embeddings.cpu().numpy().tolist()

	def get_image_embedding(self, images: List[str]) -> List[List[float]]:
	images = [
	self.image_transform(Image.open(image).convert("RGB")) for image in images
	]
	images = torch.stack(images).to(self.device)

	with torch.no_grad():
	embeddings = self.model.vision_encoder(images).pooler_output

	return embeddings.cpu().numpy().tolist()

	def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
	contents = [requests.get(image).content for image in images]
	images = [BytesIO(content) for content in contents]

	images = [
	self.image_transform(Image.open(image).convert("RGB")) for image in images
	]
	images = torch.stack(images).to(self.device)

	with torch.no_grad():
	embeddings = self.model.vision_encoder(images).pooler_output

	return embeddings.cpu().numpy().tolist()
	```


	## GLOT500LORA

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel
	from peft import PeftModel
	from typing import List


	class GLOT500LORA:
	def __init__(self, name: str, base: str, adapters: str):
	self.name = name
	self.base = base
	self.adapters = adapters

	self.device = "cuda:0"
	self.model = PeftModel.from_pretrained(
	AutoModel.from_pretrained(base), adapters
	)
	self.model.to(self.device)

	self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)

	def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
	inputs = self.text_transform(
	contents, return_tensors="pt", padding=True, truncation=True
	).to(self.device)

	with torch.no_grad():
	outputs = self.model(**inputs)
	embeddings = outputs.last_hidden_state
	mask = (
	inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
	)
	embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
	mask.sum(1), min=1e-9
	)

	return embeddings.cpu().numpy().tolist()
	```