File size: 7,159 Bytes
be0484f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
451c53d
be0484f
451c53d
 
be0484f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: apache-2.0
language:
- en
- multilingual
library_name: peft
tags:
- clip
- lora
- vision-language
- contrastive
- multilingual
- glot
datasets:
- fictional-glot-5m-dataset
base_model: openai/clip-vit-large-patch14
---

# Prompt-Kala: A Multimodal Conversational Agent for E-Commerce Built on Dual-Retrieval RAG Architecture

## Abstract
Effectively harnessing the vast and unstructured data from customer comments is a critical challenge in modern e-commerce. An intelligent system that can accurately interpret and respond to nuanced, multimodal user queries is essential for enhancing customer experience and providing scalable support. We propose a novel, dual-phase Retrieval-Augmented Generation (RAG) system that integrates both textual and visual information to power a conversational chatbot. Our empirical results demonstrate a significant performance uplift, with question-answering accuracy increasing by up to 20 percentage points when visual context is provided alongside text. This work establishes a robust framework for transforming raw customer feedback into a dynamic, interactive, and reliable knowledge base for e-commerce applications. The code for this project is available at https://github.com/NLP-Final-Projects/digikala rag. Index Terms—Retrieval-Augmented Generation (RAG), Natural Language Processing (NLP), Knowledge base/External knowl- edge, Vector database, Prompt engineering

## Model Variants

The adapters are organized by their training configuration. The naming convention is `clip_lora_adapters_{epochs}e{rank}r`, with subdirectories for different training checkpoints.

* **`r` (Rank)**: The rank of the LoRA decomposition. Higher ranks can capture more complex patterns but increase the number of trainable parameters. We provide adapters with ranks **16** and **32**.
* **`e` (Epochs)**: The total number of training epochs. All primary models were trained for **80 epochs**.
* **`Cut`**: Checkpoints saved at intermediate epochs (e.g., `30eCut`, `50eCut`). These can be useful if the model starts to overfit in later epochs.
* **`ES` (Early Stopping)**: The final adapter saved based on the best validation score using an early stopping mechanism.

### Adapter Directory Structure:

* `clip_lora_adapters_80e16r_ES`: Final LoRA adapter with **rank 16**, trained for 80 epochs with early stopping.
    * `clip_lora_adapters_80e16r_30eCut`: Checkpoint from the same run at 30 epochs.
    * `clip_lora_adapters_80e16r_50eCut`: Checkpoint at 50 epochs.
    * `clip_lora_adapters_80e16r_70eCut`: Checkpoint at 70 epochs.
* `clip_lora_adapters_80e32r_ES`: Final LoRA adapter with **rank 32**, trained for 80 epochs with early stopping.
    * `clip_lora_adapters_80e32r_30eCut`: Checkpoint at 30 epochs.
    * `clip_lora_adapters_80e32r_50eCut`: Checkpoint at 50 epochs.
    * `clip_lora_adapters_80e32r_70eCut`: Checkpoint at 70 epochs.
* `glot-contrastive-final-lora`: A curated final version, recommended for general use (symbolic link to the best-performing adapter, e.g., `clip_lora_adapters_80e32r_ES`).
* `glot-mlm-adapted`: An experimental version of the adapter further fine-tuned with a Masked Language Modeling (MLM) objective on the text encoder.

***

## How to Use

To use these LoRA adapters, you need to install the `transformers`, `peft`, and `torch` libraries. First, load the base CLIP model, and then attach the desired LoRA adapter from this repository.

## CLIPFaLORA

```python
import torch
from torchvision import transforms
from PIL import Image
from transformers import CLIPVisionModel, RobertaModel, AutoTokenizer

from peft import PeftModel
from .CombinedContrastive import CombinedContrastive

import requests
from io import BytesIO

from typing import List


class CLIPFaLORA:
    def __init__(self, name: str, path: str):
        self.name = name
        self.path = path

        self.device = "cuda:0"
        self.model = PeftModel.from_pretrained(
            CombinedContrastive(
                CLIPVisionModel.from_pretrained("SajjadAyoubi/clip-fa-vision"),
                RobertaModel.from_pretrained("SajjadAyoubi/clip-fa-text"),
            ),
            self.path,
        )
        self.model = self.model.to(self.device)
        self.model.eval()

        self.text_transform = AutoTokenizer.from_pretrained("SajjadAyoubi/clip-fa-text")
        self.image_transform = transforms.Compose(
            [
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.8544, 0.8390, 0.8298], std=[0.2618, 0.2729, 0.2855]
                ),
            ]
        )

    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
        inputs = self.text_transform(
            contents, return_tensors="pt", padding=True, truncation=True
        ).to(self.device)

        with torch.no_grad():
            embeddings = self.model.text_encoder(**inputs).pooler_output

        return embeddings.cpu().numpy().tolist()

    def get_image_embedding(self, images: List[str]) -> List[List[float]]:
        images = [
            self.image_transform(Image.open(image).convert("RGB")) for image in images
        ]
        images = torch.stack(images).to(self.device)

        with torch.no_grad():
            embeddings = self.model.vision_encoder(images).pooler_output

        return embeddings.cpu().numpy().tolist()

    def get_image_embedding_url(self, images: List[str]) -> List[List[float]]:
        contents = [requests.get(image).content for image in images]
        images = [BytesIO(content) for content in contents]

        images = [
            self.image_transform(Image.open(image).convert("RGB")) for image in images
        ]
        images = torch.stack(images).to(self.device)

        with torch.no_grad():
            embeddings = self.model.vision_encoder(images).pooler_output

        return embeddings.cpu().numpy().tolist()
```


## GLOT500LORA

```python
import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from typing import List


class GLOT500LORA:
    def __init__(self, name: str, base: str, adapters: str):
        self.name = name
        self.base = base
        self.adapters = adapters

        self.device = "cuda:0"
        self.model = PeftModel.from_pretrained(
            AutoModel.from_pretrained(base), adapters
        )
        self.model.to(self.device)

        self.text_transform = AutoTokenizer.from_pretrained(base, use_fast=False)

    def get_text_embedding(self, contents: List[str]) -> List[List[float]]:
        inputs = self.text_transform(
            contents, return_tensors="pt", padding=True, truncation=True
        ).to(self.device)

        with torch.no_grad():
            outputs = self.model(**inputs)
            embeddings = outputs.last_hidden_state
            mask = (
                inputs["attention_mask"].unsqueeze(-1).expand(embeddings.size()).float()
            )
            embeddings = torch.sum(embeddings * mask, 1) / torch.clamp(
                mask.sum(1), min=1e-9
            )

        return embeddings.cpu().numpy().tolist()
```