DolArs commited on
Commit
5edd45c
·
1 Parent(s): b0e42f8

updated READme

Browse files
Files changed (1) hide show
  1. README.md +7 -193
README.md CHANGED
@@ -1,194 +1,8 @@
1
- # Multimodal RAG System
2
-
3
- ## Project Description
4
-
5
- This project implements a **Multimodal Retrieval-Augmented Generation (RAG)** system that combines text and image data to retrieve relevant articles from **The Batch**. The system allows you to:
6
- - Retrieve data based on user queries using text **and** visual embeddings.
7
- - Perform **Classical RAG** (text-based search) and **Multimodal RAG** (combined text + image search).
8
- - Generate AI-powered answers to queries utilizing **Large Language Models (LLMs)**.
9
- - Provide users with an interactive interface for exploring results.
10
-
11
- **Key Feature**: By combining textual and visual content, this system enhances the relevance of search results and elevates user experience.
12
-
13
  ---
14
- ## How to Run the Project
15
-
16
- Follow these steps to set up and run the system locally on your machine.
17
-
18
- ### **1. Clone the Repository**
19
- Start by cloning the project repository into your local machine:
20
- ```bash
21
- git clone https://github.com/DolAr1610/Multimodal_RAG.git
22
- cd multimodal_rag
23
- ```
24
-
25
- ### **2. Install Dependencies**
26
- Install all the required Python libraries specified in requirements.txt:
27
- ```bash
28
- pip install -r requirements.txt
29
- ```
30
-
31
- ### **3. Prepare the Data**
32
- Ensure that the parsed articles are saved as a JSON file (data/articles_export.json) before running the system.
33
-
34
- #### **Option 1: Generate Data**
35
- If the articles are not yet parsed, you can run the parser:
36
- ```bash
37
- python -m data.parser
38
- ```
39
- #### **Option 2: Use Pre-Generated Data**
40
- Alternatively, use the pre-generated articles_export.json located in the data/ directory.
41
-
42
- ### **4. Generate Vector Databases (First-Time Setup)**
43
- If this is your first run, you need to create the vector databases for text and images:
44
- ```bash
45
- python -m ingestion.ingest_run # Create vector databases for text and image embeddings
46
- ```
47
- This step ensures Chroma vector databases are properly initialized and indexed.
48
-
49
- ### **5. Launch the Application**
50
- Run the Streamlit application to access the interactive user interface:
51
- ```bash
52
- streamlit run main.py
53
- ```
54
-
55
-
56
- ## Key Features
57
-
58
- ### **1. Parsing Articles and Metadata**
59
-
60
- The system collects articles, including text, metadata, and associated images, from **The Batch** using web-scraping techniques.
61
-
62
- - **Objective:** Extract text content (title, description, publication date), metadata, and associated images.
63
- - **How it Works**:
64
- - **Selenium:** Handles dynamic website elements like pagination ("Load More", "Older Posts").
65
- - **BeautifulSoup:** Extracts article text, metadata, and image URLs from HTML.
66
- - **Output:** Articles stored in a structured JSON format as follows:
67
- ```json
68
- {
69
- "title": "Article Title",
70
- "description": "Short Description",
71
- "image_url": "https://example.com/image.jpg",
72
- "date": "2024-10-11",
73
- "content": "The main content of the article...",
74
- "source_url": "https://thebatch.org/example-article"
75
- }
76
- ```
77
- **Scripts**:
78
- - `initialize_driver()`: Configures the Selenium WebDriver for site interaction.
79
- - `parse_article(url)`: Extracts title, description, metadata, and images of an article.
80
- - `run_parser_and_save_to_json()`: Performs entire filtering and parsing process.
81
-
82
- ---
83
-
84
- ### **2. Building Vector Databases**
85
-
86
- To enable efficient multimodal retrieval, the system creates **two separate vector databases**: one for text and one for images.
87
-
88
- #### **Text Index**
89
- - **Model:** The text index leverages **SentenceTransformer (E5)** for generating embeddings.
90
- - **Process:**
91
- - Articles are preprocessed using `chunk_text()` to split larger texts into smaller chunks (400 words with a 50-word overlap).
92
- - Chunks and embeddings are stored in a **Chroma** database.
93
-
94
- #### **Image Index**
95
- - **Model:** Image embeddings are generated using **OpenAI CLIP** (`clip-vit-large-patch14-336`).
96
- - **Process:**
97
- - Images are accessed via URLs and transformed into embeddings.
98
- - Embeddings and metadata are stored in a **Chroma** database.
99
-
100
- ---
101
-
102
- ### **3. Embedding Integration**
103
-
104
- The text and image embeddings are created independently to enhance retrieval performance.
105
-
106
- - **Text Integration:** Articles are preprocessed, converted into embeddings using **E5**, and indexed.
107
- - **Image Integration:** Image URLs are retrieved, processed, and added to the image index using embeddings from **CLIP**.
108
-
109
- **Why Separate Databases?**: This ensures that powerful text and image-specific models can be used without sacrificing independence or performance. Each database is optimized for its respective modality.
110
-
111
- ---
112
-
113
- ### **4. Search System**
114
-
115
- The system provides two types of searches: text-only or multimodal.
116
-
117
- #### **1. Classical Search (Text-Based RAG)**
118
- - Focuses exclusively on the text database.
119
- - Finds articles that are highly relevant to the user query.
120
- - Always provides accompanying images from relevant articles.
121
- - **Implementation:** `classical_search()`.
122
-
123
- #### **2. Multimodal Search (Text + Image RAG)**
124
- - Leverages both text and image databases.
125
- - Locates the best-matching text and independent image:
126
- - Finds relevant text embeddings for the query in the text index.
127
- - Simultaneously searches for image embeddings matching the query in the image index.
128
- - Combines the results into multimodal pairs.
129
- - **Implementation:** `best_pair_search()`.
130
-
131
- **Output Example**:
132
- ```json
133
- {
134
- "title": "AI in Healthcare",
135
- "description": "How AI is revolutionizing medicine.",
136
- "image_url": "https://thebatch.org/healthcare-ai.jpg",
137
- "date": "2024-10-11",
138
- "source_url": "https://thebatch.org/ai-healthcare",
139
- "content": "Artificial intelligence is transforming healthcare with personalized approaches..."
140
- }
141
- ```
142
- ---
143
-
144
- ### **5. Answer Generation Using LLM**
145
-
146
- The system integrates a **Large Language Model (LLM)** to generate responses based on the content of retrieved articles.
147
-
148
- #### **Model**
149
- - The system uses **meta-llama/llama-3-8b-instruct**, integrated via the **OpenRouter API**.
150
-
151
- #### **Process**
152
- 1. Retrieved article context (text fragments) is passed to the LLM model.
153
- 2. The model generates detailed answers while adhering strictly to the provided context.
154
- 3. If the query cannot be addressed due to insufficient context, the system returns a fallback response:
155
- > **"Sorry, I could not find the answer in the provided context."**
156
-
157
- #### **Implementation**
158
- - The function `generate_response()` is responsible for:
159
- - Extracting text from articles as context.
160
- - Sending the context to an LLM.
161
- - Generating user-facing responses.
162
-
163
- ---
164
-
165
- ### **6. Interactive User Interface**
166
-
167
- The system includes an interactive **Streamlit-based UI**, designed for a smooth user experience when exploring data.
168
-
169
- #### **Features**
170
- 1. **Query Input:**
171
- - Users can input text queries.
172
- - They can choose between **Classical RAG** (text-only search) or **Multimodal RAG** (text + image search).
173
- 2. **Result Display:**
174
- - Lists retrieved articles with:
175
- - Metadata (title, description, publication date).
176
- - Accompanying images.
177
- - Key fragments of text content.
178
- - Includes a button to generate detailed responses from the LLM.
179
- ## **Summary**
180
-
181
- This project explores the integration of multimodal content (text + images) and retrieval-augmented generation (RAG), incorporating cutting-edge NLP and computer vision models to provide users with:
182
-
183
- **Contextual Search Results:** Retrieve precise matches using text and visual embeddings seamlessly.
184
-
185
- **LLM Responses:** Generate detailed answers with OpenAI LLMs.
186
-
187
- **Interactive UI:** Streamlined user interaction through Streamlit.
188
-
189
- ---
190
- ## **Demo Video**
191
-
192
- Below is a quick demonstration of how the system works:
193
-
194
- Watch the demo video on [Google Drive](https://drive.google.com/file/d/1wd8QJfZYaPdwYy7qyCH4NeuQ0ZFNbW-K/view?usp=sharing).
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Multimodal RAG System
3
+ emoji: 🤖
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ ---