Arsen Dolichnyi commited on
Commit
42b2b3c
Β·
0 Parent(s):

basic version with classic and multimodal rag

Browse files
.gitignore ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # created by virtualenv automatically
2
+ .env
3
+ /venv
4
+ .idea
5
+ __pycache__/
6
+ chroma_db_text/
7
+ chroma_db_images/
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multimodal RAG System
2
+
3
+ ## Project Description
4
+
5
+ This project implements a **Multimodal Retrieval-Augmented Generation (RAG)** system that combines text and image data to retrieve relevant articles from **The Batch**. The system allows you to:
6
+ - Retrieve data based on user queries using text **and** visual embeddings.
7
+ - Perform **Classical RAG** (text-based search) and **Multimodal RAG** (combined text + image search).
8
+ - Generate AI-powered answers to queries utilizing **Large Language Models (LLMs)**.
9
+ - Provide users with an interactive interface for exploring results.
10
+
11
+ **Key Feature**: By combining textual and visual content, this system enhances the relevance of search results and elevates user experience.
12
+
13
+ ---
14
+ ## How to Run the Project
15
+
16
+ Follow these steps to set up and run the system locally on your machine.
17
+
18
+ ### **1. Clone the Repository**
19
+ Start by cloning the project repository into your local machine:
20
+ ```bash
21
+ git clone https://github.com/DolAr1610/Multimodal_RAG.git
22
+ cd multimodal_rag
23
+ ```
24
+
25
+ ### **2. Install Dependencies**
26
+ Install all the required Python libraries specified in requirements.txt:
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ ```
30
+
31
+ ### **3. Prepare the Data**
32
+ Ensure that the parsed articles are saved as a JSON file (data/articles_export.json) before running the system.
33
+
34
+ #### **Option 1: Generate Data**
35
+ If the articles are not yet parsed, you can run the parser:
36
+ ```bash
37
+ python parser.py
38
+ ```
39
+ #### **Option 2: Use Pre-Generated Data**
40
+ Alternatively, use the pre-generated articles_export.json located in the data/ directory.
41
+
42
+ ### **4. Generate Vector Databases (First-Time Setup)**
43
+ If this is your first run, you need to create the vector databases for text and images:
44
+ ```bash
45
+ python ingest_run.py # Create vector databases for text and image embeddings
46
+ ```
47
+ This step ensures Chroma vector databases are properly initialized and indexed.
48
+
49
+ ### **5. Launch the Application**
50
+ Run the Streamlit application to access the interactive user interface:
51
+ ```bash
52
+ streamlit run main.py
53
+ ```
54
+
55
+
56
+ ## Key Features
57
+
58
+ ### **1. Parsing Articles and Metadata**
59
+
60
+ The system collects articles, including text, metadata, and associated images, from **The Batch** using web-scraping techniques.
61
+
62
+ - **Objective:** Extract text content (title, description, publication date), metadata, and associated images.
63
+ - **How it Works**:
64
+ - **Selenium:** Handles dynamic website elements like pagination ("Load More", "Older Posts").
65
+ - **BeautifulSoup:** Extracts article text, metadata, and image URLs from HTML.
66
+ - **Output:** Articles stored in a structured JSON format as follows:
67
+ ```json
68
+ {
69
+ "title": "Article Title",
70
+ "description": "Short Description",
71
+ "image_url": "https://example.com/image.jpg",
72
+ "date": "2024-10-11",
73
+ "content": "The main content of the article...",
74
+ "source_url": "https://thebatch.org/example-article"
75
+ }
76
+ ```
77
+ **Scripts**:
78
+ - `initialize_driver()`: Configures the Selenium WebDriver for site interaction.
79
+ - `parse_article(url)`: Extracts title, description, metadata, and images of an article.
80
+ - `run_parser_and_save_to_json()`: Performs entire filtering and parsing process.
81
+
82
+ ---
83
+
84
+ ### **2. Building Vector Databases**
85
+
86
+ To enable efficient multimodal retrieval, the system creates **two separate vector databases**: one for text and one for images.
87
+
88
+ #### **Text Index**
89
+ - **Model:** The text index leverages **SentenceTransformer (E5)** for generating embeddings.
90
+ - **Process:**
91
+ - Articles are preprocessed using `chunk_text()` to split larger texts into smaller chunks (400 words with a 50-word overlap).
92
+ - Chunks and embeddings are stored in a **Chroma** database.
93
+
94
+ #### **Image Index**
95
+ - **Model:** Image embeddings are generated using **OpenAI CLIP** (`clip-vit-large-patch14-336`).
96
+ - **Process:**
97
+ - Images are accessed via URLs and transformed into embeddings.
98
+ - Embeddings and metadata are stored in a **Chroma** database.
99
+
100
+ ---
101
+
102
+ ### **3. Embedding Integration**
103
+
104
+ The text and image embeddings are created independently to enhance retrieval performance.
105
+
106
+ - **Text Integration:** Articles are preprocessed, converted into embeddings using **E5**, and indexed.
107
+ - **Image Integration:** Image URLs are retrieved, processed, and added to the image index using embeddings from **CLIP**.
108
+
109
+ **Why Separate Databases?**: This ensures that powerful text and image-specific models can be used without sacrificing independence or performance. Each database is optimized for its respective modality.
110
+
111
+ ---
112
+
113
+ ### **4. Search System**
114
+
115
+ The system provides two types of searches: text-only or multimodal.
116
+
117
+ #### **1. Classical Search (Text-Based RAG)**
118
+ - Focuses exclusively on the text database.
119
+ - Finds articles that are highly relevant to the user query.
120
+ - Always provides accompanying images from relevant articles.
121
+ - **Implementation:** `classical_search()`.
122
+
123
+ #### **2. Multimodal Search (Text + Image RAG)**
124
+ - Leverages both text and image databases.
125
+ - Locates the best-matching text and independent image:
126
+ - Finds relevant text embeddings for the query in the text index.
127
+ - Simultaneously searches for image embeddings matching the query in the image index.
128
+ - Combines the results into multimodal pairs.
129
+ - **Implementation:** `best_pair_search()`.
130
+
131
+ **Output Example**:
132
+ ```json
133
+ {
134
+ "title": "AI in Healthcare",
135
+ "description": "How AI is revolutionizing medicine.",
136
+ "image_url": "https://thebatch.org/healthcare-ai.jpg",
137
+ "date": "2024-10-11",
138
+ "source_url": "https://thebatch.org/ai-healthcare",
139
+ "content": "Artificial intelligence is transforming healthcare with personalized approaches..."
140
+ }
141
+ ```
142
+ ---
143
+
144
+ ### **5. Answer Generation Using LLM**
145
+
146
+ The system integrates a **Large Language Model (LLM)** to generate responses based on the content of retrieved articles.
147
+
148
+ #### **Model**
149
+ - The system uses **meta-llama/llama-3-8b-instruct**, integrated via the **OpenRouter API**.
150
+
151
+ #### **Process**
152
+ 1. Retrieved article context (text fragments) is passed to the LLM model.
153
+ 2. The model generates detailed answers while adhering strictly to the provided context.
154
+ 3. If the query cannot be addressed due to insufficient context, the system returns a fallback response:
155
+ > **"Sorry, I could not find the answer in the provided context."**
156
+
157
+ #### **Implementation**
158
+ - The function `generate_response()` is responsible for:
159
+ - Extracting text from articles as context.
160
+ - Sending the context to an LLM.
161
+ - Generating user-facing responses.
162
+
163
+ ---
164
+
165
+ ### **6. Interactive User Interface**
166
+
167
+ The system includes an interactive **Streamlit-based UI**, designed for a smooth user experience when exploring data.
168
+
169
+ #### **Features**
170
+ 1. **Query Input:**
171
+ - Users can input text queries.
172
+ - They can choose between **Classical RAG** (text-only search) or **Multimodal RAG** (text + image search).
173
+ 2. **Result Display:**
174
+ - Lists retrieved articles with:
175
+ - Metadata (title, description, publication date).
176
+ - Accompanying images.
177
+ - Key fragments of text content.
178
+ - Includes a button to generate detailed responses from the LLM.
179
+ ## **Summary**
180
+
181
+ This project explores the integration of multimodal content (text + images) and retrieval-augmented generation (RAG), incorporating cutting-edge NLP and computer vision models to provide users with:
182
+
183
+ **Contextual Search Results:** Retrieve precise matches using text and visual embeddings seamlessly.
184
+
185
+ **LLM Responses:** Generate detailed answers with OpenAI LLMs.
186
+
187
+ **Interactive UI:** Streamlined user interaction through Streamlit.
188
+
189
+ ---
190
+ ## **Demo Video**
191
+
192
+ Below is a quick demonstration of how the system works:
193
+
194
+ Watch the demo video on [Google Drive](https://drive.google.com/file/d/1wd8QJfZYaPdwYy7qyCH4NeuQ0ZFNbW-K/view?usp=sharing).
data/__init__.py ADDED
File without changes
data/articles_export.json ADDED
The diff for this file is too large to render. See raw diff
 
data/parser.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ from datetime import datetime
3
+ from bs4 import BeautifulSoup
4
+ import json
5
+ import requests
6
+
7
+ from selenium import webdriver
8
+ from selenium.webdriver.common.by import By
9
+ from selenium.webdriver.support.ui import WebDriverWait
10
+ from selenium.webdriver.support import expected_conditions as EC
11
+ from selenium.webdriver.chrome.service import Service
12
+ from selenium.common.exceptions import TimeoutException, NoSuchElementException
13
+ from webdriver_manager.chrome import ChromeDriverManager
14
+
15
+ BASE_TAG_URL = "https://www.deeplearning.ai/the-batch/tag/"
16
+ VALID_CATEGORIES = [
17
+ "letters",
18
+ "data-points",
19
+ "research",
20
+ "business",
21
+ "science",
22
+ "culture",
23
+ "hardware",
24
+ "ai-careers"
25
+ ]
26
+
27
+
28
+ def initialize_driver():
29
+ options = webdriver.ChromeOptions()
30
+ options.add_argument('--headless')
31
+ options.add_argument('--disable-gpu')
32
+ options.add_argument('--no-sandbox')
33
+ driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
34
+ return driver
35
+
36
+
37
+ def load_all_articles(driver, url):
38
+ wait = WebDriverWait(driver, 20)
39
+ driver.get(url)
40
+ time.sleep(3)
41
+
42
+ category = url.split('/')[-2]
43
+ all_articles_links = set()
44
+
45
+ if category == "letters":
46
+ last_url = ""
47
+ while True:
48
+ current_links = get_article_links_from_page(driver)
49
+ all_articles_links.update(current_links)
50
+ print(f"Collected {len(current_links)} articles on the current page in '{category}'")
51
+
52
+ try:
53
+ older_button = wait.until(
54
+ EC.element_to_be_clickable((By.CLASS_NAME, "justify-self-end"))
55
+ )
56
+ driver.execute_script("arguments[0].scrollIntoView({block: 'end'});", older_button)
57
+ time.sleep(1)
58
+ older_button.click()
59
+ print(f"Clicked 'Older Posts' in'{category}'...")
60
+ time.sleep(2)
61
+
62
+ current_url = driver.current_url
63
+ if current_url == last_url:
64
+ print("The URL did not change after the click, we are stopping the 'Older Posts' pagination.")
65
+ break
66
+ last_url = current_url
67
+
68
+ except (TimeoutException, NoSuchElementException):
69
+ print("There is no 'Older Posts' button. Let's move on to the next category.")
70
+ break
71
+
72
+ else:
73
+ while True:
74
+ current_links = get_article_links_from_page(driver)
75
+ all_articles_links.update(current_links)
76
+ print(f"Collected {len(current_links)} articles on the current page in '{category}'")
77
+
78
+ try:
79
+ load_more_button = wait.until(
80
+ EC.element_to_be_clickable((By.CLASS_NAME, "buttons_secondary__8o9u6"))
81
+ )
82
+ driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", load_more_button)
83
+ time.sleep(1)
84
+ load_more_button.click()
85
+ print(f"Clicked 'Load More' in '{category}'...")
86
+ time.sleep(2)
87
+ except (TimeoutException, NoSuchElementException):
88
+ print(
89
+ f"The 'Load More' button is unavailable or missing in '{category}'. Moving to the next category.")
90
+ break
91
+
92
+ return list(all_articles_links)
93
+
94
+
95
+ def get_article_links_from_page(driver):
96
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
97
+ all_links = set()
98
+ for a in soup.find_all("a", href=True):
99
+ href = a['href']
100
+ if href.startswith("/the-batch/") and not href.startswith("/the-batch/tag/"):
101
+ full_url = "https://www.deeplearning.ai" + href
102
+ if "issue" not in full_url:
103
+ all_links.add(full_url)
104
+ return list(all_links)
105
+
106
+
107
+ def get_article_links():
108
+ driver = initialize_driver()
109
+ all_links = set()
110
+
111
+ for category in VALID_CATEGORIES:
112
+ url = f"{BASE_TAG_URL}{category}/"
113
+ print(f"Loading the category: {url}")
114
+ category_links = load_all_articles(driver, url)
115
+ print(f"Found {len(category_links)} articles in category '{category}'")
116
+ all_links.update(category_links)
117
+
118
+ driver.quit()
119
+ return list(all_links)
120
+
121
+
122
+ def parse_article(url, max_retries=3, delay=2):
123
+ attempts = 0
124
+ while attempts < max_retries:
125
+ try:
126
+ response = requests.get(url, timeout=10)
127
+ response.raise_for_status()
128
+
129
+ soup = BeautifulSoup(response.text, "html.parser")
130
+
131
+ h1 = soup.find("h1")
132
+ title = h1.get_text(strip=True) if h1 else ""
133
+ description = ""
134
+ if h1:
135
+ span = h1.find("span")
136
+ if span:
137
+ description = span.get_text(strip=True)
138
+ span.extract()
139
+ title = h1.get_text(strip=True)
140
+
141
+ image_tag = soup.find("meta", attrs={"property": "og:image"})
142
+ image_url = image_tag["content"] if image_tag else None
143
+
144
+ date_meta = soup.find("meta", attrs={"property": "article:published_time"})
145
+ date_str = ""
146
+ if date_meta:
147
+ try:
148
+ date_raw = date_meta["content"]
149
+ date_str = datetime.fromisoformat(date_raw.split("T")[0]).strftime("%Y-%m-%d")
150
+ except Exception:
151
+ date_str = date_meta["content"]
152
+
153
+ content = ""
154
+ main_content = soup.find("div", class_="prose--styled")
155
+
156
+ if main_content:
157
+ paragraphs = main_content.find_all(["p", "li"])
158
+ content_lines = [p.get_text(strip=True) for p in paragraphs]
159
+ content = "\n".join(content_lines)
160
+
161
+ time.sleep(delay)
162
+
163
+ return {
164
+ "title": title.strip(),
165
+ "description": description.strip(),
166
+ "image_url": image_url,
167
+ "date": date_str,
168
+ "content": content.strip(),
169
+ "source_url": url,
170
+ }
171
+
172
+ except (requests.RequestException, Exception) as e:
173
+ attempts += 1
174
+ print(f"Error parsing URL {url} (Attempt {attempts}/{max_retries}): {e}")
175
+ time.sleep(delay * attempts)
176
+
177
+ print(f"Article skipped due to repeated errors: {url}")
178
+ return None
179
+
180
+
181
+ def run_parser_and_save_to_json(output_filename="data/articles_export.json"):
182
+ print("Starting to parse article links...")
183
+ all_article_urls = get_article_links()
184
+ print(f"{len(all_article_urls)} unique links to articles collected.")
185
+
186
+ parsed_articles = []
187
+ print("\n Starting to parse article content...")
188
+ for i, url in enumerate(all_article_urls):
189
+ print(f"Parsing the article {i + 1}/{len(all_article_urls)}: {url}")
190
+ article_data = parse_article(url)
191
+ if article_data:
192
+ parsed_articles.append(article_data)
193
+
194
+ print(f"\n Parsing completed. {len(parsed_articles)} articles collected.")
195
+
196
+ with open(output_filename, "w", encoding="utf-8") as f:
197
+ json.dump(parsed_articles, f, ensure_ascii=False, indent=4)
198
+ print(f"All articles are saved in '{output_filename}'")
199
+
200
+ print("\n Starting to parse articles...")
201
+ try:
202
+ with open(output_filename, "r", encoding="utf-8") as f:
203
+ articles_to_filter = json.load(f)
204
+ except FileNotFoundError:
205
+ print(f"File '{output_filename}' not found for parse.")
206
+ articles_to_filter = []
207
+
208
+ initial_count = len(articles_to_filter)
209
+ filtered_articles = [a for a in articles_to_filter if a.get("content") != "[image]"]
210
+ filtered_count = len(filtered_articles)
211
+
212
+ print(f"Articles for parse: {initial_count}")
213
+ print(f"Parsed articles: {filtered_count}")
214
+
215
+ with open(output_filename, "w", encoding="utf-8") as f:
216
+ json.dump(filtered_articles, f, ensure_ascii=False, indent=4)
217
+ print(f"Parsed articles saved in '{output_filename}'")
218
+
219
+
220
+ if __name__ == "__main__":
221
+ import os
222
+
223
+ os.makedirs("data", exist_ok=True)
224
+ run_parser_and_save_to_json()
data/the-batch-logo.webp ADDED
db/__init__.py ADDED
File without changes
db/image_db.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_community.vectorstores import Chroma
2
+
3
+
4
+ def init_chroma_image():
5
+ vectordb = Chroma(
6
+ collection_name="rag_collection_images",
7
+ persist_directory="chroma_db_images"
8
+ )
9
+ return vectordb
10
+
11
+
12
+ def add_document_image(vectordb, doc_id, embedding, metadata):
13
+ vectordb._collection.add(
14
+ embeddings=[embedding],
15
+ documents=["[image]"],
16
+ metadatas=[metadata],
17
+ ids=[doc_id]
18
+ )
db/text_db.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_community.vectorstores import Chroma
2
+ from embeddings.text_embedder import TextEmbeddings
3
+
4
+
5
+ def init_chroma():
6
+ return Chroma(
7
+ collection_name="text_collection",
8
+ embedding_function=TextEmbeddings(),
9
+ persist_directory="chroma_db_text"
10
+ )
11
+
12
+
13
+ def add_document_text(vectordb, doc_id, embedding, document_text, metadata):
14
+ vectordb._collection.add(
15
+ embeddings=[embedding],
16
+ documents=[document_text],
17
+ metadatas=[metadata],
18
+ ids=[doc_id]
19
+ )
embeddings/__init__.py ADDED
File without changes
embeddings/image_embedder.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoProcessor, AutoModel
2
+ from PIL import Image
3
+ from io import BytesIO
4
+ import torch
5
+ import requests
6
+
7
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
8
+ model = AutoModel.from_pretrained("openai/clip-vit-large-patch14-336").to(device)
9
+ processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
10
+
11
+
12
+ def get_image_embedding(image_url):
13
+ try:
14
+ response = requests.get(image_url, timeout=10)
15
+ img = Image.open(BytesIO(response.content)).convert('RGB')
16
+ inputs = processor(images=img, return_tensors="pt").to(device)
17
+ with torch.no_grad():
18
+ emb = model.get_image_features(**inputs)
19
+ emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
20
+ return emb[0].cpu().numpy().tolist()
21
+ except Exception as e:
22
+ print(f"Image loading failed: {e}")
23
+ return None
24
+
25
+
26
+ def get_text_embedding_clip(text_query):
27
+ inputs = processor(text=[text_query], return_tensors="pt", padding=True, truncation=True).to(device)
28
+ with torch.no_grad():
29
+ emb = model.get_text_features(**inputs)
30
+ emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
31
+ return emb[0].cpu().numpy().tolist()
embeddings/text_embedder.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sentence_transformers import SentenceTransformer
2
+ from langchain_core.embeddings import Embeddings
3
+ import torch
4
+ import re
5
+ import emoji
6
+ import unicodedata
7
+ import nltk
8
+ from nltk.corpus import stopwords
9
+
10
+ nltk.download('stopwords')
11
+ stop_words = set(stopwords.words('english'))
12
+
13
+ device = "cuda" if torch.cuda.is_available() else "cpu"
14
+ text_model = SentenceTransformer("intfloat/e5-base", device=device)
15
+
16
+
17
+ def preprocess_text(text):
18
+ text = emoji.replace_emoji(text, replace='')
19
+ text = ''.join(c for c in text if unicodedata.category(c)[0] != 'C')
20
+ text = text.lower()
21
+ words = re.findall(r'\b[a-z]+\b', text)
22
+ filtered = [w for w in words if w not in stop_words]
23
+
24
+ return " ".join(filtered)
25
+
26
+
27
+ def get_text_embedding(text):
28
+ clean_text = preprocess_text(text)
29
+ emb = text_model.encode(clean_text, convert_to_numpy=True, normalize_embeddings=True)
30
+ return emb.tolist()
31
+
32
+
33
+ class TextEmbeddings(Embeddings):
34
+ def embed_documents(self, texts):
35
+ return [get_text_embedding(text) for text in texts]
36
+
37
+ def embed_query(self, text):
38
+ return get_text_embedding(text)
ingestion/__init__.py ADDED
File without changes
ingestion/config.py ADDED
@@ -0,0 +1 @@
 
 
1
+ JSON_PATH = "data/articles_export.json"
ingestion/ingest_image.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from tqdm import tqdm
3
+ from db.image_db import init_chroma_image, add_document_image
4
+ from embeddings.image_embedder import get_image_embedding
5
+ from config import JSON_PATH
6
+
7
+
8
+ def ingest_images():
9
+ vectordb = init_chroma_image()
10
+
11
+ with open(JSON_PATH, "r", encoding="utf-8") as f:
12
+ articles = json.load(f)
13
+ print(f"Found {len(articles)} articles.")
14
+
15
+ for article in tqdm(articles, desc="Indexing images"):
16
+ metadata = {
17
+ "title": article.get("title", ""),
18
+ "description": article.get("description", ""),
19
+ "image_url": article.get("image_url", ""),
20
+ "date": article.get("date", ""),
21
+ "content": article.get("content", ""),
22
+ "source_url": article.get("source_url", "")
23
+ }
24
+
25
+ image_url = metadata["image_url"]
26
+ if not image_url:
27
+ print(f"No image in: {metadata['title']}")
28
+ continue
29
+
30
+ emb = get_image_embedding(image_url)
31
+ if emb:
32
+ doc_id = f"{metadata['source_url']}#image" if metadata["source_url"] else f"image#{metadata['title']}"
33
+ add_document_image(vectordb, doc_id, emb, {**metadata, "modality": "image"})
34
+ else:
35
+ print(f"Failed to embed image for: {metadata['title']}")
36
+
37
+ print("Done indexing images.")
ingestion/ingest_run.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ from ingest_text import ingest_texts
3
+ from ingest_image import ingest_images
4
+
5
+ def main():
6
+ try:
7
+ print("1. Ingesting text embeddings...")
8
+ start_time = time.time()
9
+ ingest_texts()
10
+ print(f"Text embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
11
+
12
+ print("2. Ingesting image embeddings...")
13
+ start_time = time.time()
14
+ ingest_images()
15
+ print(f"Image embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
16
+
17
+ except Exception as e:
18
+ print(f"An error occurred during ingestion: {e}")
19
+ else:
20
+ print("All embeddings successfully ingested.")
21
+ finally:
22
+ print("Done.")
23
+
24
+ if __name__ == "__main__":
25
+ main()
ingestion/ingest_text.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from tqdm import tqdm
3
+ from db.text_db import init_chroma, add_document_text
4
+ from embeddings.text_embedder import get_text_embedding
5
+ from config import JSON_PATH
6
+
7
+
8
+ def chunk_text(text, chunk_size=400, overlap=50):
9
+ words = text.split()
10
+ chunks = []
11
+ i = 0
12
+ while i < len(words):
13
+ chunk = words[i:i + chunk_size]
14
+ chunks.append(" ".join(chunk))
15
+ i += chunk_size - overlap
16
+ return chunks
17
+
18
+
19
+ def ingest_texts():
20
+ vectordb = init_chroma()
21
+
22
+ with open(JSON_PATH, "r", encoding="utf-8") as f:
23
+ articles = json.load(f)
24
+ print(f"Found {len(articles)} articles.")
25
+
26
+ for article in tqdm(articles, desc="Indexing texts"):
27
+ full_text = f"{article.get('title', '')}\n{article.get('description', '')}\n{article.get('content', '')}"
28
+
29
+ metadata = {
30
+ "title": article.get("title", ""),
31
+ "description": article.get("description", ""),
32
+ "image_url": article.get("image_url", ""),
33
+ "date": article.get("date", ""),
34
+ "content": article.get("content", ""),
35
+ "source_url": article.get("source_url", "")
36
+ }
37
+
38
+ chunks = chunk_text(full_text, chunk_size=400, overlap=50)
39
+
40
+ for i, chunk in enumerate(chunks):
41
+ emb = get_text_embedding(chunk)
42
+ if emb:
43
+ doc_id = f"{metadata['source_url']}#chunk{i}" if metadata[
44
+ "source_url"] else f"{metadata['title']}#chunk{i}"
45
+ add_document_text(vectordb, doc_id, emb, chunk, metadata)
46
+ else:
47
+ print(f"Failed to embed chunk {i} of {metadata['title']}")
48
+
49
+ print("Done indexing texts.")
llm.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import requests
3
+ from dotenv import load_dotenv
4
+
5
+ load_dotenv()
6
+ OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
7
+
8
+
9
+ def generate_response(question, retrieved_docs, model="meta-llama/llama-3-8b-instruct"):
10
+ context = "\n\n".join(
11
+ f"Title: {doc.get('title', 'N/A')}\n"
12
+ f"Description: {doc.get('description', 'N/A')}\n"
13
+ f"Content: {doc.get('content', 'N/A')}\n"
14
+ for doc in retrieved_docs
15
+ )
16
+
17
+ prompt = (
18
+ "You are a polite assistant who provides clear and detailed answers based solely on the information from The Batch articles.\n\n"
19
+ "Rules:\n"
20
+ "- Answer only using the knowledge from The Batch articles.\n"
21
+ "- Do not mention other sources or questions; provide only accurate, detailed, and understandable answers.\n"
22
+ "- If the information is present in the context, give a clear answer.\n"
23
+ "- If the information is missing, respond with: 'Sorry, I could not find the answer in the provided context.'\n"
24
+ "- Do not guess, fabricate information, or go beyond the given context.\n\n"
25
+ f"Context for the answer:\n{context}"
26
+ )
27
+
28
+ headers = {
29
+ "Authorization": f"Bearer {OPENROUTER_API_KEY}",
30
+ "Content-Type": "application/json"
31
+ }
32
+
33
+ data = {
34
+ "model": model,
35
+ "messages": [
36
+ {"role": "system", "content": prompt},
37
+ {"role": "user", "content": question}
38
+ ],
39
+ "temperature": 0.3
40
+ }
41
+
42
+ response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json=data)
43
+
44
+ if response.status_code == 200:
45
+ return response.json()['choices'][0]['message']['content'].strip()
46
+ else:
47
+ return f"Помилка: {response.status_code} β€” {response.text}"
main.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from search.search_classical import classical_search
3
+ from search.search_best_pair import best_pair_search
4
+ from llm import generate_response
5
+
6
+ st.set_page_config(page_title="πŸ” Multimodal Search The Batch")
7
+ st.image("data/the-batch-logo.webp", width=300)
8
+ st.title("Multimodal Assistant")
9
+
10
+ mode = st.selectbox("πŸ”Ž Select the search mode:", ["Classical RAG", "Multimodal RAG"])
11
+ query = st.text_input("πŸ“ Enter the text query:")
12
+
13
+ if query:
14
+ if mode == "Classical RAG":
15
+ results = classical_search(query, k=3)
16
+ else:
17
+ results = best_pair_search(query, k=3)
18
+
19
+ st.markdown(f"### πŸ“„ Results found: {len(results)}")
20
+
21
+ for i, meta in enumerate(results):
22
+ st.markdown(f"### πŸ”Ή Result {i + 1}")
23
+ if meta.get("title"):
24
+ st.markdown(f"**πŸ“– Name:** {meta['title']}")
25
+ if meta.get("date"):
26
+ st.markdown(f"**πŸ“… Date of publication:** {meta['date']}")
27
+ if meta.get("description"):
28
+ st.markdown(f"**πŸ“ Description:** {meta['description']}")
29
+ if meta.get("image_url"):
30
+ st.image(meta["image_url"], use_container_width=True)
31
+ if meta.get("content"):
32
+ st.markdown("**πŸ“š Part of the article:**")
33
+ st.write(meta["content"][:500] + "...")
34
+ if meta.get("source_url"):
35
+ st.markdown(f"[πŸ”— Read the full article β†’]({meta['source_url']})")
36
+ st.markdown("---")
37
+
38
+ if st.button("🧠 Generate a response to a query"):
39
+ docs = [
40
+ {
41
+ "title": meta.get("title", ""),
42
+ "description": meta.get("description", ""),
43
+ "content": meta.get("content", "")
44
+ }
45
+ for meta in results
46
+ ]
47
+
48
+ response = generate_response(query, docs)
49
+ st.markdown("### πŸ€– Generated Response:")
50
+ st.success(response)
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ langchain
3
+ sentence-transformers
4
+ transformers
5
+ torch
6
+ chromadb
7
+ nltk
8
+ requests
9
+ tqdm
10
+ python-dotenv
11
+ beautifulsoup4
12
+ selenium
13
+ webdriver-manager
14
+ langchain_community
15
+ emoji
search/__init__.py ADDED
File without changes
search/search_best_pair.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from db.text_db import init_chroma
2
+ from db.image_db import init_chroma_image
3
+ from embeddings.text_embedder import get_text_embedding
4
+ from embeddings.image_embedder import get_text_embedding_clip
5
+
6
+
7
+ def best_pair_search(query, k=3):
8
+ text_db = init_chroma()
9
+ image_db = init_chroma_image()
10
+
11
+ text_emb = get_text_embedding(query)
12
+ image_emb = get_text_embedding_clip(query)
13
+
14
+ t_res = text_db.similarity_search_by_vector(text_emb, k=k)
15
+ i_res = image_db.similarity_search_by_vector(image_emb, k=k)
16
+
17
+ results = []
18
+ for i in range(k):
19
+ text_meta = t_res[i].metadata if i < len(t_res) else {}
20
+ image_meta = i_res[i].metadata if i < len(i_res) else {}
21
+
22
+ results.append({
23
+ "title": text_meta.get("title"),
24
+ "description": text_meta.get("description"),
25
+ "date": text_meta.get("date"),
26
+ "source_url": text_meta.get("source_url"),
27
+ "content": text_meta.get("content"),
28
+ "image_url": image_meta.get("image_url") if image_meta else None
29
+ })
30
+
31
+ return results
search/search_classical.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from db.text_db import init_chroma
2
+ from embeddings.text_embedder import get_text_embedding
3
+
4
+
5
+ def classical_search(query, k=5):
6
+ db = init_chroma()
7
+ emb = get_text_embedding(query)
8
+ results = db.similarity_search_by_vector(emb, k=k)
9
+ articles = []
10
+ seen = set()
11
+
12
+ for r in results:
13
+ meta = r.metadata
14
+ aid = meta.get("source_url") or meta.get("title")
15
+ if aid not in seen:
16
+ seen.add(aid)
17
+ articles.append({
18
+ "title": meta.get("title"),
19
+ "image_url": meta.get("image_url"),
20
+ "date": meta.get("date"),
21
+ "description": meta.get("description"),
22
+ "content": meta.get("content"),
23
+ "source_url": meta.get("source_url"),
24
+ })
25
+
26
+ return articles