Arsen Dolichnyi commited on
Commit Β·
42b2b3c
0
Parent(s):
basic version with classic and multimodal rag
Browse files- .gitignore +7 -0
- README.md +194 -0
- data/__init__.py +0 -0
- data/articles_export.json +0 -0
- data/parser.py +224 -0
- data/the-batch-logo.webp +0 -0
- db/__init__.py +0 -0
- db/image_db.py +18 -0
- db/text_db.py +19 -0
- embeddings/__init__.py +0 -0
- embeddings/image_embedder.py +31 -0
- embeddings/text_embedder.py +38 -0
- ingestion/__init__.py +0 -0
- ingestion/config.py +1 -0
- ingestion/ingest_image.py +37 -0
- ingestion/ingest_run.py +25 -0
- ingestion/ingest_text.py +49 -0
- llm.py +47 -0
- main.py +50 -0
- requirements.txt +15 -0
- search/__init__.py +0 -0
- search/search_best_pair.py +31 -0
- search/search_classical.py +26 -0
.gitignore
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# created by virtualenv automatically
|
| 2 |
+
.env
|
| 3 |
+
/venv
|
| 4 |
+
.idea
|
| 5 |
+
__pycache__/
|
| 6 |
+
chroma_db_text/
|
| 7 |
+
chroma_db_images/
|
README.md
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multimodal RAG System
|
| 2 |
+
|
| 3 |
+
## Project Description
|
| 4 |
+
|
| 5 |
+
This project implements a **Multimodal Retrieval-Augmented Generation (RAG)** system that combines text and image data to retrieve relevant articles from **The Batch**. The system allows you to:
|
| 6 |
+
- Retrieve data based on user queries using text **and** visual embeddings.
|
| 7 |
+
- Perform **Classical RAG** (text-based search) and **Multimodal RAG** (combined text + image search).
|
| 8 |
+
- Generate AI-powered answers to queries utilizing **Large Language Models (LLMs)**.
|
| 9 |
+
- Provide users with an interactive interface for exploring results.
|
| 10 |
+
|
| 11 |
+
**Key Feature**: By combining textual and visual content, this system enhances the relevance of search results and elevates user experience.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
## How to Run the Project
|
| 15 |
+
|
| 16 |
+
Follow these steps to set up and run the system locally on your machine.
|
| 17 |
+
|
| 18 |
+
### **1. Clone the Repository**
|
| 19 |
+
Start by cloning the project repository into your local machine:
|
| 20 |
+
```bash
|
| 21 |
+
git clone https://github.com/DolAr1610/Multimodal_RAG.git
|
| 22 |
+
cd multimodal_rag
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
### **2. Install Dependencies**
|
| 26 |
+
Install all the required Python libraries specified in requirements.txt:
|
| 27 |
+
```bash
|
| 28 |
+
pip install -r requirements.txt
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### **3. Prepare the Data**
|
| 32 |
+
Ensure that the parsed articles are saved as a JSON file (data/articles_export.json) before running the system.
|
| 33 |
+
|
| 34 |
+
#### **Option 1: Generate Data**
|
| 35 |
+
If the articles are not yet parsed, you can run the parser:
|
| 36 |
+
```bash
|
| 37 |
+
python parser.py
|
| 38 |
+
```
|
| 39 |
+
#### **Option 2: Use Pre-Generated Data**
|
| 40 |
+
Alternatively, use the pre-generated articles_export.json located in the data/ directory.
|
| 41 |
+
|
| 42 |
+
### **4. Generate Vector Databases (First-Time Setup)**
|
| 43 |
+
If this is your first run, you need to create the vector databases for text and images:
|
| 44 |
+
```bash
|
| 45 |
+
python ingest_run.py # Create vector databases for text and image embeddings
|
| 46 |
+
```
|
| 47 |
+
This step ensures Chroma vector databases are properly initialized and indexed.
|
| 48 |
+
|
| 49 |
+
### **5. Launch the Application**
|
| 50 |
+
Run the Streamlit application to access the interactive user interface:
|
| 51 |
+
```bash
|
| 52 |
+
streamlit run main.py
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## Key Features
|
| 57 |
+
|
| 58 |
+
### **1. Parsing Articles and Metadata**
|
| 59 |
+
|
| 60 |
+
The system collects articles, including text, metadata, and associated images, from **The Batch** using web-scraping techniques.
|
| 61 |
+
|
| 62 |
+
- **Objective:** Extract text content (title, description, publication date), metadata, and associated images.
|
| 63 |
+
- **How it Works**:
|
| 64 |
+
- **Selenium:** Handles dynamic website elements like pagination ("Load More", "Older Posts").
|
| 65 |
+
- **BeautifulSoup:** Extracts article text, metadata, and image URLs from HTML.
|
| 66 |
+
- **Output:** Articles stored in a structured JSON format as follows:
|
| 67 |
+
```json
|
| 68 |
+
{
|
| 69 |
+
"title": "Article Title",
|
| 70 |
+
"description": "Short Description",
|
| 71 |
+
"image_url": "https://example.com/image.jpg",
|
| 72 |
+
"date": "2024-10-11",
|
| 73 |
+
"content": "The main content of the article...",
|
| 74 |
+
"source_url": "https://thebatch.org/example-article"
|
| 75 |
+
}
|
| 76 |
+
```
|
| 77 |
+
**Scripts**:
|
| 78 |
+
- `initialize_driver()`: Configures the Selenium WebDriver for site interaction.
|
| 79 |
+
- `parse_article(url)`: Extracts title, description, metadata, and images of an article.
|
| 80 |
+
- `run_parser_and_save_to_json()`: Performs entire filtering and parsing process.
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
### **2. Building Vector Databases**
|
| 85 |
+
|
| 86 |
+
To enable efficient multimodal retrieval, the system creates **two separate vector databases**: one for text and one for images.
|
| 87 |
+
|
| 88 |
+
#### **Text Index**
|
| 89 |
+
- **Model:** The text index leverages **SentenceTransformer (E5)** for generating embeddings.
|
| 90 |
+
- **Process:**
|
| 91 |
+
- Articles are preprocessed using `chunk_text()` to split larger texts into smaller chunks (400 words with a 50-word overlap).
|
| 92 |
+
- Chunks and embeddings are stored in a **Chroma** database.
|
| 93 |
+
|
| 94 |
+
#### **Image Index**
|
| 95 |
+
- **Model:** Image embeddings are generated using **OpenAI CLIP** (`clip-vit-large-patch14-336`).
|
| 96 |
+
- **Process:**
|
| 97 |
+
- Images are accessed via URLs and transformed into embeddings.
|
| 98 |
+
- Embeddings and metadata are stored in a **Chroma** database.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
### **3. Embedding Integration**
|
| 103 |
+
|
| 104 |
+
The text and image embeddings are created independently to enhance retrieval performance.
|
| 105 |
+
|
| 106 |
+
- **Text Integration:** Articles are preprocessed, converted into embeddings using **E5**, and indexed.
|
| 107 |
+
- **Image Integration:** Image URLs are retrieved, processed, and added to the image index using embeddings from **CLIP**.
|
| 108 |
+
|
| 109 |
+
**Why Separate Databases?**: This ensures that powerful text and image-specific models can be used without sacrificing independence or performance. Each database is optimized for its respective modality.
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
### **4. Search System**
|
| 114 |
+
|
| 115 |
+
The system provides two types of searches: text-only or multimodal.
|
| 116 |
+
|
| 117 |
+
#### **1. Classical Search (Text-Based RAG)**
|
| 118 |
+
- Focuses exclusively on the text database.
|
| 119 |
+
- Finds articles that are highly relevant to the user query.
|
| 120 |
+
- Always provides accompanying images from relevant articles.
|
| 121 |
+
- **Implementation:** `classical_search()`.
|
| 122 |
+
|
| 123 |
+
#### **2. Multimodal Search (Text + Image RAG)**
|
| 124 |
+
- Leverages both text and image databases.
|
| 125 |
+
- Locates the best-matching text and independent image:
|
| 126 |
+
- Finds relevant text embeddings for the query in the text index.
|
| 127 |
+
- Simultaneously searches for image embeddings matching the query in the image index.
|
| 128 |
+
- Combines the results into multimodal pairs.
|
| 129 |
+
- **Implementation:** `best_pair_search()`.
|
| 130 |
+
|
| 131 |
+
**Output Example**:
|
| 132 |
+
```json
|
| 133 |
+
{
|
| 134 |
+
"title": "AI in Healthcare",
|
| 135 |
+
"description": "How AI is revolutionizing medicine.",
|
| 136 |
+
"image_url": "https://thebatch.org/healthcare-ai.jpg",
|
| 137 |
+
"date": "2024-10-11",
|
| 138 |
+
"source_url": "https://thebatch.org/ai-healthcare",
|
| 139 |
+
"content": "Artificial intelligence is transforming healthcare with personalized approaches..."
|
| 140 |
+
}
|
| 141 |
+
```
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
### **5. Answer Generation Using LLM**
|
| 145 |
+
|
| 146 |
+
The system integrates a **Large Language Model (LLM)** to generate responses based on the content of retrieved articles.
|
| 147 |
+
|
| 148 |
+
#### **Model**
|
| 149 |
+
- The system uses **meta-llama/llama-3-8b-instruct**, integrated via the **OpenRouter API**.
|
| 150 |
+
|
| 151 |
+
#### **Process**
|
| 152 |
+
1. Retrieved article context (text fragments) is passed to the LLM model.
|
| 153 |
+
2. The model generates detailed answers while adhering strictly to the provided context.
|
| 154 |
+
3. If the query cannot be addressed due to insufficient context, the system returns a fallback response:
|
| 155 |
+
> **"Sorry, I could not find the answer in the provided context."**
|
| 156 |
+
|
| 157 |
+
#### **Implementation**
|
| 158 |
+
- The function `generate_response()` is responsible for:
|
| 159 |
+
- Extracting text from articles as context.
|
| 160 |
+
- Sending the context to an LLM.
|
| 161 |
+
- Generating user-facing responses.
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
### **6. Interactive User Interface**
|
| 166 |
+
|
| 167 |
+
The system includes an interactive **Streamlit-based UI**, designed for a smooth user experience when exploring data.
|
| 168 |
+
|
| 169 |
+
#### **Features**
|
| 170 |
+
1. **Query Input:**
|
| 171 |
+
- Users can input text queries.
|
| 172 |
+
- They can choose between **Classical RAG** (text-only search) or **Multimodal RAG** (text + image search).
|
| 173 |
+
2. **Result Display:**
|
| 174 |
+
- Lists retrieved articles with:
|
| 175 |
+
- Metadata (title, description, publication date).
|
| 176 |
+
- Accompanying images.
|
| 177 |
+
- Key fragments of text content.
|
| 178 |
+
- Includes a button to generate detailed responses from the LLM.
|
| 179 |
+
## **Summary**
|
| 180 |
+
|
| 181 |
+
This project explores the integration of multimodal content (text + images) and retrieval-augmented generation (RAG), incorporating cutting-edge NLP and computer vision models to provide users with:
|
| 182 |
+
|
| 183 |
+
**Contextual Search Results:** Retrieve precise matches using text and visual embeddings seamlessly.
|
| 184 |
+
|
| 185 |
+
**LLM Responses:** Generate detailed answers with OpenAI LLMs.
|
| 186 |
+
|
| 187 |
+
**Interactive UI:** Streamlined user interaction through Streamlit.
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
## **Demo Video**
|
| 191 |
+
|
| 192 |
+
Below is a quick demonstration of how the system works:
|
| 193 |
+
|
| 194 |
+
Watch the demo video on [Google Drive](https://drive.google.com/file/d/1wd8QJfZYaPdwYy7qyCH4NeuQ0ZFNbW-K/view?usp=sharing).
|
data/__init__.py
ADDED
|
File without changes
|
data/articles_export.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/parser.py
ADDED
|
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
from datetime import datetime
|
| 3 |
+
from bs4 import BeautifulSoup
|
| 4 |
+
import json
|
| 5 |
+
import requests
|
| 6 |
+
|
| 7 |
+
from selenium import webdriver
|
| 8 |
+
from selenium.webdriver.common.by import By
|
| 9 |
+
from selenium.webdriver.support.ui import WebDriverWait
|
| 10 |
+
from selenium.webdriver.support import expected_conditions as EC
|
| 11 |
+
from selenium.webdriver.chrome.service import Service
|
| 12 |
+
from selenium.common.exceptions import TimeoutException, NoSuchElementException
|
| 13 |
+
from webdriver_manager.chrome import ChromeDriverManager
|
| 14 |
+
|
| 15 |
+
BASE_TAG_URL = "https://www.deeplearning.ai/the-batch/tag/"
|
| 16 |
+
VALID_CATEGORIES = [
|
| 17 |
+
"letters",
|
| 18 |
+
"data-points",
|
| 19 |
+
"research",
|
| 20 |
+
"business",
|
| 21 |
+
"science",
|
| 22 |
+
"culture",
|
| 23 |
+
"hardware",
|
| 24 |
+
"ai-careers"
|
| 25 |
+
]
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def initialize_driver():
|
| 29 |
+
options = webdriver.ChromeOptions()
|
| 30 |
+
options.add_argument('--headless')
|
| 31 |
+
options.add_argument('--disable-gpu')
|
| 32 |
+
options.add_argument('--no-sandbox')
|
| 33 |
+
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
|
| 34 |
+
return driver
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def load_all_articles(driver, url):
|
| 38 |
+
wait = WebDriverWait(driver, 20)
|
| 39 |
+
driver.get(url)
|
| 40 |
+
time.sleep(3)
|
| 41 |
+
|
| 42 |
+
category = url.split('/')[-2]
|
| 43 |
+
all_articles_links = set()
|
| 44 |
+
|
| 45 |
+
if category == "letters":
|
| 46 |
+
last_url = ""
|
| 47 |
+
while True:
|
| 48 |
+
current_links = get_article_links_from_page(driver)
|
| 49 |
+
all_articles_links.update(current_links)
|
| 50 |
+
print(f"Collected {len(current_links)} articles on the current page in '{category}'")
|
| 51 |
+
|
| 52 |
+
try:
|
| 53 |
+
older_button = wait.until(
|
| 54 |
+
EC.element_to_be_clickable((By.CLASS_NAME, "justify-self-end"))
|
| 55 |
+
)
|
| 56 |
+
driver.execute_script("arguments[0].scrollIntoView({block: 'end'});", older_button)
|
| 57 |
+
time.sleep(1)
|
| 58 |
+
older_button.click()
|
| 59 |
+
print(f"Clicked 'Older Posts' in'{category}'...")
|
| 60 |
+
time.sleep(2)
|
| 61 |
+
|
| 62 |
+
current_url = driver.current_url
|
| 63 |
+
if current_url == last_url:
|
| 64 |
+
print("The URL did not change after the click, we are stopping the 'Older Posts' pagination.")
|
| 65 |
+
break
|
| 66 |
+
last_url = current_url
|
| 67 |
+
|
| 68 |
+
except (TimeoutException, NoSuchElementException):
|
| 69 |
+
print("There is no 'Older Posts' button. Let's move on to the next category.")
|
| 70 |
+
break
|
| 71 |
+
|
| 72 |
+
else:
|
| 73 |
+
while True:
|
| 74 |
+
current_links = get_article_links_from_page(driver)
|
| 75 |
+
all_articles_links.update(current_links)
|
| 76 |
+
print(f"Collected {len(current_links)} articles on the current page in '{category}'")
|
| 77 |
+
|
| 78 |
+
try:
|
| 79 |
+
load_more_button = wait.until(
|
| 80 |
+
EC.element_to_be_clickable((By.CLASS_NAME, "buttons_secondary__8o9u6"))
|
| 81 |
+
)
|
| 82 |
+
driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", load_more_button)
|
| 83 |
+
time.sleep(1)
|
| 84 |
+
load_more_button.click()
|
| 85 |
+
print(f"Clicked 'Load More' in '{category}'...")
|
| 86 |
+
time.sleep(2)
|
| 87 |
+
except (TimeoutException, NoSuchElementException):
|
| 88 |
+
print(
|
| 89 |
+
f"The 'Load More' button is unavailable or missing in '{category}'. Moving to the next category.")
|
| 90 |
+
break
|
| 91 |
+
|
| 92 |
+
return list(all_articles_links)
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def get_article_links_from_page(driver):
|
| 96 |
+
soup = BeautifulSoup(driver.page_source, 'html.parser')
|
| 97 |
+
all_links = set()
|
| 98 |
+
for a in soup.find_all("a", href=True):
|
| 99 |
+
href = a['href']
|
| 100 |
+
if href.startswith("/the-batch/") and not href.startswith("/the-batch/tag/"):
|
| 101 |
+
full_url = "https://www.deeplearning.ai" + href
|
| 102 |
+
if "issue" not in full_url:
|
| 103 |
+
all_links.add(full_url)
|
| 104 |
+
return list(all_links)
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def get_article_links():
|
| 108 |
+
driver = initialize_driver()
|
| 109 |
+
all_links = set()
|
| 110 |
+
|
| 111 |
+
for category in VALID_CATEGORIES:
|
| 112 |
+
url = f"{BASE_TAG_URL}{category}/"
|
| 113 |
+
print(f"Loading the category: {url}")
|
| 114 |
+
category_links = load_all_articles(driver, url)
|
| 115 |
+
print(f"Found {len(category_links)} articles in category '{category}'")
|
| 116 |
+
all_links.update(category_links)
|
| 117 |
+
|
| 118 |
+
driver.quit()
|
| 119 |
+
return list(all_links)
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def parse_article(url, max_retries=3, delay=2):
|
| 123 |
+
attempts = 0
|
| 124 |
+
while attempts < max_retries:
|
| 125 |
+
try:
|
| 126 |
+
response = requests.get(url, timeout=10)
|
| 127 |
+
response.raise_for_status()
|
| 128 |
+
|
| 129 |
+
soup = BeautifulSoup(response.text, "html.parser")
|
| 130 |
+
|
| 131 |
+
h1 = soup.find("h1")
|
| 132 |
+
title = h1.get_text(strip=True) if h1 else ""
|
| 133 |
+
description = ""
|
| 134 |
+
if h1:
|
| 135 |
+
span = h1.find("span")
|
| 136 |
+
if span:
|
| 137 |
+
description = span.get_text(strip=True)
|
| 138 |
+
span.extract()
|
| 139 |
+
title = h1.get_text(strip=True)
|
| 140 |
+
|
| 141 |
+
image_tag = soup.find("meta", attrs={"property": "og:image"})
|
| 142 |
+
image_url = image_tag["content"] if image_tag else None
|
| 143 |
+
|
| 144 |
+
date_meta = soup.find("meta", attrs={"property": "article:published_time"})
|
| 145 |
+
date_str = ""
|
| 146 |
+
if date_meta:
|
| 147 |
+
try:
|
| 148 |
+
date_raw = date_meta["content"]
|
| 149 |
+
date_str = datetime.fromisoformat(date_raw.split("T")[0]).strftime("%Y-%m-%d")
|
| 150 |
+
except Exception:
|
| 151 |
+
date_str = date_meta["content"]
|
| 152 |
+
|
| 153 |
+
content = ""
|
| 154 |
+
main_content = soup.find("div", class_="prose--styled")
|
| 155 |
+
|
| 156 |
+
if main_content:
|
| 157 |
+
paragraphs = main_content.find_all(["p", "li"])
|
| 158 |
+
content_lines = [p.get_text(strip=True) for p in paragraphs]
|
| 159 |
+
content = "\n".join(content_lines)
|
| 160 |
+
|
| 161 |
+
time.sleep(delay)
|
| 162 |
+
|
| 163 |
+
return {
|
| 164 |
+
"title": title.strip(),
|
| 165 |
+
"description": description.strip(),
|
| 166 |
+
"image_url": image_url,
|
| 167 |
+
"date": date_str,
|
| 168 |
+
"content": content.strip(),
|
| 169 |
+
"source_url": url,
|
| 170 |
+
}
|
| 171 |
+
|
| 172 |
+
except (requests.RequestException, Exception) as e:
|
| 173 |
+
attempts += 1
|
| 174 |
+
print(f"Error parsing URL {url} (Attempt {attempts}/{max_retries}): {e}")
|
| 175 |
+
time.sleep(delay * attempts)
|
| 176 |
+
|
| 177 |
+
print(f"Article skipped due to repeated errors: {url}")
|
| 178 |
+
return None
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
def run_parser_and_save_to_json(output_filename="data/articles_export.json"):
|
| 182 |
+
print("Starting to parse article links...")
|
| 183 |
+
all_article_urls = get_article_links()
|
| 184 |
+
print(f"{len(all_article_urls)} unique links to articles collected.")
|
| 185 |
+
|
| 186 |
+
parsed_articles = []
|
| 187 |
+
print("\n Starting to parse article content...")
|
| 188 |
+
for i, url in enumerate(all_article_urls):
|
| 189 |
+
print(f"Parsing the article {i + 1}/{len(all_article_urls)}: {url}")
|
| 190 |
+
article_data = parse_article(url)
|
| 191 |
+
if article_data:
|
| 192 |
+
parsed_articles.append(article_data)
|
| 193 |
+
|
| 194 |
+
print(f"\n Parsing completed. {len(parsed_articles)} articles collected.")
|
| 195 |
+
|
| 196 |
+
with open(output_filename, "w", encoding="utf-8") as f:
|
| 197 |
+
json.dump(parsed_articles, f, ensure_ascii=False, indent=4)
|
| 198 |
+
print(f"All articles are saved in '{output_filename}'")
|
| 199 |
+
|
| 200 |
+
print("\n Starting to parse articles...")
|
| 201 |
+
try:
|
| 202 |
+
with open(output_filename, "r", encoding="utf-8") as f:
|
| 203 |
+
articles_to_filter = json.load(f)
|
| 204 |
+
except FileNotFoundError:
|
| 205 |
+
print(f"File '{output_filename}' not found for parse.")
|
| 206 |
+
articles_to_filter = []
|
| 207 |
+
|
| 208 |
+
initial_count = len(articles_to_filter)
|
| 209 |
+
filtered_articles = [a for a in articles_to_filter if a.get("content") != "[image]"]
|
| 210 |
+
filtered_count = len(filtered_articles)
|
| 211 |
+
|
| 212 |
+
print(f"Articles for parse: {initial_count}")
|
| 213 |
+
print(f"Parsed articles: {filtered_count}")
|
| 214 |
+
|
| 215 |
+
with open(output_filename, "w", encoding="utf-8") as f:
|
| 216 |
+
json.dump(filtered_articles, f, ensure_ascii=False, indent=4)
|
| 217 |
+
print(f"Parsed articles saved in '{output_filename}'")
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
if __name__ == "__main__":
|
| 221 |
+
import os
|
| 222 |
+
|
| 223 |
+
os.makedirs("data", exist_ok=True)
|
| 224 |
+
run_parser_and_save_to_json()
|
data/the-batch-logo.webp
ADDED
|
db/__init__.py
ADDED
|
File without changes
|
db/image_db.py
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from langchain_community.vectorstores import Chroma
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
def init_chroma_image():
|
| 5 |
+
vectordb = Chroma(
|
| 6 |
+
collection_name="rag_collection_images",
|
| 7 |
+
persist_directory="chroma_db_images"
|
| 8 |
+
)
|
| 9 |
+
return vectordb
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def add_document_image(vectordb, doc_id, embedding, metadata):
|
| 13 |
+
vectordb._collection.add(
|
| 14 |
+
embeddings=[embedding],
|
| 15 |
+
documents=["[image]"],
|
| 16 |
+
metadatas=[metadata],
|
| 17 |
+
ids=[doc_id]
|
| 18 |
+
)
|
db/text_db.py
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from langchain_community.vectorstores import Chroma
|
| 2 |
+
from embeddings.text_embedder import TextEmbeddings
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
def init_chroma():
|
| 6 |
+
return Chroma(
|
| 7 |
+
collection_name="text_collection",
|
| 8 |
+
embedding_function=TextEmbeddings(),
|
| 9 |
+
persist_directory="chroma_db_text"
|
| 10 |
+
)
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def add_document_text(vectordb, doc_id, embedding, document_text, metadata):
|
| 14 |
+
vectordb._collection.add(
|
| 15 |
+
embeddings=[embedding],
|
| 16 |
+
documents=[document_text],
|
| 17 |
+
metadatas=[metadata],
|
| 18 |
+
ids=[doc_id]
|
| 19 |
+
)
|
embeddings/__init__.py
ADDED
|
File without changes
|
embeddings/image_embedder.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoProcessor, AutoModel
|
| 2 |
+
from PIL import Image
|
| 3 |
+
from io import BytesIO
|
| 4 |
+
import torch
|
| 5 |
+
import requests
|
| 6 |
+
|
| 7 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 8 |
+
model = AutoModel.from_pretrained("openai/clip-vit-large-patch14-336").to(device)
|
| 9 |
+
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def get_image_embedding(image_url):
|
| 13 |
+
try:
|
| 14 |
+
response = requests.get(image_url, timeout=10)
|
| 15 |
+
img = Image.open(BytesIO(response.content)).convert('RGB')
|
| 16 |
+
inputs = processor(images=img, return_tensors="pt").to(device)
|
| 17 |
+
with torch.no_grad():
|
| 18 |
+
emb = model.get_image_features(**inputs)
|
| 19 |
+
emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
|
| 20 |
+
return emb[0].cpu().numpy().tolist()
|
| 21 |
+
except Exception as e:
|
| 22 |
+
print(f"Image loading failed: {e}")
|
| 23 |
+
return None
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def get_text_embedding_clip(text_query):
|
| 27 |
+
inputs = processor(text=[text_query], return_tensors="pt", padding=True, truncation=True).to(device)
|
| 28 |
+
with torch.no_grad():
|
| 29 |
+
emb = model.get_text_features(**inputs)
|
| 30 |
+
emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
|
| 31 |
+
return emb[0].cpu().numpy().tolist()
|
embeddings/text_embedder.py
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from sentence_transformers import SentenceTransformer
|
| 2 |
+
from langchain_core.embeddings import Embeddings
|
| 3 |
+
import torch
|
| 4 |
+
import re
|
| 5 |
+
import emoji
|
| 6 |
+
import unicodedata
|
| 7 |
+
import nltk
|
| 8 |
+
from nltk.corpus import stopwords
|
| 9 |
+
|
| 10 |
+
nltk.download('stopwords')
|
| 11 |
+
stop_words = set(stopwords.words('english'))
|
| 12 |
+
|
| 13 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 14 |
+
text_model = SentenceTransformer("intfloat/e5-base", device=device)
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def preprocess_text(text):
|
| 18 |
+
text = emoji.replace_emoji(text, replace='')
|
| 19 |
+
text = ''.join(c for c in text if unicodedata.category(c)[0] != 'C')
|
| 20 |
+
text = text.lower()
|
| 21 |
+
words = re.findall(r'\b[a-z]+\b', text)
|
| 22 |
+
filtered = [w for w in words if w not in stop_words]
|
| 23 |
+
|
| 24 |
+
return " ".join(filtered)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def get_text_embedding(text):
|
| 28 |
+
clean_text = preprocess_text(text)
|
| 29 |
+
emb = text_model.encode(clean_text, convert_to_numpy=True, normalize_embeddings=True)
|
| 30 |
+
return emb.tolist()
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
class TextEmbeddings(Embeddings):
|
| 34 |
+
def embed_documents(self, texts):
|
| 35 |
+
return [get_text_embedding(text) for text in texts]
|
| 36 |
+
|
| 37 |
+
def embed_query(self, text):
|
| 38 |
+
return get_text_embedding(text)
|
ingestion/__init__.py
ADDED
|
File without changes
|
ingestion/config.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
JSON_PATH = "data/articles_export.json"
|
ingestion/ingest_image.py
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
from tqdm import tqdm
|
| 3 |
+
from db.image_db import init_chroma_image, add_document_image
|
| 4 |
+
from embeddings.image_embedder import get_image_embedding
|
| 5 |
+
from config import JSON_PATH
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def ingest_images():
|
| 9 |
+
vectordb = init_chroma_image()
|
| 10 |
+
|
| 11 |
+
with open(JSON_PATH, "r", encoding="utf-8") as f:
|
| 12 |
+
articles = json.load(f)
|
| 13 |
+
print(f"Found {len(articles)} articles.")
|
| 14 |
+
|
| 15 |
+
for article in tqdm(articles, desc="Indexing images"):
|
| 16 |
+
metadata = {
|
| 17 |
+
"title": article.get("title", ""),
|
| 18 |
+
"description": article.get("description", ""),
|
| 19 |
+
"image_url": article.get("image_url", ""),
|
| 20 |
+
"date": article.get("date", ""),
|
| 21 |
+
"content": article.get("content", ""),
|
| 22 |
+
"source_url": article.get("source_url", "")
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
image_url = metadata["image_url"]
|
| 26 |
+
if not image_url:
|
| 27 |
+
print(f"No image in: {metadata['title']}")
|
| 28 |
+
continue
|
| 29 |
+
|
| 30 |
+
emb = get_image_embedding(image_url)
|
| 31 |
+
if emb:
|
| 32 |
+
doc_id = f"{metadata['source_url']}#image" if metadata["source_url"] else f"image#{metadata['title']}"
|
| 33 |
+
add_document_image(vectordb, doc_id, emb, {**metadata, "modality": "image"})
|
| 34 |
+
else:
|
| 35 |
+
print(f"Failed to embed image for: {metadata['title']}")
|
| 36 |
+
|
| 37 |
+
print("Done indexing images.")
|
ingestion/ingest_run.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
from ingest_text import ingest_texts
|
| 3 |
+
from ingest_image import ingest_images
|
| 4 |
+
|
| 5 |
+
def main():
|
| 6 |
+
try:
|
| 7 |
+
print("1. Ingesting text embeddings...")
|
| 8 |
+
start_time = time.time()
|
| 9 |
+
ingest_texts()
|
| 10 |
+
print(f"Text embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
|
| 11 |
+
|
| 12 |
+
print("2. Ingesting image embeddings...")
|
| 13 |
+
start_time = time.time()
|
| 14 |
+
ingest_images()
|
| 15 |
+
print(f"Image embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
|
| 16 |
+
|
| 17 |
+
except Exception as e:
|
| 18 |
+
print(f"An error occurred during ingestion: {e}")
|
| 19 |
+
else:
|
| 20 |
+
print("All embeddings successfully ingested.")
|
| 21 |
+
finally:
|
| 22 |
+
print("Done.")
|
| 23 |
+
|
| 24 |
+
if __name__ == "__main__":
|
| 25 |
+
main()
|
ingestion/ingest_text.py
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
from tqdm import tqdm
|
| 3 |
+
from db.text_db import init_chroma, add_document_text
|
| 4 |
+
from embeddings.text_embedder import get_text_embedding
|
| 5 |
+
from config import JSON_PATH
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def chunk_text(text, chunk_size=400, overlap=50):
|
| 9 |
+
words = text.split()
|
| 10 |
+
chunks = []
|
| 11 |
+
i = 0
|
| 12 |
+
while i < len(words):
|
| 13 |
+
chunk = words[i:i + chunk_size]
|
| 14 |
+
chunks.append(" ".join(chunk))
|
| 15 |
+
i += chunk_size - overlap
|
| 16 |
+
return chunks
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def ingest_texts():
|
| 20 |
+
vectordb = init_chroma()
|
| 21 |
+
|
| 22 |
+
with open(JSON_PATH, "r", encoding="utf-8") as f:
|
| 23 |
+
articles = json.load(f)
|
| 24 |
+
print(f"Found {len(articles)} articles.")
|
| 25 |
+
|
| 26 |
+
for article in tqdm(articles, desc="Indexing texts"):
|
| 27 |
+
full_text = f"{article.get('title', '')}\n{article.get('description', '')}\n{article.get('content', '')}"
|
| 28 |
+
|
| 29 |
+
metadata = {
|
| 30 |
+
"title": article.get("title", ""),
|
| 31 |
+
"description": article.get("description", ""),
|
| 32 |
+
"image_url": article.get("image_url", ""),
|
| 33 |
+
"date": article.get("date", ""),
|
| 34 |
+
"content": article.get("content", ""),
|
| 35 |
+
"source_url": article.get("source_url", "")
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
chunks = chunk_text(full_text, chunk_size=400, overlap=50)
|
| 39 |
+
|
| 40 |
+
for i, chunk in enumerate(chunks):
|
| 41 |
+
emb = get_text_embedding(chunk)
|
| 42 |
+
if emb:
|
| 43 |
+
doc_id = f"{metadata['source_url']}#chunk{i}" if metadata[
|
| 44 |
+
"source_url"] else f"{metadata['title']}#chunk{i}"
|
| 45 |
+
add_document_text(vectordb, doc_id, emb, chunk, metadata)
|
| 46 |
+
else:
|
| 47 |
+
print(f"Failed to embed chunk {i} of {metadata['title']}")
|
| 48 |
+
|
| 49 |
+
print("Done indexing texts.")
|
llm.py
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import requests
|
| 3 |
+
from dotenv import load_dotenv
|
| 4 |
+
|
| 5 |
+
load_dotenv()
|
| 6 |
+
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def generate_response(question, retrieved_docs, model="meta-llama/llama-3-8b-instruct"):
|
| 10 |
+
context = "\n\n".join(
|
| 11 |
+
f"Title: {doc.get('title', 'N/A')}\n"
|
| 12 |
+
f"Description: {doc.get('description', 'N/A')}\n"
|
| 13 |
+
f"Content: {doc.get('content', 'N/A')}\n"
|
| 14 |
+
for doc in retrieved_docs
|
| 15 |
+
)
|
| 16 |
+
|
| 17 |
+
prompt = (
|
| 18 |
+
"You are a polite assistant who provides clear and detailed answers based solely on the information from The Batch articles.\n\n"
|
| 19 |
+
"Rules:\n"
|
| 20 |
+
"- Answer only using the knowledge from The Batch articles.\n"
|
| 21 |
+
"- Do not mention other sources or questions; provide only accurate, detailed, and understandable answers.\n"
|
| 22 |
+
"- If the information is present in the context, give a clear answer.\n"
|
| 23 |
+
"- If the information is missing, respond with: 'Sorry, I could not find the answer in the provided context.'\n"
|
| 24 |
+
"- Do not guess, fabricate information, or go beyond the given context.\n\n"
|
| 25 |
+
f"Context for the answer:\n{context}"
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
headers = {
|
| 29 |
+
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
|
| 30 |
+
"Content-Type": "application/json"
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
data = {
|
| 34 |
+
"model": model,
|
| 35 |
+
"messages": [
|
| 36 |
+
{"role": "system", "content": prompt},
|
| 37 |
+
{"role": "user", "content": question}
|
| 38 |
+
],
|
| 39 |
+
"temperature": 0.3
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json=data)
|
| 43 |
+
|
| 44 |
+
if response.status_code == 200:
|
| 45 |
+
return response.json()['choices'][0]['message']['content'].strip()
|
| 46 |
+
else:
|
| 47 |
+
return f"ΠΠΎΠΌΠΈΠ»ΠΊΠ°: {response.status_code} β {response.text}"
|
main.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
from search.search_classical import classical_search
|
| 3 |
+
from search.search_best_pair import best_pair_search
|
| 4 |
+
from llm import generate_response
|
| 5 |
+
|
| 6 |
+
st.set_page_config(page_title="π Multimodal Search The Batch")
|
| 7 |
+
st.image("data/the-batch-logo.webp", width=300)
|
| 8 |
+
st.title("Multimodal Assistant")
|
| 9 |
+
|
| 10 |
+
mode = st.selectbox("π Select the search mode:", ["Classical RAG", "Multimodal RAG"])
|
| 11 |
+
query = st.text_input("π Enter the text query:")
|
| 12 |
+
|
| 13 |
+
if query:
|
| 14 |
+
if mode == "Classical RAG":
|
| 15 |
+
results = classical_search(query, k=3)
|
| 16 |
+
else:
|
| 17 |
+
results = best_pair_search(query, k=3)
|
| 18 |
+
|
| 19 |
+
st.markdown(f"### π Results found: {len(results)}")
|
| 20 |
+
|
| 21 |
+
for i, meta in enumerate(results):
|
| 22 |
+
st.markdown(f"### πΉ Result {i + 1}")
|
| 23 |
+
if meta.get("title"):
|
| 24 |
+
st.markdown(f"**π Name:** {meta['title']}")
|
| 25 |
+
if meta.get("date"):
|
| 26 |
+
st.markdown(f"**π
Date of publication:** {meta['date']}")
|
| 27 |
+
if meta.get("description"):
|
| 28 |
+
st.markdown(f"**π Description:** {meta['description']}")
|
| 29 |
+
if meta.get("image_url"):
|
| 30 |
+
st.image(meta["image_url"], use_container_width=True)
|
| 31 |
+
if meta.get("content"):
|
| 32 |
+
st.markdown("**π Part of the article:**")
|
| 33 |
+
st.write(meta["content"][:500] + "...")
|
| 34 |
+
if meta.get("source_url"):
|
| 35 |
+
st.markdown(f"[π Read the full article β]({meta['source_url']})")
|
| 36 |
+
st.markdown("---")
|
| 37 |
+
|
| 38 |
+
if st.button("π§ Generate a response to a query"):
|
| 39 |
+
docs = [
|
| 40 |
+
{
|
| 41 |
+
"title": meta.get("title", ""),
|
| 42 |
+
"description": meta.get("description", ""),
|
| 43 |
+
"content": meta.get("content", "")
|
| 44 |
+
}
|
| 45 |
+
for meta in results
|
| 46 |
+
]
|
| 47 |
+
|
| 48 |
+
response = generate_response(query, docs)
|
| 49 |
+
st.markdown("### π€ Generated Response:")
|
| 50 |
+
st.success(response)
|
requirements.txt
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit
|
| 2 |
+
langchain
|
| 3 |
+
sentence-transformers
|
| 4 |
+
transformers
|
| 5 |
+
torch
|
| 6 |
+
chromadb
|
| 7 |
+
nltk
|
| 8 |
+
requests
|
| 9 |
+
tqdm
|
| 10 |
+
python-dotenv
|
| 11 |
+
beautifulsoup4
|
| 12 |
+
selenium
|
| 13 |
+
webdriver-manager
|
| 14 |
+
langchain_community
|
| 15 |
+
emoji
|
search/__init__.py
ADDED
|
File without changes
|
search/search_best_pair.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from db.text_db import init_chroma
|
| 2 |
+
from db.image_db import init_chroma_image
|
| 3 |
+
from embeddings.text_embedder import get_text_embedding
|
| 4 |
+
from embeddings.image_embedder import get_text_embedding_clip
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
def best_pair_search(query, k=3):
|
| 8 |
+
text_db = init_chroma()
|
| 9 |
+
image_db = init_chroma_image()
|
| 10 |
+
|
| 11 |
+
text_emb = get_text_embedding(query)
|
| 12 |
+
image_emb = get_text_embedding_clip(query)
|
| 13 |
+
|
| 14 |
+
t_res = text_db.similarity_search_by_vector(text_emb, k=k)
|
| 15 |
+
i_res = image_db.similarity_search_by_vector(image_emb, k=k)
|
| 16 |
+
|
| 17 |
+
results = []
|
| 18 |
+
for i in range(k):
|
| 19 |
+
text_meta = t_res[i].metadata if i < len(t_res) else {}
|
| 20 |
+
image_meta = i_res[i].metadata if i < len(i_res) else {}
|
| 21 |
+
|
| 22 |
+
results.append({
|
| 23 |
+
"title": text_meta.get("title"),
|
| 24 |
+
"description": text_meta.get("description"),
|
| 25 |
+
"date": text_meta.get("date"),
|
| 26 |
+
"source_url": text_meta.get("source_url"),
|
| 27 |
+
"content": text_meta.get("content"),
|
| 28 |
+
"image_url": image_meta.get("image_url") if image_meta else None
|
| 29 |
+
})
|
| 30 |
+
|
| 31 |
+
return results
|
search/search_classical.py
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from db.text_db import init_chroma
|
| 2 |
+
from embeddings.text_embedder import get_text_embedding
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
def classical_search(query, k=5):
|
| 6 |
+
db = init_chroma()
|
| 7 |
+
emb = get_text_embedding(query)
|
| 8 |
+
results = db.similarity_search_by_vector(emb, k=k)
|
| 9 |
+
articles = []
|
| 10 |
+
seen = set()
|
| 11 |
+
|
| 12 |
+
for r in results:
|
| 13 |
+
meta = r.metadata
|
| 14 |
+
aid = meta.get("source_url") or meta.get("title")
|
| 15 |
+
if aid not in seen:
|
| 16 |
+
seen.add(aid)
|
| 17 |
+
articles.append({
|
| 18 |
+
"title": meta.get("title"),
|
| 19 |
+
"image_url": meta.get("image_url"),
|
| 20 |
+
"date": meta.get("date"),
|
| 21 |
+
"description": meta.get("description"),
|
| 22 |
+
"content": meta.get("content"),
|
| 23 |
+
"source_url": meta.get("source_url"),
|
| 24 |
+
})
|
| 25 |
+
|
| 26 |
+
return articles
|