Spaces:

edamonia
/

Batch_RAG

Sleeping

App Files Files Community

Arsen Dolichnyi commited on Jun 12, 2025

Commit

42b2b3c

0 Parent(s):

basic version with classic and multimodal rag

Browse files

Files changed (23) hide show

.gitignore +7 -0
README.md +194 -0
data/__init__.py +0 -0
data/articles_export.json +0 -0
data/parser.py +224 -0
data/the-batch-logo.webp +0 -0
db/__init__.py +0 -0
db/image_db.py +18 -0
db/text_db.py +19 -0
embeddings/__init__.py +0 -0
embeddings/image_embedder.py +31 -0
embeddings/text_embedder.py +38 -0
ingestion/__init__.py +0 -0
ingestion/config.py +1 -0
ingestion/ingest_image.py +37 -0
ingestion/ingest_run.py +25 -0
ingestion/ingest_text.py +49 -0
llm.py +47 -0
main.py +50 -0
requirements.txt +15 -0
search/__init__.py +0 -0
search/search_best_pair.py +31 -0
search/search_classical.py +26 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+# created by virtualenv automatically
+.env
+/venv
+.idea
+__pycache__/
+chroma_db_text/
+chroma_db_images/

README.md ADDED Viewed

	@@ -0,0 +1,194 @@

+# Multimodal RAG System
+## Project Description
+This project implements a **Multimodal Retrieval-Augmented Generation (RAG)** system that combines text and image data to retrieve relevant articles from **The Batch**. The system allows you to:
+- Retrieve data based on user queries using text **and** visual embeddings.
+- Perform **Classical RAG** (text-based search) and **Multimodal RAG** (combined text + image search).
+- Generate AI-powered answers to queries utilizing **Large Language Models (LLMs)**.
+- Provide users with an interactive interface for exploring results.
+ **Key Feature**: By combining textual and visual content, this system enhances the relevance of search results and elevates user experience.
+---
+## How to Run the Project
+Follow these steps to set up and run the system locally on your machine.
+### **1. Clone the Repository**
+Start by cloning the project repository into your local machine:
+```bash
+git clone https://github.com/DolAr1610/Multimodal_RAG.git
+cd multimodal_rag
+```
+### **2. Install Dependencies**
+Install all the required Python libraries specified in requirements.txt:
+```bash
+pip install -r requirements.txt
+```
+### **3. Prepare the Data**
+Ensure that the parsed articles are saved as a JSON file (data/articles_export.json) before running the system.
+#### **Option 1: Generate Data**
+If the articles are not yet parsed, you can run the parser:
+```bash
+python parser.py
+```
+#### **Option 2: Use Pre-Generated Data**
+Alternatively, use the pre-generated articles_export.json located in the data/ directory.
+### **4. Generate Vector Databases (First-Time Setup)**
+If this is your first run, you need to create the vector databases for text and images:
+```bash
+python ingest_run.py      # Create vector databases for text and image embeddings
+```
+This step ensures Chroma vector databases are properly initialized and indexed.
+### **5. Launch the Application**
+Run the Streamlit application to access the interactive user interface:
+```bash
+streamlit run main.py
+```
+## Key Features
+### **1. Parsing Articles and Metadata**
+The system collects articles, including text, metadata, and associated images, from **The Batch** using web-scraping techniques.
+- **Objective:** Extract text content (title, description, publication date), metadata, and associated images.
+- **How it Works**:
+  - **Selenium:** Handles dynamic website elements like pagination ("Load More", "Older Posts").
+  - **BeautifulSoup:** Extracts article text, metadata, and image URLs from HTML.
+- **Output:** Articles stored in a structured JSON format as follows:
+    ```json
+    {
+      "title": "Article Title",
+      "description": "Short Description",
+      "image_url": "https://example.com/image.jpg",
+      "date": "2024-10-11",
+      "content": "The main content of the article...",
+      "source_url": "https://thebatch.org/example-article"
+    }
+    ```
+**Scripts**:
+- `initialize_driver()`: Configures the Selenium WebDriver for site interaction.
+- `parse_article(url)`: Extracts title, description, metadata, and images of an article.
+- `run_parser_and_save_to_json()`: Performs entire filtering and parsing process.
+---
+### **2. Building Vector Databases**
+To enable efficient multimodal retrieval, the system creates **two separate vector databases**: one for text and one for images.
+#### **Text Index**
+- **Model:** The text index leverages **SentenceTransformer (E5)** for generating embeddings.
+- **Process:**
+  - Articles are preprocessed using `chunk_text()` to split larger texts into smaller chunks (400 words with a 50-word overlap).
+  - Chunks and embeddings are stored in a **Chroma** database.
+#### **Image Index**
+- **Model:** Image embeddings are generated using **OpenAI CLIP** (`clip-vit-large-patch14-336`).
+- **Process:**
+  - Images are accessed via URLs and transformed into embeddings.
+  - Embeddings and metadata are stored in a **Chroma** database.
+---
+### **3. Embedding Integration**
+The text and image embeddings are created independently to enhance retrieval performance.
+- **Text Integration:** Articles are preprocessed, converted into embeddings using **E5**, and indexed.
+- **Image Integration:** Image URLs are retrieved, processed, and added to the image index using embeddings from **CLIP**.
+  **Why Separate Databases?**: This ensures that powerful text and image-specific models can be used without sacrificing independence or performance. Each database is optimized for its respective modality.
+---
+### **4. Search System**
+The system provides two types of searches: text-only or multimodal.
+#### **1. Classical Search (Text-Based RAG)**
+- Focuses exclusively on the text database.
+- Finds articles that are highly relevant to the user query.
+- Always provides accompanying images from relevant articles.
+- **Implementation:** `classical_search()`.
+#### **2. Multimodal Search (Text + Image RAG)**
+- Leverages both text and image databases.
+- Locates the best-matching text and independent image:
+  - Finds relevant text embeddings for the query in the text index.
+  - Simultaneously searches for image embeddings matching the query in the image index.
+  - Combines the results into multimodal pairs.
+- **Implementation:** `best_pair_search()`.
+  **Output Example**:
+```json
+{
+  "title": "AI in Healthcare",
+  "description": "How AI is revolutionizing medicine.",
+  "image_url": "https://thebatch.org/healthcare-ai.jpg",
+  "date": "2024-10-11",
+  "source_url": "https://thebatch.org/ai-healthcare",
+  "content": "Artificial intelligence is transforming healthcare with personalized approaches..."
+}
+```
+---
+### **5. Answer Generation Using LLM**
+The system integrates a **Large Language Model (LLM)** to generate responses based on the content of retrieved articles.
+#### **Model**
+- The system uses **meta-llama/llama-3-8b-instruct**, integrated via the **OpenRouter API**.
+#### **Process**
+1. Retrieved article context (text fragments) is passed to the LLM model.
+2. The model generates detailed answers while adhering strictly to the provided context.
+3. If the query cannot be addressed due to insufficient context, the system returns a fallback response:
+   > **"Sorry, I could not find the answer in the provided context."**
+#### **Implementation**
+- The function `generate_response()` is responsible for:
+  - Extracting text from articles as context.
+  - Sending the context to an LLM.
+  - Generating user-facing responses.
+---
+### **6. Interactive User Interface**
+The system includes an interactive **Streamlit-based UI**, designed for a smooth user experience when exploring data.
+#### **Features**
+1. **Query Input:**
+   - Users can input text queries.
+   - They can choose between **Classical RAG** (text-only search) or **Multimodal RAG** (text + image search).
+2. **Result Display:**
+   - Lists retrieved articles with:
+     - Metadata (title, description, publication date).
+     - Accompanying images.
+     - Key fragments of text content.
+   - Includes a button to generate detailed responses from the LLM.
+## **Summary**
+This project explores the integration of multimodal content (text + images) and retrieval-augmented generation (RAG), incorporating cutting-edge NLP and computer vision models to provide users with:
+**Contextual Search Results:** Retrieve precise matches using text and visual embeddings seamlessly.
+**LLM Responses:** Generate detailed answers with OpenAI LLMs.
+**Interactive UI:** Streamlined user interaction through Streamlit.
+---
+## **Demo Video**
+Below is a quick demonstration of how the system works:
+Watch the demo video on [Google Drive](https://drive.google.com/file/d/1wd8QJfZYaPdwYy7qyCH4NeuQ0ZFNbW-K/view?usp=sharing).

data/__init__.py ADDED Viewed

File without changes

data/articles_export.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/parser.py ADDED Viewed

	@@ -0,0 +1,224 @@

+import time
+from datetime import datetime
+from bs4 import BeautifulSoup
+import json
+import requests
+from selenium import webdriver
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.webdriver.chrome.service import Service
+from selenium.common.exceptions import TimeoutException, NoSuchElementException
+from webdriver_manager.chrome import ChromeDriverManager
+BASE_TAG_URL = "https://www.deeplearning.ai/the-batch/tag/"
+VALID_CATEGORIES = [
+    "letters",
+    "data-points",
+    "research",
+    "business",
+    "science",
+    "culture",
+    "hardware",
+    "ai-careers"
+]
+def initialize_driver():
+    options = webdriver.ChromeOptions()
+    options.add_argument('--headless')
+    options.add_argument('--disable-gpu')
+    options.add_argument('--no-sandbox')
+    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
+    return driver
+def load_all_articles(driver, url):
+    wait = WebDriverWait(driver, 20)
+    driver.get(url)
+    time.sleep(3)
+    category = url.split('/')[-2]
+    all_articles_links = set()
+    if category == "letters":
+        last_url = ""
+        while True:
+            current_links = get_article_links_from_page(driver)
+            all_articles_links.update(current_links)
+            print(f"Collected {len(current_links)} articles on the current page in '{category}'")
+            try:
+                older_button = wait.until(
+                    EC.element_to_be_clickable((By.CLASS_NAME, "justify-self-end"))
+                )
+                driver.execute_script("arguments[0].scrollIntoView({block: 'end'});", older_button)
+                time.sleep(1)
+                older_button.click()
+                print(f"Clicked 'Older Posts' in'{category}'...")
+                time.sleep(2)
+                current_url = driver.current_url
+                if current_url == last_url:
+                    print("The URL did not change after the click, we are stopping the 'Older Posts' pagination.")
+                    break
+                last_url = current_url
+            except (TimeoutException, NoSuchElementException):
+                print("There is no 'Older Posts' button. Let's move on to the next category.")
+                break
+    else:
+        while True:
+            current_links = get_article_links_from_page(driver)
+            all_articles_links.update(current_links)
+            print(f"Collected {len(current_links)} articles on the current page in '{category}'")
+            try:
+                load_more_button = wait.until(
+                    EC.element_to_be_clickable((By.CLASS_NAME, "buttons_secondary__8o9u6"))
+                )
+                driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", load_more_button)
+                time.sleep(1)
+                load_more_button.click()
+                print(f"Clicked 'Load More' in '{category}'...")
+                time.sleep(2)
+            except (TimeoutException, NoSuchElementException):
+                print(
+                    f"The 'Load More' button is unavailable or missing in '{category}'. Moving to the next category.")
+                break
+    return list(all_articles_links)
+def get_article_links_from_page(driver):
+    soup = BeautifulSoup(driver.page_source, 'html.parser')
+    all_links = set()
+    for a in soup.find_all("a", href=True):
+        href = a['href']
+        if href.startswith("/the-batch/") and not href.startswith("/the-batch/tag/"):
+            full_url = "https://www.deeplearning.ai" + href
+            if "issue" not in full_url:
+                all_links.add(full_url)
+    return list(all_links)
+def get_article_links():
+    driver = initialize_driver()
+    all_links = set()
+    for category in VALID_CATEGORIES:
+        url = f"{BASE_TAG_URL}{category}/"
+        print(f"Loading the category: {url}")
+        category_links = load_all_articles(driver, url)
+        print(f"Found {len(category_links)} articles in category '{category}'")
+        all_links.update(category_links)
+    driver.quit()
+    return list(all_links)
+def parse_article(url, max_retries=3, delay=2):
+    attempts = 0
+    while attempts < max_retries:
+        try:
+            response = requests.get(url, timeout=10)
+            response.raise_for_status()
+            soup = BeautifulSoup(response.text, "html.parser")
+            h1 = soup.find("h1")
+            title = h1.get_text(strip=True) if h1 else ""
+            description = ""
+            if h1:
+                span = h1.find("span")
+                if span:
+                    description = span.get_text(strip=True)
+                    span.extract()
+                title = h1.get_text(strip=True)
+            image_tag = soup.find("meta", attrs={"property": "og:image"})
+            image_url = image_tag["content"] if image_tag else None
+            date_meta = soup.find("meta", attrs={"property": "article:published_time"})
+            date_str = ""
+            if date_meta:
+                try:
+                    date_raw = date_meta["content"]
+                    date_str = datetime.fromisoformat(date_raw.split("T")[0]).strftime("%Y-%m-%d")
+                except Exception:
+                    date_str = date_meta["content"]
+            content = ""
+            main_content = soup.find("div", class_="prose--styled")
+            if main_content:
+                paragraphs = main_content.find_all(["p", "li"])
+                content_lines = [p.get_text(strip=True) for p in paragraphs]
+                content = "\n".join(content_lines)
+            time.sleep(delay)
+            return {
+                "title": title.strip(),
+                "description": description.strip(),
+                "image_url": image_url,
+                "date": date_str,
+                "content": content.strip(),
+                "source_url": url,
+            }
+        except (requests.RequestException, Exception) as e:
+            attempts += 1
+            print(f"Error parsing URL {url} (Attempt {attempts}/{max_retries}): {e}")
+            time.sleep(delay * attempts)
+    print(f"Article skipped due to repeated errors: {url}")
+    return None
+def run_parser_and_save_to_json(output_filename="data/articles_export.json"):
+    print("Starting to parse article links...")
+    all_article_urls = get_article_links()
+    print(f"{len(all_article_urls)} unique links to articles collected.")
+    parsed_articles = []
+    print("\n Starting to parse article content...")
+    for i, url in enumerate(all_article_urls):
+        print(f"Parsing the article {i + 1}/{len(all_article_urls)}: {url}")
+        article_data = parse_article(url)
+        if article_data:
+            parsed_articles.append(article_data)
+    print(f"\n Parsing completed. {len(parsed_articles)} articles collected.")
+    with open(output_filename, "w", encoding="utf-8") as f:
+        json.dump(parsed_articles, f, ensure_ascii=False, indent=4)
+    print(f"All articles are saved in '{output_filename}'")
+    print("\n Starting to parse articles...")
+    try:
+        with open(output_filename, "r", encoding="utf-8") as f:
+            articles_to_filter = json.load(f)
+    except FileNotFoundError:
+        print(f"File '{output_filename}' not found for parse.")
+        articles_to_filter = []
+    initial_count = len(articles_to_filter)
+    filtered_articles = [a for a in articles_to_filter if a.get("content") != "[image]"]
+    filtered_count = len(filtered_articles)
+    print(f"Articles for parse: {initial_count}")
+    print(f"Parsed articles: {filtered_count}")
+    with open(output_filename, "w", encoding="utf-8") as f:
+        json.dump(filtered_articles, f, ensure_ascii=False, indent=4)
+    print(f"Parsed articles saved in '{output_filename}'")
+if __name__ == "__main__":
+    import os
+    os.makedirs("data", exist_ok=True)
+    run_parser_and_save_to_json()

data/the-batch-logo.webp ADDED Viewed

db/__init__.py ADDED Viewed

File without changes

db/image_db.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from langchain_community.vectorstores import Chroma
+def init_chroma_image():
+    vectordb = Chroma(
+        collection_name="rag_collection_images",
+        persist_directory="chroma_db_images"
+    )
+    return vectordb
+def add_document_image(vectordb, doc_id, embedding, metadata):
+    vectordb._collection.add(
+        embeddings=[embedding],
+        documents=["[image]"],
+        metadatas=[metadata],
+        ids=[doc_id]
+    )

db/text_db.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from langchain_community.vectorstores import Chroma
+from embeddings.text_embedder import TextEmbeddings
+def init_chroma():
+    return Chroma(
+        collection_name="text_collection",
+        embedding_function=TextEmbeddings(),
+        persist_directory="chroma_db_text"
+    )
+def add_document_text(vectordb, doc_id, embedding, document_text, metadata):
+    vectordb._collection.add(
+        embeddings=[embedding],
+        documents=[document_text],
+        metadatas=[metadata],
+        ids=[doc_id]
+    )

embeddings/__init__.py ADDED Viewed

File without changes

embeddings/image_embedder.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from transformers import AutoProcessor, AutoModel
+from PIL import Image
+from io import BytesIO
+import torch
+import requests
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = AutoModel.from_pretrained("openai/clip-vit-large-patch14-336").to(device)
+processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
+def get_image_embedding(image_url):
+    try:
+        response = requests.get(image_url, timeout=10)
+        img = Image.open(BytesIO(response.content)).convert('RGB')
+        inputs = processor(images=img, return_tensors="pt").to(device)
+        with torch.no_grad():
+            emb = model.get_image_features(**inputs)
+        emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
+        return emb[0].cpu().numpy().tolist()
+    except Exception as e:
+        print(f"Image loading failed: {e}")
+        return None
+def get_text_embedding_clip(text_query):
+    inputs = processor(text=[text_query], return_tensors="pt", padding=True, truncation=True).to(device)
+    with torch.no_grad():
+        emb = model.get_text_features(**inputs)
+    emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
+    return emb[0].cpu().numpy().tolist()

embeddings/text_embedder.py ADDED Viewed

	@@ -0,0 +1,38 @@

+from sentence_transformers import SentenceTransformer
+from langchain_core.embeddings import Embeddings
+import torch
+import re
+import emoji
+import unicodedata
+import nltk
+from nltk.corpus import stopwords
+nltk.download('stopwords')
+stop_words = set(stopwords.words('english'))
+device = "cuda" if torch.cuda.is_available() else "cpu"
+text_model = SentenceTransformer("intfloat/e5-base", device=device)
+def preprocess_text(text):
+    text = emoji.replace_emoji(text, replace='')
+    text = ''.join(c for c in text if unicodedata.category(c)[0] != 'C')
+    text = text.lower()
+    words = re.findall(r'\b[a-z]+\b', text)
+    filtered = [w for w in words if w not in stop_words]
+    return " ".join(filtered)
+def get_text_embedding(text):
+    clean_text = preprocess_text(text)
+    emb = text_model.encode(clean_text, convert_to_numpy=True, normalize_embeddings=True)
+    return emb.tolist()
+class TextEmbeddings(Embeddings):
+    def embed_documents(self, texts):
+        return [get_text_embedding(text) for text in texts]
+    def embed_query(self, text):
+        return get_text_embedding(text)

ingestion/__init__.py ADDED Viewed

File without changes

ingestion/config.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ JSON_PATH = "data/articles_export.json"

ingestion/ingest_image.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import json
+from tqdm import tqdm
+from db.image_db import init_chroma_image, add_document_image
+from embeddings.image_embedder import get_image_embedding
+from config import JSON_PATH
+def ingest_images():
+    vectordb = init_chroma_image()
+    with open(JSON_PATH, "r", encoding="utf-8") as f:
+        articles = json.load(f)
+    print(f"Found {len(articles)} articles.")
+    for article in tqdm(articles, desc="Indexing images"):
+        metadata = {
+            "title": article.get("title", ""),
+            "description": article.get("description", ""),
+            "image_url": article.get("image_url", ""),
+            "date": article.get("date", ""),
+            "content": article.get("content", ""),
+            "source_url": article.get("source_url", "")
+        }
+        image_url = metadata["image_url"]
+        if not image_url:
+            print(f"No image in: {metadata['title']}")
+            continue
+        emb = get_image_embedding(image_url)
+        if emb:
+            doc_id = f"{metadata['source_url']}#image" if metadata["source_url"] else f"image#{metadata['title']}"
+            add_document_image(vectordb, doc_id, emb, {**metadata, "modality": "image"})
+        else:
+            print(f"Failed to embed image for: {metadata['title']}")
+    print("Done indexing images.")

ingestion/ingest_run.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import time
+from ingest_text import ingest_texts
+from ingest_image import ingest_images
+def main():
+    try:
+        print("1. Ingesting text embeddings...")
+        start_time = time.time()
+        ingest_texts()
+        print(f"Text embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
+        print("2. Ingesting image embeddings...")
+        start_time = time.time()
+        ingest_images()
+        print(f"Image embeddings processed successfully in {time.time() - start_time:.2f} seconds.")
+    except Exception as e:
+        print(f"An error occurred during ingestion: {e}")
+    else:
+        print("All embeddings successfully ingested.")
+    finally:
+        print("Done.")
+if __name__ == "__main__":
+    main()

ingestion/ingest_text.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import json
+from tqdm import tqdm
+from db.text_db import init_chroma, add_document_text
+from embeddings.text_embedder import get_text_embedding
+from config import JSON_PATH
+def chunk_text(text, chunk_size=400, overlap=50):
+    words = text.split()
+    chunks = []
+    i = 0
+    while i < len(words):
+        chunk = words[i:i + chunk_size]
+        chunks.append(" ".join(chunk))
+        i += chunk_size - overlap
+    return chunks
+def ingest_texts():
+    vectordb = init_chroma()
+    with open(JSON_PATH, "r", encoding="utf-8") as f:
+        articles = json.load(f)
+    print(f"Found {len(articles)} articles.")
+    for article in tqdm(articles, desc="Indexing texts"):
+        full_text = f"{article.get('title', '')}\n{article.get('description', '')}\n{article.get('content', '')}"
+        metadata = {
+            "title": article.get("title", ""),
+            "description": article.get("description", ""),
+            "image_url": article.get("image_url", ""),
+            "date": article.get("date", ""),
+            "content": article.get("content", ""),
+            "source_url": article.get("source_url", "")
+        }
+        chunks = chunk_text(full_text, chunk_size=400, overlap=50)
+        for i, chunk in enumerate(chunks):
+            emb = get_text_embedding(chunk)
+            if emb:
+                doc_id = f"{metadata['source_url']}#chunk{i}" if metadata[
+                    "source_url"] else f"{metadata['title']}#chunk{i}"
+                add_document_text(vectordb, doc_id, emb, chunk, metadata)
+            else:
+                print(f"Failed to embed chunk {i} of {metadata['title']}")
+    print("Done indexing texts.")

llm.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import os
+import requests
+from dotenv import load_dotenv
+load_dotenv()
+OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
+def generate_response(question, retrieved_docs, model="meta-llama/llama-3-8b-instruct"):
+    context = "\n\n".join(
+        f"Title: {doc.get('title', 'N/A')}\n"
+        f"Description: {doc.get('description', 'N/A')}\n"
+        f"Content: {doc.get('content', 'N/A')}\n"
+        for doc in retrieved_docs
+    )
+    prompt = (
+        "You are a polite assistant who provides clear and detailed answers based solely on the information from The Batch articles.\n\n"
+        "Rules:\n"
+        "- Answer only using the knowledge from The Batch articles.\n"
+        "- Do not mention other sources or questions; provide only accurate, detailed, and understandable answers.\n"
+        "- If the information is present in the context, give a clear answer.\n"
+        "- If the information is missing, respond with: 'Sorry, I could not find the answer in the provided context.'\n"
+        "- Do not guess, fabricate information, or go beyond the given context.\n\n"
+        f"Context for the answer:\n{context}"
+    )
+    headers = {
+        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
+        "Content-Type": "application/json"
+    }
+    data = {
+        "model": model,
+        "messages": [
+            {"role": "system", "content": prompt},
+            {"role": "user", "content": question}
+        ],
+        "temperature": 0.3
+    }
+    response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json=data)
+    if response.status_code == 200:
+        return response.json()['choices'][0]['message']['content'].strip()
+    else:
+        return f"Помилка: {response.status_code} — {response.text}"

main.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import streamlit as st
+from search.search_classical import classical_search
+from search.search_best_pair import best_pair_search
+from llm import generate_response
+st.set_page_config(page_title="🔍 Multimodal Search The Batch")
+st.image("data/the-batch-logo.webp", width=300)
+st.title("Multimodal Assistant")
+mode = st.selectbox("🔎 Select the search mode:", ["Classical RAG", "Multimodal RAG"])
+query = st.text_input("📝 Enter the text query:")
+if query:
+    if mode == "Classical RAG":
+        results = classical_search(query, k=3)
+    else:
+        results = best_pair_search(query, k=3)
+    st.markdown(f"### 📄 Results found: {len(results)}")
+    for i, meta in enumerate(results):
+        st.markdown(f"### 🔹 Result {i + 1}")
+        if meta.get("title"):
+            st.markdown(f"**📖 Name:** {meta['title']}")
+        if meta.get("date"):
+            st.markdown(f"**📅 Date of publication:** {meta['date']}")
+        if meta.get("description"):
+            st.markdown(f"**📝 Description:** {meta['description']}")
+        if meta.get("image_url"):
+            st.image(meta["image_url"], use_container_width=True)
+        if meta.get("content"):
+            st.markdown("**📚 Part of the article:**")
+            st.write(meta["content"][:500] + "...")
+        if meta.get("source_url"):
+            st.markdown(f"[🔗 Read the full article →]({meta['source_url']})")
+        st.markdown("---")
+    if st.button("🧠 Generate a response to a query"):
+        docs = [
+            {
+                "title": meta.get("title", ""),
+                "description": meta.get("description", ""),
+                "content": meta.get("content", "")
+            }
+            for meta in results
+        ]
+        response = generate_response(query, docs)
+        st.markdown("### 🤖 Generated Response:")
+        st.success(response)

requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+streamlit
+langchain
+sentence-transformers
+transformers
+torch
+chromadb
+nltk
+requests
+tqdm
+python-dotenv
+beautifulsoup4
+selenium
+webdriver-manager
+langchain_community
+emoji

search/__init__.py ADDED Viewed

File without changes

search/search_best_pair.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from db.text_db import init_chroma
+from db.image_db import init_chroma_image
+from embeddings.text_embedder import get_text_embedding
+from embeddings.image_embedder import get_text_embedding_clip
+def best_pair_search(query, k=3):
+    text_db = init_chroma()
+    image_db = init_chroma_image()
+    text_emb = get_text_embedding(query)
+    image_emb = get_text_embedding_clip(query)
+    t_res = text_db.similarity_search_by_vector(text_emb, k=k)
+    i_res = image_db.similarity_search_by_vector(image_emb, k=k)
+    results = []
+    for i in range(k):
+        text_meta = t_res[i].metadata if i < len(t_res) else {}
+        image_meta = i_res[i].metadata if i < len(i_res) else {}
+        results.append({
+            "title": text_meta.get("title"),
+            "description": text_meta.get("description"),
+            "date": text_meta.get("date"),
+            "source_url": text_meta.get("source_url"),
+            "content": text_meta.get("content"),
+            "image_url": image_meta.get("image_url") if image_meta else None
+        })
+    return results

search/search_classical.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from db.text_db import init_chroma
+from embeddings.text_embedder import get_text_embedding
+def classical_search(query, k=5):
+    db = init_chroma()
+    emb = get_text_embedding(query)
+    results = db.similarity_search_by_vector(emb, k=k)
+    articles = []
+    seen = set()
+    for r in results:
+        meta = r.metadata
+        aid = meta.get("source_url") or meta.get("title")
+        if aid not in seen:
+            seen.add(aid)
+            articles.append({
+                "title": meta.get("title"),
+                "image_url": meta.get("image_url"),
+                "date": meta.get("date"),
+                "description": meta.get("description"),
+                "content": meta.get("content"),
+                "source_url": meta.get("source_url"),
+            })
+    return articles