Spaces:

dev-models
/

MultiModel-Rag

Sleeping

App Files Files Community

dev-models commited on Jan 22

Commit

e97c8d1

1 Parent(s): 135acdb

Initial commit

Browse files

Files changed (12) hide show

.dockerignore +17 -0
.gitignore +47 -0
Dockerfile +28 -0
README.md +112 -0
app.py +290 -0
backend/__init__.py +1 -0
backend/database.py +15 -0
backend/models.py +21 -0
backend/parser.py +189 -0
backend/rag.py +315 -0
config.py +23 -0
requirements.txt +16 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,17 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+build/
+dist/
+*.egg-info
+temp_uploads/
+rag_data/
+.env
+node_modules/
+.mypy_cache
+rag_data/
+temp_uploads/

.gitignore ADDED Viewed

	@@ -0,0 +1,47 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+.venv
+venv
+ENV/
+env.bak
+venv.bak
+# Environment Variables
+.env
+# Project specific
+temp_uploads/
+rag_data/
+# IDEs
+.vscode/
+.idea/
+# Docker
+.docker/
+# OS specific
+.DS_Store
+Thumbs.db

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.10-slim
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    tesseract-ocr \
+    tesseract-ocr-eng \
+    poppler-utils \
+    libgl1 \
+    libglib2.0-0 \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --upgrade pip && \
+    pip install -r requirements.txt
+# Copy app code
+COPY . .
+EXPOSE 7860
+CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -8,3 +8,115 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# 🤖 Multimodal RAG Assistant (Docling-Powered)
+[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
+[![Streamlit](https://img.shields.io/badge/Streamlit-1.32%2B-FF4B4B.svg)](https://streamlit.io/)
+[![Docling](https://img.shields.io/badge/Docling-IBM-orange.svg)](https://github.com/DS4SD/docling)
+[![MongoDB](https://img.shields.io/badge/MongoDB-Atlas-green.svg)](https://www.mongodb.com/products/platform/atlas-vector-search)
+[![Groq](https://img.shields.io/badge/Groq-Llama_3.3-black.svg)](https://groq.com/)
+A state-of-the-art **Multimodal Retrieval-Augmented Generation (RAG)** system built for the modern document era. This assistant doesn't just read text—it understands tables, charts, diagrams, and complex layouts using **IBM's Docling** and **Visual Language Models**.
+---
+## 🚀 The WOW Factor
+*   **🧠 Deep Document Intelligence:** Powered by **Docling**, the system extracts semantic structures (headers, tables, lists) with extreme precision.
+*   **👁️ Visual Understanding:** Every image in your PDF is "seen" by a **VLM (Llama-3-Vision)** to generate rich textual descriptions for vector indexing.
+*   **🔍 Hybrid Search Engine:** A high-performance retrieval pipeline combining **CLIP (Dense)** and **BM25 (Sparse)** to ensure zero-miss retrieval.
+*   **🖼️ Visual RAG Capabilities:** Directly query for charts or diagrams. The assistant "shows" you the relevant visuals alongside textual answers.
+*   **💡 Intelligent Query Guidance:** Automatically analyzes document structure to suggest the most relevant questions for the user.
+*   **⚡ Blazing Fast Generation:** Uses **Groq's Llama-3.3-70B** for near-instant, high-quality responses with full streaming support.
+---
+## 🛠️ Architecture Overview
+The system is built on a modular, production-ready foundation:
+```text
+rag-app/
+├── 🌐 app.py              # Streamlit Premium Interface
+├── ⚙️ config.py           # Centralized configuration
+├── 📦 backend/            # Domain-driven modules
+│   ├── 🛠️ parser.py       # Docling Engine + VLM Describer
+│   ├── 🧠 rag.py          # Hybrid Search + RAG Orchestrator
+│   ├── 💾 database.py     # MongoDB Atlas Vector Store integration
+│   └── 🤖 models.py       # CLIP, LLM, and VLM Connectors
+├── 📁 rag_data/           # Parsed JSON persistence
+├── 🐳 Dockerfile          # Multi-stage optimized build
+└── 📋 requirements.txt    # Optimized dependency tree
+```
+---
+## 🏗️ Core Technology Stack
+| Layer | Technology | Purpose |
+| :--- | :--- | :--- |
+| **Parsing** | **Docling** | High-fidelity PDF structural parsing & OCR |
+| **VLM** | **Groq (Llama-4-Scout)** | Image captioning for multimodal indexing |
+| **Embeddings** | **CLIP (ViT-L/14)** | Joint Text-Image vector space |
+| **Vector DB** | **MongoDB Atlas** | Scalable vector search & metadata storage |
+| **LLM** | **Llama-3.3-70B** | Final answer generation (via Groq) |
+| **UI** | **Streamlit** | Modern, responsive chat interface |
+---
+## 🚦 Getting Started
+### 1. Prerequisites
+- Python 3.10+
+- A [MongoDB Atlas](https://www.mongodb.com/cloud/atlas/register) Account (for Vector Search)
+- A [Groq API Key](https://console.groq.com/)
+### 2. Configure Environment
+Create a `.env` file in the root directory:
+```env
+# MongoDB Credentials
+MONGO_USER=your_username
+MONGO_PASSWORD=your_password
+MONGO_HOST=your_cluster_url.mongodb.net
+MONGO_DB=rag_assistant
+# API Keys
+GROQ_API_KEY=gsk_your_key_here
+# Optional: Full URI (overrides components above)
+# MONGO_URI=mongodb+srv://...
+```
+### 3. Quick Run (Docker)
+```bash
+docker compose up --build
+```
+### 4. Local Setup
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Launch app
+streamlit run app.py
+```
+---
+## 📈 Search Optimization
+- **Dense Retrieval (CLIP):** Captures semantic meaning and visual similarity.
+- **Sparse Retrieval (BM25):** Ensures keyword matches (names, technical terms) are never missed.
+- **Hybrid Weighting:** Fine-tuned `alpha` parameter balances the two search methods for optimal precision-recall.
+---
+## 🛡️ Security & Scalability
+*   **Safe Parsing:** Docling runs in a secure, resource-limited container environment.
+*   **Vector Search Indexing:** Optimized for MongoDB Atlas Search, enabling enterprise-grade scaling.
+*   **Streaming Responses:** Uses Server-Sent Events (SSE) logic for smooth user experience.

app.py ADDED Viewed

	@@ -0,0 +1,290 @@

+import streamlit as st
+import os
+import base64
+from io import BytesIO
+from PIL import Image
+import time
+# Import Modular components
+from backend.rag import RAGEngine
+from backend.parser import EnrichedRagParser
+# ==========================================
+# 1. Page Configuration & Professional CSS
+# ==========================================
+st.set_page_config(
+    page_title="Multimodal RAG Assistant",
+    page_icon="🤖",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Production-ready CSS
+st.markdown("""
+<style>
+    .stChatMessage {
+        background-color: var(--secondary-background-color);
+        border: 1px solid rgba(128, 128, 128, 0.1);
+        border-radius: 12px;
+        padding: 1.5rem;
+        margin-bottom: 1rem;
+        box-shadow: 0 2px 4px rgba(0,0,0,0.05);
+    }
+    .stats-container {
+        background-color: var(--secondary-background-color);
+        border: 1px solid rgba(128, 128, 128, 0.2);
+        border-radius: 10px;
+        padding: 15px;
+        margin-top: 10px;
+    }
+    .stats-header {
+        font-weight: 600;
+        color: var(--text-color);
+        margin-bottom: 8px;
+        display: block;
+    }
+    .stats-item {
+        font-size: 0.9em;
+        color: var(--text-color);
+        opacity: 0.8;
+        margin-bottom: 4px;
+        display: flex;
+        justify-content: space-between;
+    }
+</style>
+""", unsafe_allow_html=True)
+# ==========================================
+# 2. Initialization & Helper Functions
+# ==========================================
+@st.cache_resource
+def initialize_rag_system(force_clean: bool = True):
+    """Initialize the RAG system with caching."""
+    return RAGEngine(use_hybrid=True, force_clean=force_clean)
+def display_image_from_base64(base64_str: str, caption: str = "", width: int = 300):
+    """Helper to decode and display base64 images."""
+    try:
+        img_data = base64.b64decode(base64_str)
+        img = Image.open(BytesIO(img_data))
+        st.image(img, caption=caption, width=width)
+    except Exception as e:
+        st.error(f"Failed to display image: {e}")
+# ==========================================
+# 3. Main Application
+# ==========================================
+def main():
+    # --- State Management ---
+    if "messages" not in st.session_state:
+        st.session_state.messages = []
+    if "suggested_questions" not in st.session_state:
+        st.session_state.suggested_questions = []
+    # Initialize Backend
+    if "rag" not in st.session_state:
+        with st.spinner("🚀 Booting up AI System..."):
+            st.session_state.rag = initialize_rag_system()
+    rag: RAGEngine = st.session_state.rag
+    # ==========================================
+    # SIDEBAR: Control Panel
+    # ==========================================
+    with st.sidebar:
+        st.header("🧠 RAG Control Panel")
+        # --- PDF Document Upload ---
+        with st.expander("📂 Knowledge Base", expanded=True):
+            uploaded_file = st.file_uploader(
+                "Upload Document (PDF)",
+                type=["pdf"],
+                label_visibility="collapsed"
+            )
+            if uploaded_file:
+                # Temporary save for parsing
+                temp_dir = "temp_uploads"
+                os.makedirs(temp_dir, exist_ok=True)
+                save_path = os.path.join(temp_dir, uploaded_file.name)
+                with open(save_path, "wb") as f:
+                    f.write(uploaded_file.getbuffer())
+                if st.button("🚀 Process PDF", type="primary", use_container_width=True):
+                    try:
+                        with st.spinner("Analyzing PDF with Docling..."):
+                            parser = EnrichedRagParser()
+                            parsed_data = parser.process_document(save_path)
+                        with st.spinner("Ingesting into MongoDB..."):
+                            rag.ingest_data(parsed_data)
+                        # Generate Suggestions
+                        suggestions = rag.generate_suggested_questions(num_questions=6)
+                        st.session_state.suggested_questions = suggestions
+                        st.success(f"Processed: {uploaded_file.name}")
+                        st.rerun()
+                    except Exception as e:
+                        st.error(f"❌ Error: {str(e)}")
+                    finally:
+                        # ✅ Always cleanup temp file
+                        if os.path.exists(save_path):
+                            os.remove(save_path)
+                            print("🧹 Temp file deleted")
+                    st.rerun()
+                    st.markdown("---")
+        # --- Suggested Questions ---
+        if st.session_state.suggested_questions:
+            st.subheader("💡 Quick Questions")
+            for idx, q in enumerate(st.session_state.suggested_questions):
+                if st.button(q, key=f"sugg_{idx}", use_container_width=True):
+                    st.session_state.messages.append({"role": "user", "content": q})
+                    st.rerun()
+            st.markdown("---")
+        # --- Settings ---
+        with st.expander("⚙️ Search Settings"):
+            top_k = st.slider("Max Results", 1, 10, 5)
+            min_score = st.slider("Confidence Threshold", 0.0, 1.0, 0.6)
+            use_images = st.toggle("Enable Image Search", value=True)
+        # --- System Stats ---
+        count = rag.collection.count_documents({})
+        st.markdown(
+            f"""
+            <div class="stats-container">
+                <span class="stats-header">📊 Database Status</span>
+                <div class="stats-item"><span>Total Chunks:</span> <strong>{count}</strong></div>
+                <div class="stats-item"><span>Embedding:</span> <strong>CLIP ViT-L/14</strong></div>
+            </div>
+            """,
+            unsafe_allow_html=True,
+        )
+        # Reset
+        if st.button("🗑️ Clear Chat", type="secondary", use_container_width=True):
+            st.session_state.messages = []
+            st.rerun()
+        if st.button("⚠️ Delete Vector Collection", type="primary", use_container_width=True):
+            with st.spinner("Deleting collection..."):
+                rag.collection.delete_many({})
+                # Reset in-memory indices to match empty DB
+                rag.bm25_index = None
+                rag.bm25_doc_map = {}
+                st.success("Vector Collection Deleted!")
+                time.sleep(1) # Give user a moment to see the success message
+                st.rerun()
+    # ==========================================
+    # MAIN: Chat Interface
+    # ==========================================
+    st.title("🤖 Multimodal AI Assistant")
+    if not st.session_state.messages:
+        st.markdown(
+            """
+            <div style="text-align: center; margin-top: 50px; opacity: 0.7;">
+                <h3>👋 Ready to help!</h3>
+                <p>Upload a PDF in the sidebar to start.</p>
+            </div>
+            """,
+            unsafe_allow_html=True,
+        )
+    # Render History
+    for msg in st.session_state.messages:
+        with st.chat_message(msg["role"]):
+            st.markdown(msg["content"])
+            if "images" in msg and msg["images"]:
+                st.markdown("---")
+                cols = st.columns(3)
+                for i, img in enumerate(msg["images"]):
+                    with cols[i % 3]:
+                        display_image_from_base64(img["image_base64"], width=220)
+    # ==========================================
+    # LOGIC: Input Handling
+    # ==========================================
+    user_input = st.chat_input("Type your question here...")
+    if user_input:
+        st.session_state.messages.append({"role": "user", "content": user_input})
+        st.rerun()
+    # ==========================================
+    # ASSISTANT: Streaming Response Logic
+    # ==========================================
+    if st.session_state.messages and st.session_state.messages[-1]["role"] == "user":
+        last_query = st.session_state.messages[-1]["content"]
+        with st.chat_message("assistant"):
+            with st.spinner("🤔 Searching context..."):
+                try:
+                    img_keywords = ["show", "image", "diagram", "figure", "picture"]
+                    is_visual_request = any(
+                        k in last_query.lower() for k in img_keywords
+                    ) and use_images
+                    found_imgs = []
+                    answer_text = ""
+                    if is_visual_request:
+                        # 🔍 Image search branch (non-streaming)
+                        found_imgs = rag.search_images(
+                            last_query,
+                            top_k=3,
+                            min_score=min_score,
+                        )
+                        if found_imgs:
+                            answer_text = f"I found {len(found_imgs)} relevant visuals:"
+                        else:
+                            answer_text = "I couldn't find any relevant images."
+                        # Render once
+                        st.markdown(answer_text)
+                    else:
+                        # 🧠 Text answer branch (STREAMING)
+                        # Assume rag.answer_question returns a generator / stream.
+                        # st.write_stream will both display the chunks and return
+                        # the final concatenated string.[web:60]
+                        stream = rag.answer_question(
+                            last_query,
+                            top_k=top_k
+                        )
+                        answer_text = st.write_stream(stream)
+                    # Render images if any
+                    if found_imgs:
+                        st.markdown("---")
+                        cols = st.columns(3)
+                        for idx, img in enumerate(found_imgs):
+                            with cols[idx % 3]:
+                                display_image_from_base64(
+                                    img["image_base64"], width=220
+                                )
+                    # Persist assistant message in history
+                    st.session_state.messages.append(
+                        {
+                            "role": "assistant",
+                            "content": answer_text,
+                            "images": found_imgs,
+                        }
+                    )
+                except Exception as e:
+                    st.error(f"Error: {e}")
+                    st.session_state.messages.append(
+                        {"role": "assistant", "content": f"❌ Error: {e}"}
+                    )
+if __name__ == "__main__":
+    main()

backend/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Modules package initialization

backend/database.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from pymongo import MongoClient
+from pymongo.collection import Collection
+from config import MONGO_URI, DB_NAME, MONGO_COLLECTION
+def get_mongo_client(uri: str | None = None) -> MongoClient:
+    """Return a pymongo MongoClient."""
+    uri = uri or MONGO_URI
+    return MongoClient(uri)
+def get_mongo_collection(client: MongoClient | None = None, db_name: str | None = None, collection_name: str | None = None) -> Collection:
+    """Return a MongoDB collection instance."""
+    client = client or get_mongo_client()
+    db_name = db_name or DB_NAME
+    collection_name = collection_name or MONGO_COLLECTION
+    return client[db_name][collection_name]

backend/models.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import torch
+from sentence_transformers import SentenceTransformer
+from groq import Groq
+from config import CLIP_MODEL_NAME, GROQ_API_KEY, LLM_MODEL_NAME
+from langchain_groq import ChatGroq
+def get_clip_model(model_name: str = CLIP_MODEL_NAME):
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    try:
+        model = SentenceTransformer(model_name, trust_remote_code=True)
+        model.to(device)
+        return model
+    except Exception as e:
+        print(f"Fallback CLIP model due to: {e}")
+        return SentenceTransformer('clip-ViT-B-32')
+def get_llm(model_name: str = LLM_MODEL_NAME):
+    return ChatGroq(model=model_name, api_key=GROQ_API_KEY, temperature=0.1)
+def get_groq_client(api_key: str = GROQ_API_KEY):
+    return Groq(api_key=api_key)

backend/parser.py ADDED Viewed

	@@ -0,0 +1,189 @@

+import json
+import os
+import base64
+from io import BytesIO
+from typing import List, Dict, Any
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions, PictureDescriptionApiOptions
+from docling_core.types.doc.labels import DocItemLabel
+from docling_core.types.doc.document import SectionHeaderItem, TitleItem
+from config import GROQ_API_KEY
+from docling.chunking import HybridChunker
+from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
+from docling.datamodel.settings import settings
+from docling.datamodel.pipeline_options import (
+    PdfPipelineOptions,
+    OcrAutoOptions
+)
+class EnrichedRagParser:
+    """
+    Parser using Docling's HybridChunker for Multimodal RAG.
+    Modified from sonnet_export.py for modular use.
+    """
+    def __init__(self, groq_api_key: str = GROQ_API_KEY):
+        self.groq_api_key = groq_api_key
+        self.converter = self._setup_converter()
+        self.chunker = HybridChunker(merge_peers=True)
+    def _setup_converter(self) -> DocumentConverter:
+        # CPU Configuration
+        accelerator_options = AcceleratorOptions(
+            num_threads=min(12, os.cpu_count()),
+            device=AcceleratorDevice.CPU
+        )
+        # Smart OCR Configuration
+        # Only triggers when >50% of page is scanned/bitmap content
+        ocr_options = OcrAutoOptions(
+            lang=["en"],                        # ✅ Specify language
+            force_full_page_ocr=False,          # ⚡ Don't force OCR on all pages
+            bitmap_area_threshold=0.5           # ⚡ Smart: Only OCR if >50% scanned
+        )
+        # Pipeline Configuration
+        pipeline_options = PdfPipelineOptions(
+            # Features
+            do_ocr=True,                        # Enable OCR (but smart triggering)
+            do_table_structure=True,
+            generate_picture_images=True,
+            images_scale=1,
+            ocr_options=ocr_options,            # ⚡ Smart OCR config
+            # Disable unnecessary features
+            generate_page_images=False,
+            enable_remote_services=True,
+            # Picture descriptions - using VLM (local)
+            do_picture_description=True,
+            # Resource management
+            queue_max_size=10,
+            document_timeout=300.0
+        )
+        pipeline_options.accelerator_options = accelerator_options
+        settings.debug.profile_pipeline_timings = True
+        pipeline_options.picture_description_options = PictureDescriptionApiOptions(
+            url="https://api.groq.com/openai/v1/chat/completions",
+            params={
+                "model": "meta-llama/llama-4-scout-17b-16e-instruct", # Double check this model string
+                "temperature": 0.2,
+                "max_tokens": 500,
+            },
+            prompt="Describe this image in detail for a RAG knowledge base. Include all visible text, numbers, and chart trends.",
+            headers={"Authorization": f"Bearer {self.groq_api_key}"}
+        )
+        return DocumentConverter(
+            format_options={
+                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+            }
+        )
+    def _determine_chunk_type(self, chunk) -> str:
+        chunk_type = "text"
+        if hasattr(chunk.meta, "doc_items") and chunk.meta.doc_items:
+            labels = [item.label for item in chunk.meta.doc_items]
+            if DocItemLabel.TABLE in labels:
+                chunk_type = "table"
+            elif DocItemLabel.LIST_ITEM in labels:
+                chunk_type = "list"
+            elif any(l in [DocItemLabel.TITLE, DocItemLabel.SECTION_HEADER] for l in labels):
+                chunk_type = "header"
+            elif DocItemLabel.CODE in labels:
+                chunk_type = "code"
+        return chunk_type
+    def _get_base64_image(self, pic) -> str:
+        try:
+            if hasattr(pic, "image") and pic.image and hasattr(pic.image, "pil_image"):
+                img = pic.image.pil_image
+                if img:
+                    buffered = BytesIO()
+                    if img.mode != "RGB":
+                        img = img.convert("RGB")
+                    img.save(buffered, format="PNG")
+                    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+        except Exception as e:
+            print(f"Failed to convert image to base64: {e}")
+        return ""
+    def _find_image_heading(self, doc, pic_item) -> str:
+        current_heading = "Unknown"
+        for item, level in doc.iterate_items():
+            if isinstance(item, (SectionHeaderItem, TitleItem)):
+                if hasattr(item, 'text'):
+                    current_heading = item.text
+            if item == pic_item:
+                return current_heading
+        return current_heading
+    def process_document(self, file_path: str, save_json: bool = True, output_dir: str = "rag_data", max_page: int = 10) -> Dict[str, Any]:
+        """Converts document and returns structured data."""
+        print(f"Testing Docling Parser on: {file_path}...")
+        result = self.converter.convert(file_path)
+        doc = result.document
+        doc_conversion_secs = result.timings["pipeline_total"].times
+        print(f"Doc conversion time: {doc_conversion_secs} seconds")
+        chunk_iter = self.chunker.chunk(dl_doc=doc)
+        structured_chunks = []
+        for i, chunk in enumerate(chunk_iter):
+            heading = chunk.meta.headings[0] if chunk.meta.headings else "Unknown"
+            page_num = 0
+            if hasattr(chunk.meta, "doc_items") and chunk.meta.doc_items:
+                for item in chunk.meta.doc_items:
+                    if hasattr(item, "prov") and item.prov:
+                        if len(item.prov) > 0 and hasattr(item.prov[0], "page_no"):
+                            page_num = item.prov[0].page_no
+                            break
+            structured_chunks.append({
+                "chunk_id": f"chunk_{i}",
+                "type": self._determine_chunk_type(chunk),
+                "text": chunk.text,
+                "metadata": {
+                    "source": os.path.basename(file_path),
+                    "page_number": page_num,
+                    "section_header": heading
+                }
+            })
+        images_data = []
+        for i, pic in enumerate(doc.pictures):
+            description = "No description"
+            if hasattr(pic, "meta") and pic.meta and hasattr(pic.meta, "description"):
+                desc_obj = pic.meta.description
+                description = desc_obj.text if hasattr(desc_obj, "text") else str(desc_obj)
+            images_data.append({
+                "image_id": f"img_{i}",
+                "description": description,
+                "page_number": pic.prov[0].page_no if pic.prov else 0,
+                "section_header": self._find_image_heading(doc, pic),
+                "image_base64": self._get_base64_image(pic)
+            })
+        final_output = {"chunks": structured_chunks, "images": images_data}
+        if save_json:
+            os.makedirs(output_dir, exist_ok=True)
+            with open(os.path.join(output_dir, "parsed_knowledge.json"), "w", encoding="utf-8") as f:
+                json.dump(final_output, f, indent=2, ensure_ascii=False)
+            print(f"Saved parsed knowledge to {output_dir}/parsed_knowledge.json")
+        return final_output

backend/rag.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import json
+import os
+import numpy as np
+import torch
+from typing import List, Dict, Any, Optional
+from tqdm import tqdm
+from pymongo import ReplaceOne
+from rank_bm25 import BM25Okapi
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from config import VECTOR_INDEX_NAME
+from .database import get_mongo_client, get_mongo_collection
+from .models import get_clip_model, get_llm, get_groq_client
+from dotenv import load_dotenv
+import time
+load_dotenv()
+import os
+class RAGEngine:
+    """
+    Unified RAG engine refactored from search.py.
+    """
+    def __init__(self, use_hybrid: bool = True, force_clean: bool = False):
+        self.use_hybrid = use_hybrid
+        self.clip_model = get_clip_model()
+        self.collection = get_mongo_collection()
+        self.llm = get_llm()
+        self.groq_client = get_groq_client()
+        if force_clean:
+            self.collection.delete_many({})
+        self._setup_vector_index()
+        self.bm25_index = None
+        self.bm25_doc_map = {}
+        if self.collection.count_documents({}) > 0:
+            self._rebuild_bm25_index()
+    def _setup_vector_index(self):
+        """
+        Attempts to create a vector search index if using MongoDB Atlas.
+        Includes robust dimension checking and error handling.
+        """
+        # 1. Determine Dimensions safely
+        try:
+            dims = self.clip_model.get_sentence_embedding_dimension()
+            if dims is None or not isinstance(dims, int):
+                raise ValueError("Model returned invalid dimensions")
+        except Exception:
+            print("Auto-dim failed, probing model...")
+            test_vec = self.clip_model.encode("test")
+            dims = len(test_vec)
+        print(f"Vector Dimensions: {dims}")
+        # 2. Define Index Model
+        index_model = {
+            "definition": {
+                "fields": [
+                    {
+                        "type": "vector",
+                        "path": "embedding",
+                        "numDimensions": int(dims),  # Ensure strict integer
+                        "similarity": "cosine"
+                    },
+                    {
+                        "type": "filter",
+                        "path": "metadata.type"
+                    }
+                ]
+            },
+            "name": VECTOR_INDEX_NAME,
+            "type": "vectorSearch"
+        }
+        # 3. Create Index
+        try:
+            # Check if index already exists
+            indexes = list(self.collection.list_search_indexes())
+            index_names = [idx.get("name") for idx in indexes]
+            if VECTOR_INDEX_NAME not in index_names:
+                print(f"Creating Atlas Vector Search Index '{VECTOR_INDEX_NAME}'...")
+                self.collection.create_search_index(model=index_model)
+                print("Index creation initiated. Please wait 1-2 minutes for Atlas to build it.")
+                print("You can check progress in Atlas UI -> Database -> Search -> Vector Search")
+            else:
+                print(f"Index '{VECTOR_INDEX_NAME}' already exists.")
+        except Exception as e:
+            print(f"\nAutomatic Index Creation Failed: {e}")
+            print("This is common on Free Tier (M0) or due to permissions.")
+            print("PLEASE CREATE MANUALLY IN ATLAS UI (See JSON below)\n")
+            print(json.dumps(index_model["definition"], indent=2))
+        except Exception as e:
+            print(f"Unexpected error checking/creating index: {e}")
+    def _rebuild_bm25_index(self):
+        cursor = self.collection.find(
+            {"metadata.type": {"$in": ["text", "table", "list", "header", "code"]}},
+            {"content": 1, "_id": 1}
+        )
+        text_docs = []
+        self.bm25_doc_map = {}
+        for idx, doc in enumerate(cursor):
+            content = doc.get("content", "")
+            if content:
+                text_docs.append(content.lower().split())
+                self.bm25_doc_map[idx] = str(doc["_id"])
+        if text_docs:
+            self.bm25_index = BM25Okapi(text_docs)
+    def _encode_content(self, content: Any, content_type: str) -> np.ndarray:
+        if content_type == "image":
+            # Assuming content is base64
+            from PIL import Image
+            from io import BytesIO
+            import base64
+            try:
+                img = Image.open(BytesIO(base64.b64decode(content))).convert("RGB")
+                return self.clip_model.encode(img, normalize_embeddings=True)
+            except: return None
+        return self.clip_model.encode(content, normalize_embeddings=True)
+    def ingest_data(self, data: Dict[str, Any]):
+        """Ingests processed document data."""
+        operations = []
+        for chunk in data.get("chunks", []):
+            embedding = self._encode_content(chunk["text"], "text")
+            if embedding is None: continue
+            doc = {
+                "_id": chunk["chunk_id"],
+                "content": chunk["text"],
+                "embedding": embedding.tolist(),
+                "metadata": {
+                    **chunk["metadata"],
+                    "type": chunk.get("type", "text")
+                }
+            }
+            operations.append(ReplaceOne({"_id": doc["_id"]}, doc, upsert=True))
+        for img in data.get("images", []):
+            embedding = self._encode_content(img["image_base64"], "image")
+            if embedding is None: continue
+            doc = {
+                "_id": img["image_id"],
+                "content": img.get("description", ""),
+                "embedding": embedding.tolist(),
+                "metadata": {
+                    "page": str(img.get("page_number", 0)),
+                    "header": str(img.get("section_header", "")),
+                    "type": "image",
+                    "description": img.get("description", ""),
+                    "image_base64": img["image_base64"]
+                }
+            }
+            operations.append(ReplaceOne({"_id": doc["_id"]}, doc, upsert=True))
+        if operations:
+            for i in range(0, len(operations), 100):
+                self.collection.bulk_write(operations[i:i+100])
+            self._rebuild_bm25_index()
+    def hybrid_search(self, query: str, top_k: int = 5, alpha: float = 0.5) -> List[Dict]:
+        query_embedding = self._encode_content(query, "text")
+        dense_results = []
+        try:
+            pipeline = [
+                {"$vectorSearch": {
+                    "index": VECTOR_INDEX_NAME,
+                    "path": "embedding",
+                    "queryVector": query_embedding.tolist(),
+                    "numCandidates": top_k * 10,
+                    "limit": top_k * 2
+                }},
+                {"$project": {"content": 1, "metadata": 1, "score": {"$meta": "vectorSearchScore"}}}
+            ]
+            dense_results = list(self.collection.aggregate(pipeline))
+        except: pass
+        dense_scores = {str(r["_id"]): {"score": r.get("score", 0), "doc": r} for r in dense_results}
+        sparse_scores = {}
+        if self.bm25_index:
+            scores = self.bm25_index.get_scores(query.lower().split())
+            max_s = max(scores) if len(scores) > 0 and max(scores) > 0 else 1.0
+            for i in np.argsort(scores)[::-1][:top_k*2]:
+                if scores[i] > 0:
+                    sparse_scores[self.bm25_doc_map[i]] = scores[i] / max_s
+        combined = []
+        all_ids = set(dense_scores.keys()) | set(sparse_scores.keys())
+        for did in all_ids:
+            d_s = dense_scores.get(did, {}).get("score", 0)
+            s_s = sparse_scores.get(did, 0)
+            score = (alpha * d_s) + ((1-alpha) * s_s)
+            doc = dense_scores.get(did, {}).get("doc") or self.collection.find_one({"_id": did})
+            if doc:
+                combined.append({**doc, "score": score})
+        combined.sort(key=lambda x: x["score"], reverse=True)
+        return combined[:top_k]
+    def answer_question(self, question: str, top_k: int = 5) -> str:
+        results = self.hybrid_search(question, top_k=top_k)
+        if not results: return "No relevant info found."
+        context = ""
+        for i, res in enumerate(results, 1):
+            m = res["metadata"]
+            context += f"\n[Src {i} | Page {m.get('page_number','?')}] {res['content']}"
+        prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer strictly based on context:"
+        try:
+            chain = ChatPromptTemplate.from_template("{p}") | self.llm | StrOutputParser()
+            # return chain.invoke({"p": prompt})
+            for msg in chain.stream({"p": prompt}):
+                if hasattr(msg, "content"):
+                    time.sleep(0.01)
+                    yield msg.content
+                else:
+                    time.sleep(0.01)
+                    yield str(msg)
+        except Exception as e: return f"Error: {e}"
+    def search_images(self, query: str, top_k: int = 3, min_score: float = 0.5) -> List[Dict]:
+        query_embedding = self._encode_content(f"{query}", "text")
+        try:
+            pipeline = [
+                {"$vectorSearch": {
+                    "index": VECTOR_INDEX_NAME, "path": "embedding",
+                    "queryVector": query_embedding.tolist(), "numCandidates": top_k*10, "limit": top_k*2,
+                    "filter": {"metadata.type": "image"}
+                }},
+                {"$project": {"content": 1, "metadata": 1, "score": {"$meta": "vectorSearchScore"}}}
+            ]
+            results = list(self.collection.aggregate(pipeline))
+            return [{"description": r["content"], "image_base64": r["metadata"].get("image_base64"), "score": r["score"]}
+                    for r in results if r["score"] >= min_score][:top_k]
+        except Exception as e:
+            print("*********error", str(e))
+            return []
+    # def generate_suggested_questions(self, num_questions: int = 5) -> List[str]:
+    #     # Simple metadata-based generation or just a fixed list for now
+    #     return ["What is the main topic?", "Explain the diagrams.", "Summarize the results."]
+    def generate_suggested_questions(self, num_questions: int = 4) -> List[str]:
+        """Token-efficient question generation using metadata."""
+        print("\nGenerating suggested questions (Efficient Mode)...")
+        try:
+            # 1. Fetch metadata ONLY (projection excludes embedding and content)
+            cursor = self.collection.find(
+                {},
+                {"metadata": 1, "_id": 0}
+            ).limit(100)
+            metadatas = [doc.get('metadata', {}) for doc in cursor]
+            if not metadatas:
+                return ["What is this document about?"]
+            # 2. Extract High-Level Structure
+            headers = set()
+            image_descriptions = []
+            import random
+            random.shuffle(metadatas)
+            for meta in metadatas:
+                if 'header' in meta and len(headers) < 8:
+                    h = str(meta['header']).strip()
+                    if h and h.lower() != "unknown" and len(h) > 5:
+                        headers.add(h)
+                if meta.get('type') == 'image' and len(image_descriptions) < 2:
+                    desc = meta.get('description', '')
+                    if len(desc) > 20:
+                        image_descriptions.append(desc[:100] + "...")
+            # 3. Construct Prompt
+            context_str = "Document Sections:\n" + "\n".join([f"- {h}" for h in headers])
+            if image_descriptions:
+                context_str += "\n\nVisual Content involves:\n" + "\n".join([f"- {d}" for d in image_descriptions])
+            # 4. Prompt LLM
+            prompt = f"""Generate {num_questions} short, interesting questions about a document with these sections and visuals:
+            {context_str}
+            Output ONLY the {num_questions} questions, one per line. No numbering."""
+            prompt_tmpl = ChatPromptTemplate.from_messages([
+                ("system", "You are a helpful assistant."),
+                ("user", "{prompt}")
+            ])
+            chain = prompt_tmpl | self.llm | StrOutputParser()
+            response = chain.invoke({"prompt": prompt})
+            questions = [q.strip().lstrip('-1234567890. ') for q in response.split('\n') if q.strip()]
+            return questions[:num_questions]
+        except Exception as e:
+            print(f"Error generating questions: {e}")

config.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import os
+from dotenv import load_dotenv
+from urllib.parse import quote_plus
+load_dotenv()
+# --- MongoDB Configuration ---
+DB_NAME = os.getenv("MONGO_DB", "mongodb")
+DB_PASSWORD = os.getenv("MONGO_PASSWORD", "pass")
+DB_USER = os.getenv("MONGO_USER", "username")
+DB_HOST = os.getenv("MONGO_HOST", "localhost")
+VECTOR_INDEX_NAME = "vector_index"
+MONGO_URI = f"mongodb+srv://{DB_USER}:{quote_plus(DB_PASSWORD)}@{DB_HOST}/?appName={quote_plus(DB_NAME)}"
+MONGO_COLLECTION = os.getenv("MONGO_COLLECTION", "documents")
+# --- API Keys ---
+GROQ_API_KEY = os.getenv("GROQ_API_KEY")
+# --- Model Configurations ---
+CLIP_MODEL_NAME = "clip-ViT-L-14"
+LLM_MODEL_NAME = "llama-3.3-70b-versatile" # Fallback/Check

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+streamlit>=1.20.0
+pillow>=9.0.0
+python-dotenv>=1.0.0
+pymongo>=4.0.0
+sentence-transformers>=2.2.2
+torch>=2.0.0
+rank-bm25>=0.2.2
+tqdm>=4.0.0
+numpy>=1.24.0
+langchain-core>=0.0.200
+langchain-ollama>=0.0.1
+groq>=0.3.0
+docling>=0.1.0
+docling-core>=0.1.0
+langchain-groq
+docling[easyocr]