dev-models commited on
Commit
e97c8d1
·
1 Parent(s): 135acdb

Initial commit

Browse files
Files changed (12) hide show
  1. .dockerignore +17 -0
  2. .gitignore +47 -0
  3. Dockerfile +28 -0
  4. README.md +112 -0
  5. app.py +290 -0
  6. backend/__init__.py +1 -0
  7. backend/database.py +15 -0
  8. backend/models.py +21 -0
  9. backend/parser.py +189 -0
  10. backend/rag.py +315 -0
  11. config.py +23 -0
  12. requirements.txt +16 -0
.dockerignore ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ env/
7
+ venv/
8
+ build/
9
+ dist/
10
+ *.egg-info
11
+ temp_uploads/
12
+ rag_data/
13
+ .env
14
+ node_modules/
15
+ .mypy_cache
16
+ rag_data/
17
+ temp_uploads/
.gitignore ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ env/
8
+ build/
9
+ develop-eggs/
10
+ dist/
11
+ downloads/
12
+ eggs/
13
+ .eggs/
14
+ lib/
15
+ lib64/
16
+ parts/
17
+ sdist/
18
+ var/
19
+ wheels/
20
+ *.egg-info/
21
+ .installed.cfg
22
+ *.egg
23
+ .venv
24
+ venv
25
+ ENV/
26
+ env.bak
27
+ venv.bak
28
+
29
+ # Environment Variables
30
+ .env
31
+
32
+ # Project specific
33
+ temp_uploads/
34
+ rag_data/
35
+
36
+ # IDEs
37
+ .vscode/
38
+ .idea/
39
+
40
+ # Docker
41
+ .docker/
42
+
43
+ # OS specific
44
+ .DS_Store
45
+ Thumbs.db
46
+
47
+
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ ENV PYTHONDONTWRITEBYTECODE=1
4
+ ENV PYTHONUNBUFFERED=1
5
+
6
+ WORKDIR /app
7
+
8
+ # Install system dependencies
9
+ RUN apt-get update && apt-get install -y \
10
+ tesseract-ocr \
11
+ tesseract-ocr-eng \
12
+ poppler-utils \
13
+ libgl1 \
14
+ libglib2.0-0 \
15
+ build-essential \
16
+ && rm -rf /var/lib/apt/lists/*
17
+
18
+ # Install Python dependencies
19
+ COPY requirements.txt .
20
+ RUN pip install --upgrade pip && \
21
+ pip install -r requirements.txt
22
+
23
+ # Copy app code
24
+ COPY . .
25
+
26
+ EXPOSE 7860
27
+
28
+ CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -8,3 +8,115 @@ pinned: false
8
  ---
9
 
10
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
11
+
12
+
13
+
14
+ # 🤖 Multimodal RAG Assistant (Docling-Powered)
15
+
16
+ [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
17
+ [![Streamlit](https://img.shields.io/badge/Streamlit-1.32%2B-FF4B4B.svg)](https://streamlit.io/)
18
+ [![Docling](https://img.shields.io/badge/Docling-IBM-orange.svg)](https://github.com/DS4SD/docling)
19
+ [![MongoDB](https://img.shields.io/badge/MongoDB-Atlas-green.svg)](https://www.mongodb.com/products/platform/atlas-vector-search)
20
+ [![Groq](https://img.shields.io/badge/Groq-Llama_3.3-black.svg)](https://groq.com/)
21
+
22
+ A state-of-the-art **Multimodal Retrieval-Augmented Generation (RAG)** system built for the modern document era. This assistant doesn't just read text—it understands tables, charts, diagrams, and complex layouts using **IBM's Docling** and **Visual Language Models**.
23
+
24
+ ---
25
+
26
+ ## 🚀 The WOW Factor
27
+
28
+ * **🧠 Deep Document Intelligence:** Powered by **Docling**, the system extracts semantic structures (headers, tables, lists) with extreme precision.
29
+ * **👁️ Visual Understanding:** Every image in your PDF is "seen" by a **VLM (Llama-3-Vision)** to generate rich textual descriptions for vector indexing.
30
+ * **🔍 Hybrid Search Engine:** A high-performance retrieval pipeline combining **CLIP (Dense)** and **BM25 (Sparse)** to ensure zero-miss retrieval.
31
+ * **🖼️ Visual RAG Capabilities:** Directly query for charts or diagrams. The assistant "shows" you the relevant visuals alongside textual answers.
32
+ * **💡 Intelligent Query Guidance:** Automatically analyzes document structure to suggest the most relevant questions for the user.
33
+ * **⚡ Blazing Fast Generation:** Uses **Groq's Llama-3.3-70B** for near-instant, high-quality responses with full streaming support.
34
+
35
+ ---
36
+
37
+ ## 🛠️ Architecture Overview
38
+
39
+ The system is built on a modular, production-ready foundation:
40
+
41
+ ```text
42
+ rag-app/
43
+ ├── 🌐 app.py # Streamlit Premium Interface
44
+ ├── ⚙️ config.py # Centralized configuration
45
+ ├── 📦 backend/ # Domain-driven modules
46
+ │ ├── 🛠️ parser.py # Docling Engine + VLM Describer
47
+ │ ├── 🧠 rag.py # Hybrid Search + RAG Orchestrator
48
+ │ ├── 💾 database.py # MongoDB Atlas Vector Store integration
49
+ │ └── 🤖 models.py # CLIP, LLM, and VLM Connectors
50
+ ├── 📁 rag_data/ # Parsed JSON persistence
51
+ ├── 🐳 Dockerfile # Multi-stage optimized build
52
+ └── 📋 requirements.txt # Optimized dependency tree
53
+ ```
54
+
55
+ ---
56
+
57
+ ## 🏗️ Core Technology Stack
58
+
59
+ | Layer | Technology | Purpose |
60
+ | :--- | :--- | :--- |
61
+ | **Parsing** | **Docling** | High-fidelity PDF structural parsing & OCR |
62
+ | **VLM** | **Groq (Llama-4-Scout)** | Image captioning for multimodal indexing |
63
+ | **Embeddings** | **CLIP (ViT-L/14)** | Joint Text-Image vector space |
64
+ | **Vector DB** | **MongoDB Atlas** | Scalable vector search & metadata storage |
65
+ | **LLM** | **Llama-3.3-70B** | Final answer generation (via Groq) |
66
+ | **UI** | **Streamlit** | Modern, responsive chat interface |
67
+
68
+ ---
69
+
70
+ ## 🚦 Getting Started
71
+
72
+ ### 1. Prerequisites
73
+ - Python 3.10+
74
+ - A [MongoDB Atlas](https://www.mongodb.com/cloud/atlas/register) Account (for Vector Search)
75
+ - A [Groq API Key](https://console.groq.com/)
76
+
77
+ ### 2. Configure Environment
78
+ Create a `.env` file in the root directory:
79
+
80
+ ```env
81
+ # MongoDB Credentials
82
+ MONGO_USER=your_username
83
+ MONGO_PASSWORD=your_password
84
+ MONGO_HOST=your_cluster_url.mongodb.net
85
+ MONGO_DB=rag_assistant
86
+
87
+ # API Keys
88
+ GROQ_API_KEY=gsk_your_key_here
89
+
90
+ # Optional: Full URI (overrides components above)
91
+ # MONGO_URI=mongodb+srv://...
92
+ ```
93
+
94
+ ### 3. Quick Run (Docker)
95
+ ```bash
96
+ docker compose up --build
97
+ ```
98
+
99
+ ### 4. Local Setup
100
+ ```bash
101
+ # Install dependencies
102
+ pip install -r requirements.txt
103
+
104
+ # Launch app
105
+ streamlit run app.py
106
+ ```
107
+
108
+ ---
109
+
110
+ ## 📈 Search Optimization
111
+
112
+ - **Dense Retrieval (CLIP):** Captures semantic meaning and visual similarity.
113
+ - **Sparse Retrieval (BM25):** Ensures keyword matches (names, technical terms) are never missed.
114
+ - **Hybrid Weighting:** Fine-tuned `alpha` parameter balances the two search methods for optimal precision-recall.
115
+
116
+ ---
117
+
118
+ ## 🛡️ Security & Scalability
119
+
120
+ * **Safe Parsing:** Docling runs in a secure, resource-limited container environment.
121
+ * **Vector Search Indexing:** Optimized for MongoDB Atlas Search, enabling enterprise-grade scaling.
122
+ * **Streaming Responses:** Uses Server-Sent Events (SSE) logic for smooth user experience.
app.py ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import os
3
+ import base64
4
+ from io import BytesIO
5
+ from PIL import Image
6
+ import time
7
+
8
+ # Import Modular components
9
+ from backend.rag import RAGEngine
10
+ from backend.parser import EnrichedRagParser
11
+
12
+ # ==========================================
13
+ # 1. Page Configuration & Professional CSS
14
+ # ==========================================
15
+ st.set_page_config(
16
+ page_title="Multimodal RAG Assistant",
17
+ page_icon="🤖",
18
+ layout="wide",
19
+ initial_sidebar_state="expanded"
20
+ )
21
+
22
+ # Production-ready CSS
23
+ st.markdown("""
24
+ <style>
25
+ .stChatMessage {
26
+ background-color: var(--secondary-background-color);
27
+ border: 1px solid rgba(128, 128, 128, 0.1);
28
+ border-radius: 12px;
29
+ padding: 1.5rem;
30
+ margin-bottom: 1rem;
31
+ box-shadow: 0 2px 4px rgba(0,0,0,0.05);
32
+ }
33
+ .stats-container {
34
+ background-color: var(--secondary-background-color);
35
+ border: 1px solid rgba(128, 128, 128, 0.2);
36
+ border-radius: 10px;
37
+ padding: 15px;
38
+ margin-top: 10px;
39
+ }
40
+ .stats-header {
41
+ font-weight: 600;
42
+ color: var(--text-color);
43
+ margin-bottom: 8px;
44
+ display: block;
45
+ }
46
+ .stats-item {
47
+ font-size: 0.9em;
48
+ color: var(--text-color);
49
+ opacity: 0.8;
50
+ margin-bottom: 4px;
51
+ display: flex;
52
+ justify-content: space-between;
53
+ }
54
+ </style>
55
+ """, unsafe_allow_html=True)
56
+
57
+ # ==========================================
58
+ # 2. Initialization & Helper Functions
59
+ # ==========================================
60
+
61
+ @st.cache_resource
62
+ def initialize_rag_system(force_clean: bool = True):
63
+ """Initialize the RAG system with caching."""
64
+ return RAGEngine(use_hybrid=True, force_clean=force_clean)
65
+
66
+ def display_image_from_base64(base64_str: str, caption: str = "", width: int = 300):
67
+ """Helper to decode and display base64 images."""
68
+ try:
69
+ img_data = base64.b64decode(base64_str)
70
+ img = Image.open(BytesIO(img_data))
71
+ st.image(img, caption=caption, width=width)
72
+ except Exception as e:
73
+ st.error(f"Failed to display image: {e}")
74
+
75
+ # ==========================================
76
+ # 3. Main Application
77
+ # ==========================================
78
+
79
+ def main():
80
+ # --- State Management ---
81
+ if "messages" not in st.session_state:
82
+ st.session_state.messages = []
83
+ if "suggested_questions" not in st.session_state:
84
+ st.session_state.suggested_questions = []
85
+
86
+ # Initialize Backend
87
+ if "rag" not in st.session_state:
88
+ with st.spinner("🚀 Booting up AI System..."):
89
+ st.session_state.rag = initialize_rag_system()
90
+ rag: RAGEngine = st.session_state.rag
91
+
92
+ # ==========================================
93
+ # SIDEBAR: Control Panel
94
+ # ==========================================
95
+ with st.sidebar:
96
+ st.header("🧠 RAG Control Panel")
97
+
98
+ # --- PDF Document Upload ---
99
+ with st.expander("📂 Knowledge Base", expanded=True):
100
+ uploaded_file = st.file_uploader(
101
+ "Upload Document (PDF)",
102
+ type=["pdf"],
103
+ label_visibility="collapsed"
104
+ )
105
+
106
+ if uploaded_file:
107
+ # Temporary save for parsing
108
+ temp_dir = "temp_uploads"
109
+ os.makedirs(temp_dir, exist_ok=True)
110
+ save_path = os.path.join(temp_dir, uploaded_file.name)
111
+
112
+ with open(save_path, "wb") as f:
113
+ f.write(uploaded_file.getbuffer())
114
+
115
+ if st.button("🚀 Process PDF", type="primary", use_container_width=True):
116
+ try:
117
+ with st.spinner("Analyzing PDF with Docling..."):
118
+ parser = EnrichedRagParser()
119
+ parsed_data = parser.process_document(save_path)
120
+
121
+ with st.spinner("Ingesting into MongoDB..."):
122
+ rag.ingest_data(parsed_data)
123
+
124
+ # Generate Suggestions
125
+ suggestions = rag.generate_suggested_questions(num_questions=6)
126
+ st.session_state.suggested_questions = suggestions
127
+ st.success(f"Processed: {uploaded_file.name}")
128
+ st.rerun()
129
+
130
+ except Exception as e:
131
+ st.error(f"❌ Error: {str(e)}")
132
+
133
+ finally:
134
+ # ✅ Always cleanup temp file
135
+ if os.path.exists(save_path):
136
+ os.remove(save_path)
137
+ print("🧹 Temp file deleted")
138
+
139
+ st.rerun()
140
+ st.markdown("---")
141
+
142
+ # --- Suggested Questions ---
143
+ if st.session_state.suggested_questions:
144
+ st.subheader("💡 Quick Questions")
145
+ for idx, q in enumerate(st.session_state.suggested_questions):
146
+ if st.button(q, key=f"sugg_{idx}", use_container_width=True):
147
+ st.session_state.messages.append({"role": "user", "content": q})
148
+ st.rerun()
149
+ st.markdown("---")
150
+
151
+ # --- Settings ---
152
+ with st.expander("⚙️ Search Settings"):
153
+ top_k = st.slider("Max Results", 1, 10, 5)
154
+ min_score = st.slider("Confidence Threshold", 0.0, 1.0, 0.6)
155
+ use_images = st.toggle("Enable Image Search", value=True)
156
+
157
+ # --- System Stats ---
158
+ count = rag.collection.count_documents({})
159
+ st.markdown(
160
+ f"""
161
+ <div class="stats-container">
162
+ <span class="stats-header">📊 Database Status</span>
163
+ <div class="stats-item"><span>Total Chunks:</span> <strong>{count}</strong></div>
164
+ <div class="stats-item"><span>Embedding:</span> <strong>CLIP ViT-L/14</strong></div>
165
+ </div>
166
+ """,
167
+ unsafe_allow_html=True,
168
+ )
169
+
170
+ # Reset
171
+ if st.button("🗑️ Clear Chat", type="secondary", use_container_width=True):
172
+ st.session_state.messages = []
173
+ st.rerun()
174
+
175
+ if st.button("⚠️ Delete Vector Collection", type="primary", use_container_width=True):
176
+ with st.spinner("Deleting collection..."):
177
+ rag.collection.delete_many({})
178
+ # Reset in-memory indices to match empty DB
179
+ rag.bm25_index = None
180
+ rag.bm25_doc_map = {}
181
+ st.success("Vector Collection Deleted!")
182
+ time.sleep(1) # Give user a moment to see the success message
183
+ st.rerun()
184
+
185
+ # ==========================================
186
+ # MAIN: Chat Interface
187
+ # ==========================================
188
+ st.title("🤖 Multimodal AI Assistant")
189
+
190
+ if not st.session_state.messages:
191
+ st.markdown(
192
+ """
193
+ <div style="text-align: center; margin-top: 50px; opacity: 0.7;">
194
+ <h3>👋 Ready to help!</h3>
195
+ <p>Upload a PDF in the sidebar to start.</p>
196
+ </div>
197
+ """,
198
+ unsafe_allow_html=True,
199
+ )
200
+
201
+ # Render History
202
+ for msg in st.session_state.messages:
203
+ with st.chat_message(msg["role"]):
204
+ st.markdown(msg["content"])
205
+ if "images" in msg and msg["images"]:
206
+ st.markdown("---")
207
+ cols = st.columns(3)
208
+ for i, img in enumerate(msg["images"]):
209
+ with cols[i % 3]:
210
+ display_image_from_base64(img["image_base64"], width=220)
211
+
212
+ # ==========================================
213
+ # LOGIC: Input Handling
214
+ # ==========================================
215
+ user_input = st.chat_input("Type your question here...")
216
+
217
+ if user_input:
218
+ st.session_state.messages.append({"role": "user", "content": user_input})
219
+ st.rerun()
220
+
221
+ # ==========================================
222
+ # ASSISTANT: Streaming Response Logic
223
+ # ==========================================
224
+ if st.session_state.messages and st.session_state.messages[-1]["role"] == "user":
225
+ last_query = st.session_state.messages[-1]["content"]
226
+
227
+ with st.chat_message("assistant"):
228
+ with st.spinner("🤔 Searching context..."):
229
+ try:
230
+ img_keywords = ["show", "image", "diagram", "figure", "picture"]
231
+ is_visual_request = any(
232
+ k in last_query.lower() for k in img_keywords
233
+ ) and use_images
234
+
235
+ found_imgs = []
236
+ answer_text = ""
237
+
238
+ if is_visual_request:
239
+ # 🔍 Image search branch (non-streaming)
240
+ found_imgs = rag.search_images(
241
+ last_query,
242
+ top_k=3,
243
+ min_score=min_score,
244
+ )
245
+ if found_imgs:
246
+ answer_text = f"I found {len(found_imgs)} relevant visuals:"
247
+ else:
248
+ answer_text = "I couldn't find any relevant images."
249
+
250
+ # Render once
251
+ st.markdown(answer_text)
252
+
253
+ else:
254
+ # 🧠 Text answer branch (STREAMING)
255
+ # Assume rag.answer_question returns a generator / stream.
256
+ # st.write_stream will both display the chunks and return
257
+ # the final concatenated string.[web:60]
258
+ stream = rag.answer_question(
259
+ last_query,
260
+ top_k=top_k
261
+ )
262
+ answer_text = st.write_stream(stream)
263
+
264
+ # Render images if any
265
+ if found_imgs:
266
+ st.markdown("---")
267
+ cols = st.columns(3)
268
+ for idx, img in enumerate(found_imgs):
269
+ with cols[idx % 3]:
270
+ display_image_from_base64(
271
+ img["image_base64"], width=220
272
+ )
273
+
274
+ # Persist assistant message in history
275
+ st.session_state.messages.append(
276
+ {
277
+ "role": "assistant",
278
+ "content": answer_text,
279
+ "images": found_imgs,
280
+ }
281
+ )
282
+
283
+ except Exception as e:
284
+ st.error(f"Error: {e}")
285
+ st.session_state.messages.append(
286
+ {"role": "assistant", "content": f"❌ Error: {e}"}
287
+ )
288
+
289
+ if __name__ == "__main__":
290
+ main()
backend/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Modules package initialization
backend/database.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pymongo import MongoClient
2
+ from pymongo.collection import Collection
3
+ from config import MONGO_URI, DB_NAME, MONGO_COLLECTION
4
+
5
+ def get_mongo_client(uri: str | None = None) -> MongoClient:
6
+ """Return a pymongo MongoClient."""
7
+ uri = uri or MONGO_URI
8
+ return MongoClient(uri)
9
+
10
+ def get_mongo_collection(client: MongoClient | None = None, db_name: str | None = None, collection_name: str | None = None) -> Collection:
11
+ """Return a MongoDB collection instance."""
12
+ client = client or get_mongo_client()
13
+ db_name = db_name or DB_NAME
14
+ collection_name = collection_name or MONGO_COLLECTION
15
+ return client[db_name][collection_name]
backend/models.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from sentence_transformers import SentenceTransformer
3
+ from groq import Groq
4
+ from config import CLIP_MODEL_NAME, GROQ_API_KEY, LLM_MODEL_NAME
5
+ from langchain_groq import ChatGroq
6
+
7
+ def get_clip_model(model_name: str = CLIP_MODEL_NAME):
8
+ device = "cuda" if torch.cuda.is_available() else "cpu"
9
+ try:
10
+ model = SentenceTransformer(model_name, trust_remote_code=True)
11
+ model.to(device)
12
+ return model
13
+ except Exception as e:
14
+ print(f"Fallback CLIP model due to: {e}")
15
+ return SentenceTransformer('clip-ViT-B-32')
16
+
17
+ def get_llm(model_name: str = LLM_MODEL_NAME):
18
+ return ChatGroq(model=model_name, api_key=GROQ_API_KEY, temperature=0.1)
19
+
20
+ def get_groq_client(api_key: str = GROQ_API_KEY):
21
+ return Groq(api_key=api_key)
backend/parser.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import base64
4
+ from io import BytesIO
5
+ from typing import List, Dict, Any
6
+ from docling.document_converter import DocumentConverter, PdfFormatOption
7
+ from docling.datamodel.base_models import InputFormat
8
+ from docling.datamodel.pipeline_options import PdfPipelineOptions, PictureDescriptionApiOptions
9
+ from docling_core.types.doc.labels import DocItemLabel
10
+ from docling_core.types.doc.document import SectionHeaderItem, TitleItem
11
+ from config import GROQ_API_KEY
12
+ from docling.chunking import HybridChunker
13
+ from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
14
+ from docling.datamodel.settings import settings
15
+ from docling.datamodel.pipeline_options import (
16
+ PdfPipelineOptions,
17
+ OcrAutoOptions
18
+ )
19
+
20
+ class EnrichedRagParser:
21
+ """
22
+ Parser using Docling's HybridChunker for Multimodal RAG.
23
+ Modified from sonnet_export.py for modular use.
24
+ """
25
+
26
+ def __init__(self, groq_api_key: str = GROQ_API_KEY):
27
+ self.groq_api_key = groq_api_key
28
+ self.converter = self._setup_converter()
29
+ self.chunker = HybridChunker(merge_peers=True)
30
+
31
+ def _setup_converter(self) -> DocumentConverter:
32
+
33
+ # CPU Configuration
34
+ accelerator_options = AcceleratorOptions(
35
+ num_threads=min(12, os.cpu_count()),
36
+ device=AcceleratorDevice.CPU
37
+ )
38
+
39
+ # Smart OCR Configuration
40
+ # Only triggers when >50% of page is scanned/bitmap content
41
+ ocr_options = OcrAutoOptions(
42
+ lang=["en"], # ✅ Specify language
43
+ force_full_page_ocr=False, # ⚡ Don't force OCR on all pages
44
+ bitmap_area_threshold=0.5 # ⚡ Smart: Only OCR if >50% scanned
45
+ )
46
+
47
+ # Pipeline Configuration
48
+ pipeline_options = PdfPipelineOptions(
49
+ # Features
50
+ do_ocr=True, # Enable OCR (but smart triggering)
51
+ do_table_structure=True,
52
+ generate_picture_images=True,
53
+ images_scale=1,
54
+ ocr_options=ocr_options, # ⚡ Smart OCR config
55
+
56
+ # Disable unnecessary features
57
+ generate_page_images=False,
58
+ enable_remote_services=True,
59
+
60
+ # Picture descriptions - using VLM (local)
61
+ do_picture_description=True,
62
+
63
+ # Resource management
64
+ queue_max_size=10,
65
+ document_timeout=300.0
66
+ )
67
+
68
+ pipeline_options.accelerator_options = accelerator_options
69
+ settings.debug.profile_pipeline_timings = True
70
+
71
+ pipeline_options.picture_description_options = PictureDescriptionApiOptions(
72
+ url="https://api.groq.com/openai/v1/chat/completions",
73
+ params={
74
+ "model": "meta-llama/llama-4-scout-17b-16e-instruct", # Double check this model string
75
+ "temperature": 0.2,
76
+ "max_tokens": 500,
77
+ },
78
+ prompt="Describe this image in detail for a RAG knowledge base. Include all visible text, numbers, and chart trends.",
79
+ headers={"Authorization": f"Bearer {self.groq_api_key}"}
80
+ )
81
+
82
+ return DocumentConverter(
83
+ format_options={
84
+ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
85
+ }
86
+ )
87
+
88
+
89
+
90
+ def _determine_chunk_type(self, chunk) -> str:
91
+ chunk_type = "text"
92
+ if hasattr(chunk.meta, "doc_items") and chunk.meta.doc_items:
93
+ labels = [item.label for item in chunk.meta.doc_items]
94
+ if DocItemLabel.TABLE in labels:
95
+ chunk_type = "table"
96
+ elif DocItemLabel.LIST_ITEM in labels:
97
+ chunk_type = "list"
98
+ elif any(l in [DocItemLabel.TITLE, DocItemLabel.SECTION_HEADER] for l in labels):
99
+ chunk_type = "header"
100
+ elif DocItemLabel.CODE in labels:
101
+ chunk_type = "code"
102
+ return chunk_type
103
+
104
+ def _get_base64_image(self, pic) -> str:
105
+ try:
106
+ if hasattr(pic, "image") and pic.image and hasattr(pic.image, "pil_image"):
107
+ img = pic.image.pil_image
108
+ if img:
109
+ buffered = BytesIO()
110
+ if img.mode != "RGB":
111
+ img = img.convert("RGB")
112
+ img.save(buffered, format="PNG")
113
+ return base64.b64encode(buffered.getvalue()).decode("utf-8")
114
+ except Exception as e:
115
+ print(f"Failed to convert image to base64: {e}")
116
+ return ""
117
+
118
+ def _find_image_heading(self, doc, pic_item) -> str:
119
+ current_heading = "Unknown"
120
+ for item, level in doc.iterate_items():
121
+ if isinstance(item, (SectionHeaderItem, TitleItem)):
122
+ if hasattr(item, 'text'):
123
+ current_heading = item.text
124
+ if item == pic_item:
125
+ return current_heading
126
+ return current_heading
127
+
128
+ def process_document(self, file_path: str, save_json: bool = True, output_dir: str = "rag_data", max_page: int = 10) -> Dict[str, Any]:
129
+ """Converts document and returns structured data."""
130
+ print(f"Testing Docling Parser on: {file_path}...")
131
+
132
+ result = self.converter.convert(file_path)
133
+ doc = result.document
134
+ doc_conversion_secs = result.timings["pipeline_total"].times
135
+ print(f"Doc conversion time: {doc_conversion_secs} seconds")
136
+
137
+ chunk_iter = self.chunker.chunk(dl_doc=doc)
138
+
139
+ structured_chunks = []
140
+ for i, chunk in enumerate(chunk_iter):
141
+ heading = chunk.meta.headings[0] if chunk.meta.headings else "Unknown"
142
+
143
+ page_num = 0
144
+ if hasattr(chunk.meta, "doc_items") and chunk.meta.doc_items:
145
+ for item in chunk.meta.doc_items:
146
+ if hasattr(item, "prov") and item.prov:
147
+ if len(item.prov) > 0 and hasattr(item.prov[0], "page_no"):
148
+ page_num = item.prov[0].page_no
149
+ break
150
+
151
+ structured_chunks.append({
152
+ "chunk_id": f"chunk_{i}",
153
+ "type": self._determine_chunk_type(chunk),
154
+ "text": chunk.text,
155
+ "metadata": {
156
+ "source": os.path.basename(file_path),
157
+ "page_number": page_num,
158
+ "section_header": heading
159
+ }
160
+ })
161
+
162
+ images_data = []
163
+ for i, pic in enumerate(doc.pictures):
164
+
165
+ description = "No description"
166
+ if hasattr(pic, "meta") and pic.meta and hasattr(pic.meta, "description"):
167
+ desc_obj = pic.meta.description
168
+ description = desc_obj.text if hasattr(desc_obj, "text") else str(desc_obj)
169
+
170
+ images_data.append({
171
+ "image_id": f"img_{i}",
172
+ "description": description,
173
+ "page_number": pic.prov[0].page_no if pic.prov else 0,
174
+ "section_header": self._find_image_heading(doc, pic),
175
+ "image_base64": self._get_base64_image(pic)
176
+ })
177
+
178
+ final_output = {"chunks": structured_chunks, "images": images_data}
179
+
180
+ if save_json:
181
+ os.makedirs(output_dir, exist_ok=True)
182
+ with open(os.path.join(output_dir, "parsed_knowledge.json"), "w", encoding="utf-8") as f:
183
+ json.dump(final_output, f, indent=2, ensure_ascii=False)
184
+ print(f"Saved parsed knowledge to {output_dir}/parsed_knowledge.json")
185
+
186
+ return final_output
187
+
188
+
189
+
backend/rag.py ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import numpy as np
4
+ import torch
5
+ from typing import List, Dict, Any, Optional
6
+ from tqdm import tqdm
7
+ from pymongo import ReplaceOne
8
+ from rank_bm25 import BM25Okapi
9
+ from langchain_core.prompts import ChatPromptTemplate
10
+ from langchain_core.output_parsers import StrOutputParser
11
+
12
+ from config import VECTOR_INDEX_NAME
13
+ from .database import get_mongo_client, get_mongo_collection
14
+ from .models import get_clip_model, get_llm, get_groq_client
15
+
16
+ from dotenv import load_dotenv
17
+ import time
18
+
19
+ load_dotenv()
20
+ import os
21
+ class RAGEngine:
22
+ """
23
+ Unified RAG engine refactored from search.py.
24
+ """
25
+ def __init__(self, use_hybrid: bool = True, force_clean: bool = False):
26
+ self.use_hybrid = use_hybrid
27
+ self.clip_model = get_clip_model()
28
+ self.collection = get_mongo_collection()
29
+ self.llm = get_llm()
30
+ self.groq_client = get_groq_client()
31
+
32
+ if force_clean:
33
+ self.collection.delete_many({})
34
+
35
+ self._setup_vector_index()
36
+
37
+ self.bm25_index = None
38
+ self.bm25_doc_map = {}
39
+
40
+ if self.collection.count_documents({}) > 0:
41
+ self._rebuild_bm25_index()
42
+
43
+
44
+
45
+ def _setup_vector_index(self):
46
+ """
47
+ Attempts to create a vector search index if using MongoDB Atlas.
48
+ Includes robust dimension checking and error handling.
49
+ """
50
+ # 1. Determine Dimensions safely
51
+ try:
52
+ dims = self.clip_model.get_sentence_embedding_dimension()
53
+ if dims is None or not isinstance(dims, int):
54
+ raise ValueError("Model returned invalid dimensions")
55
+ except Exception:
56
+ print("Auto-dim failed, probing model...")
57
+ test_vec = self.clip_model.encode("test")
58
+ dims = len(test_vec)
59
+
60
+
61
+ print(f"Vector Dimensions: {dims}")
62
+
63
+ # 2. Define Index Model
64
+ index_model = {
65
+ "definition": {
66
+ "fields": [
67
+ {
68
+ "type": "vector",
69
+ "path": "embedding",
70
+ "numDimensions": int(dims), # Ensure strict integer
71
+ "similarity": "cosine"
72
+ },
73
+ {
74
+ "type": "filter",
75
+ "path": "metadata.type"
76
+ }
77
+ ]
78
+ },
79
+ "name": VECTOR_INDEX_NAME,
80
+ "type": "vectorSearch"
81
+ }
82
+
83
+ # 3. Create Index
84
+ try:
85
+ # Check if index already exists
86
+ indexes = list(self.collection.list_search_indexes())
87
+ index_names = [idx.get("name") for idx in indexes]
88
+
89
+ if VECTOR_INDEX_NAME not in index_names:
90
+ print(f"Creating Atlas Vector Search Index '{VECTOR_INDEX_NAME}'...")
91
+ self.collection.create_search_index(model=index_model)
92
+ print("Index creation initiated. Please wait 1-2 minutes for Atlas to build it.")
93
+ print("You can check progress in Atlas UI -> Database -> Search -> Vector Search")
94
+ else:
95
+ print(f"Index '{VECTOR_INDEX_NAME}' already exists.")
96
+
97
+ except Exception as e:
98
+ print(f"\nAutomatic Index Creation Failed: {e}")
99
+ print("This is common on Free Tier (M0) or due to permissions.")
100
+ print("PLEASE CREATE MANUALLY IN ATLAS UI (See JSON below)\n")
101
+ print(json.dumps(index_model["definition"], indent=2))
102
+ except Exception as e:
103
+ print(f"Unexpected error checking/creating index: {e}")
104
+
105
+ def _rebuild_bm25_index(self):
106
+ cursor = self.collection.find(
107
+ {"metadata.type": {"$in": ["text", "table", "list", "header", "code"]}},
108
+ {"content": 1, "_id": 1}
109
+ )
110
+ text_docs = []
111
+ self.bm25_doc_map = {}
112
+ for idx, doc in enumerate(cursor):
113
+ content = doc.get("content", "")
114
+ if content:
115
+ text_docs.append(content.lower().split())
116
+ self.bm25_doc_map[idx] = str(doc["_id"])
117
+ if text_docs:
118
+ self.bm25_index = BM25Okapi(text_docs)
119
+
120
+ def _encode_content(self, content: Any, content_type: str) -> np.ndarray:
121
+ if content_type == "image":
122
+ # Assuming content is base64
123
+ from PIL import Image
124
+ from io import BytesIO
125
+ import base64
126
+ try:
127
+ img = Image.open(BytesIO(base64.b64decode(content))).convert("RGB")
128
+ return self.clip_model.encode(img, normalize_embeddings=True)
129
+ except: return None
130
+ return self.clip_model.encode(content, normalize_embeddings=True)
131
+
132
+ def ingest_data(self, data: Dict[str, Any]):
133
+ """Ingests processed document data."""
134
+ operations = []
135
+ for chunk in data.get("chunks", []):
136
+ embedding = self._encode_content(chunk["text"], "text")
137
+ if embedding is None: continue
138
+ doc = {
139
+ "_id": chunk["chunk_id"],
140
+ "content": chunk["text"],
141
+ "embedding": embedding.tolist(),
142
+ "metadata": {
143
+ **chunk["metadata"],
144
+ "type": chunk.get("type", "text")
145
+ }
146
+ }
147
+ operations.append(ReplaceOne({"_id": doc["_id"]}, doc, upsert=True))
148
+
149
+ for img in data.get("images", []):
150
+ embedding = self._encode_content(img["image_base64"], "image")
151
+ if embedding is None: continue
152
+ doc = {
153
+ "_id": img["image_id"],
154
+ "content": img.get("description", ""),
155
+ "embedding": embedding.tolist(),
156
+ "metadata": {
157
+ "page": str(img.get("page_number", 0)),
158
+ "header": str(img.get("section_header", "")),
159
+ "type": "image",
160
+ "description": img.get("description", ""),
161
+ "image_base64": img["image_base64"]
162
+ }
163
+ }
164
+ operations.append(ReplaceOne({"_id": doc["_id"]}, doc, upsert=True))
165
+
166
+ if operations:
167
+ for i in range(0, len(operations), 100):
168
+ self.collection.bulk_write(operations[i:i+100])
169
+ self._rebuild_bm25_index()
170
+
171
+ def hybrid_search(self, query: str, top_k: int = 5, alpha: float = 0.5) -> List[Dict]:
172
+ query_embedding = self._encode_content(query, "text")
173
+ dense_results = []
174
+ try:
175
+ pipeline = [
176
+ {"$vectorSearch": {
177
+ "index": VECTOR_INDEX_NAME,
178
+ "path": "embedding",
179
+ "queryVector": query_embedding.tolist(),
180
+ "numCandidates": top_k * 10,
181
+ "limit": top_k * 2
182
+ }},
183
+ {"$project": {"content": 1, "metadata": 1, "score": {"$meta": "vectorSearchScore"}}}
184
+ ]
185
+ dense_results = list(self.collection.aggregate(pipeline))
186
+ except: pass
187
+
188
+ dense_scores = {str(r["_id"]): {"score": r.get("score", 0), "doc": r} for r in dense_results}
189
+ sparse_scores = {}
190
+ if self.bm25_index:
191
+ scores = self.bm25_index.get_scores(query.lower().split())
192
+ max_s = max(scores) if len(scores) > 0 and max(scores) > 0 else 1.0
193
+ for i in np.argsort(scores)[::-1][:top_k*2]:
194
+ if scores[i] > 0:
195
+ sparse_scores[self.bm25_doc_map[i]] = scores[i] / max_s
196
+
197
+ combined = []
198
+ all_ids = set(dense_scores.keys()) | set(sparse_scores.keys())
199
+ for did in all_ids:
200
+ d_s = dense_scores.get(did, {}).get("score", 0)
201
+ s_s = sparse_scores.get(did, 0)
202
+ score = (alpha * d_s) + ((1-alpha) * s_s)
203
+ doc = dense_scores.get(did, {}).get("doc") or self.collection.find_one({"_id": did})
204
+ if doc:
205
+ combined.append({**doc, "score": score})
206
+
207
+ combined.sort(key=lambda x: x["score"], reverse=True)
208
+ return combined[:top_k]
209
+
210
+ def answer_question(self, question: str, top_k: int = 5) -> str:
211
+ results = self.hybrid_search(question, top_k=top_k)
212
+ if not results: return "No relevant info found."
213
+
214
+ context = ""
215
+ for i, res in enumerate(results, 1):
216
+ m = res["metadata"]
217
+ context += f"\n[Src {i} | Page {m.get('page_number','?')}] {res['content']}"
218
+
219
+ prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer strictly based on context:"
220
+ try:
221
+ chain = ChatPromptTemplate.from_template("{p}") | self.llm | StrOutputParser()
222
+ # return chain.invoke({"p": prompt})
223
+
224
+ for msg in chain.stream({"p": prompt}):
225
+ if hasattr(msg, "content"):
226
+ time.sleep(0.01)
227
+ yield msg.content
228
+ else:
229
+ time.sleep(0.01)
230
+ yield str(msg)
231
+
232
+ except Exception as e: return f"Error: {e}"
233
+
234
+ def search_images(self, query: str, top_k: int = 3, min_score: float = 0.5) -> List[Dict]:
235
+ query_embedding = self._encode_content(f"{query}", "text")
236
+ try:
237
+ pipeline = [
238
+ {"$vectorSearch": {
239
+ "index": VECTOR_INDEX_NAME, "path": "embedding",
240
+ "queryVector": query_embedding.tolist(), "numCandidates": top_k*10, "limit": top_k*2,
241
+ "filter": {"metadata.type": "image"}
242
+ }},
243
+ {"$project": {"content": 1, "metadata": 1, "score": {"$meta": "vectorSearchScore"}}}
244
+ ]
245
+ results = list(self.collection.aggregate(pipeline))
246
+ return [{"description": r["content"], "image_base64": r["metadata"].get("image_base64"), "score": r["score"]}
247
+ for r in results if r["score"] >= min_score][:top_k]
248
+ except Exception as e:
249
+ print("*********error", str(e))
250
+ return []
251
+
252
+ # def generate_suggested_questions(self, num_questions: int = 5) -> List[str]:
253
+ # # Simple metadata-based generation or just a fixed list for now
254
+ # return ["What is the main topic?", "Explain the diagrams.", "Summarize the results."]
255
+
256
+
257
+ def generate_suggested_questions(self, num_questions: int = 4) -> List[str]:
258
+ """Token-efficient question generation using metadata."""
259
+ print("\nGenerating suggested questions (Efficient Mode)...")
260
+
261
+ try:
262
+ # 1. Fetch metadata ONLY (projection excludes embedding and content)
263
+ cursor = self.collection.find(
264
+ {},
265
+ {"metadata": 1, "_id": 0}
266
+ ).limit(100)
267
+
268
+ metadatas = [doc.get('metadata', {}) for doc in cursor]
269
+
270
+ if not metadatas:
271
+ return ["What is this document about?"]
272
+
273
+ # 2. Extract High-Level Structure
274
+ headers = set()
275
+ image_descriptions = []
276
+
277
+ import random
278
+ random.shuffle(metadatas)
279
+
280
+ for meta in metadatas:
281
+ if 'header' in meta and len(headers) < 8:
282
+ h = str(meta['header']).strip()
283
+ if h and h.lower() != "unknown" and len(h) > 5:
284
+ headers.add(h)
285
+
286
+ if meta.get('type') == 'image' and len(image_descriptions) < 2:
287
+ desc = meta.get('description', '')
288
+ if len(desc) > 20:
289
+ image_descriptions.append(desc[:100] + "...")
290
+
291
+ # 3. Construct Prompt
292
+ context_str = "Document Sections:\n" + "\n".join([f"- {h}" for h in headers])
293
+ if image_descriptions:
294
+ context_str += "\n\nVisual Content involves:\n" + "\n".join([f"- {d}" for d in image_descriptions])
295
+
296
+ # 4. Prompt LLM
297
+ prompt = f"""Generate {num_questions} short, interesting questions about a document with these sections and visuals:
298
+
299
+ {context_str}
300
+
301
+ Output ONLY the {num_questions} questions, one per line. No numbering."""
302
+
303
+ prompt_tmpl = ChatPromptTemplate.from_messages([
304
+ ("system", "You are a helpful assistant."),
305
+ ("user", "{prompt}")
306
+ ])
307
+
308
+ chain = prompt_tmpl | self.llm | StrOutputParser()
309
+ response = chain.invoke({"prompt": prompt})
310
+
311
+ questions = [q.strip().lstrip('-1234567890. ') for q in response.split('\n') if q.strip()]
312
+ return questions[:num_questions]
313
+
314
+ except Exception as e:
315
+ print(f"Error generating questions: {e}")
config.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from dotenv import load_dotenv
3
+ from urllib.parse import quote_plus
4
+
5
+ load_dotenv()
6
+
7
+ # --- MongoDB Configuration ---
8
+ DB_NAME = os.getenv("MONGO_DB", "mongodb")
9
+ DB_PASSWORD = os.getenv("MONGO_PASSWORD", "pass")
10
+ DB_USER = os.getenv("MONGO_USER", "username")
11
+ DB_HOST = os.getenv("MONGO_HOST", "localhost")
12
+
13
+
14
+ VECTOR_INDEX_NAME = "vector_index"
15
+ MONGO_URI = f"mongodb+srv://{DB_USER}:{quote_plus(DB_PASSWORD)}@{DB_HOST}/?appName={quote_plus(DB_NAME)}"
16
+ MONGO_COLLECTION = os.getenv("MONGO_COLLECTION", "documents")
17
+
18
+ # --- API Keys ---
19
+ GROQ_API_KEY = os.getenv("GROQ_API_KEY")
20
+
21
+ # --- Model Configurations ---
22
+ CLIP_MODEL_NAME = "clip-ViT-L-14"
23
+ LLM_MODEL_NAME = "llama-3.3-70b-versatile" # Fallback/Check
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.20.0
2
+ pillow>=9.0.0
3
+ python-dotenv>=1.0.0
4
+ pymongo>=4.0.0
5
+ sentence-transformers>=2.2.2
6
+ torch>=2.0.0
7
+ rank-bm25>=0.2.2
8
+ tqdm>=4.0.0
9
+ numpy>=1.24.0
10
+ langchain-core>=0.0.200
11
+ langchain-ollama>=0.0.1
12
+ groq>=0.3.0
13
+ docling>=0.1.0
14
+ docling-core>=0.1.0
15
+ langchain-groq
16
+ docling[easyocr]