Ryanfafa commited on
Commit
477ca04
·
verified ·
1 Parent(s): 1e4325c

Upload 7 files

Browse files
Files changed (7) hide show
  1. Dockerfile +45 -0
  2. README (1).md +113 -0
  3. app.py +339 -0
  4. data_downloader.py +293 -0
  5. packages.txt +3 -0
  6. rag_engine.py +200 -0
  7. requirements.txt +31 -0
Dockerfile ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ # System dependencies
4
+ RUN apt-get update && apt-get install -y \
5
+ build-essential \
6
+ curl \
7
+ git \
8
+ libgl1 \
9
+ libglib2.0-0 \
10
+ poppler-utils \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # Create non-root user (required by HuggingFace Spaces)
14
+ RUN useradd -m -u 1000 appuser
15
+
16
+ WORKDIR /app
17
+
18
+ # Copy and install Python dependencies
19
+ COPY requirements.txt .
20
+ RUN pip install --no-cache-dir --upgrade pip && \
21
+ pip install --no-cache-dir -r requirements.txt
22
+
23
+ # Copy app files
24
+ COPY --chown=appuser:appuser . .
25
+
26
+ # Create writable directories for ChromaDB and sample docs
27
+ RUN mkdir -p /app/chroma_db /app/sample_docs && \
28
+ chown -R appuser:appuser /app/chroma_db /app/sample_docs
29
+
30
+ USER appuser
31
+
32
+ # Expose Streamlit port
33
+ EXPOSE 7860
34
+
35
+ # Health check
36
+ HEALTHCHECK CMD curl --fail http://localhost:7860/_stcore/health || exit 1
37
+
38
+ # Run Streamlit on port 7860 (required by HuggingFace Spaces)
39
+ CMD ["streamlit", "run", "app.py", \
40
+ "--server.port=7860", \
41
+ "--server.address=0.0.0.0", \
42
+ "--server.headless=true", \
43
+ "--server.enableCORS=false", \
44
+ "--server.enableXsrfProtection=false", \
45
+ "--browser.gatherUsageStats=false"]
README (1).md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DocMind AI – RAG Document Q&A
3
+ emoji: 🧠
4
+ colorFrom: purple
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: true
9
+ license: mit
10
+ short_description: Chat with any PDF using RAG + ChromaDB
11
+ ---
12
+
13
+ # 🧠 DocMind AI — RAG-Powered Document Q&A
14
+
15
+ > Upload any PDF or text document and ask questions — answers are grounded in your content using Retrieval-Augmented Generation.
16
+
17
+ ## 🚀 Live Demo
18
+
19
+ Upload a PDF or TXT, or click **"Load Sample: AI Report"** to instantly demo with a preloaded AI research document.
20
+
21
+ ---
22
+
23
+ ## 🏗️ Architecture
24
+
25
+ ```
26
+ User Query
27
+
28
+
29
+ ┌─────────────────────────────────────────┐
30
+ │ RETRIEVAL PIPELINE │
31
+ │ │
32
+ │ Document → Chunking → Embedding │
33
+ │ (RecursiveCharacterSplitter) │
34
+ │ (all-MiniLM-L6-v2, 384 dims) │
35
+ │ │ │
36
+ │ ▼ │
37
+ │ ChromaDB │
38
+ │ (local vector store, MMR) │
39
+ │ │ │
40
+ │ Top-4 relevant chunks │
41
+ └─────────────────────────────────────────┘
42
+
43
+
44
+ ┌─────────────────────────────────────────┐
45
+ │ GENERATION PIPELINE │
46
+ │ │
47
+ │ Context + Question → Prompt Template │
48
+ │ │ │
49
+ │ HuggingFace Inference API │
50
+ │ (zephyr-7b-beta) │
51
+ │ │ │
52
+ │ Final Answer │
53
+ └─────────────────────────────────────────┘
54
+ ```
55
+
56
+ ## 🛠️ Tech Stack
57
+
58
+ | Component | Technology |
59
+ |-----------|-----------|
60
+ | **Framework** | LangChain 0.2 |
61
+ | **Vector DB** | ChromaDB |
62
+ | **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 |
63
+ | **LLM** | HuggingFaceH4/zephyr-7b-beta |
64
+ | **UI** | Streamlit |
65
+ | **Deployment** | HuggingFace Spaces |
66
+
67
+ ## ⚙️ Key RAG Concepts Demonstrated
68
+
69
+ - **Recursive Character Splitting** — smart chunking with 800-token windows and 150-token overlap
70
+ - **Dense Embeddings** — semantic vector representations, not keyword matching
71
+ - **MMR Retrieval** — Maximal Marginal Relevance reduces redundant retrieved chunks
72
+ - **Prompt Engineering** — structured system/user/assistant prompt for grounded answers
73
+ - **Source Attribution** — every answer shows which document chunks were used
74
+
75
+ ## 🔧 Local Setup
76
+
77
+ ```bash
78
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/docmind-ai
79
+ cd docmind-ai
80
+ pip install -r requirements.txt
81
+ streamlit run app.py
82
+ ```
83
+
84
+ Optional — add a HuggingFace token for higher API rate limits:
85
+ ```bash
86
+ export HF_TOKEN=hf_your_token_here
87
+ ```
88
+
89
+ ## 📁 Project Structure
90
+
91
+ ```
92
+ docmind-ai/
93
+ ├── app.py # Streamlit UI
94
+ ├── rag_engine.py # Core RAG pipeline (embed, store, retrieve, generate)
95
+ ├── data_downloader.py # Auto-downloads sample documents
96
+ ├── requirements.txt # Dependencies
97
+ └── README.md # This file
98
+ ```
99
+
100
+ ## 💡 How It Works
101
+
102
+ 1. **Upload** a PDF or TXT file (or use the sample)
103
+ 2. The app **splits** the document into overlapping chunks
104
+ 3. Each chunk is **embedded** into a 384-dimensional vector
105
+ 4. Vectors are **stored** in ChromaDB (local vector database)
106
+ 5. Your question is **embedded** and matched against stored vectors via MMR
107
+ 6. The top-4 relevant chunks are **retrieved**
108
+ 7. Chunks + question are sent to **Zephyr-7B** via HuggingFace Inference API
109
+ 8. A grounded **answer** is returned with source attribution
110
+
111
+ ---
112
+
113
+ *Built as a portfolio project demonstrating end-to-end RAG engineering.*
app.py ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import os
3
+ import time
4
+ import hashlib
5
+ from pathlib import Path
6
+
7
+ # ─── Page Config ───────────────────────────────────────────────────────────────
8
+ st.set_page_config(
9
+ page_title="DocMind AI – RAG Document Q&A",
10
+ page_icon="🧠",
11
+ layout="wide",
12
+ initial_sidebar_state="expanded",
13
+ )
14
+
15
+ # ─── Custom CSS ────────────────────────────────────────────────────────────────
16
+ st.markdown("""
17
+ <style>
18
+ @import url('https://fonts.googleapis.com/css2?family=Syne:wght@400;600;700;800&family=DM+Sans:wght@300;400;500&display=swap');
19
+
20
+ html, body, [class*="css"] {
21
+ font-family: 'DM Sans', sans-serif;
22
+ }
23
+
24
+ .stApp {
25
+ background: #0f0f13;
26
+ color: #e8e8f0;
27
+ }
28
+
29
+ /* Sidebar */
30
+ [data-testid="stSidebar"] {
31
+ background: #16161d !important;
32
+ border-right: 1px solid #2a2a3a;
33
+ }
34
+
35
+ /* Hero header */
36
+ .hero-title {
37
+ font-family: 'Syne', sans-serif;
38
+ font-size: 2.8rem;
39
+ font-weight: 800;
40
+ background: linear-gradient(135deg, #7c6af7 0%, #a78bfa 40%, #38bdf8 100%);
41
+ -webkit-background-clip: text;
42
+ -webkit-text-fill-color: transparent;
43
+ background-clip: text;
44
+ line-height: 1.1;
45
+ margin-bottom: 0.2rem;
46
+ }
47
+
48
+ .hero-sub {
49
+ color: #6b6b8a;
50
+ font-size: 1rem;
51
+ font-weight: 300;
52
+ letter-spacing: 0.04em;
53
+ margin-bottom: 2rem;
54
+ }
55
+
56
+ /* Stat cards */
57
+ .stat-card {
58
+ background: #1c1c26;
59
+ border: 1px solid #2a2a3a;
60
+ border-radius: 12px;
61
+ padding: 1rem 1.2rem;
62
+ text-align: center;
63
+ }
64
+ .stat-number {
65
+ font-family: 'Syne', sans-serif;
66
+ font-size: 1.6rem;
67
+ font-weight: 700;
68
+ color: #a78bfa;
69
+ }
70
+ .stat-label {
71
+ font-size: 0.75rem;
72
+ color: #6b6b8a;
73
+ text-transform: uppercase;
74
+ letter-spacing: 0.08em;
75
+ }
76
+
77
+ /* Chat messages */
78
+ .chat-user {
79
+ background: #1e1e2e;
80
+ border: 1px solid #2a2a3a;
81
+ border-radius: 12px 12px 4px 12px;
82
+ padding: 0.9rem 1.1rem;
83
+ margin: 0.5rem 0;
84
+ color: #e8e8f0;
85
+ }
86
+ .chat-assistant {
87
+ background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
88
+ border: 1px solid #312e81;
89
+ border-radius: 12px 12px 12px 4px;
90
+ padding: 0.9rem 1.1rem;
91
+ margin: 0.5rem 0;
92
+ color: #e8e8f0;
93
+ }
94
+ .chat-label {
95
+ font-size: 0.7rem;
96
+ font-weight: 600;
97
+ text-transform: uppercase;
98
+ letter-spacing: 0.1em;
99
+ margin-bottom: 0.4rem;
100
+ }
101
+ .label-user { color: #38bdf8; }
102
+ .label-ai { color: #a78bfa; }
103
+
104
+ /* Source pills */
105
+ .source-pill {
106
+ display: inline-block;
107
+ background: #1f1f2e;
108
+ border: 1px solid #3730a3;
109
+ border-radius: 20px;
110
+ padding: 0.2rem 0.7rem;
111
+ font-size: 0.72rem;
112
+ color: #818cf8;
113
+ margin: 0.2rem 0.15rem;
114
+ }
115
+
116
+ /* Upload area */
117
+ [data-testid="stFileUploader"] {
118
+ background: #1c1c26 !important;
119
+ border: 2px dashed #2a2a3a !important;
120
+ border-radius: 12px !important;
121
+ }
122
+
123
+ /* Buttons */
124
+ .stButton > button {
125
+ background: linear-gradient(135deg, #7c3aed, #4f46e5) !important;
126
+ color: white !important;
127
+ border: none !important;
128
+ border-radius: 8px !important;
129
+ font-family: 'DM Sans', sans-serif !important;
130
+ font-weight: 500 !important;
131
+ transition: all 0.2s ease !important;
132
+ }
133
+ .stButton > button:hover {
134
+ transform: translateY(-1px) !important;
135
+ box-shadow: 0 4px 20px rgba(124, 58, 237, 0.4) !important;
136
+ }
137
+
138
+ /* Input */
139
+ .stTextInput > div > div > input,
140
+ [data-testid="stChatInputTextArea"] {
141
+ background: #1c1c26 !important;
142
+ border: 1px solid #2a2a3a !important;
143
+ color: #e8e8f0 !important;
144
+ border-radius: 10px !important;
145
+ }
146
+
147
+ /* Status badges */
148
+ .badge-ready { background:#14532d; color:#86efac; padding:3px 10px; border-radius:20px; font-size:0.75rem; }
149
+ .badge-empty { background:#1c1917; color:#a8a29e; padding:3px 10px; border-radius:20px; font-size:0.75rem; }
150
+ .badge-loading{ background:#1e3a5f; color:#7dd3fc; padding:3px 10px; border-radius:20px; font-size:0.75rem; }
151
+
152
+ /* Divider */
153
+ hr { border-color: #2a2a3a !important; }
154
+
155
+ /* Scrollbar */
156
+ ::-webkit-scrollbar { width: 6px; }
157
+ ::-webkit-scrollbar-track { background: #0f0f13; }
158
+ ::-webkit-scrollbar-thumb { background: #2a2a3a; border-radius: 3px; }
159
+ </style>
160
+ """, unsafe_allow_html=True)
161
+
162
+ # ─── Lazy imports (avoids reload cost) ────────────────────────────────────────
163
+ @st.cache_resource(show_spinner=False)
164
+ def load_rag_engine():
165
+ from rag_engine import RAGEngine
166
+ return RAGEngine()
167
+
168
+ # ─── Session state init ────────────────────────────────────────────────────────
169
+ if "messages" not in st.session_state: st.session_state.messages = []
170
+ if "doc_loaded" not in st.session_state: st.session_state.doc_loaded = False
171
+ if "doc_name" not in st.session_state: st.session_state.doc_name = ""
172
+ if "chunk_count" not in st.session_state: st.session_state.chunk_count = 0
173
+ if "processed_hash" not in st.session_state: st.session_state.processed_hash = ""
174
+
175
+ # ─── Sidebar ───────────────────────────────────────────────────────────────────
176
+ with st.sidebar:
177
+ st.markdown('<p style="font-family:Syne,sans-serif;font-size:1.3rem;font-weight:700;color:#a78bfa;">🧠 DocMind AI</p>', unsafe_allow_html=True)
178
+ st.markdown('<p style="color:#6b6b8a;font-size:0.8rem;">RAG-Powered Document Intelligence</p>', unsafe_allow_html=True)
179
+ st.markdown("---")
180
+
181
+ # Status
182
+ if st.session_state.doc_loaded:
183
+ st.markdown(f'<span class="badge-ready">✓ Ready</span>', unsafe_allow_html=True)
184
+ st.markdown(f'<p style="color:#e8e8f0;font-size:0.85rem;margin-top:0.5rem;">📄 <b>{st.session_state.doc_name}</b></p>', unsafe_allow_html=True)
185
+ st.markdown(f'<p style="color:#6b6b8a;font-size:0.78rem;">{st.session_state.chunk_count} chunks indexed</p>', unsafe_allow_html=True)
186
+ else:
187
+ st.markdown('<span class="badge-empty">○ No document loaded</span>', unsafe_allow_html=True)
188
+
189
+ st.markdown("---")
190
+ st.markdown('<p style="color:#6b6b8a;font-size:0.78rem;font-weight:600;text-transform:uppercase;letter-spacing:0.08em;">Upload Document</p>', unsafe_allow_html=True)
191
+
192
+ uploaded_file = st.file_uploader(
193
+ "PDF or TXT",
194
+ type=["pdf", "txt"],
195
+ label_visibility="collapsed"
196
+ )
197
+
198
+ if uploaded_file:
199
+ file_hash = hashlib.md5(uploaded_file.read()).hexdigest()
200
+ uploaded_file.seek(0)
201
+
202
+ if file_hash != st.session_state.processed_hash:
203
+ with st.spinner("🔍 Processing document..."):
204
+ rag = load_rag_engine()
205
+ chunks = rag.ingest_file(uploaded_file)
206
+ st.session_state.doc_loaded = True
207
+ st.session_state.doc_name = uploaded_file.name
208
+ st.session_state.chunk_count = chunks
209
+ st.session_state.processed_hash = file_hash
210
+ st.session_state.messages = []
211
+ st.success(f"✓ Indexed {chunks} chunks!")
212
+ st.rerun()
213
+
214
+ st.markdown("---")
215
+
216
+ # Try sample doc
217
+ st.markdown('<p style="color:#6b6b8a;font-size:0.78rem;font-weight:600;text-transform:uppercase;letter-spacing:0.08em;">Or try a sample</p>', unsafe_allow_html=True)
218
+ if st.button("📥 Load Sample: AI Report", use_container_width=True):
219
+ with st.spinner("Downloading sample document..."):
220
+ from data_downloader import download_sample_doc
221
+ path, name = download_sample_doc()
222
+ rag = load_rag_engine()
223
+ chunks = rag.ingest_path(path, name)
224
+ st.session_state.doc_loaded = True
225
+ st.session_state.doc_name = name
226
+ st.session_state.chunk_count = chunks
227
+ st.session_state.processed_hash = "sample"
228
+ st.session_state.messages = []
229
+ st.success(f"✓ Sample loaded! {chunks} chunks")
230
+ st.rerun()
231
+
232
+ st.markdown("---")
233
+ if st.button("🗑️ Clear Chat", use_container_width=True):
234
+ st.session_state.messages = []
235
+ st.rerun()
236
+
237
+ st.markdown("---")
238
+ st.markdown("""
239
+ <p style="color:#6b6b8a;font-size:0.72rem;line-height:1.6;">
240
+ <b style="color:#a78bfa;">Stack</b><br>
241
+ 🔗 LangChain · ChromaDB<br>
242
+ 🤗 HuggingFace Embeddings<br>
243
+ 🦙 Mistral-7B (GGUF)<br>
244
+ 🌊 Streamlit
245
+ </p>
246
+ """, unsafe_allow_html=True)
247
+
248
+ # ─── Main Area ─────────────────────────────────────────────────────────────────
249
+ st.markdown('<h1 class="hero-title">DocMind AI</h1>', unsafe_allow_html=True)
250
+ st.markdown('<p class="hero-sub">Upload any document · Ask anything · Get answers grounded in your content</p>', unsafe_allow_html=True)
251
+
252
+ # Stats row
253
+ col1, col2, col3, col4 = st.columns(4)
254
+ with col1:
255
+ st.markdown(f"""
256
+ <div class="stat-card">
257
+ <div class="stat-number">{st.session_state.chunk_count or "—"}</div>
258
+ <div class="stat-label">Chunks Indexed</div>
259
+ </div>""", unsafe_allow_html=True)
260
+ with col2:
261
+ st.markdown(f"""
262
+ <div class="stat-card">
263
+ <div class="stat-number">{len(st.session_state.messages) // 2}</div>
264
+ <div class="stat-label">Questions Asked</div>
265
+ </div>""", unsafe_allow_html=True)
266
+ with col3:
267
+ st.markdown("""
268
+ <div class="stat-card">
269
+ <div class="stat-number">384</div>
270
+ <div class="stat-label">Embedding Dims</div>
271
+ </div>""", unsafe_allow_html=True)
272
+ with col4:
273
+ st.markdown("""
274
+ <div class="stat-card">
275
+ <div class="stat-number">Top-4</div>
276
+ <div class="stat-label">Retrieval K</div>
277
+ </div>""", unsafe_allow_html=True)
278
+
279
+ st.markdown("<br>", unsafe_allow_html=True)
280
+
281
+ # ─── Chat History ──────────────────────────────────────────────────────────────
282
+ chat_container = st.container()
283
+ with chat_container:
284
+ if not st.session_state.messages:
285
+ if st.session_state.doc_loaded:
286
+ st.markdown(f"""
287
+ <div style="text-align:center;padding:3rem;color:#6b6b8a;">
288
+ <div style="font-size:2.5rem;margin-bottom:1rem;">💬</div>
289
+ <p style="font-size:1rem;color:#a78bfa;">Document ready!</p>
290
+ <p style="font-size:0.85rem;">Ask anything about <b style="color:#e8e8f0;">{st.session_state.doc_name}</b></p>
291
+ </div>
292
+ """, unsafe_allow_html=True)
293
+ else:
294
+ st.markdown("""
295
+ <div style="text-align:center;padding:4rem 2rem;color:#6b6b8a;">
296
+ <div style="font-size:3rem;margin-bottom:1rem;">📄</div>
297
+ <p style="font-size:1.1rem;color:#a78bfa;font-family:'Syne',sans-serif;font-weight:600;">No document loaded yet</p>
298
+ <p style="font-size:0.85rem;">Upload a PDF or TXT file in the sidebar,<br>or load the sample AI report to get started.</p>
299
+ </div>
300
+ """, unsafe_allow_html=True)
301
+ else:
302
+ for msg in st.session_state.messages:
303
+ if msg["role"] == "user":
304
+ st.markdown(f"""
305
+ <div class="chat-user">
306
+ <div class="chat-label label-user">You</div>
307
+ {msg["content"]}
308
+ </div>""", unsafe_allow_html=True)
309
+ else:
310
+ sources_html = ""
311
+ if msg.get("sources"):
312
+ pills = "".join(f'<span class="source-pill">📄 {s}</span>' for s in msg["sources"])
313
+ sources_html = f'<div style="margin-top:0.7rem;">{pills}</div>'
314
+ st.markdown(f"""
315
+ <div class="chat-assistant">
316
+ <div class="chat-label label-ai">DocMind AI</div>
317
+ {msg["content"]}
318
+ {sources_html}
319
+ </div>""", unsafe_allow_html=True)
320
+
321
+ # ─── Chat Input ────────────────────────────────────────────────────────────────
322
+ st.markdown("<br>", unsafe_allow_html=True)
323
+
324
+ if not st.session_state.doc_loaded:
325
+ st.chat_input("Upload a document first...", disabled=True)
326
+ else:
327
+ if prompt := st.chat_input("Ask anything about your document..."):
328
+ st.session_state.messages.append({"role": "user", "content": prompt})
329
+
330
+ with st.spinner("🔍 Retrieving & generating answer..."):
331
+ rag = load_rag_engine()
332
+ answer, sources = rag.query(prompt)
333
+
334
+ st.session_state.messages.append({
335
+ "role": "assistant",
336
+ "content": answer,
337
+ "sources": sources
338
+ })
339
+ st.rerun()
data_downloader.py ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ data_downloader.py
3
+ ──────────────────
4
+ Downloads a free, publicly available AI research report to use as a
5
+ demo document — no manual steps needed.
6
+
7
+ Primary : Stanford AI Index Report 2024 (summary chapter, public PDF)
8
+ Fallback 1: Our World in Data – AI progress summary (txt)
9
+ Fallback 2: Generate a synthetic AI overview document locally
10
+ """
11
+
12
+ import os
13
+ import time
14
+ import textwrap
15
+ import urllib.request
16
+ from pathlib import Path
17
+
18
+ CACHE_DIR = Path("./sample_docs")
19
+ SAMPLE_PDF = CACHE_DIR / "ai_report_sample.pdf"
20
+ SAMPLE_TXT = CACHE_DIR / "ai_overview.txt"
21
+
22
+ # Public, stable, lightweight PDFs (< 5 MB each)
23
+ PDF_SOURCES = [
24
+ (
25
+ "https://arxiv.org/pdf/2310.07064", # "Levels of AGI" Google DeepMind paper
26
+ "Levels_of_AGI_DeepMind.pdf",
27
+ ),
28
+ (
29
+ "https://arxiv.org/pdf/2303.12528", # "Sparks of AGI" Microsoft Research
30
+ "Sparks_of_AGI_Microsoft.pdf",
31
+ ),
32
+ (
33
+ "https://arxiv.org/pdf/2304.15004", # "AutoGPT for Online Dec. Making"
34
+ "AutoGPT_Decision_Making.pdf",
35
+ ),
36
+ ]
37
+
38
+
39
+ def download_sample_doc() -> tuple[str, str]:
40
+ """
41
+ Returns (local_path, display_name).
42
+ Tries PDF sources first; falls back to a generated TXT file.
43
+ """
44
+ CACHE_DIR.mkdir(exist_ok=True)
45
+
46
+ # ── Try each PDF source ────────────────────────────────────────────────────
47
+ for url, fname in PDF_SOURCES:
48
+ dest = CACHE_DIR / fname
49
+ if dest.exists():
50
+ return str(dest), fname # already cached
51
+
52
+ try:
53
+ print(f"Attempting download: {url}")
54
+ req = urllib.request.Request(
55
+ url,
56
+ headers={
57
+ "User-Agent": (
58
+ "Mozilla/5.0 (X11; Linux x86_64) "
59
+ "AppleWebKit/537.36 (KHTML, like Gecko) "
60
+ "Chrome/120.0 Safari/537.36"
61
+ )
62
+ },
63
+ )
64
+ with urllib.request.urlopen(req, timeout=20) as resp:
65
+ data = resp.read()
66
+
67
+ # Sanity-check: must look like a PDF
68
+ if data[:4] == b"%PDF" and len(data) > 10_000:
69
+ dest.write_bytes(data)
70
+ print(f"✓ Downloaded {fname} ({len(data)//1024} KB)")
71
+ return str(dest), fname
72
+
73
+ except Exception as ex:
74
+ print(f" ✗ Failed: {ex}")
75
+ time.sleep(1)
76
+
77
+ # ── Fallback: generate a rich synthetic TXT document ──────────────────────
78
+ print("All PDF downloads failed – generating synthetic document.")
79
+ return _generate_synthetic_doc()
80
+
81
+
82
+ def _generate_synthetic_doc() -> tuple[str, str]:
83
+ """Creates a comprehensive synthetic AI overview document locally."""
84
+ fname = "AI_Technology_Overview_2024.txt"
85
+ dest = CACHE_DIR / fname
86
+
87
+ content = textwrap.dedent("""
88
+ ═══════════════════════════════════════════════════════════════
89
+ ARTIFICIAL INTELLIGENCE: STATE OF THE FIELD — 2024 OVERVIEW
90
+ A Comprehensive Technical Reference Document
91
+ ═══════════════════════════════════════════════════════════════
92
+
93
+ ── SECTION 1: LARGE LANGUAGE MODELS ──────────────────────────
94
+
95
+ Large Language Models (LLMs) are neural networks trained on vast corpora
96
+ of text data using the Transformer architecture introduced by Vaswani et
97
+ al. in 2017. Modern LLMs such as GPT-4, Claude 3, Gemini Ultra, and
98
+ LLaMA-3 contain hundreds of billions of parameters.
99
+
100
+ Training involves two primary phases:
101
+ 1. Pre-training: Self-supervised learning on internet-scale text data
102
+ (Common Crawl, Wikipedia, Books, GitHub code). The model learns to
103
+ predict the next token in a sequence.
104
+ 2. Fine-tuning / RLHF: Reinforcement Learning from Human Feedback aligns
105
+ the model with human preferences, improving helpfulness, harmlessness,
106
+ and honesty.
107
+
108
+ Key capabilities: text generation, translation, summarization, question
109
+ answering, code generation, reasoning, and multimodal understanding.
110
+
111
+ Limitations: hallucinations (generating plausible but false information),
112
+ knowledge cutoff dates, context-window constraints, and sensitivity to
113
+ prompt phrasing (prompt brittleness).
114
+
115
+ ── SECTION 2: RETRIEVAL-AUGMENTED GENERATION (RAG) ──────────
116
+
117
+ RAG is an architectural pattern that enhances LLM accuracy by grounding
118
+ generation in retrieved factual documents. It was introduced in a 2020
119
+ paper by Lewis et al. at Facebook AI Research.
120
+
121
+ RAG Pipeline Architecture:
122
+ 1. Document Ingestion: PDFs, text files, or web pages are loaded.
123
+ 2. Chunking: Documents are split into smaller overlapping segments
124
+ (typically 256–1024 tokens) to fit the model's context window.
125
+ 3. Embedding: Each chunk is converted to a dense vector using a sentence
126
+ transformer model (e.g., all-MiniLM-L6-v2, text-embedding-ada-002).
127
+ 4. Vector Storage: Embeddings are stored in a vector database such as
128
+ ChromaDB, Pinecone, Weaviate, or Qdrant for fast similarity search.
129
+ 5. Query Processing: A user query is embedded and compared against stored
130
+ vectors using cosine similarity or ANN algorithms (HNSW, IVF).
131
+ 6. Context Injection: The top-k most relevant chunks are retrieved and
132
+ injected into the LLM prompt as grounding context.
133
+ 7. Generation: The LLM generates an answer informed by retrieved context.
134
+
135
+ Advantages over pure LLMs:
136
+ - Up-to-date information (no knowledge cutoff)
137
+ - Reduced hallucination (grounded in real documents)
138
+ - Source attribution and transparency
139
+ - Domain-specific knowledge without expensive fine-tuning
140
+
141
+ ── SECTION 3: VECTOR DATABASES ───────────────────────────────
142
+
143
+ Vector databases are specialized systems optimized for storing and
144
+ querying high-dimensional embedding vectors.
145
+
146
+ ChromaDB: Open-source, runs locally in Python. Ideal for development
147
+ and small-to-medium scale projects. Supports persistent and in-memory
148
+ storage. Integrates seamlessly with LangChain.
149
+
150
+ Pinecone: Managed cloud vector database. Scales to billions of vectors.
151
+ Supports metadata filtering, sparse-dense hybrid search.
152
+
153
+ Qdrant: Open-source with cloud option. Supports payload filtering,
154
+ multi-vector collections, and quantization for memory efficiency.
155
+
156
+ Weaviate: GraphQL-native vector search with modular ML integrations.
157
+
158
+ FAISS (Facebook AI Similarity Search): Library (not a database) for
159
+ efficient similarity search. Excellent for research and batch processing.
160
+
161
+ Approximate Nearest Neighbor (ANN) algorithms used by these systems
162
+ include HNSW (Hierarchical Navigable Small World graphs), which provides
163
+ O(log n) search complexity with high recall.
164
+
165
+ ── SECTION 4: EMBEDDING MODELS ───────────────────────────────
166
+
167
+ Embedding models convert text into dense numerical vectors that capture
168
+ semantic meaning. Similar texts produce vectors that are close in the
169
+ embedding space (measured by cosine similarity or dot product).
170
+
171
+ Popular models:
172
+ - all-MiniLM-L6-v2: 22M parameters, 384 dimensions, very fast, good
173
+ quality. Best for real-time applications.
174
+ - all-mpnet-base-v2: 110M parameters, 768 dimensions, higher quality.
175
+ - text-embedding-3-small (OpenAI): 1536 dims, strong general performance.
176
+ - text-embedding-3-large (OpenAI): 3072 dims, state-of-the-art quality.
177
+ - UAE-Large-V1 (WhereIsAI): Top performer on MTEB benchmark as of 2024.
178
+
179
+ The MTEB (Massive Text Embedding Benchmark) is the standard evaluation
180
+ suite for embedding models, covering retrieval, clustering, classification,
181
+ and semantic similarity tasks across 56 datasets.
182
+
183
+ ── SECTION 5: AI AGENTS & AGENTIC SYSTEMS ────────────────────
184
+
185
+ AI agents are LLM-powered systems that can take actions in the world—
186
+ browsing the web, executing code, calling APIs, and managing files—in
187
+ pursuit of a goal.
188
+
189
+ ReAct (Reason + Act) Framework: The model alternates between reasoning
190
+ steps (Thought) and actions (Act), observing results after each action.
191
+
192
+ LangGraph: A framework for building stateful, graph-based agent workflows.
193
+ Supports cycles, branching, parallel execution, and human-in-the-loop
194
+ interrupts.
195
+
196
+ CrewAI: Multi-agent framework where specialized agents collaborate on
197
+ complex tasks. Agents have roles, goals, tools, and can delegate to peers.
198
+
199
+ AutoGen (Microsoft): Framework for multi-agent conversation and code
200
+ execution. Supports human-agent collaboration workflows.
201
+
202
+ Key challenges in agent development:
203
+ - Long-horizon planning and task decomposition
204
+ - Reliable tool use and API integration
205
+ - Memory management (short-term, long-term, episodic)
206
+ - Error recovery and graceful degradation
207
+ - Safety and sandboxing of code execution
208
+
209
+ ── SECTION 6: FINE-TUNING & PEFT METHODS ─────────────────────
210
+
211
+ Full fine-tuning of LLMs is computationally expensive. Parameter-Efficient
212
+ Fine-Tuning (PEFT) methods adapt pre-trained models with minimal resources.
213
+
214
+ LoRA (Low-Rank Adaptation): Adds small trainable rank-decomposition matrices
215
+ to attention layers while freezing the base model. Reduces trainable
216
+ parameters by 10,000x while achieving near-full fine-tune quality.
217
+
218
+ QLoRA: Quantizes the base model to 4-bit precision (NF4), then applies
219
+ LoRA adapters. Enables fine-tuning of 70B models on a single consumer GPU.
220
+
221
+ Instruction tuning: Fine-tuning on (instruction, response) pairs to
222
+ improve the model's ability to follow natural language directions.
223
+
224
+ Popular open-source base models for fine-tuning:
225
+ - LLaMA-3 (Meta AI): 8B and 70B versions, strong multilingual support.
226
+ - Mistral-7B: Efficient 7B model with sliding window attention.
227
+ - Phi-3 (Microsoft): Small but surprisingly capable models (3.8B–14B).
228
+ - Gemma-2 (Google): 2B and 9B versions, optimized for efficiency.
229
+
230
+ ── SECTION 7: MLOPS AND MODEL DEPLOYMENT ─────────────────────
231
+
232
+ MLOps (Machine Learning Operations) covers the practices of deploying,
233
+ monitoring, and maintaining ML models in production.
234
+
235
+ Key components:
236
+ - Experiment Tracking: MLflow, Weights & Biases (W&B) track metrics,
237
+ hyperparameters, and model artifacts across training runs.
238
+ - Model Registry: Central repository for versioned model artifacts.
239
+ - Serving Infrastructure: FastAPI, TorchServe, Triton Inference Server,
240
+ or vLLM for high-throughput LLM serving.
241
+ - Containerization: Docker packages models with all dependencies.
242
+ Kubernetes orchestrates containers at scale.
243
+ - CI/CD: GitHub Actions or GitLab CI automates testing, building,
244
+ and deployment pipelines.
245
+ - Monitoring: Track data drift, concept drift, latency, and error rates
246
+ in production. Tools: Evidently AI, Arize, WhyLabs.
247
+
248
+ Deployment platforms:
249
+ - HuggingFace Spaces: Free hosting for Gradio/Streamlit ML demos.
250
+ - AWS SageMaker: Enterprise ML deployment on AWS infrastructure.
251
+ - Google Vertex AI: Managed ML platform on Google Cloud.
252
+ - Replicate: API-first model deployment, pay-per-prediction.
253
+ - Modal: Serverless GPU compute for ML inference.
254
+
255
+ ── SECTION 8: RESPONSIBLE AI & SAFETY ────────────────────────
256
+
257
+ As AI systems become more capable, ensuring they are safe, fair, and
258
+ aligned with human values is a critical research and engineering challenge.
259
+
260
+ Key principles:
261
+ - Helpfulness: The system should assist users effectively.
262
+ - Harmlessness: Avoid generating content that could cause real-world harm.
263
+ - Honesty: Acknowledge uncertainty; do not hallucinate or deceive.
264
+
265
+ Techniques:
266
+ - RLHF (Reinforcement Learning from Human Feedback): Trains reward models
267
+ from human preferences to guide LLM behavior.
268
+ - Constitutional AI (Anthropic): Models self-critique and revise outputs
269
+ against a set of principles.
270
+ - Red Teaming: Adversarial testing to discover model failure modes.
271
+ - Interpretability Research: Understanding internal model representations
272
+ (mechanistic interpretability, probing classifiers, attention analysis).
273
+
274
+ Regulatory landscape (2024):
275
+ - EU AI Act: First comprehensive AI regulation, risk-based tiered approach.
276
+ - US Executive Order on AI (Oct. 2023): Safety testing requirements for
277
+ large AI models.
278
+ - China AI Regulations: Content moderation and algorithmic transparency
279
+ requirements for generative AI services.
280
+
281
+ ═══════════════════════════════════════════════════════════════
282
+ END OF DOCUMENT
283
+ ═══════════════════════════════════════════════════════════════
284
+ """).strip()
285
+
286
+ dest.write_text(content, encoding="utf-8")
287
+ print(f"✓ Generated synthetic document ({len(content)} chars)")
288
+ return str(dest), fname
289
+
290
+
291
+ if __name__ == "__main__":
292
+ path, name = download_sample_doc()
293
+ print(f"\nReady: {path} ({name})")
packages.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ libgl1
2
+ libglib2.0-0
3
+ poppler-utils
rag_engine.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAG Engine
3
+ ──────────
4
+ - Embeddings : sentence-transformers/all-MiniLM-L6-v2 (HuggingFace, free)
5
+ - Vector DB : ChromaDB (local, in-memory / persistent)
6
+ - LLM : HuggingFace Inference API (zephyr-7b-beta, free tier)
7
+ - Chunking : Recursive character splitter with overlap
8
+ """
9
+
10
+ import os
11
+ import re
12
+ import tempfile
13
+ from typing import Tuple, List
14
+
15
+ import chromadb
16
+ from chromadb.config import Settings
17
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
18
+ from langchain_community.embeddings import HuggingFaceEmbeddings
19
+ from langchain_community.vectorstores import Chroma
20
+ from langchain_community.document_loaders import PyPDFLoader, TextLoader
21
+ from langchain.schema import Document
22
+
23
+ # ─── Configuration ─────────────────────────────────────────────────────────────
24
+ EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
25
+ HF_MODEL_ID = "HuggingFaceH4/zephyr-7b-beta" # free inference API
26
+ CHUNK_SIZE = 800
27
+ CHUNK_OVERLAP = 150
28
+ TOP_K = 4
29
+ COLLECTION_NAME = "docmind_collection"
30
+ CHROMA_DIR = "./chroma_db"
31
+
32
+
33
+ class RAGEngine:
34
+ """Full RAG pipeline: ingest → embed → store → retrieve → generate."""
35
+
36
+ def __init__(self):
37
+ self._embeddings = None
38
+ self._vectorstore = None
39
+ self._splitter = RecursiveCharacterTextSplitter(
40
+ chunk_size=CHUNK_SIZE,
41
+ chunk_overlap=CHUNK_OVERLAP,
42
+ separators=["\n\n", "\n", ". ", " ", ""],
43
+ )
44
+
45
+ # ── Lazy-load embeddings ───────────────────────────────────────────────────
46
+ @property
47
+ def embeddings(self):
48
+ if self._embeddings is None:
49
+ self._embeddings = HuggingFaceEmbeddings(
50
+ model_name=EMBED_MODEL,
51
+ model_kwargs={"device": "cpu"},
52
+ encode_kwargs={"normalize_embeddings": True},
53
+ )
54
+ return self._embeddings
55
+
56
+ # ── Ingest an uploaded Streamlit file object ───────────────────────────────
57
+ def ingest_file(self, uploaded_file) -> int:
58
+ suffix = Path_suffix(uploaded_file.name)
59
+ with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
60
+ tmp.write(uploaded_file.read())
61
+ tmp_path = tmp.name
62
+ return self.ingest_path(tmp_path, uploaded_file.name)
63
+
64
+ # ── Ingest from a file path ────────────────────────────────────────────────
65
+ def ingest_path(self, path: str, name: str = "") -> int:
66
+ suffix = Path_suffix(name or path)
67
+
68
+ if suffix == ".pdf":
69
+ loader = PyPDFLoader(path)
70
+ else:
71
+ loader = TextLoader(path, encoding="utf-8")
72
+
73
+ raw_docs = loader.load()
74
+
75
+ # Add source metadata
76
+ for doc in raw_docs:
77
+ doc.metadata["source"] = name or os.path.basename(path)
78
+
79
+ chunks = self._splitter.split_documents(raw_docs)
80
+
81
+ # Reset & recreate vectorstore for the new document
82
+ self._vectorstore = Chroma.from_documents(
83
+ documents=chunks,
84
+ embedding=self.embeddings,
85
+ collection_name=COLLECTION_NAME,
86
+ persist_directory=CHROMA_DIR,
87
+ client_settings=Settings(anonymized_telemetry=False),
88
+ )
89
+
90
+ return len(chunks)
91
+
92
+ # ── Query: retrieve + generate ─────────────────────────────────────────────
93
+ def query(self, question: str) -> Tuple[str, List[str]]:
94
+ if self._vectorstore is None:
95
+ return "⚠️ Please upload a document first.", []
96
+
97
+ # 1. Retrieve top-k relevant chunks
98
+ retriever = self._vectorstore.as_retriever(
99
+ search_type="mmr", # Maximal Marginal Relevance
100
+ search_kwargs={"k": TOP_K, "fetch_k": TOP_K * 3},
101
+ )
102
+ docs = retriever.invoke(question)
103
+
104
+ # 2. Build context
105
+ context = "\n\n---\n\n".join(
106
+ f"[Chunk {i+1}]\n{d.page_content}" for i, d in enumerate(docs)
107
+ )
108
+
109
+ # 3. Unique source names for display
110
+ sources = list({d.metadata.get("source", "Document") for d in docs})
111
+
112
+ # 4. Generate answer
113
+ answer = self._generate(question, context)
114
+
115
+ return answer, sources
116
+
117
+ # ── LLM call via HuggingFace Inference API ─────────────────────────────────
118
+ def _generate(self, question: str, context: str) -> str:
119
+ try:
120
+ from huggingface_hub import InferenceClient
121
+
122
+ prompt = _build_prompt(question, context)
123
+
124
+ hf_token = os.environ.get("HF_TOKEN", "") # optional but unlocks higher rate limits
125
+ client = InferenceClient(model=HF_MODEL_ID, token=hf_token if hf_token else None)
126
+
127
+ response = client.text_generation(
128
+ prompt,
129
+ max_new_tokens=512,
130
+ temperature=0.2,
131
+ repetition_penalty=1.15,
132
+ do_sample=True,
133
+ stop_sequences=["</s>", "[INST]", "Human:", "User:"],
134
+ )
135
+
136
+ # Strip any echoed prompt
137
+ answer = _clean_response(response, question)
138
+ return answer
139
+
140
+ except Exception as e:
141
+ # Fallback: context-extraction mode (no LLM needed)
142
+ return _fallback_answer(question, context, str(e))
143
+
144
+
145
+ # ─── Prompt Builder ────────────────────────────────────────────────────────────
146
+ def _build_prompt(question: str, context: str) -> str:
147
+ system = (
148
+ "You are DocMind, an expert document analyst. "
149
+ "Answer the user's question using ONLY the provided document context. "
150
+ "Be concise, accurate, and cite specific details from the context. "
151
+ "If the answer is not in the context, say so clearly."
152
+ )
153
+ return (
154
+ f"<|system|>\n{system}</s>\n"
155
+ f"<|user|>\n"
156
+ f"Document context:\n{context}\n\n"
157
+ f"Question: {question}</s>\n"
158
+ f"<|assistant|>\n"
159
+ )
160
+
161
+
162
+ # ─── Response Cleaner ──────────────────────────────────────────────────────────
163
+ def _clean_response(text: str, question: str) -> str:
164
+ # Remove any re-echoed prompt fragments
165
+ for marker in ["<|assistant|>", "<|user|>", "<|system|>", "</s>"]:
166
+ text = text.replace(marker, "")
167
+ text = text.strip()
168
+
169
+ # Remove leading repetition of question
170
+ if text.lower().startswith(question.lower()[:30]):
171
+ text = text[len(question):].strip()
172
+
173
+ return text or "I could not generate a response. Please try rephrasing your question."
174
+
175
+
176
+ # ─── Fallback (no LLM) ─────────────────────────────────────────────────────────
177
+ def _fallback_answer(question: str, context: str, error: str) -> str:
178
+ """Simple extractive answer when LLM is unavailable."""
179
+ keywords = set(re.findall(r'\b\w{4,}\b', question.lower()))
180
+ best_chunk, best_score = "", 0
181
+
182
+ for chunk in context.split("---"):
183
+ words = set(re.findall(r'\b\w{4,}\b', chunk.lower()))
184
+ score = len(keywords & words)
185
+ if score > best_score:
186
+ best_score = score
187
+ best_chunk = chunk.strip()
188
+
189
+ if best_chunk:
190
+ excerpt = best_chunk[:600] + ("..." if len(best_chunk) > 600 else "")
191
+ return (
192
+ f"*(LLM unavailable – showing most relevant excerpt)*\n\n{excerpt}\n\n"
193
+ f"<small>Error: {error}</small>"
194
+ )
195
+ return f"⚠️ Could not generate answer. Error: {error}"
196
+
197
+
198
+ # ─── Helper ────────────────────────────────────────────────────────────────────
199
+ def Path_suffix(name: str) -> str:
200
+ return os.path.splitext(name)[-1].lower() or ".txt"
requirements.txt ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Core RAG Stack ─────────────────────────────────────────────────────────────
2
+ langchain==0.2.16
3
+ langchain-community==0.2.16
4
+ langchain-core==0.2.38
5
+
6
+ # ── Vector DB ──────────────────────────────────────────────────────────────────
7
+ chromadb==0.5.5
8
+
9
+ # ── Embeddings ─────────────────────────────────────────────────────────────────
10
+ sentence-transformers==3.0.1
11
+ huggingface-hub==0.24.6
12
+ transformers==4.44.2
13
+ tokenizers==0.19.1
14
+
15
+ # ── PDF Loading ────────────────────────────────────────────────────────────────
16
+ pypdf==4.3.1
17
+ pymupdf==1.24.9
18
+
19
+ # ── UI ─────────────────────────────────────────────────────────────────────────
20
+ streamlit==1.38.0
21
+
22
+ # ── ML Dependencies ────────────────────────────────────────────────────────────
23
+ torch==2.4.0
24
+ numpy==1.26.4
25
+ scipy==1.13.1
26
+ scikit-learn==1.5.1
27
+
28
+ # ── Utilities ──────────────────────────────────────────────────────────────────
29
+ python-dotenv==1.0.1
30
+ requests==2.32.3
31
+ tqdm==4.66.5