GovindKurapati commited on
Commit
1cfcd72
·
1 Parent(s): 1bcf42f
Files changed (6) hide show
  1. .gitignore +81 -0
  2. README.md +152 -11
  3. app.py +432 -0
  4. ingestion.py +87 -0
  5. qa_pipeline.py +69 -0
  6. requirements.txt +17 -0
.gitignore ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+ .env.local
4
+ .env.*.local
5
+
6
+ # Python
7
+ __pycache__/
8
+ *.py[cod]
9
+ *$py.class
10
+ *.so
11
+ .Python
12
+ build/
13
+ develop-eggs/
14
+ dist/
15
+ downloads/
16
+ eggs/
17
+ .eggs/
18
+ lib/
19
+ lib64/
20
+ parts/
21
+ sdist/
22
+ var/
23
+ wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # Virtual environments
30
+ venv/
31
+ env/
32
+ ENV/
33
+ env.bak/
34
+ venv.bak/
35
+ .venv/
36
+
37
+ # IDE
38
+ .vscode/
39
+ .idea/
40
+ *.swp
41
+ *.swo
42
+ *~
43
+ .project
44
+ .pydevproject
45
+
46
+ # OS
47
+ .DS_Store
48
+ .DS_Store?
49
+ ._*
50
+ .Spotlight-V100
51
+ .Trashes
52
+ ehthumbs.db
53
+ Thumbs.db
54
+ desktop.ini
55
+
56
+ # Project specific
57
+ chroma_db/
58
+ uploads/
59
+ ingested_urls.txt
60
+
61
+ # Logs
62
+ *.log
63
+ logs/
64
+
65
+ # Temporary files
66
+ *.tmp
67
+ *.temp
68
+ .cache/
69
+
70
+ # Jupyter Notebook
71
+ .ipynb_checkpoints
72
+
73
+ # pytest
74
+ .pytest_cache/
75
+ .coverage
76
+ htmlcov/
77
+
78
+ # mypy
79
+ .mypy_cache/
80
+ .dmypy.json
81
+ dmypy.json
README.md CHANGED
@@ -1,13 +1,154 @@
1
- ---
2
- title: Dev Docs Chat
3
- emoji: 🏃
4
- colorFrom: indigo
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.35.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # 📘 Dev Docs Chat
2
+
3
+ A powerful RAG (Retrieval-Augmented Generation) system that allows you to upload documents, ingest content from URLs, and ask questions about your knowledge base with AI-powered answers.
4
+
5
+ ## 🚀 Features
6
+
7
+ ### 📁 **Document Support**
8
+
9
+ - **PDF Files**: Extract and process PDF documents
10
+ - **Text Files**: Plain text document processing
11
+ - **Markdown Files**: Structured markdown with proper parsing
12
+ - **URL Ingestion**: Fetch and process content from web URLs
13
+
14
+ ### 🎯 **Core Functionality**
15
+
16
+ - **Smart Search**: Vector-based semantic search across your documents
17
+ - **AI-Powered Q&A**: Get intelligent answers based on your content
18
+ - **Conversational Memory**: Maintains context across multiple questions
19
+
20
+ ### 🗂️ **Data Management**
21
+
22
+ - **File Upload**: Drag-and-drop interface for document ingestion
23
+ - **URL Ingestion**: Process web content with progress indicators
24
+ - **Delete Operations**: Remove files, URLs, and their embeddings
25
+ - **Bulk Clear**: Reset entire knowledge base with one click
26
+
27
+ ## 🛠️ Installation
28
+
29
+ ### Prerequisites
30
+
31
+ - Python 3.10+
32
+ - pip package manager
33
+
34
+ ### Setup Instructions
35
+
36
+ 1. **Clone the repository**
37
+
38
+ ```bash
39
+ git clone <repository-url>
40
+ cd dev_docs_chat
41
+ ```
42
+
43
+ 2. **Create virtual environment**
44
+
45
+ ```bash
46
+ python -m venv venv
47
+ source venv/bin/activate # On Windows: venv\Scripts\activate
48
+ ```
49
+
50
+ 3. **Install dependencies**
51
+
52
+ ```bash
53
+ pip install -r requirements.txt
54
+ ```
55
+
56
+ 4. **Set up environment variables**
57
+ Create a `.env` file in the project root:
58
+
59
+ ```env
60
+ GROQ_API_KEY=your_groq_api_key_here
61
+ GROQ_API_BASE=https://api.groq.com/openai/v1
62
+ ```
63
+
64
+ 5. **Get API Key**
65
+ - Sign up at [Groq](https://console.groq.com/)
66
+ - Generate an API key
67
+ - Add it to your `.env` file
68
+
69
+ ## 🚀 Usage
70
+
71
+ ### Starting the Application
72
+
73
+ ```bash
74
+ python app.py
75
+ ```
76
+
77
+ The application will be available at `http://127.0.0.1:7860`
78
+
79
+ ## 📁 Project Structure
80
+
81
+ ```
82
+ dev-docs-chat/
83
+ ├── app.py # Main Gradio application
84
+ ├── qa_pipeline.py # Question-answering logic
85
+ ├── ingestion.py # Document ingestion logic
86
+ ├── requirements.txt # Python dependencies
87
+ ├── .env # Environment variables (create this)
88
+ ├── chroma_db/ # Vector database storage
89
+ ├── uploads/ # Uploaded file storage
90
+ ├── ingested_urls.txt # List of ingested URLs
91
+ └── README.md # This file
92
+ ```
93
+
94
+ ## 🔧 Technical Details
95
+
96
+ ### **Architecture**
97
+
98
+ - **Vector Database**: ChromaDB for efficient similarity search
99
+ - **Embeddings**: HuggingFace sentence-transformers
100
+ - **LLM**: Groq's fast LLM for quick responses
101
+ - **Framework**: Gradio for web interface
102
+
103
+ ## 🎯 Use Cases
104
+
105
+ ### **📚 Documentation Assistant**
106
+
107
+ - Upload project documentation and README files
108
+ - Ask questions about implementation details
109
+ - Get instant answers about your codebase
110
+
111
+ ### **🔍 Research Tool**
112
+
113
+ - Ingest research papers and technical articles
114
+ - Ask questions about new technologies
115
+ - Stay updated with industry trends
116
+
117
+ ### **📖 Learning Platform**
118
+
119
+ - Upload tutorials and educational content
120
+ - Ask questions about complex topics
121
+ - Get personalized explanations
122
+
123
+ ## 📈 Future Enhancements
124
+
125
+ - [ ] **Streaming Responses**: Real-time answer generation
126
+ - [ ] **File Type Support**: Excel, Word, PowerPoint documents
127
+ - [ ] **Advanced Search**: Filters and date-based search
128
+ - [ ] **Export Features**: Save conversations and answers
129
+ - [ ] **User Authentication**: Multi-user support
130
+ - [ ] **API Endpoints**: REST API for integration
131
+
132
+ ## 🤝 Contributing
133
+
134
+ 1. Fork the repository
135
+ 2. Create a feature branch
136
+ 3. Make your changes
137
+ 4. Add tests if applicable
138
+ 5. Submit a pull request
139
+
140
+ ## 📄 License
141
+
142
+ This project is licensed under the MIT License - see the LICENSE file for details.
143
+
144
+ ## 🙏 Acknowledgments
145
+
146
+ - **LangChain**: For the RAG framework
147
+ - **ChromaDB**: For vector storage
148
+ - **Gradio**: For the web interface
149
+ - **Groq**: For fast LLM inference
150
+ - **HuggingFace**: For embedding models
151
+
152
  ---
153
 
154
+ **Made with ❤️ by Govind Kurapati**
app.py ADDED
@@ -0,0 +1,432 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import shutil
3
+ import gradio as gr
4
+ from ingestion import load_and_ingest_file, load_and_ingest_url, clear_database, delete_embeddings_by_source
5
+ from qa_pipeline import answer_question
6
+
7
+ INGESTED_URLS_FILE = "./ingested_urls.txt"
8
+
9
+
10
+ def handle_file_upload(file):
11
+ filename = os.path.basename(file.name)
12
+ file_path = f"./uploads/{filename}"
13
+ upload_dir = "uploads"
14
+ os.makedirs("./uploads", exist_ok=True)
15
+ destination = os.path.join(upload_dir, filename)
16
+ shutil.copy2(file.name, destination)
17
+ load_and_ingest_file(file_path)
18
+ return "File processed and embedded successfully."
19
+
20
+
21
+ def handle_url_ingestion(url):
22
+ load_and_ingest_url(url)
23
+ save_url(url)
24
+ return "URL content processed and embedded successfully."
25
+
26
+
27
+ def handle_file_upload_with_progress(file):
28
+ """File upload with progress indicator"""
29
+ if not file:
30
+ return "No file selected.", gr.update(visible=False)
31
+
32
+ try:
33
+ # Copy file
34
+ filename = os.path.basename(file.name)
35
+ file_path = f"./uploads/{filename}"
36
+ upload_dir = "uploads"
37
+ os.makedirs("./uploads", exist_ok=True)
38
+ destination = os.path.join(upload_dir, filename)
39
+ shutil.copy2(file.name, destination)
40
+
41
+ # Process and embed
42
+ load_and_ingest_file(file_path)
43
+
44
+ return f"File '{filename}' processed and embedded successfully!", gr.update(visible=True)
45
+ except Exception as e:
46
+ return f"Error processing file: {str(e)}", gr.update(visible=True)
47
+
48
+
49
+ def handle_url_ingestion_with_progress(url):
50
+ """URL ingestion with progress indicator"""
51
+ if not url or not url.strip():
52
+ return "No URL provided.", gr.update(visible=False)
53
+
54
+ try:
55
+ # Ingest URL content
56
+ load_and_ingest_url(url.strip())
57
+
58
+ # Save URL to file
59
+ save_url(url.strip())
60
+
61
+ return f"URL '{url.strip()}' processed and embedded successfully!", gr.update(visible=True)
62
+ except Exception as e:
63
+ return f"Error processing URL: {str(e)}", gr.update(visible=True)
64
+
65
+
66
+ def handle_question(question):
67
+ return answer_question(question)
68
+
69
+
70
+ UPLOAD_DIR = "./uploads"
71
+
72
+
73
+ def list_uploaded_files():
74
+ files = []
75
+ for filename in os.listdir(UPLOAD_DIR):
76
+ full_path = os.path.join(UPLOAD_DIR, filename)
77
+ if os.path.isfile(full_path):
78
+ files.append(full_path)
79
+ return files
80
+
81
+
82
+ def save_url(url: str):
83
+ with open(INGESTED_URLS_FILE, "a") as f:
84
+ f.write(url.strip() + "\n")
85
+
86
+
87
+ def get_saved_urls() -> str:
88
+ if not os.path.exists(INGESTED_URLS_FILE):
89
+ return "<i>No URLs ingested yet.</i>"
90
+
91
+ links_html = ""
92
+ with open(INGESTED_URLS_FILE, "r") as f:
93
+ for i, line in enumerate(f):
94
+ url = line.strip()
95
+ links_html += f'<div style="margin: 2px 0; padding: 8px; border: 1px solid #ddd; border-radius: 5px; background-color: #f9f9f9;"><a href="{url}" target="_blank">{url}</a></div>'
96
+ return links_html
97
+
98
+
99
+ def get_saved_urls_list():
100
+ """Get list of ingested URLs for dropdown"""
101
+ urls = []
102
+ if os.path.exists(INGESTED_URLS_FILE):
103
+ with open(INGESTED_URLS_FILE, "r") as f:
104
+ for line in f:
105
+ url = line.strip()
106
+ if url:
107
+ urls.append(url)
108
+ return urls
109
+
110
+
111
+ def delete_url_by_url(url_to_delete: str):
112
+ """Delete URL by its actual URL string and its embeddings"""
113
+ if not os.path.exists(INGESTED_URLS_FILE):
114
+ return "No URLs to delete."
115
+
116
+ try:
117
+ with open(INGESTED_URLS_FILE, "r") as f:
118
+ urls = f.readlines()
119
+
120
+ # Find and remove the URL
121
+ found = False
122
+ for i, url in enumerate(urls):
123
+ if url.strip() == url_to_delete:
124
+ urls.pop(i)
125
+ found = True
126
+ break
127
+
128
+ if found:
129
+ with open(INGESTED_URLS_FILE, "w") as f:
130
+ f.writelines(urls)
131
+
132
+ # Delete embeddings for this URL
133
+ embeddings_result = delete_embeddings_by_source(url_to_delete)
134
+
135
+ return f"Deleted URL: {url_to_delete}\n{embeddings_result}"
136
+ else:
137
+ return f"URL not found: {url_to_delete}"
138
+ except Exception as e:
139
+ return f"Error deleting URL: {str(e)}"
140
+
141
+
142
+
143
+ def delete_uploaded_file(filename: str):
144
+ """Delete an uploaded file and its embeddings"""
145
+ try:
146
+ file_path = os.path.join(UPLOAD_DIR, filename)
147
+ if os.path.exists(file_path):
148
+ # Delete the file
149
+ os.remove(file_path)
150
+
151
+ # Delete embeddings for this file
152
+ embeddings_result = delete_embeddings_by_source(file_path)
153
+
154
+ return f"Deleted file: {filename}\n{embeddings_result}"
155
+ else:
156
+ return f"File not found: {filename}"
157
+ except Exception as e:
158
+ return f"Error deleting file: {str(e)}"
159
+
160
+
161
+ def get_uploaded_files_list():
162
+ """Get list of uploaded files with delete buttons"""
163
+ files = []
164
+ if os.path.exists(UPLOAD_DIR):
165
+ for filename in os.listdir(UPLOAD_DIR):
166
+ full_path = os.path.join(UPLOAD_DIR, filename)
167
+ if os.path.isfile(full_path):
168
+ files.append(filename)
169
+ return files
170
+
171
+
172
+ with gr.Blocks() as demo:
173
+ gr.Markdown("# 📘 Developer Docs Assistant")
174
+
175
+ with gr.Tab("Upload Document"):
176
+ with gr.Row():
177
+ with gr.Column(scale=2):
178
+ file = gr.File(label="Upload Document", file_types=[".pdf", ".txt", ".md", ".markdown"])
179
+ upload_btn = gr.Button("📤 Ingest File", variant="primary")
180
+ upload_output = gr.Textbox(label="Upload Result", visible=False)
181
+
182
+ # Progress indicator
183
+ upload_progress = gr.HTML(
184
+ value="<div style='text-align: center; color: #666;'>Ready to upload</div>",
185
+ label="Status"
186
+ )
187
+
188
+ with gr.Column(scale=1):
189
+ gr.Markdown("### 📋 Upload Instructions")
190
+ gr.Markdown("""
191
+ 1. **Select File**: Choose a PDF, TXT, or Markdown file
192
+ 2. **Click Upload**: The file will be processed and embedded
193
+ 3. **Wait**: Processing may take a few moments
194
+ 4. **Check Status**: Monitor the progress indicator
195
+ """)
196
+
197
+ def handle_upload_with_progress(file):
198
+ if not file:
199
+ return (
200
+ "⚠️ Please select a file first.",
201
+ gr.update(visible=True),
202
+ gr.update(value="<div style='text-align: center; color: #ff6b6b;'>❌ No file selected</div>")
203
+ )
204
+
205
+ # Show processing status
206
+ progress_html = """
207
+ <div style='text-align: center; color: #4CAF50;'>
208
+ <div style='margin-bottom: 10px;'>🔄 Processing file...</div>
209
+ <div style='display: inline-block; width: 20px; height: 20px; border: 3px solid #f3f3f3; border-top: 3px solid #4CAF50; border-radius: 50%; animation: spin 1s linear infinite;'></div>
210
+ <style>
211
+ @keyframes spin {
212
+ 0% { transform: rotate(0deg); }
213
+ 100% { transform: rotate(360deg); }
214
+ }
215
+ </style>
216
+ </div>
217
+ """
218
+
219
+ try:
220
+ result = handle_file_upload_with_progress(file)
221
+ success_html = f"""
222
+ <div style='text-align: center; color: #4CAF50;'>
223
+ ✅ {result[0]}
224
+ </div>
225
+ """
226
+ return result[0], gr.update(visible=True), gr.update(value=success_html)
227
+ except Exception as e:
228
+ error_html = f"""
229
+ <div style='text-align: center; color: #ff6b6b;'>
230
+ ❌ Error: {str(e)}
231
+ </div>
232
+ """
233
+ return f"❌ Error: {str(e)}", gr.update(visible=True), gr.update(value=error_html)
234
+
235
+ upload_btn.click(
236
+ handle_upload_with_progress,
237
+ inputs=file,
238
+ outputs=[upload_output, upload_output, upload_progress]
239
+ )
240
+
241
+ with gr.Tab("Ingest from URL"):
242
+ with gr.Row():
243
+ with gr.Column(scale=2):
244
+ url_input = gr.Textbox(label="Document URL", placeholder="https://example.com/document")
245
+ url_btn = gr.Button("🌐 Ingest URL", variant="primary")
246
+ url_output = gr.Textbox(label="URL Processing Result", visible=False)
247
+
248
+ # Progress indicator
249
+ url_progress = gr.HTML(
250
+ value="<div style='text-align: center; color: #666;'>Ready to ingest URL</div>",
251
+ label="Status"
252
+ )
253
+
254
+ with gr.Column(scale=1):
255
+ gr.Markdown("### 📋 URL Ingestion Instructions")
256
+ gr.Markdown("""
257
+ 1. **Enter URL**: Paste a valid document URL
258
+ 2. **Click Ingest**: Content will be fetched and processed
259
+ 3. **Wait**: Processing may take a few moments
260
+ 4. **Check Status**: Monitor the progress indicator
261
+ """)
262
+
263
+ def handle_url_ingestion_with_progress_ui(url):
264
+ if not url or not url.strip():
265
+ return (
266
+ "⚠️ Please enter a valid URL.",
267
+ gr.update(visible=True),
268
+ gr.update(value="<div style='text-align: center; color: #ff6b6b;'>❌ No URL provided</div>")
269
+ )
270
+
271
+ # Show processing status
272
+ progress_html = """
273
+ <div style='text-align: center; color: #4CAF50;'>
274
+ <div style='margin-bottom: 10px;'>🔄 Fetching and processing URL...</div>
275
+ <div style='display: inline-block; width: 20px; height: 20px; border: 3px solid #f3f3f3; border-top: 3px solid #4CAF50; border-radius: 50%; animation: spin 1s linear infinite;'></div>
276
+ <style>
277
+ @keyframes spin {
278
+ 0% { transform: rotate(0deg); }
279
+ 100% { transform: rotate(360deg); }
280
+ }
281
+ </style>
282
+ </div>
283
+ """
284
+
285
+ try:
286
+ result = handle_url_ingestion_with_progress(url.strip())
287
+ success_html = f"""
288
+ <div style='text-align: center; color: #4CAF50;'>
289
+ ✅ {result[0]}
290
+ </div>
291
+ """
292
+ return result[0], gr.update(visible=True), gr.update(value=success_html)
293
+ except Exception as e:
294
+ error_html = f"""
295
+ <div style='text-align: center; color: #ff6b6b;'>
296
+ ❌ Error: {str(e)}
297
+ </div>
298
+ """
299
+ return f"❌ Error: {str(e)}", gr.update(visible=True), gr.update(value=error_html)
300
+
301
+ url_btn.click(
302
+ handle_url_ingestion_with_progress_ui,
303
+ inputs=url_input,
304
+ outputs=[url_output, url_output, url_progress]
305
+ )
306
+
307
+ with gr.Tab("Manage Data"):
308
+ gr.Markdown("# 🗂️ Data Management")
309
+
310
+ with gr.Row():
311
+ with gr.Column(scale=1):
312
+ gr.Markdown("### 📁 Uploaded Files")
313
+ file_dropdown = gr.Dropdown(
314
+ label="Select File to Delete",
315
+ choices=get_uploaded_files_list(),
316
+ interactive=True
317
+ )
318
+ delete_file_btn = gr.Button("🗑️ Delete Selected File", variant="stop")
319
+ file_delete_output = gr.Textbox(label="File Delete Result", visible=False)
320
+
321
+ def delete_selected_file(filename):
322
+ if filename:
323
+ result = delete_uploaded_file(filename)
324
+ # Refresh the dropdown
325
+ new_choices = get_uploaded_files_list()
326
+ return gr.update(value=result, visible=True), gr.update(choices=new_choices)
327
+ return gr.update(value="No file selected", visible=True), gr.update()
328
+
329
+ delete_file_btn.click(
330
+ delete_selected_file,
331
+ inputs=file_dropdown,
332
+ outputs=[file_delete_output, file_dropdown]
333
+ )
334
+
335
+ refresh_files_btn = gr.Button("🔄 Refresh File List")
336
+ refresh_files_btn.click(
337
+ lambda: gr.update(choices=get_uploaded_files_list()),
338
+ outputs=file_dropdown
339
+ )
340
+
341
+ with gr.Column(scale=1):
342
+ gr.Markdown("### 🌐 Ingested URLs")
343
+ # url_links_display = gr.HTML(value=get_saved_urls())
344
+ url_dropdown = gr.Dropdown(
345
+ label="Select URL to Delete",
346
+ choices=get_saved_urls_list(),
347
+ interactive=True
348
+ )
349
+ delete_url_btn = gr.Button("🗑️ Delete Selected URL", variant="stop")
350
+ url_delete_output = gr.Textbox(label="URL Delete Result", visible=False)
351
+
352
+ def delete_selected_url(url):
353
+ if url:
354
+ result = delete_url_by_url(url)
355
+ # Refresh the dropdown and display
356
+ new_choices = get_saved_urls_list()
357
+ new_display = get_saved_urls()
358
+ return gr.update(value=result, visible=True), gr.update(choices=new_choices), gr.update(value=new_display)
359
+ return gr.update(value="No URL selected", visible=True), gr.update(), gr.update()
360
+
361
+ delete_url_btn.click(
362
+ delete_selected_url,
363
+ inputs=url_dropdown,
364
+ outputs=[url_delete_output, url_dropdown]
365
+ )
366
+
367
+ refresh_urls_btn = gr.Button("🔄 Refresh URL List")
368
+ refresh_urls_btn.click(
369
+ lambda: (gr.update(choices=get_saved_urls_list()), gr.update(value=get_saved_urls())),
370
+ outputs=[url_dropdown]
371
+ )
372
+
373
+ gr.Markdown("---")
374
+ gr.Markdown("### ⚠️ Nuclear Option - Clear All Data")
375
+ gr.Markdown("**Warning**: This will delete ALL uploaded files, ingested URLs, and clear the entire vector database. This action cannot be undone.")
376
+
377
+ with gr.Row():
378
+ clear_all_btn = gr.Button("💥 Clear All Data", variant="stop", size="lg")
379
+ clear_output = gr.Textbox(label="Clear All Result", visible=False)
380
+
381
+ def clear_all_data():
382
+ # Clear database
383
+ db_result = clear_database()
384
+
385
+ # Clear uploaded files
386
+ file_result = ""
387
+ if os.path.exists(UPLOAD_DIR):
388
+ for filename in os.listdir(UPLOAD_DIR):
389
+ file_path = os.path.join(UPLOAD_DIR, filename)
390
+ if os.path.isfile(file_path):
391
+ try:
392
+ os.remove(file_path)
393
+ file_result += f"Deleted file: {filename}\n"
394
+ except Exception as e:
395
+ file_result += f"Error deleting {filename}: {str(e)}\n"
396
+
397
+ # Clear ingested URLs
398
+ url_result = ""
399
+ if os.path.exists(INGESTED_URLS_FILE):
400
+ try:
401
+ os.remove(INGESTED_URLS_FILE)
402
+ url_result = "Deleted ingested URLs file\n"
403
+ except Exception as e:
404
+ url_result = f"Error deleting URLs file: {str(e)}\n"
405
+
406
+ return f"Database: {db_result}\nFiles: {file_result}URLs: {url_result}"
407
+
408
+ clear_all_btn.click(
409
+ clear_all_data,
410
+ outputs=clear_output
411
+ )
412
+
413
+ # Load initial data
414
+ demo.load(fn=lambda: gr.update(choices=get_uploaded_files_list()), outputs=file_dropdown)
415
+ demo.load(fn=lambda: gr.update(choices=get_saved_urls_list()), outputs=url_dropdown)
416
+ # demo.load(fn=get_saved_urls, outputs=url_links_display)
417
+
418
+ with gr.Tab("Ask a Question"):
419
+ with gr.Row():
420
+ with gr.Column(scale=2):
421
+ question_input = gr.Textbox(label="Your Question", placeholder="Ask a question about your documents...")
422
+ ask_btn = gr.Button("🤖 Get Answer", variant="primary")
423
+ answer_output = gr.Textbox(label="Answer", lines=10, placeholder="Answer will appear here...")
424
+
425
+
426
+ def handle_question_with_sources(question):
427
+ return answer_question(question)
428
+
429
+ ask_btn.click(handle_question_with_sources, inputs=question_input, outputs=answer_output)
430
+
431
+
432
+ demo.launch()
ingestion.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain_community.document_loaders import (
3
+ WebBaseLoader,
4
+ PyPDFLoader,
5
+ TextLoader,
6
+ UnstructuredMarkdownLoader,
7
+ )
8
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
9
+ from langchain_chroma import Chroma
10
+ from langchain_huggingface import HuggingFaceEmbeddings
11
+
12
+ CHROMA_DB_DIR = "./chroma_db"
13
+
14
+ model_name = "sentence-transformers/all-mpnet-base-v2"
15
+ model_kwargs = {"device": "cpu"}
16
+ encode_kwargs = {"normalize_embeddings": False}
17
+ embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
18
+
19
+
20
+ def load_and_ingest_file(file_path):
21
+ print(f"Loading file: {file_path}")
22
+ ext = os.path.splitext(file_path)[1].lower()
23
+ if ext == ".pdf":
24
+ loader = PyPDFLoader(file_path)
25
+ elif ext in [".md", ".markdown"]:
26
+ loader = UnstructuredMarkdownLoader(file_path)
27
+ else:
28
+ loader = TextLoader(file_path)
29
+ docs = loader.load()
30
+ store_embeddings(docs, source_type="file", source_path=file_path)
31
+
32
+
33
+ def load_and_ingest_url(url):
34
+ loader = WebBaseLoader(url)
35
+ docs = loader.load()
36
+ store_embeddings(docs, source_type="url", source_path=url)
37
+
38
+
39
+ def store_embeddings(docs, source_type="file", source_path=""):
40
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
41
+ chunks = text_splitter.split_documents(docs)
42
+
43
+ # Add metadata to each chunk
44
+ for chunk in chunks:
45
+ chunk.metadata["source_type"] = source_type
46
+ chunk.metadata["source_path"] = source_path
47
+
48
+ vectordb = Chroma(
49
+ collection_name="docs_collection",
50
+ embedding_function=embeddings,
51
+ persist_directory=None, # Where to save data locally, remove if not necessary
52
+ )
53
+ vectordb.add_documents(chunks)
54
+ print(f"Stored {len(chunks)} chunks in VectorDB.")
55
+
56
+
57
+ def delete_embeddings_by_source(source_path):
58
+ """Delete embeddings for a specific source file or URL"""
59
+ try:
60
+ vectordb = Chroma(
61
+ collection_name="docs_collection",
62
+ embedding_function=embeddings,
63
+ persist_directory=None,
64
+ )
65
+ # Delete documents where source_path matches
66
+ vectordb._collection.delete(where={"source_path": source_path})
67
+ print(f"Deleted embeddings for source: {source_path}")
68
+ return f"Deleted embeddings for: {source_path}"
69
+ except Exception as e:
70
+ print(f"Error deleting embeddings: {str(e)}")
71
+ return f"Error deleting embeddings: {str(e)}"
72
+
73
+
74
+ def clear_database():
75
+ """Clear all documents from the vector database"""
76
+ try:
77
+ vectordb = Chroma(
78
+ collection_name="docs_collection",
79
+ embedding_function=embeddings,
80
+ persist_directory=None,
81
+ )
82
+ vectordb._collection.delete(where={})
83
+ print("Database cleared successfully.")
84
+ return "Database cleared successfully."
85
+ except Exception as e:
86
+ print(f"Error clearing database: {str(e)}")
87
+ return f"Error clearing database: {str(e)}"
qa_pipeline.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain_chroma import Chroma
3
+ from langchain.chains import ConversationalRetrievalChain
4
+ from langchain_openai import OpenAIEmbeddings, ChatOpenAI
5
+ from langchain_core.callbacks import StdOutCallbackHandler
6
+ from langchain.memory import ConversationBufferMemory
7
+ from dotenv import load_dotenv
8
+ from langchain_huggingface import HuggingFaceEmbeddings
9
+
10
+ CHROMA_DB_DIR = "./chroma_db"
11
+
12
+ # Load environment variables from .env file
13
+ load_dotenv()
14
+
15
+ # Get API credentials from environment variables
16
+ OPENAI_API_KEY = os.getenv("GROQ_API_KEY")
17
+ OPENAI_API_BASE = os.getenv("GROQ_API_BASE")
18
+
19
+ if not OPENAI_API_KEY:
20
+ raise ValueError(
21
+ "OPENAI_API_KEY not found in environment variables. Please check your .env file."
22
+ )
23
+ if not OPENAI_API_BASE:
24
+ raise ValueError(
25
+ "OPENAI_API_BASE not found in environment variables. Please check your .env file."
26
+ )
27
+
28
+ model_name = "sentence-transformers/all-mpnet-base-v2"
29
+ model_kwargs = {"device": "cpu"}
30
+ encode_kwargs = {"normalize_embeddings": False}
31
+ embeddings = HuggingFaceEmbeddings(
32
+ model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
33
+ )
34
+
35
+
36
+ def get_qa_chain():
37
+ vectordb = Chroma(
38
+ persist_directory=None,
39
+ embedding_function=embeddings,
40
+ collection_name="docs_collection",
41
+ )
42
+
43
+ print(f"Number of embedded documents: {vectordb._collection.count()}")
44
+ retriever = vectordb.as_retriever(search_kwargs={"k": 3})
45
+
46
+ llm = ChatOpenAI(
47
+ model_name="llama-3.1-8b-instant",
48
+ openai_api_key=OPENAI_API_KEY,
49
+ openai_api_base=OPENAI_API_BASE,
50
+ temperature=0.2,
51
+ )
52
+
53
+ memory = ConversationBufferMemory(
54
+ memory_key="chat_history", return_messages=True, output_key="answer"
55
+ )
56
+ conversation_chain = ConversationalRetrievalChain.from_llm(
57
+ llm=llm, retriever=retriever, memory=memory, callbacks=[StdOutCallbackHandler()]
58
+ )
59
+
60
+ return conversation_chain
61
+
62
+
63
+ def answer_question(question):
64
+ qa_chain = get_qa_chain()
65
+ result = qa_chain.invoke({"question": question})
66
+
67
+ answer = result["answer"]
68
+
69
+ return answer
requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain
2
+ langchain_community
3
+ chromadb
4
+ gradio
5
+ beautifulsoup4
6
+ requests
7
+ pypdf
8
+ tiktoken
9
+ ollama
10
+ langchain_chroma
11
+ langchain-ollama
12
+ langchain-groq
13
+ dotenv
14
+ langchain_huggingface
15
+ langchain_openai
16
+ sentence-transformers
17
+ unstructured