SimranShaikh commited on
Commit
e655aaf
Β·
verified Β·
1 Parent(s): 998a186
Files changed (1) hide show
  1. README.md +342 -17
README.md CHANGED
@@ -1,20 +1,345 @@
1
- ---
2
- title: Enterprise Rag Assistant
3
- emoji: πŸš€
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: 'Enterprise AI Assistant with RAG - Complete Implementation '
12
- license: mit
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- # Welcome to Streamlit!
 
 
 
 
 
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
1
+ # πŸ“„ Enterprise Rag Assistant with IBM Granite
2
+
3
+ A powerful Retrieval-Augmented Generation (RAG) application that allows you to upload PDF documents and ask intelligent questions about their content using IBM's Granite AI model.
4
+
5
+ ## 🌟 Features
6
+
7
+ - **PDF Text Extraction**: Extract text from PDF documents with detailed progress tracking
8
+ - **Intelligent Chunking**: Split documents into manageable chunks with overlap for better context preservation
9
+ - **Semantic Search**: Find relevant content using advanced sentence embeddings
10
+ - **AI-Powered Q&A**: Generate accurate answers using IBM Granite language model
11
+ - **Interactive UI**: User-friendly Streamlit interface with real-time status updates
12
+ - **GPU/CPU Support**: Automatically detects and utilizes available hardware
13
+ - **Memory Optimization**: Efficient processing for large documents
14
+
15
+ ## πŸš€ Quick Start
16
+
17
+ ### Prerequisites
18
+
19
+ - Python 3.8 or higher
20
+ - pip package manager
21
+ - At least 4GB RAM (8GB+ recommended)
22
+ - Optional: CUDA-compatible GPU for faster processing
23
+
24
+ ### Installation
25
+
26
+ 1. **Clone the repository:**
27
+ ```bash
28
+ git clone https://huggingface.co/spaces/SimranShaikh/enterprise-rag-assistant.git
29
+ cd pdf-rag-granite
30
+ ```
31
+
32
+ 2. **Create a virtual environment:**
33
+ ```bash
34
+ python -m venv venv
35
+ source venv/bin/activate # On Windows: venv\Scripts\activate
36
+ ```
37
+
38
+ 3. **Install dependencies:**
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ 4. **Run the application:**
44
+ ```bash
45
+ streamlit run app.py
46
+ ```
47
+
48
+ 5. **Open your browser** and navigate to `http://localhost:8501`
49
+
50
+ ## πŸ“¦ Dependencies
51
+
52
+ Create a `requirements.txt` file with the following content:
53
+
54
+ ```txt
55
+ streamlit>=1.28.0
56
+ PyPDF2>=3.0.1
57
+ sentence-transformers>=2.2.2
58
+ transformers>=4.30.0
59
+ torch>=2.0.0
60
+ numpy>=1.24.0
61
+ scikit-learn>=1.3.0
62
+ ```
63
+
64
+ ## πŸ”§ Usage
65
+
66
+ ### Step 1: Load Models
67
+ 1. Click the **"πŸ€– Load Models"** button
68
+ 2. Wait for the models to download and load (this may take a few minutes on first run)
69
+ 3. Models are cached locally for faster subsequent loads
70
+
71
+ ### Step 2: Upload PDF
72
+ 1. Click **"Browse files"** and select your PDF document
73
+ 2. Supported formats: PDF files only
74
+ 3. Maximum recommended size: 50MB
75
+
76
+ ### Step 3: Process PDF
77
+ 1. Click **"πŸ“– Process PDF"** after uploading
78
+ 2. The system will:
79
+ - Extract text from all pages
80
+ - Split text into overlapping chunks
81
+ - Generate embeddings for semantic search
82
+ - Display processing progress
83
+
84
+ ### Step 4: Ask Questions
85
+ 1. Type your question in the text input field
86
+ 2. Click **"πŸ” Get Answer"**
87
+ 3. View the AI-generated answer and source references
88
+
89
+ ### Example Questions
90
+ - "What is the main topic of this document?"
91
+ - "Summarize the key findings"
92
+ - "What are the recommendations mentioned?"
93
+ - "Who are the main authors or contributors?"
94
+ - "What methodology was used?"
95
+
96
+ ## πŸ—οΈ Architecture
97
+
98
+ ```
99
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
100
+ β”‚ PDF Upload │───▢│ Text Extraction │───▢│ Text Chunking β”‚
101
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
102
+ β”‚
103
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
104
+ β”‚ User Query │───▢│ Semantic Search │◀───│ Embeddings β”‚
105
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
106
+ β”‚ β”‚
107
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
108
+ └─────────────▢│ Answer Gen. β”‚
109
+ β”‚ (IBM Granite) β”‚
110
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
111
+ ```
112
+
113
+ ## πŸ”§ Configuration
114
+
115
+ ### Model Configuration
116
+
117
+ You can modify the models used in the `SimplePDFRAG` class:
118
+
119
+ ```python
120
+ # Embedding model options
121
+ embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # Default
122
+ # embedding_model = SentenceTransformer('all-mpnet-base-v2') # Better quality
123
+
124
+ # Language model options
125
+ model_name = "ibm-granite/granite-3-2b-instruct" # Default
126
+ # model_name = "ibm-granite/granite-3-8b-instruct" # Better performance
127
+ # model_name = "google/flan-t5-base" # Alternative
128
+ ```
129
+
130
+ ### Chunking Parameters
131
+
132
+ Adjust text chunking settings:
133
+
134
+ ```python
135
+ def chunk_text(self, text, chunk_size=400, overlap=50):
136
+ # chunk_size: Number of words per chunk
137
+ # overlap: Number of overlapping words between chunks
138
+ ```
139
+
140
+ ### Search Parameters
141
+
142
+ Modify search behavior:
143
+
144
+ ```python
145
+ def search_documents(self, query, top_k=3):
146
+ # top_k: Number of relevant chunks to retrieve
147
+ # min_threshold: Minimum similarity score (0.1 default)
148
+ ```
149
+
150
+ ## πŸ“Š Performance Tips
151
+
152
+ ### For Better Performance:
153
+ - Use a GPU-enabled environment
154
+ - Increase chunk overlap for better context
155
+ - Use larger language models (8B+ parameters)
156
+ - Process smaller PDF files (< 20MB)
157
+
158
+ ### Memory Management:
159
+ - The app automatically manages GPU memory
160
+ - Use the "Reset Everything" button to clear memory
161
+ - Process one PDF at a time for optimal performance
162
+
163
+ ## πŸ› Troubleshooting
164
+
165
+ ### Common Issues:
166
 
167
+ **1. Models not loading:**
168
+ ```
169
+ Error: Model loading failed
170
+ ```
171
+ - **Solution**: Check internet connection and try again
172
+ - **Alternative**: Use smaller models or CPU-only mode
173
 
174
+ **2. PDF text extraction fails:**
175
+ ```
176
+ Error: No text could be extracted
177
+ ```
178
+ - **Solution**: Ensure PDF contains selectable text (not just images)
179
+ - **Alternative**: Use OCR preprocessing tools
180
+
181
+ **3. Out of memory errors:**
182
+ ```
183
+ Error: CUDA out of memory
184
+ ```
185
+ - **Solution**: Reduce batch size or use CPU mode
186
+ - **Alternative**: Process smaller documents
187
+
188
+ **4. Slow processing:**
189
+ - **Solution**: Enable GPU acceleration
190
+ - **Alternative**: Use smaller embedding models
191
+
192
+ ### Debug Mode
193
+
194
+ Enable debug logging by setting:
195
+ ```python
196
+ logging.basicConfig(level=logging.DEBUG)
197
+ ```
198
+
199
+ ## πŸš€ Deployment
200
+
201
+ ### Local Development
202
+ ```bash
203
+ streamlit run app.py
204
+ ```
205
+
206
+ ### Docker Deployment
207
+ ```dockerfile
208
+ FROM python:3.9-slim
209
+
210
+ WORKDIR /app
211
+ COPY requirements.txt .
212
+ RUN pip install -r requirements.txt
213
+
214
+ COPY . .
215
+ EXPOSE 8501
216
+
217
+ CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
218
+ ```
219
+
220
+ ### Cloud Deployment
221
+
222
+ **Streamlit Cloud:**
223
+ 1. Push code to GitHub
224
+ 2. Connect repository to Streamlit Cloud
225
+ 3. Deploy with one click
226
+
227
+ **Heroku:**
228
+ ```bash
229
+ git init
230
+ heroku create your-app-name
231
+ git add .
232
+ git commit -m "Initial commit"
233
+ git push heroku main
234
+ ```
235
+
236
+ ## πŸ“ˆ Advanced Features
237
+
238
+ ### Custom Models
239
+
240
+ Add support for custom models:
241
+
242
+ ```python
243
+ def load_custom_model(self, model_path):
244
+ """Load a custom trained model"""
245
+ self.granite_model = AutoModelForCausalLM.from_pretrained(model_path)
246
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path)
247
+ ```
248
+
249
+ ### Batch Processing
250
+
251
+ Process multiple PDFs:
252
+
253
+ ```python
254
+ def process_multiple_pdfs(self, pdf_files):
255
+ """Process multiple PDFs simultaneously"""
256
+ all_documents = []
257
+ all_embeddings = []
258
+
259
+ for pdf_file in pdf_files:
260
+ # Process each PDF
261
+ documents, embeddings = self.process_single_pdf(pdf_file)
262
+ all_documents.extend(documents)
263
+ all_embeddings.extend(embeddings)
264
+
265
+ return all_documents, all_embeddings
266
+ ```
267
+
268
+ ### Export Results
269
+
270
+ Save Q&A sessions:
271
+
272
+ ```python
273
+ def export_qa_session(self, qa_pairs, filename):
274
+ """Export Q&A session to file"""
275
+ import json
276
+ with open(filename, 'w') as f:
277
+ json.dump(qa_pairs, f, indent=2)
278
+ ```
279
+
280
+ ## 🀝 Contributing
281
+
282
+ We welcome contributions! Please follow these steps:
283
+
284
+ 1. **Fork the repository**
285
+ 2. **Create a feature branch:**
286
+ ```bash
287
+ git checkout -b feature/amazing-feature
288
+ ```
289
+ 3. **Make your changes and commit:**
290
+ ```bash
291
+ git commit -m "Add amazing feature"
292
+ ```
293
+ 4. **Push to your branch:**
294
+ ```bash
295
+ git push origin feature/amazing-feature
296
+ ```
297
+ 5. **Create a Pull Request**
298
+
299
+ ### Development Guidelines
300
+
301
+ - Follow PEP 8 style guidelines
302
+ - Add docstrings to all functions
303
+ - Include unit tests for new features
304
+ - Update documentation as needed
305
+
306
+ ## πŸ“ License
307
+
308
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
309
+
310
+ ```
311
+ MIT License
312
+
313
+ Copyright (c) 2024 Your Name
314
+
315
+ Permission is hereby granted, free of charge, to any person obtaining a copy
316
+ of this software and associated documentation files (the "Software"), to deal
317
+ in the Software without restriction, including without limitation the rights
318
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
319
+ copies of the Software, and to permit persons to whom the Software is
320
+ furnished to do so, subject to the following conditions:
321
+
322
+ The above copyright notice and this permission notice shall be included in all
323
+ copies or substantial portions of the Software.
324
+
325
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
326
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
327
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
328
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
329
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
330
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
331
+ SOFTWARE.
332
+ ```
333
+
334
+ ## πŸ™ Acknowledgments
335
+
336
+ - **IBM** for the Granite language models
337
+ - **Hugging Face** for the transformers library
338
+ - **Sentence Transformers** for embedding models
339
+ - **Streamlit** for the web framework
340
+ - **PyPDF2** for PDF processing
341
+
342
+
343
+ ---
344
 
345
+ **⭐ If you find this project helpful, please consider giving it a star on GitHub!**