sinhapiyush86 commited on
Commit
192b2d2
Β·
verified Β·
1 Parent(s): bd75c88

Upload 18 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,11 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ RIL-Q1-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
37
+ RIL-Q1-FY2025-26.pdf filter=lfs diff=lfs merge=lfs -text
38
+ RIL-Q2-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
39
+ RIL-Q2-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
40
+ RIL-Q3-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
41
+ RIL-Q3-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
42
+ RIL-Q4-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
43
+ RIL-Q4-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
DEPLOYMENT_GUIDE.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ Hugging Face Spaces Deployment Guide (Docker + Streamlit)
2
+
3
+ This guide will walk you through deploying your RAG system to Hugging Face Spaces using **Docker with Streamlit**.
4
+
5
+ ## πŸ“‹ Prerequisites
6
+
7
+ - Hugging Face account
8
+ - All files from the `huggingface_deploy/` folder
9
+ - Basic understanding of Docker (optional)
10
+
11
+ ## 🎯 Step-by-Step Deployment
12
+
13
+ ### Step 1: Create a New Space
14
+
15
+ 1. **Go to Hugging Face Spaces:**
16
+ - Visit [https://huggingface.co/spaces](https://huggingface.co/spaces)
17
+ - Click "Create new Space"
18
+
19
+ 2. **Configure your Space:**
20
+ - **Owner**: Choose your username or organization
21
+ - **Space name**: Choose a unique name (e.g., `my-rag-system`)
22
+ - **License**: Choose appropriate license (e.g., MIT)
23
+ - **SDK**: Select **Docker**
24
+ - **Visibility**: Choose Public or Private
25
+ - **Hardware**: Select appropriate hardware (CPU is sufficient for basic usage)
26
+
27
+ 3. **Click "Create Space"**
28
+
29
+ ### Step 2: Upload Files
30
+
31
+ #### Option A: Using Git (Recommended)
32
+
33
+ 1. **Clone your Space repository:**
34
+ ```bash
35
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
36
+ cd YOUR_SPACE_NAME
37
+ ```
38
+
39
+ 2. **Copy files from the deployment folder:**
40
+ ```bash
41
+ cp -r ../huggingface_deploy/* .
42
+ ```
43
+
44
+ 3. **Commit and push:**
45
+ ```bash
46
+ git add .
47
+ git commit -m "Initial RAG system deployment with Docker"
48
+ git push
49
+ ```
50
+
51
+ #### Option B: Using Web Interface
52
+
53
+ 1. **Upload files manually:**
54
+ - Go to your Space's "Files" tab
55
+ - Click "Add file" β†’ "Upload files"
56
+ - Upload all files from the `huggingface_deploy/` folder:
57
+ - `app.py`
58
+ - `rag_system.py`
59
+ - `pdf_processor.py`
60
+ - `requirements.txt`
61
+ - `Dockerfile`
62
+ - `.dockerignore`
63
+ - `README.md`
64
+
65
+ ### Step 3: Configure the Space
66
+
67
+ 1. **Set up environment variables (optional):**
68
+ - Go to your Space's "Settings" tab
69
+ - Add environment variables if needed:
70
+ ```
71
+ EMBEDDING_MODEL=all-MiniLM-L6-v2
72
+ GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
73
+ ```
74
+
75
+ 2. **Configure hardware (if needed):**
76
+ - Go to "Settings" β†’ "Hardware"
77
+ - Select appropriate hardware based on your needs
78
+
79
+ ### Step 4: Deploy and Test
80
+
81
+ 1. **Wait for deployment:**
82
+ - Hugging Face will automatically build and deploy your Docker container
83
+ - This may take 10-15 minutes for the first deployment (model downloads)
84
+
85
+ 2. **Test your application:**
86
+ - Visit your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
87
+ - Upload a PDF document
88
+ - Ask questions to test the RAG system
89
+
90
+ ## πŸ”§ Docker Configuration
91
+
92
+ ### Dockerfile Features
93
+
94
+ - **Base Image**: Python 3.10 slim
95
+ - **System Dependencies**: build-essential, curl
96
+ - **Health Check**: Monitors Streamlit health endpoint
97
+ - **Environment Variables**: Configured for Streamlit
98
+ - **Port**: Exposes port 8501
99
+
100
+ ### Local Docker Testing
101
+
102
+ You can test the Docker build locally:
103
+
104
+ ```bash
105
+ # Build the Docker image
106
+ docker build -t rag-system .
107
+
108
+ # Run the container
109
+ docker run -p 8501:8501 rag-system
110
+
111
+ # Or use docker-compose
112
+ docker-compose up --build
113
+ ```
114
+
115
+ ## πŸ”§ Configuration Options
116
+
117
+ ### Environment Variables
118
+
119
+ You can customize your deployment by setting these environment variables in your Space settings:
120
+
121
+ ```bash
122
+ # Model configuration
123
+ EMBEDDING_MODEL=all-MiniLM-L6-v2
124
+ GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
125
+
126
+ # Chunk sizes
127
+ CHUNK_SIZES=100,400
128
+
129
+ # Vector store path
130
+ VECTOR_STORE_PATH=./vector_store
131
+
132
+ # Streamlit configuration
133
+ STREAMLIT_SERVER_PORT=8501
134
+ STREAMLIT_SERVER_ADDRESS=0.0.0.0
135
+ STREAMLIT_SERVER_HEADLESS=true
136
+ ```
137
+
138
+ ### Hardware Options
139
+
140
+ - **CPU**: Sufficient for basic usage, slower inference
141
+ - **T4**: Good for faster inference, limited memory
142
+ - **A10G**: High performance, more memory
143
+ - **A100**: Maximum performance, highest cost
144
+
145
+ ## πŸ› Troubleshooting
146
+
147
+ ### Common Issues
148
+
149
+ 1. **Build Fails**
150
+ - Check that all required files are uploaded
151
+ - Verify `requirements.txt` and `Dockerfile` are correct
152
+ - Check the build logs for specific errors
153
+
154
+ 2. **Model Loading Errors**
155
+ - Ensure internet connectivity for model downloads
156
+ - Check model names are correct
157
+ - Verify sufficient disk space
158
+
159
+ 3. **Memory Issues**
160
+ - Use smaller models
161
+ - Reduce chunk sizes
162
+ - Upgrade to higher-tier hardware
163
+
164
+ 4. **Slow Performance**
165
+ - Upgrade hardware tier
166
+ - Use smaller embedding models
167
+ - Optimize chunk sizes
168
+
169
+ 5. **Docker Build Issues**
170
+ - Check `.dockerignore` excludes unnecessary files
171
+ - Verify Dockerfile syntax
172
+ - Check for missing dependencies
173
+
174
+ ### Debug Mode
175
+
176
+ To enable debug logging, add this to your `app.py`:
177
+
178
+ ```python
179
+ import logging
180
+ logging.basicConfig(level=logging.DEBUG)
181
+ ```
182
+
183
+ ## πŸ“Š Monitoring
184
+
185
+ ### Space Metrics
186
+
187
+ - **Build Status**: Check if Docker build was successful
188
+ - **Runtime Logs**: Monitor application logs
189
+ - **Resource Usage**: Track CPU and memory usage
190
+ - **Error Logs**: Identify and fix issues
191
+
192
+ ### Docker Logs
193
+
194
+ Check Docker logs in your Space:
195
+ - Go to "Settings" β†’ "Logs"
196
+ - Monitor build and runtime logs
197
+ - Look for error messages
198
+
199
+ ## πŸ”’ Security Considerations
200
+
201
+ 1. **File Upload:**
202
+ - Validate PDF files before processing
203
+ - Implement file size limits
204
+ - Check file types
205
+
206
+ 2. **Model Access:**
207
+ - Use appropriate model access tokens
208
+ - Consider private models for sensitive data
209
+
210
+ 3. **Data Privacy:**
211
+ - Be aware that uploaded documents are processed
212
+ - Consider data retention policies
213
+
214
+ 4. **Docker Security:**
215
+ - Use non-root user in Dockerfile
216
+ - Minimize attack surface
217
+ - Keep base images updated
218
+
219
+ ## πŸ“ˆ Scaling
220
+
221
+ ### For Production Use
222
+
223
+ 1. **Multiple Spaces:**
224
+ - Create separate Spaces for different use cases
225
+ - Use different hardware tiers as needed
226
+
227
+ 2. **Custom Domains:**
228
+ - Set up custom domains for your Spaces
229
+ - Use proper SSL certificates
230
+
231
+ 3. **Load Balancing:**
232
+ - Consider multiple Space instances
233
+ - Implement proper caching strategies
234
+
235
+ ## πŸŽ‰ Success Checklist
236
+
237
+ - [ ] Space created successfully with Docker SDK
238
+ - [ ] All files uploaded (including Dockerfile)
239
+ - [ ] Docker build completed without errors
240
+ - [ ] Application loads correctly
241
+ - [ ] PDF upload works
242
+ - [ ] Question answering works
243
+ - [ ] Search results display correctly
244
+ - [ ] Performance is acceptable
245
+
246
+ ## πŸ“ž Support
247
+
248
+ If you encounter issues:
249
+
250
+ 1. **Check the logs** in your Space's "Logs" tab
251
+ 2. **Review this guide** for common solutions
252
+ 3. **Search Hugging Face documentation**
253
+ 4. **Create an issue** in the project repository
254
+ 5. **Contact Hugging Face support** for Space-specific issues
255
+
256
+ ## πŸš€ Next Steps
257
+
258
+ After successful deployment:
259
+
260
+ 1. **Test thoroughly** with different document types
261
+ 2. **Optimize performance** based on usage patterns
262
+ 3. **Add custom features** as needed
263
+ 4. **Share your Space** with others
264
+ 5. **Monitor usage** and gather feedback
265
+
266
+ ## πŸ”„ Updates and Maintenance
267
+
268
+ ### Updating Your Space
269
+
270
+ 1. **Make changes locally**
271
+ 2. **Test with Docker locally**
272
+ 3. **Push changes to your Space repository**
273
+ 4. **Monitor the rebuild process**
274
+
275
+ ### Version Management
276
+
277
+ - Use specific versions in `requirements.txt`
278
+ - Tag your Docker images
279
+ - Keep track of model versions
280
+
281
+ ---
282
+
283
+ **Happy deploying with Docker! πŸ³πŸš€**
Dockerfile CHANGED
@@ -1,20 +1,45 @@
1
- FROM python:3.13.5-slim
 
2
 
 
3
  WORKDIR /app
4
 
 
5
  RUN apt-get update && apt-get install -y \
6
  build-essential \
7
  curl \
8
- git \
9
  && rm -rf /var/lib/apt/lists/*
10
 
11
- COPY requirements.txt ./
12
- COPY src/ ./src/
13
 
14
- RUN pip3 install -r requirements.txt
 
 
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  EXPOSE 8501
17
 
 
18
  HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
19
 
20
- ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
 
 
1
+ # Use Python 3.10 slim image
2
+ FROM python:3.10-slim
3
 
4
+ # Set working directory
5
  WORKDIR /app
6
 
7
+ # Install system dependencies
8
  RUN apt-get update && apt-get install -y \
9
  build-essential \
10
  curl \
 
11
  && rm -rf /var/lib/apt/lists/*
12
 
13
+ # Copy requirements first for better caching
14
+ COPY requirements.txt .
15
 
16
+ # Install Python dependencies
17
+ RUN pip install --no-cache-dir --upgrade pip && \
18
+ pip install --no-cache-dir -r requirements.txt
19
 
20
+ # Copy application files
21
+ COPY . .
22
+
23
+ # Create vector store directory
24
+ RUN mkdir -p vector_store
25
+
26
+ # Copy all PDF documents for testing
27
+ COPY *.pdf /app/
28
+
29
+ # Set environment variables
30
+ ENV PYTHONPATH=/app
31
+ ENV STREAMLIT_SERVER_PORT=8501
32
+ ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0
33
+ ENV STREAMLIT_SERVER_HEADLESS=true
34
+ ENV STREAMLIT_SERVER_ENABLE_CORS=false
35
+ ENV STREAMLIT_SERVER_ENABLE_XSRF_PROTECTION=false
36
+ ENV STREAMLIT_LOGGER_LEVEL=debug
37
+
38
+ # Expose port
39
  EXPOSE 8501
40
 
41
+ # Health check
42
  HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
43
 
44
+ # Run the application
45
+ CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -1,19 +1,245 @@
1
- ---
2
- title: ConvAI
3
- emoji: πŸš€
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Streamlit template space
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- # Welcome to Streamlit!
15
 
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
1
+ # RAG System for Hugging Face Spaces
2
+
3
+ A simplified Retrieval-Augmented Generation (RAG) system optimized for deployment on Hugging Face Spaces.
4
+
5
+ ## πŸš€ Features
6
+
7
+ - **FAISS Vector Search**: Fast similarity search using FAISS
8
+ - **BM25 Keyword Search**: Traditional keyword-based retrieval
9
+ - **Hybrid Search**: Combines both dense and sparse retrieval
10
+ - **Qwen 2.5 1.5B**: Advanced language model for answer generation
11
+ - **Streamlit UI**: Clean, interactive web interface
12
+ - **PDF Processing**: Extract and process PDF documents
13
+ - **Persistent Storage**: Saves embeddings and metadata locally
14
+
15
+ ## πŸ“ Project Structure
16
+
17
+ ```
18
+ huggingface_deploy/
19
+ β”œβ”€β”€ app.py # Main Streamlit application
20
+ β”œβ”€β”€ rag_system.py # Simplified RAG system
21
+ β”œβ”€β”€ pdf_processor.py # PDF processing utilities
22
+ β”œβ”€β”€ requirements.txt # Python dependencies
23
+ β”œβ”€β”€ README.md # This file
24
+ └── vector_store/ # FAISS index and metadata (created automatically)
25
+ ```
26
+
27
+ ## πŸ› οΈ Technologies Used
28
+
29
+ - **Streamlit**: Web interface
30
+ - **FAISS**: Vector similarity search
31
+ - **BM25**: Keyword-based retrieval
32
+ - **Sentence Transformers**: Text embeddings
33
+ - **Transformers**: Qwen 2.5 1.5B model
34
+ - **PyPDF**: PDF text extraction
35
+ - **PyTorch**: Deep learning framework
36
+
37
+ ## πŸš€ Quick Start
38
+
39
+ ### Local Development
40
+
41
+ 1. **Install dependencies:**
42
+ ```bash
43
+ pip install -r requirements.txt
44
+ ```
45
+
46
+ 2. **Run the application:**
47
+ ```bash
48
+ streamlit run app.py
49
+ ```
50
+
51
+ 3. **Open in browser:**
52
+ Navigate to `http://localhost:8501`
53
+
54
+ ### Hugging Face Spaces Deployment
55
+
56
+ 1. **Create a new Space:**
57
+ - Go to [Hugging Face Spaces](https://huggingface.co/spaces)
58
+ - Click "Create new Space"
59
+ - Choose "Streamlit" as the SDK
60
+ - Set visibility (public or private)
61
+
62
+ 2. **Upload files:**
63
+ - Upload all files from this directory to your Space
64
+ - The Space will automatically install dependencies and run the app
65
+
66
+ 3. **Access your app:**
67
+ - Your RAG system will be available at your Space URL
68
+
69
+ ## πŸ“– How to Use
70
+
71
+ ### 1. Upload Documents
72
+ - Use the sidebar to upload PDF documents
73
+ - The system will automatically process and index the content
74
+ - Multiple documents can be uploaded
75
+
76
+ ### 2. Ask Questions
77
+ - Type your question in the chat interface
78
+ - Choose your preferred retrieval method:
79
+ - **Hybrid**: Combines FAISS and BM25 (recommended)
80
+ - **Dense**: Uses only FAISS vector similarity
81
+ - **Sparse**: Uses only BM25 keyword matching
82
+
83
+ ### 3. View Results
84
+ - See the generated answer
85
+ - View search results with confidence scores
86
+ - Check response time and method used
87
+
88
+ ## βš™οΈ Configuration
89
+
90
+ ### Environment Variables
91
+
92
+ You can customize the system by setting these environment variables:
93
+
94
+ ```bash
95
+ # Model configuration
96
+ EMBEDDING_MODEL=all-MiniLM-L6-v2
97
+ GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
98
+
99
+ # Chunk sizes for document processing
100
+ CHUNK_SIZES=100,400
101
+
102
+ # Vector store path
103
+ VECTOR_STORE_PATH=./vector_store
104
+ ```
105
+
106
+ ### Model Options
107
+
108
+ **Embedding Models:**
109
+ - `all-MiniLM-L6-v2` (default, 384 dimensions)
110
+ - `all-mpnet-base-v2` (768 dimensions)
111
+ - `multi-qa-MiniLM-L6-cos-v1` (384 dimensions)
112
+
113
+ **Generative Models:**
114
+ - `Qwen/Qwen2.5-1.5B-Instruct` (default)
115
+ - `distilgpt2` (fallback)
116
+ - `microsoft/DialoGPT-medium`
117
+
118
+ ## πŸ”§ Customization
119
+
120
+ ### Adding New Models
121
 
122
+ To use different models, modify the `SimpleRAGSystem` initialization in `app.py`:
123
 
124
+ ```python
125
+ st.session_state.rag_system = SimpleRAGSystem(
126
+ embedding_model="your-embedding-model",
127
+ generative_model="your-generative-model"
128
+ )
129
+ ```
130
+
131
+ ### Custom Chunk Sizes
132
+
133
+ Modify the chunk sizes for different document types:
134
+
135
+ ```python
136
+ chunk_sizes = [50, 200, 800] # Smaller chunks for technical docs
137
+ ```
138
+
139
+ ### Custom Search Methods
140
+
141
+ Add new search methods in `rag_system.py`:
142
+
143
+ ```python
144
+ def custom_search(self, query: str, top_k: int = 5):
145
+ # Your custom search implementation
146
+ pass
147
+ ```
148
+
149
+ ## πŸ“Š Performance Optimization
150
+
151
+ ### Memory Usage
152
+ - Use smaller embedding models for limited memory
153
+ - Reduce chunk sizes for large documents
154
+ - Enable model quantization
155
+
156
+ ### Speed Optimization
157
+ - Use GPU acceleration when available
158
+ - Optimize FAISS index parameters
159
+ - Cache embeddings for repeated queries
160
+
161
+ ### Storage
162
+ - FAISS index and metadata are saved locally
163
+ - Consider cloud storage for production deployments
164
+
165
+ ## πŸ› Troubleshooting
166
+
167
+ ### Common Issues
168
+
169
+ 1. **Model Loading Errors**
170
+ - Check internet connection for model downloads
171
+ - Verify model names are correct
172
+ - Ensure sufficient disk space
173
+
174
+ 2. **Memory Issues**
175
+ - Reduce batch sizes
176
+ - Use smaller models
177
+ - Enable gradient checkpointing
178
+
179
+ 3. **PDF Processing Errors**
180
+ - Verify PDF files are not corrupted
181
+ - Check file permissions
182
+ - Ensure PyPDF is properly installed
183
+
184
+ ### Debug Mode
185
+
186
+ Enable debug logging by adding to `app.py`:
187
+
188
+ ```python
189
+ import logging
190
+ logging.basicConfig(level=logging.DEBUG)
191
+ ```
192
+
193
+ ## πŸ”’ Security Considerations
194
+
195
+ - **File Upload**: Validate PDF files before processing
196
+ - **Model Access**: Use appropriate model access tokens
197
+ - **Data Privacy**: Consider data retention policies
198
+ - **Rate Limiting**: Implement query rate limiting for production
199
+
200
+ ## πŸ“ˆ Monitoring
201
+
202
+ ### System Metrics
203
+ - Document count and chunk count
204
+ - Response times
205
+ - Search result quality
206
+ - Model performance
207
+
208
+ ### Logs
209
+ - Application logs in Streamlit
210
+ - Model loading and inference logs
211
+ - Error tracking and debugging
212
+
213
+ ## 🀝 Contributing
214
+
215
+ 1. Fork the repository
216
+ 2. Create a feature branch
217
+ 3. Make your changes
218
+ 4. Test thoroughly
219
+ 5. Submit a pull request
220
+
221
+ ## πŸ“„ License
222
+
223
+ This project is licensed under the MIT License - see the LICENSE file for details.
224
+
225
+ ## πŸ†˜ Support
226
+
227
+ For issues and questions:
228
+ 1. Check the troubleshooting section
229
+ 2. Review the logs for error messages
230
+ 3. Create an issue on GitHub
231
+ 4. Contact the maintainers
232
+
233
+ ## 🎯 Roadmap
234
+
235
+ - [ ] Add support for more document formats
236
+ - [ ] Implement advanced search algorithms
237
+ - [ ] Add model fine-tuning capabilities
238
+ - [ ] Improve UI/UX design
239
+ - [ ] Add export/import functionality
240
+ - [ ] Implement user authentication
241
+ - [ ] Add analytics dashboard
242
+
243
+ ---
244
 
245
+ **Happy RAG-ing! πŸš€**
 
RIL-Q1-FY2024-25.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e29390caae95cc8f28606d9f08317cda424bf544fd86383c7f9ac7d25ca8e808
3
+ size 1253337
RIL-Q1-FY2025-26.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce3f74a4a4012cdb85afaf7795aa2cc118f94af0f2b4d290f92248d042eb0976
3
+ size 719459
RIL-Q2-FY2023-24.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e07142e623cd116f6c18a6e17e803b06bff53eeaa149c4151022579ef305cbd
3
+ size 1570743
RIL-Q2-FY2024-25.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f78f4ade6ab7640fb74560b76505754fe5751c3602d61925c764c177875d1097
3
+ size 1664783
RIL-Q3-FY2023-24.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2e4afa7e303df86a156c02fbdb07866238891a408cd79398c98b100693cafcc
3
+ size 1446439
RIL-Q3-FY2024-25.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9d0afa42b8fb75efcf2d1c1aea5b104c77dd63fd69fa0fcc059af8b350e8567
3
+ size 1855556
RIL-Q4-FY2023-24.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:645d6658976d1f958703b951fd7c89b22738ed2c865f31077fa725ec27781115
3
+ size 1662456
RIL-Q4-FY2024-25.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ec375dcbc69b69a95cd13f37fe090d61071d6e6a66707f2c73b26b77c6bd0d0
3
+ size 1719021
app.py ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ RAG System for Hugging Face Spaces
4
+
5
+ A simplified RAG system using:
6
+ - FAISS for vector search
7
+ - BM25 for hybrid retrieval
8
+ - Streamlit for UI
9
+ - Qwen 2.5 1.5B for generation
10
+ """
11
+
12
+ import streamlit as st
13
+ import os
14
+ import tempfile
15
+ from pathlib import Path
16
+ import time
17
+ from typing import List, Dict, Optional
18
+ import json
19
+ import glob
20
+ from concurrent.futures import ThreadPoolExecutor, as_completed
21
+ from loguru import logger
22
+
23
+ # Import our simplified components
24
+ from rag_system import SimpleRAGSystem
25
+ from pdf_processor import SimplePDFProcessor
26
+
27
+ # Page configuration
28
+ st.set_page_config(
29
+ page_title="RAG System - Hugging Face",
30
+ page_icon="πŸ€–",
31
+ layout="wide",
32
+ initial_sidebar_state="expanded",
33
+ )
34
+
35
+ # Initialize session state
36
+ if "rag_system" not in st.session_state:
37
+ st.session_state.rag_system = None
38
+ if "documents_loaded" not in st.session_state:
39
+ st.session_state.documents_loaded = False
40
+ if "chat_history" not in st.session_state:
41
+ st.session_state.chat_history = []
42
+ if "initializing" not in st.session_state:
43
+ st.session_state.initializing = False
44
+
45
+
46
+ def load_single_document(rag_system, pdf_path):
47
+ """Load a single document into the RAG system"""
48
+ try:
49
+ filename = os.path.basename(pdf_path)
50
+ success = rag_system.add_document(pdf_path, filename)
51
+ return filename, success, None
52
+ except Exception as e:
53
+ return os.path.basename(pdf_path), False, str(e)
54
+
55
+
56
+ def initialize_rag_system():
57
+ """Initialize the RAG system"""
58
+ if st.session_state.rag_system is None and not st.session_state.initializing:
59
+ st.session_state.initializing = True
60
+ st.write("πŸš€ Starting RAG system initialization...")
61
+ with st.spinner("Initializing RAG system..."):
62
+ try:
63
+ st.session_state.rag_system = SimpleRAGSystem()
64
+ st.write("βœ… RAG system created successfully")
65
+
66
+ # Auto-load all available PDF documents in parallel
67
+ pdf_files = glob.glob("/app/*.pdf")
68
+ st.write(f"πŸ“ Found {len(pdf_files)} PDF files")
69
+
70
+ if pdf_files:
71
+ loaded_count = 0
72
+ failed_count = 0
73
+
74
+ with st.spinner(
75
+ f"Loading {len(pdf_files)} PDF documents in parallel..."
76
+ ):
77
+ # Use ThreadPoolExecutor for parallel loading
78
+ with ThreadPoolExecutor(max_workers=4) as executor:
79
+ # Submit all tasks
80
+ future_to_pdf = {
81
+ executor.submit(
82
+ load_single_document,
83
+ st.session_state.rag_system,
84
+ pdf_path,
85
+ ): pdf_path
86
+ for pdf_path in pdf_files
87
+ }
88
+
89
+ # Process completed tasks
90
+ for future in as_completed(future_to_pdf):
91
+ filename, success, error = future.result()
92
+ if success:
93
+ loaded_count += 1
94
+ st.write(f"βœ… Loaded: {filename}")
95
+ logger.info(f"βœ… Loaded: {filename}")
96
+ else:
97
+ failed_count += 1
98
+ st.write(f"⚠️ Failed: {filename} - {error}")
99
+ logger.warning(
100
+ f"⚠️ Failed to load {filename}: {error}"
101
+ )
102
+
103
+ if loaded_count > 0:
104
+ st.session_state.documents_loaded = True
105
+ st.success(
106
+ f"βœ… Successfully loaded {loaded_count} PDF documents!"
107
+ )
108
+ if failed_count > 0:
109
+ st.warning(f"⚠️ Failed to load {failed_count} documents")
110
+ else:
111
+ st.warning("⚠️ No documents could be loaded")
112
+ # Still allow querying even if no documents loaded
113
+ st.session_state.documents_loaded = True
114
+ else:
115
+ st.info("πŸ“š No PDF documents found in the container")
116
+ # Still allow querying even if no documents found
117
+ st.session_state.documents_loaded = True
118
+
119
+ st.success("βœ… RAG system initialized!")
120
+
121
+ except Exception as e:
122
+ st.error(f"❌ Failed to initialize RAG system: {e}")
123
+ logger.error(f"RAG system initialization failed: {e}")
124
+ # Reset initialization flag on error
125
+ st.session_state.initializing = False
126
+ raise
127
+ finally:
128
+ # Always reset initialization flag
129
+ st.session_state.initializing = False
130
+
131
+
132
+ def upload_document(uploaded_file):
133
+ """Upload and process a document"""
134
+ if uploaded_file is not None:
135
+ try:
136
+ # Create temporary file
137
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
138
+ tmp_file.write(uploaded_file.getvalue())
139
+ tmp_path = tmp_file.name
140
+
141
+ # Process the document
142
+ with st.spinner(f"Processing {uploaded_file.name}..."):
143
+ success = st.session_state.rag_system.add_document(
144
+ tmp_path, uploaded_file.name
145
+ )
146
+
147
+ if success:
148
+ st.success(f"βœ… {uploaded_file.name} processed successfully!")
149
+ st.session_state.documents_loaded = True
150
+ # Clean up temporary file
151
+ os.unlink(tmp_path)
152
+ else:
153
+ st.error(f"❌ Failed to process {uploaded_file.name}")
154
+ os.unlink(tmp_path)
155
+
156
+ except Exception as e:
157
+ st.error(f"❌ Error processing document: {str(e)}")
158
+
159
+
160
+ def query_rag(query: str, method: str = "hybrid", top_k: int = 5):
161
+ """Query the RAG system"""
162
+ try:
163
+ st.write(f"πŸ” Starting query: {query}")
164
+ st.write(f"πŸ” Method: {method}, top_k: {top_k}")
165
+
166
+ if st.session_state.rag_system is None:
167
+ st.error("❌ RAG system is not initialized")
168
+ return None, "RAG system not initialized"
169
+
170
+ st.write(f"βœ… RAG system is available")
171
+ start_time = time.time()
172
+
173
+ st.write(f"πŸ” Calling rag_system.query...")
174
+ response = st.session_state.rag_system.query(query, method, top_k)
175
+ response_time = time.time() - start_time
176
+
177
+ st.write(f"βœ… Response received in {response_time:.2f}s")
178
+ st.write(f"βœ… Response type: {type(response)}")
179
+
180
+ if response:
181
+ st.write(f"βœ… Response answer: {response.answer[:100]}...")
182
+
183
+ return response, response_time
184
+
185
+ except Exception as e:
186
+ st.error(f"❌ Error during query: {str(e)}")
187
+ logger.error(f"Query error: {e}")
188
+ import traceback
189
+
190
+ st.error(f"❌ Full error: {traceback.format_exc()}")
191
+ return None, f"Error: {str(e)}"
192
+
193
+
194
+ def display_search_results(results: List[Dict]):
195
+ """Display search results"""
196
+ if not results:
197
+ st.info("No search results found.")
198
+ return
199
+
200
+ for i, result in enumerate(results, 1):
201
+ st.markdown(f"---")
202
+ st.markdown(f"**Result {i}** - Score: {result.score:.3f}")
203
+ st.write(f"**Source:** {result.filename}")
204
+ st.write(f"**Method:** {result.search_method}")
205
+ st.write(f"**Text:** {result.text[:500]}...")
206
+
207
+ if result.dense_score and result.sparse_score:
208
+ col1, col2 = st.columns(2)
209
+ with col1:
210
+ st.metric("Dense Score", f"{result.dense_score:.3f}")
211
+ with col2:
212
+ st.metric("Sparse Score", f"{result.sparse_score:.3f}")
213
+
214
+
215
+ def main():
216
+ """Main application"""
217
+ st.write("πŸš€ App starting...")
218
+ st.title("πŸ€– RAG System - Hugging Face Spaces")
219
+ st.markdown("A simplified RAG system using FAISS + BM25 + Qwen 2.5 1.5B")
220
+
221
+ # Initialize RAG system
222
+ initialize_rag_system()
223
+
224
+ # Sidebar
225
+ with st.sidebar:
226
+ st.header("πŸ“ Document Upload")
227
+
228
+ uploaded_file = st.file_uploader(
229
+ "Upload PDF Document",
230
+ type=["pdf"],
231
+ help="Upload a PDF document to add to the knowledge base",
232
+ )
233
+
234
+ if uploaded_file:
235
+ upload_document(uploaded_file)
236
+
237
+ st.divider()
238
+
239
+ st.header("βš™οΈ Settings")
240
+
241
+ method = st.selectbox(
242
+ "Retrieval Method",
243
+ ["hybrid", "dense", "sparse"],
244
+ help="Choose the retrieval method",
245
+ )
246
+
247
+ top_k = st.slider(
248
+ "Number of Results",
249
+ min_value=1,
250
+ max_value=10,
251
+ value=5,
252
+ help="Number of top results to retrieve",
253
+ )
254
+
255
+ st.divider()
256
+
257
+ # System info
258
+ if st.session_state.rag_system:
259
+ stats = st.session_state.rag_system.get_stats()
260
+ st.header("πŸ“Š System Info")
261
+ st.write(f"**Documents:** {stats['total_documents']}")
262
+ st.write(f"**Chunks:** {stats['total_chunks']}")
263
+ st.write(f"**Vector Size:** {stats['vector_size']}")
264
+ st.write(f"**Model:** {stats['model_name']}")
265
+
266
+ # Initialize RAG system if not already done
267
+ if not st.session_state.rag_system:
268
+ if st.session_state.initializing:
269
+ st.info("πŸ”„ RAG system is initializing... Please wait.")
270
+ return
271
+ else:
272
+ initialize_rag_system()
273
+ return
274
+
275
+ # Show system info and allow querying immediately after initialization
276
+ stats = st.session_state.rag_system.get_stats()
277
+ documents_available = stats["total_documents"] > 0
278
+
279
+ if not documents_available:
280
+ st.info(
281
+ "πŸ“š No documents loaded yet, but you can still ask questions. The system will respond based on its general knowledge."
282
+ )
283
+
284
+ # Chat interface
285
+ st.header("πŸ’¬ Ask Questions About Your Documents")
286
+
287
+ # Chat input
288
+ query = st.chat_input("Ask a question about the loaded documents...")
289
+
290
+ if query:
291
+ st.write(f"πŸ“ Processing query: {query}")
292
+ # Add user message to chat history
293
+ st.session_state.chat_history.append({"role": "user", "content": query})
294
+
295
+ # Get response
296
+ response, response_time = query_rag(query, method, top_k)
297
+
298
+ st.write(f"πŸ“Š Response type: {type(response)}")
299
+ st.write(f"πŸ“Š Response time: {response_time}")
300
+
301
+ if response:
302
+ st.write("βœ… Got valid response, adding to chat history")
303
+ # Add assistant response to chat history
304
+ st.session_state.chat_history.append(
305
+ {
306
+ "role": "assistant",
307
+ "content": response.answer,
308
+ "search_results": response.search_results,
309
+ "method_used": response.method_used,
310
+ "confidence": response.confidence,
311
+ "response_time": response_time,
312
+ }
313
+ )
314
+ else:
315
+ st.write("❌ No valid response received")
316
+ st.session_state.chat_history.append(
317
+ {"role": "assistant", "content": f"Error: {response_time}"}
318
+ )
319
+
320
+ # Display chat history
321
+ for message in st.session_state.chat_history:
322
+ if message["role"] == "user":
323
+ with st.chat_message("user"):
324
+ st.write(message["content"])
325
+ else:
326
+ with st.chat_message("assistant"):
327
+ st.write(message["content"])
328
+
329
+ # Show additional info for assistant messages
330
+ if "search_results" in message:
331
+ st.markdown("**πŸ” Search Results:**")
332
+ display_search_results(message["search_results"])
333
+
334
+ # Show metrics
335
+ col1, col2, col3 = st.columns(3)
336
+ with col1:
337
+ st.metric("Method", message["method_used"])
338
+ with col2:
339
+ st.metric("Confidence", f"{message['confidence']:.3f}")
340
+ with col3:
341
+ st.metric("Response Time", f"{message['response_time']:.2f}s")
342
+
343
+ # Clear chat button
344
+ if st.session_state.chat_history:
345
+ if st.button("πŸ—‘οΈ Clear Chat History"):
346
+ st.session_state.chat_history = []
347
+ st.rerun()
348
+
349
+
350
+ if __name__ == "__main__":
351
+ main()
docker-compose.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ rag-system:
5
+ build: .
6
+ ports:
7
+ - "8501:8501"
8
+ environment:
9
+ - PYTHONPATH=/app
10
+ - STREAMLIT_SERVER_PORT=8501
11
+ - STREAMLIT_SERVER_ADDRESS=0.0.0.0
12
+ - STREAMLIT_SERVER_HEADLESS=true
13
+ volumes:
14
+ - ./vector_store:/app/vector_store
15
+ restart: unless-stopped
16
+ healthcheck:
17
+ test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
18
+ interval: 30s
19
+ timeout: 10s
20
+ retries: 3
pdf_processor.py ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simplified PDF Processor for Hugging Face Spaces
4
+
5
+ This module provides PDF processing functionality for the simplified RAG system.
6
+ """
7
+
8
+ import os
9
+ import re
10
+ import uuid
11
+ from typing import List, Dict, Optional
12
+ from dataclasses import dataclass
13
+ from pathlib import Path
14
+ import pypdf
15
+ from loguru import logger
16
+
17
+
18
+ @dataclass
19
+ class DocumentChunk:
20
+ """Represents a document chunk"""
21
+
22
+ text: str
23
+ doc_id: str
24
+ filename: str
25
+ chunk_id: str
26
+ chunk_size: int
27
+
28
+
29
+ @dataclass
30
+ class ProcessedDocument:
31
+ """Represents a processed document"""
32
+
33
+ filename: str
34
+ title: str
35
+ author: str
36
+ chunks: List[DocumentChunk]
37
+
38
+
39
+ class SimplePDFProcessor:
40
+ """Simplified PDF processor for Hugging Face Spaces"""
41
+
42
+ def __init__(self):
43
+ """Initialize the PDF processor"""
44
+ self.stop_words = {
45
+ "the",
46
+ "a",
47
+ "an",
48
+ "and",
49
+ "or",
50
+ "but",
51
+ "in",
52
+ "on",
53
+ "at",
54
+ "to",
55
+ "for",
56
+ "of",
57
+ "with",
58
+ "by",
59
+ "is",
60
+ "are",
61
+ "was",
62
+ "were",
63
+ "be",
64
+ "been",
65
+ "being",
66
+ "have",
67
+ "has",
68
+ "had",
69
+ "do",
70
+ "does",
71
+ "did",
72
+ "will",
73
+ "would",
74
+ "could",
75
+ "should",
76
+ "may",
77
+ "might",
78
+ "can",
79
+ "this",
80
+ "that",
81
+ "these",
82
+ "those",
83
+ }
84
+
85
+ def process_document(
86
+ self, file_path: str, chunk_sizes: List[int] = None
87
+ ) -> ProcessedDocument:
88
+ """
89
+ Process a PDF document
90
+
91
+ Args:
92
+ file_path: Path to the PDF file
93
+ chunk_sizes: List of chunk sizes to use
94
+
95
+ Returns:
96
+ Processed document
97
+ """
98
+ if chunk_sizes is None:
99
+ chunk_sizes = [100, 400]
100
+
101
+ try:
102
+ # Extract text from PDF
103
+ text = self._extract_text(file_path)
104
+
105
+ # Clean text
106
+ cleaned_text = self._clean_text(text)
107
+
108
+ # Extract metadata
109
+ metadata = self._extract_metadata(file_path)
110
+
111
+ # Create chunks
112
+ chunks = []
113
+ doc_id = str(uuid.uuid4())
114
+
115
+ for chunk_size in chunk_sizes:
116
+ chunk_list = self._create_chunks(
117
+ cleaned_text, chunk_size, doc_id, metadata["filename"]
118
+ )
119
+ chunks.extend(chunk_list)
120
+
121
+ return ProcessedDocument(
122
+ filename=metadata["filename"],
123
+ title=metadata["title"],
124
+ author=metadata["author"],
125
+ chunks=chunks,
126
+ )
127
+
128
+ except Exception as e:
129
+ logger.error(f"Error processing document {file_path}: {e}")
130
+ raise
131
+
132
+ def _extract_text(self, file_path: str) -> str:
133
+ """Extract text from PDF file"""
134
+ try:
135
+ with open(file_path, "rb") as file:
136
+ pdf_reader = pypdf.PdfReader(file)
137
+ text = ""
138
+
139
+ for page in pdf_reader.pages:
140
+ page_text = page.extract_text()
141
+ if page_text:
142
+ text += page_text + "\n"
143
+
144
+ return text
145
+
146
+ except Exception as e:
147
+ logger.error(f"Error extracting text from {file_path}: {e}")
148
+ raise
149
+
150
+ def _clean_text(self, text: str) -> str:
151
+ """Clean and preprocess text"""
152
+ # Remove extra whitespace
153
+ text = re.sub(r"\s+", " ", text)
154
+
155
+ # Remove special characters but keep punctuation
156
+ text = re.sub(r"[^\w\s\.\,\!\?\;\:\-\(\)\[\]\{\}]", "", text)
157
+
158
+ # Remove page numbers and headers/footers
159
+ text = re.sub(
160
+ r"\b\d+\b(?=\s*\n)", "", text
161
+ ) # Remove standalone numbers at line ends
162
+
163
+ # Remove excessive newlines
164
+ text = re.sub(r"\n\s*\n\s*\n+", "\n\n", text)
165
+
166
+ return text.strip()
167
+
168
+ def _extract_metadata(self, file_path: str) -> Dict[str, str]:
169
+ """Extract metadata from PDF file"""
170
+ try:
171
+ with open(file_path, "rb") as file:
172
+ pdf_reader = pypdf.PdfReader(file)
173
+ info = pdf_reader.metadata
174
+
175
+ return {
176
+ "filename": Path(file_path).name,
177
+ "title": (
178
+ info.get("/Title", Path(file_path).stem)
179
+ if info
180
+ else Path(file_path).stem
181
+ ),
182
+ "author": info.get("/Author", "Unknown") if info else "Unknown",
183
+ }
184
+
185
+ except Exception as e:
186
+ logger.warning(f"Error extracting metadata from {file_path}: {e}")
187
+ return {
188
+ "filename": Path(file_path).name,
189
+ "title": Path(file_path).stem,
190
+ "author": "Unknown",
191
+ }
192
+
193
+ def _create_chunks(
194
+ self, text: str, chunk_size: int, doc_id: str, filename: str
195
+ ) -> List[DocumentChunk]:
196
+ """Create text chunks of specified size"""
197
+ chunks = []
198
+
199
+ # Split text into sentences
200
+ sentences = self._split_into_sentences(text)
201
+
202
+ current_chunk = ""
203
+ chunk_id = 0
204
+
205
+ for sentence in sentences:
206
+ # Estimate token count (rough approximation)
207
+ estimated_tokens = len(sentence.split())
208
+
209
+ if len(current_chunk.split()) + estimated_tokens <= chunk_size:
210
+ current_chunk += sentence + " "
211
+ else:
212
+ # Save current chunk if not empty
213
+ if current_chunk.strip():
214
+ chunks.append(
215
+ DocumentChunk(
216
+ text=current_chunk.strip(),
217
+ doc_id=doc_id,
218
+ filename=filename,
219
+ chunk_id=f"{doc_id}_{chunk_id}",
220
+ chunk_size=chunk_size,
221
+ )
222
+ )
223
+ chunk_id += 1
224
+
225
+ # Start new chunk
226
+ current_chunk = sentence + " "
227
+
228
+ # Add the last chunk if not empty
229
+ if current_chunk.strip():
230
+ chunks.append(
231
+ DocumentChunk(
232
+ text=current_chunk.strip(),
233
+ doc_id=doc_id,
234
+ filename=filename,
235
+ chunk_id=f"{doc_id}_{chunk_id}",
236
+ chunk_size=chunk_size,
237
+ )
238
+ )
239
+
240
+ return chunks
241
+
242
+ def _split_into_sentences(self, text: str) -> List[str]:
243
+ """Split text into sentences"""
244
+ # Simple sentence splitting
245
+ sentences = re.split(r"[.!?]+", text)
246
+
247
+ # Clean and filter sentences
248
+ cleaned_sentences = []
249
+ for sentence in sentences:
250
+ sentence = sentence.strip()
251
+ if sentence and len(sentence.split()) > 3: # Minimum 3 words
252
+ cleaned_sentences.append(sentence)
253
+
254
+ return cleaned_sentences
255
+
256
+ def preprocess_query(self, query: str) -> str:
257
+ """Preprocess query text"""
258
+ # Convert to lowercase
259
+ query = query.lower()
260
+
261
+ # Remove punctuation
262
+ query = re.sub(r"[^\w\s]", "", query)
263
+
264
+ # Remove stop words
265
+ words = query.split()
266
+ filtered_words = [word for word in words if word not in self.stop_words]
267
+
268
+ return " ".join(filtered_words)
rag_system.py ADDED
@@ -0,0 +1,547 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simplified RAG System for Hugging Face Spaces
4
+
5
+ This module provides a simplified RAG system using:
6
+ - FAISS for vector storage
7
+ - BM25 for sparse retrieval
8
+ - Hybrid search combining both
9
+ - Qwen 2.5 1.5B for generation
10
+ """
11
+
12
+ import os
13
+ import pickle
14
+ import json
15
+ import time
16
+ from typing import List, Dict, Optional, Tuple
17
+ from dataclasses import dataclass
18
+ import numpy as np
19
+ import torch
20
+ from loguru import logger
21
+ import threading
22
+
23
+ # Import required libraries
24
+ from sentence_transformers import SentenceTransformer
25
+ from rank_bm25 import BM25Okapi
26
+ import faiss
27
+ from transformers import AutoTokenizer, AutoModelForCausalLM
28
+
29
+
30
+ @dataclass
31
+ class DocumentChunk:
32
+ """Represents a document chunk"""
33
+
34
+ text: str
35
+ doc_id: str
36
+ filename: str
37
+ chunk_id: str
38
+ chunk_size: int
39
+
40
+
41
+ @dataclass
42
+ class SearchResult:
43
+ """Represents a search result"""
44
+
45
+ text: str
46
+ score: float
47
+ doc_id: str
48
+ filename: str
49
+ search_method: str
50
+ dense_score: Optional[float] = None
51
+ sparse_score: Optional[float] = None
52
+
53
+
54
+ @dataclass
55
+ class RAGResponse:
56
+ """Represents a RAG response"""
57
+
58
+ answer: str
59
+ confidence: float
60
+ search_results: List[SearchResult]
61
+ method_used: str
62
+ response_time: float
63
+ query: str
64
+
65
+
66
+ class SimpleRAGSystem:
67
+ """Simplified RAG system for Hugging Face Spaces"""
68
+
69
+ def __init__(
70
+ self,
71
+ embedding_model: str = "all-MiniLM-L6-v2",
72
+ generative_model: str = "Qwen/Qwen2.5-1.5B-Instruct",
73
+ chunk_sizes: List[int] = None,
74
+ vector_store_path: str = "./vector_store",
75
+ ):
76
+ """
77
+ Initialize the RAG system
78
+
79
+ Args:
80
+ embedding_model: Sentence transformer model for embeddings
81
+ generative_model: Language model for generation
82
+ chunk_sizes: List of chunk sizes to use
83
+ vector_store_path: Path to store FAISS index and metadata
84
+ """
85
+ self.embedding_model = embedding_model
86
+ self.generative_model = generative_model
87
+ self.chunk_sizes = chunk_sizes or [100, 400]
88
+ self.vector_store_path = vector_store_path
89
+
90
+ # Initialize components
91
+ self.embedder = None
92
+ self.tokenizer = None
93
+ self.model = None
94
+ self.faiss_index = None
95
+ self.bm25 = None
96
+ self.documents = []
97
+ self.chunks = []
98
+ self._lock = threading.Lock() # Thread safety for concurrent loading
99
+
100
+ # Create vector store directory
101
+ os.makedirs(vector_store_path, exist_ok=True)
102
+
103
+ # Load or initialize components
104
+ self._load_models()
105
+ self._load_or_create_index()
106
+
107
+ logger.info("Simple RAG system initialized successfully!")
108
+
109
+ def _load_models(self):
110
+ """Load embedding and generative models"""
111
+ try:
112
+ # Load embedding model
113
+ self.embedder = SentenceTransformer(self.embedding_model)
114
+ self.vector_size = self.embedder.get_sentence_embedding_dimension()
115
+
116
+ # Load generative model with fallback
117
+ model_loaded = False
118
+
119
+ # Try Qwen model first
120
+ try:
121
+ self.tokenizer = AutoTokenizer.from_pretrained(
122
+ self.generative_model,
123
+ trust_remote_code=True,
124
+ padding_side="left",
125
+ )
126
+
127
+ # Load model with explicit CPU configuration
128
+ self.model = AutoModelForCausalLM.from_pretrained(
129
+ self.generative_model,
130
+ trust_remote_code=True,
131
+ torch_dtype=torch.float32,
132
+ device_map=None,
133
+ low_cpu_mem_usage=False,
134
+ )
135
+
136
+ # Move to CPU explicitly
137
+ self.model = self.model.to("cpu")
138
+ model_loaded = True
139
+
140
+ except Exception as e:
141
+ logger.warning(f"Failed to load Qwen model: {e}")
142
+
143
+ # Fallback to distilgpt2 if Qwen fails
144
+ if not model_loaded:
145
+ logger.info("Falling back to distilgpt2...")
146
+ self.generative_model = "distilgpt2"
147
+ try:
148
+ self.tokenizer = AutoTokenizer.from_pretrained(
149
+ self.generative_model,
150
+ trust_remote_code=True,
151
+ padding_side="left",
152
+ )
153
+ self.model = AutoModelForCausalLM.from_pretrained(
154
+ self.generative_model,
155
+ trust_remote_code=True,
156
+ )
157
+ # Ensure fallback model is also on CPU
158
+ self.model = self.model.to("cpu")
159
+ model_loaded = True
160
+ except Exception as e:
161
+ logger.error(f"Failed to load distilgpt2: {e}")
162
+ raise Exception("Could not load any generative model")
163
+
164
+ # Set pad token for tokenizer
165
+ if self.tokenizer.pad_token is None:
166
+ self.tokenizer.pad_token = self.tokenizer.eos_token
167
+ self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
168
+
169
+ logger.info(f"βœ… Models loaded successfully")
170
+ logger.info(f" - Embedding: {self.embedding_model}")
171
+ logger.info(f" - Generative: {self.generative_model}")
172
+
173
+ except Exception as e:
174
+ logger.error(f"❌ Failed to load models: {e}")
175
+ raise
176
+
177
+ def _load_or_create_index(self):
178
+ """Load existing FAISS index or create new one"""
179
+ faiss_path = os.path.join(self.vector_store_path, "faiss_index.bin")
180
+ metadata_path = os.path.join(self.vector_store_path, "metadata.pkl")
181
+
182
+ if os.path.exists(faiss_path) and os.path.exists(metadata_path):
183
+ # Load existing index
184
+ try:
185
+ self.faiss_index = faiss.read_index(faiss_path)
186
+ with open(metadata_path, "rb") as f:
187
+ metadata = pickle.load(f)
188
+ self.documents = metadata.get("documents", [])
189
+ self.chunks = metadata.get("chunks", [])
190
+
191
+ # Rebuild BM25
192
+ if self.chunks:
193
+ texts = [chunk.text for chunk in self.chunks]
194
+ tokenized_texts = [text.lower().split() for text in texts]
195
+ self.bm25 = BM25Okapi(tokenized_texts)
196
+
197
+ logger.info(f"βœ… Loaded existing index with {len(self.chunks)} chunks")
198
+ except Exception as e:
199
+ logger.warning(f"Failed to load existing index: {e}")
200
+ self._create_new_index()
201
+ else:
202
+ self._create_new_index()
203
+
204
+ def _create_new_index(self):
205
+ """Create new FAISS index"""
206
+ vector_size = self.embedder.get_sentence_embedding_dimension()
207
+ self.faiss_index = faiss.IndexFlatIP(
208
+ vector_size
209
+ ) # Inner product for cosine similarity
210
+ self.bm25 = None
211
+ logger.info(f"βœ… Created new FAISS index with dimension {vector_size}")
212
+
213
+ def _save_index(self):
214
+ """Save FAISS index and metadata"""
215
+ try:
216
+ # Save FAISS index
217
+ faiss_path = os.path.join(self.vector_store_path, "faiss_index.bin")
218
+ faiss.write_index(self.faiss_index, faiss_path)
219
+
220
+ # Save metadata
221
+ metadata_path = os.path.join(self.vector_store_path, "metadata.pkl")
222
+ metadata = {"documents": self.documents, "chunks": self.chunks}
223
+ with open(metadata_path, "wb") as f:
224
+ pickle.dump(metadata, f)
225
+
226
+ logger.info("βœ… Index saved successfully")
227
+ except Exception as e:
228
+ logger.error(f"❌ Failed to save index: {e}")
229
+
230
+ def add_document(self, file_path: str, filename: str) -> bool:
231
+ """
232
+ Add a document to the RAG system
233
+
234
+ Args:
235
+ file_path: Path to the PDF file
236
+ filename: Name of the file
237
+
238
+ Returns:
239
+ True if successful, False otherwise
240
+ """
241
+ try:
242
+ from pdf_processor import SimplePDFProcessor
243
+
244
+ # Process the document
245
+ processor = SimplePDFProcessor()
246
+ processed_doc = processor.process_document(file_path, self.chunk_sizes)
247
+
248
+ # Thread-safe document addition
249
+ with self._lock:
250
+ # Add document to list
251
+ self.documents.append(
252
+ {
253
+ "filename": filename,
254
+ "title": processed_doc.title,
255
+ "author": processed_doc.author,
256
+ "file_path": file_path,
257
+ }
258
+ )
259
+
260
+ # Add chunks
261
+ for chunk in processed_doc.chunks:
262
+ self.chunks.append(chunk)
263
+
264
+ # Update embeddings and BM25
265
+ self._update_embeddings()
266
+ self._update_bm25()
267
+
268
+ # Save index
269
+ self._save_index()
270
+
271
+ logger.info(
272
+ f"βœ… Added document: {filename} ({len(processed_doc.chunks)} chunks)"
273
+ )
274
+ return True
275
+
276
+ except Exception as e:
277
+ logger.error(f"❌ Failed to add document {filename}: {e}")
278
+ return False
279
+
280
+ def _update_embeddings(self):
281
+ """Update FAISS index with new embeddings"""
282
+ if not self.chunks:
283
+ return
284
+
285
+ # Get embeddings for new chunks
286
+ texts = [chunk.text for chunk in self.chunks]
287
+ embeddings = self.embedder.encode(texts, show_progress_bar=False)
288
+
289
+ # Add to FAISS index
290
+ self.faiss_index.add(embeddings.astype("float32"))
291
+
292
+ def _update_bm25(self):
293
+ """Update BM25 index with new chunks"""
294
+ if not self.chunks:
295
+ return
296
+
297
+ # Rebuild BM25 with all chunks
298
+ texts = [chunk.text for chunk in self.chunks]
299
+ tokenized_texts = [text.lower().split() for text in texts]
300
+ self.bm25 = BM25Okapi(tokenized_texts)
301
+
302
+ def search(
303
+ self, query: str, method: str = "hybrid", top_k: int = 5
304
+ ) -> List[SearchResult]:
305
+ """
306
+ Search for relevant documents
307
+
308
+ Args:
309
+ query: Search query
310
+ method: Search method (hybrid, dense, sparse)
311
+ top_k: Number of results to return
312
+
313
+ Returns:
314
+ List of search results
315
+ """
316
+ if not self.chunks:
317
+ return []
318
+
319
+ results = []
320
+
321
+ if method == "dense" or method == "hybrid":
322
+ # Dense search using FAISS
323
+ query_embedding = self.embedder.encode([query])
324
+ scores, indices = self.faiss_index.search(
325
+ query_embedding.astype("float32"), min(top_k, len(self.chunks))
326
+ )
327
+
328
+ for score, idx in zip(scores[0], indices[0]):
329
+ if idx < len(self.chunks):
330
+ chunk = self.chunks[idx]
331
+ results.append(
332
+ SearchResult(
333
+ text=chunk.text,
334
+ score=float(score),
335
+ doc_id=chunk.doc_id,
336
+ filename=chunk.filename,
337
+ search_method="dense",
338
+ dense_score=float(score),
339
+ )
340
+ )
341
+
342
+ if method == "sparse" or method == "hybrid":
343
+ # Sparse search using BM25
344
+ if self.bm25:
345
+ tokenized_query = query.lower().split()
346
+ bm25_scores = self.bm25.get_scores(tokenized_query)
347
+
348
+ # Get top BM25 results
349
+ top_indices = np.argsort(bm25_scores)[::-1][:top_k]
350
+
351
+ for idx in top_indices:
352
+ if idx < len(self.chunks):
353
+ chunk = self.chunks[idx]
354
+ score = float(bm25_scores[idx])
355
+
356
+ # Check if result already exists
357
+ existing_result = next(
358
+ (
359
+ r
360
+ for r in results
361
+ if r.doc_id == chunk.doc_id and r.text == chunk.text
362
+ ),
363
+ None,
364
+ )
365
+
366
+ if existing_result:
367
+ # Update existing result with sparse score
368
+ existing_result.sparse_score = score
369
+ if method == "hybrid":
370
+ # Combine scores for hybrid
371
+ existing_result.score = (
372
+ existing_result.dense_score + score
373
+ ) / 2
374
+ else:
375
+ results.append(
376
+ SearchResult(
377
+ text=chunk.text,
378
+ score=score,
379
+ doc_id=chunk.doc_id,
380
+ filename=chunk.filename,
381
+ search_method="sparse",
382
+ sparse_score=score,
383
+ )
384
+ )
385
+
386
+ # Sort by score and return top_k
387
+ results.sort(key=lambda x: x.score, reverse=True)
388
+ return results[:top_k]
389
+
390
+ def generate_response(self, query: str, context: str) -> str:
391
+ """
392
+ Generate response using the language model
393
+
394
+ Args:
395
+ query: User query
396
+ context: Retrieved context
397
+
398
+ Returns:
399
+ Generated response
400
+ """
401
+ try:
402
+ # Prepare prompt
403
+ if hasattr(self.tokenizer, "apply_chat_template"):
404
+ # Use chat template for Qwen
405
+ messages = [
406
+ {
407
+ "role": "system",
408
+ "content": "You are a helpful AI assistant. Use the provided context to answer the user's question accurately and concisely. If the context doesn't contain enough information to answer the question, say so.",
409
+ },
410
+ {
411
+ "role": "user",
412
+ "content": f"Context: {context}\n\nQuestion: {query}",
413
+ },
414
+ ]
415
+ prompt = self.tokenizer.apply_chat_template(
416
+ messages, tokenize=False, add_generation_prompt=True
417
+ )
418
+ else:
419
+ # Fallback for non-chat models
420
+ prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
421
+
422
+ # Tokenize
423
+ tokenized = self.tokenizer(
424
+ prompt,
425
+ return_tensors="pt",
426
+ truncation=True,
427
+ max_length=1024,
428
+ padding=True,
429
+ return_attention_mask=True,
430
+ )
431
+
432
+ # Generate response
433
+ with torch.no_grad():
434
+ try:
435
+ outputs = self.model.generate(
436
+ tokenized.input_ids,
437
+ attention_mask=tokenized.attention_mask,
438
+ max_new_tokens=512,
439
+ num_return_sequences=1,
440
+ temperature=0.7,
441
+ do_sample=True,
442
+ pad_token_id=self.tokenizer.pad_token_id,
443
+ eos_token_id=self.tokenizer.eos_token_id,
444
+ )
445
+ except RuntimeError as e:
446
+ if "Half" in str(e):
447
+ logger.warning(
448
+ "Half precision not supported on CPU, converting to float32"
449
+ )
450
+ # Convert model to float32
451
+ self.model = self.model.float()
452
+ outputs = self.model.generate(
453
+ tokenized.input_ids,
454
+ attention_mask=tokenized.attention_mask,
455
+ max_new_tokens=512,
456
+ num_return_sequences=1,
457
+ temperature=0.7,
458
+ do_sample=True,
459
+ pad_token_id=self.tokenizer.pad_token_id,
460
+ eos_token_id=self.tokenizer.eos_token_id,
461
+ )
462
+ else:
463
+ raise e
464
+
465
+ # Decode response
466
+ response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
467
+
468
+ # Extract only the generated part
469
+ if hasattr(self.tokenizer, "apply_chat_template"):
470
+ if "<|im_start|>assistant" in response:
471
+ response = response.split("<|im_start|>assistant")[-1]
472
+ if "<|im_end|>" in response:
473
+ response = response.split("<|im_end|>")[0]
474
+ else:
475
+ response = response[len(prompt) :]
476
+
477
+ return response.strip()
478
+
479
+ except Exception as e:
480
+ logger.error(f"Error generating response: {e}")
481
+ return f"Error generating response: {str(e)}"
482
+
483
+ def query(self, query: str, method: str = "hybrid", top_k: int = 5) -> RAGResponse:
484
+ """
485
+ Query the RAG system
486
+
487
+ Args:
488
+ query: User query
489
+ method: Search method
490
+ top_k: Number of results
491
+
492
+ Returns:
493
+ RAG response
494
+ """
495
+ start_time = time.time()
496
+
497
+ # Search for relevant documents
498
+ search_results = self.search(query, method, top_k)
499
+
500
+ if not search_results:
501
+ return RAGResponse(
502
+ answer="I couldn't find any relevant information to answer your question.",
503
+ confidence=0.0,
504
+ search_results=[],
505
+ method_used=method,
506
+ response_time=time.time() - start_time,
507
+ query=query,
508
+ )
509
+
510
+ # Combine context from search results
511
+ context = "\n\n".join([result.text for result in search_results])
512
+
513
+ # Generate response
514
+ answer = self.generate_response(query, context)
515
+
516
+ # Calculate confidence (simple heuristic)
517
+ confidence = np.mean([result.score for result in search_results])
518
+
519
+ return RAGResponse(
520
+ answer=answer,
521
+ confidence=confidence,
522
+ search_results=search_results,
523
+ method_used=method,
524
+ response_time=time.time() - start_time,
525
+ query=query,
526
+ )
527
+
528
+ def get_stats(self) -> Dict:
529
+ """Get system statistics"""
530
+ return {
531
+ "total_documents": len(self.documents),
532
+ "total_chunks": len(self.chunks),
533
+ "vector_size": (
534
+ self.embedder.get_sentence_embedding_dimension() if self.embedder else 0
535
+ ),
536
+ "model_name": self.generative_model,
537
+ "embedding_model": self.embedding_model,
538
+ "chunk_sizes": self.chunk_sizes,
539
+ }
540
+
541
+ def clear(self):
542
+ """Clear all documents and reset the system"""
543
+ self.documents = []
544
+ self.chunks = []
545
+ self._create_new_index()
546
+ self._save_index()
547
+ logger.info("βœ… System cleared successfully")
requirements.txt CHANGED
@@ -1,3 +1,15 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies for Docker deployment
2
+ streamlit==1.28.1
3
+ torch==2.1.0
4
+ transformers>=4.36.0
5
+ sentence-transformers==2.2.2
6
+ faiss-cpu==1.7.4
7
+ scikit-learn==1.3.2
8
+ rank-bm25==0.2.2
9
+ pypdf==3.17.1
10
+ pandas==2.1.3
11
+ numpy==1.24.3
12
+ loguru==0.7.2
13
+ tqdm==4.66.1
14
+ accelerate==0.24.1
15
+ huggingface-hub==0.19.4
test_deployment.py ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Hugging Face deployment
4
+
5
+ This script tests if all components are working correctly for deployment.
6
+ """
7
+
8
+ import os
9
+ import sys
10
+ import tempfile
11
+ from pathlib import Path
12
+
13
+
14
+ def test_imports():
15
+ """Test if all required packages can be imported"""
16
+ print("πŸ” Testing imports...")
17
+
18
+ try:
19
+ import streamlit
20
+
21
+ print(f"βœ… Streamlit: {streamlit.__version__}")
22
+ except ImportError as e:
23
+ print(f"❌ Streamlit import failed: {e}")
24
+ return False
25
+
26
+ try:
27
+ import torch
28
+
29
+ print(f"βœ… PyTorch: {torch.__version__}")
30
+ except ImportError as e:
31
+ print(f"❌ PyTorch import failed: {e}")
32
+ return False
33
+
34
+ try:
35
+ import transformers
36
+
37
+ print(f"βœ… Transformers: {transformers.__version__}")
38
+ except ImportError as e:
39
+ print(f"❌ Transformers import failed: {e}")
40
+ return False
41
+
42
+ try:
43
+ import sentence_transformers
44
+
45
+ print(f"βœ… Sentence Transformers: {sentence_transformers.__version__}")
46
+ except ImportError as e:
47
+ print(f"❌ Sentence Transformers import failed: {e}")
48
+ return False
49
+
50
+ try:
51
+ import faiss
52
+
53
+ print(f"βœ… FAISS: {faiss.__version__}")
54
+ except ImportError as e:
55
+ print(f"❌ FAISS import failed: {e}")
56
+ return False
57
+
58
+ try:
59
+ import rank_bm25
60
+
61
+ print("βœ… Rank BM25")
62
+ except ImportError as e:
63
+ print(f"❌ Rank BM25 import failed: {e}")
64
+ return False
65
+
66
+ try:
67
+ import pypdf
68
+
69
+ print(f"βœ… PyPDF: {pypdf.__version__}")
70
+ except ImportError as e:
71
+ print(f"❌ PyPDF import failed: {e}")
72
+ return False
73
+
74
+ return True
75
+
76
+
77
+ def test_rag_system():
78
+ """Test the RAG system"""
79
+ print("\nπŸ” Testing RAG system...")
80
+
81
+ try:
82
+ from rag_system import SimpleRAGSystem
83
+
84
+ # Test initialization
85
+ rag = SimpleRAGSystem()
86
+ print("βœ… RAG system initialized")
87
+
88
+ # Test stats
89
+ stats = rag.get_stats()
90
+ print(f"βœ… Stats retrieved: {stats}")
91
+
92
+ return True
93
+
94
+ except Exception as e:
95
+ print(f"❌ RAG system test failed: {e}")
96
+ return False
97
+
98
+
99
+ def test_pdf_processor():
100
+ """Test the PDF processor"""
101
+ print("\nπŸ” Testing PDF processor...")
102
+
103
+ try:
104
+ from pdf_processor import SimplePDFProcessor
105
+
106
+ # Test initialization
107
+ processor = SimplePDFProcessor()
108
+ print("βœ… PDF processor initialized")
109
+
110
+ # Test query preprocessing
111
+ processed_query = processor.preprocess_query("What is the revenue?")
112
+ print(f"βœ… Query preprocessing: '{processed_query}'")
113
+
114
+ return True
115
+
116
+ except Exception as e:
117
+ print(f"❌ PDF processor test failed: {e}")
118
+ return False
119
+
120
+
121
+ def test_model_loading():
122
+ """Test if models can be loaded"""
123
+ print("\nπŸ” Testing model loading...")
124
+
125
+ try:
126
+ from sentence_transformers import SentenceTransformer
127
+ from transformers import AutoTokenizer, AutoModelForCausalLM
128
+
129
+ # Test embedding model
130
+ embedder = SentenceTransformer("all-MiniLM-L6-v2")
131
+ print("βœ… Embedding model loaded")
132
+
133
+ # Test tokenizer
134
+ tokenizer = AutoTokenizer.from_pretrained(
135
+ "Qwen/Qwen2.5-1.5B-Instruct", trust_remote_code=True
136
+ )
137
+ print("βœ… Tokenizer loaded")
138
+
139
+ # Test model (CPU only for testing)
140
+ model = AutoModelForCausalLM.from_pretrained(
141
+ "Qwen/Qwen2.5-1.5B-Instruct",
142
+ trust_remote_code=True,
143
+ torch_dtype="auto",
144
+ device_map="cpu",
145
+ )
146
+ print("βœ… Generative model loaded")
147
+
148
+ return True
149
+
150
+ except Exception as e:
151
+ print(f"❌ Model loading failed: {e}")
152
+ return False
153
+
154
+
155
+ def test_streamlit_app():
156
+ """Test if Streamlit app can be imported"""
157
+ print("\nπŸ” Testing Streamlit app...")
158
+
159
+ try:
160
+ # Test if app.py can be imported
161
+ import app
162
+
163
+ print("βœ… Streamlit app imported successfully")
164
+
165
+ return True
166
+
167
+ except Exception as e:
168
+ print(f"❌ Streamlit app test failed: {e}")
169
+ return False
170
+
171
+
172
+ def test_file_structure():
173
+ """Test if all required files exist"""
174
+ print("\nπŸ” Testing file structure...")
175
+
176
+ required_files = [
177
+ "app.py",
178
+ "rag_system.py",
179
+ "pdf_processor.py",
180
+ "requirements.txt",
181
+ "README.md",
182
+ ]
183
+
184
+ missing_files = []
185
+ for file in required_files:
186
+ if os.path.exists(file):
187
+ print(f"βœ… {file}")
188
+ else:
189
+ print(f"❌ {file} (missing)")
190
+ missing_files.append(file)
191
+
192
+ if missing_files:
193
+ print(f"❌ Missing files: {missing_files}")
194
+ return False
195
+
196
+ return True
197
+
198
+
199
+ def test_requirements():
200
+ """Test if requirements.txt is valid"""
201
+ print("\nπŸ” Testing requirements.txt...")
202
+
203
+ try:
204
+ with open("requirements.txt", "r") as f:
205
+ requirements = f.read()
206
+
207
+ # Check for essential packages
208
+ essential_packages = [
209
+ "streamlit",
210
+ "torch",
211
+ "transformers",
212
+ "sentence-transformers",
213
+ "faiss-cpu",
214
+ "rank-bm25",
215
+ "pypdf",
216
+ ]
217
+
218
+ missing_packages = []
219
+ for package in essential_packages:
220
+ if package in requirements:
221
+ print(f"βœ… {package}")
222
+ else:
223
+ print(f"❌ {package} (missing)")
224
+ missing_packages.append(package)
225
+
226
+ if missing_packages:
227
+ print(f"❌ Missing packages: {missing_packages}")
228
+ return False
229
+
230
+ return True
231
+
232
+ except Exception as e:
233
+ print(f"❌ Requirements test failed: {e}")
234
+ return False
235
+
236
+
237
+ def main():
238
+ """Run all tests"""
239
+ print("πŸš€ Hugging Face Deployment Test\n")
240
+
241
+ tests = [
242
+ ("File Structure", test_file_structure),
243
+ ("Requirements", test_requirements),
244
+ ("Imports", test_imports),
245
+ ("Model Loading", test_model_loading),
246
+ ("PDF Processor", test_pdf_processor),
247
+ ("RAG System", test_rag_system),
248
+ ("Streamlit App", test_streamlit_app),
249
+ ]
250
+
251
+ results = []
252
+ for test_name, test_func in tests:
253
+ try:
254
+ result = test_func()
255
+ results.append((test_name, result))
256
+ except Exception as e:
257
+ print(f"❌ {test_name} test failed with exception: {e}")
258
+ results.append((test_name, False))
259
+
260
+ # Summary
261
+ print("\n" + "=" * 50)
262
+ print("πŸ“Š Test Results Summary")
263
+ print("=" * 50)
264
+
265
+ passed = 0
266
+ total = len(results)
267
+
268
+ for test_name, result in results:
269
+ status = "βœ… PASS" if result else "❌ FAIL"
270
+ print(f"{test_name:20} {status}")
271
+ if result:
272
+ passed += 1
273
+
274
+ print(f"\nOverall: {passed}/{total} tests passed")
275
+
276
+ if passed == total:
277
+ print("πŸŽ‰ All tests passed! Ready for Hugging Face deployment.")
278
+ print("\nNext steps:")
279
+ print("1. Create a new Hugging Face Space")
280
+ print("2. Upload all files from this directory")
281
+ print("3. Set the SDK to 'Streamlit'")
282
+ print("4. Deploy and test your RAG system!")
283
+ else:
284
+ print("⚠️ Some tests failed. Please fix the issues before deployment.")
285
+ print("\nTroubleshooting:")
286
+ print("1. Install missing dependencies: pip install -r requirements.txt")
287
+ print("2. Check file permissions and paths")
288
+ print("3. Verify model download permissions")
289
+ print("4. Test locally first: streamlit run app.py")
290
+
291
+
292
+ if __name__ == "__main__":
293
+ main()
test_docker.py ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Docker deployment
4
+
5
+ This script tests if all components are working correctly for Docker deployment.
6
+ """
7
+
8
+ import os
9
+ import sys
10
+ import subprocess
11
+ from pathlib import Path
12
+
13
+
14
+ def test_dockerfile():
15
+ """Test if Dockerfile exists and is valid"""
16
+ print("πŸ” Testing Dockerfile...")
17
+
18
+ dockerfile_path = Path("Dockerfile")
19
+ if not dockerfile_path.exists():
20
+ print("❌ Dockerfile not found")
21
+ return False
22
+
23
+ try:
24
+ with open(dockerfile_path, "r") as f:
25
+ content = f.read()
26
+
27
+ # Check for essential Dockerfile components
28
+ required_components = [
29
+ "FROM python:",
30
+ "WORKDIR /app",
31
+ "COPY requirements.txt",
32
+ "RUN pip install",
33
+ "COPY .",
34
+ "EXPOSE 8501",
35
+ 'CMD ["streamlit"',
36
+ ]
37
+
38
+ missing_components = []
39
+ for component in required_components:
40
+ if component in content:
41
+ print(f"βœ… {component}")
42
+ else:
43
+ print(f"❌ {component} (missing)")
44
+ missing_components.append(component)
45
+
46
+ if missing_components:
47
+ print(f"❌ Missing Dockerfile components: {missing_components}")
48
+ return False
49
+
50
+ return True
51
+
52
+ except Exception as e:
53
+ print(f"❌ Dockerfile test failed: {e}")
54
+ return False
55
+
56
+
57
+ def test_dockerignore():
58
+ """Test if .dockerignore exists"""
59
+ print("\nπŸ” Testing .dockerignore...")
60
+
61
+ dockerignore_path = Path(".dockerignore")
62
+ if dockerignore_path.exists():
63
+ print("βœ… .dockerignore exists")
64
+ return True
65
+ else:
66
+ print("⚠️ .dockerignore not found (optional but recommended)")
67
+ return True
68
+
69
+
70
+ def test_docker_compose():
71
+ """Test if docker-compose.yml exists"""
72
+ print("\nπŸ” Testing docker-compose.yml...")
73
+
74
+ compose_path = Path("docker-compose.yml")
75
+ if compose_path.exists():
76
+ print("βœ… docker-compose.yml exists")
77
+ return True
78
+ else:
79
+ print("⚠️ docker-compose.yml not found (optional)")
80
+ return True
81
+
82
+
83
+ def test_docker_build():
84
+ """Test Docker build locally"""
85
+ print("\nπŸ” Testing Docker build...")
86
+
87
+ try:
88
+ # Test Docker build
89
+ result = subprocess.run(
90
+ ["docker", "build", "-t", "rag-system-test", "."],
91
+ capture_output=True,
92
+ text=True,
93
+ timeout=300, # 5 minutes timeout
94
+ )
95
+
96
+ if result.returncode == 0:
97
+ print("βœ… Docker build successful")
98
+ return True
99
+ else:
100
+ print(f"❌ Docker build failed: {result.stderr}")
101
+ return False
102
+
103
+ except subprocess.TimeoutExpired:
104
+ print("❌ Docker build timed out")
105
+ return False
106
+ except FileNotFoundError:
107
+ print("⚠️ Docker not installed or not in PATH")
108
+ return False
109
+ except Exception as e:
110
+ print(f"❌ Docker build test failed: {e}")
111
+ return False
112
+
113
+
114
+ def test_docker_run():
115
+ """Test Docker run locally"""
116
+ print("\nπŸ” Testing Docker run...")
117
+
118
+ try:
119
+ # Test Docker run (brief test)
120
+ result = subprocess.run(
121
+ [
122
+ "docker",
123
+ "run",
124
+ "--rm",
125
+ "-d",
126
+ "-p",
127
+ "8501:8501",
128
+ "--name",
129
+ "rag-test",
130
+ "rag-system-test",
131
+ ],
132
+ capture_output=True,
133
+ text=True,
134
+ timeout=30,
135
+ )
136
+
137
+ if result.returncode == 0:
138
+ print("βœ… Docker run successful")
139
+
140
+ # Clean up
141
+ subprocess.run(["docker", "stop", "rag-test"], capture_output=True)
142
+ return True
143
+ else:
144
+ print(f"❌ Docker run failed: {result.stderr}")
145
+ return False
146
+
147
+ except subprocess.TimeoutExpired:
148
+ print("❌ Docker run timed out")
149
+ return False
150
+ except FileNotFoundError:
151
+ print("⚠️ Docker not installed or not in PATH")
152
+ return False
153
+ except Exception as e:
154
+ print(f"❌ Docker run test failed: {e}")
155
+ return False
156
+
157
+
158
+ def test_file_structure():
159
+ """Test if all required files exist"""
160
+ print("\nπŸ” Testing file structure...")
161
+
162
+ required_files = [
163
+ "app.py",
164
+ "rag_system.py",
165
+ "pdf_processor.py",
166
+ "requirements.txt",
167
+ "Dockerfile",
168
+ ]
169
+
170
+ optional_files = [".dockerignore", "docker-compose.yml", "README.md"]
171
+
172
+ missing_required = []
173
+ missing_optional = []
174
+
175
+ for file in required_files:
176
+ if os.path.exists(file):
177
+ print(f"βœ… {file}")
178
+ else:
179
+ print(f"❌ {file} (missing)")
180
+ missing_required.append(file)
181
+
182
+ for file in optional_files:
183
+ if os.path.exists(file):
184
+ print(f"βœ… {file}")
185
+ else:
186
+ print(f"⚠️ {file} (optional)")
187
+ missing_optional.append(file)
188
+
189
+ if missing_required:
190
+ print(f"❌ Missing required files: {missing_required}")
191
+ return False
192
+
193
+ return True
194
+
195
+
196
+ def test_requirements():
197
+ """Test if requirements.txt is valid"""
198
+ print("\nπŸ” Testing requirements.txt...")
199
+
200
+ try:
201
+ with open("requirements.txt", "r") as f:
202
+ requirements = f.read()
203
+
204
+ # Check for essential packages
205
+ essential_packages = [
206
+ "streamlit",
207
+ "torch",
208
+ "transformers",
209
+ "sentence-transformers",
210
+ "faiss-cpu",
211
+ "rank-bm25",
212
+ "pypdf",
213
+ ]
214
+
215
+ missing_packages = []
216
+ for package in essential_packages:
217
+ if package in requirements:
218
+ print(f"βœ… {package}")
219
+ else:
220
+ print(f"❌ {package} (missing)")
221
+ missing_packages.append(package)
222
+
223
+ if missing_packages:
224
+ print(f"❌ Missing packages: {missing_packages}")
225
+ return False
226
+
227
+ return True
228
+
229
+ except Exception as e:
230
+ print(f"❌ Requirements test failed: {e}")
231
+ return False
232
+
233
+
234
+ def main():
235
+ """Run all tests"""
236
+ print("🐳 Docker Deployment Test\n")
237
+
238
+ tests = [
239
+ ("File Structure", test_file_structure),
240
+ ("Requirements", test_requirements),
241
+ ("Dockerfile", test_dockerfile),
242
+ (".dockerignore", test_dockerignore),
243
+ ("docker-compose.yml", test_docker_compose),
244
+ ("Docker Build", test_docker_build),
245
+ ("Docker Run", test_docker_run),
246
+ ]
247
+
248
+ results = []
249
+ for test_name, test_func in tests:
250
+ try:
251
+ result = test_func()
252
+ results.append((test_name, result))
253
+ except Exception as e:
254
+ print(f"❌ {test_name} test failed with exception: {e}")
255
+ results.append((test_name, False))
256
+
257
+ # Summary
258
+ print("\n" + "=" * 50)
259
+ print("πŸ“Š Test Results Summary")
260
+ print("=" * 50)
261
+
262
+ passed = 0
263
+ total = len(results)
264
+
265
+ for test_name, result in results:
266
+ status = "βœ… PASS" if result else "❌ FAIL"
267
+ print(f"{test_name:20} {status}")
268
+ if result:
269
+ passed += 1
270
+
271
+ print(f"\nOverall: {passed}/{total} tests passed")
272
+
273
+ if passed == total:
274
+ print("πŸŽ‰ All tests passed! Ready for Hugging Face Docker deployment.")
275
+ print("\nNext steps:")
276
+ print("1. Create a new Hugging Face Space with Docker SDK")
277
+ print("2. Upload all files from this directory")
278
+ print("3. Wait for Docker build to complete")
279
+ print("4. Test your RAG system!")
280
+ else:
281
+ print("⚠️ Some tests failed. Please fix the issues before deployment.")
282
+ print("\nTroubleshooting:")
283
+ print("1. Install Docker if not available")
284
+ print("2. Check file permissions and paths")
285
+ print("3. Verify Dockerfile syntax")
286
+ print("4. Test Docker build locally: docker build -t rag-system .")
287
+
288
+
289
+ if __name__ == "__main__":
290
+ main()