Spaces:
Sleeping
Sleeping
Upload 18 files
Browse files- .gitattributes +8 -0
- DEPLOYMENT_GUIDE.md +283 -0
- Dockerfile +31 -6
- README.md +242 -16
- RIL-Q1-FY2024-25.pdf +3 -0
- RIL-Q1-FY2025-26.pdf +3 -0
- RIL-Q2-FY2023-24.pdf +3 -0
- RIL-Q2-FY2024-25.pdf +3 -0
- RIL-Q3-FY2023-24.pdf +3 -0
- RIL-Q3-FY2024-25.pdf +3 -0
- RIL-Q4-FY2023-24.pdf +3 -0
- RIL-Q4-FY2024-25.pdf +3 -0
- app.py +351 -0
- docker-compose.yml +20 -0
- pdf_processor.py +268 -0
- rag_system.py +547 -0
- requirements.txt +15 -3
- test_deployment.py +293 -0
- test_docker.py +290 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,11 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
RIL-Q1-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
RIL-Q1-FY2025-26.pdf filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
RIL-Q2-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
RIL-Q2-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
RIL-Q3-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
RIL-Q3-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
RIL-Q4-FY2023-24.pdf filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
RIL-Q4-FY2024-25.pdf filter=lfs diff=lfs merge=lfs -text
|
DEPLOYMENT_GUIDE.md
ADDED
|
@@ -0,0 +1,283 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Hugging Face Spaces Deployment Guide (Docker + Streamlit)
|
| 2 |
+
|
| 3 |
+
This guide will walk you through deploying your RAG system to Hugging Face Spaces using **Docker with Streamlit**.
|
| 4 |
+
|
| 5 |
+
## π Prerequisites
|
| 6 |
+
|
| 7 |
+
- Hugging Face account
|
| 8 |
+
- All files from the `huggingface_deploy/` folder
|
| 9 |
+
- Basic understanding of Docker (optional)
|
| 10 |
+
|
| 11 |
+
## π― Step-by-Step Deployment
|
| 12 |
+
|
| 13 |
+
### Step 1: Create a New Space
|
| 14 |
+
|
| 15 |
+
1. **Go to Hugging Face Spaces:**
|
| 16 |
+
- Visit [https://huggingface.co/spaces](https://huggingface.co/spaces)
|
| 17 |
+
- Click "Create new Space"
|
| 18 |
+
|
| 19 |
+
2. **Configure your Space:**
|
| 20 |
+
- **Owner**: Choose your username or organization
|
| 21 |
+
- **Space name**: Choose a unique name (e.g., `my-rag-system`)
|
| 22 |
+
- **License**: Choose appropriate license (e.g., MIT)
|
| 23 |
+
- **SDK**: Select **Docker**
|
| 24 |
+
- **Visibility**: Choose Public or Private
|
| 25 |
+
- **Hardware**: Select appropriate hardware (CPU is sufficient for basic usage)
|
| 26 |
+
|
| 27 |
+
3. **Click "Create Space"**
|
| 28 |
+
|
| 29 |
+
### Step 2: Upload Files
|
| 30 |
+
|
| 31 |
+
#### Option A: Using Git (Recommended)
|
| 32 |
+
|
| 33 |
+
1. **Clone your Space repository:**
|
| 34 |
+
```bash
|
| 35 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 36 |
+
cd YOUR_SPACE_NAME
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
2. **Copy files from the deployment folder:**
|
| 40 |
+
```bash
|
| 41 |
+
cp -r ../huggingface_deploy/* .
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
3. **Commit and push:**
|
| 45 |
+
```bash
|
| 46 |
+
git add .
|
| 47 |
+
git commit -m "Initial RAG system deployment with Docker"
|
| 48 |
+
git push
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
#### Option B: Using Web Interface
|
| 52 |
+
|
| 53 |
+
1. **Upload files manually:**
|
| 54 |
+
- Go to your Space's "Files" tab
|
| 55 |
+
- Click "Add file" β "Upload files"
|
| 56 |
+
- Upload all files from the `huggingface_deploy/` folder:
|
| 57 |
+
- `app.py`
|
| 58 |
+
- `rag_system.py`
|
| 59 |
+
- `pdf_processor.py`
|
| 60 |
+
- `requirements.txt`
|
| 61 |
+
- `Dockerfile`
|
| 62 |
+
- `.dockerignore`
|
| 63 |
+
- `README.md`
|
| 64 |
+
|
| 65 |
+
### Step 3: Configure the Space
|
| 66 |
+
|
| 67 |
+
1. **Set up environment variables (optional):**
|
| 68 |
+
- Go to your Space's "Settings" tab
|
| 69 |
+
- Add environment variables if needed:
|
| 70 |
+
```
|
| 71 |
+
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
| 72 |
+
GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
2. **Configure hardware (if needed):**
|
| 76 |
+
- Go to "Settings" β "Hardware"
|
| 77 |
+
- Select appropriate hardware based on your needs
|
| 78 |
+
|
| 79 |
+
### Step 4: Deploy and Test
|
| 80 |
+
|
| 81 |
+
1. **Wait for deployment:**
|
| 82 |
+
- Hugging Face will automatically build and deploy your Docker container
|
| 83 |
+
- This may take 10-15 minutes for the first deployment (model downloads)
|
| 84 |
+
|
| 85 |
+
2. **Test your application:**
|
| 86 |
+
- Visit your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
|
| 87 |
+
- Upload a PDF document
|
| 88 |
+
- Ask questions to test the RAG system
|
| 89 |
+
|
| 90 |
+
## π§ Docker Configuration
|
| 91 |
+
|
| 92 |
+
### Dockerfile Features
|
| 93 |
+
|
| 94 |
+
- **Base Image**: Python 3.10 slim
|
| 95 |
+
- **System Dependencies**: build-essential, curl
|
| 96 |
+
- **Health Check**: Monitors Streamlit health endpoint
|
| 97 |
+
- **Environment Variables**: Configured for Streamlit
|
| 98 |
+
- **Port**: Exposes port 8501
|
| 99 |
+
|
| 100 |
+
### Local Docker Testing
|
| 101 |
+
|
| 102 |
+
You can test the Docker build locally:
|
| 103 |
+
|
| 104 |
+
```bash
|
| 105 |
+
# Build the Docker image
|
| 106 |
+
docker build -t rag-system .
|
| 107 |
+
|
| 108 |
+
# Run the container
|
| 109 |
+
docker run -p 8501:8501 rag-system
|
| 110 |
+
|
| 111 |
+
# Or use docker-compose
|
| 112 |
+
docker-compose up --build
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## π§ Configuration Options
|
| 116 |
+
|
| 117 |
+
### Environment Variables
|
| 118 |
+
|
| 119 |
+
You can customize your deployment by setting these environment variables in your Space settings:
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
# Model configuration
|
| 123 |
+
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
| 124 |
+
GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
|
| 125 |
+
|
| 126 |
+
# Chunk sizes
|
| 127 |
+
CHUNK_SIZES=100,400
|
| 128 |
+
|
| 129 |
+
# Vector store path
|
| 130 |
+
VECTOR_STORE_PATH=./vector_store
|
| 131 |
+
|
| 132 |
+
# Streamlit configuration
|
| 133 |
+
STREAMLIT_SERVER_PORT=8501
|
| 134 |
+
STREAMLIT_SERVER_ADDRESS=0.0.0.0
|
| 135 |
+
STREAMLIT_SERVER_HEADLESS=true
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
### Hardware Options
|
| 139 |
+
|
| 140 |
+
- **CPU**: Sufficient for basic usage, slower inference
|
| 141 |
+
- **T4**: Good for faster inference, limited memory
|
| 142 |
+
- **A10G**: High performance, more memory
|
| 143 |
+
- **A100**: Maximum performance, highest cost
|
| 144 |
+
|
| 145 |
+
## π Troubleshooting
|
| 146 |
+
|
| 147 |
+
### Common Issues
|
| 148 |
+
|
| 149 |
+
1. **Build Fails**
|
| 150 |
+
- Check that all required files are uploaded
|
| 151 |
+
- Verify `requirements.txt` and `Dockerfile` are correct
|
| 152 |
+
- Check the build logs for specific errors
|
| 153 |
+
|
| 154 |
+
2. **Model Loading Errors**
|
| 155 |
+
- Ensure internet connectivity for model downloads
|
| 156 |
+
- Check model names are correct
|
| 157 |
+
- Verify sufficient disk space
|
| 158 |
+
|
| 159 |
+
3. **Memory Issues**
|
| 160 |
+
- Use smaller models
|
| 161 |
+
- Reduce chunk sizes
|
| 162 |
+
- Upgrade to higher-tier hardware
|
| 163 |
+
|
| 164 |
+
4. **Slow Performance**
|
| 165 |
+
- Upgrade hardware tier
|
| 166 |
+
- Use smaller embedding models
|
| 167 |
+
- Optimize chunk sizes
|
| 168 |
+
|
| 169 |
+
5. **Docker Build Issues**
|
| 170 |
+
- Check `.dockerignore` excludes unnecessary files
|
| 171 |
+
- Verify Dockerfile syntax
|
| 172 |
+
- Check for missing dependencies
|
| 173 |
+
|
| 174 |
+
### Debug Mode
|
| 175 |
+
|
| 176 |
+
To enable debug logging, add this to your `app.py`:
|
| 177 |
+
|
| 178 |
+
```python
|
| 179 |
+
import logging
|
| 180 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
## π Monitoring
|
| 184 |
+
|
| 185 |
+
### Space Metrics
|
| 186 |
+
|
| 187 |
+
- **Build Status**: Check if Docker build was successful
|
| 188 |
+
- **Runtime Logs**: Monitor application logs
|
| 189 |
+
- **Resource Usage**: Track CPU and memory usage
|
| 190 |
+
- **Error Logs**: Identify and fix issues
|
| 191 |
+
|
| 192 |
+
### Docker Logs
|
| 193 |
+
|
| 194 |
+
Check Docker logs in your Space:
|
| 195 |
+
- Go to "Settings" β "Logs"
|
| 196 |
+
- Monitor build and runtime logs
|
| 197 |
+
- Look for error messages
|
| 198 |
+
|
| 199 |
+
## π Security Considerations
|
| 200 |
+
|
| 201 |
+
1. **File Upload:**
|
| 202 |
+
- Validate PDF files before processing
|
| 203 |
+
- Implement file size limits
|
| 204 |
+
- Check file types
|
| 205 |
+
|
| 206 |
+
2. **Model Access:**
|
| 207 |
+
- Use appropriate model access tokens
|
| 208 |
+
- Consider private models for sensitive data
|
| 209 |
+
|
| 210 |
+
3. **Data Privacy:**
|
| 211 |
+
- Be aware that uploaded documents are processed
|
| 212 |
+
- Consider data retention policies
|
| 213 |
+
|
| 214 |
+
4. **Docker Security:**
|
| 215 |
+
- Use non-root user in Dockerfile
|
| 216 |
+
- Minimize attack surface
|
| 217 |
+
- Keep base images updated
|
| 218 |
+
|
| 219 |
+
## π Scaling
|
| 220 |
+
|
| 221 |
+
### For Production Use
|
| 222 |
+
|
| 223 |
+
1. **Multiple Spaces:**
|
| 224 |
+
- Create separate Spaces for different use cases
|
| 225 |
+
- Use different hardware tiers as needed
|
| 226 |
+
|
| 227 |
+
2. **Custom Domains:**
|
| 228 |
+
- Set up custom domains for your Spaces
|
| 229 |
+
- Use proper SSL certificates
|
| 230 |
+
|
| 231 |
+
3. **Load Balancing:**
|
| 232 |
+
- Consider multiple Space instances
|
| 233 |
+
- Implement proper caching strategies
|
| 234 |
+
|
| 235 |
+
## π Success Checklist
|
| 236 |
+
|
| 237 |
+
- [ ] Space created successfully with Docker SDK
|
| 238 |
+
- [ ] All files uploaded (including Dockerfile)
|
| 239 |
+
- [ ] Docker build completed without errors
|
| 240 |
+
- [ ] Application loads correctly
|
| 241 |
+
- [ ] PDF upload works
|
| 242 |
+
- [ ] Question answering works
|
| 243 |
+
- [ ] Search results display correctly
|
| 244 |
+
- [ ] Performance is acceptable
|
| 245 |
+
|
| 246 |
+
## π Support
|
| 247 |
+
|
| 248 |
+
If you encounter issues:
|
| 249 |
+
|
| 250 |
+
1. **Check the logs** in your Space's "Logs" tab
|
| 251 |
+
2. **Review this guide** for common solutions
|
| 252 |
+
3. **Search Hugging Face documentation**
|
| 253 |
+
4. **Create an issue** in the project repository
|
| 254 |
+
5. **Contact Hugging Face support** for Space-specific issues
|
| 255 |
+
|
| 256 |
+
## π Next Steps
|
| 257 |
+
|
| 258 |
+
After successful deployment:
|
| 259 |
+
|
| 260 |
+
1. **Test thoroughly** with different document types
|
| 261 |
+
2. **Optimize performance** based on usage patterns
|
| 262 |
+
3. **Add custom features** as needed
|
| 263 |
+
4. **Share your Space** with others
|
| 264 |
+
5. **Monitor usage** and gather feedback
|
| 265 |
+
|
| 266 |
+
## π Updates and Maintenance
|
| 267 |
+
|
| 268 |
+
### Updating Your Space
|
| 269 |
+
|
| 270 |
+
1. **Make changes locally**
|
| 271 |
+
2. **Test with Docker locally**
|
| 272 |
+
3. **Push changes to your Space repository**
|
| 273 |
+
4. **Monitor the rebuild process**
|
| 274 |
+
|
| 275 |
+
### Version Management
|
| 276 |
+
|
| 277 |
+
- Use specific versions in `requirements.txt`
|
| 278 |
+
- Tag your Docker images
|
| 279 |
+
- Keep track of model versions
|
| 280 |
+
|
| 281 |
+
---
|
| 282 |
+
|
| 283 |
+
**Happy deploying with Docker! π³π**
|
Dockerfile
CHANGED
|
@@ -1,20 +1,45 @@
|
|
| 1 |
-
|
|
|
|
| 2 |
|
|
|
|
| 3 |
WORKDIR /app
|
| 4 |
|
|
|
|
| 5 |
RUN apt-get update && apt-get install -y \
|
| 6 |
build-essential \
|
| 7 |
curl \
|
| 8 |
-
git \
|
| 9 |
&& rm -rf /var/lib/apt/lists/*
|
| 10 |
|
| 11 |
-
|
| 12 |
-
COPY
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
EXPOSE 8501
|
| 17 |
|
|
|
|
| 18 |
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
| 1 |
+
# Use Python 3.10 slim image
|
| 2 |
+
FROM python:3.10-slim
|
| 3 |
|
| 4 |
+
# Set working directory
|
| 5 |
WORKDIR /app
|
| 6 |
|
| 7 |
+
# Install system dependencies
|
| 8 |
RUN apt-get update && apt-get install -y \
|
| 9 |
build-essential \
|
| 10 |
curl \
|
|
|
|
| 11 |
&& rm -rf /var/lib/apt/lists/*
|
| 12 |
|
| 13 |
+
# Copy requirements first for better caching
|
| 14 |
+
COPY requirements.txt .
|
| 15 |
|
| 16 |
+
# Install Python dependencies
|
| 17 |
+
RUN pip install --no-cache-dir --upgrade pip && \
|
| 18 |
+
pip install --no-cache-dir -r requirements.txt
|
| 19 |
|
| 20 |
+
# Copy application files
|
| 21 |
+
COPY . .
|
| 22 |
+
|
| 23 |
+
# Create vector store directory
|
| 24 |
+
RUN mkdir -p vector_store
|
| 25 |
+
|
| 26 |
+
# Copy all PDF documents for testing
|
| 27 |
+
COPY *.pdf /app/
|
| 28 |
+
|
| 29 |
+
# Set environment variables
|
| 30 |
+
ENV PYTHONPATH=/app
|
| 31 |
+
ENV STREAMLIT_SERVER_PORT=8501
|
| 32 |
+
ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0
|
| 33 |
+
ENV STREAMLIT_SERVER_HEADLESS=true
|
| 34 |
+
ENV STREAMLIT_SERVER_ENABLE_CORS=false
|
| 35 |
+
ENV STREAMLIT_SERVER_ENABLE_XSRF_PROTECTION=false
|
| 36 |
+
ENV STREAMLIT_LOGGER_LEVEL=debug
|
| 37 |
+
|
| 38 |
+
# Expose port
|
| 39 |
EXPOSE 8501
|
| 40 |
|
| 41 |
+
# Health check
|
| 42 |
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
|
| 43 |
|
| 44 |
+
# Run the application
|
| 45 |
+
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
|
README.md
CHANGED
|
@@ -1,19 +1,245 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
forums](https://discuss.streamlit.io).
|
|
|
|
| 1 |
+
# RAG System for Hugging Face Spaces
|
| 2 |
+
|
| 3 |
+
A simplified Retrieval-Augmented Generation (RAG) system optimized for deployment on Hugging Face Spaces.
|
| 4 |
+
|
| 5 |
+
## π Features
|
| 6 |
+
|
| 7 |
+
- **FAISS Vector Search**: Fast similarity search using FAISS
|
| 8 |
+
- **BM25 Keyword Search**: Traditional keyword-based retrieval
|
| 9 |
+
- **Hybrid Search**: Combines both dense and sparse retrieval
|
| 10 |
+
- **Qwen 2.5 1.5B**: Advanced language model for answer generation
|
| 11 |
+
- **Streamlit UI**: Clean, interactive web interface
|
| 12 |
+
- **PDF Processing**: Extract and process PDF documents
|
| 13 |
+
- **Persistent Storage**: Saves embeddings and metadata locally
|
| 14 |
+
|
| 15 |
+
## π Project Structure
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
huggingface_deploy/
|
| 19 |
+
βββ app.py # Main Streamlit application
|
| 20 |
+
βββ rag_system.py # Simplified RAG system
|
| 21 |
+
βββ pdf_processor.py # PDF processing utilities
|
| 22 |
+
βββ requirements.txt # Python dependencies
|
| 23 |
+
βββ README.md # This file
|
| 24 |
+
βββ vector_store/ # FAISS index and metadata (created automatically)
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
## π οΈ Technologies Used
|
| 28 |
+
|
| 29 |
+
- **Streamlit**: Web interface
|
| 30 |
+
- **FAISS**: Vector similarity search
|
| 31 |
+
- **BM25**: Keyword-based retrieval
|
| 32 |
+
- **Sentence Transformers**: Text embeddings
|
| 33 |
+
- **Transformers**: Qwen 2.5 1.5B model
|
| 34 |
+
- **PyPDF**: PDF text extraction
|
| 35 |
+
- **PyTorch**: Deep learning framework
|
| 36 |
+
|
| 37 |
+
## π Quick Start
|
| 38 |
+
|
| 39 |
+
### Local Development
|
| 40 |
+
|
| 41 |
+
1. **Install dependencies:**
|
| 42 |
+
```bash
|
| 43 |
+
pip install -r requirements.txt
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
2. **Run the application:**
|
| 47 |
+
```bash
|
| 48 |
+
streamlit run app.py
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
3. **Open in browser:**
|
| 52 |
+
Navigate to `http://localhost:8501`
|
| 53 |
+
|
| 54 |
+
### Hugging Face Spaces Deployment
|
| 55 |
+
|
| 56 |
+
1. **Create a new Space:**
|
| 57 |
+
- Go to [Hugging Face Spaces](https://huggingface.co/spaces)
|
| 58 |
+
- Click "Create new Space"
|
| 59 |
+
- Choose "Streamlit" as the SDK
|
| 60 |
+
- Set visibility (public or private)
|
| 61 |
+
|
| 62 |
+
2. **Upload files:**
|
| 63 |
+
- Upload all files from this directory to your Space
|
| 64 |
+
- The Space will automatically install dependencies and run the app
|
| 65 |
+
|
| 66 |
+
3. **Access your app:**
|
| 67 |
+
- Your RAG system will be available at your Space URL
|
| 68 |
+
|
| 69 |
+
## π How to Use
|
| 70 |
+
|
| 71 |
+
### 1. Upload Documents
|
| 72 |
+
- Use the sidebar to upload PDF documents
|
| 73 |
+
- The system will automatically process and index the content
|
| 74 |
+
- Multiple documents can be uploaded
|
| 75 |
+
|
| 76 |
+
### 2. Ask Questions
|
| 77 |
+
- Type your question in the chat interface
|
| 78 |
+
- Choose your preferred retrieval method:
|
| 79 |
+
- **Hybrid**: Combines FAISS and BM25 (recommended)
|
| 80 |
+
- **Dense**: Uses only FAISS vector similarity
|
| 81 |
+
- **Sparse**: Uses only BM25 keyword matching
|
| 82 |
+
|
| 83 |
+
### 3. View Results
|
| 84 |
+
- See the generated answer
|
| 85 |
+
- View search results with confidence scores
|
| 86 |
+
- Check response time and method used
|
| 87 |
+
|
| 88 |
+
## βοΈ Configuration
|
| 89 |
+
|
| 90 |
+
### Environment Variables
|
| 91 |
+
|
| 92 |
+
You can customize the system by setting these environment variables:
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
# Model configuration
|
| 96 |
+
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
| 97 |
+
GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
|
| 98 |
+
|
| 99 |
+
# Chunk sizes for document processing
|
| 100 |
+
CHUNK_SIZES=100,400
|
| 101 |
+
|
| 102 |
+
# Vector store path
|
| 103 |
+
VECTOR_STORE_PATH=./vector_store
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### Model Options
|
| 107 |
+
|
| 108 |
+
**Embedding Models:**
|
| 109 |
+
- `all-MiniLM-L6-v2` (default, 384 dimensions)
|
| 110 |
+
- `all-mpnet-base-v2` (768 dimensions)
|
| 111 |
+
- `multi-qa-MiniLM-L6-cos-v1` (384 dimensions)
|
| 112 |
+
|
| 113 |
+
**Generative Models:**
|
| 114 |
+
- `Qwen/Qwen2.5-1.5B-Instruct` (default)
|
| 115 |
+
- `distilgpt2` (fallback)
|
| 116 |
+
- `microsoft/DialoGPT-medium`
|
| 117 |
+
|
| 118 |
+
## π§ Customization
|
| 119 |
+
|
| 120 |
+
### Adding New Models
|
| 121 |
|
| 122 |
+
To use different models, modify the `SimpleRAGSystem` initialization in `app.py`:
|
| 123 |
|
| 124 |
+
```python
|
| 125 |
+
st.session_state.rag_system = SimpleRAGSystem(
|
| 126 |
+
embedding_model="your-embedding-model",
|
| 127 |
+
generative_model="your-generative-model"
|
| 128 |
+
)
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
### Custom Chunk Sizes
|
| 132 |
+
|
| 133 |
+
Modify the chunk sizes for different document types:
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
chunk_sizes = [50, 200, 800] # Smaller chunks for technical docs
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### Custom Search Methods
|
| 140 |
+
|
| 141 |
+
Add new search methods in `rag_system.py`:
|
| 142 |
+
|
| 143 |
+
```python
|
| 144 |
+
def custom_search(self, query: str, top_k: int = 5):
|
| 145 |
+
# Your custom search implementation
|
| 146 |
+
pass
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## π Performance Optimization
|
| 150 |
+
|
| 151 |
+
### Memory Usage
|
| 152 |
+
- Use smaller embedding models for limited memory
|
| 153 |
+
- Reduce chunk sizes for large documents
|
| 154 |
+
- Enable model quantization
|
| 155 |
+
|
| 156 |
+
### Speed Optimization
|
| 157 |
+
- Use GPU acceleration when available
|
| 158 |
+
- Optimize FAISS index parameters
|
| 159 |
+
- Cache embeddings for repeated queries
|
| 160 |
+
|
| 161 |
+
### Storage
|
| 162 |
+
- FAISS index and metadata are saved locally
|
| 163 |
+
- Consider cloud storage for production deployments
|
| 164 |
+
|
| 165 |
+
## π Troubleshooting
|
| 166 |
+
|
| 167 |
+
### Common Issues
|
| 168 |
+
|
| 169 |
+
1. **Model Loading Errors**
|
| 170 |
+
- Check internet connection for model downloads
|
| 171 |
+
- Verify model names are correct
|
| 172 |
+
- Ensure sufficient disk space
|
| 173 |
+
|
| 174 |
+
2. **Memory Issues**
|
| 175 |
+
- Reduce batch sizes
|
| 176 |
+
- Use smaller models
|
| 177 |
+
- Enable gradient checkpointing
|
| 178 |
+
|
| 179 |
+
3. **PDF Processing Errors**
|
| 180 |
+
- Verify PDF files are not corrupted
|
| 181 |
+
- Check file permissions
|
| 182 |
+
- Ensure PyPDF is properly installed
|
| 183 |
+
|
| 184 |
+
### Debug Mode
|
| 185 |
+
|
| 186 |
+
Enable debug logging by adding to `app.py`:
|
| 187 |
+
|
| 188 |
+
```python
|
| 189 |
+
import logging
|
| 190 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
## π Security Considerations
|
| 194 |
+
|
| 195 |
+
- **File Upload**: Validate PDF files before processing
|
| 196 |
+
- **Model Access**: Use appropriate model access tokens
|
| 197 |
+
- **Data Privacy**: Consider data retention policies
|
| 198 |
+
- **Rate Limiting**: Implement query rate limiting for production
|
| 199 |
+
|
| 200 |
+
## π Monitoring
|
| 201 |
+
|
| 202 |
+
### System Metrics
|
| 203 |
+
- Document count and chunk count
|
| 204 |
+
- Response times
|
| 205 |
+
- Search result quality
|
| 206 |
+
- Model performance
|
| 207 |
+
|
| 208 |
+
### Logs
|
| 209 |
+
- Application logs in Streamlit
|
| 210 |
+
- Model loading and inference logs
|
| 211 |
+
- Error tracking and debugging
|
| 212 |
+
|
| 213 |
+
## π€ Contributing
|
| 214 |
+
|
| 215 |
+
1. Fork the repository
|
| 216 |
+
2. Create a feature branch
|
| 217 |
+
3. Make your changes
|
| 218 |
+
4. Test thoroughly
|
| 219 |
+
5. Submit a pull request
|
| 220 |
+
|
| 221 |
+
## π License
|
| 222 |
+
|
| 223 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 224 |
+
|
| 225 |
+
## π Support
|
| 226 |
+
|
| 227 |
+
For issues and questions:
|
| 228 |
+
1. Check the troubleshooting section
|
| 229 |
+
2. Review the logs for error messages
|
| 230 |
+
3. Create an issue on GitHub
|
| 231 |
+
4. Contact the maintainers
|
| 232 |
+
|
| 233 |
+
## π― Roadmap
|
| 234 |
+
|
| 235 |
+
- [ ] Add support for more document formats
|
| 236 |
+
- [ ] Implement advanced search algorithms
|
| 237 |
+
- [ ] Add model fine-tuning capabilities
|
| 238 |
+
- [ ] Improve UI/UX design
|
| 239 |
+
- [ ] Add export/import functionality
|
| 240 |
+
- [ ] Implement user authentication
|
| 241 |
+
- [ ] Add analytics dashboard
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
|
| 245 |
+
**Happy RAG-ing! π**
|
|
|
RIL-Q1-FY2024-25.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e29390caae95cc8f28606d9f08317cda424bf544fd86383c7f9ac7d25ca8e808
|
| 3 |
+
size 1253337
|
RIL-Q1-FY2025-26.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ce3f74a4a4012cdb85afaf7795aa2cc118f94af0f2b4d290f92248d042eb0976
|
| 3 |
+
size 719459
|
RIL-Q2-FY2023-24.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e07142e623cd116f6c18a6e17e803b06bff53eeaa149c4151022579ef305cbd
|
| 3 |
+
size 1570743
|
RIL-Q2-FY2024-25.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f78f4ade6ab7640fb74560b76505754fe5751c3602d61925c764c177875d1097
|
| 3 |
+
size 1664783
|
RIL-Q3-FY2023-24.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f2e4afa7e303df86a156c02fbdb07866238891a408cd79398c98b100693cafcc
|
| 3 |
+
size 1446439
|
RIL-Q3-FY2024-25.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e9d0afa42b8fb75efcf2d1c1aea5b104c77dd63fd69fa0fcc059af8b350e8567
|
| 3 |
+
size 1855556
|
RIL-Q4-FY2023-24.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:645d6658976d1f958703b951fd7c89b22738ed2c865f31077fa725ec27781115
|
| 3 |
+
size 1662456
|
RIL-Q4-FY2024-25.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0ec375dcbc69b69a95cd13f37fe090d61071d6e6a66707f2c73b26b77c6bd0d0
|
| 3 |
+
size 1719021
|
app.py
ADDED
|
@@ -0,0 +1,351 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
RAG System for Hugging Face Spaces
|
| 4 |
+
|
| 5 |
+
A simplified RAG system using:
|
| 6 |
+
- FAISS for vector search
|
| 7 |
+
- BM25 for hybrid retrieval
|
| 8 |
+
- Streamlit for UI
|
| 9 |
+
- Qwen 2.5 1.5B for generation
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import streamlit as st
|
| 13 |
+
import os
|
| 14 |
+
import tempfile
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
import time
|
| 17 |
+
from typing import List, Dict, Optional
|
| 18 |
+
import json
|
| 19 |
+
import glob
|
| 20 |
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 21 |
+
from loguru import logger
|
| 22 |
+
|
| 23 |
+
# Import our simplified components
|
| 24 |
+
from rag_system import SimpleRAGSystem
|
| 25 |
+
from pdf_processor import SimplePDFProcessor
|
| 26 |
+
|
| 27 |
+
# Page configuration
|
| 28 |
+
st.set_page_config(
|
| 29 |
+
page_title="RAG System - Hugging Face",
|
| 30 |
+
page_icon="π€",
|
| 31 |
+
layout="wide",
|
| 32 |
+
initial_sidebar_state="expanded",
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
# Initialize session state
|
| 36 |
+
if "rag_system" not in st.session_state:
|
| 37 |
+
st.session_state.rag_system = None
|
| 38 |
+
if "documents_loaded" not in st.session_state:
|
| 39 |
+
st.session_state.documents_loaded = False
|
| 40 |
+
if "chat_history" not in st.session_state:
|
| 41 |
+
st.session_state.chat_history = []
|
| 42 |
+
if "initializing" not in st.session_state:
|
| 43 |
+
st.session_state.initializing = False
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def load_single_document(rag_system, pdf_path):
|
| 47 |
+
"""Load a single document into the RAG system"""
|
| 48 |
+
try:
|
| 49 |
+
filename = os.path.basename(pdf_path)
|
| 50 |
+
success = rag_system.add_document(pdf_path, filename)
|
| 51 |
+
return filename, success, None
|
| 52 |
+
except Exception as e:
|
| 53 |
+
return os.path.basename(pdf_path), False, str(e)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def initialize_rag_system():
|
| 57 |
+
"""Initialize the RAG system"""
|
| 58 |
+
if st.session_state.rag_system is None and not st.session_state.initializing:
|
| 59 |
+
st.session_state.initializing = True
|
| 60 |
+
st.write("π Starting RAG system initialization...")
|
| 61 |
+
with st.spinner("Initializing RAG system..."):
|
| 62 |
+
try:
|
| 63 |
+
st.session_state.rag_system = SimpleRAGSystem()
|
| 64 |
+
st.write("β
RAG system created successfully")
|
| 65 |
+
|
| 66 |
+
# Auto-load all available PDF documents in parallel
|
| 67 |
+
pdf_files = glob.glob("/app/*.pdf")
|
| 68 |
+
st.write(f"π Found {len(pdf_files)} PDF files")
|
| 69 |
+
|
| 70 |
+
if pdf_files:
|
| 71 |
+
loaded_count = 0
|
| 72 |
+
failed_count = 0
|
| 73 |
+
|
| 74 |
+
with st.spinner(
|
| 75 |
+
f"Loading {len(pdf_files)} PDF documents in parallel..."
|
| 76 |
+
):
|
| 77 |
+
# Use ThreadPoolExecutor for parallel loading
|
| 78 |
+
with ThreadPoolExecutor(max_workers=4) as executor:
|
| 79 |
+
# Submit all tasks
|
| 80 |
+
future_to_pdf = {
|
| 81 |
+
executor.submit(
|
| 82 |
+
load_single_document,
|
| 83 |
+
st.session_state.rag_system,
|
| 84 |
+
pdf_path,
|
| 85 |
+
): pdf_path
|
| 86 |
+
for pdf_path in pdf_files
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
# Process completed tasks
|
| 90 |
+
for future in as_completed(future_to_pdf):
|
| 91 |
+
filename, success, error = future.result()
|
| 92 |
+
if success:
|
| 93 |
+
loaded_count += 1
|
| 94 |
+
st.write(f"β
Loaded: {filename}")
|
| 95 |
+
logger.info(f"β
Loaded: {filename}")
|
| 96 |
+
else:
|
| 97 |
+
failed_count += 1
|
| 98 |
+
st.write(f"β οΈ Failed: {filename} - {error}")
|
| 99 |
+
logger.warning(
|
| 100 |
+
f"β οΈ Failed to load {filename}: {error}"
|
| 101 |
+
)
|
| 102 |
+
|
| 103 |
+
if loaded_count > 0:
|
| 104 |
+
st.session_state.documents_loaded = True
|
| 105 |
+
st.success(
|
| 106 |
+
f"β
Successfully loaded {loaded_count} PDF documents!"
|
| 107 |
+
)
|
| 108 |
+
if failed_count > 0:
|
| 109 |
+
st.warning(f"β οΈ Failed to load {failed_count} documents")
|
| 110 |
+
else:
|
| 111 |
+
st.warning("β οΈ No documents could be loaded")
|
| 112 |
+
# Still allow querying even if no documents loaded
|
| 113 |
+
st.session_state.documents_loaded = True
|
| 114 |
+
else:
|
| 115 |
+
st.info("π No PDF documents found in the container")
|
| 116 |
+
# Still allow querying even if no documents found
|
| 117 |
+
st.session_state.documents_loaded = True
|
| 118 |
+
|
| 119 |
+
st.success("β
RAG system initialized!")
|
| 120 |
+
|
| 121 |
+
except Exception as e:
|
| 122 |
+
st.error(f"β Failed to initialize RAG system: {e}")
|
| 123 |
+
logger.error(f"RAG system initialization failed: {e}")
|
| 124 |
+
# Reset initialization flag on error
|
| 125 |
+
st.session_state.initializing = False
|
| 126 |
+
raise
|
| 127 |
+
finally:
|
| 128 |
+
# Always reset initialization flag
|
| 129 |
+
st.session_state.initializing = False
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def upload_document(uploaded_file):
|
| 133 |
+
"""Upload and process a document"""
|
| 134 |
+
if uploaded_file is not None:
|
| 135 |
+
try:
|
| 136 |
+
# Create temporary file
|
| 137 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
|
| 138 |
+
tmp_file.write(uploaded_file.getvalue())
|
| 139 |
+
tmp_path = tmp_file.name
|
| 140 |
+
|
| 141 |
+
# Process the document
|
| 142 |
+
with st.spinner(f"Processing {uploaded_file.name}..."):
|
| 143 |
+
success = st.session_state.rag_system.add_document(
|
| 144 |
+
tmp_path, uploaded_file.name
|
| 145 |
+
)
|
| 146 |
+
|
| 147 |
+
if success:
|
| 148 |
+
st.success(f"β
{uploaded_file.name} processed successfully!")
|
| 149 |
+
st.session_state.documents_loaded = True
|
| 150 |
+
# Clean up temporary file
|
| 151 |
+
os.unlink(tmp_path)
|
| 152 |
+
else:
|
| 153 |
+
st.error(f"β Failed to process {uploaded_file.name}")
|
| 154 |
+
os.unlink(tmp_path)
|
| 155 |
+
|
| 156 |
+
except Exception as e:
|
| 157 |
+
st.error(f"β Error processing document: {str(e)}")
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
def query_rag(query: str, method: str = "hybrid", top_k: int = 5):
|
| 161 |
+
"""Query the RAG system"""
|
| 162 |
+
try:
|
| 163 |
+
st.write(f"π Starting query: {query}")
|
| 164 |
+
st.write(f"π Method: {method}, top_k: {top_k}")
|
| 165 |
+
|
| 166 |
+
if st.session_state.rag_system is None:
|
| 167 |
+
st.error("β RAG system is not initialized")
|
| 168 |
+
return None, "RAG system not initialized"
|
| 169 |
+
|
| 170 |
+
st.write(f"β
RAG system is available")
|
| 171 |
+
start_time = time.time()
|
| 172 |
+
|
| 173 |
+
st.write(f"π Calling rag_system.query...")
|
| 174 |
+
response = st.session_state.rag_system.query(query, method, top_k)
|
| 175 |
+
response_time = time.time() - start_time
|
| 176 |
+
|
| 177 |
+
st.write(f"β
Response received in {response_time:.2f}s")
|
| 178 |
+
st.write(f"β
Response type: {type(response)}")
|
| 179 |
+
|
| 180 |
+
if response:
|
| 181 |
+
st.write(f"β
Response answer: {response.answer[:100]}...")
|
| 182 |
+
|
| 183 |
+
return response, response_time
|
| 184 |
+
|
| 185 |
+
except Exception as e:
|
| 186 |
+
st.error(f"β Error during query: {str(e)}")
|
| 187 |
+
logger.error(f"Query error: {e}")
|
| 188 |
+
import traceback
|
| 189 |
+
|
| 190 |
+
st.error(f"β Full error: {traceback.format_exc()}")
|
| 191 |
+
return None, f"Error: {str(e)}"
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
def display_search_results(results: List[Dict]):
|
| 195 |
+
"""Display search results"""
|
| 196 |
+
if not results:
|
| 197 |
+
st.info("No search results found.")
|
| 198 |
+
return
|
| 199 |
+
|
| 200 |
+
for i, result in enumerate(results, 1):
|
| 201 |
+
st.markdown(f"---")
|
| 202 |
+
st.markdown(f"**Result {i}** - Score: {result.score:.3f}")
|
| 203 |
+
st.write(f"**Source:** {result.filename}")
|
| 204 |
+
st.write(f"**Method:** {result.search_method}")
|
| 205 |
+
st.write(f"**Text:** {result.text[:500]}...")
|
| 206 |
+
|
| 207 |
+
if result.dense_score and result.sparse_score:
|
| 208 |
+
col1, col2 = st.columns(2)
|
| 209 |
+
with col1:
|
| 210 |
+
st.metric("Dense Score", f"{result.dense_score:.3f}")
|
| 211 |
+
with col2:
|
| 212 |
+
st.metric("Sparse Score", f"{result.sparse_score:.3f}")
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
def main():
|
| 216 |
+
"""Main application"""
|
| 217 |
+
st.write("π App starting...")
|
| 218 |
+
st.title("π€ RAG System - Hugging Face Spaces")
|
| 219 |
+
st.markdown("A simplified RAG system using FAISS + BM25 + Qwen 2.5 1.5B")
|
| 220 |
+
|
| 221 |
+
# Initialize RAG system
|
| 222 |
+
initialize_rag_system()
|
| 223 |
+
|
| 224 |
+
# Sidebar
|
| 225 |
+
with st.sidebar:
|
| 226 |
+
st.header("π Document Upload")
|
| 227 |
+
|
| 228 |
+
uploaded_file = st.file_uploader(
|
| 229 |
+
"Upload PDF Document",
|
| 230 |
+
type=["pdf"],
|
| 231 |
+
help="Upload a PDF document to add to the knowledge base",
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
if uploaded_file:
|
| 235 |
+
upload_document(uploaded_file)
|
| 236 |
+
|
| 237 |
+
st.divider()
|
| 238 |
+
|
| 239 |
+
st.header("βοΈ Settings")
|
| 240 |
+
|
| 241 |
+
method = st.selectbox(
|
| 242 |
+
"Retrieval Method",
|
| 243 |
+
["hybrid", "dense", "sparse"],
|
| 244 |
+
help="Choose the retrieval method",
|
| 245 |
+
)
|
| 246 |
+
|
| 247 |
+
top_k = st.slider(
|
| 248 |
+
"Number of Results",
|
| 249 |
+
min_value=1,
|
| 250 |
+
max_value=10,
|
| 251 |
+
value=5,
|
| 252 |
+
help="Number of top results to retrieve",
|
| 253 |
+
)
|
| 254 |
+
|
| 255 |
+
st.divider()
|
| 256 |
+
|
| 257 |
+
# System info
|
| 258 |
+
if st.session_state.rag_system:
|
| 259 |
+
stats = st.session_state.rag_system.get_stats()
|
| 260 |
+
st.header("π System Info")
|
| 261 |
+
st.write(f"**Documents:** {stats['total_documents']}")
|
| 262 |
+
st.write(f"**Chunks:** {stats['total_chunks']}")
|
| 263 |
+
st.write(f"**Vector Size:** {stats['vector_size']}")
|
| 264 |
+
st.write(f"**Model:** {stats['model_name']}")
|
| 265 |
+
|
| 266 |
+
# Initialize RAG system if not already done
|
| 267 |
+
if not st.session_state.rag_system:
|
| 268 |
+
if st.session_state.initializing:
|
| 269 |
+
st.info("π RAG system is initializing... Please wait.")
|
| 270 |
+
return
|
| 271 |
+
else:
|
| 272 |
+
initialize_rag_system()
|
| 273 |
+
return
|
| 274 |
+
|
| 275 |
+
# Show system info and allow querying immediately after initialization
|
| 276 |
+
stats = st.session_state.rag_system.get_stats()
|
| 277 |
+
documents_available = stats["total_documents"] > 0
|
| 278 |
+
|
| 279 |
+
if not documents_available:
|
| 280 |
+
st.info(
|
| 281 |
+
"π No documents loaded yet, but you can still ask questions. The system will respond based on its general knowledge."
|
| 282 |
+
)
|
| 283 |
+
|
| 284 |
+
# Chat interface
|
| 285 |
+
st.header("π¬ Ask Questions About Your Documents")
|
| 286 |
+
|
| 287 |
+
# Chat input
|
| 288 |
+
query = st.chat_input("Ask a question about the loaded documents...")
|
| 289 |
+
|
| 290 |
+
if query:
|
| 291 |
+
st.write(f"π Processing query: {query}")
|
| 292 |
+
# Add user message to chat history
|
| 293 |
+
st.session_state.chat_history.append({"role": "user", "content": query})
|
| 294 |
+
|
| 295 |
+
# Get response
|
| 296 |
+
response, response_time = query_rag(query, method, top_k)
|
| 297 |
+
|
| 298 |
+
st.write(f"π Response type: {type(response)}")
|
| 299 |
+
st.write(f"π Response time: {response_time}")
|
| 300 |
+
|
| 301 |
+
if response:
|
| 302 |
+
st.write("β
Got valid response, adding to chat history")
|
| 303 |
+
# Add assistant response to chat history
|
| 304 |
+
st.session_state.chat_history.append(
|
| 305 |
+
{
|
| 306 |
+
"role": "assistant",
|
| 307 |
+
"content": response.answer,
|
| 308 |
+
"search_results": response.search_results,
|
| 309 |
+
"method_used": response.method_used,
|
| 310 |
+
"confidence": response.confidence,
|
| 311 |
+
"response_time": response_time,
|
| 312 |
+
}
|
| 313 |
+
)
|
| 314 |
+
else:
|
| 315 |
+
st.write("β No valid response received")
|
| 316 |
+
st.session_state.chat_history.append(
|
| 317 |
+
{"role": "assistant", "content": f"Error: {response_time}"}
|
| 318 |
+
)
|
| 319 |
+
|
| 320 |
+
# Display chat history
|
| 321 |
+
for message in st.session_state.chat_history:
|
| 322 |
+
if message["role"] == "user":
|
| 323 |
+
with st.chat_message("user"):
|
| 324 |
+
st.write(message["content"])
|
| 325 |
+
else:
|
| 326 |
+
with st.chat_message("assistant"):
|
| 327 |
+
st.write(message["content"])
|
| 328 |
+
|
| 329 |
+
# Show additional info for assistant messages
|
| 330 |
+
if "search_results" in message:
|
| 331 |
+
st.markdown("**π Search Results:**")
|
| 332 |
+
display_search_results(message["search_results"])
|
| 333 |
+
|
| 334 |
+
# Show metrics
|
| 335 |
+
col1, col2, col3 = st.columns(3)
|
| 336 |
+
with col1:
|
| 337 |
+
st.metric("Method", message["method_used"])
|
| 338 |
+
with col2:
|
| 339 |
+
st.metric("Confidence", f"{message['confidence']:.3f}")
|
| 340 |
+
with col3:
|
| 341 |
+
st.metric("Response Time", f"{message['response_time']:.2f}s")
|
| 342 |
+
|
| 343 |
+
# Clear chat button
|
| 344 |
+
if st.session_state.chat_history:
|
| 345 |
+
if st.button("ποΈ Clear Chat History"):
|
| 346 |
+
st.session_state.chat_history = []
|
| 347 |
+
st.rerun()
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
if __name__ == "__main__":
|
| 351 |
+
main()
|
docker-compose.yml
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version: '3.8'
|
| 2 |
+
|
| 3 |
+
services:
|
| 4 |
+
rag-system:
|
| 5 |
+
build: .
|
| 6 |
+
ports:
|
| 7 |
+
- "8501:8501"
|
| 8 |
+
environment:
|
| 9 |
+
- PYTHONPATH=/app
|
| 10 |
+
- STREAMLIT_SERVER_PORT=8501
|
| 11 |
+
- STREAMLIT_SERVER_ADDRESS=0.0.0.0
|
| 12 |
+
- STREAMLIT_SERVER_HEADLESS=true
|
| 13 |
+
volumes:
|
| 14 |
+
- ./vector_store:/app/vector_store
|
| 15 |
+
restart: unless-stopped
|
| 16 |
+
healthcheck:
|
| 17 |
+
test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
|
| 18 |
+
interval: 30s
|
| 19 |
+
timeout: 10s
|
| 20 |
+
retries: 3
|
pdf_processor.py
ADDED
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simplified PDF Processor for Hugging Face Spaces
|
| 4 |
+
|
| 5 |
+
This module provides PDF processing functionality for the simplified RAG system.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
import re
|
| 10 |
+
import uuid
|
| 11 |
+
from typing import List, Dict, Optional
|
| 12 |
+
from dataclasses import dataclass
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
import pypdf
|
| 15 |
+
from loguru import logger
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
@dataclass
|
| 19 |
+
class DocumentChunk:
|
| 20 |
+
"""Represents a document chunk"""
|
| 21 |
+
|
| 22 |
+
text: str
|
| 23 |
+
doc_id: str
|
| 24 |
+
filename: str
|
| 25 |
+
chunk_id: str
|
| 26 |
+
chunk_size: int
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
@dataclass
|
| 30 |
+
class ProcessedDocument:
|
| 31 |
+
"""Represents a processed document"""
|
| 32 |
+
|
| 33 |
+
filename: str
|
| 34 |
+
title: str
|
| 35 |
+
author: str
|
| 36 |
+
chunks: List[DocumentChunk]
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
class SimplePDFProcessor:
|
| 40 |
+
"""Simplified PDF processor for Hugging Face Spaces"""
|
| 41 |
+
|
| 42 |
+
def __init__(self):
|
| 43 |
+
"""Initialize the PDF processor"""
|
| 44 |
+
self.stop_words = {
|
| 45 |
+
"the",
|
| 46 |
+
"a",
|
| 47 |
+
"an",
|
| 48 |
+
"and",
|
| 49 |
+
"or",
|
| 50 |
+
"but",
|
| 51 |
+
"in",
|
| 52 |
+
"on",
|
| 53 |
+
"at",
|
| 54 |
+
"to",
|
| 55 |
+
"for",
|
| 56 |
+
"of",
|
| 57 |
+
"with",
|
| 58 |
+
"by",
|
| 59 |
+
"is",
|
| 60 |
+
"are",
|
| 61 |
+
"was",
|
| 62 |
+
"were",
|
| 63 |
+
"be",
|
| 64 |
+
"been",
|
| 65 |
+
"being",
|
| 66 |
+
"have",
|
| 67 |
+
"has",
|
| 68 |
+
"had",
|
| 69 |
+
"do",
|
| 70 |
+
"does",
|
| 71 |
+
"did",
|
| 72 |
+
"will",
|
| 73 |
+
"would",
|
| 74 |
+
"could",
|
| 75 |
+
"should",
|
| 76 |
+
"may",
|
| 77 |
+
"might",
|
| 78 |
+
"can",
|
| 79 |
+
"this",
|
| 80 |
+
"that",
|
| 81 |
+
"these",
|
| 82 |
+
"those",
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
def process_document(
|
| 86 |
+
self, file_path: str, chunk_sizes: List[int] = None
|
| 87 |
+
) -> ProcessedDocument:
|
| 88 |
+
"""
|
| 89 |
+
Process a PDF document
|
| 90 |
+
|
| 91 |
+
Args:
|
| 92 |
+
file_path: Path to the PDF file
|
| 93 |
+
chunk_sizes: List of chunk sizes to use
|
| 94 |
+
|
| 95 |
+
Returns:
|
| 96 |
+
Processed document
|
| 97 |
+
"""
|
| 98 |
+
if chunk_sizes is None:
|
| 99 |
+
chunk_sizes = [100, 400]
|
| 100 |
+
|
| 101 |
+
try:
|
| 102 |
+
# Extract text from PDF
|
| 103 |
+
text = self._extract_text(file_path)
|
| 104 |
+
|
| 105 |
+
# Clean text
|
| 106 |
+
cleaned_text = self._clean_text(text)
|
| 107 |
+
|
| 108 |
+
# Extract metadata
|
| 109 |
+
metadata = self._extract_metadata(file_path)
|
| 110 |
+
|
| 111 |
+
# Create chunks
|
| 112 |
+
chunks = []
|
| 113 |
+
doc_id = str(uuid.uuid4())
|
| 114 |
+
|
| 115 |
+
for chunk_size in chunk_sizes:
|
| 116 |
+
chunk_list = self._create_chunks(
|
| 117 |
+
cleaned_text, chunk_size, doc_id, metadata["filename"]
|
| 118 |
+
)
|
| 119 |
+
chunks.extend(chunk_list)
|
| 120 |
+
|
| 121 |
+
return ProcessedDocument(
|
| 122 |
+
filename=metadata["filename"],
|
| 123 |
+
title=metadata["title"],
|
| 124 |
+
author=metadata["author"],
|
| 125 |
+
chunks=chunks,
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
except Exception as e:
|
| 129 |
+
logger.error(f"Error processing document {file_path}: {e}")
|
| 130 |
+
raise
|
| 131 |
+
|
| 132 |
+
def _extract_text(self, file_path: str) -> str:
|
| 133 |
+
"""Extract text from PDF file"""
|
| 134 |
+
try:
|
| 135 |
+
with open(file_path, "rb") as file:
|
| 136 |
+
pdf_reader = pypdf.PdfReader(file)
|
| 137 |
+
text = ""
|
| 138 |
+
|
| 139 |
+
for page in pdf_reader.pages:
|
| 140 |
+
page_text = page.extract_text()
|
| 141 |
+
if page_text:
|
| 142 |
+
text += page_text + "\n"
|
| 143 |
+
|
| 144 |
+
return text
|
| 145 |
+
|
| 146 |
+
except Exception as e:
|
| 147 |
+
logger.error(f"Error extracting text from {file_path}: {e}")
|
| 148 |
+
raise
|
| 149 |
+
|
| 150 |
+
def _clean_text(self, text: str) -> str:
|
| 151 |
+
"""Clean and preprocess text"""
|
| 152 |
+
# Remove extra whitespace
|
| 153 |
+
text = re.sub(r"\s+", " ", text)
|
| 154 |
+
|
| 155 |
+
# Remove special characters but keep punctuation
|
| 156 |
+
text = re.sub(r"[^\w\s\.\,\!\?\;\:\-\(\)\[\]\{\}]", "", text)
|
| 157 |
+
|
| 158 |
+
# Remove page numbers and headers/footers
|
| 159 |
+
text = re.sub(
|
| 160 |
+
r"\b\d+\b(?=\s*\n)", "", text
|
| 161 |
+
) # Remove standalone numbers at line ends
|
| 162 |
+
|
| 163 |
+
# Remove excessive newlines
|
| 164 |
+
text = re.sub(r"\n\s*\n\s*\n+", "\n\n", text)
|
| 165 |
+
|
| 166 |
+
return text.strip()
|
| 167 |
+
|
| 168 |
+
def _extract_metadata(self, file_path: str) -> Dict[str, str]:
|
| 169 |
+
"""Extract metadata from PDF file"""
|
| 170 |
+
try:
|
| 171 |
+
with open(file_path, "rb") as file:
|
| 172 |
+
pdf_reader = pypdf.PdfReader(file)
|
| 173 |
+
info = pdf_reader.metadata
|
| 174 |
+
|
| 175 |
+
return {
|
| 176 |
+
"filename": Path(file_path).name,
|
| 177 |
+
"title": (
|
| 178 |
+
info.get("/Title", Path(file_path).stem)
|
| 179 |
+
if info
|
| 180 |
+
else Path(file_path).stem
|
| 181 |
+
),
|
| 182 |
+
"author": info.get("/Author", "Unknown") if info else "Unknown",
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
except Exception as e:
|
| 186 |
+
logger.warning(f"Error extracting metadata from {file_path}: {e}")
|
| 187 |
+
return {
|
| 188 |
+
"filename": Path(file_path).name,
|
| 189 |
+
"title": Path(file_path).stem,
|
| 190 |
+
"author": "Unknown",
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
def _create_chunks(
|
| 194 |
+
self, text: str, chunk_size: int, doc_id: str, filename: str
|
| 195 |
+
) -> List[DocumentChunk]:
|
| 196 |
+
"""Create text chunks of specified size"""
|
| 197 |
+
chunks = []
|
| 198 |
+
|
| 199 |
+
# Split text into sentences
|
| 200 |
+
sentences = self._split_into_sentences(text)
|
| 201 |
+
|
| 202 |
+
current_chunk = ""
|
| 203 |
+
chunk_id = 0
|
| 204 |
+
|
| 205 |
+
for sentence in sentences:
|
| 206 |
+
# Estimate token count (rough approximation)
|
| 207 |
+
estimated_tokens = len(sentence.split())
|
| 208 |
+
|
| 209 |
+
if len(current_chunk.split()) + estimated_tokens <= chunk_size:
|
| 210 |
+
current_chunk += sentence + " "
|
| 211 |
+
else:
|
| 212 |
+
# Save current chunk if not empty
|
| 213 |
+
if current_chunk.strip():
|
| 214 |
+
chunks.append(
|
| 215 |
+
DocumentChunk(
|
| 216 |
+
text=current_chunk.strip(),
|
| 217 |
+
doc_id=doc_id,
|
| 218 |
+
filename=filename,
|
| 219 |
+
chunk_id=f"{doc_id}_{chunk_id}",
|
| 220 |
+
chunk_size=chunk_size,
|
| 221 |
+
)
|
| 222 |
+
)
|
| 223 |
+
chunk_id += 1
|
| 224 |
+
|
| 225 |
+
# Start new chunk
|
| 226 |
+
current_chunk = sentence + " "
|
| 227 |
+
|
| 228 |
+
# Add the last chunk if not empty
|
| 229 |
+
if current_chunk.strip():
|
| 230 |
+
chunks.append(
|
| 231 |
+
DocumentChunk(
|
| 232 |
+
text=current_chunk.strip(),
|
| 233 |
+
doc_id=doc_id,
|
| 234 |
+
filename=filename,
|
| 235 |
+
chunk_id=f"{doc_id}_{chunk_id}",
|
| 236 |
+
chunk_size=chunk_size,
|
| 237 |
+
)
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
return chunks
|
| 241 |
+
|
| 242 |
+
def _split_into_sentences(self, text: str) -> List[str]:
|
| 243 |
+
"""Split text into sentences"""
|
| 244 |
+
# Simple sentence splitting
|
| 245 |
+
sentences = re.split(r"[.!?]+", text)
|
| 246 |
+
|
| 247 |
+
# Clean and filter sentences
|
| 248 |
+
cleaned_sentences = []
|
| 249 |
+
for sentence in sentences:
|
| 250 |
+
sentence = sentence.strip()
|
| 251 |
+
if sentence and len(sentence.split()) > 3: # Minimum 3 words
|
| 252 |
+
cleaned_sentences.append(sentence)
|
| 253 |
+
|
| 254 |
+
return cleaned_sentences
|
| 255 |
+
|
| 256 |
+
def preprocess_query(self, query: str) -> str:
|
| 257 |
+
"""Preprocess query text"""
|
| 258 |
+
# Convert to lowercase
|
| 259 |
+
query = query.lower()
|
| 260 |
+
|
| 261 |
+
# Remove punctuation
|
| 262 |
+
query = re.sub(r"[^\w\s]", "", query)
|
| 263 |
+
|
| 264 |
+
# Remove stop words
|
| 265 |
+
words = query.split()
|
| 266 |
+
filtered_words = [word for word in words if word not in self.stop_words]
|
| 267 |
+
|
| 268 |
+
return " ".join(filtered_words)
|
rag_system.py
ADDED
|
@@ -0,0 +1,547 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simplified RAG System for Hugging Face Spaces
|
| 4 |
+
|
| 5 |
+
This module provides a simplified RAG system using:
|
| 6 |
+
- FAISS for vector storage
|
| 7 |
+
- BM25 for sparse retrieval
|
| 8 |
+
- Hybrid search combining both
|
| 9 |
+
- Qwen 2.5 1.5B for generation
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import os
|
| 13 |
+
import pickle
|
| 14 |
+
import json
|
| 15 |
+
import time
|
| 16 |
+
from typing import List, Dict, Optional, Tuple
|
| 17 |
+
from dataclasses import dataclass
|
| 18 |
+
import numpy as np
|
| 19 |
+
import torch
|
| 20 |
+
from loguru import logger
|
| 21 |
+
import threading
|
| 22 |
+
|
| 23 |
+
# Import required libraries
|
| 24 |
+
from sentence_transformers import SentenceTransformer
|
| 25 |
+
from rank_bm25 import BM25Okapi
|
| 26 |
+
import faiss
|
| 27 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
@dataclass
|
| 31 |
+
class DocumentChunk:
|
| 32 |
+
"""Represents a document chunk"""
|
| 33 |
+
|
| 34 |
+
text: str
|
| 35 |
+
doc_id: str
|
| 36 |
+
filename: str
|
| 37 |
+
chunk_id: str
|
| 38 |
+
chunk_size: int
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
@dataclass
|
| 42 |
+
class SearchResult:
|
| 43 |
+
"""Represents a search result"""
|
| 44 |
+
|
| 45 |
+
text: str
|
| 46 |
+
score: float
|
| 47 |
+
doc_id: str
|
| 48 |
+
filename: str
|
| 49 |
+
search_method: str
|
| 50 |
+
dense_score: Optional[float] = None
|
| 51 |
+
sparse_score: Optional[float] = None
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
@dataclass
|
| 55 |
+
class RAGResponse:
|
| 56 |
+
"""Represents a RAG response"""
|
| 57 |
+
|
| 58 |
+
answer: str
|
| 59 |
+
confidence: float
|
| 60 |
+
search_results: List[SearchResult]
|
| 61 |
+
method_used: str
|
| 62 |
+
response_time: float
|
| 63 |
+
query: str
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
class SimpleRAGSystem:
|
| 67 |
+
"""Simplified RAG system for Hugging Face Spaces"""
|
| 68 |
+
|
| 69 |
+
def __init__(
|
| 70 |
+
self,
|
| 71 |
+
embedding_model: str = "all-MiniLM-L6-v2",
|
| 72 |
+
generative_model: str = "Qwen/Qwen2.5-1.5B-Instruct",
|
| 73 |
+
chunk_sizes: List[int] = None,
|
| 74 |
+
vector_store_path: str = "./vector_store",
|
| 75 |
+
):
|
| 76 |
+
"""
|
| 77 |
+
Initialize the RAG system
|
| 78 |
+
|
| 79 |
+
Args:
|
| 80 |
+
embedding_model: Sentence transformer model for embeddings
|
| 81 |
+
generative_model: Language model for generation
|
| 82 |
+
chunk_sizes: List of chunk sizes to use
|
| 83 |
+
vector_store_path: Path to store FAISS index and metadata
|
| 84 |
+
"""
|
| 85 |
+
self.embedding_model = embedding_model
|
| 86 |
+
self.generative_model = generative_model
|
| 87 |
+
self.chunk_sizes = chunk_sizes or [100, 400]
|
| 88 |
+
self.vector_store_path = vector_store_path
|
| 89 |
+
|
| 90 |
+
# Initialize components
|
| 91 |
+
self.embedder = None
|
| 92 |
+
self.tokenizer = None
|
| 93 |
+
self.model = None
|
| 94 |
+
self.faiss_index = None
|
| 95 |
+
self.bm25 = None
|
| 96 |
+
self.documents = []
|
| 97 |
+
self.chunks = []
|
| 98 |
+
self._lock = threading.Lock() # Thread safety for concurrent loading
|
| 99 |
+
|
| 100 |
+
# Create vector store directory
|
| 101 |
+
os.makedirs(vector_store_path, exist_ok=True)
|
| 102 |
+
|
| 103 |
+
# Load or initialize components
|
| 104 |
+
self._load_models()
|
| 105 |
+
self._load_or_create_index()
|
| 106 |
+
|
| 107 |
+
logger.info("Simple RAG system initialized successfully!")
|
| 108 |
+
|
| 109 |
+
def _load_models(self):
|
| 110 |
+
"""Load embedding and generative models"""
|
| 111 |
+
try:
|
| 112 |
+
# Load embedding model
|
| 113 |
+
self.embedder = SentenceTransformer(self.embedding_model)
|
| 114 |
+
self.vector_size = self.embedder.get_sentence_embedding_dimension()
|
| 115 |
+
|
| 116 |
+
# Load generative model with fallback
|
| 117 |
+
model_loaded = False
|
| 118 |
+
|
| 119 |
+
# Try Qwen model first
|
| 120 |
+
try:
|
| 121 |
+
self.tokenizer = AutoTokenizer.from_pretrained(
|
| 122 |
+
self.generative_model,
|
| 123 |
+
trust_remote_code=True,
|
| 124 |
+
padding_side="left",
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
# Load model with explicit CPU configuration
|
| 128 |
+
self.model = AutoModelForCausalLM.from_pretrained(
|
| 129 |
+
self.generative_model,
|
| 130 |
+
trust_remote_code=True,
|
| 131 |
+
torch_dtype=torch.float32,
|
| 132 |
+
device_map=None,
|
| 133 |
+
low_cpu_mem_usage=False,
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
# Move to CPU explicitly
|
| 137 |
+
self.model = self.model.to("cpu")
|
| 138 |
+
model_loaded = True
|
| 139 |
+
|
| 140 |
+
except Exception as e:
|
| 141 |
+
logger.warning(f"Failed to load Qwen model: {e}")
|
| 142 |
+
|
| 143 |
+
# Fallback to distilgpt2 if Qwen fails
|
| 144 |
+
if not model_loaded:
|
| 145 |
+
logger.info("Falling back to distilgpt2...")
|
| 146 |
+
self.generative_model = "distilgpt2"
|
| 147 |
+
try:
|
| 148 |
+
self.tokenizer = AutoTokenizer.from_pretrained(
|
| 149 |
+
self.generative_model,
|
| 150 |
+
trust_remote_code=True,
|
| 151 |
+
padding_side="left",
|
| 152 |
+
)
|
| 153 |
+
self.model = AutoModelForCausalLM.from_pretrained(
|
| 154 |
+
self.generative_model,
|
| 155 |
+
trust_remote_code=True,
|
| 156 |
+
)
|
| 157 |
+
# Ensure fallback model is also on CPU
|
| 158 |
+
self.model = self.model.to("cpu")
|
| 159 |
+
model_loaded = True
|
| 160 |
+
except Exception as e:
|
| 161 |
+
logger.error(f"Failed to load distilgpt2: {e}")
|
| 162 |
+
raise Exception("Could not load any generative model")
|
| 163 |
+
|
| 164 |
+
# Set pad token for tokenizer
|
| 165 |
+
if self.tokenizer.pad_token is None:
|
| 166 |
+
self.tokenizer.pad_token = self.tokenizer.eos_token
|
| 167 |
+
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
|
| 168 |
+
|
| 169 |
+
logger.info(f"β
Models loaded successfully")
|
| 170 |
+
logger.info(f" - Embedding: {self.embedding_model}")
|
| 171 |
+
logger.info(f" - Generative: {self.generative_model}")
|
| 172 |
+
|
| 173 |
+
except Exception as e:
|
| 174 |
+
logger.error(f"β Failed to load models: {e}")
|
| 175 |
+
raise
|
| 176 |
+
|
| 177 |
+
def _load_or_create_index(self):
|
| 178 |
+
"""Load existing FAISS index or create new one"""
|
| 179 |
+
faiss_path = os.path.join(self.vector_store_path, "faiss_index.bin")
|
| 180 |
+
metadata_path = os.path.join(self.vector_store_path, "metadata.pkl")
|
| 181 |
+
|
| 182 |
+
if os.path.exists(faiss_path) and os.path.exists(metadata_path):
|
| 183 |
+
# Load existing index
|
| 184 |
+
try:
|
| 185 |
+
self.faiss_index = faiss.read_index(faiss_path)
|
| 186 |
+
with open(metadata_path, "rb") as f:
|
| 187 |
+
metadata = pickle.load(f)
|
| 188 |
+
self.documents = metadata.get("documents", [])
|
| 189 |
+
self.chunks = metadata.get("chunks", [])
|
| 190 |
+
|
| 191 |
+
# Rebuild BM25
|
| 192 |
+
if self.chunks:
|
| 193 |
+
texts = [chunk.text for chunk in self.chunks]
|
| 194 |
+
tokenized_texts = [text.lower().split() for text in texts]
|
| 195 |
+
self.bm25 = BM25Okapi(tokenized_texts)
|
| 196 |
+
|
| 197 |
+
logger.info(f"β
Loaded existing index with {len(self.chunks)} chunks")
|
| 198 |
+
except Exception as e:
|
| 199 |
+
logger.warning(f"Failed to load existing index: {e}")
|
| 200 |
+
self._create_new_index()
|
| 201 |
+
else:
|
| 202 |
+
self._create_new_index()
|
| 203 |
+
|
| 204 |
+
def _create_new_index(self):
|
| 205 |
+
"""Create new FAISS index"""
|
| 206 |
+
vector_size = self.embedder.get_sentence_embedding_dimension()
|
| 207 |
+
self.faiss_index = faiss.IndexFlatIP(
|
| 208 |
+
vector_size
|
| 209 |
+
) # Inner product for cosine similarity
|
| 210 |
+
self.bm25 = None
|
| 211 |
+
logger.info(f"β
Created new FAISS index with dimension {vector_size}")
|
| 212 |
+
|
| 213 |
+
def _save_index(self):
|
| 214 |
+
"""Save FAISS index and metadata"""
|
| 215 |
+
try:
|
| 216 |
+
# Save FAISS index
|
| 217 |
+
faiss_path = os.path.join(self.vector_store_path, "faiss_index.bin")
|
| 218 |
+
faiss.write_index(self.faiss_index, faiss_path)
|
| 219 |
+
|
| 220 |
+
# Save metadata
|
| 221 |
+
metadata_path = os.path.join(self.vector_store_path, "metadata.pkl")
|
| 222 |
+
metadata = {"documents": self.documents, "chunks": self.chunks}
|
| 223 |
+
with open(metadata_path, "wb") as f:
|
| 224 |
+
pickle.dump(metadata, f)
|
| 225 |
+
|
| 226 |
+
logger.info("β
Index saved successfully")
|
| 227 |
+
except Exception as e:
|
| 228 |
+
logger.error(f"β Failed to save index: {e}")
|
| 229 |
+
|
| 230 |
+
def add_document(self, file_path: str, filename: str) -> bool:
|
| 231 |
+
"""
|
| 232 |
+
Add a document to the RAG system
|
| 233 |
+
|
| 234 |
+
Args:
|
| 235 |
+
file_path: Path to the PDF file
|
| 236 |
+
filename: Name of the file
|
| 237 |
+
|
| 238 |
+
Returns:
|
| 239 |
+
True if successful, False otherwise
|
| 240 |
+
"""
|
| 241 |
+
try:
|
| 242 |
+
from pdf_processor import SimplePDFProcessor
|
| 243 |
+
|
| 244 |
+
# Process the document
|
| 245 |
+
processor = SimplePDFProcessor()
|
| 246 |
+
processed_doc = processor.process_document(file_path, self.chunk_sizes)
|
| 247 |
+
|
| 248 |
+
# Thread-safe document addition
|
| 249 |
+
with self._lock:
|
| 250 |
+
# Add document to list
|
| 251 |
+
self.documents.append(
|
| 252 |
+
{
|
| 253 |
+
"filename": filename,
|
| 254 |
+
"title": processed_doc.title,
|
| 255 |
+
"author": processed_doc.author,
|
| 256 |
+
"file_path": file_path,
|
| 257 |
+
}
|
| 258 |
+
)
|
| 259 |
+
|
| 260 |
+
# Add chunks
|
| 261 |
+
for chunk in processed_doc.chunks:
|
| 262 |
+
self.chunks.append(chunk)
|
| 263 |
+
|
| 264 |
+
# Update embeddings and BM25
|
| 265 |
+
self._update_embeddings()
|
| 266 |
+
self._update_bm25()
|
| 267 |
+
|
| 268 |
+
# Save index
|
| 269 |
+
self._save_index()
|
| 270 |
+
|
| 271 |
+
logger.info(
|
| 272 |
+
f"β
Added document: {filename} ({len(processed_doc.chunks)} chunks)"
|
| 273 |
+
)
|
| 274 |
+
return True
|
| 275 |
+
|
| 276 |
+
except Exception as e:
|
| 277 |
+
logger.error(f"β Failed to add document {filename}: {e}")
|
| 278 |
+
return False
|
| 279 |
+
|
| 280 |
+
def _update_embeddings(self):
|
| 281 |
+
"""Update FAISS index with new embeddings"""
|
| 282 |
+
if not self.chunks:
|
| 283 |
+
return
|
| 284 |
+
|
| 285 |
+
# Get embeddings for new chunks
|
| 286 |
+
texts = [chunk.text for chunk in self.chunks]
|
| 287 |
+
embeddings = self.embedder.encode(texts, show_progress_bar=False)
|
| 288 |
+
|
| 289 |
+
# Add to FAISS index
|
| 290 |
+
self.faiss_index.add(embeddings.astype("float32"))
|
| 291 |
+
|
| 292 |
+
def _update_bm25(self):
|
| 293 |
+
"""Update BM25 index with new chunks"""
|
| 294 |
+
if not self.chunks:
|
| 295 |
+
return
|
| 296 |
+
|
| 297 |
+
# Rebuild BM25 with all chunks
|
| 298 |
+
texts = [chunk.text for chunk in self.chunks]
|
| 299 |
+
tokenized_texts = [text.lower().split() for text in texts]
|
| 300 |
+
self.bm25 = BM25Okapi(tokenized_texts)
|
| 301 |
+
|
| 302 |
+
def search(
|
| 303 |
+
self, query: str, method: str = "hybrid", top_k: int = 5
|
| 304 |
+
) -> List[SearchResult]:
|
| 305 |
+
"""
|
| 306 |
+
Search for relevant documents
|
| 307 |
+
|
| 308 |
+
Args:
|
| 309 |
+
query: Search query
|
| 310 |
+
method: Search method (hybrid, dense, sparse)
|
| 311 |
+
top_k: Number of results to return
|
| 312 |
+
|
| 313 |
+
Returns:
|
| 314 |
+
List of search results
|
| 315 |
+
"""
|
| 316 |
+
if not self.chunks:
|
| 317 |
+
return []
|
| 318 |
+
|
| 319 |
+
results = []
|
| 320 |
+
|
| 321 |
+
if method == "dense" or method == "hybrid":
|
| 322 |
+
# Dense search using FAISS
|
| 323 |
+
query_embedding = self.embedder.encode([query])
|
| 324 |
+
scores, indices = self.faiss_index.search(
|
| 325 |
+
query_embedding.astype("float32"), min(top_k, len(self.chunks))
|
| 326 |
+
)
|
| 327 |
+
|
| 328 |
+
for score, idx in zip(scores[0], indices[0]):
|
| 329 |
+
if idx < len(self.chunks):
|
| 330 |
+
chunk = self.chunks[idx]
|
| 331 |
+
results.append(
|
| 332 |
+
SearchResult(
|
| 333 |
+
text=chunk.text,
|
| 334 |
+
score=float(score),
|
| 335 |
+
doc_id=chunk.doc_id,
|
| 336 |
+
filename=chunk.filename,
|
| 337 |
+
search_method="dense",
|
| 338 |
+
dense_score=float(score),
|
| 339 |
+
)
|
| 340 |
+
)
|
| 341 |
+
|
| 342 |
+
if method == "sparse" or method == "hybrid":
|
| 343 |
+
# Sparse search using BM25
|
| 344 |
+
if self.bm25:
|
| 345 |
+
tokenized_query = query.lower().split()
|
| 346 |
+
bm25_scores = self.bm25.get_scores(tokenized_query)
|
| 347 |
+
|
| 348 |
+
# Get top BM25 results
|
| 349 |
+
top_indices = np.argsort(bm25_scores)[::-1][:top_k]
|
| 350 |
+
|
| 351 |
+
for idx in top_indices:
|
| 352 |
+
if idx < len(self.chunks):
|
| 353 |
+
chunk = self.chunks[idx]
|
| 354 |
+
score = float(bm25_scores[idx])
|
| 355 |
+
|
| 356 |
+
# Check if result already exists
|
| 357 |
+
existing_result = next(
|
| 358 |
+
(
|
| 359 |
+
r
|
| 360 |
+
for r in results
|
| 361 |
+
if r.doc_id == chunk.doc_id and r.text == chunk.text
|
| 362 |
+
),
|
| 363 |
+
None,
|
| 364 |
+
)
|
| 365 |
+
|
| 366 |
+
if existing_result:
|
| 367 |
+
# Update existing result with sparse score
|
| 368 |
+
existing_result.sparse_score = score
|
| 369 |
+
if method == "hybrid":
|
| 370 |
+
# Combine scores for hybrid
|
| 371 |
+
existing_result.score = (
|
| 372 |
+
existing_result.dense_score + score
|
| 373 |
+
) / 2
|
| 374 |
+
else:
|
| 375 |
+
results.append(
|
| 376 |
+
SearchResult(
|
| 377 |
+
text=chunk.text,
|
| 378 |
+
score=score,
|
| 379 |
+
doc_id=chunk.doc_id,
|
| 380 |
+
filename=chunk.filename,
|
| 381 |
+
search_method="sparse",
|
| 382 |
+
sparse_score=score,
|
| 383 |
+
)
|
| 384 |
+
)
|
| 385 |
+
|
| 386 |
+
# Sort by score and return top_k
|
| 387 |
+
results.sort(key=lambda x: x.score, reverse=True)
|
| 388 |
+
return results[:top_k]
|
| 389 |
+
|
| 390 |
+
def generate_response(self, query: str, context: str) -> str:
|
| 391 |
+
"""
|
| 392 |
+
Generate response using the language model
|
| 393 |
+
|
| 394 |
+
Args:
|
| 395 |
+
query: User query
|
| 396 |
+
context: Retrieved context
|
| 397 |
+
|
| 398 |
+
Returns:
|
| 399 |
+
Generated response
|
| 400 |
+
"""
|
| 401 |
+
try:
|
| 402 |
+
# Prepare prompt
|
| 403 |
+
if hasattr(self.tokenizer, "apply_chat_template"):
|
| 404 |
+
# Use chat template for Qwen
|
| 405 |
+
messages = [
|
| 406 |
+
{
|
| 407 |
+
"role": "system",
|
| 408 |
+
"content": "You are a helpful AI assistant. Use the provided context to answer the user's question accurately and concisely. If the context doesn't contain enough information to answer the question, say so.",
|
| 409 |
+
},
|
| 410 |
+
{
|
| 411 |
+
"role": "user",
|
| 412 |
+
"content": f"Context: {context}\n\nQuestion: {query}",
|
| 413 |
+
},
|
| 414 |
+
]
|
| 415 |
+
prompt = self.tokenizer.apply_chat_template(
|
| 416 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 417 |
+
)
|
| 418 |
+
else:
|
| 419 |
+
# Fallback for non-chat models
|
| 420 |
+
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
|
| 421 |
+
|
| 422 |
+
# Tokenize
|
| 423 |
+
tokenized = self.tokenizer(
|
| 424 |
+
prompt,
|
| 425 |
+
return_tensors="pt",
|
| 426 |
+
truncation=True,
|
| 427 |
+
max_length=1024,
|
| 428 |
+
padding=True,
|
| 429 |
+
return_attention_mask=True,
|
| 430 |
+
)
|
| 431 |
+
|
| 432 |
+
# Generate response
|
| 433 |
+
with torch.no_grad():
|
| 434 |
+
try:
|
| 435 |
+
outputs = self.model.generate(
|
| 436 |
+
tokenized.input_ids,
|
| 437 |
+
attention_mask=tokenized.attention_mask,
|
| 438 |
+
max_new_tokens=512,
|
| 439 |
+
num_return_sequences=1,
|
| 440 |
+
temperature=0.7,
|
| 441 |
+
do_sample=True,
|
| 442 |
+
pad_token_id=self.tokenizer.pad_token_id,
|
| 443 |
+
eos_token_id=self.tokenizer.eos_token_id,
|
| 444 |
+
)
|
| 445 |
+
except RuntimeError as e:
|
| 446 |
+
if "Half" in str(e):
|
| 447 |
+
logger.warning(
|
| 448 |
+
"Half precision not supported on CPU, converting to float32"
|
| 449 |
+
)
|
| 450 |
+
# Convert model to float32
|
| 451 |
+
self.model = self.model.float()
|
| 452 |
+
outputs = self.model.generate(
|
| 453 |
+
tokenized.input_ids,
|
| 454 |
+
attention_mask=tokenized.attention_mask,
|
| 455 |
+
max_new_tokens=512,
|
| 456 |
+
num_return_sequences=1,
|
| 457 |
+
temperature=0.7,
|
| 458 |
+
do_sample=True,
|
| 459 |
+
pad_token_id=self.tokenizer.pad_token_id,
|
| 460 |
+
eos_token_id=self.tokenizer.eos_token_id,
|
| 461 |
+
)
|
| 462 |
+
else:
|
| 463 |
+
raise e
|
| 464 |
+
|
| 465 |
+
# Decode response
|
| 466 |
+
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 467 |
+
|
| 468 |
+
# Extract only the generated part
|
| 469 |
+
if hasattr(self.tokenizer, "apply_chat_template"):
|
| 470 |
+
if "<|im_start|>assistant" in response:
|
| 471 |
+
response = response.split("<|im_start|>assistant")[-1]
|
| 472 |
+
if "<|im_end|>" in response:
|
| 473 |
+
response = response.split("<|im_end|>")[0]
|
| 474 |
+
else:
|
| 475 |
+
response = response[len(prompt) :]
|
| 476 |
+
|
| 477 |
+
return response.strip()
|
| 478 |
+
|
| 479 |
+
except Exception as e:
|
| 480 |
+
logger.error(f"Error generating response: {e}")
|
| 481 |
+
return f"Error generating response: {str(e)}"
|
| 482 |
+
|
| 483 |
+
def query(self, query: str, method: str = "hybrid", top_k: int = 5) -> RAGResponse:
|
| 484 |
+
"""
|
| 485 |
+
Query the RAG system
|
| 486 |
+
|
| 487 |
+
Args:
|
| 488 |
+
query: User query
|
| 489 |
+
method: Search method
|
| 490 |
+
top_k: Number of results
|
| 491 |
+
|
| 492 |
+
Returns:
|
| 493 |
+
RAG response
|
| 494 |
+
"""
|
| 495 |
+
start_time = time.time()
|
| 496 |
+
|
| 497 |
+
# Search for relevant documents
|
| 498 |
+
search_results = self.search(query, method, top_k)
|
| 499 |
+
|
| 500 |
+
if not search_results:
|
| 501 |
+
return RAGResponse(
|
| 502 |
+
answer="I couldn't find any relevant information to answer your question.",
|
| 503 |
+
confidence=0.0,
|
| 504 |
+
search_results=[],
|
| 505 |
+
method_used=method,
|
| 506 |
+
response_time=time.time() - start_time,
|
| 507 |
+
query=query,
|
| 508 |
+
)
|
| 509 |
+
|
| 510 |
+
# Combine context from search results
|
| 511 |
+
context = "\n\n".join([result.text for result in search_results])
|
| 512 |
+
|
| 513 |
+
# Generate response
|
| 514 |
+
answer = self.generate_response(query, context)
|
| 515 |
+
|
| 516 |
+
# Calculate confidence (simple heuristic)
|
| 517 |
+
confidence = np.mean([result.score for result in search_results])
|
| 518 |
+
|
| 519 |
+
return RAGResponse(
|
| 520 |
+
answer=answer,
|
| 521 |
+
confidence=confidence,
|
| 522 |
+
search_results=search_results,
|
| 523 |
+
method_used=method,
|
| 524 |
+
response_time=time.time() - start_time,
|
| 525 |
+
query=query,
|
| 526 |
+
)
|
| 527 |
+
|
| 528 |
+
def get_stats(self) -> Dict:
|
| 529 |
+
"""Get system statistics"""
|
| 530 |
+
return {
|
| 531 |
+
"total_documents": len(self.documents),
|
| 532 |
+
"total_chunks": len(self.chunks),
|
| 533 |
+
"vector_size": (
|
| 534 |
+
self.embedder.get_sentence_embedding_dimension() if self.embedder else 0
|
| 535 |
+
),
|
| 536 |
+
"model_name": self.generative_model,
|
| 537 |
+
"embedding_model": self.embedding_model,
|
| 538 |
+
"chunk_sizes": self.chunk_sizes,
|
| 539 |
+
}
|
| 540 |
+
|
| 541 |
+
def clear(self):
|
| 542 |
+
"""Clear all documents and reset the system"""
|
| 543 |
+
self.documents = []
|
| 544 |
+
self.chunks = []
|
| 545 |
+
self._create_new_index()
|
| 546 |
+
self._save_index()
|
| 547 |
+
logger.info("β
System cleared successfully")
|
requirements.txt
CHANGED
|
@@ -1,3 +1,15 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core dependencies for Docker deployment
|
| 2 |
+
streamlit==1.28.1
|
| 3 |
+
torch==2.1.0
|
| 4 |
+
transformers>=4.36.0
|
| 5 |
+
sentence-transformers==2.2.2
|
| 6 |
+
faiss-cpu==1.7.4
|
| 7 |
+
scikit-learn==1.3.2
|
| 8 |
+
rank-bm25==0.2.2
|
| 9 |
+
pypdf==3.17.1
|
| 10 |
+
pandas==2.1.3
|
| 11 |
+
numpy==1.24.3
|
| 12 |
+
loguru==0.7.2
|
| 13 |
+
tqdm==4.66.1
|
| 14 |
+
accelerate==0.24.1
|
| 15 |
+
huggingface-hub==0.19.4
|
test_deployment.py
ADDED
|
@@ -0,0 +1,293 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Hugging Face deployment
|
| 4 |
+
|
| 5 |
+
This script tests if all components are working correctly for deployment.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
import sys
|
| 10 |
+
import tempfile
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def test_imports():
|
| 15 |
+
"""Test if all required packages can be imported"""
|
| 16 |
+
print("π Testing imports...")
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
import streamlit
|
| 20 |
+
|
| 21 |
+
print(f"β
Streamlit: {streamlit.__version__}")
|
| 22 |
+
except ImportError as e:
|
| 23 |
+
print(f"β Streamlit import failed: {e}")
|
| 24 |
+
return False
|
| 25 |
+
|
| 26 |
+
try:
|
| 27 |
+
import torch
|
| 28 |
+
|
| 29 |
+
print(f"β
PyTorch: {torch.__version__}")
|
| 30 |
+
except ImportError as e:
|
| 31 |
+
print(f"β PyTorch import failed: {e}")
|
| 32 |
+
return False
|
| 33 |
+
|
| 34 |
+
try:
|
| 35 |
+
import transformers
|
| 36 |
+
|
| 37 |
+
print(f"β
Transformers: {transformers.__version__}")
|
| 38 |
+
except ImportError as e:
|
| 39 |
+
print(f"β Transformers import failed: {e}")
|
| 40 |
+
return False
|
| 41 |
+
|
| 42 |
+
try:
|
| 43 |
+
import sentence_transformers
|
| 44 |
+
|
| 45 |
+
print(f"β
Sentence Transformers: {sentence_transformers.__version__}")
|
| 46 |
+
except ImportError as e:
|
| 47 |
+
print(f"β Sentence Transformers import failed: {e}")
|
| 48 |
+
return False
|
| 49 |
+
|
| 50 |
+
try:
|
| 51 |
+
import faiss
|
| 52 |
+
|
| 53 |
+
print(f"β
FAISS: {faiss.__version__}")
|
| 54 |
+
except ImportError as e:
|
| 55 |
+
print(f"β FAISS import failed: {e}")
|
| 56 |
+
return False
|
| 57 |
+
|
| 58 |
+
try:
|
| 59 |
+
import rank_bm25
|
| 60 |
+
|
| 61 |
+
print("β
Rank BM25")
|
| 62 |
+
except ImportError as e:
|
| 63 |
+
print(f"β Rank BM25 import failed: {e}")
|
| 64 |
+
return False
|
| 65 |
+
|
| 66 |
+
try:
|
| 67 |
+
import pypdf
|
| 68 |
+
|
| 69 |
+
print(f"β
PyPDF: {pypdf.__version__}")
|
| 70 |
+
except ImportError as e:
|
| 71 |
+
print(f"β PyPDF import failed: {e}")
|
| 72 |
+
return False
|
| 73 |
+
|
| 74 |
+
return True
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def test_rag_system():
|
| 78 |
+
"""Test the RAG system"""
|
| 79 |
+
print("\nπ Testing RAG system...")
|
| 80 |
+
|
| 81 |
+
try:
|
| 82 |
+
from rag_system import SimpleRAGSystem
|
| 83 |
+
|
| 84 |
+
# Test initialization
|
| 85 |
+
rag = SimpleRAGSystem()
|
| 86 |
+
print("β
RAG system initialized")
|
| 87 |
+
|
| 88 |
+
# Test stats
|
| 89 |
+
stats = rag.get_stats()
|
| 90 |
+
print(f"β
Stats retrieved: {stats}")
|
| 91 |
+
|
| 92 |
+
return True
|
| 93 |
+
|
| 94 |
+
except Exception as e:
|
| 95 |
+
print(f"β RAG system test failed: {e}")
|
| 96 |
+
return False
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def test_pdf_processor():
|
| 100 |
+
"""Test the PDF processor"""
|
| 101 |
+
print("\nπ Testing PDF processor...")
|
| 102 |
+
|
| 103 |
+
try:
|
| 104 |
+
from pdf_processor import SimplePDFProcessor
|
| 105 |
+
|
| 106 |
+
# Test initialization
|
| 107 |
+
processor = SimplePDFProcessor()
|
| 108 |
+
print("β
PDF processor initialized")
|
| 109 |
+
|
| 110 |
+
# Test query preprocessing
|
| 111 |
+
processed_query = processor.preprocess_query("What is the revenue?")
|
| 112 |
+
print(f"β
Query preprocessing: '{processed_query}'")
|
| 113 |
+
|
| 114 |
+
return True
|
| 115 |
+
|
| 116 |
+
except Exception as e:
|
| 117 |
+
print(f"β PDF processor test failed: {e}")
|
| 118 |
+
return False
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def test_model_loading():
|
| 122 |
+
"""Test if models can be loaded"""
|
| 123 |
+
print("\nπ Testing model loading...")
|
| 124 |
+
|
| 125 |
+
try:
|
| 126 |
+
from sentence_transformers import SentenceTransformer
|
| 127 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 128 |
+
|
| 129 |
+
# Test embedding model
|
| 130 |
+
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
| 131 |
+
print("β
Embedding model loaded")
|
| 132 |
+
|
| 133 |
+
# Test tokenizer
|
| 134 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 135 |
+
"Qwen/Qwen2.5-1.5B-Instruct", trust_remote_code=True
|
| 136 |
+
)
|
| 137 |
+
print("β
Tokenizer loaded")
|
| 138 |
+
|
| 139 |
+
# Test model (CPU only for testing)
|
| 140 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 141 |
+
"Qwen/Qwen2.5-1.5B-Instruct",
|
| 142 |
+
trust_remote_code=True,
|
| 143 |
+
torch_dtype="auto",
|
| 144 |
+
device_map="cpu",
|
| 145 |
+
)
|
| 146 |
+
print("β
Generative model loaded")
|
| 147 |
+
|
| 148 |
+
return True
|
| 149 |
+
|
| 150 |
+
except Exception as e:
|
| 151 |
+
print(f"β Model loading failed: {e}")
|
| 152 |
+
return False
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def test_streamlit_app():
|
| 156 |
+
"""Test if Streamlit app can be imported"""
|
| 157 |
+
print("\nπ Testing Streamlit app...")
|
| 158 |
+
|
| 159 |
+
try:
|
| 160 |
+
# Test if app.py can be imported
|
| 161 |
+
import app
|
| 162 |
+
|
| 163 |
+
print("β
Streamlit app imported successfully")
|
| 164 |
+
|
| 165 |
+
return True
|
| 166 |
+
|
| 167 |
+
except Exception as e:
|
| 168 |
+
print(f"β Streamlit app test failed: {e}")
|
| 169 |
+
return False
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
def test_file_structure():
|
| 173 |
+
"""Test if all required files exist"""
|
| 174 |
+
print("\nπ Testing file structure...")
|
| 175 |
+
|
| 176 |
+
required_files = [
|
| 177 |
+
"app.py",
|
| 178 |
+
"rag_system.py",
|
| 179 |
+
"pdf_processor.py",
|
| 180 |
+
"requirements.txt",
|
| 181 |
+
"README.md",
|
| 182 |
+
]
|
| 183 |
+
|
| 184 |
+
missing_files = []
|
| 185 |
+
for file in required_files:
|
| 186 |
+
if os.path.exists(file):
|
| 187 |
+
print(f"β
{file}")
|
| 188 |
+
else:
|
| 189 |
+
print(f"β {file} (missing)")
|
| 190 |
+
missing_files.append(file)
|
| 191 |
+
|
| 192 |
+
if missing_files:
|
| 193 |
+
print(f"β Missing files: {missing_files}")
|
| 194 |
+
return False
|
| 195 |
+
|
| 196 |
+
return True
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
def test_requirements():
|
| 200 |
+
"""Test if requirements.txt is valid"""
|
| 201 |
+
print("\nπ Testing requirements.txt...")
|
| 202 |
+
|
| 203 |
+
try:
|
| 204 |
+
with open("requirements.txt", "r") as f:
|
| 205 |
+
requirements = f.read()
|
| 206 |
+
|
| 207 |
+
# Check for essential packages
|
| 208 |
+
essential_packages = [
|
| 209 |
+
"streamlit",
|
| 210 |
+
"torch",
|
| 211 |
+
"transformers",
|
| 212 |
+
"sentence-transformers",
|
| 213 |
+
"faiss-cpu",
|
| 214 |
+
"rank-bm25",
|
| 215 |
+
"pypdf",
|
| 216 |
+
]
|
| 217 |
+
|
| 218 |
+
missing_packages = []
|
| 219 |
+
for package in essential_packages:
|
| 220 |
+
if package in requirements:
|
| 221 |
+
print(f"β
{package}")
|
| 222 |
+
else:
|
| 223 |
+
print(f"β {package} (missing)")
|
| 224 |
+
missing_packages.append(package)
|
| 225 |
+
|
| 226 |
+
if missing_packages:
|
| 227 |
+
print(f"β Missing packages: {missing_packages}")
|
| 228 |
+
return False
|
| 229 |
+
|
| 230 |
+
return True
|
| 231 |
+
|
| 232 |
+
except Exception as e:
|
| 233 |
+
print(f"β Requirements test failed: {e}")
|
| 234 |
+
return False
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def main():
|
| 238 |
+
"""Run all tests"""
|
| 239 |
+
print("π Hugging Face Deployment Test\n")
|
| 240 |
+
|
| 241 |
+
tests = [
|
| 242 |
+
("File Structure", test_file_structure),
|
| 243 |
+
("Requirements", test_requirements),
|
| 244 |
+
("Imports", test_imports),
|
| 245 |
+
("Model Loading", test_model_loading),
|
| 246 |
+
("PDF Processor", test_pdf_processor),
|
| 247 |
+
("RAG System", test_rag_system),
|
| 248 |
+
("Streamlit App", test_streamlit_app),
|
| 249 |
+
]
|
| 250 |
+
|
| 251 |
+
results = []
|
| 252 |
+
for test_name, test_func in tests:
|
| 253 |
+
try:
|
| 254 |
+
result = test_func()
|
| 255 |
+
results.append((test_name, result))
|
| 256 |
+
except Exception as e:
|
| 257 |
+
print(f"β {test_name} test failed with exception: {e}")
|
| 258 |
+
results.append((test_name, False))
|
| 259 |
+
|
| 260 |
+
# Summary
|
| 261 |
+
print("\n" + "=" * 50)
|
| 262 |
+
print("π Test Results Summary")
|
| 263 |
+
print("=" * 50)
|
| 264 |
+
|
| 265 |
+
passed = 0
|
| 266 |
+
total = len(results)
|
| 267 |
+
|
| 268 |
+
for test_name, result in results:
|
| 269 |
+
status = "β
PASS" if result else "β FAIL"
|
| 270 |
+
print(f"{test_name:20} {status}")
|
| 271 |
+
if result:
|
| 272 |
+
passed += 1
|
| 273 |
+
|
| 274 |
+
print(f"\nOverall: {passed}/{total} tests passed")
|
| 275 |
+
|
| 276 |
+
if passed == total:
|
| 277 |
+
print("π All tests passed! Ready for Hugging Face deployment.")
|
| 278 |
+
print("\nNext steps:")
|
| 279 |
+
print("1. Create a new Hugging Face Space")
|
| 280 |
+
print("2. Upload all files from this directory")
|
| 281 |
+
print("3. Set the SDK to 'Streamlit'")
|
| 282 |
+
print("4. Deploy and test your RAG system!")
|
| 283 |
+
else:
|
| 284 |
+
print("β οΈ Some tests failed. Please fix the issues before deployment.")
|
| 285 |
+
print("\nTroubleshooting:")
|
| 286 |
+
print("1. Install missing dependencies: pip install -r requirements.txt")
|
| 287 |
+
print("2. Check file permissions and paths")
|
| 288 |
+
print("3. Verify model download permissions")
|
| 289 |
+
print("4. Test locally first: streamlit run app.py")
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
if __name__ == "__main__":
|
| 293 |
+
main()
|
test_docker.py
ADDED
|
@@ -0,0 +1,290 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Docker deployment
|
| 4 |
+
|
| 5 |
+
This script tests if all components are working correctly for Docker deployment.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
import sys
|
| 10 |
+
import subprocess
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def test_dockerfile():
|
| 15 |
+
"""Test if Dockerfile exists and is valid"""
|
| 16 |
+
print("π Testing Dockerfile...")
|
| 17 |
+
|
| 18 |
+
dockerfile_path = Path("Dockerfile")
|
| 19 |
+
if not dockerfile_path.exists():
|
| 20 |
+
print("β Dockerfile not found")
|
| 21 |
+
return False
|
| 22 |
+
|
| 23 |
+
try:
|
| 24 |
+
with open(dockerfile_path, "r") as f:
|
| 25 |
+
content = f.read()
|
| 26 |
+
|
| 27 |
+
# Check for essential Dockerfile components
|
| 28 |
+
required_components = [
|
| 29 |
+
"FROM python:",
|
| 30 |
+
"WORKDIR /app",
|
| 31 |
+
"COPY requirements.txt",
|
| 32 |
+
"RUN pip install",
|
| 33 |
+
"COPY .",
|
| 34 |
+
"EXPOSE 8501",
|
| 35 |
+
'CMD ["streamlit"',
|
| 36 |
+
]
|
| 37 |
+
|
| 38 |
+
missing_components = []
|
| 39 |
+
for component in required_components:
|
| 40 |
+
if component in content:
|
| 41 |
+
print(f"β
{component}")
|
| 42 |
+
else:
|
| 43 |
+
print(f"β {component} (missing)")
|
| 44 |
+
missing_components.append(component)
|
| 45 |
+
|
| 46 |
+
if missing_components:
|
| 47 |
+
print(f"β Missing Dockerfile components: {missing_components}")
|
| 48 |
+
return False
|
| 49 |
+
|
| 50 |
+
return True
|
| 51 |
+
|
| 52 |
+
except Exception as e:
|
| 53 |
+
print(f"β Dockerfile test failed: {e}")
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def test_dockerignore():
|
| 58 |
+
"""Test if .dockerignore exists"""
|
| 59 |
+
print("\nπ Testing .dockerignore...")
|
| 60 |
+
|
| 61 |
+
dockerignore_path = Path(".dockerignore")
|
| 62 |
+
if dockerignore_path.exists():
|
| 63 |
+
print("β
.dockerignore exists")
|
| 64 |
+
return True
|
| 65 |
+
else:
|
| 66 |
+
print("β οΈ .dockerignore not found (optional but recommended)")
|
| 67 |
+
return True
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def test_docker_compose():
|
| 71 |
+
"""Test if docker-compose.yml exists"""
|
| 72 |
+
print("\nπ Testing docker-compose.yml...")
|
| 73 |
+
|
| 74 |
+
compose_path = Path("docker-compose.yml")
|
| 75 |
+
if compose_path.exists():
|
| 76 |
+
print("β
docker-compose.yml exists")
|
| 77 |
+
return True
|
| 78 |
+
else:
|
| 79 |
+
print("β οΈ docker-compose.yml not found (optional)")
|
| 80 |
+
return True
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def test_docker_build():
|
| 84 |
+
"""Test Docker build locally"""
|
| 85 |
+
print("\nπ Testing Docker build...")
|
| 86 |
+
|
| 87 |
+
try:
|
| 88 |
+
# Test Docker build
|
| 89 |
+
result = subprocess.run(
|
| 90 |
+
["docker", "build", "-t", "rag-system-test", "."],
|
| 91 |
+
capture_output=True,
|
| 92 |
+
text=True,
|
| 93 |
+
timeout=300, # 5 minutes timeout
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
if result.returncode == 0:
|
| 97 |
+
print("β
Docker build successful")
|
| 98 |
+
return True
|
| 99 |
+
else:
|
| 100 |
+
print(f"β Docker build failed: {result.stderr}")
|
| 101 |
+
return False
|
| 102 |
+
|
| 103 |
+
except subprocess.TimeoutExpired:
|
| 104 |
+
print("β Docker build timed out")
|
| 105 |
+
return False
|
| 106 |
+
except FileNotFoundError:
|
| 107 |
+
print("β οΈ Docker not installed or not in PATH")
|
| 108 |
+
return False
|
| 109 |
+
except Exception as e:
|
| 110 |
+
print(f"β Docker build test failed: {e}")
|
| 111 |
+
return False
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def test_docker_run():
|
| 115 |
+
"""Test Docker run locally"""
|
| 116 |
+
print("\nπ Testing Docker run...")
|
| 117 |
+
|
| 118 |
+
try:
|
| 119 |
+
# Test Docker run (brief test)
|
| 120 |
+
result = subprocess.run(
|
| 121 |
+
[
|
| 122 |
+
"docker",
|
| 123 |
+
"run",
|
| 124 |
+
"--rm",
|
| 125 |
+
"-d",
|
| 126 |
+
"-p",
|
| 127 |
+
"8501:8501",
|
| 128 |
+
"--name",
|
| 129 |
+
"rag-test",
|
| 130 |
+
"rag-system-test",
|
| 131 |
+
],
|
| 132 |
+
capture_output=True,
|
| 133 |
+
text=True,
|
| 134 |
+
timeout=30,
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
if result.returncode == 0:
|
| 138 |
+
print("β
Docker run successful")
|
| 139 |
+
|
| 140 |
+
# Clean up
|
| 141 |
+
subprocess.run(["docker", "stop", "rag-test"], capture_output=True)
|
| 142 |
+
return True
|
| 143 |
+
else:
|
| 144 |
+
print(f"β Docker run failed: {result.stderr}")
|
| 145 |
+
return False
|
| 146 |
+
|
| 147 |
+
except subprocess.TimeoutExpired:
|
| 148 |
+
print("β Docker run timed out")
|
| 149 |
+
return False
|
| 150 |
+
except FileNotFoundError:
|
| 151 |
+
print("β οΈ Docker not installed or not in PATH")
|
| 152 |
+
return False
|
| 153 |
+
except Exception as e:
|
| 154 |
+
print(f"β Docker run test failed: {e}")
|
| 155 |
+
return False
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def test_file_structure():
|
| 159 |
+
"""Test if all required files exist"""
|
| 160 |
+
print("\nπ Testing file structure...")
|
| 161 |
+
|
| 162 |
+
required_files = [
|
| 163 |
+
"app.py",
|
| 164 |
+
"rag_system.py",
|
| 165 |
+
"pdf_processor.py",
|
| 166 |
+
"requirements.txt",
|
| 167 |
+
"Dockerfile",
|
| 168 |
+
]
|
| 169 |
+
|
| 170 |
+
optional_files = [".dockerignore", "docker-compose.yml", "README.md"]
|
| 171 |
+
|
| 172 |
+
missing_required = []
|
| 173 |
+
missing_optional = []
|
| 174 |
+
|
| 175 |
+
for file in required_files:
|
| 176 |
+
if os.path.exists(file):
|
| 177 |
+
print(f"β
{file}")
|
| 178 |
+
else:
|
| 179 |
+
print(f"β {file} (missing)")
|
| 180 |
+
missing_required.append(file)
|
| 181 |
+
|
| 182 |
+
for file in optional_files:
|
| 183 |
+
if os.path.exists(file):
|
| 184 |
+
print(f"β
{file}")
|
| 185 |
+
else:
|
| 186 |
+
print(f"β οΈ {file} (optional)")
|
| 187 |
+
missing_optional.append(file)
|
| 188 |
+
|
| 189 |
+
if missing_required:
|
| 190 |
+
print(f"β Missing required files: {missing_required}")
|
| 191 |
+
return False
|
| 192 |
+
|
| 193 |
+
return True
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
def test_requirements():
|
| 197 |
+
"""Test if requirements.txt is valid"""
|
| 198 |
+
print("\nπ Testing requirements.txt...")
|
| 199 |
+
|
| 200 |
+
try:
|
| 201 |
+
with open("requirements.txt", "r") as f:
|
| 202 |
+
requirements = f.read()
|
| 203 |
+
|
| 204 |
+
# Check for essential packages
|
| 205 |
+
essential_packages = [
|
| 206 |
+
"streamlit",
|
| 207 |
+
"torch",
|
| 208 |
+
"transformers",
|
| 209 |
+
"sentence-transformers",
|
| 210 |
+
"faiss-cpu",
|
| 211 |
+
"rank-bm25",
|
| 212 |
+
"pypdf",
|
| 213 |
+
]
|
| 214 |
+
|
| 215 |
+
missing_packages = []
|
| 216 |
+
for package in essential_packages:
|
| 217 |
+
if package in requirements:
|
| 218 |
+
print(f"β
{package}")
|
| 219 |
+
else:
|
| 220 |
+
print(f"β {package} (missing)")
|
| 221 |
+
missing_packages.append(package)
|
| 222 |
+
|
| 223 |
+
if missing_packages:
|
| 224 |
+
print(f"β Missing packages: {missing_packages}")
|
| 225 |
+
return False
|
| 226 |
+
|
| 227 |
+
return True
|
| 228 |
+
|
| 229 |
+
except Exception as e:
|
| 230 |
+
print(f"β Requirements test failed: {e}")
|
| 231 |
+
return False
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def main():
|
| 235 |
+
"""Run all tests"""
|
| 236 |
+
print("π³ Docker Deployment Test\n")
|
| 237 |
+
|
| 238 |
+
tests = [
|
| 239 |
+
("File Structure", test_file_structure),
|
| 240 |
+
("Requirements", test_requirements),
|
| 241 |
+
("Dockerfile", test_dockerfile),
|
| 242 |
+
(".dockerignore", test_dockerignore),
|
| 243 |
+
("docker-compose.yml", test_docker_compose),
|
| 244 |
+
("Docker Build", test_docker_build),
|
| 245 |
+
("Docker Run", test_docker_run),
|
| 246 |
+
]
|
| 247 |
+
|
| 248 |
+
results = []
|
| 249 |
+
for test_name, test_func in tests:
|
| 250 |
+
try:
|
| 251 |
+
result = test_func()
|
| 252 |
+
results.append((test_name, result))
|
| 253 |
+
except Exception as e:
|
| 254 |
+
print(f"β {test_name} test failed with exception: {e}")
|
| 255 |
+
results.append((test_name, False))
|
| 256 |
+
|
| 257 |
+
# Summary
|
| 258 |
+
print("\n" + "=" * 50)
|
| 259 |
+
print("π Test Results Summary")
|
| 260 |
+
print("=" * 50)
|
| 261 |
+
|
| 262 |
+
passed = 0
|
| 263 |
+
total = len(results)
|
| 264 |
+
|
| 265 |
+
for test_name, result in results:
|
| 266 |
+
status = "β
PASS" if result else "β FAIL"
|
| 267 |
+
print(f"{test_name:20} {status}")
|
| 268 |
+
if result:
|
| 269 |
+
passed += 1
|
| 270 |
+
|
| 271 |
+
print(f"\nOverall: {passed}/{total} tests passed")
|
| 272 |
+
|
| 273 |
+
if passed == total:
|
| 274 |
+
print("π All tests passed! Ready for Hugging Face Docker deployment.")
|
| 275 |
+
print("\nNext steps:")
|
| 276 |
+
print("1. Create a new Hugging Face Space with Docker SDK")
|
| 277 |
+
print("2. Upload all files from this directory")
|
| 278 |
+
print("3. Wait for Docker build to complete")
|
| 279 |
+
print("4. Test your RAG system!")
|
| 280 |
+
else:
|
| 281 |
+
print("β οΈ Some tests failed. Please fix the issues before deployment.")
|
| 282 |
+
print("\nTroubleshooting:")
|
| 283 |
+
print("1. Install Docker if not available")
|
| 284 |
+
print("2. Check file permissions and paths")
|
| 285 |
+
print("3. Verify Dockerfile syntax")
|
| 286 |
+
print("4. Test Docker build locally: docker build -t rag-system .")
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
if __name__ == "__main__":
|
| 290 |
+
main()
|