Spaces:
Sleeping
Sleeping
Upload README.md
Browse files
README.md
CHANGED
|
@@ -7,47 +7,243 @@ sdk: docker
|
|
| 7 |
sdk_version: latest
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# RAG System - Hugging Face Spaces
|
| 13 |
|
| 14 |
-
A comprehensive Retrieval-Augmented Generation (RAG) system that processes PDF documents and answers questions using advanced AI models.
|
| 15 |
|
| 16 |
-
## Features
|
| 17 |
|
| 18 |
-
|
| 19 |
-
-
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
-
|
|
|
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
1. **Wait for Initialization**: The system automatically loads pre-configured PDF documents
|
| 28 |
2. **Ask Questions**: Use the chat interface to ask questions about the documents
|
| 29 |
3. **Choose Method**: Select from hybrid, dense, or sparse retrieval methods
|
| 30 |
4. **View Results**: See answers with confidence scores and search results
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- **Embedding Model**: all-MiniLM-L6-v2 for document embeddings
|
| 37 |
-
- **Generative Model**: Qwen 2.5 1.5B for answer generation
|
| 38 |
-
- **UI Framework**: Streamlit for interactive interface
|
| 39 |
-
- **Containerization**: Docker for deployment
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
## Performance
|
| 46 |
|
|
|
|
| 47 |
- **Parallel Processing**: Documents are loaded concurrently for faster initialization
|
| 48 |
- **Optimized Search**: Hybrid retrieval combines the best of vector and keyword search
|
| 49 |
- **Memory Efficient**: Uses CPU-optimized models for deployment compatibility
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
---
|
| 52 |
|
| 53 |
*Built with β€οΈ for efficient document question-answering*
|
|
|
|
|
|
|
|
|
| 7 |
sdk_version: latest
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
app_port: 8501
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# π€ RAG System - Hugging Face Spaces
|
| 14 |
|
| 15 |
+
A comprehensive **Retrieval-Augmented Generation (RAG)** system that processes PDF documents and answers questions using advanced AI models. This system combines the power of vector search, keyword matching, and large language models to provide intelligent document question-answering capabilities.
|
| 16 |
|
| 17 |
+
## π Features
|
| 18 |
|
| 19 |
+
### Core Functionality
|
| 20 |
+
- **π PDF Processing**: Automatically loads and processes PDF documents with intelligent text extraction
|
| 21 |
+
- **π Hybrid Search**: Combines FAISS vector search with BM25 keyword search for optimal retrieval
|
| 22 |
+
- **π― Multiple Retrieval Methods**: Choose from hybrid, dense, or sparse retrieval options
|
| 23 |
+
- **π€ Advanced AI Models**: Uses Qwen 2.5 1.5B for intelligent response generation
|
| 24 |
+
- **π¬ Real-time Chat Interface**: Interactive Streamlit-based UI with conversation history
|
| 25 |
+
- **β‘ Parallel Document Loading**: Fast document processing with concurrent loading
|
| 26 |
|
| 27 |
+
### Technical Features
|
| 28 |
+
- **π Thread Safety**: Safe concurrent document loading with proper locking
|
| 29 |
+
- **πΎ Persistent Storage**: Automatic index saving and loading across sessions
|
| 30 |
+
- **π― Smart Fallbacks**: Graceful model loading with alternative options
|
| 31 |
+
- **π Performance Metrics**: Response times, confidence scores, and search result analysis
|
| 32 |
+
- **π‘οΈ Error Handling**: Robust error handling and user feedback
|
| 33 |
+
|
| 34 |
+
## ποΈ Architecture
|
| 35 |
+
|
| 36 |
+
The RAG system follows a modular, scalable architecture:
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
|
| 40 |
+
β PDF Documents β β User Interface β β Search Engine β
|
| 41 |
+
β β β (Streamlit) β β β
|
| 42 |
+
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
|
| 43 |
+
β β β
|
| 44 |
+
βΌ βΌ βΌ
|
| 45 |
+
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
|
| 46 |
+
β PDF Processor β β RAG System β β Vector Store β
|
| 47 |
+
β - Text Extract β β - Orchestration β β (FAISS) β
|
| 48 |
+
β - Cleaning β β - Response Gen β β β
|
| 49 |
+
β - Chunking β β - Thread Safety β βββββββββββββββββββ
|
| 50 |
+
βββββββββββββββββββ βββββββββββββββββββ
|
| 51 |
+
β
|
| 52 |
+
βΌ
|
| 53 |
+
βββββββββββββββββββ
|
| 54 |
+
β Language Model β
|
| 55 |
+
β (Qwen 2.5 1.5B) β
|
| 56 |
+
βββββββββββββββββββ
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## π οΈ Technology Stack
|
| 60 |
+
|
| 61 |
+
### Core Technologies
|
| 62 |
+
- **π Vector Database**: FAISS for efficient similarity search
|
| 63 |
+
- **π Sparse Retrieval**: BM25 for keyword-based search
|
| 64 |
+
- **π§ Embedding Model**: all-MiniLM-L6-v2 for document embeddings
|
| 65 |
+
- **π€ Generative Model**: Qwen 2.5 1.5B for answer generation
|
| 66 |
+
- **π UI Framework**: Streamlit for interactive interface
|
| 67 |
+
- **π³ Containerization**: Docker for deployment
|
| 68 |
+
|
| 69 |
+
### Supporting Libraries
|
| 70 |
+
- **π Data Processing**: Pandas, NumPy for data manipulation
|
| 71 |
+
- **π PDF Handling**: PyPDF for document processing
|
| 72 |
+
- **π§ ML Utilities**: Scikit-learn for preprocessing
|
| 73 |
+
- **π Logging**: Loguru for structured logging
|
| 74 |
+
- **β‘ Optimization**: Accelerate for model optimization
|
| 75 |
+
|
| 76 |
+
## π Quick Start
|
| 77 |
+
|
| 78 |
+
### 1. Using the Web Interface
|
| 79 |
|
| 80 |
1. **Wait for Initialization**: The system automatically loads pre-configured PDF documents
|
| 81 |
2. **Ask Questions**: Use the chat interface to ask questions about the documents
|
| 82 |
3. **Choose Method**: Select from hybrid, dense, or sparse retrieval methods
|
| 83 |
4. **View Results**: See answers with confidence scores and search results
|
| 84 |
|
| 85 |
+
### 2. Local Development
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
# Clone the repository
|
| 89 |
+
git clone <repository-url>
|
| 90 |
+
cd convAI
|
| 91 |
+
|
| 92 |
+
# Install dependencies
|
| 93 |
+
pip install -r requirements.txt
|
| 94 |
+
|
| 95 |
+
# Run the application
|
| 96 |
+
streamlit run app.py
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### 3. Docker Deployment
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
# Build and run with Docker Compose
|
| 103 |
+
docker-compose up --build
|
| 104 |
+
|
| 105 |
+
# Or build and run manually
|
| 106 |
+
docker build -t rag-system .
|
| 107 |
+
docker run -p 8501:8501 rag-system
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## π Usage Guide
|
| 111 |
+
|
| 112 |
+
### Document Upload
|
| 113 |
+
- **Automatic Loading**: PDF documents in the container are loaded automatically
|
| 114 |
+
- **Manual Upload**: Use the sidebar to upload additional PDF documents
|
| 115 |
+
- **Supported Formats**: PDF files with text content
|
| 116 |
+
|
| 117 |
+
### Search Methods
|
| 118 |
+
- **π Hybrid**: Combines vector similarity and keyword matching (recommended)
|
| 119 |
+
- **π― Dense**: Uses only vector similarity search
|
| 120 |
+
- **π Sparse**: Uses only keyword-based BM25 search
|
| 121 |
+
|
| 122 |
+
### Query Interface
|
| 123 |
+
- **Natural Language**: Ask questions in plain English
|
| 124 |
+
- **Context Awareness**: System uses retrieved documents for context
|
| 125 |
+
- **Confidence Scores**: See how confident the system is in its answers
|
| 126 |
+
- **Source Citations**: View which documents were used for the answer
|
| 127 |
+
|
| 128 |
+
## βοΈ Configuration
|
| 129 |
+
|
| 130 |
+
### Environment Variables
|
| 131 |
+
```bash
|
| 132 |
+
# Model Configuration
|
| 133 |
+
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
| 134 |
+
GENERATIVE_MODEL=Qwen/Qwen2.5-1.5B-Instruct
|
| 135 |
+
|
| 136 |
+
# Chunk Sizes
|
| 137 |
+
CHUNK_SIZES=100,400
|
| 138 |
|
| 139 |
+
# Vector Store Path
|
| 140 |
+
VECTOR_STORE_PATH=./vector_store
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
+
# Streamlit Configuration
|
| 143 |
+
STREAMLIT_SERVER_PORT=8501
|
| 144 |
+
STREAMLIT_SERVER_ADDRESS=0.0.0.0
|
| 145 |
+
```
|
| 146 |
|
| 147 |
+
### Performance Tuning
|
| 148 |
+
- **Chunk Sizes**: Adjust for different document types (smaller for technical docs, larger for narratives)
|
| 149 |
+
- **Top-k Results**: Increase for more comprehensive answers, decrease for faster responses
|
| 150 |
+
- **Model Selection**: Choose between Qwen 2.5 1.5B and distilgpt2 based on performance needs
|
| 151 |
|
| 152 |
+
## π Performance
|
| 153 |
|
| 154 |
+
### Optimization Features
|
| 155 |
- **Parallel Processing**: Documents are loaded concurrently for faster initialization
|
| 156 |
- **Optimized Search**: Hybrid retrieval combines the best of vector and keyword search
|
| 157 |
- **Memory Efficient**: Uses CPU-optimized models for deployment compatibility
|
| 158 |
+
- **Caching**: FAISS index and metadata are cached for faster subsequent queries
|
| 159 |
+
|
| 160 |
+
### Expected Performance
|
| 161 |
+
- **Document Loading**: ~2-5 seconds per PDF (depending on size)
|
| 162 |
+
- **Query Response**: ~1-3 seconds for typical questions
|
| 163 |
+
- **Memory Usage**: ~2-4GB RAM for typical document collections
|
| 164 |
+
- **Storage**: ~100MB per 1000 document chunks
|
| 165 |
+
|
| 166 |
+
## π§ Development
|
| 167 |
+
|
| 168 |
+
### Project Structure
|
| 169 |
+
```
|
| 170 |
+
convAI/
|
| 171 |
+
βββ app.py # Main Streamlit application
|
| 172 |
+
βββ rag_system.py # Core RAG system implementation
|
| 173 |
+
βββ pdf_processor.py # PDF processing utilities
|
| 174 |
+
βββ requirements.txt # Python dependencies
|
| 175 |
+
βββ Dockerfile # Container configuration
|
| 176 |
+
βββ docker-compose.yml # Multi-container setup
|
| 177 |
+
βββ README.md # This file
|
| 178 |
+
βββ DEPLOYMENT_GUIDE.md # Detailed deployment instructions
|
| 179 |
+
βββ test_deployment.py # Deployment testing script
|
| 180 |
+
βββ test_docker.py # Docker testing script
|
| 181 |
+
βββ src/
|
| 182 |
+
βββ streamlit_app.py # Sample Streamlit app
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
### Testing
|
| 186 |
+
```bash
|
| 187 |
+
# Test deployment readiness
|
| 188 |
+
python test_deployment.py
|
| 189 |
+
|
| 190 |
+
# Test Docker configuration
|
| 191 |
+
python test_docker.py
|
| 192 |
+
|
| 193 |
+
# Run local tests
|
| 194 |
+
streamlit run app.py
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
## π Troubleshooting
|
| 198 |
+
|
| 199 |
+
### Common Issues
|
| 200 |
+
|
| 201 |
+
1. **Model Loading Errors**
|
| 202 |
+
- Check internet connectivity for model downloads
|
| 203 |
+
- Verify sufficient disk space
|
| 204 |
+
- Try the fallback model (distilgpt2)
|
| 205 |
+
|
| 206 |
+
2. **Memory Issues**
|
| 207 |
+
- Reduce chunk sizes
|
| 208 |
+
- Use smaller embedding models
|
| 209 |
+
- Limit the number of documents
|
| 210 |
+
|
| 211 |
+
3. **Performance Issues**
|
| 212 |
+
- Adjust top-k parameter
|
| 213 |
+
- Use sparse search for keyword-heavy queries
|
| 214 |
+
- Consider hardware upgrades
|
| 215 |
+
|
| 216 |
+
4. **Docker Issues**
|
| 217 |
+
- Check Docker installation
|
| 218 |
+
- Verify port availability
|
| 219 |
+
- Check container logs
|
| 220 |
+
|
| 221 |
+
### Getting Help
|
| 222 |
+
- Check the logs in your Space's "Logs" tab
|
| 223 |
+
- Review the deployment guide for common solutions
|
| 224 |
+
- Create an issue in the project repository
|
| 225 |
+
|
| 226 |
+
## π€ Contributing
|
| 227 |
+
|
| 228 |
+
We welcome contributions! Please see our contributing guidelines for:
|
| 229 |
+
- Code style and standards
|
| 230 |
+
- Testing requirements
|
| 231 |
+
- Documentation updates
|
| 232 |
+
- Feature requests and bug reports
|
| 233 |
+
|
| 234 |
+
## π License
|
| 235 |
+
|
| 236 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 237 |
+
|
| 238 |
+
## π Acknowledgments
|
| 239 |
+
|
| 240 |
+
- **Hugging Face** for providing the platform and models
|
| 241 |
+
- **FAISS** team for the efficient vector search library
|
| 242 |
+
- **Streamlit** team for the excellent web framework
|
| 243 |
+
- **OpenAI** for inspiring the RAG architecture
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
*Built with β€οΈ for efficient document question-answering*
|
| 248 |
+
|
| 249 |
+
**Ready to explore your documents? Start asking questions! π**
|