Rahul-Samedavar's picture
added readmeeeee
ade6079
# ShastraDocs Preprocessing Package
An advanced document preprocessing pipeline for RAG (Retrieval-Augmented Generation) systems. This modular package handles document ingestion, text extraction, chunking, embedding generation, and vector storage for multiple document formats.
## πŸš€ Features
### Document Format Support
- **PDF**: Advanced text extraction with table handling and CID font support (Malayalam, complex scripts)
- **DOCX**: Complete Word document processing with tables and text boxes
- **PPTX**: PowerPoint extraction with OCR for images using OCR Space API
- **XLSX**: Excel spreadsheet processing with image OCR support
- **Images**: PNG, JPEG, JPG with table detection and OCR
- **Plain Text**: TXT and CSV file support
- **URLs**: Direct URL processing and Google Docs conversion
### Advanced Processing Capabilities
- **Smart Text Chunking**: Sentence-boundary aware chunking with configurable overlap
- **Embedding Generation**: Sentence transformer-based embeddings with batch processing
- **Vector Storage**: Qdrant integration for efficient similarity search
- **Table Extraction**: Automated table detection and formatting
- **OCR Integration**: OCR Space API for image text extraction
- **Metadata Management**: Comprehensive document metadata tracking
- **Parallel Processing**: Multi-threaded document processing
- **Caching**: Intelligent caching to avoid reprocessing
## πŸ“ Package Structure
```
preprocessing/
β”œβ”€β”€ __init__.py # Package initialization
β”œβ”€β”€ preprocessing.py # Main entry point and CLI
└── preprocessing_modules/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ modular_preprocessor.py # Main orchestrator class
β”œβ”€β”€ file_downloader.py # Universal file downloading
β”œβ”€β”€ pdf_extractor.py # PDF text extraction
β”œβ”€β”€ docx_extractor.py # DOCX processing
β”œβ”€β”€ pptx_extractor.py # PowerPoint processing
β”œβ”€β”€ xlsx_extractor.py # Excel processing
β”œβ”€β”€ image_extractor.py # Image and table extraction
β”œβ”€β”€ text_chunker.py # Smart text chunking
β”œβ”€β”€ embedding_manager.py # Embedding generation
β”œβ”€β”€ vector_storage.py # Qdrant vector database
└── metadata_manager.py # Document metadata management
```
## πŸ› οΈ Installation
### Dependencies
Note: these packages are already included in requirements.txt of the project
```bash
# Core dependencies
pip install aiohttp asyncio numpy pandas pathlib
pip install sentence-transformers qdrant-client
pip install pdfplumber pymupdf python-docx python-pptx openpyxl
pip install opencv-python pytesseract pillow lxml
# For image processing
pip install opencv-python pytesseract pillow
# For document parsing
pip install pdfplumber pymupdf python-docx python-pptx openpyxl lxml
```
### Environment Variables
Create a `.env` file with the following:
```env
# Required for PowerPoint OCR
OCR_SPACE_API_KEY=your_ocr_space_api_key
# Optional: Custom paths
OUTPUT_DIR=./vector_db
EMBEDDING_MODEL=Bge-large-en #or any model
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
BATCH_SIZE=32
```
## πŸ”§ Configuration
The package uses `config/config.py` for configuration:
```python
# Embedding configuration
EMBEDDING_MODEL = "Bge-large-en" # Sentence transformer model
BATCH_SIZE = 32 # Embedding batch size
# Chunking configuration
CHUNK_SIZE = 1600 # Characters per chunk
CHUNK_OVERLAP = 500 # Overlap between chunks
# Storage configuration
OUTPUT_DIR = "./vector_db" # Vector database directory
# OCR configuration (for PPTX images)
OCR_SPACE_API_KEY = "your_api_key" # OCR Space API key
```
## πŸ“– Usage
### Basic Usage
```python
from preprocessing import ModularDocumentPreprocessor
# Initialize preprocessor
preprocessor = ModularDocumentPreprocessor()
# Process a single document
doc_id = await preprocessor.process_document("https://example.com/document.pdf")
# Process multiple documents
urls = [
"https://example.com/doc1.pdf",
"https://example.com/doc2.docx",
"https://example.com/presentation.pptx"
]
results = await preprocessor.process_multiple_documents(urls)
# Check processing status
info = preprocessor.get_document_info("https://example.com/document.pdf")
print(f"Document processed: {info}")
```
### Document Types and Return Values
```python
# Different document types return different formats
result = await preprocessor.process_document(url)
# Regular documents (PDF, DOCX, TXT)
if isinstance(result, str):
doc_id = result # Normal processing, returns document ID
# Special cases
elif isinstance(result, list):
content, doc_type = result[0], result[1]
if doc_type == 'oneshot':
# Small documents processed as single chunk
# Use content directly with LLM
elif doc_type == 'tabular':
# Excel/CSV with structured data
# Use content for data analysis
elif doc_type == 'image':
# Image file - content is file path
# Process with image_extractor
elif doc_type == 'unsupported':
# File format not supported
print(f"Unsupported format: {content}")
```
### Advanced Usage
```python
# Force reprocessing
doc_id = await preprocessor.process_document(url, force_reprocess=True)
# Custom timeout for large files
doc_id = await preprocessor.process_document(url, timeout=600) # 10 minutes
# Get system information
system_info = preprocessor.get_system_info()
print(f"Embedding model: {system_info['embedding_model']}")
# Get collection statistics
stats = preprocessor.get_collection_stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
# List all processed documents
docs = preprocessor.list_processed_documents()
for doc_id, info in docs.items():
print(f"{doc_id}: {info['document_url']} ({info['chunk_count']} chunks)")
# Cleanup document
success = preprocessor.cleanup_document(url)
```
### Image Processing
```python
from preprocessing_modules.image_extractor import extract_image
# Extract text and tables from images
text_content = extract_image("path/to/image.png")
print(text_content)
# Output format:
# ### Non-Table Text:
# Regular text content from the image
#
# ### Table 1 (Markdown):
# | Column 1 | Column 2 | Column 3 |
# |----------|----------|----------|
# | Data 1 | Data 2 | Data 3 |
```
## 🎯 Command Line Interface
```bash
# Process a single document
python -m preprocessing --url "https://example.com/document.pdf"
# Process multiple documents from file
python -m preprocessing --urls-file urls.txt
# Force reprocessing
python -m preprocessing --url "https://example.com/document.pdf" --force
# List processed documents
python -m preprocessing --list
# Show collection statistics
python -m preprocessing --stats
```
### URLs File Format
```
https://example.com/doc1.pdf
https://example.com/doc2.docx
https://example.com/presentation.pptx
https://docs.google.com/document/d/abc123/edit?usp=sharing
```
## πŸ—οΈ Architecture
### Modular Design
The package follows a modular architecture with clear separation of concerns:
1. **File Downloader**: Handles downloading from various sources with retry logic
2. **Text Extractors**: Specialized extractors for each document format
3. **Text Chunker**: Smart chunking with sentence boundary detection
4. **Embedding Manager**: Generates embeddings using sentence transformers
5. **Vector Storage**: Manages Qdrant vector database operations
6. **Metadata Manager**: Tracks document processing metadata
### Processing Pipeline
```
URL/File β†’ Download β†’ Extract Text β†’ Chunk β†’ Generate Embeddings β†’ Store in Qdrant
↓
Save Metadata
```
### Document Processing Flow
1. **Download**: Securely download document to temporary location
2. **Format Detection**: Identify document type and select appropriate extractor
3. **Text Extraction**: Extract text content with format-specific handling
4. **Chunking**: Split text into overlapping chunks with smart boundaries
5. **Embedding**: Generate embeddings using sentence transformers
6. **Storage**: Store embeddings and metadata in Qdrant vector database
7. **Cleanup**: Remove temporary files and update registries
## πŸ“Š Supported Formats
| Format | Extension | Features | Special Handling |
|--------|-----------|----------|------------------|
| PDF | .pdf | Text, tables, complex scripts | CID font mapping, parallel processing |
| Word | .docx | Text, tables, text boxes | XML parsing, gridSpan handling |
| PowerPoint | .pptx | Text, images, tables, notes | OCR Space API for images |
| Excel | .xlsx | Cells, images | OpenPyXL, OCR for embedded images |
| Images | .png, .jpg, .jpeg | Text, tables | OpenCV table detection, OCR |
| Text | .txt, .csv | Plain text | Direct processing |
| URLs | http/https | Web content | Google Docs conversion |
## πŸ” Advanced Features
### Table Processing
- Automatic table detection in PDFs and images
- GridSpan handling for complex table structures
- Markdown formatting for structured output
- Cell content extraction with proper spacing
### CID Font Support
- Advanced handling of Malayalam and complex scripts
- Character mapping resolution
- Proper spacing and conjunct handling
- Fallback extraction methods
### OCR Integration
- OCR Space API for PowerPoint images
- Tesseract OCR for Excel images
- Batch processing for efficiency
- Error handling and fallback options
### Caching System
- Document-level caching to avoid reprocessing
- Chunk caching for repeated operations
- Temporary file management
- Automatic cleanup on exit
## πŸ›‘οΈ Error Handling
The package includes comprehensive error handling:
- **Network Issues**: Retry logic with exponential backoff
- **Corrupted Files**: Fallback extraction methods
- **Memory Issues**: Batch processing and streaming
- **Format Issues**: Multiple parser fallbacks
- **OCR Failures**: Graceful degradation with error messages
## πŸ“ˆ Performance
### Optimization Features
- **Parallel Processing**: Multi-threaded document processing
- **Batch Operations**: Efficient embedding generation
- **Streaming**: Memory-efficient large file handling
- **Caching**: Avoid redundant processing
- **Connection Pooling**: Efficient HTTP operations
### Benchmarks
- **PDF Processing**: ~2-5 pages/second (depends on complexity)
- **Embedding Generation**: ~100-500 chunks/second (depends on model)
- **Vector Storage**: ~1000+ vectors/second insertion rate
## πŸ”§ Troubleshooting
### Common Issues
1. **OCR Space API Errors**
```python
# Ensure API key is set
export OCR_SPACE_API_KEY="your_key_here"
```
2. **Tesseract Not Found**
```bash
# Install tesseract
apt-get install tesseract-ocr
# or
brew install tesseract
```
3. **Memory Issues with Large Files**
```python
# Reduce batch size in config
BATCH_SIZE = 16
```
4. **Vector Database Issues**
```python
# Check permissions on OUTPUT_DIR
# Ensure sufficient disk space
```
### Debug Mode
```python
import logging
logging.basicConfig(level=logging.DEBUG)
# Enable detailed logging for troubleshooting
```
## πŸ“„ License
This package is part of the ShastraDocs project. See the main project license for details.
*This preprocessing package is designed to handle the complex requirements of document processing in RAG systems, with a focus on reliability, performance, and format diversity.*