Spaces:
Sleeping
Sleeping
| # ShastraDocs Preprocessing Package | |
| An advanced document preprocessing pipeline for RAG (Retrieval-Augmented Generation) systems. This modular package handles document ingestion, text extraction, chunking, embedding generation, and vector storage for multiple document formats. | |
| ## π Features | |
| ### Document Format Support | |
| - **PDF**: Advanced text extraction with table handling and CID font support (Malayalam, complex scripts) | |
| - **DOCX**: Complete Word document processing with tables and text boxes | |
| - **PPTX**: PowerPoint extraction with OCR for images using OCR Space API | |
| - **XLSX**: Excel spreadsheet processing with image OCR support | |
| - **Images**: PNG, JPEG, JPG with table detection and OCR | |
| - **Plain Text**: TXT and CSV file support | |
| - **URLs**: Direct URL processing and Google Docs conversion | |
| ### Advanced Processing Capabilities | |
| - **Smart Text Chunking**: Sentence-boundary aware chunking with configurable overlap | |
| - **Embedding Generation**: Sentence transformer-based embeddings with batch processing | |
| - **Vector Storage**: Qdrant integration for efficient similarity search | |
| - **Table Extraction**: Automated table detection and formatting | |
| - **OCR Integration**: OCR Space API for image text extraction | |
| - **Metadata Management**: Comprehensive document metadata tracking | |
| - **Parallel Processing**: Multi-threaded document processing | |
| - **Caching**: Intelligent caching to avoid reprocessing | |
| ## π Package Structure | |
| ``` | |
| preprocessing/ | |
| βββ __init__.py # Package initialization | |
| βββ preprocessing.py # Main entry point and CLI | |
| βββ preprocessing_modules/ | |
| βββ __init__.py | |
| βββ modular_preprocessor.py # Main orchestrator class | |
| βββ file_downloader.py # Universal file downloading | |
| βββ pdf_extractor.py # PDF text extraction | |
| βββ docx_extractor.py # DOCX processing | |
| βββ pptx_extractor.py # PowerPoint processing | |
| βββ xlsx_extractor.py # Excel processing | |
| βββ image_extractor.py # Image and table extraction | |
| βββ text_chunker.py # Smart text chunking | |
| βββ embedding_manager.py # Embedding generation | |
| βββ vector_storage.py # Qdrant vector database | |
| βββ metadata_manager.py # Document metadata management | |
| ``` | |
| ## π οΈ Installation | |
| ### Dependencies | |
| Note: these packages are already included in requirements.txt of the project | |
| ```bash | |
| # Core dependencies | |
| pip install aiohttp asyncio numpy pandas pathlib | |
| pip install sentence-transformers qdrant-client | |
| pip install pdfplumber pymupdf python-docx python-pptx openpyxl | |
| pip install opencv-python pytesseract pillow lxml | |
| # For image processing | |
| pip install opencv-python pytesseract pillow | |
| # For document parsing | |
| pip install pdfplumber pymupdf python-docx python-pptx openpyxl lxml | |
| ``` | |
| ### Environment Variables | |
| Create a `.env` file with the following: | |
| ```env | |
| # Required for PowerPoint OCR | |
| OCR_SPACE_API_KEY=your_ocr_space_api_key | |
| # Optional: Custom paths | |
| OUTPUT_DIR=./vector_db | |
| EMBEDDING_MODEL=Bge-large-en #or any model | |
| CHUNK_SIZE=1000 | |
| CHUNK_OVERLAP=200 | |
| BATCH_SIZE=32 | |
| ``` | |
| ## π§ Configuration | |
| The package uses `config/config.py` for configuration: | |
| ```python | |
| # Embedding configuration | |
| EMBEDDING_MODEL = "Bge-large-en" # Sentence transformer model | |
| BATCH_SIZE = 32 # Embedding batch size | |
| # Chunking configuration | |
| CHUNK_SIZE = 1600 # Characters per chunk | |
| CHUNK_OVERLAP = 500 # Overlap between chunks | |
| # Storage configuration | |
| OUTPUT_DIR = "./vector_db" # Vector database directory | |
| # OCR configuration (for PPTX images) | |
| OCR_SPACE_API_KEY = "your_api_key" # OCR Space API key | |
| ``` | |
| ## π Usage | |
| ### Basic Usage | |
| ```python | |
| from preprocessing import ModularDocumentPreprocessor | |
| # Initialize preprocessor | |
| preprocessor = ModularDocumentPreprocessor() | |
| # Process a single document | |
| doc_id = await preprocessor.process_document("https://example.com/document.pdf") | |
| # Process multiple documents | |
| urls = [ | |
| "https://example.com/doc1.pdf", | |
| "https://example.com/doc2.docx", | |
| "https://example.com/presentation.pptx" | |
| ] | |
| results = await preprocessor.process_multiple_documents(urls) | |
| # Check processing status | |
| info = preprocessor.get_document_info("https://example.com/document.pdf") | |
| print(f"Document processed: {info}") | |
| ``` | |
| ### Document Types and Return Values | |
| ```python | |
| # Different document types return different formats | |
| result = await preprocessor.process_document(url) | |
| # Regular documents (PDF, DOCX, TXT) | |
| if isinstance(result, str): | |
| doc_id = result # Normal processing, returns document ID | |
| # Special cases | |
| elif isinstance(result, list): | |
| content, doc_type = result[0], result[1] | |
| if doc_type == 'oneshot': | |
| # Small documents processed as single chunk | |
| # Use content directly with LLM | |
| elif doc_type == 'tabular': | |
| # Excel/CSV with structured data | |
| # Use content for data analysis | |
| elif doc_type == 'image': | |
| # Image file - content is file path | |
| # Process with image_extractor | |
| elif doc_type == 'unsupported': | |
| # File format not supported | |
| print(f"Unsupported format: {content}") | |
| ``` | |
| ### Advanced Usage | |
| ```python | |
| # Force reprocessing | |
| doc_id = await preprocessor.process_document(url, force_reprocess=True) | |
| # Custom timeout for large files | |
| doc_id = await preprocessor.process_document(url, timeout=600) # 10 minutes | |
| # Get system information | |
| system_info = preprocessor.get_system_info() | |
| print(f"Embedding model: {system_info['embedding_model']}") | |
| # Get collection statistics | |
| stats = preprocessor.get_collection_stats() | |
| print(f"Total documents: {stats['total_documents']}") | |
| print(f"Total chunks: {stats['total_chunks']}") | |
| # List all processed documents | |
| docs = preprocessor.list_processed_documents() | |
| for doc_id, info in docs.items(): | |
| print(f"{doc_id}: {info['document_url']} ({info['chunk_count']} chunks)") | |
| # Cleanup document | |
| success = preprocessor.cleanup_document(url) | |
| ``` | |
| ### Image Processing | |
| ```python | |
| from preprocessing_modules.image_extractor import extract_image | |
| # Extract text and tables from images | |
| text_content = extract_image("path/to/image.png") | |
| print(text_content) | |
| # Output format: | |
| # ### Non-Table Text: | |
| # Regular text content from the image | |
| # | |
| # ### Table 1 (Markdown): | |
| # | Column 1 | Column 2 | Column 3 | | |
| # |----------|----------|----------| | |
| # | Data 1 | Data 2 | Data 3 | | |
| ``` | |
| ## π― Command Line Interface | |
| ```bash | |
| # Process a single document | |
| python -m preprocessing --url "https://example.com/document.pdf" | |
| # Process multiple documents from file | |
| python -m preprocessing --urls-file urls.txt | |
| # Force reprocessing | |
| python -m preprocessing --url "https://example.com/document.pdf" --force | |
| # List processed documents | |
| python -m preprocessing --list | |
| # Show collection statistics | |
| python -m preprocessing --stats | |
| ``` | |
| ### URLs File Format | |
| ``` | |
| https://example.com/doc1.pdf | |
| https://example.com/doc2.docx | |
| https://example.com/presentation.pptx | |
| https://docs.google.com/document/d/abc123/edit?usp=sharing | |
| ``` | |
| ## ποΈ Architecture | |
| ### Modular Design | |
| The package follows a modular architecture with clear separation of concerns: | |
| 1. **File Downloader**: Handles downloading from various sources with retry logic | |
| 2. **Text Extractors**: Specialized extractors for each document format | |
| 3. **Text Chunker**: Smart chunking with sentence boundary detection | |
| 4. **Embedding Manager**: Generates embeddings using sentence transformers | |
| 5. **Vector Storage**: Manages Qdrant vector database operations | |
| 6. **Metadata Manager**: Tracks document processing metadata | |
| ### Processing Pipeline | |
| ``` | |
| URL/File β Download β Extract Text β Chunk β Generate Embeddings β Store in Qdrant | |
| β | |
| Save Metadata | |
| ``` | |
| ### Document Processing Flow | |
| 1. **Download**: Securely download document to temporary location | |
| 2. **Format Detection**: Identify document type and select appropriate extractor | |
| 3. **Text Extraction**: Extract text content with format-specific handling | |
| 4. **Chunking**: Split text into overlapping chunks with smart boundaries | |
| 5. **Embedding**: Generate embeddings using sentence transformers | |
| 6. **Storage**: Store embeddings and metadata in Qdrant vector database | |
| 7. **Cleanup**: Remove temporary files and update registries | |
| ## π Supported Formats | |
| | Format | Extension | Features | Special Handling | | |
| |--------|-----------|----------|------------------| | |
| | PDF | .pdf | Text, tables, complex scripts | CID font mapping, parallel processing | | |
| | Word | .docx | Text, tables, text boxes | XML parsing, gridSpan handling | | |
| | PowerPoint | .pptx | Text, images, tables, notes | OCR Space API for images | | |
| | Excel | .xlsx | Cells, images | OpenPyXL, OCR for embedded images | | |
| | Images | .png, .jpg, .jpeg | Text, tables | OpenCV table detection, OCR | | |
| | Text | .txt, .csv | Plain text | Direct processing | | |
| | URLs | http/https | Web content | Google Docs conversion | | |
| ## π Advanced Features | |
| ### Table Processing | |
| - Automatic table detection in PDFs and images | |
| - GridSpan handling for complex table structures | |
| - Markdown formatting for structured output | |
| - Cell content extraction with proper spacing | |
| ### CID Font Support | |
| - Advanced handling of Malayalam and complex scripts | |
| - Character mapping resolution | |
| - Proper spacing and conjunct handling | |
| - Fallback extraction methods | |
| ### OCR Integration | |
| - OCR Space API for PowerPoint images | |
| - Tesseract OCR for Excel images | |
| - Batch processing for efficiency | |
| - Error handling and fallback options | |
| ### Caching System | |
| - Document-level caching to avoid reprocessing | |
| - Chunk caching for repeated operations | |
| - Temporary file management | |
| - Automatic cleanup on exit | |
| ## π‘οΈ Error Handling | |
| The package includes comprehensive error handling: | |
| - **Network Issues**: Retry logic with exponential backoff | |
| - **Corrupted Files**: Fallback extraction methods | |
| - **Memory Issues**: Batch processing and streaming | |
| - **Format Issues**: Multiple parser fallbacks | |
| - **OCR Failures**: Graceful degradation with error messages | |
| ## π Performance | |
| ### Optimization Features | |
| - **Parallel Processing**: Multi-threaded document processing | |
| - **Batch Operations**: Efficient embedding generation | |
| - **Streaming**: Memory-efficient large file handling | |
| - **Caching**: Avoid redundant processing | |
| - **Connection Pooling**: Efficient HTTP operations | |
| ### Benchmarks | |
| - **PDF Processing**: ~2-5 pages/second (depends on complexity) | |
| - **Embedding Generation**: ~100-500 chunks/second (depends on model) | |
| - **Vector Storage**: ~1000+ vectors/second insertion rate | |
| ## π§ Troubleshooting | |
| ### Common Issues | |
| 1. **OCR Space API Errors** | |
| ```python | |
| # Ensure API key is set | |
| export OCR_SPACE_API_KEY="your_key_here" | |
| ``` | |
| 2. **Tesseract Not Found** | |
| ```bash | |
| # Install tesseract | |
| apt-get install tesseract-ocr | |
| # or | |
| brew install tesseract | |
| ``` | |
| 3. **Memory Issues with Large Files** | |
| ```python | |
| # Reduce batch size in config | |
| BATCH_SIZE = 16 | |
| ``` | |
| 4. **Vector Database Issues** | |
| ```python | |
| # Check permissions on OUTPUT_DIR | |
| # Ensure sufficient disk space | |
| ``` | |
| ### Debug Mode | |
| ```python | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| # Enable detailed logging for troubleshooting | |
| ``` | |
| ## π License | |
| This package is part of the ShastraDocs project. See the main project license for details. | |
| *This preprocessing package is designed to handle the complex requirements of document processing in RAG systems, with a focus on reliability, performance, and format diversity.* |