Spaces:

Rahul-Samedavar
/

ShastraDocs2

Sleeping

File size: 11,625 Bytes

ade6079

# ShastraDocs Preprocessing Package

An advanced document preprocessing pipeline for RAG (Retrieval-Augmented Generation) systems. This modular package handles document ingestion, text extraction, chunking, embedding generation, and vector storage for multiple document formats.

## 🚀 Features

### Document Format Support
- **PDF**: Advanced text extraction with table handling and CID font support (Malayalam, complex scripts)
- **DOCX**: Complete Word document processing with tables and text boxes
- **PPTX**: PowerPoint extraction with OCR for images using OCR Space API
- **XLSX**: Excel spreadsheet processing with image OCR support
- **Images**: PNG, JPEG, JPG with table detection and OCR
- **Plain Text**: TXT and CSV file support
- **URLs**: Direct URL processing and Google Docs conversion

### Advanced Processing Capabilities
- **Smart Text Chunking**: Sentence-boundary aware chunking with configurable overlap
- **Embedding Generation**: Sentence transformer-based embeddings with batch processing
- **Vector Storage**: Qdrant integration for efficient similarity search
- **Table Extraction**: Automated table detection and formatting
- **OCR Integration**: OCR Space API for image text extraction
- **Metadata Management**: Comprehensive document metadata tracking
- **Parallel Processing**: Multi-threaded document processing
- **Caching**: Intelligent caching to avoid reprocessing

## 📁 Package Structure

```
preprocessing/
├── __init__.py                    # Package initialization
├── preprocessing.py               # Main entry point and CLI
└── preprocessing_modules/
    ├── __init__.py
    ├── modular_preprocessor.py    # Main orchestrator class
    ├── file_downloader.py         # Universal file downloading
    ├── pdf_extractor.py           # PDF text extraction
    ├── docx_extractor.py          # DOCX processing
    ├── pptx_extractor.py          # PowerPoint processing
    ├── xlsx_extractor.py          # Excel processing
    ├── image_extractor.py         # Image and table extraction
    ├── text_chunker.py            # Smart text chunking
    ├── embedding_manager.py       # Embedding generation
    ├── vector_storage.py          # Qdrant vector database
    └── metadata_manager.py        # Document metadata management
```

## 🛠️ Installation

### Dependencies
Note: these packages are already included in requirements.txt of the project
```bash
# Core dependencies
pip install aiohttp asyncio numpy pandas pathlib
pip install sentence-transformers qdrant-client
pip install pdfplumber pymupdf python-docx python-pptx openpyxl
pip install opencv-python pytesseract pillow lxml

# For image processing
pip install opencv-python pytesseract pillow

# For document parsing
pip install pdfplumber pymupdf python-docx python-pptx openpyxl lxml
```

### Environment Variables
Create a `.env` file with the following:
```env
# Required for PowerPoint OCR
OCR_SPACE_API_KEY=your_ocr_space_api_key

# Optional: Custom paths
OUTPUT_DIR=./vector_db
EMBEDDING_MODEL=Bge-large-en #or any model
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
BATCH_SIZE=32
```

## 🔧 Configuration

The package uses `config/config.py` for configuration:

```python
# Embedding configuration
EMBEDDING_MODEL = "Bge-large-en"  # Sentence transformer model
BATCH_SIZE = 32                       # Embedding batch size

# Chunking configuration
CHUNK_SIZE = 1600                     # Characters per chunk
CHUNK_OVERLAP = 500                   # Overlap between chunks

# Storage configuration
OUTPUT_DIR = "./vector_db"            # Vector database directory

# OCR configuration (for PPTX images)
OCR_SPACE_API_KEY = "your_api_key"    # OCR Space API key
```

## 📖 Usage

### Basic Usage

```python
from preprocessing import ModularDocumentPreprocessor

# Initialize preprocessor
preprocessor = ModularDocumentPreprocessor()

# Process a single document
doc_id = await preprocessor.process_document("https://example.com/document.pdf")

# Process multiple documents
urls = [
    "https://example.com/doc1.pdf",
    "https://example.com/doc2.docx",
    "https://example.com/presentation.pptx"
]
results = await preprocessor.process_multiple_documents(urls)

# Check processing status
info = preprocessor.get_document_info("https://example.com/document.pdf")
print(f"Document processed: {info}")
```

### Document Types and Return Values

```python
# Different document types return different formats
result = await preprocessor.process_document(url)

# Regular documents (PDF, DOCX, TXT)
if isinstance(result, str):
    doc_id = result  # Normal processing, returns document ID

# Special cases
elif isinstance(result, list):
    content, doc_type = result[0], result[1]
    
    if doc_type == 'oneshot':
        # Small documents processed as single chunk
        # Use content directly with LLM
        
    elif doc_type == 'tabular':
        # Excel/CSV with structured data
        # Use content for data analysis
        
    elif doc_type == 'image':
        # Image file - content is file path
        # Process with image_extractor
        
    elif doc_type == 'unsupported':
        # File format not supported
        print(f"Unsupported format: {content}")
```

### Advanced Usage

```python
# Force reprocessing
doc_id = await preprocessor.process_document(url, force_reprocess=True)

# Custom timeout for large files
doc_id = await preprocessor.process_document(url, timeout=600)  # 10 minutes

# Get system information
system_info = preprocessor.get_system_info()
print(f"Embedding model: {system_info['embedding_model']}")

# Get collection statistics
stats = preprocessor.get_collection_stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")

# List all processed documents
docs = preprocessor.list_processed_documents()
for doc_id, info in docs.items():
    print(f"{doc_id}: {info['document_url']} ({info['chunk_count']} chunks)")

# Cleanup document
success = preprocessor.cleanup_document(url)
```

### Image Processing

```python
from preprocessing_modules.image_extractor import extract_image

# Extract text and tables from images
text_content = extract_image("path/to/image.png")
print(text_content)

# Output format:
# ### Non-Table Text:
# Regular text content from the image
# 
# ### Table 1 (Markdown):
# | Column 1 | Column 2 | Column 3 |
# |----------|----------|----------|
# | Data 1   | Data 2   | Data 3   |
```

## 🎯 Command Line Interface

```bash
# Process a single document
python -m preprocessing --url "https://example.com/document.pdf"

# Process multiple documents from file
python -m preprocessing --urls-file urls.txt

# Force reprocessing
python -m preprocessing --url "https://example.com/document.pdf" --force

# List processed documents
python -m preprocessing --list

# Show collection statistics
python -m preprocessing --stats
```

### URLs File Format
```
https://example.com/doc1.pdf
https://example.com/doc2.docx
https://example.com/presentation.pptx
https://docs.google.com/document/d/abc123/edit?usp=sharing
```

## 🏗️ Architecture

### Modular Design
The package follows a modular architecture with clear separation of concerns:

1. **File Downloader**: Handles downloading from various sources with retry logic
2. **Text Extractors**: Specialized extractors for each document format
3. **Text Chunker**: Smart chunking with sentence boundary detection
4. **Embedding Manager**: Generates embeddings using sentence transformers
5. **Vector Storage**: Manages Qdrant vector database operations
6. **Metadata Manager**: Tracks document processing metadata

### Processing Pipeline
```
URL/File → Download → Extract Text → Chunk → Generate Embeddings → Store in Qdrant
                                     ↓
                               Save Metadata
```

### Document Processing Flow

1. **Download**: Securely download document to temporary location
2. **Format Detection**: Identify document type and select appropriate extractor
3. **Text Extraction**: Extract text content with format-specific handling
4. **Chunking**: Split text into overlapping chunks with smart boundaries
5. **Embedding**: Generate embeddings using sentence transformers
6. **Storage**: Store embeddings and metadata in Qdrant vector database
7. **Cleanup**: Remove temporary files and update registries

## 📊 Supported Formats

| Format | Extension | Features | Special Handling |
|--------|-----------|----------|------------------|
| PDF | .pdf | Text, tables, complex scripts | CID font mapping, parallel processing |
| Word | .docx | Text, tables, text boxes | XML parsing, gridSpan handling |
| PowerPoint | .pptx | Text, images, tables, notes | OCR Space API for images |
| Excel | .xlsx | Cells, images | OpenPyXL, OCR for embedded images |
| Images | .png, .jpg, .jpeg | Text, tables | OpenCV table detection, OCR |
| Text | .txt, .csv | Plain text | Direct processing |
| URLs | http/https | Web content | Google Docs conversion |

## 🔍 Advanced Features

### Table Processing
- Automatic table detection in PDFs and images
- GridSpan handling for complex table structures
- Markdown formatting for structured output
- Cell content extraction with proper spacing

### CID Font Support
- Advanced handling of Malayalam and complex scripts
- Character mapping resolution
- Proper spacing and conjunct handling
- Fallback extraction methods

### OCR Integration
- OCR Space API for PowerPoint images
- Tesseract OCR for Excel images
- Batch processing for efficiency
- Error handling and fallback options

### Caching System
- Document-level caching to avoid reprocessing
- Chunk caching for repeated operations
- Temporary file management
- Automatic cleanup on exit

## 🛡️ Error Handling

The package includes comprehensive error handling:

- **Network Issues**: Retry logic with exponential backoff
- **Corrupted Files**: Fallback extraction methods
- **Memory Issues**: Batch processing and streaming
- **Format Issues**: Multiple parser fallbacks
- **OCR Failures**: Graceful degradation with error messages

## 📈 Performance

### Optimization Features
- **Parallel Processing**: Multi-threaded document processing
- **Batch Operations**: Efficient embedding generation
- **Streaming**: Memory-efficient large file handling
- **Caching**: Avoid redundant processing
- **Connection Pooling**: Efficient HTTP operations

### Benchmarks
- **PDF Processing**: ~2-5 pages/second (depends on complexity)
- **Embedding Generation**: ~100-500 chunks/second (depends on model)
- **Vector Storage**: ~1000+ vectors/second insertion rate

## 🔧 Troubleshooting

### Common Issues

1. **OCR Space API Errors**
   ```python
   # Ensure API key is set
   export OCR_SPACE_API_KEY="your_key_here"
   ```

2. **Tesseract Not Found**
   ```bash
   # Install tesseract
   apt-get install tesseract-ocr
   # or
   brew install tesseract
   ```

3. **Memory Issues with Large Files**
   ```python
   # Reduce batch size in config
   BATCH_SIZE = 16
   ```

4. **Vector Database Issues**
   ```python
   # Check permissions on OUTPUT_DIR
   # Ensure sufficient disk space
   ```

### Debug Mode
```python
import logging
logging.basicConfig(level=logging.DEBUG)

# Enable detailed logging for troubleshooting
```
## 📄 License

This package is part of the ShastraDocs project. See the main project license for details.


*This preprocessing package is designed to handle the complex requirements of document processing in RAG systems, with a focus on reliability, performance, and format diversity.*