# OCR Utilities This directory contains utility modules for the Historical OCR project. ## PDF OCR Processing The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR. ### Features - **Robust PDF-to-Image Conversion**: Converts PDF documents to images using optimized settings before OCR processing - **Multi-Page Support**: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges - **Memory-Efficient Processing**: Processes PDFs in batches to prevent memory issues with large documents - **Fallback Mechanism**: Falls back to structured_ocr's internal processing if direct conversion fails - **Cleanup Management**: Automatically cleans up temporary files after processing ### Key Components - **PDFOCR**: Main class for processing PDF files with OCR - **PDFConversionResult**: Helper class that holds PDF conversion results and manages cleanup ### Basic Usage ```python from utils.pdf_ocr import PDFOCR # Initialize the processor processor = PDFOCR() # Process a PDF file (all pages, with vision model) result = processor.process_pdf('document.pdf') # Process a PDF file (specific pages, with vision model) result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5]) # Process a PDF file (first N pages, without vision model) result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False) # Process a PDF file with custom prompt result = processor.process_pdf( 'document.pdf', custom_prompt="This is a historical newspaper with multiple columns." ) # Save results to JSON output_path = processor.save_json_output('document.pdf', 'results.json') ``` ### Command Line Usage The module can also be used directly from the command line: ```bash python utils/pdf_ocr.py document.pdf --output results.json python utils/pdf_ocr.py document.pdf --max-pages 3 python utils/pdf_ocr.py document.pdf --pages 1,3,5 python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns." python utils/pdf_ocr.py document.pdf --no-vision ``` ### How It Works 1. The module first attempts to convert the PDF to images using `pdf2image` 2. It processes the first page with the vision model (if requested) for detailed analysis 3. Additional pages are processed with the text model for efficiency 4. All text is combined into a single result with appropriate metadata 5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing ### Parameters - **pdf_path**: Path to the PDF file to process - **use_vision**: Whether to use vision model for improved analysis (default: True) - **max_pages**: Maximum number of pages to process (default: all pages) - **custom_pages**: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5]) - **custom_prompt**: Custom instructions for OCR processing