# Lightweight Agentic OCR Document Extraction A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR. ## Features - **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations) - **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results - **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length - **Structured Field Extraction**: Extracts common document fields including: - DOI, ISSN, ISBN, PMID, arXiv ID - Volume, Issue, Pages, Year - Received/Accepted/Published dates - Title, Authors, Abstract, Keywords - Email addresses - **Parallel Processing**: Uses thread pools for faster processing of multiple variants ## Installation ### Prerequisites 1. Install Tesseract OCR: **Ubuntu/Debian:** ```bash sudo apt-get update sudo apt-get install -y tesseract-ocr ``` **macOS:** ```bash brew install tesseract ``` **Windows:** Download and install from: https://github.com/UB-Mannheim/tesseract/wiki 2. Install Python dependencies: ```bash pip install pytesseract opencv-python-headless pillow numpy ``` ## Usage ### Command Line ```bash # Basic usage python agentic_ocr_extractor.py document.jpg # Save outputs to files python agentic_ocr_extractor.py document.png -o output.txt -j fields.json # Custom scale factor and PSM modes python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11 # Quiet mode (suppress progress output) python agentic_ocr_extractor.py document.jpg -q ``` ### Python API ```python from agentic_ocr_extractor import process_image, run_agent, extract_fields import cv2 # Full processing pipeline cleaned_text, fields, best_candidate = process_image( 'document.jpg', output_text_path='extracted_text.txt', output_json_path='extracted_fields.json', scale_factor=1.5, verbose=True ) # Access extracted fields print(f"Title: {fields.title}") print(f"Authors: {fields.authors}") print(f"DOI: {fields.doi}") # Or use individual components bgr = cv2.imread('document.jpg') rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB) # Run agentic OCR best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5) print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}") # Extract fields from text fields = extract_fields(best.text) print(fields.to_dict()) ``` ## How It Works 1. **Image Loading**: Reads the input image and converts to RGB format 2. **Preprocessing**: Generates multiple variants of the image: - Raw (original) - Upscaled (1.5x by default) - Grayscale - Otsu threshold - Adaptive threshold - Denoised - Sharpened - CLAHE (Contrast Limited Adaptive Histogram Equalization) - Morphological closing 3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default) 4. **Candidate Scoring**: Scores each OCR result based on: - Average word confidence - Text length (penalizes very short outputs) - Word count (bonus for reasonable counts) 5. **Best Selection**: Selects the candidate with the highest combined score 6. **Text Cleaning**: Cleans the OCR output by: - Normalizing Unicode characters - Fixing common OCR artifacts - Cleaning whitespace 7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields ## Tips for Better Results - **Image Quality**: Higher resolution images generally produce better results - **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text - **PSM Modes**: Different document layouts may benefit from different PSM modes: - PSM 3: Fully automatic page segmentation - PSM 4: Assume a single column of text - PSM 6: Assume a single uniform block of text - PSM 11: Sparse text, find as much text as possible - **Cropping**: For PDFs rendered to images, cropping margins often improves results ## License MIT License - see [LICENSE](LICENSE) for details. ## Acknowledgments - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine - [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract - [OpenCV](https://opencv.org/) - Computer vision library