| # Lightweight Agentic OCR Document Extraction | |
| A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR. | |
| ## Features | |
| - **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations) | |
| - **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results | |
| - **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length | |
| - **Structured Field Extraction**: Extracts common document fields including: | |
| - DOI, ISSN, ISBN, PMID, arXiv ID | |
| - Volume, Issue, Pages, Year | |
| - Received/Accepted/Published dates | |
| - Title, Authors, Abstract, Keywords | |
| - Email addresses | |
| - **Parallel Processing**: Uses thread pools for faster processing of multiple variants | |
| ## Installation | |
| ### Prerequisites | |
| 1. Install Tesseract OCR: | |
| **Ubuntu/Debian:** | |
| ```bash | |
| sudo apt-get update | |
| sudo apt-get install -y tesseract-ocr | |
| ``` | |
| **macOS:** | |
| ```bash | |
| brew install tesseract | |
| ``` | |
| **Windows:** | |
| Download and install from: https://github.com/UB-Mannheim/tesseract/wiki | |
| 2. Install Python dependencies: | |
| ```bash | |
| pip install pytesseract opencv-python-headless pillow numpy | |
| ``` | |
| ## Usage | |
| ### Command Line | |
| ```bash | |
| # Basic usage | |
| python agentic_ocr_extractor.py document.jpg | |
| # Save outputs to files | |
| python agentic_ocr_extractor.py document.png -o output.txt -j fields.json | |
| # Custom scale factor and PSM modes | |
| python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11 | |
| # Quiet mode (suppress progress output) | |
| python agentic_ocr_extractor.py document.jpg -q | |
| ``` | |
| ### Python API | |
| ```python | |
| from agentic_ocr_extractor import process_image, run_agent, extract_fields | |
| import cv2 | |
| # Full processing pipeline | |
| cleaned_text, fields, best_candidate = process_image( | |
| 'document.jpg', | |
| output_text_path='extracted_text.txt', | |
| output_json_path='extracted_fields.json', | |
| scale_factor=1.5, | |
| verbose=True | |
| ) | |
| # Access extracted fields | |
| print(f"Title: {fields.title}") | |
| print(f"Authors: {fields.authors}") | |
| print(f"DOI: {fields.doi}") | |
| # Or use individual components | |
| bgr = cv2.imread('document.jpg') | |
| rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB) | |
| # Run agentic OCR | |
| best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5) | |
| print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}") | |
| # Extract fields from text | |
| fields = extract_fields(best.text) | |
| print(fields.to_dict()) | |
| ``` | |
| ## How It Works | |
| 1. **Image Loading**: Reads the input image and converts to RGB format | |
| 2. **Preprocessing**: Generates multiple variants of the image: | |
| - Raw (original) | |
| - Upscaled (1.5x by default) | |
| - Grayscale | |
| - Otsu threshold | |
| - Adaptive threshold | |
| - Denoised | |
| - Sharpened | |
| - CLAHE (Contrast Limited Adaptive Histogram Equalization) | |
| - Morphological closing | |
| 3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default) | |
| 4. **Candidate Scoring**: Scores each OCR result based on: | |
| - Average word confidence | |
| - Text length (penalizes very short outputs) | |
| - Word count (bonus for reasonable counts) | |
| 5. **Best Selection**: Selects the candidate with the highest combined score | |
| 6. **Text Cleaning**: Cleans the OCR output by: | |
| - Normalizing Unicode characters | |
| - Fixing common OCR artifacts | |
| - Cleaning whitespace | |
| 7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields | |
| ## Tips for Better Results | |
| - **Image Quality**: Higher resolution images generally produce better results | |
| - **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text | |
| - **PSM Modes**: Different document layouts may benefit from different PSM modes: | |
| - PSM 3: Fully automatic page segmentation | |
| - PSM 4: Assume a single column of text | |
| - PSM 6: Assume a single uniform block of text | |
| - PSM 11: Sparse text, find as much text as possible | |
| - **Cropping**: For PDFs rendered to images, cropping margins often improves results | |
| ## License | |
| MIT License - see [LICENSE](LICENSE) for details. | |
| ## Acknowledgments | |
| - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine | |
| - [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract | |
| - [OpenCV](https://opencv.org/) - Computer vision library | |