File size: 4,395 Bytes
3522f86 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# Lightweight Agentic OCR Document Extraction
A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.
## Features
- **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
- **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results
- **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length
- **Structured Field Extraction**: Extracts common document fields including:
- DOI, ISSN, ISBN, PMID, arXiv ID
- Volume, Issue, Pages, Year
- Received/Accepted/Published dates
- Title, Authors, Abstract, Keywords
- Email addresses
- **Parallel Processing**: Uses thread pools for faster processing of multiple variants
## Installation
### Prerequisites
1. Install Tesseract OCR:
**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install -y tesseract-ocr
```
**macOS:**
```bash
brew install tesseract
```
**Windows:**
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
2. Install Python dependencies:
```bash
pip install pytesseract opencv-python-headless pillow numpy
```
## Usage
### Command Line
```bash
# Basic usage
python agentic_ocr_extractor.py document.jpg
# Save outputs to files
python agentic_ocr_extractor.py document.png -o output.txt -j fields.json
# Custom scale factor and PSM modes
python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11
# Quiet mode (suppress progress output)
python agentic_ocr_extractor.py document.jpg -q
```
### Python API
```python
from agentic_ocr_extractor import process_image, run_agent, extract_fields
import cv2
# Full processing pipeline
cleaned_text, fields, best_candidate = process_image(
'document.jpg',
output_text_path='extracted_text.txt',
output_json_path='extracted_fields.json',
scale_factor=1.5,
verbose=True
)
# Access extracted fields
print(f"Title: {fields.title}")
print(f"Authors: {fields.authors}")
print(f"DOI: {fields.doi}")
# Or use individual components
bgr = cv2.imread('document.jpg')
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
# Run agentic OCR
best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")
# Extract fields from text
fields = extract_fields(best.text)
print(fields.to_dict())
```
## How It Works
1. **Image Loading**: Reads the input image and converts to RGB format
2. **Preprocessing**: Generates multiple variants of the image:
- Raw (original)
- Upscaled (1.5x by default)
- Grayscale
- Otsu threshold
- Adaptive threshold
- Denoised
- Sharpened
- CLAHE (Contrast Limited Adaptive Histogram Equalization)
- Morphological closing
3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)
4. **Candidate Scoring**: Scores each OCR result based on:
- Average word confidence
- Text length (penalizes very short outputs)
- Word count (bonus for reasonable counts)
5. **Best Selection**: Selects the candidate with the highest combined score
6. **Text Cleaning**: Cleans the OCR output by:
- Normalizing Unicode characters
- Fixing common OCR artifacts
- Cleaning whitespace
7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields
## Tips for Better Results
- **Image Quality**: Higher resolution images generally produce better results
- **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text
- **PSM Modes**: Different document layouts may benefit from different PSM modes:
- PSM 3: Fully automatic page segmentation
- PSM 4: Assume a single column of text
- PSM 6: Assume a single uniform block of text
- PSM 11: Sparse text, find as much text as possible
- **Cropping**: For PDFs rendered to images, cropping margins often improves results
## License
MIT License - see [LICENSE](LICENSE) for details.
## Acknowledgments
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
- [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
- [OpenCV](https://opencv.org/) - Computer vision library
|