File size: 4,395 Bytes

3522f86

# Lightweight Agentic OCR Document Extraction

A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.

## Features

- **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
- **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results
- **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length
- **Structured Field Extraction**: Extracts common document fields including:
  - DOI, ISSN, ISBN, PMID, arXiv ID
  - Volume, Issue, Pages, Year
  - Received/Accepted/Published dates
  - Title, Authors, Abstract, Keywords
  - Email addresses
- **Parallel Processing**: Uses thread pools for faster processing of multiple variants

## Installation

### Prerequisites

1. Install Tesseract OCR:

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install -y tesseract-ocr
```

**macOS:**
```bash
brew install tesseract
```

**Windows:**
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

2. Install Python dependencies:
```bash
pip install pytesseract opencv-python-headless pillow numpy
```

## Usage

### Command Line

```bash
# Basic usage
python agentic_ocr_extractor.py document.jpg

# Save outputs to files
python agentic_ocr_extractor.py document.png -o output.txt -j fields.json

# Custom scale factor and PSM modes
python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11

# Quiet mode (suppress progress output)
python agentic_ocr_extractor.py document.jpg -q
```

### Python API

```python
from agentic_ocr_extractor import process_image, run_agent, extract_fields
import cv2

# Full processing pipeline
cleaned_text, fields, best_candidate = process_image(
    'document.jpg',
    output_text_path='extracted_text.txt',
    output_json_path='extracted_fields.json',
    scale_factor=1.5,
    verbose=True
)

# Access extracted fields
print(f"Title: {fields.title}")
print(f"Authors: {fields.authors}")
print(f"DOI: {fields.doi}")

# Or use individual components
bgr = cv2.imread('document.jpg')
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)

# Run agentic OCR
best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")

# Extract fields from text
fields = extract_fields(best.text)
print(fields.to_dict())
```

## How It Works

1. **Image Loading**: Reads the input image and converts to RGB format

2. **Preprocessing**: Generates multiple variants of the image:
   - Raw (original)
   - Upscaled (1.5x by default)
   - Grayscale
   - Otsu threshold
   - Adaptive threshold
   - Denoised
   - Sharpened
   - CLAHE (Contrast Limited Adaptive Histogram Equalization)
   - Morphological closing

3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)

4. **Candidate Scoring**: Scores each OCR result based on:
   - Average word confidence
   - Text length (penalizes very short outputs)
   - Word count (bonus for reasonable counts)

5. **Best Selection**: Selects the candidate with the highest combined score

6. **Text Cleaning**: Cleans the OCR output by:
   - Normalizing Unicode characters
   - Fixing common OCR artifacts
   - Cleaning whitespace

7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields

## Tips for Better Results

- **Image Quality**: Higher resolution images generally produce better results
- **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text
- **PSM Modes**: Different document layouts may benefit from different PSM modes:
  - PSM 3: Fully automatic page segmentation
  - PSM 4: Assume a single column of text
  - PSM 6: Assume a single uniform block of text
  - PSM 11: Sparse text, find as much text as possible
- **Cropping**: For PDFs rendered to images, cropping margins often improves results

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
- [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
- [OpenCV](https://opencv.org/) - Computer vision library