File size: 4,395 Bytes
3522f86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# Lightweight Agentic OCR Document Extraction

A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.

## Features

- **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
- **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results
- **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length
- **Structured Field Extraction**: Extracts common document fields including:
  - DOI, ISSN, ISBN, PMID, arXiv ID
  - Volume, Issue, Pages, Year
  - Received/Accepted/Published dates
  - Title, Authors, Abstract, Keywords
  - Email addresses
- **Parallel Processing**: Uses thread pools for faster processing of multiple variants

## Installation

### Prerequisites

1. Install Tesseract OCR:

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install -y tesseract-ocr
```

**macOS:**
```bash
brew install tesseract
```

**Windows:**
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

2. Install Python dependencies:
```bash
pip install pytesseract opencv-python-headless pillow numpy
```

## Usage

### Command Line

```bash
# Basic usage
python agentic_ocr_extractor.py document.jpg

# Save outputs to files
python agentic_ocr_extractor.py document.png -o output.txt -j fields.json

# Custom scale factor and PSM modes
python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11

# Quiet mode (suppress progress output)
python agentic_ocr_extractor.py document.jpg -q
```

### Python API

```python
from agentic_ocr_extractor import process_image, run_agent, extract_fields
import cv2

# Full processing pipeline
cleaned_text, fields, best_candidate = process_image(
    'document.jpg',
    output_text_path='extracted_text.txt',
    output_json_path='extracted_fields.json',
    scale_factor=1.5,
    verbose=True
)

# Access extracted fields
print(f"Title: {fields.title}")
print(f"Authors: {fields.authors}")
print(f"DOI: {fields.doi}")

# Or use individual components
bgr = cv2.imread('document.jpg')
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)

# Run agentic OCR
best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")

# Extract fields from text
fields = extract_fields(best.text)
print(fields.to_dict())
```

## How It Works

1. **Image Loading**: Reads the input image and converts to RGB format

2. **Preprocessing**: Generates multiple variants of the image:
   - Raw (original)
   - Upscaled (1.5x by default)
   - Grayscale
   - Otsu threshold
   - Adaptive threshold
   - Denoised
   - Sharpened
   - CLAHE (Contrast Limited Adaptive Histogram Equalization)
   - Morphological closing

3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)

4. **Candidate Scoring**: Scores each OCR result based on:
   - Average word confidence
   - Text length (penalizes very short outputs)
   - Word count (bonus for reasonable counts)

5. **Best Selection**: Selects the candidate with the highest combined score

6. **Text Cleaning**: Cleans the OCR output by:
   - Normalizing Unicode characters
   - Fixing common OCR artifacts
   - Cleaning whitespace

7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields

## Tips for Better Results

- **Image Quality**: Higher resolution images generally produce better results
- **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text
- **PSM Modes**: Different document layouts may benefit from different PSM modes:
  - PSM 3: Fully automatic page segmentation
  - PSM 4: Assume a single column of text
  - PSM 6: Assume a single uniform block of text
  - PSM 11: Sparse text, find as much text as possible
- **Cropping**: For PDFs rendered to images, cropping margins often improves results

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
- [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
- [OpenCV](https://opencv.org/) - Computer vision library