ahczhg
/

agentic-ocr-extractor

Model card Files Files and versions

xet

Community

ahczhg commited on Jan 19

Commit

3522f86

verified ·

1 Parent(s): 3354205

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +144 -0

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# Lightweight Agentic OCR Document Extraction
+A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.
+## Features
+- **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
+- **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results
+- **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length
+- **Structured Field Extraction**: Extracts common document fields including:
+  - DOI, ISSN, ISBN, PMID, arXiv ID
+  - Volume, Issue, Pages, Year
+  - Received/Accepted/Published dates
+  - Title, Authors, Abstract, Keywords
+  - Email addresses
+- **Parallel Processing**: Uses thread pools for faster processing of multiple variants
+## Installation
+### Prerequisites
+1. Install Tesseract OCR:
+**Ubuntu/Debian:**
+```bash
+sudo apt-get update
+sudo apt-get install -y tesseract-ocr
+```
+**macOS:**
+```bash
+brew install tesseract
+```
+**Windows:**
+Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
+2. Install Python dependencies:
+```bash
+pip install pytesseract opencv-python-headless pillow numpy
+```
+## Usage
+### Command Line
+```bash
+# Basic usage
+python agentic_ocr_extractor.py document.jpg
+# Save outputs to files
+python agentic_ocr_extractor.py document.png -o output.txt -j fields.json
+# Custom scale factor and PSM modes
+python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11
+# Quiet mode (suppress progress output)
+python agentic_ocr_extractor.py document.jpg -q
+```
+### Python API
+```python
+from agentic_ocr_extractor import process_image, run_agent, extract_fields
+import cv2
+# Full processing pipeline
+cleaned_text, fields, best_candidate = process_image(
+    'document.jpg',
+    output_text_path='extracted_text.txt',
+    output_json_path='extracted_fields.json',
+    scale_factor=1.5,
+    verbose=True
+)
+# Access extracted fields
+print(f"Title: {fields.title}")
+print(f"Authors: {fields.authors}")
+print(f"DOI: {fields.doi}")
+# Or use individual components
+bgr = cv2.imread('document.jpg')
+rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
+# Run agentic OCR
+best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
+print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")
+# Extract fields from text
+fields = extract_fields(best.text)
+print(fields.to_dict())
+```
+## How It Works
+1. **Image Loading**: Reads the input image and converts to RGB format
+2. **Preprocessing**: Generates multiple variants of the image:
+   - Raw (original)
+   - Upscaled (1.5x by default)
+   - Grayscale
+   - Otsu threshold
+   - Adaptive threshold
+   - Denoised
+   - Sharpened
+   - CLAHE (Contrast Limited Adaptive Histogram Equalization)
+   - Morphological closing
+3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)
+4. **Candidate Scoring**: Scores each OCR result based on:
+   - Average word confidence
+   - Text length (penalizes very short outputs)
+   - Word count (bonus for reasonable counts)
+5. **Best Selection**: Selects the candidate with the highest combined score
+6. **Text Cleaning**: Cleans the OCR output by:
+   - Normalizing Unicode characters
+   - Fixing common OCR artifacts
+   - Cleaning whitespace
+7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields
+## Tips for Better Results
+- **Image Quality**: Higher resolution images generally produce better results
+- **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text
+- **PSM Modes**: Different document layouts may benefit from different PSM modes:
+  - PSM 3: Fully automatic page segmentation
+  - PSM 4: Assume a single column of text
+  - PSM 6: Assume a single uniform block of text
+  - PSM 11: Sparse text, find as much text as possible
+- **Cropping**: For PDFs rendered to images, cropping margins often improves results
+## License
+MIT License - see [LICENSE](LICENSE) for details.
+## Acknowledgments
+- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
+- [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
+- [OpenCV](https://opencv.org/) - Computer vision library