Upload README.md with huggingface_hub

3522f86 verified 16 days ago

4.4 kB

	# Lightweight Agentic OCR Document Extraction

	A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.

	## Features

	- Multiple Preprocessing Variants: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
	- Multiple PSM Modes: Tests various Tesseract page segmentation modes to find optimal results
	- Intelligent Candidate Scoring: Ranks OCR results by average confidence, word count, and text length
	- Structured Field Extraction: Extracts common document fields including:
	- DOI, ISSN, ISBN, PMID, arXiv ID
	- Volume, Issue, Pages, Year
	- Received/Accepted/Published dates
	- Title, Authors, Abstract, Keywords
	- Email addresses
	- Parallel Processing: Uses thread pools for faster processing of multiple variants

	## Installation

	### Prerequisites

	1. Install Tesseract OCR:

	Ubuntu/Debian:
	```bash
	sudo apt-get update
	sudo apt-get install -y tesseract-ocr
	```

	macOS:
	```bash
	brew install tesseract
	```

	Windows:
	Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

	2. Install Python dependencies:
	```bash
	pip install pytesseract opencv-python-headless pillow numpy
	```

	## Usage

	### Command Line

	```bash
	# Basic usage
	python agentic_ocr_extractor.py document.jpg

	# Save outputs to files
	python agentic_ocr_extractor.py document.png -o output.txt -j fields.json

	# Custom scale factor and PSM modes
	python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11

	# Quiet mode (suppress progress output)
	python agentic_ocr_extractor.py document.jpg -q
	```

	### Python API

	```python
	from agentic_ocr_extractor import process_image, run_agent, extract_fields
	import cv2

	# Full processing pipeline
	cleaned_text, fields, best_candidate = process_image(
	'document.jpg',
	output_text_path='extracted_text.txt',
	output_json_path='extracted_fields.json',
	scale_factor=1.5,
	verbose=True
	)

	# Access extracted fields
	print(f"Title: {fields.title}")
	print(f"Authors: {fields.authors}")
	print(f"DOI: {fields.doi}")

	# Or use individual components
	bgr = cv2.imread('document.jpg')
	rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)

	# Run agentic OCR
	best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
	print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")

	# Extract fields from text
	fields = extract_fields(best.text)
	print(fields.to_dict())
	```

	## How It Works

	1. Image Loading: Reads the input image and converts to RGB format

	2. Preprocessing: Generates multiple variants of the image:
	- Raw (original)
	- Upscaled (1.5x by default)
	- Grayscale
	- Otsu threshold
	- Adaptive threshold
	- Denoised
	- Sharpened
	- CLAHE (Contrast Limited Adaptive Histogram Equalization)
	- Morphological closing

	3. OCR Execution: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)

	4. Candidate Scoring: Scores each OCR result based on:
	- Average word confidence
	- Text length (penalizes very short outputs)
	- Word count (bonus for reasonable counts)

	5. Best Selection: Selects the candidate with the highest combined score

	6. Text Cleaning: Cleans the OCR output by:
	- Normalizing Unicode characters
	- Fixing common OCR artifacts
	- Cleaning whitespace

	7. Field Extraction: Uses rule-based regex patterns to extract structured fields

	## Tips for Better Results

	- Image Quality: Higher resolution images generally produce better results
	- Scale Factor: Try increasing the scale factor (e.g., 2.0) for images with small text
	- PSM Modes: Different document layouts may benefit from different PSM modes:
	- PSM 3: Fully automatic page segmentation
	- PSM 4: Assume a single column of text
	- PSM 6: Assume a single uniform block of text
	- PSM 11: Sparse text, find as much text as possible
	- Cropping: For PDFs rendered to images, cropping margins often improves results

	## License

	MIT License - see [LICENSE](LICENSE) for details.

	## Acknowledgments

	- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
	- [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
	- [OpenCV](https://opencv.org/) - Computer vision library