ahczhg commited on
Commit
3522f86
·
verified ·
1 Parent(s): 3354205

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lightweight Agentic OCR Document Extraction
2
+
3
+ A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.
4
+
5
+ ## Features
6
+
7
+ - **Multiple Preprocessing Variants**: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
8
+ - **Multiple PSM Modes**: Tests various Tesseract page segmentation modes to find optimal results
9
+ - **Intelligent Candidate Scoring**: Ranks OCR results by average confidence, word count, and text length
10
+ - **Structured Field Extraction**: Extracts common document fields including:
11
+ - DOI, ISSN, ISBN, PMID, arXiv ID
12
+ - Volume, Issue, Pages, Year
13
+ - Received/Accepted/Published dates
14
+ - Title, Authors, Abstract, Keywords
15
+ - Email addresses
16
+ - **Parallel Processing**: Uses thread pools for faster processing of multiple variants
17
+
18
+ ## Installation
19
+
20
+ ### Prerequisites
21
+
22
+ 1. Install Tesseract OCR:
23
+
24
+ **Ubuntu/Debian:**
25
+ ```bash
26
+ sudo apt-get update
27
+ sudo apt-get install -y tesseract-ocr
28
+ ```
29
+
30
+ **macOS:**
31
+ ```bash
32
+ brew install tesseract
33
+ ```
34
+
35
+ **Windows:**
36
+ Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
37
+
38
+ 2. Install Python dependencies:
39
+ ```bash
40
+ pip install pytesseract opencv-python-headless pillow numpy
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ### Command Line
46
+
47
+ ```bash
48
+ # Basic usage
49
+ python agentic_ocr_extractor.py document.jpg
50
+
51
+ # Save outputs to files
52
+ python agentic_ocr_extractor.py document.png -o output.txt -j fields.json
53
+
54
+ # Custom scale factor and PSM modes
55
+ python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11
56
+
57
+ # Quiet mode (suppress progress output)
58
+ python agentic_ocr_extractor.py document.jpg -q
59
+ ```
60
+
61
+ ### Python API
62
+
63
+ ```python
64
+ from agentic_ocr_extractor import process_image, run_agent, extract_fields
65
+ import cv2
66
+
67
+ # Full processing pipeline
68
+ cleaned_text, fields, best_candidate = process_image(
69
+ 'document.jpg',
70
+ output_text_path='extracted_text.txt',
71
+ output_json_path='extracted_fields.json',
72
+ scale_factor=1.5,
73
+ verbose=True
74
+ )
75
+
76
+ # Access extracted fields
77
+ print(f"Title: {fields.title}")
78
+ print(f"Authors: {fields.authors}")
79
+ print(f"DOI: {fields.doi}")
80
+
81
+ # Or use individual components
82
+ bgr = cv2.imread('document.jpg')
83
+ rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
84
+
85
+ # Run agentic OCR
86
+ best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
87
+ print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")
88
+
89
+ # Extract fields from text
90
+ fields = extract_fields(best.text)
91
+ print(fields.to_dict())
92
+ ```
93
+
94
+ ## How It Works
95
+
96
+ 1. **Image Loading**: Reads the input image and converts to RGB format
97
+
98
+ 2. **Preprocessing**: Generates multiple variants of the image:
99
+ - Raw (original)
100
+ - Upscaled (1.5x by default)
101
+ - Grayscale
102
+ - Otsu threshold
103
+ - Adaptive threshold
104
+ - Denoised
105
+ - Sharpened
106
+ - CLAHE (Contrast Limited Adaptive Histogram Equalization)
107
+ - Morphological closing
108
+
109
+ 3. **OCR Execution**: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)
110
+
111
+ 4. **Candidate Scoring**: Scores each OCR result based on:
112
+ - Average word confidence
113
+ - Text length (penalizes very short outputs)
114
+ - Word count (bonus for reasonable counts)
115
+
116
+ 5. **Best Selection**: Selects the candidate with the highest combined score
117
+
118
+ 6. **Text Cleaning**: Cleans the OCR output by:
119
+ - Normalizing Unicode characters
120
+ - Fixing common OCR artifacts
121
+ - Cleaning whitespace
122
+
123
+ 7. **Field Extraction**: Uses rule-based regex patterns to extract structured fields
124
+
125
+ ## Tips for Better Results
126
+
127
+ - **Image Quality**: Higher resolution images generally produce better results
128
+ - **Scale Factor**: Try increasing the scale factor (e.g., 2.0) for images with small text
129
+ - **PSM Modes**: Different document layouts may benefit from different PSM modes:
130
+ - PSM 3: Fully automatic page segmentation
131
+ - PSM 4: Assume a single column of text
132
+ - PSM 6: Assume a single uniform block of text
133
+ - PSM 11: Sparse text, find as much text as possible
134
+ - **Cropping**: For PDFs rendered to images, cropping margins often improves results
135
+
136
+ ## License
137
+
138
+ MIT License - see [LICENSE](LICENSE) for details.
139
+
140
+ ## Acknowledgments
141
+
142
+ - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - Open source OCR engine
143
+ - [pytesseract](https://github.com/madmaze/pytesseract) - Python wrapper for Tesseract
144
+ - [OpenCV](https://opencv.org/) - Computer vision library