Spaces:
Running
Running
Rolling out modular v2
Browse files- .DS_Store +0 -0
- .clinerules/apiDocumentation.md +29 -0
- .clinerules/projectBrief.md +21 -0
- .clinerules/systemPatterns.md +31 -0
- README.md +5 -1
- app.py +44 -30
- config.py +5 -8
- constants.py +47 -8
- image_segmentation.py +21 -2
- language_detection.py +0 -1
- ocr_processing.py +11 -1
- ocr_utils.py +33 -1771
- preprocessing.py +521 -66
- process_file.py +2 -4
- requirements.txt +1 -0
- structured_ocr.py +130 -110
- test_magician.py → testing/test_magician.py +0 -0
- ui_components.py +114 -582
- utils/content_utils.py +189 -0
- utils/file_utils.py +100 -0
- utils/general_utils.py +163 -0
- utils/image_utils.py +886 -0
- utils/text_utils.py +151 -0
- utils/ui_utils.py +413 -0
.DS_Store
CHANGED
|
Binary files a/.DS_Store and b/.DS_Store differ
|
|
|
.clinerules/apiDocumentation.md
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
apiDocumentation.md
|
| 2 |
+
API Interaction Documentation
|
| 3 |
+
Mistral OCR API
|
| 4 |
+
|
| 5 |
+
Endpoint: /v1/ocr
|
| 6 |
+
|
| 7 |
+
Payload:
|
| 8 |
+
|
| 9 |
+
image (binary)
|
| 10 |
+
|
| 11 |
+
prompt (optional contextual instructions)
|
| 12 |
+
|
| 13 |
+
Response:
|
| 14 |
+
|
| 15 |
+
structured_data: Hierarchical text + metadata output
|
| 16 |
+
|
| 17 |
+
raw_text: Plain extracted text
|
| 18 |
+
|
| 19 |
+
Error Handling:
|
| 20 |
+
|
| 21 |
+
Timeout retries (up to 3 attempts)
|
| 22 |
+
|
| 23 |
+
Local fallback to Tesseract if Mistral service unavailable
|
| 24 |
+
|
| 25 |
+
Tesseract Fallback
|
| 26 |
+
|
| 27 |
+
Only invoked if Mistral API fails after retries.
|
| 28 |
+
|
| 29 |
+
No structured output; raw text only.
|
.clinerules/projectBrief.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Foundation
|
| 2 |
+
|
| 3 |
+
Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
|
| 4 |
+
|
| 5 |
+
High-Level Overview
|
| 6 |
+
|
| 7 |
+
Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
|
| 8 |
+
|
| 9 |
+
Core Requirements and Goals
|
| 10 |
+
|
| 11 |
+
Upload and preprocess historical documents
|
| 12 |
+
|
| 13 |
+
Automatically detect document types (e.g., handwritten letters, scientific papers)
|
| 14 |
+
|
| 15 |
+
Apply tailored OCR prompting and structured output based on document type
|
| 16 |
+
|
| 17 |
+
Support user-defined contextual instructions to refine output
|
| 18 |
+
|
| 19 |
+
Provide downloadable structured transcripts and analysis
|
| 20 |
+
|
| 21 |
+
Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
|
.clinerules/systemPatterns.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# System Architecture
|
| 2 |
+
|
| 3 |
+
Frontend: Streamlit app (app.py) for user interface and interactions.
|
| 4 |
+
|
| 5 |
+
Core Processing: ocr_processing.py orchestrates preprocessing, document type detection, and OCR operations.
|
| 6 |
+
|
| 7 |
+
Image Preprocessing: preprocessing.py, image_segmentation.py handle deskewing, thresholding, and cleaning.
|
| 8 |
+
|
| 9 |
+
OCR and Structuring: structured_ocr.py and ocr_utils.py manage API communication and formatting structured outputs.
|
| 10 |
+
|
| 11 |
+
Utilities and Detection: language_detection.py, utils.py, and constants.py provide language detection, helpers, and prompt templates.
|
| 12 |
+
|
| 13 |
+
Key Technical Decisions
|
| 14 |
+
|
| 15 |
+
Streamlit cache management for upload processing efficiency.
|
| 16 |
+
|
| 17 |
+
Modular design of preprocessing paths based on document type.
|
| 18 |
+
|
| 19 |
+
Mistral AI as the primary OCR processor, with Tesseract fallback for redundancy.
|
| 20 |
+
|
| 21 |
+
Design Patterns in Use
|
| 22 |
+
|
| 23 |
+
Delegation: Frontend delegates all processing to backend orchestrators.
|
| 24 |
+
|
| 25 |
+
Modularity: Preprocessing and OCR tasks divided into clean, testable modules.
|
| 26 |
+
|
| 27 |
+
State-driven Processing: Output dynamically reflects session state and user input.
|
| 28 |
+
|
| 29 |
+
Component Relationships
|
| 30 |
+
|
| 31 |
+
app.py ⇨ ocr_processing.py ⇨ preprocessing.py, structured_ocr.py, language_detection.py, etc.
|
README.md
CHANGED
|
@@ -21,7 +21,11 @@ An advanced OCR application for historical document analysis using Mistral AI.
|
|
| 21 |
|
| 22 |
- **OCR with Context:** AI-enhanced OCR optimized for historical documents
|
| 23 |
- **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
|
| 24 |
-
- **Image Preprocessing:**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
- **Custom Prompting:** Tailor the AI analysis with document-specific instructions
|
| 26 |
- **Structured Output:** Returns organized, structured information based on document type
|
| 27 |
|
|
|
|
| 21 |
|
| 22 |
- **OCR with Context:** AI-enhanced OCR optimized for historical documents
|
| 23 |
- **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
|
| 24 |
+
- **Advanced Image Preprocessing:**
|
| 25 |
+
- Automatic deskewing to correct document orientation
|
| 26 |
+
- Smart thresholding with Otsu and adaptive methods
|
| 27 |
+
- Morphological operations to clean up text
|
| 28 |
+
- Document-type specific optimization
|
| 29 |
- **Custom Prompting:** Tailor the AI analysis with document-specific instructions
|
| 30 |
- **Structured Output:** Returns organized, structured information based on document type
|
| 31 |
|
app.py
CHANGED
|
@@ -41,7 +41,7 @@ from constants import (
|
|
| 41 |
)
|
| 42 |
from structured_ocr import StructuredOCR
|
| 43 |
from config import MISTRAL_API_KEY
|
| 44 |
-
from
|
| 45 |
|
| 46 |
# Set favicon path
|
| 47 |
favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
|
|
@@ -74,20 +74,47 @@ st.set_page_config(
|
|
| 74 |
# Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
|
| 75 |
# ========================================================================================
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
def init_session_state():
|
| 78 |
"""Initialize session state variables if they don't already exist
|
| 79 |
|
| 80 |
This function follows Streamlit's recommended patterns for state initialization.
|
| 81 |
It only creates variables if they don't exist yet and doesn't modify existing values.
|
| 82 |
"""
|
|
|
|
| 83 |
if 'previous_results' not in st.session_state:
|
| 84 |
st.session_state.previous_results = []
|
| 85 |
if 'temp_file_paths' not in st.session_state:
|
| 86 |
st.session_state.temp_file_paths = []
|
| 87 |
-
if 'last_processed_file' not in st.session_state:
|
| 88 |
-
st.session_state.last_processed_file = None
|
| 89 |
if 'auto_process_sample' not in st.session_state:
|
| 90 |
st.session_state.auto_process_sample = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
if 'sample_just_loaded' not in st.session_state:
|
| 92 |
st.session_state.sample_just_loaded = False
|
| 93 |
if 'processed_document_active' not in st.session_state:
|
|
@@ -104,10 +131,6 @@ def init_session_state():
|
|
| 104 |
st.session_state.is_sample_document = False
|
| 105 |
if 'selected_previous_result' not in st.session_state:
|
| 106 |
st.session_state.selected_previous_result = None
|
| 107 |
-
if 'close_clicked' not in st.session_state:
|
| 108 |
-
st.session_state.close_clicked = False
|
| 109 |
-
if 'active_tab' not in st.session_state:
|
| 110 |
-
st.session_state.active_tab = 0
|
| 111 |
|
| 112 |
def close_document():
|
| 113 |
"""Called when the Close Document button is clicked
|
|
@@ -120,24 +143,17 @@ def close_document():
|
|
| 120 |
That approach breaks Streamlit's execution flow and causes UI artifacts.
|
| 121 |
"""
|
| 122 |
logger.info("Close document button clicked")
|
| 123 |
-
# Save the previous results
|
| 124 |
-
previous_results = st.session_state.previous_results if 'previous_results' in st.session_state else []
|
| 125 |
|
| 126 |
-
# Clean up temp files
|
| 127 |
if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
|
| 128 |
logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
|
| 129 |
handle_temp_files(st.session_state.temp_file_paths)
|
| 130 |
|
| 131 |
-
#
|
| 132 |
-
|
| 133 |
-
if key != 'previous_results' and key != 'close_clicked':
|
| 134 |
-
st.session_state.pop(key, None)
|
| 135 |
|
| 136 |
-
# Set flag for having cleaned up
|
| 137 |
st.session_state.close_clicked = True
|
| 138 |
-
|
| 139 |
-
# Restore the previous results
|
| 140 |
-
st.session_state.previous_results = previous_results
|
| 141 |
|
| 142 |
def show_example_documents():
|
| 143 |
"""Show example documents section"""
|
|
@@ -251,14 +267,12 @@ def show_example_documents():
|
|
| 251 |
|
| 252 |
# Reset any document state before loading a new sample
|
| 253 |
if st.session_state.processed_document_active:
|
| 254 |
-
# Clear previous document state
|
| 255 |
-
st.session_state.processed_document_active = False
|
| 256 |
-
st.session_state.last_processed_file = None
|
| 257 |
-
|
| 258 |
# Clean up any temporary files from previous processing
|
| 259 |
if st.session_state.temp_file_paths:
|
| 260 |
handle_temp_files(st.session_state.temp_file_paths)
|
| 261 |
-
|
|
|
|
|
|
|
| 262 |
|
| 263 |
# Save download info in session state
|
| 264 |
st.session_state.sample_document = SampleDocument(
|
|
@@ -350,6 +364,7 @@ def process_document(uploaded_file, left_col, right_col, sidebar_options):
|
|
| 350 |
progress_placeholder = st.empty()
|
| 351 |
|
| 352 |
# Image preprocessing preview - show if image file and preprocessing options are set
|
|
|
|
| 353 |
if (any(sidebar_options["preprocessing_options"].values()) and
|
| 354 |
uploaded_file.type.startswith('image/')):
|
| 355 |
|
|
@@ -530,13 +545,14 @@ def main():
|
|
| 530 |
sidebar_options = create_sidebar_options()
|
| 531 |
|
| 532 |
# Create main layout with tabs - simpler, more compact approach
|
| 533 |
-
tab_names = ["Document Processing", "Sample Documents", "
|
| 534 |
-
main_tab1, main_tab2, main_tab3
|
| 535 |
|
| 536 |
with main_tab1:
|
| 537 |
# Create a two-column layout for file upload and results with minimal padding
|
| 538 |
st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
|
| 539 |
-
|
|
|
|
| 540 |
|
| 541 |
with left_col:
|
| 542 |
# Create file uploader
|
|
@@ -575,11 +591,9 @@ def main():
|
|
| 575 |
|
| 576 |
show_example_documents()
|
| 577 |
|
| 578 |
-
|
| 579 |
-
# Previous results tab
|
| 580 |
-
display_previous_results()
|
| 581 |
|
| 582 |
-
with
|
| 583 |
# About tab
|
| 584 |
display_about_tab()
|
| 585 |
|
|
|
|
| 41 |
)
|
| 42 |
from structured_ocr import StructuredOCR
|
| 43 |
from config import MISTRAL_API_KEY
|
| 44 |
+
from utils.image_utils import create_results_zip
|
| 45 |
|
| 46 |
# Set favicon path
|
| 47 |
favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
|
|
|
|
| 74 |
# Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
|
| 75 |
# ========================================================================================
|
| 76 |
|
| 77 |
+
def reset_document_state():
|
| 78 |
+
"""Reset only document-specific state variables
|
| 79 |
+
|
| 80 |
+
This function explicitly resets all document-related variables to ensure
|
| 81 |
+
clean state between document processing, preventing cached data issues.
|
| 82 |
+
"""
|
| 83 |
+
st.session_state.sample_document = None
|
| 84 |
+
st.session_state.original_sample_bytes = None
|
| 85 |
+
st.session_state.original_sample_name = None
|
| 86 |
+
st.session_state.original_sample_mime_type = None
|
| 87 |
+
st.session_state.is_sample_document = False
|
| 88 |
+
st.session_state.processed_document_active = False
|
| 89 |
+
st.session_state.sample_document_processed = False
|
| 90 |
+
st.session_state.sample_just_loaded = False
|
| 91 |
+
st.session_state.last_processed_file = None
|
| 92 |
+
st.session_state.selected_previous_result = None
|
| 93 |
+
# Keep temp_file_paths but ensure it's empty after cleanup
|
| 94 |
+
if 'temp_file_paths' in st.session_state:
|
| 95 |
+
st.session_state.temp_file_paths = []
|
| 96 |
+
|
| 97 |
def init_session_state():
|
| 98 |
"""Initialize session state variables if they don't already exist
|
| 99 |
|
| 100 |
This function follows Streamlit's recommended patterns for state initialization.
|
| 101 |
It only creates variables if they don't exist yet and doesn't modify existing values.
|
| 102 |
"""
|
| 103 |
+
# Initialize persistent app state variables
|
| 104 |
if 'previous_results' not in st.session_state:
|
| 105 |
st.session_state.previous_results = []
|
| 106 |
if 'temp_file_paths' not in st.session_state:
|
| 107 |
st.session_state.temp_file_paths = []
|
|
|
|
|
|
|
| 108 |
if 'auto_process_sample' not in st.session_state:
|
| 109 |
st.session_state.auto_process_sample = False
|
| 110 |
+
if 'close_clicked' not in st.session_state:
|
| 111 |
+
st.session_state.close_clicked = False
|
| 112 |
+
if 'active_tab' not in st.session_state:
|
| 113 |
+
st.session_state.active_tab = 0
|
| 114 |
+
|
| 115 |
+
# Initialize document-specific state variables
|
| 116 |
+
if 'last_processed_file' not in st.session_state:
|
| 117 |
+
st.session_state.last_processed_file = None
|
| 118 |
if 'sample_just_loaded' not in st.session_state:
|
| 119 |
st.session_state.sample_just_loaded = False
|
| 120 |
if 'processed_document_active' not in st.session_state:
|
|
|
|
| 131 |
st.session_state.is_sample_document = False
|
| 132 |
if 'selected_previous_result' not in st.session_state:
|
| 133 |
st.session_state.selected_previous_result = None
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
def close_document():
|
| 136 |
"""Called when the Close Document button is clicked
|
|
|
|
| 143 |
That approach breaks Streamlit's execution flow and causes UI artifacts.
|
| 144 |
"""
|
| 145 |
logger.info("Close document button clicked")
|
|
|
|
|
|
|
| 146 |
|
| 147 |
+
# Clean up temp files first
|
| 148 |
if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
|
| 149 |
logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
|
| 150 |
handle_temp_files(st.session_state.temp_file_paths)
|
| 151 |
|
| 152 |
+
# Reset all document-specific state variables to prevent caching issues
|
| 153 |
+
reset_document_state()
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
# Set flag for having cleaned up - this will trigger a rerun in main()
|
| 156 |
st.session_state.close_clicked = True
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
def show_example_documents():
|
| 159 |
"""Show example documents section"""
|
|
|
|
| 267 |
|
| 268 |
# Reset any document state before loading a new sample
|
| 269 |
if st.session_state.processed_document_active:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
# Clean up any temporary files from previous processing
|
| 271 |
if st.session_state.temp_file_paths:
|
| 272 |
handle_temp_files(st.session_state.temp_file_paths)
|
| 273 |
+
|
| 274 |
+
# Reset all document-specific state variables
|
| 275 |
+
reset_document_state()
|
| 276 |
|
| 277 |
# Save download info in session state
|
| 278 |
st.session_state.sample_document = SampleDocument(
|
|
|
|
| 364 |
progress_placeholder = st.empty()
|
| 365 |
|
| 366 |
# Image preprocessing preview - show if image file and preprocessing options are set
|
| 367 |
+
# Remove the document active check to show preview immediately after selection
|
| 368 |
if (any(sidebar_options["preprocessing_options"].values()) and
|
| 369 |
uploaded_file.type.startswith('image/')):
|
| 370 |
|
|
|
|
| 545 |
sidebar_options = create_sidebar_options()
|
| 546 |
|
| 547 |
# Create main layout with tabs - simpler, more compact approach
|
| 548 |
+
tab_names = ["Document Processing", "Sample Documents", "Learn More"]
|
| 549 |
+
main_tab1, main_tab2, main_tab3 = st.tabs(tab_names)
|
| 550 |
|
| 551 |
with main_tab1:
|
| 552 |
# Create a two-column layout for file upload and results with minimal padding
|
| 553 |
st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
|
| 554 |
+
# Using a 2:3 column ratio gives more space to the results column
|
| 555 |
+
left_col, right_col = st.columns([2, 3])
|
| 556 |
|
| 557 |
with left_col:
|
| 558 |
# Create file uploader
|
|
|
|
| 591 |
|
| 592 |
show_example_documents()
|
| 593 |
|
| 594 |
+
# Previous results tab temporarily removed
|
|
|
|
|
|
|
| 595 |
|
| 596 |
+
with main_tab3:
|
| 597 |
# About tab
|
| 598 |
display_about_tab()
|
| 599 |
|
config.py
CHANGED
|
@@ -40,22 +40,19 @@ VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest") #
|
|
| 40 |
# Image preprocessing settings optimized for historical documents
|
| 41 |
# These can be customized from environment variables
|
| 42 |
IMAGE_PREPROCESSING = {
|
| 43 |
-
"enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.
|
| 44 |
"sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
|
| 45 |
"denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
|
| 46 |
"max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
|
| 47 |
"target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
|
| 48 |
-
"compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "
|
| 49 |
-
# Enhanced settings for handwritten documents
|
| 50 |
"handwritten": {
|
| 51 |
-
"contrast": float(os.environ.get("HANDWRITTEN_CONTRAST", "1.2")), # Lower contrast for handwritten text
|
| 52 |
"block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
|
| 53 |
"constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
|
| 54 |
"use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
|
| 55 |
-
"
|
| 56 |
-
"
|
| 57 |
-
"bilateral_sigma1": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA1", "25")), # Color sigma
|
| 58 |
-
"bilateral_sigma2": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA2", "45")) # Space sigma
|
| 59 |
}
|
| 60 |
}
|
| 61 |
|
|
|
|
| 40 |
# Image preprocessing settings optimized for historical documents
|
| 41 |
# These can be customized from environment variables
|
| 42 |
IMAGE_PREPROCESSING = {
|
| 43 |
+
"enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")), # Increased contrast for better text recognition
|
| 44 |
"sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
|
| 45 |
"denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
|
| 46 |
"max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
|
| 47 |
"target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
|
| 48 |
+
"compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")), # Higher quality for better OCR results
|
| 49 |
+
# # Enhanced settings for handwritten documents
|
| 50 |
"handwritten": {
|
|
|
|
| 51 |
"block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
|
| 52 |
"constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
|
| 53 |
"use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
|
| 54 |
+
"dilation_iterations": int(os.environ.get("HANDWRITTEN_DILATION_ITERATIONS", "2")), # More iterations for better stroke connection
|
| 55 |
+
"dilation_kernel_size": int(os.environ.get("HANDWRITTEN_DILATION_KERNEL_SIZE", "3")) # Larger kernel for dilation
|
|
|
|
|
|
|
| 56 |
}
|
| 57 |
}
|
| 58 |
|
constants.py
CHANGED
|
@@ -138,17 +138,56 @@ CONTENT_THEMES = {
|
|
| 138 |
}
|
| 139 |
|
| 140 |
# Period tags based on year ranges
|
|
|
|
| 141 |
PERIOD_TAGS = {
|
| 142 |
-
(0,
|
| 143 |
-
(
|
| 144 |
-
(
|
| 145 |
-
(
|
| 146 |
-
(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
}
|
| 148 |
|
| 149 |
-
# Default fallback tags
|
| 150 |
-
DEFAULT_TAGS = [
|
| 151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
# UI constants
|
| 154 |
PROGRESS_DELAY = 0.8 # Seconds to show completion message
|
|
|
|
| 138 |
}
|
| 139 |
|
| 140 |
# Period tags based on year ranges
|
| 141 |
+
# These ranges are used to assign historical period tags to documents based on their year.
|
| 142 |
PERIOD_TAGS = {
|
| 143 |
+
(0, 499): "Ancient Era (to 500 CE)",
|
| 144 |
+
(500, 999): "Early Medieval (500–1000)",
|
| 145 |
+
(1000, 1299): "High Medieval (1000–1300)",
|
| 146 |
+
(1300, 1499): "Late Medieval (1300–1500)",
|
| 147 |
+
(1500, 1599): "Renaissance (1500–1600)",
|
| 148 |
+
(1600, 1699): "Early Modern (1600–1700)",
|
| 149 |
+
(1700, 1775): "Enlightenment (1700–1775)",
|
| 150 |
+
(1776, 1799): "Age of Revolutions (1776–1800)",
|
| 151 |
+
(1800, 1849): "Early 19th Century (1800–1850)",
|
| 152 |
+
(1850, 1899): "Late 19th Century (1850–1900)",
|
| 153 |
+
(1900, 1918): "Early 20th Century & WWI (1900–1918)",
|
| 154 |
+
(1919, 1938): "Interwar Period (1919–1938)",
|
| 155 |
+
(1939, 1945): "World War II (1939–1945)",
|
| 156 |
+
(1946, 1968): "Postwar & Mid-20th Century (1946–1968)",
|
| 157 |
+
(1969, 1989): "Late 20th Century (1969–1989)",
|
| 158 |
+
(1990, 2000): "Turn of the 21st Century (1990–2000)",
|
| 159 |
+
(2001, 2099): "Contemporary (21st Century)"
|
| 160 |
}
|
| 161 |
|
| 162 |
+
# Default fallback tags for documents when no specific tags are detected.
|
| 163 |
+
DEFAULT_TAGS = [
|
| 164 |
+
"Document",
|
| 165 |
+
"Historical",
|
| 166 |
+
"Text",
|
| 167 |
+
"Primary Source",
|
| 168 |
+
"Archival Material",
|
| 169 |
+
"Record",
|
| 170 |
+
"Manuscript",
|
| 171 |
+
"Printed Material",
|
| 172 |
+
"Correspondence",
|
| 173 |
+
"Publication"
|
| 174 |
+
]
|
| 175 |
+
|
| 176 |
+
# Generic tags that can be used for broad categorization or as supplemental tags.
|
| 177 |
+
GENERIC_TAGS = [
|
| 178 |
+
"Archive",
|
| 179 |
+
"Content",
|
| 180 |
+
"Record",
|
| 181 |
+
"Source",
|
| 182 |
+
"Material",
|
| 183 |
+
"Page",
|
| 184 |
+
"Scan",
|
| 185 |
+
"Image",
|
| 186 |
+
"Transcription",
|
| 187 |
+
"Uncategorized",
|
| 188 |
+
"General",
|
| 189 |
+
"Miscellaneous"
|
| 190 |
+
]
|
| 191 |
|
| 192 |
# UI constants
|
| 193 |
PROGRESS_DELAY = 0.8 # Seconds to show completion message
|
image_segmentation.py
CHANGED
|
@@ -18,12 +18,13 @@ logging.basicConfig(level=logging.INFO,
|
|
| 18 |
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 19 |
logger = logging.getLogger(__name__)
|
| 20 |
|
| 21 |
-
def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image.Image, str]]:
|
| 22 |
"""
|
| 23 |
Segment an image into text and image regions for improved OCR processing.
|
| 24 |
|
| 25 |
Args:
|
| 26 |
image_path: Path to the image file
|
|
|
|
| 27 |
|
| 28 |
Returns:
|
| 29 |
Dict containing:
|
|
@@ -41,6 +42,23 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
|
|
| 41 |
try:
|
| 42 |
# Open original image with PIL for compatibility
|
| 43 |
with Image.open(image_file) as pil_img:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
# Convert to RGB if not already
|
| 45 |
if pil_img.mode != 'RGB':
|
| 46 |
pil_img = pil_img.convert('RGB')
|
|
@@ -89,7 +107,8 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
|
|
| 89 |
|
| 90 |
# Additional check for text-like characteristics
|
| 91 |
# Text typically has aspect ratio > 1 (wider than tall) and reasonable density
|
| 92 |
-
|
|
|
|
| 93 |
# Add to text regions list
|
| 94 |
text_regions.append((x, y, w, h))
|
| 95 |
# Add to text mask
|
|
|
|
| 18 |
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 19 |
logger = logging.getLogger(__name__)
|
| 20 |
|
| 21 |
+
def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
|
| 22 |
"""
|
| 23 |
Segment an image into text and image regions for improved OCR processing.
|
| 24 |
|
| 25 |
Args:
|
| 26 |
image_path: Path to the image file
|
| 27 |
+
vision_enabled: Whether the vision model is enabled
|
| 28 |
|
| 29 |
Returns:
|
| 30 |
Dict containing:
|
|
|
|
| 42 |
try:
|
| 43 |
# Open original image with PIL for compatibility
|
| 44 |
with Image.open(image_file) as pil_img:
|
| 45 |
+
# --- 2 · Stop "text page detected as image" when vision model is off ---
|
| 46 |
+
if not vision_enabled:
|
| 47 |
+
# Import the entropy calculator from utils.image_utils
|
| 48 |
+
from utils.image_utils import calculate_image_entropy
|
| 49 |
+
|
| 50 |
+
# Calculate entropy to determine if this is line art or blank
|
| 51 |
+
ent = calculate_image_entropy(pil_img)
|
| 52 |
+
if ent < 3.5: # Heuristically low → line-art or blank page
|
| 53 |
+
logger.info(f"Low entropy image detected ({ent:.2f}), classifying as illustration")
|
| 54 |
+
# Return minimal result for illustration
|
| 55 |
+
return {
|
| 56 |
+
'text_regions': None,
|
| 57 |
+
'image_regions': pil_img,
|
| 58 |
+
'text_mask_base64': None,
|
| 59 |
+
'combined_result': None,
|
| 60 |
+
'text_regions_coordinates': []
|
| 61 |
+
}
|
| 62 |
# Convert to RGB if not already
|
| 63 |
if pil_img.mode != 'RGB':
|
| 64 |
pil_img = pil_img.convert('RGB')
|
|
|
|
| 107 |
|
| 108 |
# Additional check for text-like characteristics
|
| 109 |
# Text typically has aspect ratio > 1 (wider than tall) and reasonable density
|
| 110 |
+
# Relaxed aspect ratio constraints and lowered density threshold for better detection
|
| 111 |
+
if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
|
| 112 |
# Add to text regions list
|
| 113 |
text_regions.append((x, y, w, h))
|
| 114 |
# Add to text mask
|
language_detection.py
CHANGED
|
@@ -64,7 +64,6 @@ class LanguageDetector:
|
|
| 64 |
"patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
|
| 65 |
'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
|
| 66 |
},
|
| 67 |
-
"exclusivity": 2.0 # French indicators have higher weight in historical text detection
|
| 68 |
},
|
| 69 |
"German": {
|
| 70 |
"chars": ['ä', 'ö', 'ü', 'ß'],
|
|
|
|
| 64 |
"patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
|
| 65 |
'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
|
| 66 |
},
|
|
|
|
| 67 |
},
|
| 68 |
"German": {
|
| 69 |
"chars": ['ä', 'ö', 'ü', 'ß'],
|
ocr_processing.py
CHANGED
|
@@ -17,6 +17,9 @@ import streamlit as st
|
|
| 17 |
|
| 18 |
# Local application imports
|
| 19 |
from structured_ocr import StructuredOCR
|
|
|
|
|
|
|
|
|
|
| 20 |
from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
|
| 21 |
from preprocessing import apply_preprocessing_to_file
|
| 22 |
from error_handler import handle_ocr_error, check_file_size
|
|
@@ -239,7 +242,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
|
|
| 239 |
|
| 240 |
try:
|
| 241 |
# Perform image segmentation
|
| 242 |
-
segmentation_results = segment_image_for_ocr(temp_path)
|
| 243 |
|
| 244 |
if segmentation_results['combined_result'] is not None:
|
| 245 |
# Save the segmented result to a new temporary file
|
|
@@ -357,6 +360,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
|
|
| 357 |
# Add additional metadata to result
|
| 358 |
result = process_result(result, uploaded_file, preprocessing_options)
|
| 359 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 360 |
# Complete progress
|
| 361 |
progress_reporter.complete()
|
| 362 |
|
|
|
|
| 17 |
|
| 18 |
# Local application imports
|
| 19 |
from structured_ocr import StructuredOCR
|
| 20 |
+
# Import from updated utils directory
|
| 21 |
+
from utils.image_utils import clean_ocr_result
|
| 22 |
+
# Temporarily retain old utils imports until they are fully migrated
|
| 23 |
from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
|
| 24 |
from preprocessing import apply_preprocessing_to_file
|
| 25 |
from error_handler import handle_ocr_error, check_file_size
|
|
|
|
| 242 |
|
| 243 |
try:
|
| 244 |
# Perform image segmentation
|
| 245 |
+
segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
|
| 246 |
|
| 247 |
if segmentation_results['combined_result'] is not None:
|
| 248 |
# Save the segmented result to a new temporary file
|
|
|
|
| 360 |
# Add additional metadata to result
|
| 361 |
result = process_result(result, uploaded_file, preprocessing_options)
|
| 362 |
|
| 363 |
+
# 🔧 ALWAYS normalize result before returning
|
| 364 |
+
result = clean_ocr_result(
|
| 365 |
+
result,
|
| 366 |
+
use_segmentation=use_segmentation,
|
| 367 |
+
vision_enabled=use_vision
|
| 368 |
+
)
|
| 369 |
+
|
| 370 |
# Complete progress
|
| 371 |
progress_reporter.complete()
|
| 372 |
|
ocr_utils.py
CHANGED
|
@@ -1,110 +1,38 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
-
|
| 4 |
"""
|
| 5 |
|
| 6 |
-
|
| 7 |
-
import json
|
| 8 |
import base64
|
| 9 |
-
import io
|
| 10 |
-
import zipfile
|
| 11 |
import logging
|
| 12 |
-
import time
|
| 13 |
-
from datetime import datetime
|
| 14 |
from pathlib import Path
|
| 15 |
-
from typing import
|
| 16 |
-
from functools import lru_cache
|
| 17 |
|
| 18 |
# Configure logging
|
| 19 |
logging.basicConfig(level=logging.INFO,
|
| 20 |
-
|
| 21 |
logger = logging.getLogger(__name__)
|
| 22 |
|
| 23 |
-
#
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
# Check for image processing libraries
|
| 27 |
try:
|
| 28 |
-
from PIL import Image
|
| 29 |
PILLOW_AVAILABLE = True
|
| 30 |
except ImportError:
|
| 31 |
logger.warning("PIL not available - image preprocessing will be limited")
|
| 32 |
PILLOW_AVAILABLE = False
|
| 33 |
|
| 34 |
-
try:
|
| 35 |
-
import cv2
|
| 36 |
-
CV2_AVAILABLE = True
|
| 37 |
-
except ImportError:
|
| 38 |
-
logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
|
| 39 |
-
CV2_AVAILABLE = False
|
| 40 |
-
|
| 41 |
-
# Mistral AI imports
|
| 42 |
-
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
|
| 43 |
-
from mistralai.models import OCRImageObject
|
| 44 |
-
|
| 45 |
-
# Import configuration
|
| 46 |
-
try:
|
| 47 |
-
from config import IMAGE_PREPROCESSING
|
| 48 |
-
except ImportError:
|
| 49 |
-
# Fallback defaults if config not available
|
| 50 |
-
IMAGE_PREPROCESSING = {
|
| 51 |
-
"enhance_contrast": 1.5,
|
| 52 |
-
"sharpen": True,
|
| 53 |
-
"denoise": True,
|
| 54 |
-
"max_size_mb": 8.0,
|
| 55 |
-
"target_dpi": 300,
|
| 56 |
-
"compression_quality": 92
|
| 57 |
-
}
|
| 58 |
-
|
| 59 |
-
def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
|
| 60 |
-
"""
|
| 61 |
-
Replace image placeholders in markdown with base64-encoded images.
|
| 62 |
-
|
| 63 |
-
Args:
|
| 64 |
-
markdown_str: Markdown text containing image placeholders
|
| 65 |
-
images_dict: Dictionary mapping image IDs to base64 strings
|
| 66 |
-
|
| 67 |
-
Returns:
|
| 68 |
-
Markdown text with images replaced by base64 data
|
| 69 |
-
"""
|
| 70 |
-
for img_name, base64_str in images_dict.items():
|
| 71 |
-
markdown_str = markdown_str.replace(
|
| 72 |
-
f"", f""
|
| 73 |
-
)
|
| 74 |
-
return markdown_str
|
| 75 |
-
|
| 76 |
-
def get_combined_markdown(ocr_response) -> str:
|
| 77 |
-
"""
|
| 78 |
-
Combine OCR text and images into a single markdown document.
|
| 79 |
-
|
| 80 |
-
Args:
|
| 81 |
-
ocr_response: OCR response object from Mistral AI
|
| 82 |
-
|
| 83 |
-
Returns:
|
| 84 |
-
Combined markdown string with embedded images
|
| 85 |
-
"""
|
| 86 |
-
markdowns = []
|
| 87 |
-
|
| 88 |
-
# Process each page of the OCR response
|
| 89 |
-
for page in ocr_response.pages:
|
| 90 |
-
# Extract image data if available
|
| 91 |
-
image_data = {}
|
| 92 |
-
if hasattr(page, "images"):
|
| 93 |
-
for img in page.images:
|
| 94 |
-
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
| 95 |
-
image_data[img.id] = img.image_base64
|
| 96 |
-
|
| 97 |
-
# Replace image placeholders with base64 data
|
| 98 |
-
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
| 99 |
-
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
| 100 |
-
markdowns.append(processed_markdown)
|
| 101 |
-
|
| 102 |
-
# Join all pages' markdown with double newlines
|
| 103 |
-
return "\n\n".join(markdowns)
|
| 104 |
|
| 105 |
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
| 106 |
"""
|
| 107 |
-
Encode an image as base64 data URL for API submission.
|
| 108 |
|
| 109 |
Args:
|
| 110 |
image_path: Path to the image file
|
|
@@ -135,1703 +63,37 @@ def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
|
| 135 |
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
| 136 |
return f"data:{mime_type};base64,{encoded}"
|
| 137 |
|
| 138 |
-
def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
|
| 139 |
-
"""
|
| 140 |
-
Encode binary data as base64 data URL for API submission.
|
| 141 |
-
|
| 142 |
-
Args:
|
| 143 |
-
file_bytes: Binary file data
|
| 144 |
-
mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
|
| 145 |
-
|
| 146 |
-
Returns:
|
| 147 |
-
Base64 data URL for the data
|
| 148 |
-
"""
|
| 149 |
-
# Encode data as base64
|
| 150 |
-
encoded = base64.b64encode(file_bytes).decode()
|
| 151 |
-
return f"data:{mime_type};base64,{encoded}"
|
| 152 |
-
|
| 153 |
-
def process_image_with_ocr(client, image_path: Union[str, Path], model: str = "mistral-ocr-latest"):
|
| 154 |
-
"""
|
| 155 |
-
Process an image with OCR and return the response.
|
| 156 |
-
|
| 157 |
-
Args:
|
| 158 |
-
client: Mistral AI client
|
| 159 |
-
image_path: Path to the image file
|
| 160 |
-
model: OCR model to use
|
| 161 |
-
|
| 162 |
-
Returns:
|
| 163 |
-
OCR response object
|
| 164 |
-
"""
|
| 165 |
-
# Encode image as base64
|
| 166 |
-
base64_data_url = encode_image_for_api(image_path)
|
| 167 |
-
|
| 168 |
-
# Process image with OCR
|
| 169 |
-
image_response = client.ocr.process(
|
| 170 |
-
document=ImageURLChunk(image_url=base64_data_url),
|
| 171 |
-
model=model
|
| 172 |
-
)
|
| 173 |
-
|
| 174 |
-
return image_response
|
| 175 |
-
|
| 176 |
-
def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
|
| 177 |
-
"""
|
| 178 |
-
Convert OCR response to a formatted JSON string.
|
| 179 |
-
|
| 180 |
-
Args:
|
| 181 |
-
ocr_response: OCR response object
|
| 182 |
-
indent: Indentation level for JSON formatting
|
| 183 |
-
|
| 184 |
-
Returns:
|
| 185 |
-
Formatted JSON string
|
| 186 |
-
"""
|
| 187 |
-
# Convert OCR response to a dictionary
|
| 188 |
-
response_dict = {
|
| 189 |
-
"text": ocr_response.text if hasattr(ocr_response, "text") else "",
|
| 190 |
-
"pages": []
|
| 191 |
-
}
|
| 192 |
-
|
| 193 |
-
# Process pages if available
|
| 194 |
-
if hasattr(ocr_response, "pages"):
|
| 195 |
-
for page in ocr_response.pages:
|
| 196 |
-
page_dict = {
|
| 197 |
-
"text": page.text if hasattr(page, "text") else "",
|
| 198 |
-
"markdown": page.markdown if hasattr(page, "markdown") else "",
|
| 199 |
-
"images": []
|
| 200 |
-
}
|
| 201 |
-
|
| 202 |
-
# Process images if available
|
| 203 |
-
if hasattr(page, "images"):
|
| 204 |
-
for img in page.images:
|
| 205 |
-
img_dict = {
|
| 206 |
-
"id": img.id if hasattr(img, "id") else "",
|
| 207 |
-
"base64": img.image_base64 if hasattr(img, "image_base64") else ""
|
| 208 |
-
}
|
| 209 |
-
page_dict["images"].append(img_dict)
|
| 210 |
-
|
| 211 |
-
response_dict["pages"].append(page_dict)
|
| 212 |
-
|
| 213 |
-
# Convert dictionary to JSON
|
| 214 |
-
return json.dumps(response_dict, indent=indent)
|
| 215 |
-
|
| 216 |
-
def create_results_zip_in_memory(results):
|
| 217 |
-
"""
|
| 218 |
-
Create a zip file containing OCR results in memory.
|
| 219 |
-
|
| 220 |
-
Args:
|
| 221 |
-
results: Dictionary or list of OCR results
|
| 222 |
-
|
| 223 |
-
Returns:
|
| 224 |
-
Binary zip file data
|
| 225 |
-
"""
|
| 226 |
-
# Create a BytesIO object
|
| 227 |
-
zip_buffer = io.BytesIO()
|
| 228 |
-
|
| 229 |
-
# Check if results is a list or a dictionary
|
| 230 |
-
is_list = isinstance(results, list)
|
| 231 |
-
|
| 232 |
-
# Create zip file in memory
|
| 233 |
-
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
| 234 |
-
if is_list:
|
| 235 |
-
# Handle list of results
|
| 236 |
-
for i, result in enumerate(results):
|
| 237 |
-
try:
|
| 238 |
-
# Create a descriptive base filename for this result
|
| 239 |
-
base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
|
| 240 |
-
|
| 241 |
-
# Add document type if available
|
| 242 |
-
if 'topics' in result and result['topics']:
|
| 243 |
-
topic = result['topics'][0].lower().replace(' ', '_')
|
| 244 |
-
base_filename = f"{base_filename}_{topic}"
|
| 245 |
-
|
| 246 |
-
# Add language if available
|
| 247 |
-
if 'languages' in result and result['languages']:
|
| 248 |
-
lang = result['languages'][0].lower()
|
| 249 |
-
# Only add if it's not already in the filename
|
| 250 |
-
if lang not in base_filename.lower():
|
| 251 |
-
base_filename = f"{base_filename}_{lang}"
|
| 252 |
-
|
| 253 |
-
# For PDFs, add page information
|
| 254 |
-
if 'total_pages' in result and 'processed_pages' in result:
|
| 255 |
-
base_filename = f"{base_filename}_p{result['processed_pages']}of{result['total_pages']}"
|
| 256 |
-
|
| 257 |
-
# Add timestamp if available
|
| 258 |
-
if 'timestamp' in result:
|
| 259 |
-
try:
|
| 260 |
-
# Try to parse the timestamp and reformat it
|
| 261 |
-
dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
|
| 262 |
-
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
| 263 |
-
base_filename = f"{base_filename}_{timestamp}"
|
| 264 |
-
except:
|
| 265 |
-
pass
|
| 266 |
-
|
| 267 |
-
# Add JSON results for each file with descriptive name
|
| 268 |
-
result_json = json.dumps(result, indent=2)
|
| 269 |
-
zipf.writestr(f"{base_filename}.json", result_json)
|
| 270 |
-
|
| 271 |
-
# Add HTML content (generated from the result)
|
| 272 |
-
html_content = create_html_with_images(result)
|
| 273 |
-
zipf.writestr(f"{base_filename}_with_images.html", html_content)
|
| 274 |
-
|
| 275 |
-
# Add raw OCR text if available
|
| 276 |
-
if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
|
| 277 |
-
zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
|
| 278 |
-
|
| 279 |
-
# Add HTML visualization if available
|
| 280 |
-
if "html_visualization" in result:
|
| 281 |
-
zipf.writestr(f"visualization_{i+1}.html", result["html_visualization"])
|
| 282 |
-
|
| 283 |
-
# Add images if available (limit to conserve memory)
|
| 284 |
-
if "pages_data" in result:
|
| 285 |
-
for page_idx, page in enumerate(result["pages_data"]):
|
| 286 |
-
for img_idx, img in enumerate(page.get("images", [])[:3]): # Limit to first 3 images per page
|
| 287 |
-
img_base64 = img.get("image_base64", "")
|
| 288 |
-
if img_base64:
|
| 289 |
-
# Strip data URL prefix if present
|
| 290 |
-
if img_base64.startswith("data:image"):
|
| 291 |
-
img_base64 = img_base64.split(",", 1)[1]
|
| 292 |
-
|
| 293 |
-
# Decode base64 and add to zip
|
| 294 |
-
try:
|
| 295 |
-
img_data = base64.b64decode(img_base64)
|
| 296 |
-
zipf.writestr(f"images/result_{i+1}_page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
|
| 297 |
-
except:
|
| 298 |
-
pass
|
| 299 |
-
except Exception:
|
| 300 |
-
# If any result fails, skip it and continue
|
| 301 |
-
continue
|
| 302 |
-
else:
|
| 303 |
-
# Handle single result
|
| 304 |
-
try:
|
| 305 |
-
# Create a descriptive base filename for this result
|
| 306 |
-
base_filename = results.get('file_name', 'document').split('.')[0]
|
| 307 |
-
|
| 308 |
-
# Add document type if available
|
| 309 |
-
if 'topics' in results and results['topics']:
|
| 310 |
-
topic = results['topics'][0].lower().replace(' ', '_')
|
| 311 |
-
base_filename = f"{base_filename}_{topic}"
|
| 312 |
-
|
| 313 |
-
# Add language if available
|
| 314 |
-
if 'languages' in results and results['languages']:
|
| 315 |
-
lang = results['languages'][0].lower()
|
| 316 |
-
# Only add if it's not already in the filename
|
| 317 |
-
if lang not in base_filename.lower():
|
| 318 |
-
base_filename = f"{base_filename}_{lang}"
|
| 319 |
-
|
| 320 |
-
# For PDFs, add page information
|
| 321 |
-
if 'total_pages' in results and 'processed_pages' in results:
|
| 322 |
-
base_filename = f"{base_filename}_p{results['processed_pages']}of{results['total_pages']}"
|
| 323 |
-
|
| 324 |
-
# Add timestamp if available
|
| 325 |
-
if 'timestamp' in results:
|
| 326 |
-
try:
|
| 327 |
-
# Try to parse the timestamp and reformat it
|
| 328 |
-
dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
|
| 329 |
-
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
| 330 |
-
base_filename = f"{base_filename}_{timestamp}"
|
| 331 |
-
except:
|
| 332 |
-
# If parsing fails, create a new timestamp
|
| 333 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 334 |
-
base_filename = f"{base_filename}_{timestamp}"
|
| 335 |
-
else:
|
| 336 |
-
# No timestamp in the result, create a new one
|
| 337 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 338 |
-
base_filename = f"{base_filename}_{timestamp}"
|
| 339 |
-
|
| 340 |
-
# Add JSON results with descriptive name
|
| 341 |
-
results_json = json.dumps(results, indent=2)
|
| 342 |
-
zipf.writestr(f"{base_filename}.json", results_json)
|
| 343 |
-
|
| 344 |
-
# Add HTML content with descriptive name
|
| 345 |
-
html_content = create_html_with_images(results)
|
| 346 |
-
zipf.writestr(f"{base_filename}_with_images.html", html_content)
|
| 347 |
-
|
| 348 |
-
# Add raw OCR text if available
|
| 349 |
-
if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
|
| 350 |
-
zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
|
| 351 |
-
|
| 352 |
-
# Add HTML visualization if available
|
| 353 |
-
if "html_visualization" in results:
|
| 354 |
-
zipf.writestr("visualization.html", results["html_visualization"])
|
| 355 |
-
|
| 356 |
-
# Add images if available
|
| 357 |
-
if "pages_data" in results:
|
| 358 |
-
for page_idx, page in enumerate(results["pages_data"]):
|
| 359 |
-
for img_idx, img in enumerate(page.get("images", [])):
|
| 360 |
-
img_base64 = img.get("image_base64", "")
|
| 361 |
-
if img_base64:
|
| 362 |
-
# Strip data URL prefix if present
|
| 363 |
-
if img_base64.startswith("data:image"):
|
| 364 |
-
img_base64 = img_base64.split(",", 1)[1]
|
| 365 |
-
|
| 366 |
-
# Decode base64 and add to zip
|
| 367 |
-
try:
|
| 368 |
-
img_data = base64.b64decode(img_base64)
|
| 369 |
-
zipf.writestr(f"images/page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
|
| 370 |
-
except:
|
| 371 |
-
pass
|
| 372 |
-
except Exception:
|
| 373 |
-
# If processing fails, return empty zip
|
| 374 |
-
pass
|
| 375 |
-
|
| 376 |
-
# Seek to the beginning of the BytesIO object
|
| 377 |
-
zip_buffer.seek(0)
|
| 378 |
-
|
| 379 |
-
# Return the zip file bytes
|
| 380 |
-
return zip_buffer.getvalue()
|
| 381 |
-
|
| 382 |
-
def create_results_zip(results, output_dir=None, zip_name=None):
|
| 383 |
-
"""
|
| 384 |
-
Create a zip file containing OCR results.
|
| 385 |
-
|
| 386 |
-
Args:
|
| 387 |
-
results: Dictionary or list of OCR results
|
| 388 |
-
output_dir: Optional output directory
|
| 389 |
-
zip_name: Optional zip file name
|
| 390 |
-
|
| 391 |
-
Returns:
|
| 392 |
-
Path to the created zip file
|
| 393 |
-
"""
|
| 394 |
-
# Create temporary output directory if not provided
|
| 395 |
-
if output_dir is None:
|
| 396 |
-
output_dir = Path.cwd() / "output"
|
| 397 |
-
output_dir.mkdir(exist_ok=True)
|
| 398 |
-
else:
|
| 399 |
-
output_dir = Path(output_dir)
|
| 400 |
-
output_dir.mkdir(exist_ok=True)
|
| 401 |
-
|
| 402 |
-
# Check if results is a list or a dictionary
|
| 403 |
-
is_list = isinstance(results, list)
|
| 404 |
-
|
| 405 |
-
# Generate zip name if not provided
|
| 406 |
-
if zip_name is None:
|
| 407 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 408 |
-
|
| 409 |
-
if is_list:
|
| 410 |
-
# For a list of results, create a more descriptive name based on the content
|
| 411 |
-
file_count = len(results)
|
| 412 |
-
|
| 413 |
-
# Count document types
|
| 414 |
-
pdf_count = sum(1 for r in results if r.get('file_name', '').lower().endswith('.pdf'))
|
| 415 |
-
img_count = sum(1 for r in results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
|
| 416 |
-
|
| 417 |
-
# Create descriptive name based on contents
|
| 418 |
-
if pdf_count > 0 and img_count > 0:
|
| 419 |
-
zip_name = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
|
| 420 |
-
elif pdf_count > 0:
|
| 421 |
-
zip_name = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
|
| 422 |
-
elif img_count > 0:
|
| 423 |
-
zip_name = f"historical_ocr_images_{img_count}_{timestamp}.zip"
|
| 424 |
-
else:
|
| 425 |
-
zip_name = f"historical_ocr_results_{file_count}_{timestamp}.zip"
|
| 426 |
-
else:
|
| 427 |
-
# For single result, create descriptive filename
|
| 428 |
-
base_name = results.get("file_name", "document").split('.')[0]
|
| 429 |
-
|
| 430 |
-
# Add document type if available
|
| 431 |
-
if 'topics' in results and results['topics']:
|
| 432 |
-
topic = results['topics'][0].lower().replace(' ', '_')
|
| 433 |
-
base_name = f"{base_name}_{topic}"
|
| 434 |
-
|
| 435 |
-
# Add language if available
|
| 436 |
-
if 'languages' in results and results['languages']:
|
| 437 |
-
lang = results['languages'][0].lower()
|
| 438 |
-
# Only add if it's not already in the filename
|
| 439 |
-
if lang not in base_name.lower():
|
| 440 |
-
base_name = f"{base_name}_{lang}"
|
| 441 |
-
|
| 442 |
-
# For PDFs, add page information
|
| 443 |
-
if 'total_pages' in results and 'processed_pages' in results:
|
| 444 |
-
base_name = f"{base_name}_p{results['processed_pages']}of{results['total_pages']}"
|
| 445 |
-
|
| 446 |
-
# Add timestamp
|
| 447 |
-
zip_name = f"{base_name}_{timestamp}.zip"
|
| 448 |
-
|
| 449 |
-
try:
|
| 450 |
-
# Get zip data in memory first
|
| 451 |
-
zip_data = create_results_zip_in_memory(results)
|
| 452 |
-
|
| 453 |
-
# Save to file
|
| 454 |
-
zip_path = output_dir / zip_name
|
| 455 |
-
with open(zip_path, 'wb') as f:
|
| 456 |
-
f.write(zip_data)
|
| 457 |
-
|
| 458 |
-
return zip_path
|
| 459 |
-
except Exception as e:
|
| 460 |
-
# Create an empty zip file as fallback
|
| 461 |
-
zip_path = output_dir / zip_name
|
| 462 |
-
with zipfile.ZipFile(zip_path, 'w') as zipf:
|
| 463 |
-
zipf.writestr("info.txt", "Could not create complete archive")
|
| 464 |
-
|
| 465 |
-
return zip_path
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
# Advanced image preprocessing functions
|
| 469 |
-
|
| 470 |
-
def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
|
| 471 |
-
"""
|
| 472 |
-
Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
|
| 473 |
-
Enhanced to handle large newspaper and document images.
|
| 474 |
-
|
| 475 |
-
Args:
|
| 476 |
-
image_path: Path to the image file
|
| 477 |
-
|
| 478 |
-
Returns:
|
| 479 |
-
Tuple of (processed PIL Image, base64 string)
|
| 480 |
-
"""
|
| 481 |
-
# Fast path: Skip all processing if PIL not available
|
| 482 |
-
if not PILLOW_AVAILABLE:
|
| 483 |
-
logger.info("PIL not available, skipping image preprocessing")
|
| 484 |
-
return None, encode_image_for_api(image_path)
|
| 485 |
-
|
| 486 |
-
# Convert to Path object if string
|
| 487 |
-
image_file = Path(image_path) if isinstance(image_path, str) else image_path
|
| 488 |
-
|
| 489 |
-
# Thread-safe caching with early exit for already processed images
|
| 490 |
-
try:
|
| 491 |
-
# Fast stat calls for file metadata - consolidate to reduce I/O
|
| 492 |
-
file_stat = image_file.stat()
|
| 493 |
-
file_size = file_stat.st_size
|
| 494 |
-
file_size_mb = file_size / (1024 * 1024)
|
| 495 |
-
mod_time = file_stat.st_mtime
|
| 496 |
-
|
| 497 |
-
# Create a cache key based on essential file properties
|
| 498 |
-
cache_key = f"{image_file.name}_{file_size}_{mod_time}"
|
| 499 |
-
|
| 500 |
-
# Fast path: Return cached result if available
|
| 501 |
-
if hasattr(preprocess_image_for_ocr, "_cache") and cache_key in preprocess_image_for_ocr._cache:
|
| 502 |
-
logger.debug(f"Using cached preprocessing result for {image_file.name}")
|
| 503 |
-
return preprocess_image_for_ocr._cache[cache_key]
|
| 504 |
-
|
| 505 |
-
# Optimization: Skip heavy processing for very small files
|
| 506 |
-
# Small images (less than 100KB) likely don't need preprocessing
|
| 507 |
-
if file_size < 100000: # 100KB
|
| 508 |
-
logger.info(f"Image {image_file.name} is small ({file_size/1024:.1f}KB), using minimal processing")
|
| 509 |
-
with Image.open(image_file) as img:
|
| 510 |
-
# Normalize mode only
|
| 511 |
-
if img.mode not in ('RGB', 'L'):
|
| 512 |
-
img = img.convert('RGB')
|
| 513 |
-
|
| 514 |
-
# Save with light optimization
|
| 515 |
-
buffer = io.BytesIO()
|
| 516 |
-
img.save(buffer, format="JPEG", quality=95, optimize=True)
|
| 517 |
-
buffer.seek(0)
|
| 518 |
-
|
| 519 |
-
# Get base64
|
| 520 |
-
encoded_image = base64.b64encode(buffer.getvalue()).decode()
|
| 521 |
-
base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
|
| 522 |
-
|
| 523 |
-
# Cache and return
|
| 524 |
-
result = (img, base64_data_url)
|
| 525 |
-
if not hasattr(preprocess_image_for_ocr, "_cache"):
|
| 526 |
-
preprocess_image_for_ocr._cache = {}
|
| 527 |
-
|
| 528 |
-
# Clean cache if needed
|
| 529 |
-
if len(preprocess_image_for_ocr._cache) > 20: # Increased cache size for better performance
|
| 530 |
-
# Remove oldest 5 entries for better batch processing
|
| 531 |
-
for _ in range(5):
|
| 532 |
-
if preprocess_image_for_ocr._cache:
|
| 533 |
-
preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
|
| 534 |
-
|
| 535 |
-
preprocess_image_for_ocr._cache[cache_key] = result
|
| 536 |
-
return result
|
| 537 |
-
|
| 538 |
-
# Special handling for large newspaper-style documents
|
| 539 |
-
if file_size_mb > 5 and image_file.name.lower().endswith(('.jpg', '.jpeg', '.png')):
|
| 540 |
-
logger.info(f"Large image detected ({file_size_mb:.2f}MB), checking for newspaper format")
|
| 541 |
-
try:
|
| 542 |
-
# Quickly check dimensions without loading full image
|
| 543 |
-
with Image.open(image_file) as img:
|
| 544 |
-
width, height = img.size
|
| 545 |
-
aspect_ratio = width / height
|
| 546 |
-
|
| 547 |
-
# Newspaper-style documents typically have width > height or are very large
|
| 548 |
-
is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
|
| 549 |
-
|
| 550 |
-
if is_newspaper_format:
|
| 551 |
-
logger.info(f"Newspaper format detected: {width}x{height}, applying specialized processing")
|
| 552 |
-
|
| 553 |
-
except Exception as dim_err:
|
| 554 |
-
logger.debug(f"Error checking dimensions: {str(dim_err)}")
|
| 555 |
-
is_newspaper_format = False
|
| 556 |
-
else:
|
| 557 |
-
is_newspaper_format = False
|
| 558 |
-
|
| 559 |
-
except Exception as e:
|
| 560 |
-
# If stat or cache handling fails, log and continue with processing
|
| 561 |
-
logger.debug(f"Cache handling failed for {image_path}: {str(e)}")
|
| 562 |
-
# Ensure we have a valid file_size_mb for later decisions
|
| 563 |
-
try:
|
| 564 |
-
file_size_mb = image_file.stat().st_size / (1024 * 1024)
|
| 565 |
-
except:
|
| 566 |
-
file_size_mb = 0 # Default if we can't determine size
|
| 567 |
-
|
| 568 |
-
# Default to not newspaper format on error
|
| 569 |
-
is_newspaper_format = False
|
| 570 |
-
|
| 571 |
-
try:
|
| 572 |
-
# Process start time for performance logging
|
| 573 |
-
start_time = time.time()
|
| 574 |
-
|
| 575 |
-
# Open and process the image with minimal memory footprint
|
| 576 |
-
with Image.open(image_file) as img:
|
| 577 |
-
# Normalize image mode
|
| 578 |
-
if img.mode not in ('RGB', 'L'):
|
| 579 |
-
img = img.convert('RGB')
|
| 580 |
-
|
| 581 |
-
# Fast path: Quick check of image properties to determine appropriate processing
|
| 582 |
-
width, height = img.size
|
| 583 |
-
image_area = width * height
|
| 584 |
-
|
| 585 |
-
# Detect document type only for medium to large images to save processing time
|
| 586 |
-
is_document = False
|
| 587 |
-
is_newspaper = False
|
| 588 |
-
|
| 589 |
-
# More aggressive document type detection for larger images
|
| 590 |
-
if image_area > 500000: # Approx 700x700 or larger
|
| 591 |
-
# Store image for document detection
|
| 592 |
-
_detect_document_type_impl._current_img = img
|
| 593 |
-
is_document = _detect_document_type_impl(None)
|
| 594 |
-
|
| 595 |
-
# Additional check for newspaper format
|
| 596 |
-
if is_document:
|
| 597 |
-
# Newspapers typically have wide formats or very large dimensions
|
| 598 |
-
aspect_ratio = width / height
|
| 599 |
-
is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
|
| 600 |
-
|
| 601 |
-
logger.debug(f"Document type detection for {image_file.name}: " +
|
| 602 |
-
f"{'newspaper' if is_newspaper else 'document' if is_document else 'photo'}")
|
| 603 |
-
|
| 604 |
-
# Check for handwritten document characteristics
|
| 605 |
-
is_handwritten = False
|
| 606 |
-
if CV2_AVAILABLE and not is_newspaper:
|
| 607 |
-
# Use more advanced detection for handwritten content
|
| 608 |
-
try:
|
| 609 |
-
gray_np = np.array(img.convert('L'))
|
| 610 |
-
# Higher variance in edge strengths can indicate handwriting
|
| 611 |
-
edges = cv2.Canny(gray_np, 30, 100)
|
| 612 |
-
if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
|
| 613 |
-
# Additional check with gradient magnitudes
|
| 614 |
-
sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
|
| 615 |
-
sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
|
| 616 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
| 617 |
-
# Handwriting typically has more variation in gradient magnitudes
|
| 618 |
-
if np.std(magnitude) > 20:
|
| 619 |
-
is_handwritten = True
|
| 620 |
-
logger.info(f"Handwritten document detected: {image_file.name}")
|
| 621 |
-
except Exception as e:
|
| 622 |
-
logger.debug(f"Handwriting detection error: {str(e)}")
|
| 623 |
-
|
| 624 |
-
# Special processing for very large images (newspapers and large documents)
|
| 625 |
-
if is_newspaper:
|
| 626 |
-
# For newspaper format, we need more specialized processing
|
| 627 |
-
logger.info(f"Processing newspaper format image: {width}x{height}")
|
| 628 |
-
|
| 629 |
-
# For newspapers, we prioritize text clarity over file size
|
| 630 |
-
# Use higher target resolution to preserve small text common in newspapers
|
| 631 |
-
# But still need to resize if extremely large to avoid API limits
|
| 632 |
-
max_dimension = max(width, height)
|
| 633 |
-
|
| 634 |
-
if max_dimension > 6000: # Extremely large
|
| 635 |
-
scale_factor = 0.4 # Preserve more resolution for newspapers (increased from 0.35)
|
| 636 |
-
elif max_dimension > 4000:
|
| 637 |
-
scale_factor = 0.6 # Higher resolution for better text extraction (increased from 0.5)
|
| 638 |
-
else:
|
| 639 |
-
scale_factor = 0.8 # Minimal reduction for moderate newspaper size (increased from 0.7)
|
| 640 |
-
|
| 641 |
-
# Calculate new dimensions - maintain higher resolution
|
| 642 |
-
new_width = int(width * scale_factor)
|
| 643 |
-
new_height = int(height * scale_factor)
|
| 644 |
-
|
| 645 |
-
# Use high-quality resampling to preserve text clarity in newspapers
|
| 646 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
| 647 |
-
logger.debug(f"Resized newspaper image from {width}x{height} to {new_width}x{new_height}")
|
| 648 |
-
|
| 649 |
-
# For newspapers, we also want to enhance the contrast and sharpen the image
|
| 650 |
-
# before the main OCR processing for better text extraction
|
| 651 |
-
if img.mode in ('RGB', 'RGBA'):
|
| 652 |
-
# For color newspapers, enhance both the overall image and then convert to grayscale
|
| 653 |
-
# This helps with mixed content newspapers that have both text and images
|
| 654 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
| 655 |
-
processed_img = enhancer.enhance(1.3) # Boost contrast but not too aggressively
|
| 656 |
-
|
| 657 |
-
# Also enhance saturation to make colored text more visible
|
| 658 |
-
enhancer_sat = ImageEnhance.Color(processed_img)
|
| 659 |
-
processed_img = enhancer_sat.enhance(1.2)
|
| 660 |
-
# Special processing for handwritten documents
|
| 661 |
-
elif is_handwritten:
|
| 662 |
-
logger.info(f"Processing handwritten document: {width}x{height}")
|
| 663 |
-
|
| 664 |
-
# For handwritten text, we need to preserve stroke details
|
| 665 |
-
# Use gentle scaling to maintain handwriting characteristics
|
| 666 |
-
max_dimension = max(width, height)
|
| 667 |
-
|
| 668 |
-
if max_dimension > 4000: # Large handwritten document
|
| 669 |
-
scale_factor = 0.6 # Less aggressive reduction for handwriting
|
| 670 |
-
else:
|
| 671 |
-
scale_factor = 0.8 # Minimal reduction for moderate size
|
| 672 |
-
|
| 673 |
-
# Calculate new dimensions
|
| 674 |
-
new_width = int(width * scale_factor)
|
| 675 |
-
new_height = int(height * scale_factor)
|
| 676 |
-
|
| 677 |
-
# Use high-quality resampling to preserve handwriting details
|
| 678 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
| 679 |
-
|
| 680 |
-
# Lower contrast enhancement for handwriting to preserve stroke details
|
| 681 |
-
if img.mode in ('RGB', 'RGBA'):
|
| 682 |
-
# Convert to grayscale for better text processing
|
| 683 |
-
processed_img = processed_img.convert('L')
|
| 684 |
-
|
| 685 |
-
# Use reduced contrast enhancement to preserve subtle strokes
|
| 686 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
| 687 |
-
processed_img = enhancer.enhance(1.2) # Lower contrast value for handwriting
|
| 688 |
-
|
| 689 |
-
# Standard processing for other large images
|
| 690 |
-
elif file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
|
| 691 |
-
# Calculate target dimensions directly instead of using the heavier resize function
|
| 692 |
-
target_width, target_height = width, height
|
| 693 |
-
max_dimension = max(width, height)
|
| 694 |
-
|
| 695 |
-
# Use a sliding scale for reduction based on image size
|
| 696 |
-
if max_dimension > 5000:
|
| 697 |
-
scale_factor = 0.3 # Slightly less aggressive reduction (was 0.25)
|
| 698 |
-
elif max_dimension > 3000:
|
| 699 |
-
scale_factor = 0.45 # Slightly less aggressive reduction (was 0.4)
|
| 700 |
-
else:
|
| 701 |
-
scale_factor = 0.65 # Slightly less aggressive reduction (was 0.6)
|
| 702 |
-
|
| 703 |
-
# Calculate new dimensions
|
| 704 |
-
new_width = int(width * scale_factor)
|
| 705 |
-
new_height = int(height * scale_factor)
|
| 706 |
-
|
| 707 |
-
# Use direct resize with optimized resampling filter based on image size
|
| 708 |
-
if image_area > 3000000: # Very large, use faster but lower quality
|
| 709 |
-
processed_img = img.resize((new_width, new_height), Image.BILINEAR)
|
| 710 |
-
else: # Medium size, use better quality
|
| 711 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
| 712 |
-
|
| 713 |
-
logger.debug(f"Resized image from {width}x{height} to {new_width}x{new_height}")
|
| 714 |
-
else:
|
| 715 |
-
# Skip resizing for smaller images
|
| 716 |
-
processed_img = img
|
| 717 |
-
|
| 718 |
-
# Apply appropriate processing based on document type and size
|
| 719 |
-
if is_document:
|
| 720 |
-
# Process as document with optimized path based on size
|
| 721 |
-
if image_area > 1000000: # Full processing for larger documents
|
| 722 |
-
preprocess_document_image._current_img = processed_img
|
| 723 |
-
processed = _preprocess_document_image_impl()
|
| 724 |
-
else: # Lightweight processing for smaller documents
|
| 725 |
-
# Just enhance contrast for small documents to save time
|
| 726 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
| 727 |
-
processed = enhancer.enhance(1.3)
|
| 728 |
-
else:
|
| 729 |
-
# Process as photo with optimized path based on size
|
| 730 |
-
if image_area > 1000000: # Full processing for larger photos
|
| 731 |
-
preprocess_general_image._current_img = processed_img
|
| 732 |
-
processed = _preprocess_general_image_impl()
|
| 733 |
-
else: # Skip processing for smaller photos
|
| 734 |
-
processed = processed_img
|
| 735 |
-
|
| 736 |
-
# Optimize memory handling during encoding
|
| 737 |
-
buffer = io.BytesIO()
|
| 738 |
-
|
| 739 |
-
# Adjust quality based on image size to optimize API payload
|
| 740 |
-
if file_size_mb > 5:
|
| 741 |
-
quality = 85 # Lower quality for large files
|
| 742 |
-
else:
|
| 743 |
-
quality = IMAGE_PREPROCESSING["compression_quality"]
|
| 744 |
-
|
| 745 |
-
# Save with optimized parameters
|
| 746 |
-
processed.save(buffer, format="JPEG", quality=quality, optimize=True)
|
| 747 |
-
buffer.seek(0)
|
| 748 |
-
|
| 749 |
-
# Get base64 with minimal memory footprint
|
| 750 |
-
encoded_image = base64.b64encode(buffer.getvalue()).decode()
|
| 751 |
-
# Always use image/jpeg MIME type since we explicitly save as JPEG above
|
| 752 |
-
base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
|
| 753 |
-
|
| 754 |
-
# Update cache thread-safely
|
| 755 |
-
result = (processed, base64_data_url)
|
| 756 |
-
if not hasattr(preprocess_image_for_ocr, "_cache"):
|
| 757 |
-
preprocess_image_for_ocr._cache = {}
|
| 758 |
-
|
| 759 |
-
# LRU-like cache management with improved clearing
|
| 760 |
-
if len(preprocess_image_for_ocr._cache) > 20:
|
| 761 |
-
try:
|
| 762 |
-
# Remove several entries to avoid frequent cache clearing
|
| 763 |
-
for _ in range(5):
|
| 764 |
-
if preprocess_image_for_ocr._cache:
|
| 765 |
-
preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
|
| 766 |
-
except:
|
| 767 |
-
# If removal fails, just continue
|
| 768 |
-
pass
|
| 769 |
-
|
| 770 |
-
# Add to cache
|
| 771 |
-
try:
|
| 772 |
-
preprocess_image_for_ocr._cache[cache_key] = result
|
| 773 |
-
except Exception:
|
| 774 |
-
# If caching fails, just proceed
|
| 775 |
-
pass
|
| 776 |
-
|
| 777 |
-
# Log performance metrics
|
| 778 |
-
processing_time = time.time() - start_time
|
| 779 |
-
logger.debug(f"Image preprocessing completed in {processing_time:.3f}s for {image_file.name}")
|
| 780 |
-
|
| 781 |
-
# Return both processed image and base64 string
|
| 782 |
-
return result
|
| 783 |
-
|
| 784 |
-
except Exception as e:
|
| 785 |
-
# If preprocessing fails, log error and use original image
|
| 786 |
-
logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
|
| 787 |
-
return None, encode_image_for_api(image_path)
|
| 788 |
-
|
| 789 |
-
# Removed caching decorator to fix unhashable type error
|
| 790 |
-
def detect_document_type(img: Image.Image) -> bool:
|
| 791 |
-
"""
|
| 792 |
-
Detect if an image is likely a document (text-heavy) vs. a photo.
|
| 793 |
-
|
| 794 |
-
Args:
|
| 795 |
-
img: PIL Image object
|
| 796 |
-
|
| 797 |
-
Returns:
|
| 798 |
-
True if likely a document, False otherwise
|
| 799 |
-
"""
|
| 800 |
-
# Direct implementation without caching
|
| 801 |
-
return _detect_document_type_impl(None)
|
| 802 |
-
|
| 803 |
-
def _detect_document_type_impl(img_hash=None) -> bool:
|
| 804 |
-
"""
|
| 805 |
-
Optimized implementation of document type detection for faster processing.
|
| 806 |
-
The img_hash parameter is unused but kept for backward compatibility.
|
| 807 |
-
|
| 808 |
-
Enhanced to better detect handwritten documents and newspaper formats.
|
| 809 |
-
"""
|
| 810 |
-
# Fast path: Get the image from thread-local storage
|
| 811 |
-
if not hasattr(_detect_document_type_impl, "_current_img"):
|
| 812 |
-
return False # Fail safe in case image is not set
|
| 813 |
-
|
| 814 |
-
img = _detect_document_type_impl._current_img
|
| 815 |
-
|
| 816 |
-
# Skip processing for tiny images - just classify as non-documents
|
| 817 |
-
width, height = img.size
|
| 818 |
-
if width * height < 100000: # Approx 300x300 or smaller
|
| 819 |
-
return False
|
| 820 |
-
|
| 821 |
-
# Convert to grayscale for analysis (using faster conversion)
|
| 822 |
-
gray_img = img.convert('L')
|
| 823 |
-
|
| 824 |
-
# PIL-only path for systems without OpenCV
|
| 825 |
-
if not CV2_AVAILABLE:
|
| 826 |
-
# Faster method: Sample a subset of the image for edge detection
|
| 827 |
-
# Downscale image for faster processing
|
| 828 |
-
sample_size = min(width, height, 1000)
|
| 829 |
-
scale_factor = sample_size / max(width, height)
|
| 830 |
-
|
| 831 |
-
if scale_factor < 0.9: # Only resize if significant reduction
|
| 832 |
-
sample_img = gray_img.resize(
|
| 833 |
-
(int(width * scale_factor), int(height * scale_factor)),
|
| 834 |
-
Image.NEAREST # Fastest resampling method
|
| 835 |
-
)
|
| 836 |
-
else:
|
| 837 |
-
sample_img = gray_img
|
| 838 |
-
|
| 839 |
-
# Fast edge detection on sample
|
| 840 |
-
edges = sample_img.filter(ImageFilter.FIND_EDGES)
|
| 841 |
-
|
| 842 |
-
# Count edge pixels using threshold (faster than summing individual pixels)
|
| 843 |
-
edge_data = edges.getdata()
|
| 844 |
-
edge_threshold = 40 # Lowered threshold to better detect handwritten texts
|
| 845 |
-
|
| 846 |
-
# Use list comprehension for better performance
|
| 847 |
-
edge_count = sum(1 for p in edge_data if p > edge_threshold)
|
| 848 |
-
total_pixels = len(edge_data)
|
| 849 |
-
edge_ratio = edge_count / total_pixels
|
| 850 |
-
|
| 851 |
-
# Check if bright areas exist - simple approximation of text/background contrast
|
| 852 |
-
bright_count = sum(1 for p in gray_img.getdata() if p > 200)
|
| 853 |
-
bright_ratio = bright_count / (width * height)
|
| 854 |
-
|
| 855 |
-
# Documents typically have more edges (text boundaries) and bright areas (background)
|
| 856 |
-
# Lowered edge threshold to better detect handwritten documents
|
| 857 |
-
return edge_ratio > 0.035 or bright_ratio > 0.4
|
| 858 |
-
|
| 859 |
-
# OpenCV path - optimized for speed and enhanced for handwritten documents
|
| 860 |
-
img_np = np.array(gray_img)
|
| 861 |
-
|
| 862 |
-
# 1. Fast check: Variance of pixel values
|
| 863 |
-
# Documents typically have high variance (text on background)
|
| 864 |
-
# Handwritten documents may have less contrast than printed text
|
| 865 |
-
std_dev = np.std(img_np)
|
| 866 |
-
if std_dev > 40: # Further lowered threshold to better detect handwritten documents with low contrast
|
| 867 |
-
return True
|
| 868 |
-
|
| 869 |
-
# 2. Quick check using downsampled image for edges
|
| 870 |
-
# Downscale for faster processing on large images
|
| 871 |
-
if max(img_np.shape) > 1000:
|
| 872 |
-
scale = 1000 / max(img_np.shape)
|
| 873 |
-
small_img = cv2.resize(img_np, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)
|
| 874 |
-
else:
|
| 875 |
-
small_img = img_np
|
| 876 |
-
|
| 877 |
-
# Enhanced edge detection for handwritten documents
|
| 878 |
-
# Use multiple Canny thresholds to better capture both faint and bold strokes
|
| 879 |
-
edges_low = cv2.Canny(small_img, 20, 110, L2gradient=False) # For faint handwriting
|
| 880 |
-
edges_high = cv2.Canny(small_img, 30, 150, L2gradient=False) # For standard text
|
| 881 |
-
|
| 882 |
-
# Combine edge detection results
|
| 883 |
-
edges = cv2.bitwise_or(edges_low, edges_high)
|
| 884 |
-
edge_ratio = np.count_nonzero(edges) / edges.size
|
| 885 |
-
|
| 886 |
-
# Special handling for potential handwritten content - more sensitive detection
|
| 887 |
-
handwritten_indicator = False
|
| 888 |
-
if edge_ratio > 0.015: # Lower threshold specifically for handwritten content
|
| 889 |
-
try:
|
| 890 |
-
# Look for handwriting stroke characteristics using gradient analysis
|
| 891 |
-
# Compute gradient magnitudes and directions
|
| 892 |
-
sobelx = cv2.Sobel(small_img, cv2.CV_64F, 1, 0, ksize=3)
|
| 893 |
-
sobely = cv2.Sobel(small_img, cv2.CV_64F, 0, 1, ksize=3)
|
| 894 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
| 895 |
-
|
| 896 |
-
# Handwriting typically has higher variation in gradient magnitudes
|
| 897 |
-
if np.std(magnitude) > 18: # Lower threshold for more sensitivity
|
| 898 |
-
# Handwriting is indicated if we also have some line structure
|
| 899 |
-
# Try to find line segments that could indicate text lines
|
| 900 |
-
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
|
| 901 |
-
threshold=45, # Lower threshold for handwriting
|
| 902 |
-
minLineLength=25, # Shorter minimum line length
|
| 903 |
-
maxLineGap=25) # Larger gap for disconnected handwriting
|
| 904 |
-
|
| 905 |
-
if lines is not None and len(lines) > 8: # Fewer line segments needed
|
| 906 |
-
handwritten_indicator = True
|
| 907 |
-
except Exception:
|
| 908 |
-
# If analysis fails, continue with other checks
|
| 909 |
-
pass
|
| 910 |
-
|
| 911 |
-
# 3. Enhanced histogram analysis for handwritten content
|
| 912 |
-
# Use more granular bins for better detection of varying stroke densities
|
| 913 |
-
dark_mask = img_np < 65 # Increased threshold to capture lighter handwritten text
|
| 914 |
-
medium_mask = (img_np >= 65) & (img_np < 170) # Medium gray range for handwriting
|
| 915 |
-
light_mask = img_np > 175 # Slightly adjusted for aged paper
|
| 916 |
-
|
| 917 |
-
dark_ratio = np.count_nonzero(dark_mask) / img_np.size
|
| 918 |
-
medium_ratio = np.count_nonzero(medium_mask) / img_np.size
|
| 919 |
-
light_ratio = np.count_nonzero(light_mask) / img_np.size
|
| 920 |
-
|
| 921 |
-
# Handwritten documents often have more medium-gray content than printed text
|
| 922 |
-
# This helps detect pencil or faded ink handwriting
|
| 923 |
-
if medium_ratio > 0.3 and edge_ratio > 0.015:
|
| 924 |
-
return True
|
| 925 |
-
|
| 926 |
-
# Special analysis for handwritten documents
|
| 927 |
-
# Return true immediately if handwriting characteristics detected
|
| 928 |
-
if handwritten_indicator:
|
| 929 |
-
return True
|
| 930 |
-
|
| 931 |
-
# Combine heuristics for final decision with improved sensitivity
|
| 932 |
-
# Lower thresholds for handwritten documents
|
| 933 |
-
return (dark_ratio > 0.025 and light_ratio > 0.2) or edge_ratio > 0.025
|
| 934 |
-
|
| 935 |
-
# Removed caching to fix unhashable type error
|
| 936 |
-
def preprocess_document_image(img: Image.Image) -> Image.Image:
|
| 937 |
-
"""
|
| 938 |
-
Preprocess a document image for optimal OCR.
|
| 939 |
-
|
| 940 |
-
Args:
|
| 941 |
-
img: PIL Image object
|
| 942 |
-
|
| 943 |
-
Returns:
|
| 944 |
-
Processed PIL Image
|
| 945 |
-
"""
|
| 946 |
-
# Store the image for the implementation function
|
| 947 |
-
preprocess_document_image._current_img = img
|
| 948 |
-
# The actual implementation is separated for cleaner code organization
|
| 949 |
-
return _preprocess_document_image_impl()
|
| 950 |
-
|
| 951 |
-
def _preprocess_document_image_impl() -> Image.Image:
|
| 952 |
-
"""
|
| 953 |
-
Optimized implementation of document preprocessing with adaptive processing based on image size.
|
| 954 |
-
Enhanced for better handwritten document processing and newspaper format.
|
| 955 |
-
"""
|
| 956 |
-
# Fast path: Get image from thread-local storage
|
| 957 |
-
if not hasattr(preprocess_document_image, "_current_img"):
|
| 958 |
-
raise ValueError("No image set for document preprocessing")
|
| 959 |
-
|
| 960 |
-
img = preprocess_document_image._current_img
|
| 961 |
-
|
| 962 |
-
# Analyze image size to determine processing strategy
|
| 963 |
-
width, height = img.size
|
| 964 |
-
img_size = width * height
|
| 965 |
-
|
| 966 |
-
# Detect special document types
|
| 967 |
-
is_handwritten = False
|
| 968 |
-
is_newspaper = False
|
| 969 |
-
|
| 970 |
-
# Check for newspaper format first (takes precedence)
|
| 971 |
-
aspect_ratio = width / height
|
| 972 |
-
if (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000):
|
| 973 |
-
is_newspaper = True
|
| 974 |
-
logger.debug(f"Newspaper format detected: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
|
| 975 |
-
else:
|
| 976 |
-
# If not newspaper, check if handwritten
|
| 977 |
-
try:
|
| 978 |
-
# Simple check for handwritten document characteristics
|
| 979 |
-
# Handwritten documents often have more varied strokes and less stark contrast
|
| 980 |
-
if CV2_AVAILABLE:
|
| 981 |
-
# Convert to grayscale and calculate local variance
|
| 982 |
-
gray_np = np.array(img.convert('L'))
|
| 983 |
-
# Higher variance in edge strengths can indicate handwriting
|
| 984 |
-
edges = cv2.Canny(gray_np, 30, 100)
|
| 985 |
-
if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
|
| 986 |
-
# Additional check with gradient magnitudes
|
| 987 |
-
sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
|
| 988 |
-
sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
|
| 989 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
| 990 |
-
# Handwriting typically has more variation in gradient magnitudes
|
| 991 |
-
if np.std(magnitude) > 20:
|
| 992 |
-
is_handwritten = True
|
| 993 |
-
except:
|
| 994 |
-
# If detection fails, assume it's not handwritten
|
| 995 |
-
pass
|
| 996 |
-
|
| 997 |
-
# Special processing for newspaper format
|
| 998 |
-
if is_newspaper:
|
| 999 |
-
# Convert to grayscale for better text extraction
|
| 1000 |
-
gray = img.convert('L')
|
| 1001 |
-
|
| 1002 |
-
# For newspapers, we need aggressive text enhancement to make small print readable
|
| 1003 |
-
# First enhance contrast more aggressively for newspaper small text
|
| 1004 |
-
enhancer = ImageEnhance.Contrast(gray)
|
| 1005 |
-
enhanced = enhancer.enhance(2.0) # More aggressive contrast for newspaper text
|
| 1006 |
-
|
| 1007 |
-
# Apply stronger sharpening to make small text more defined
|
| 1008 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
| 1009 |
-
# Apply multiple passes of sharpening for newspaper text
|
| 1010 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
| 1011 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE_MORE) # Stronger edge enhancement
|
| 1012 |
-
|
| 1013 |
-
# Enhanced processing for newspapers with OpenCV when available
|
| 1014 |
-
if CV2_AVAILABLE:
|
| 1015 |
-
try:
|
| 1016 |
-
# Convert to numpy array
|
| 1017 |
-
img_np = np.array(enhanced)
|
| 1018 |
-
|
| 1019 |
-
# For newspaper text extraction, CLAHE (Contrast Limited Adaptive Histogram Equalization)
|
| 1020 |
-
# works much better than simple contrast enhancement
|
| 1021 |
-
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
|
| 1022 |
-
img_np = clahe.apply(img_np)
|
| 1023 |
-
|
| 1024 |
-
# Apply different adaptive thresholding approaches and choose the best one
|
| 1025 |
-
|
| 1026 |
-
# 1. Standard adaptive threshold with larger block size for newspaper columns
|
| 1027 |
-
binary1 = cv2.adaptiveThreshold(img_np, 255,
|
| 1028 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 1029 |
-
cv2.THRESH_BINARY, 15, 4)
|
| 1030 |
-
|
| 1031 |
-
# 2. Otsu's method for global thresholding - works well for clean newspaper print
|
| 1032 |
-
_, binary2 = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
| 1033 |
-
|
| 1034 |
-
# Try to determine which method preserves text better
|
| 1035 |
-
# Count white pixels and edges in each binary version
|
| 1036 |
-
white_pixels1 = np.count_nonzero(binary1 > 200)
|
| 1037 |
-
white_pixels2 = np.count_nonzero(binary2 > 200)
|
| 1038 |
-
|
| 1039 |
-
# Calculate edge density to help determine which preserves text features better
|
| 1040 |
-
edges1 = cv2.Canny(binary1, 100, 200)
|
| 1041 |
-
edges2 = cv2.Canny(binary2, 100, 200)
|
| 1042 |
-
edge_count1 = np.count_nonzero(edges1)
|
| 1043 |
-
edge_count2 = np.count_nonzero(edges2)
|
| 1044 |
-
|
| 1045 |
-
# For newspaper text, we want to preserve more edges while maintaining reasonable
|
| 1046 |
-
# white space (typical of printed text on paper background)
|
| 1047 |
-
if (edge_count1 > edge_count2 * 1.2 and white_pixels1 > white_pixels2 * 0.7) or \
|
| 1048 |
-
(white_pixels1 < white_pixels2 * 0.5): # If Otsu removed too much content
|
| 1049 |
-
# Adaptive thresholding usually better preserves small text in newspapers
|
| 1050 |
-
logger.debug("Using adaptive thresholding for newspaper text")
|
| 1051 |
-
|
| 1052 |
-
# Apply optional denoising to clean up small speckles
|
| 1053 |
-
result = cv2.fastNlMeansDenoising(binary1, None, 7, 7, 21)
|
| 1054 |
-
return Image.fromarray(result)
|
| 1055 |
-
else:
|
| 1056 |
-
# Otsu method was better
|
| 1057 |
-
logger.debug("Using Otsu thresholding for newspaper text")
|
| 1058 |
-
result = cv2.fastNlMeansDenoising(binary2, None, 7, 7, 21)
|
| 1059 |
-
return Image.fromarray(result)
|
| 1060 |
-
|
| 1061 |
-
except Exception as e:
|
| 1062 |
-
logger.debug(f"Advanced newspaper processing failed: {str(e)}")
|
| 1063 |
-
# Fall back to PIL processing
|
| 1064 |
-
pass
|
| 1065 |
-
|
| 1066 |
-
# If OpenCV not available or fails, apply additional PIL enhancements
|
| 1067 |
-
# Create a more aggressive binary version to better separate text
|
| 1068 |
-
binary_threshold = enhanced.point(lambda x: 0 if x < 150 else 255, '1')
|
| 1069 |
-
|
| 1070 |
-
# Return enhanced binary image
|
| 1071 |
-
return binary_threshold
|
| 1072 |
-
|
| 1073 |
-
# Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
|
| 1074 |
-
if img_size < 300000: # ~500x600 or smaller
|
| 1075 |
-
gray = img.convert('L')
|
| 1076 |
-
# Lower contrast enhancement for handwritten documents
|
| 1077 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
| 1078 |
-
enhancer = ImageEnhance.Contrast(gray)
|
| 1079 |
-
return enhancer.enhance(contrast_level)
|
| 1080 |
-
|
| 1081 |
-
# Fast path for small images - minimal processing
|
| 1082 |
-
if img_size < 1000000: # ~1000x1000 or smaller
|
| 1083 |
-
gray = img.convert('L')
|
| 1084 |
-
# Use gentler contrast enhancement for handwritten documents
|
| 1085 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
| 1086 |
-
enhancer = ImageEnhance.Contrast(gray)
|
| 1087 |
-
enhanced = enhancer.enhance(contrast_level)
|
| 1088 |
-
|
| 1089 |
-
# Light sharpening only if sharpen is enabled
|
| 1090 |
-
# Use milder sharpening for handwritten documents to preserve stroke detail
|
| 1091 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
| 1092 |
-
if is_handwritten:
|
| 1093 |
-
# Use edge enhancement which is gentler than SHARPEN for handwriting
|
| 1094 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
| 1095 |
-
else:
|
| 1096 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
| 1097 |
-
return enhanced
|
| 1098 |
-
|
| 1099 |
-
# Standard path for medium images
|
| 1100 |
-
# Convert to grayscale (faster processing)
|
| 1101 |
-
gray = img.convert('L')
|
| 1102 |
-
|
| 1103 |
-
# Adaptive contrast enhancement based on document type
|
| 1104 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
| 1105 |
-
enhancer = ImageEnhance.Contrast(gray)
|
| 1106 |
-
enhanced = enhancer.enhance(contrast_level)
|
| 1107 |
-
|
| 1108 |
-
# Apply light sharpening for text clarity - adapt based on document type
|
| 1109 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
| 1110 |
-
if is_handwritten:
|
| 1111 |
-
# Use edge enhancement which is gentler than SHARPEN for handwriting
|
| 1112 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
| 1113 |
-
else:
|
| 1114 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
| 1115 |
-
|
| 1116 |
-
# Advanced processing with OpenCV if available
|
| 1117 |
-
if CV2_AVAILABLE and IMAGE_PREPROCESSING["denoise"]:
|
| 1118 |
-
try:
|
| 1119 |
-
# Convert to numpy array for OpenCV processing
|
| 1120 |
-
img_np = np.array(enhanced)
|
| 1121 |
-
|
| 1122 |
-
if is_handwritten:
|
| 1123 |
-
# Enhanced processing for handwritten documents
|
| 1124 |
-
# Optimized for better stroke preservation and readability
|
| 1125 |
-
if img_size > 3000000: # Large images - downsample first
|
| 1126 |
-
scale_factor = 0.5
|
| 1127 |
-
small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
|
| 1128 |
-
interpolation=cv2.INTER_AREA)
|
| 1129 |
-
|
| 1130 |
-
# Apply CLAHE for better local contrast in handwriting
|
| 1131 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
| 1132 |
-
enhanced_img = clahe.apply(small_img)
|
| 1133 |
-
|
| 1134 |
-
# Apply bilateral filter with parameters optimized for handwriting
|
| 1135 |
-
# Lower sigma values to preserve more detail
|
| 1136 |
-
filtered = cv2.bilateralFilter(enhanced_img, 7, 30, 50)
|
| 1137 |
-
|
| 1138 |
-
# Resize back
|
| 1139 |
-
filtered = cv2.resize(filtered, (width, height), interpolation=cv2.INTER_LINEAR)
|
| 1140 |
-
else:
|
| 1141 |
-
# For smaller handwritten images
|
| 1142 |
-
# Apply CLAHE for better local contrast
|
| 1143 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
| 1144 |
-
enhanced_img = clahe.apply(img_np)
|
| 1145 |
-
|
| 1146 |
-
# Apply bilateral filter with parameters optimized for handwriting
|
| 1147 |
-
filtered = cv2.bilateralFilter(enhanced_img, 5, 25, 45)
|
| 1148 |
-
|
| 1149 |
-
# Adaptive thresholding specific to handwriting
|
| 1150 |
-
try:
|
| 1151 |
-
# Use larger block size and lower constant for better stroke preservation
|
| 1152 |
-
binary = cv2.adaptiveThreshold(
|
| 1153 |
-
filtered, 255,
|
| 1154 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 1155 |
-
cv2.THRESH_BINARY,
|
| 1156 |
-
21, # Larger block size for handwriting
|
| 1157 |
-
5 # Lower constant for better stroke preservation
|
| 1158 |
-
)
|
| 1159 |
-
|
| 1160 |
-
# Apply slight dilation to connect broken strokes
|
| 1161 |
-
kernel = np.ones((2, 2), np.uint8)
|
| 1162 |
-
binary = cv2.dilate(binary, kernel, iterations=1)
|
| 1163 |
-
|
| 1164 |
-
# Convert back to PIL Image
|
| 1165 |
-
return Image.fromarray(binary)
|
| 1166 |
-
except Exception as e:
|
| 1167 |
-
logger.debug(f"Adaptive threshold for handwriting failed: {str(e)}")
|
| 1168 |
-
# Convert filtered image to PIL and return as fallback
|
| 1169 |
-
return Image.fromarray(filtered)
|
| 1170 |
-
|
| 1171 |
-
else:
|
| 1172 |
-
# Standard document processing - optimized for printed text
|
| 1173 |
-
# Optimize denoising parameters based on image size
|
| 1174 |
-
if img_size > 4000000: # Very large images
|
| 1175 |
-
# More aggressive downsampling for very large images
|
| 1176 |
-
scale_factor = 0.5
|
| 1177 |
-
downsample = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
|
| 1178 |
-
interpolation=cv2.INTER_AREA)
|
| 1179 |
-
|
| 1180 |
-
# Lighter denoising for downsampled image
|
| 1181 |
-
h_value = 7 # Strength parameter
|
| 1182 |
-
template_window = 5
|
| 1183 |
-
search_window = 13
|
| 1184 |
-
|
| 1185 |
-
# Apply denoising on smaller image
|
| 1186 |
-
denoised_np = cv2.fastNlMeansDenoising(downsample, None, h_value, template_window, search_window)
|
| 1187 |
-
|
| 1188 |
-
# Resize back to original size
|
| 1189 |
-
denoised_np = cv2.resize(denoised_np, (width, height), interpolation=cv2.INTER_LINEAR)
|
| 1190 |
-
else:
|
| 1191 |
-
# Direct denoising for medium-large images
|
| 1192 |
-
h_value = 8 # Balanced for speed and quality
|
| 1193 |
-
template_window = 5
|
| 1194 |
-
search_window = 15
|
| 1195 |
-
|
| 1196 |
-
# Apply denoising
|
| 1197 |
-
denoised_np = cv2.fastNlMeansDenoising(img_np, None, h_value, template_window, search_window)
|
| 1198 |
-
|
| 1199 |
-
# Convert back to PIL Image
|
| 1200 |
-
enhanced = Image.fromarray(denoised_np)
|
| 1201 |
-
|
| 1202 |
-
# Apply adaptive thresholding only if it improves text visibility
|
| 1203 |
-
# Create a binarized version of the image
|
| 1204 |
-
if img_size < 8000000: # Skip for extremely large images to save processing time
|
| 1205 |
-
binary = cv2.adaptiveThreshold(denoised_np, 255,
|
| 1206 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 1207 |
-
cv2.THRESH_BINARY, 11, 2)
|
| 1208 |
-
|
| 1209 |
-
# Quick verification that binarization preserves text information
|
| 1210 |
-
# Use simplified check that works well for document images
|
| 1211 |
-
white_pixels_binary = np.count_nonzero(binary > 200)
|
| 1212 |
-
white_pixels_orig = np.count_nonzero(denoised_np > 200)
|
| 1213 |
-
|
| 1214 |
-
# Check if binary preserves reasonable amount of white pixels (background)
|
| 1215 |
-
if white_pixels_binary > white_pixels_orig * 0.8:
|
| 1216 |
-
# Binarization looks good, use it
|
| 1217 |
-
return Image.fromarray(binary)
|
| 1218 |
-
|
| 1219 |
-
return enhanced
|
| 1220 |
-
|
| 1221 |
-
except Exception as e:
|
| 1222 |
-
# If OpenCV processing fails, continue with PIL-enhanced image
|
| 1223 |
-
pass
|
| 1224 |
-
|
| 1225 |
-
elif IMAGE_PREPROCESSING["denoise"]:
|
| 1226 |
-
# Fallback PIL denoising for systems without OpenCV
|
| 1227 |
-
if is_handwritten:
|
| 1228 |
-
# Lighter filtering for handwritten text to preserve details
|
| 1229 |
-
# Use a smaller median filter for handwritten documents
|
| 1230 |
-
enhanced = enhanced.filter(ImageFilter.MedianFilter(1))
|
| 1231 |
-
else:
|
| 1232 |
-
# Standard filtering for printed documents
|
| 1233 |
-
enhanced = enhanced.filter(ImageFilter.MedianFilter(3))
|
| 1234 |
-
|
| 1235 |
-
# Return enhanced grayscale image
|
| 1236 |
-
return enhanced
|
| 1237 |
-
|
| 1238 |
-
# Removed caching to fix unhashable type error
|
| 1239 |
-
def preprocess_general_image(img: Image.Image) -> Image.Image:
|
| 1240 |
-
"""
|
| 1241 |
-
Preprocess a general image for OCR.
|
| 1242 |
-
|
| 1243 |
-
Args:
|
| 1244 |
-
img: PIL Image object
|
| 1245 |
-
|
| 1246 |
-
Returns:
|
| 1247 |
-
Processed PIL Image
|
| 1248 |
-
"""
|
| 1249 |
-
# Store the image for implementation function
|
| 1250 |
-
preprocess_general_image._current_img = img
|
| 1251 |
-
return _preprocess_general_image_impl()
|
| 1252 |
-
|
| 1253 |
-
def _preprocess_general_image_impl() -> Image.Image:
|
| 1254 |
-
"""
|
| 1255 |
-
Optimized implementation of general image preprocessing with size-based processing paths
|
| 1256 |
-
"""
|
| 1257 |
-
# Fast path: Get the image from thread-local storage
|
| 1258 |
-
if not hasattr(preprocess_general_image, "_current_img"):
|
| 1259 |
-
raise ValueError("No image set for general preprocessing")
|
| 1260 |
-
|
| 1261 |
-
img = preprocess_general_image._current_img
|
| 1262 |
-
|
| 1263 |
-
# Ultra-fast path: Skip processing completely for small images to improve performance
|
| 1264 |
-
width, height = img.size
|
| 1265 |
-
img_size = width * height
|
| 1266 |
-
if img_size < 300000: # Skip for tiny images under ~0.3 megapixel
|
| 1267 |
-
# Just ensure correct color mode
|
| 1268 |
-
if img.mode != 'RGB':
|
| 1269 |
-
return img.convert('RGB')
|
| 1270 |
-
return img
|
| 1271 |
-
|
| 1272 |
-
# Fast path: Minimal processing for smaller images
|
| 1273 |
-
if img_size < 600000: # ~800x750 or smaller
|
| 1274 |
-
# Ensure RGB mode
|
| 1275 |
-
if img.mode != 'RGB':
|
| 1276 |
-
img = img.convert('RGB')
|
| 1277 |
-
|
| 1278 |
-
# Very light contrast enhancement only
|
| 1279 |
-
enhancer = ImageEnhance.Contrast(img)
|
| 1280 |
-
return enhancer.enhance(1.15) # Lighter enhancement for small images
|
| 1281 |
-
|
| 1282 |
-
# Standard path: Apply moderate enhancements for medium images
|
| 1283 |
-
# Convert to RGB to ensure compatibility
|
| 1284 |
-
if img.mode != 'RGB':
|
| 1285 |
-
img = img.convert('RGB')
|
| 1286 |
-
|
| 1287 |
-
# Moderate enhancement only
|
| 1288 |
-
enhancer = ImageEnhance.Contrast(img)
|
| 1289 |
-
enhanced = enhancer.enhance(1.2) # Less aggressive than document enhancement
|
| 1290 |
-
|
| 1291 |
-
# Skip additional processing for medium-sized images
|
| 1292 |
-
if img_size < 1000000: # Skip for images under ~1 megapixel
|
| 1293 |
-
return enhanced
|
| 1294 |
-
|
| 1295 |
-
# Enhanced path: Additional processing for larger images
|
| 1296 |
-
try:
|
| 1297 |
-
# Apply optimized enhancement pipeline for large non-document images
|
| 1298 |
-
|
| 1299 |
-
# 1. Improve color saturation slightly for better feature extraction
|
| 1300 |
-
saturation = ImageEnhance.Color(enhanced)
|
| 1301 |
-
enhanced = saturation.enhance(1.1)
|
| 1302 |
-
|
| 1303 |
-
# 2. Apply adaptive sharpening based on image size
|
| 1304 |
-
if img_size > 2500000: # Very large images (~1600x1600 or larger)
|
| 1305 |
-
# Use EDGE_ENHANCE instead of SHARPEN for more subtle enhancement on large images
|
| 1306 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
| 1307 |
-
else:
|
| 1308 |
-
# Standard sharpening for regular large images
|
| 1309 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
| 1310 |
-
|
| 1311 |
-
# 3. Apply additional processing with OpenCV if available (for largest images)
|
| 1312 |
-
if CV2_AVAILABLE and img_size > 3000000:
|
| 1313 |
-
# Convert to numpy array
|
| 1314 |
-
img_np = np.array(enhanced)
|
| 1315 |
-
|
| 1316 |
-
# Apply subtle enhancement of details (CLAHE)
|
| 1317 |
-
try:
|
| 1318 |
-
# Convert to LAB color space for better processing
|
| 1319 |
-
lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
|
| 1320 |
-
|
| 1321 |
-
# Only enhance the L channel (luminance)
|
| 1322 |
-
l, a, b = cv2.split(lab)
|
| 1323 |
-
|
| 1324 |
-
# Create CLAHE object with optimal parameters for photos
|
| 1325 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
| 1326 |
-
|
| 1327 |
-
# Apply CLAHE to L channel
|
| 1328 |
-
l = clahe.apply(l)
|
| 1329 |
-
|
| 1330 |
-
# Merge channels back and convert to RGB
|
| 1331 |
-
lab = cv2.merge((l, a, b))
|
| 1332 |
-
enhanced_np = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
|
| 1333 |
-
|
| 1334 |
-
# Convert back to PIL
|
| 1335 |
-
enhanced = Image.fromarray(enhanced_np)
|
| 1336 |
-
except:
|
| 1337 |
-
# If CLAHE fails, continue with PIL-enhanced image
|
| 1338 |
-
pass
|
| 1339 |
-
|
| 1340 |
-
except Exception:
|
| 1341 |
-
# If any enhancement fails, fall back to basic contrast enhancement
|
| 1342 |
-
if img.mode != 'RGB':
|
| 1343 |
-
img = img.convert('RGB')
|
| 1344 |
-
enhancer = ImageEnhance.Contrast(img)
|
| 1345 |
-
enhanced = enhancer.enhance(1.2)
|
| 1346 |
-
|
| 1347 |
-
return enhanced
|
| 1348 |
-
|
| 1349 |
-
# Removed caching decorator to fix unhashable type error
|
| 1350 |
-
def resize_image(img: Image.Image, target_dpi: int = 300) -> Image.Image:
|
| 1351 |
-
"""
|
| 1352 |
-
Resize an image to an optimal size for OCR while preserving quality.
|
| 1353 |
-
|
| 1354 |
-
Args:
|
| 1355 |
-
img: PIL Image object
|
| 1356 |
-
target_dpi: Target DPI (dots per inch)
|
| 1357 |
-
|
| 1358 |
-
Returns:
|
| 1359 |
-
Resized PIL Image
|
| 1360 |
-
"""
|
| 1361 |
-
# Store the image for implementation function
|
| 1362 |
-
resize_image._current_img = img
|
| 1363 |
-
return resize_image_impl(target_dpi)
|
| 1364 |
|
| 1365 |
-
def
|
| 1366 |
"""
|
| 1367 |
-
|
| 1368 |
|
| 1369 |
Args:
|
| 1370 |
-
|
| 1371 |
-
|
| 1372 |
-
Returns:
|
| 1373 |
-
Resized PIL Image
|
| 1374 |
-
"""
|
| 1375 |
-
# Get the image from thread-local storage (set by the caller)
|
| 1376 |
-
if not hasattr(resize_image, "_current_img"):
|
| 1377 |
-
raise ValueError("No image set for resizing")
|
| 1378 |
-
|
| 1379 |
-
img = resize_image._current_img
|
| 1380 |
-
|
| 1381 |
-
# Calculate current dimensions
|
| 1382 |
-
width, height = img.size
|
| 1383 |
-
|
| 1384 |
-
# Fixed target dimensions based on DPI
|
| 1385 |
-
# Using larger dimensions to support newspapers and large documents
|
| 1386 |
-
max_width = int(14 * target_dpi) # Increased from 8.5 to 14 inches
|
| 1387 |
-
max_height = int(22 * target_dpi) # Increased from 11 to 22 inches
|
| 1388 |
-
|
| 1389 |
-
# Check if resizing is needed - quick early return
|
| 1390 |
-
if width <= max_width and height <= max_height:
|
| 1391 |
-
return img # No resizing needed
|
| 1392 |
-
|
| 1393 |
-
# Calculate scaling factor once
|
| 1394 |
-
scale_factor = min(max_width / width, max_height / height)
|
| 1395 |
-
|
| 1396 |
-
# Calculate new dimensions
|
| 1397 |
-
new_width = int(width * scale_factor)
|
| 1398 |
-
new_height = int(height * scale_factor)
|
| 1399 |
-
|
| 1400 |
-
# Use BICUBIC for better balance of speed and quality
|
| 1401 |
-
return img.resize((new_width, new_height), Image.BICUBIC)
|
| 1402 |
-
|
| 1403 |
-
def calculate_image_entropy(img: Image.Image) -> float:
|
| 1404 |
-
"""
|
| 1405 |
-
Calculate the entropy (information content) of an image.
|
| 1406 |
-
|
| 1407 |
-
Args:
|
| 1408 |
-
img: PIL Image object
|
| 1409 |
-
|
| 1410 |
-
Returns:
|
| 1411 |
-
Entropy value
|
| 1412 |
-
"""
|
| 1413 |
-
# Convert to grayscale
|
| 1414 |
-
if img.mode != 'L':
|
| 1415 |
-
img = img.convert('L')
|
| 1416 |
-
|
| 1417 |
-
# Calculate histogram
|
| 1418 |
-
histogram = img.histogram()
|
| 1419 |
-
total_pixels = img.width * img.height
|
| 1420 |
-
|
| 1421 |
-
# Calculate entropy
|
| 1422 |
-
entropy = 0
|
| 1423 |
-
for h in histogram:
|
| 1424 |
-
if h > 0:
|
| 1425 |
-
probability = h / total_pixels
|
| 1426 |
-
entropy -= probability * np.log2(probability)
|
| 1427 |
-
|
| 1428 |
-
return entropy
|
| 1429 |
-
|
| 1430 |
-
def create_html_with_images(result):
|
| 1431 |
-
"""
|
| 1432 |
-
Create an HTML document with embedded images from OCR results.
|
| 1433 |
-
Handles serialization of complex OCR objects automatically.
|
| 1434 |
-
|
| 1435 |
-
Args:
|
| 1436 |
-
result: OCR result dictionary containing pages_data
|
| 1437 |
-
|
| 1438 |
-
Returns:
|
| 1439 |
-
HTML content as string
|
| 1440 |
-
"""
|
| 1441 |
-
# Ensure result is fully serializable first
|
| 1442 |
-
result = serialize_ocr_object(result)
|
| 1443 |
-
# Create HTML document structure
|
| 1444 |
-
html_content = """
|
| 1445 |
-
<!DOCTYPE html>
|
| 1446 |
-
<html>
|
| 1447 |
-
<head>
|
| 1448 |
-
<meta charset="UTF-8">
|
| 1449 |
-
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 1450 |
-
<title>OCR Document with Images</title>
|
| 1451 |
-
<style>
|
| 1452 |
-
body {
|
| 1453 |
-
font-family: Georgia, serif;
|
| 1454 |
-
line-height: 1.7;
|
| 1455 |
-
margin: 0 auto;
|
| 1456 |
-
max-width: 800px;
|
| 1457 |
-
padding: 20px;
|
| 1458 |
-
}
|
| 1459 |
-
img {
|
| 1460 |
-
max-width: 90%;
|
| 1461 |
-
max-height: 500px;
|
| 1462 |
-
object-fit: contain;
|
| 1463 |
-
margin: 20px auto;
|
| 1464 |
-
display: block;
|
| 1465 |
-
border: 1px solid #ddd;
|
| 1466 |
-
border-radius: 4px;
|
| 1467 |
-
}
|
| 1468 |
-
.image-container {
|
| 1469 |
-
margin: 20px 0;
|
| 1470 |
-
text-align: center;
|
| 1471 |
-
}
|
| 1472 |
-
.page-break {
|
| 1473 |
-
border-top: 1px solid #ddd;
|
| 1474 |
-
margin: 40px 0;
|
| 1475 |
-
padding-top: 40px;
|
| 1476 |
-
}
|
| 1477 |
-
h3 {
|
| 1478 |
-
color: #333;
|
| 1479 |
-
border-bottom: 1px solid #eee;
|
| 1480 |
-
padding-bottom: 10px;
|
| 1481 |
-
}
|
| 1482 |
-
p {
|
| 1483 |
-
margin: 12px 0;
|
| 1484 |
-
}
|
| 1485 |
-
.page-text-content {
|
| 1486 |
-
margin-bottom: 20px;
|
| 1487 |
-
}
|
| 1488 |
-
.text-block {
|
| 1489 |
-
background-color: #f9f9f9;
|
| 1490 |
-
padding: 15px;
|
| 1491 |
-
border-radius: 4px;
|
| 1492 |
-
border-left: 3px solid #546e7a;
|
| 1493 |
-
margin-bottom: 15px;
|
| 1494 |
-
color: #333;
|
| 1495 |
-
}
|
| 1496 |
-
.text-block p {
|
| 1497 |
-
margin: 8px 0;
|
| 1498 |
-
color: #333;
|
| 1499 |
-
}
|
| 1500 |
-
.metadata {
|
| 1501 |
-
background-color: #f5f5f5;
|
| 1502 |
-
padding: 10px 15px;
|
| 1503 |
-
border-radius: 4px;
|
| 1504 |
-
margin-bottom: 20px;
|
| 1505 |
-
font-size: 14px;
|
| 1506 |
-
}
|
| 1507 |
-
.metadata p {
|
| 1508 |
-
margin: 5px 0;
|
| 1509 |
-
}
|
| 1510 |
-
</style>
|
| 1511 |
-
</head>
|
| 1512 |
-
<body>
|
| 1513 |
-
"""
|
| 1514 |
-
|
| 1515 |
-
# Add document metadata
|
| 1516 |
-
html_content += f"""
|
| 1517 |
-
<div class="metadata">
|
| 1518 |
-
<h2>{result.get('file_name', 'Document')}</h2>
|
| 1519 |
-
<p><strong>Processed at:</strong> {result.get('timestamp', '')}</p>
|
| 1520 |
-
<p><strong>Languages:</strong> {', '.join(result.get('languages', ['Unknown']))}</p>
|
| 1521 |
-
<p><strong>Topics:</strong> {', '.join(result.get('topics', ['Unknown']))}</p>
|
| 1522 |
-
</div>
|
| 1523 |
-
"""
|
| 1524 |
-
|
| 1525 |
-
# Check if we have pages_data
|
| 1526 |
-
if 'pages_data' in result and result['pages_data']:
|
| 1527 |
-
pages_data = result['pages_data']
|
| 1528 |
-
|
| 1529 |
-
# Process each page
|
| 1530 |
-
for i, page in enumerate(pages_data):
|
| 1531 |
-
page_markdown = page.get('markdown', '')
|
| 1532 |
-
images = page.get('images', [])
|
| 1533 |
-
|
| 1534 |
-
# Add page header if multi-page
|
| 1535 |
-
if len(pages_data) > 1:
|
| 1536 |
-
html_content += f"<h3>Page {i+1}</h3>"
|
| 1537 |
-
|
| 1538 |
-
# Create image dictionary
|
| 1539 |
-
image_dict = {}
|
| 1540 |
-
for img in images:
|
| 1541 |
-
if 'id' in img and 'image_base64' in img:
|
| 1542 |
-
image_dict[img['id']] = img['image_base64']
|
| 1543 |
-
|
| 1544 |
-
# Process the markdown content
|
| 1545 |
-
if page_markdown:
|
| 1546 |
-
# Extract text content (lines without images)
|
| 1547 |
-
text_content = []
|
| 1548 |
-
image_lines = []
|
| 1549 |
-
|
| 1550 |
-
for line in page_markdown.split('\n'):
|
| 1551 |
-
if '
|
| 1553 |
-
elif line.strip():
|
| 1554 |
-
text_content.append(line)
|
| 1555 |
-
|
| 1556 |
-
# Add text content
|
| 1557 |
-
if text_content:
|
| 1558 |
-
html_content += '<div class="text-block">'
|
| 1559 |
-
for line in text_content:
|
| 1560 |
-
html_content += f"<p>{line}</p>"
|
| 1561 |
-
html_content += '</div>'
|
| 1562 |
-
|
| 1563 |
-
# Add images
|
| 1564 |
-
for line in image_lines:
|
| 1565 |
-
# Extract image ID and alt text using simple parsing
|
| 1566 |
-
try:
|
| 1567 |
-
alt_start = line.find('![') + 2
|
| 1568 |
-
alt_end = line.find(']', alt_start)
|
| 1569 |
-
alt_text = line[alt_start:alt_end]
|
| 1570 |
-
|
| 1571 |
-
img_start = line.find('(', alt_end) + 1
|
| 1572 |
-
img_end = line.find(')', img_start)
|
| 1573 |
-
img_id = line[img_start:img_end]
|
| 1574 |
-
|
| 1575 |
-
if img_id in image_dict:
|
| 1576 |
-
html_content += f'<div class="image-container">'
|
| 1577 |
-
html_content += f'<img src="{image_dict[img_id]}" alt="{alt_text}">'
|
| 1578 |
-
html_content += f'</div>'
|
| 1579 |
-
except:
|
| 1580 |
-
# If parsing fails, just skip this image
|
| 1581 |
-
continue
|
| 1582 |
-
|
| 1583 |
-
# Add page separator if not the last page
|
| 1584 |
-
if i < len(pages_data) - 1:
|
| 1585 |
-
html_content += '<div class="page-break"></div>'
|
| 1586 |
-
|
| 1587 |
-
# Add structured content if available
|
| 1588 |
-
if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
|
| 1589 |
-
html_content += '<h3>Structured Content</h3>'
|
| 1590 |
-
|
| 1591 |
-
for section, content in result['ocr_contents'].items():
|
| 1592 |
-
if content and section not in ['error', 'raw_text', 'partial_text']:
|
| 1593 |
-
html_content += f'<h4>{section.replace("_", " ").title()}</h4>'
|
| 1594 |
-
|
| 1595 |
-
if isinstance(content, str):
|
| 1596 |
-
html_content += f'<p>{content}</p>'
|
| 1597 |
-
elif isinstance(content, list):
|
| 1598 |
-
html_content += '<ul>'
|
| 1599 |
-
for item in content:
|
| 1600 |
-
html_content += f'<li>{str(item)}</li>'
|
| 1601 |
-
html_content += '</ul>'
|
| 1602 |
-
elif isinstance(content, dict):
|
| 1603 |
-
html_content += '<dl>'
|
| 1604 |
-
for k, v in content.items():
|
| 1605 |
-
html_content += f'<dt>{k}</dt><dd>{v}</dd>'
|
| 1606 |
-
html_content += '</dl>'
|
| 1607 |
-
|
| 1608 |
-
# Close HTML document
|
| 1609 |
-
html_content += """
|
| 1610 |
-
</body>
|
| 1611 |
-
</html>
|
| 1612 |
-
"""
|
| 1613 |
-
|
| 1614 |
-
return html_content
|
| 1615 |
-
|
| 1616 |
-
def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 300) -> str:
|
| 1617 |
-
"""
|
| 1618 |
-
Generate a thumbnail for document preview.
|
| 1619 |
-
|
| 1620 |
-
Args:
|
| 1621 |
-
image_path: Path to the image file
|
| 1622 |
-
max_size: Maximum dimension for thumbnail
|
| 1623 |
-
|
| 1624 |
-
Returns:
|
| 1625 |
-
Base64 encoded thumbnail
|
| 1626 |
-
"""
|
| 1627 |
-
if not PILLOW_AVAILABLE:
|
| 1628 |
-
return None
|
| 1629 |
-
|
| 1630 |
-
try:
|
| 1631 |
-
# Open the image
|
| 1632 |
-
with Image.open(image_path) as img:
|
| 1633 |
-
# Calculate thumbnail size preserving aspect ratio
|
| 1634 |
-
width, height = img.size
|
| 1635 |
-
if width > height:
|
| 1636 |
-
new_width = max_size
|
| 1637 |
-
new_height = int(height * (max_size / width))
|
| 1638 |
-
else:
|
| 1639 |
-
new_height = max_size
|
| 1640 |
-
new_width = int(width * (max_size / height))
|
| 1641 |
-
|
| 1642 |
-
# Create thumbnail
|
| 1643 |
-
thumbnail = img.resize((new_width, new_height), Image.LANCZOS)
|
| 1644 |
-
|
| 1645 |
-
# Save to buffer
|
| 1646 |
-
buffer = io.BytesIO()
|
| 1647 |
-
thumbnail.save(buffer, format="JPEG", quality=85)
|
| 1648 |
-
buffer.seek(0)
|
| 1649 |
-
|
| 1650 |
-
# Encode as base64
|
| 1651 |
-
encoded = base64.b64encode(buffer.getvalue()).decode()
|
| 1652 |
-
return f"data:image/jpeg;base64,{encoded}"
|
| 1653 |
-
except Exception:
|
| 1654 |
-
# Return None if thumbnail generation fails
|
| 1655 |
-
return None
|
| 1656 |
-
|
| 1657 |
-
def serialize_ocr_object(obj):
|
| 1658 |
-
"""
|
| 1659 |
-
Serialize OCR response objects to JSON serializable format.
|
| 1660 |
-
Handles OCRImageObject specifically to prevent serialization errors.
|
| 1661 |
-
|
| 1662 |
-
Args:
|
| 1663 |
-
obj: The object to serialize
|
| 1664 |
-
|
| 1665 |
-
Returns:
|
| 1666 |
-
JSON serializable representation of the object
|
| 1667 |
-
"""
|
| 1668 |
-
# Fast path: Handle primitive types directly
|
| 1669 |
-
if obj is None or isinstance(obj, (str, int, float, bool)):
|
| 1670 |
-
return obj
|
| 1671 |
-
|
| 1672 |
-
# Handle collections
|
| 1673 |
-
if isinstance(obj, list):
|
| 1674 |
-
return [serialize_ocr_object(item) for item in obj]
|
| 1675 |
-
elif isinstance(obj, dict):
|
| 1676 |
-
return {k: serialize_ocr_object(v) for k, v in obj.items()}
|
| 1677 |
-
elif isinstance(obj, OCRImageObject):
|
| 1678 |
-
# Special handling for OCRImageObject
|
| 1679 |
-
return {
|
| 1680 |
-
'id': obj.id if hasattr(obj, 'id') else None,
|
| 1681 |
-
'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
|
| 1682 |
-
}
|
| 1683 |
-
elif hasattr(obj, '__dict__'):
|
| 1684 |
-
# For objects with __dict__ attribute
|
| 1685 |
-
return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
|
| 1686 |
-
if not k.startswith('_')} # Skip private attributes
|
| 1687 |
-
else:
|
| 1688 |
-
# Try to convert to string as last resort
|
| 1689 |
-
try:
|
| 1690 |
-
return str(obj)
|
| 1691 |
-
except:
|
| 1692 |
-
return None
|
| 1693 |
-
|
| 1694 |
-
def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
|
| 1695 |
-
"""
|
| 1696 |
-
Attempt to use local pytesseract OCR as a fallback when API fails
|
| 1697 |
-
With enhanced processing optimized for handwritten content
|
| 1698 |
-
|
| 1699 |
-
Args:
|
| 1700 |
-
image_path: Path to the image file
|
| 1701 |
base64_data_url: Optional base64 data URL if already available
|
| 1702 |
|
| 1703 |
Returns:
|
| 1704 |
-
|
| 1705 |
"""
|
| 1706 |
-
|
|
|
|
|
|
|
| 1707 |
|
| 1708 |
try:
|
| 1709 |
-
|
| 1710 |
-
from PIL import Image
|
| 1711 |
-
|
| 1712 |
-
# Load image - either from path or from base64
|
| 1713 |
-
if base64_data_url and base64_data_url.startswith('data:image'):
|
| 1714 |
-
# Extract image from base64
|
| 1715 |
-
image_data = base64_data_url.split(',', 1)[1]
|
| 1716 |
-
image_bytes = base64.b64decode(image_data)
|
| 1717 |
-
image = Image.open(io.BytesIO(image_bytes))
|
| 1718 |
-
else:
|
| 1719 |
-
# Load from file path
|
| 1720 |
-
image_path = Path(image_path) if isinstance(image_path, str) else image_path
|
| 1721 |
-
image = Image.open(image_path)
|
| 1722 |
-
|
| 1723 |
-
# Auto-detect if this appears to be handwritten
|
| 1724 |
-
is_handwritten = False
|
| 1725 |
|
| 1726 |
-
# Use
|
| 1727 |
-
|
| 1728 |
-
try:
|
| 1729 |
-
# Convert image to numpy array
|
| 1730 |
-
img_np = np.array(image.convert('L'))
|
| 1731 |
-
|
| 1732 |
-
# Check for handwritten characteristics
|
| 1733 |
-
edges = cv2.Canny(img_np, 30, 100)
|
| 1734 |
-
edge_ratio = np.count_nonzero(edges) / edges.size
|
| 1735 |
-
|
| 1736 |
-
# Typical handwritten documents have more varied edge patterns
|
| 1737 |
-
if edge_ratio > 0.02:
|
| 1738 |
-
# Additional check with gradient magnitudes
|
| 1739 |
-
sobelx = cv2.Sobel(img_np, cv2.CV_64F, 1, 0, ksize=3)
|
| 1740 |
-
sobely = cv2.Sobel(img_np, cv2.CV_64F, 0, 1, ksize=3)
|
| 1741 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
| 1742 |
-
# Handwriting typically has more variation in gradient magnitudes
|
| 1743 |
-
if np.std(magnitude) > 20:
|
| 1744 |
-
is_handwritten = True
|
| 1745 |
-
logger.info("Detected handwritten content for local OCR")
|
| 1746 |
-
|
| 1747 |
-
# Enhanced preprocessing based on document type
|
| 1748 |
-
if is_handwritten:
|
| 1749 |
-
# Process for handwritten content
|
| 1750 |
-
# Apply CLAHE for better local contrast
|
| 1751 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
| 1752 |
-
img_np = clahe.apply(img_np)
|
| 1753 |
-
|
| 1754 |
-
# Apply adaptive thresholding with optimized parameters for handwriting
|
| 1755 |
-
binary = cv2.adaptiveThreshold(
|
| 1756 |
-
img_np, 255,
|
| 1757 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 1758 |
-
cv2.THRESH_BINARY,
|
| 1759 |
-
21, # Larger block size for handwriting
|
| 1760 |
-
5 # Lower constant for better stroke preservation
|
| 1761 |
-
)
|
| 1762 |
-
|
| 1763 |
-
# Optional: apply dilation to thicken strokes slightly
|
| 1764 |
-
kernel = np.ones((2, 2), np.uint8)
|
| 1765 |
-
binary = cv2.dilate(binary, kernel, iterations=1)
|
| 1766 |
-
|
| 1767 |
-
# Convert back to PIL Image for tesseract
|
| 1768 |
-
image = Image.fromarray(binary)
|
| 1769 |
-
|
| 1770 |
-
# Set tesseract options for handwritten content
|
| 1771 |
-
custom_config = r'--oem 1 --psm 6 -l eng'
|
| 1772 |
-
else:
|
| 1773 |
-
# Process for printed content
|
| 1774 |
-
# Apply CLAHE for better contrast
|
| 1775 |
-
clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
|
| 1776 |
-
img_np = clahe.apply(img_np)
|
| 1777 |
-
|
| 1778 |
-
# Apply bilateral filter to reduce noise while preserving edges
|
| 1779 |
-
img_np = cv2.bilateralFilter(img_np, 9, 75, 75)
|
| 1780 |
-
|
| 1781 |
-
# Apply Otsu's thresholding for printed text
|
| 1782 |
-
_, binary = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
| 1783 |
-
|
| 1784 |
-
# Convert back to PIL Image for tesseract
|
| 1785 |
-
image = Image.fromarray(binary)
|
| 1786 |
-
|
| 1787 |
-
# Set tesseract options for printed content
|
| 1788 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
| 1789 |
-
except Exception as e:
|
| 1790 |
-
logger.warning(f"OpenCV preprocessing failed: {str(e)}. Using PIL fallback.")
|
| 1791 |
-
|
| 1792 |
-
# Convert to RGB if not already (pytesseract works best with RGB)
|
| 1793 |
-
if image.mode != 'RGB':
|
| 1794 |
-
image = image.convert('RGB')
|
| 1795 |
-
|
| 1796 |
-
# Apply basic image enhancements
|
| 1797 |
-
image = image.convert('L')
|
| 1798 |
-
enhancer = ImageEnhance.Contrast(image)
|
| 1799 |
-
image = enhancer.enhance(2.0)
|
| 1800 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
| 1801 |
-
else:
|
| 1802 |
-
# PIL-only path without OpenCV
|
| 1803 |
-
# Convert to RGB if not already (pytesseract works best with RGB)
|
| 1804 |
-
if image.mode != 'RGB':
|
| 1805 |
-
image = image.convert('RGB')
|
| 1806 |
-
|
| 1807 |
-
# Apply basic image enhancements
|
| 1808 |
-
image = image.convert('L')
|
| 1809 |
-
enhancer = ImageEnhance.Contrast(image)
|
| 1810 |
-
image = enhancer.enhance(2.0)
|
| 1811 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
| 1812 |
|
| 1813 |
-
#
|
| 1814 |
-
|
| 1815 |
|
| 1816 |
-
if
|
| 1817 |
-
logger.info(
|
| 1818 |
-
return
|
| 1819 |
else:
|
| 1820 |
-
|
| 1821 |
-
|
| 1822 |
-
# Try PSM mode 4 (assume single column of text)
|
| 1823 |
-
fallback_config = r'--oem 3 --psm 4 -l eng'
|
| 1824 |
-
ocr_text = pytesseract.image_to_string(image, config=fallback_config)
|
| 1825 |
-
|
| 1826 |
-
if ocr_text and len(ocr_text.strip()) > 50:
|
| 1827 |
-
logger.info(f"Local OCR fallback successful: extracted {len(ocr_text)} characters")
|
| 1828 |
-
return ocr_text
|
| 1829 |
-
else:
|
| 1830 |
-
logger.warning("Local OCR produced minimal or no text")
|
| 1831 |
-
return None
|
| 1832 |
-
except ImportError:
|
| 1833 |
-
logger.warning("Pytesseract not installed - local OCR not available")
|
| 1834 |
-
return None
|
| 1835 |
except Exception as e:
|
| 1836 |
-
logger.error(f"
|
| 1837 |
-
return None
|
|
|
|
| 1 |
"""
|
| 2 |
+
OCR utility functions for image processing and OCR operations.
|
| 3 |
+
This module provides helper functions used across the Historical OCR application.
|
| 4 |
"""
|
| 5 |
|
| 6 |
+
import os
|
|
|
|
| 7 |
import base64
|
|
|
|
|
|
|
| 8 |
import logging
|
|
|
|
|
|
|
| 9 |
from pathlib import Path
|
| 10 |
+
from typing import Union, Optional
|
|
|
|
| 11 |
|
| 12 |
# Configure logging
|
| 13 |
logging.basicConfig(level=logging.INFO,
|
| 14 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 15 |
logger = logging.getLogger(__name__)
|
| 16 |
|
| 17 |
+
# Try to import optional dependencies
|
| 18 |
+
try:
|
| 19 |
+
import pytesseract
|
| 20 |
+
TESSERACT_AVAILABLE = True
|
| 21 |
+
except ImportError:
|
| 22 |
+
logger.warning("pytesseract not available - local OCR fallback will not work")
|
| 23 |
+
TESSERACT_AVAILABLE = False
|
| 24 |
|
|
|
|
| 25 |
try:
|
| 26 |
+
from PIL import Image
|
| 27 |
PILLOW_AVAILABLE = True
|
| 28 |
except ImportError:
|
| 29 |
logger.warning("PIL not available - image preprocessing will be limited")
|
| 30 |
PILLOW_AVAILABLE = False
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
| 34 |
"""
|
| 35 |
+
Encode an image as base64 data URL for API submission with proper MIME type.
|
| 36 |
|
| 37 |
Args:
|
| 38 |
image_path: Path to the image file
|
|
|
|
| 63 |
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
| 64 |
return f"data:{mime_type};base64,{encoded}"
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
def try_local_ocr_fallback(file_path: Union[str, Path], base64_data_url: Optional[str] = None) -> Optional[str]:
|
| 68 |
"""
|
| 69 |
+
Try to perform OCR using local Tesseract as a fallback when the API is unavailable.
|
| 70 |
|
| 71 |
Args:
|
| 72 |
+
file_path: Path to the image file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
base64_data_url: Optional base64 data URL if already available
|
| 74 |
|
| 75 |
Returns:
|
| 76 |
+
Extracted text or None if extraction failed
|
| 77 |
"""
|
| 78 |
+
if not TESSERACT_AVAILABLE or not PILLOW_AVAILABLE:
|
| 79 |
+
logger.warning("Local OCR fallback is not available (missing dependencies)")
|
| 80 |
+
return None
|
| 81 |
|
| 82 |
try:
|
| 83 |
+
logger.info("Using local Tesseract OCR as fallback")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
# Use PIL to open the image
|
| 86 |
+
img = Image.open(file_path)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
+
# Use Tesseract to extract text
|
| 89 |
+
text = pytesseract.image_to_string(img)
|
| 90 |
|
| 91 |
+
if text:
|
| 92 |
+
logger.info("Successfully extracted text using local Tesseract OCR")
|
| 93 |
+
return text
|
| 94 |
else:
|
| 95 |
+
logger.warning("Tesseract extracted no text")
|
| 96 |
+
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
except Exception as e:
|
| 98 |
+
logger.error(f"Error using local OCR fallback: {str(e)}")
|
| 99 |
+
return None
|
preprocessing.py
CHANGED
|
@@ -3,15 +3,398 @@ import io
|
|
| 3 |
import cv2
|
| 4 |
import numpy as np
|
| 5 |
import tempfile
|
|
|
|
|
|
|
|
|
|
| 6 |
from PIL import Image, ImageEnhance, ImageFilter
|
| 7 |
from pdf2image import convert_from_bytes
|
| 8 |
import streamlit as st
|
| 9 |
import logging
|
|
|
|
|
|
|
| 10 |
|
| 11 |
# Configure logging
|
| 12 |
logger = logging.getLogger("preprocessing")
|
| 13 |
logger.setLevel(logging.INFO)
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
@st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
|
| 16 |
def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
| 17 |
"""Convert PDF bytes to a list of images with caching"""
|
|
@@ -34,94 +417,134 @@ def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
|
| 34 |
|
| 35 |
@st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
|
| 36 |
def preprocess_image(image_bytes, preprocessing_options):
|
| 37 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
# Setup basic console logging
|
| 39 |
logger = logging.getLogger("image_preprocessor")
|
| 40 |
logger.setLevel(logging.INFO)
|
| 41 |
|
| 42 |
# Log which preprocessing options are being applied
|
| 43 |
-
logger.info(f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
# Convert bytes to PIL Image
|
| 46 |
image = Image.open(io.BytesIO(image_bytes))
|
| 47 |
|
| 48 |
-
# Check for
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
if image.mode == 'RGBA':
|
| 50 |
-
# Convert RGBA to RGB by compositing
|
|
|
|
| 51 |
background = Image.new('RGB', image.size, (255, 255, 255))
|
| 52 |
background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
|
| 53 |
image = background
|
| 54 |
-
|
| 55 |
elif image.mode not in ('RGB', 'L'):
|
| 56 |
-
# Convert other modes to RGB
|
|
|
|
| 57 |
image = image.convert('RGB')
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
# Apply rotation if specified
|
| 61 |
-
if preprocessing_options.get("rotation", 0) != 0:
|
| 62 |
-
rotation_degrees = preprocessing_options.get("rotation")
|
| 63 |
-
image = image.rotate(rotation_degrees, expand=True, resample=Image.BICUBIC)
|
| 64 |
-
|
| 65 |
-
# Resize large images while preserving details important for OCR
|
| 66 |
-
width, height = image.size
|
| 67 |
-
max_dimension = max(width, height)
|
| 68 |
-
|
| 69 |
-
# Less aggressive resizing to preserve document details
|
| 70 |
-
if max_dimension > 2500:
|
| 71 |
-
scale_factor = 2500 / max_dimension
|
| 72 |
-
new_width = int(width * scale_factor)
|
| 73 |
-
new_height = int(height * scale_factor)
|
| 74 |
-
# Use LANCZOS for better quality preservation
|
| 75 |
-
image = image.resize((new_width, new_height), Image.LANCZOS)
|
| 76 |
|
|
|
|
| 77 |
img_array = np.array(image)
|
| 78 |
|
| 79 |
-
# Apply
|
| 80 |
-
document_type = preprocessing_options.get("document_type", "standard")
|
| 81 |
-
|
| 82 |
-
# Process grayscale option first as it's a common foundation
|
| 83 |
if preprocessing_options.get("grayscale", False):
|
| 84 |
if len(img_array.shape) == 3: # Only convert if it's not already grayscale
|
| 85 |
-
|
| 86 |
-
|
| 87 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 88 |
-
|
| 89 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
|
| 90 |
img_array = clahe.apply(img_array)
|
| 91 |
else:
|
| 92 |
# Standard grayscale for printed documents
|
| 93 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
|
| 97 |
-
|
| 98 |
-
if preprocessing_options.get("contrast", 0) != 0:
|
| 99 |
-
contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 150) # Reduced from /100 for a gentler effect
|
| 100 |
-
image = Image.fromarray(img_array)
|
| 101 |
-
enhancer = ImageEnhance.Contrast(image)
|
| 102 |
-
image = enhancer.enhance(contrast_factor)
|
| 103 |
-
img_array = np.array(image)
|
| 104 |
|
|
|
|
| 105 |
if preprocessing_options.get("denoise", False):
|
| 106 |
try:
|
| 107 |
-
# Apply
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
else: # Grayscale image
|
| 113 |
-
img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 5, 15) # Reduced from 3,7,21
|
| 114 |
else:
|
| 115 |
-
#
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
img_array = cv2.fastNlMeansDenoising(img_array, None, 3, 5, 15) # Reduced from 5,7,21
|
| 120 |
except Exception as e:
|
| 121 |
-
logger.error(f"Denoising error: {str(e)}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
# Convert back to PIL Image
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
# Higher quality for OCR processing
|
| 127 |
byte_io = io.BytesIO()
|
|
@@ -135,16 +558,14 @@ def preprocess_image(image_bytes, preprocessing_options):
|
|
| 135 |
|
| 136 |
logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
|
| 137 |
logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
|
|
|
|
| 138 |
|
| 139 |
return byte_io.getvalue()
|
| 140 |
except Exception as e:
|
| 141 |
logger.error(f"Error saving processed image: {str(e)}")
|
| 142 |
# Fallback to original image
|
| 143 |
logger.info("Using original image as fallback")
|
| 144 |
-
|
| 145 |
-
image.save(image_io, format='JPEG', quality=92)
|
| 146 |
-
image_io.seek(0)
|
| 147 |
-
return image_io.getvalue()
|
| 148 |
|
| 149 |
def create_temp_file(content, suffix, temp_file_paths):
|
| 150 |
"""Create a temporary file and track it for cleanup"""
|
|
@@ -157,19 +578,53 @@ def create_temp_file(content, suffix, temp_file_paths):
|
|
| 157 |
return temp_path
|
| 158 |
|
| 159 |
def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
|
| 160 |
-
"""
|
| 161 |
-
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
has_preprocessing = (
|
| 164 |
preprocessing_options.get("grayscale", False) or
|
| 165 |
preprocessing_options.get("denoise", False) or
|
| 166 |
-
preprocessing_options.get("contrast", 0) != 0
|
| 167 |
-
preprocessing_options.get("rotation", 0) != 0
|
| 168 |
)
|
| 169 |
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
# Apply preprocessing
|
| 172 |
logger.info(f"Applying preprocessing with options: {preprocessing_options}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
processed_bytes = preprocess_image(file_bytes, preprocessing_options)
|
| 174 |
|
| 175 |
# Save processed image to temp file
|
|
|
|
| 3 |
import cv2
|
| 4 |
import numpy as np
|
| 5 |
import tempfile
|
| 6 |
+
import time
|
| 7 |
+
import math
|
| 8 |
+
import json
|
| 9 |
from PIL import Image, ImageEnhance, ImageFilter
|
| 10 |
from pdf2image import convert_from_bytes
|
| 11 |
import streamlit as st
|
| 12 |
import logging
|
| 13 |
+
import concurrent.futures
|
| 14 |
+
from pathlib import Path
|
| 15 |
|
| 16 |
# Configure logging
|
| 17 |
logger = logging.getLogger("preprocessing")
|
| 18 |
logger.setLevel(logging.INFO)
|
| 19 |
|
| 20 |
+
# Ensure logs directory exists
|
| 21 |
+
def ensure_log_directory(config):
|
| 22 |
+
"""Create logs directory if it doesn't exist"""
|
| 23 |
+
if config.get("logging", {}).get("enabled", False):
|
| 24 |
+
log_path = config.get("logging", {}).get("output_path", "logs/preprocessing_metrics.json")
|
| 25 |
+
log_dir = os.path.dirname(log_path)
|
| 26 |
+
if log_dir:
|
| 27 |
+
Path(log_dir).mkdir(parents=True, exist_ok=True)
|
| 28 |
+
|
| 29 |
+
def log_preprocessing_metrics(metrics, config):
|
| 30 |
+
"""Log preprocessing metrics to JSON file"""
|
| 31 |
+
if not config.get("enabled", False):
|
| 32 |
+
return
|
| 33 |
+
|
| 34 |
+
log_path = config.get("output_path", "logs/preprocessing_metrics.json")
|
| 35 |
+
ensure_log_directory({"logging": {"enabled": True, "output_path": log_path}})
|
| 36 |
+
|
| 37 |
+
# Add timestamp
|
| 38 |
+
metrics["timestamp"] = time.strftime("%Y-%m-%d %H:%M:%S")
|
| 39 |
+
|
| 40 |
+
# Append to log file
|
| 41 |
+
try:
|
| 42 |
+
existing_data = []
|
| 43 |
+
if os.path.exists(log_path):
|
| 44 |
+
with open(log_path, 'r') as f:
|
| 45 |
+
existing_data = json.load(f)
|
| 46 |
+
if not isinstance(existing_data, list):
|
| 47 |
+
existing_data = [existing_data]
|
| 48 |
+
|
| 49 |
+
existing_data.append(metrics)
|
| 50 |
+
|
| 51 |
+
with open(log_path, 'w') as f:
|
| 52 |
+
json.dump(existing_data, f, indent=2)
|
| 53 |
+
|
| 54 |
+
logger.info(f"Logged preprocessing metrics to {log_path}")
|
| 55 |
+
except Exception as e:
|
| 56 |
+
logger.error(f"Error logging preprocessing metrics: {str(e)}")
|
| 57 |
+
|
| 58 |
+
def get_document_config(document_type, global_config):
|
| 59 |
+
"""
|
| 60 |
+
Get document-specific preprocessing configuration by merging with global settings.
|
| 61 |
+
|
| 62 |
+
Args:
|
| 63 |
+
document_type: The type of document (e.g., 'standard', 'newspaper', 'handwritten')
|
| 64 |
+
global_config: The global preprocessing configuration
|
| 65 |
+
|
| 66 |
+
Returns:
|
| 67 |
+
A merged configuration dictionary with document-specific overrides
|
| 68 |
+
"""
|
| 69 |
+
# Start with a copy of the global config
|
| 70 |
+
config = {
|
| 71 |
+
"deskew": global_config.get("deskew", {}),
|
| 72 |
+
"thresholding": global_config.get("thresholding", {}),
|
| 73 |
+
"morphology": global_config.get("morphology", {}),
|
| 74 |
+
"performance": global_config.get("performance", {}),
|
| 75 |
+
"logging": global_config.get("logging", {})
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
# Apply document-specific overrides if they exist
|
| 79 |
+
doc_types = global_config.get("document_types", {})
|
| 80 |
+
if document_type in doc_types:
|
| 81 |
+
doc_config = doc_types[document_type]
|
| 82 |
+
|
| 83 |
+
# Merge document-specific settings into the config
|
| 84 |
+
for section in doc_config:
|
| 85 |
+
if section in config:
|
| 86 |
+
config[section].update(doc_config[section])
|
| 87 |
+
|
| 88 |
+
return config
|
| 89 |
+
|
| 90 |
+
def deskew_image(img_array, config):
|
| 91 |
+
"""
|
| 92 |
+
Detect and correct skew in document images.
|
| 93 |
+
|
| 94 |
+
Uses a combination of methods (minAreaRect and/or Hough transform)
|
| 95 |
+
to estimate the skew angle more robustly.
|
| 96 |
+
|
| 97 |
+
Args:
|
| 98 |
+
img_array: Input image as numpy array
|
| 99 |
+
config: Deskew configuration dict
|
| 100 |
+
|
| 101 |
+
Returns:
|
| 102 |
+
Deskewed image as numpy array, estimated angle, success flag
|
| 103 |
+
"""
|
| 104 |
+
if not config.get("enabled", False):
|
| 105 |
+
return img_array, 0.0, True
|
| 106 |
+
|
| 107 |
+
# Convert to grayscale if needed
|
| 108 |
+
gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 109 |
+
|
| 110 |
+
# Start with a threshold to get binary image for angle detection
|
| 111 |
+
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
|
| 112 |
+
|
| 113 |
+
angles = []
|
| 114 |
+
angle_threshold = config.get("angle_threshold", 0.1)
|
| 115 |
+
max_angle = config.get("max_angle", 45.0)
|
| 116 |
+
|
| 117 |
+
# Method 1: minAreaRect approach
|
| 118 |
+
try:
|
| 119 |
+
# Find all contours
|
| 120 |
+
contours, _ = cv2.findContours(binary, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
|
| 121 |
+
|
| 122 |
+
# Filter contours by area to avoid noise
|
| 123 |
+
min_area = binary.shape[0] * binary.shape[1] * 0.0001 # 0.01% of image area
|
| 124 |
+
filtered_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_area]
|
| 125 |
+
|
| 126 |
+
# Get angles from rotated rectangles around contours
|
| 127 |
+
for contour in filtered_contours:
|
| 128 |
+
rect = cv2.minAreaRect(contour)
|
| 129 |
+
width, height = rect[1]
|
| 130 |
+
|
| 131 |
+
# Calculate the angle based on the longer side
|
| 132 |
+
# (This is important for getting the orientation right)
|
| 133 |
+
angle = rect[2]
|
| 134 |
+
if width < height:
|
| 135 |
+
angle += 90
|
| 136 |
+
|
| 137 |
+
# Normalize angle to -45 to 45 range
|
| 138 |
+
if angle > 45:
|
| 139 |
+
angle -= 90
|
| 140 |
+
if angle < -45:
|
| 141 |
+
angle += 90
|
| 142 |
+
|
| 143 |
+
# Clamp angle to max limit
|
| 144 |
+
angle = max(min(angle, max_angle), -max_angle)
|
| 145 |
+
angles.append(angle)
|
| 146 |
+
except Exception as e:
|
| 147 |
+
logger.error(f"Error in minAreaRect skew detection: {str(e)}")
|
| 148 |
+
|
| 149 |
+
# Method 2: Hough Transform approach (if enabled)
|
| 150 |
+
if config.get("use_hough", True):
|
| 151 |
+
try:
|
| 152 |
+
# Apply Canny edge detection
|
| 153 |
+
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
|
| 154 |
+
|
| 155 |
+
# Apply Hough lines
|
| 156 |
+
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
|
| 157 |
+
threshold=100, minLineLength=100, maxLineGap=10)
|
| 158 |
+
|
| 159 |
+
if lines is not None:
|
| 160 |
+
for line in lines:
|
| 161 |
+
x1, y1, x2, y2 = line[0]
|
| 162 |
+
if x2 - x1 != 0: # Avoid division by zero
|
| 163 |
+
# Calculate line angle in degrees
|
| 164 |
+
angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
|
| 165 |
+
|
| 166 |
+
# Normalize angle to -45 to 45 range
|
| 167 |
+
if angle > 45:
|
| 168 |
+
angle -= 90
|
| 169 |
+
if angle < -45:
|
| 170 |
+
angle += 90
|
| 171 |
+
|
| 172 |
+
# Clamp angle to max limit
|
| 173 |
+
angle = max(min(angle, max_angle), -max_angle)
|
| 174 |
+
angles.append(angle)
|
| 175 |
+
except Exception as e:
|
| 176 |
+
logger.error(f"Error in Hough transform skew detection: {str(e)}")
|
| 177 |
+
|
| 178 |
+
# If no angles were detected, return original image
|
| 179 |
+
if not angles:
|
| 180 |
+
logger.warning("No skew angles detected, using original image")
|
| 181 |
+
return img_array, 0.0, False
|
| 182 |
+
|
| 183 |
+
# Combine angles using the specified consensus method
|
| 184 |
+
consensus_method = config.get("consensus_method", "average")
|
| 185 |
+
if consensus_method == "average":
|
| 186 |
+
final_angle = sum(angles) / len(angles)
|
| 187 |
+
elif consensus_method == "median":
|
| 188 |
+
final_angle = sorted(angles)[len(angles) // 2]
|
| 189 |
+
elif consensus_method == "min":
|
| 190 |
+
final_angle = min(angles, key=abs)
|
| 191 |
+
elif consensus_method == "max":
|
| 192 |
+
final_angle = max(angles, key=abs)
|
| 193 |
+
else:
|
| 194 |
+
final_angle = sum(angles) / len(angles) # Default to average
|
| 195 |
+
|
| 196 |
+
# If angle is below threshold, don't rotate
|
| 197 |
+
if abs(final_angle) < angle_threshold:
|
| 198 |
+
logger.info(f"Detected angle ({final_angle:.2f}°) is below threshold, skipping deskew")
|
| 199 |
+
return img_array, final_angle, True
|
| 200 |
+
|
| 201 |
+
# Log the detected angle
|
| 202 |
+
logger.info(f"Deskewing image with angle: {final_angle:.2f}°")
|
| 203 |
+
|
| 204 |
+
# Get image dimensions
|
| 205 |
+
h, w = img_array.shape[:2]
|
| 206 |
+
center = (w // 2, h // 2)
|
| 207 |
+
|
| 208 |
+
# Get rotation matrix
|
| 209 |
+
rotation_matrix = cv2.getRotationMatrix2D(center, final_angle, 1.0)
|
| 210 |
+
|
| 211 |
+
# Calculate new image dimensions
|
| 212 |
+
abs_cos = abs(rotation_matrix[0, 0])
|
| 213 |
+
abs_sin = abs(rotation_matrix[0, 1])
|
| 214 |
+
new_w = int(h * abs_sin + w * abs_cos)
|
| 215 |
+
new_h = int(h * abs_cos + w * abs_sin)
|
| 216 |
+
|
| 217 |
+
# Adjust the rotation matrix to account for new dimensions
|
| 218 |
+
rotation_matrix[0, 2] += (new_w / 2) - center[0]
|
| 219 |
+
rotation_matrix[1, 2] += (new_h / 2) - center[1]
|
| 220 |
+
|
| 221 |
+
# Perform the rotation
|
| 222 |
+
try:
|
| 223 |
+
# Determine the number of channels to create the correct output array
|
| 224 |
+
if len(img_array.shape) == 3:
|
| 225 |
+
rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
|
| 226 |
+
flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
|
| 227 |
+
borderValue=(255, 255, 255))
|
| 228 |
+
else:
|
| 229 |
+
rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
|
| 230 |
+
flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
|
| 231 |
+
borderValue=255)
|
| 232 |
+
return rotated, final_angle, True
|
| 233 |
+
except Exception as e:
|
| 234 |
+
logger.error(f"Error rotating image: {str(e)}")
|
| 235 |
+
if config.get("fallback", {}).get("enabled", True):
|
| 236 |
+
logger.info("Using original image as fallback after rotation failure")
|
| 237 |
+
return img_array, final_angle, False
|
| 238 |
+
return img_array, final_angle, False
|
| 239 |
+
|
| 240 |
+
def preblur(img_array, config):
|
| 241 |
+
"""
|
| 242 |
+
Apply pre-filtering blur to stabilize thresholding results.
|
| 243 |
+
|
| 244 |
+
Args:
|
| 245 |
+
img_array: Input image as numpy array
|
| 246 |
+
config: Pre-blur configuration dict
|
| 247 |
+
|
| 248 |
+
Returns:
|
| 249 |
+
Blurred image as numpy array
|
| 250 |
+
"""
|
| 251 |
+
if not config.get("enabled", False):
|
| 252 |
+
return img_array
|
| 253 |
+
|
| 254 |
+
method = config.get("method", "gaussian")
|
| 255 |
+
kernel_size = config.get("kernel_size", 3)
|
| 256 |
+
|
| 257 |
+
# Ensure kernel size is odd
|
| 258 |
+
if kernel_size % 2 == 0:
|
| 259 |
+
kernel_size += 1
|
| 260 |
+
|
| 261 |
+
try:
|
| 262 |
+
if method == "gaussian":
|
| 263 |
+
return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
|
| 264 |
+
elif method == "median":
|
| 265 |
+
return cv2.medianBlur(img_array, kernel_size)
|
| 266 |
+
else:
|
| 267 |
+
logger.warning(f"Unknown blur method: {method}, using gaussian")
|
| 268 |
+
return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
|
| 269 |
+
except Exception as e:
|
| 270 |
+
logger.error(f"Error applying {method} blur: {str(e)}")
|
| 271 |
+
return img_array
|
| 272 |
+
|
| 273 |
+
def apply_threshold(img_array, config):
|
| 274 |
+
"""
|
| 275 |
+
Apply thresholding to create binary image.
|
| 276 |
+
|
| 277 |
+
Supports Otsu's method and adaptive thresholding.
|
| 278 |
+
Includes pre-filtering and fallback mechanisms.
|
| 279 |
+
|
| 280 |
+
Args:
|
| 281 |
+
img_array: Input image as numpy array
|
| 282 |
+
config: Thresholding configuration dict
|
| 283 |
+
|
| 284 |
+
Returns:
|
| 285 |
+
Binary image as numpy array, success flag
|
| 286 |
+
"""
|
| 287 |
+
method = config.get("method", "adaptive")
|
| 288 |
+
if method == "none":
|
| 289 |
+
return img_array, True
|
| 290 |
+
|
| 291 |
+
# Convert to grayscale if needed
|
| 292 |
+
gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 293 |
+
|
| 294 |
+
# Apply pre-blur if configured
|
| 295 |
+
preblur_config = config.get("preblur", {})
|
| 296 |
+
if preblur_config.get("enabled", False):
|
| 297 |
+
gray = preblur(gray, preblur_config)
|
| 298 |
+
|
| 299 |
+
binary = None
|
| 300 |
+
try:
|
| 301 |
+
if method == "otsu":
|
| 302 |
+
# Apply Otsu's thresholding
|
| 303 |
+
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
| 304 |
+
elif method == "adaptive":
|
| 305 |
+
# Apply adaptive thresholding
|
| 306 |
+
block_size = config.get("adaptive_block_size", 11)
|
| 307 |
+
constant = config.get("adaptive_constant", 2)
|
| 308 |
+
|
| 309 |
+
# Ensure block size is odd
|
| 310 |
+
if block_size % 2 == 0:
|
| 311 |
+
block_size += 1
|
| 312 |
+
|
| 313 |
+
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 314 |
+
cv2.THRESH_BINARY, block_size, constant)
|
| 315 |
+
else:
|
| 316 |
+
logger.warning(f"Unknown thresholding method: {method}, using adaptive")
|
| 317 |
+
block_size = config.get("adaptive_block_size", 11)
|
| 318 |
+
constant = config.get("adaptive_constant", 2)
|
| 319 |
+
|
| 320 |
+
# Ensure block size is odd
|
| 321 |
+
if block_size % 2 == 0:
|
| 322 |
+
block_size += 1
|
| 323 |
+
|
| 324 |
+
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
| 325 |
+
cv2.THRESH_BINARY, block_size, constant)
|
| 326 |
+
except Exception as e:
|
| 327 |
+
logger.error(f"Error applying {method} thresholding: {str(e)}")
|
| 328 |
+
if config.get("fallback", {}).get("enabled", True):
|
| 329 |
+
logger.info("Using original grayscale image as fallback after thresholding failure")
|
| 330 |
+
return gray, False
|
| 331 |
+
return gray, False
|
| 332 |
+
|
| 333 |
+
# Calculate percentage of non-zero pixels for logging
|
| 334 |
+
nonzero_pct = np.count_nonzero(binary) / binary.size * 100
|
| 335 |
+
logger.info(f"Binary image has {nonzero_pct:.2f}% non-zero pixels")
|
| 336 |
+
|
| 337 |
+
# Check if thresholding was successful (crude check)
|
| 338 |
+
if nonzero_pct < 1 or nonzero_pct > 99:
|
| 339 |
+
logger.warning(f"Thresholding produced extreme result ({nonzero_pct:.2f}% non-zero)")
|
| 340 |
+
if config.get("fallback", {}).get("enabled", True):
|
| 341 |
+
logger.info("Using original grayscale image as fallback after poor thresholding")
|
| 342 |
+
return gray, False
|
| 343 |
+
|
| 344 |
+
return binary, True
|
| 345 |
+
|
| 346 |
+
def apply_morphology(binary_img, config):
|
| 347 |
+
"""
|
| 348 |
+
Apply morphological operations to clean up binary image.
|
| 349 |
+
|
| 350 |
+
Supports opening, closing, or both operations.
|
| 351 |
+
|
| 352 |
+
Args:
|
| 353 |
+
binary_img: Binary image as numpy array
|
| 354 |
+
config: Morphology configuration dict
|
| 355 |
+
|
| 356 |
+
Returns:
|
| 357 |
+
Processed binary image as numpy array
|
| 358 |
+
"""
|
| 359 |
+
if not config.get("enabled", False):
|
| 360 |
+
return binary_img
|
| 361 |
+
|
| 362 |
+
operation = config.get("operation", "close")
|
| 363 |
+
kernel_size = config.get("kernel_size", 1)
|
| 364 |
+
kernel_shape = config.get("kernel_shape", "rect")
|
| 365 |
+
|
| 366 |
+
# Create appropriate kernel
|
| 367 |
+
if kernel_shape == "rect":
|
| 368 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
|
| 369 |
+
elif kernel_shape == "ellipse":
|
| 370 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size*2+1, kernel_size*2+1))
|
| 371 |
+
elif kernel_shape == "cross":
|
| 372 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (kernel_size*2+1, kernel_size*2+1))
|
| 373 |
+
else:
|
| 374 |
+
logger.warning(f"Unknown kernel shape: {kernel_shape}, using rect")
|
| 375 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
|
| 376 |
+
|
| 377 |
+
result = binary_img
|
| 378 |
+
try:
|
| 379 |
+
if operation == "open":
|
| 380 |
+
# Opening: Erosion followed by dilation - removes small noise
|
| 381 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
|
| 382 |
+
elif operation == "close":
|
| 383 |
+
# Closing: Dilation followed by erosion - fills small holes
|
| 384 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
|
| 385 |
+
elif operation == "both":
|
| 386 |
+
# Both operations in sequence
|
| 387 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
|
| 388 |
+
result = cv2.morphologyEx(result, cv2.MORPH_CLOSE, kernel)
|
| 389 |
+
else:
|
| 390 |
+
logger.warning(f"Unknown morphological operation: {operation}, using close")
|
| 391 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
|
| 392 |
+
except Exception as e:
|
| 393 |
+
logger.error(f"Error applying morphological operation: {str(e)}")
|
| 394 |
+
return binary_img
|
| 395 |
+
|
| 396 |
+
return result
|
| 397 |
+
|
| 398 |
@st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
|
| 399 |
def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
| 400 |
"""Convert PDF bytes to a list of images with caching"""
|
|
|
|
| 417 |
|
| 418 |
@st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
|
| 419 |
def preprocess_image(image_bytes, preprocessing_options):
|
| 420 |
+
"""
|
| 421 |
+
Conservative preprocessing function for handwritten documents with early exit for clean scans.
|
| 422 |
+
Implements light processing: grayscale → denoise (gently) → contrast (conservative)
|
| 423 |
+
|
| 424 |
+
Args:
|
| 425 |
+
image_bytes: Image content as bytes
|
| 426 |
+
preprocessing_options: Dictionary with document_type, grayscale, denoise, contrast options
|
| 427 |
+
|
| 428 |
+
Returns:
|
| 429 |
+
Processed image bytes or original image bytes if no processing needed
|
| 430 |
+
"""
|
| 431 |
# Setup basic console logging
|
| 432 |
logger = logging.getLogger("image_preprocessor")
|
| 433 |
logger.setLevel(logging.INFO)
|
| 434 |
|
| 435 |
# Log which preprocessing options are being applied
|
| 436 |
+
logger.info(f"Document type: {preprocessing_options.get('document_type', 'standard')}")
|
| 437 |
+
|
| 438 |
+
# Check if any preprocessing is actually requested
|
| 439 |
+
has_preprocessing = (
|
| 440 |
+
preprocessing_options.get("grayscale", False) or
|
| 441 |
+
preprocessing_options.get("denoise", False) or
|
| 442 |
+
preprocessing_options.get("contrast", 0) != 0
|
| 443 |
+
)
|
| 444 |
|
| 445 |
# Convert bytes to PIL Image
|
| 446 |
image = Image.open(io.BytesIO(image_bytes))
|
| 447 |
|
| 448 |
+
# Check for minimal skew and exit early if document is already straight
|
| 449 |
+
# This avoids unnecessary processing for clean scans
|
| 450 |
+
try:
|
| 451 |
+
from utils.image_utils import detect_skew
|
| 452 |
+
skew_angle = detect_skew(image)
|
| 453 |
+
if abs(skew_angle) < 0.5:
|
| 454 |
+
logger.info(f"Document has minimal skew ({skew_angle:.2f}°), skipping preprocessing")
|
| 455 |
+
# Return original image bytes as is for perfectly straight documents
|
| 456 |
+
if not has_preprocessing:
|
| 457 |
+
return image_bytes
|
| 458 |
+
except Exception as e:
|
| 459 |
+
logger.warning(f"Error in skew detection: {str(e)}, continuing with preprocessing")
|
| 460 |
+
|
| 461 |
+
# If no preprocessing options are selected, return the original image
|
| 462 |
+
if not has_preprocessing:
|
| 463 |
+
logger.info("No preprocessing options selected, skipping preprocessing")
|
| 464 |
+
return image_bytes
|
| 465 |
+
|
| 466 |
+
# Initialize metrics for logging
|
| 467 |
+
metrics = {
|
| 468 |
+
"file": preprocessing_options.get("filename", "unknown"),
|
| 469 |
+
"document_type": preprocessing_options.get("document_type", "standard"),
|
| 470 |
+
"preprocessing_applied": []
|
| 471 |
+
}
|
| 472 |
+
start_time = time.time()
|
| 473 |
+
|
| 474 |
+
# Handle RGBA images (transparency) by converting to RGB
|
| 475 |
if image.mode == 'RGBA':
|
| 476 |
+
# Convert RGBA to RGB by compositing onto white background
|
| 477 |
+
logger.info("Converting RGBA image to RGB")
|
| 478 |
background = Image.new('RGB', image.size, (255, 255, 255))
|
| 479 |
background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
|
| 480 |
image = background
|
| 481 |
+
metrics["preprocessing_applied"].append("alpha_conversion")
|
| 482 |
elif image.mode not in ('RGB', 'L'):
|
| 483 |
+
# Convert other modes to RGB
|
| 484 |
+
logger.info(f"Converting {image.mode} image to RGB")
|
| 485 |
image = image.convert('RGB')
|
| 486 |
+
metrics["preprocessing_applied"].append("format_conversion")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 487 |
|
| 488 |
+
# Convert to NumPy array for OpenCV processing
|
| 489 |
img_array = np.array(image)
|
| 490 |
|
| 491 |
+
# Apply grayscale if requested (useful for handwritten text)
|
|
|
|
|
|
|
|
|
|
| 492 |
if preprocessing_options.get("grayscale", False):
|
| 493 |
if len(img_array.shape) == 3: # Only convert if it's not already grayscale
|
| 494 |
+
# For handwritten documents, apply gentle CLAHE to enhance contrast locally
|
| 495 |
+
if preprocessing_options.get("document_type") == "handwritten":
|
| 496 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 497 |
+
clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8,8)) # Conservative clip limit
|
|
|
|
| 498 |
img_array = clahe.apply(img_array)
|
| 499 |
else:
|
| 500 |
# Standard grayscale for printed documents
|
| 501 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
| 502 |
+
|
| 503 |
+
metrics["preprocessing_applied"].append("grayscale")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 504 |
|
| 505 |
+
# Apply light denoising if requested
|
| 506 |
if preprocessing_options.get("denoise", False):
|
| 507 |
try:
|
| 508 |
+
# Apply very gentle denoising
|
| 509 |
+
is_color = len(img_array.shape) == 3 and img_array.shape[2] == 3
|
| 510 |
+
if is_color:
|
| 511 |
+
# Very light color denoising with conservative parameters
|
| 512 |
+
img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7)
|
|
|
|
|
|
|
| 513 |
else:
|
| 514 |
+
# Very light grayscale denoising
|
| 515 |
+
img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 3, 7)
|
| 516 |
+
|
| 517 |
+
metrics["preprocessing_applied"].append("light_denoise")
|
|
|
|
| 518 |
except Exception as e:
|
| 519 |
+
logger.error(f"Denoising error: {str(e)}")
|
| 520 |
+
|
| 521 |
+
# Apply contrast adjustment if requested (conservative range)
|
| 522 |
+
contrast_value = preprocessing_options.get("contrast", 0)
|
| 523 |
+
if contrast_value != 0:
|
| 524 |
+
# Use a gentler contrast adjustment factor
|
| 525 |
+
contrast_factor = 1 + (contrast_value / 200) # Conservative scaling factor
|
| 526 |
|
| 527 |
+
# Convert NumPy array back to PIL Image for contrast adjustment
|
| 528 |
+
if len(img_array.shape) == 2: # If grayscale, convert to RGB for PIL
|
| 529 |
+
image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
|
| 530 |
+
else:
|
| 531 |
+
image = Image.fromarray(img_array)
|
| 532 |
+
|
| 533 |
+
enhancer = ImageEnhance.Contrast(image)
|
| 534 |
+
image = enhancer.enhance(contrast_factor)
|
| 535 |
+
|
| 536 |
+
# Convert back to NumPy array
|
| 537 |
+
img_array = np.array(image)
|
| 538 |
+
metrics["preprocessing_applied"].append(f"contrast_{contrast_value}")
|
| 539 |
+
|
| 540 |
# Convert back to PIL Image
|
| 541 |
+
if len(img_array.shape) == 2: # If grayscale, convert to RGB for saving
|
| 542 |
+
processed_image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
|
| 543 |
+
else:
|
| 544 |
+
processed_image = Image.fromarray(img_array)
|
| 545 |
+
|
| 546 |
+
# Record total processing time
|
| 547 |
+
metrics["processing_time"] = (time.time() - start_time) * 1000 # ms
|
| 548 |
|
| 549 |
# Higher quality for OCR processing
|
| 550 |
byte_io = io.BytesIO()
|
|
|
|
| 558 |
|
| 559 |
logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
|
| 560 |
logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
|
| 561 |
+
logger.info(f"Applied preprocessing steps: {', '.join(metrics['preprocessing_applied'])}")
|
| 562 |
|
| 563 |
return byte_io.getvalue()
|
| 564 |
except Exception as e:
|
| 565 |
logger.error(f"Error saving processed image: {str(e)}")
|
| 566 |
# Fallback to original image
|
| 567 |
logger.info("Using original image as fallback")
|
| 568 |
+
return image_bytes
|
|
|
|
|
|
|
|
|
|
| 569 |
|
| 570 |
def create_temp_file(content, suffix, temp_file_paths):
|
| 571 |
"""Create a temporary file and track it for cleanup"""
|
|
|
|
| 578 |
return temp_path
|
| 579 |
|
| 580 |
def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
|
| 581 |
+
"""
|
| 582 |
+
Apply conservative preprocessing to file and return path to the temporary file.
|
| 583 |
+
Handles format conversion and user-selected preprocessing options.
|
| 584 |
+
|
| 585 |
+
Args:
|
| 586 |
+
file_bytes: File content as bytes
|
| 587 |
+
file_ext: File extension (e.g., '.jpg', '.pdf')
|
| 588 |
+
preprocessing_options: Dictionary with document_type and preprocessing options
|
| 589 |
+
temp_file_paths: List to track temporary files for cleanup
|
| 590 |
+
|
| 591 |
+
Returns:
|
| 592 |
+
Tuple of (temp_file_path, was_processed_flag)
|
| 593 |
+
"""
|
| 594 |
+
document_type = preprocessing_options.get("document_type", "standard")
|
| 595 |
+
|
| 596 |
+
# Check for user-selected preprocessing
|
| 597 |
has_preprocessing = (
|
| 598 |
preprocessing_options.get("grayscale", False) or
|
| 599 |
preprocessing_options.get("denoise", False) or
|
| 600 |
+
preprocessing_options.get("contrast", 0) != 0
|
|
|
|
| 601 |
)
|
| 602 |
|
| 603 |
+
# Check for RGBA/transparency that needs conversion
|
| 604 |
+
format_needs_conversion = False
|
| 605 |
+
|
| 606 |
+
# Only check formats that might have transparency
|
| 607 |
+
if file_ext.lower() in ['.png', '.tif', '.tiff']:
|
| 608 |
+
try:
|
| 609 |
+
# Check if image has transparency
|
| 610 |
+
image = Image.open(io.BytesIO(file_bytes))
|
| 611 |
+
if image.mode == 'RGBA' or image.mode not in ('RGB', 'L'):
|
| 612 |
+
format_needs_conversion = True
|
| 613 |
+
except Exception as e:
|
| 614 |
+
logger.warning(f"Error checking image format: {str(e)}")
|
| 615 |
+
|
| 616 |
+
# Process if user requested preprocessing OR format needs conversion
|
| 617 |
+
needs_processing = has_preprocessing or format_needs_conversion
|
| 618 |
+
|
| 619 |
+
if needs_processing:
|
| 620 |
# Apply preprocessing
|
| 621 |
logger.info(f"Applying preprocessing with options: {preprocessing_options}")
|
| 622 |
+
logger.info(f"Using document type '{document_type}' with advanced preprocessing options")
|
| 623 |
+
|
| 624 |
+
# Add filename to preprocessing options for logging if available
|
| 625 |
+
if hasattr(file_bytes, 'name'):
|
| 626 |
+
preprocessing_options["filename"] = file_bytes.name
|
| 627 |
+
|
| 628 |
processed_bytes = preprocess_image(file_bytes, preprocessing_options)
|
| 629 |
|
| 630 |
# Save processed image to temp file
|
process_file.py
CHANGED
|
@@ -53,9 +53,7 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
|
|
| 53 |
"file_size_mb": round(file_size_mb, 2),
|
| 54 |
"use_vision": use_vision
|
| 55 |
})
|
| 56 |
-
|
| 57 |
-
# No longer needed - removing confidence score
|
| 58 |
-
|
| 59 |
return result
|
| 60 |
except Exception as e:
|
| 61 |
return {
|
|
@@ -65,4 +63,4 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
|
|
| 65 |
finally:
|
| 66 |
# Clean up the temporary file
|
| 67 |
if os.path.exists(temp_path):
|
| 68 |
-
os.unlink(temp_path)
|
|
|
|
| 53 |
"file_size_mb": round(file_size_mb, 2),
|
| 54 |
"use_vision": use_vision
|
| 55 |
})
|
| 56 |
+
|
|
|
|
|
|
|
| 57 |
return result
|
| 58 |
except Exception as e:
|
| 59 |
return {
|
|
|
|
| 63 |
finally:
|
| 64 |
# Clean up the temporary file
|
| 65 |
if os.path.exists(temp_path):
|
| 66 |
+
os.unlink(temp_path)
|
requirements.txt
CHANGED
|
@@ -10,6 +10,7 @@ Pillow>=10.0.0
|
|
| 10 |
opencv-python-headless>=4.8.0.74
|
| 11 |
pdf2image>=1.16.0
|
| 12 |
pytesseract>=0.3.10 # For local OCR fallback
|
|
|
|
| 13 |
|
| 14 |
# Data handling and utilities
|
| 15 |
numpy>=1.24.0
|
|
|
|
| 10 |
opencv-python-headless>=4.8.0.74
|
| 11 |
pdf2image>=1.16.0
|
| 12 |
pytesseract>=0.3.10 # For local OCR fallback
|
| 13 |
+
matplotlib>=3.7.0 # For visualization in preprocessing tests
|
| 14 |
|
| 15 |
# Data handling and utilities
|
| 16 |
numpy>=1.24.0
|
structured_ocr.py
CHANGED
|
@@ -47,28 +47,38 @@ except ImportError:
|
|
| 47 |
|
| 48 |
# Import utilities for OCR processing
|
| 49 |
try:
|
| 50 |
-
from
|
| 51 |
except ImportError:
|
| 52 |
-
# Define fallback functions if module not found
|
|
|
|
|
|
|
| 53 |
def replace_images_in_markdown(markdown_str, images_dict):
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
| 58 |
return markdown_str
|
| 59 |
|
| 60 |
def get_combined_markdown(ocr_response):
|
|
|
|
| 61 |
markdowns = []
|
| 62 |
for page in ocr_response.pages:
|
| 63 |
image_data = {}
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
return "\n\n".join(markdowns)
|
| 68 |
|
| 69 |
# Import config directly (now local to historical-ocr)
|
| 70 |
try:
|
| 71 |
-
from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE
|
| 72 |
except ImportError:
|
| 73 |
# Fallback defaults if config is not available
|
| 74 |
import os
|
|
@@ -77,6 +87,14 @@ except ImportError:
|
|
| 77 |
TEXT_MODEL = "mistral-large-latest"
|
| 78 |
VISION_MODEL = "mistral-large-latest"
|
| 79 |
TEST_MODE = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
logging.warning("Config module not found. Using environment variables and defaults.")
|
| 81 |
|
| 82 |
# Helper function to make OCR objects JSON serializable
|
|
@@ -127,6 +145,13 @@ def serialize_ocr_response(obj):
|
|
| 127 |
is_valid_image = False
|
| 128 |
logging.warning("Markdown image reference detected")
|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
# Case 3: Needs detailed text content detection
|
| 131 |
else:
|
| 132 |
# Use the same proven approach as in our tests
|
|
@@ -185,9 +210,27 @@ def serialize_ocr_response(obj):
|
|
| 185 |
'image_base64': image_base64
|
| 186 |
}
|
| 187 |
else:
|
| 188 |
-
# Process as text if validation fails
|
| 189 |
if image_base64 and isinstance(image_base64, str):
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
else:
|
| 192 |
result[key] = str(value)
|
| 193 |
# Handle collections
|
|
@@ -382,13 +425,47 @@ class StructuredOCR:
|
|
| 382 |
result = serialize_ocr_response(result)
|
| 383 |
|
| 384 |
# Make a final pass to check for any remaining non-serializable objects
|
| 385 |
-
#
|
| 386 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
except TypeError as e:
|
| 388 |
-
# If there's a serialization error, run the whole result through our serializer
|
| 389 |
logger = logging.getLogger("serializer")
|
| 390 |
logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
|
| 391 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 392 |
|
| 393 |
return result
|
| 394 |
|
|
@@ -1104,9 +1181,10 @@ class StructuredOCR:
|
|
| 1104 |
|
| 1105 |
# Use enhanced preprocessing functions from ocr_utils
|
| 1106 |
try:
|
| 1107 |
-
from
|
|
|
|
| 1108 |
|
| 1109 |
-
logger.info(f"Applying
|
| 1110 |
|
| 1111 |
# Get preprocessing settings from config
|
| 1112 |
max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
|
|
@@ -1114,8 +1192,14 @@ class StructuredOCR:
|
|
| 1114 |
if file_size_mb > max_size_mb:
|
| 1115 |
logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
|
| 1116 |
|
| 1117 |
-
#
|
| 1118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1119 |
|
| 1120 |
logger.info(f"Image preprocessing completed successfully")
|
| 1121 |
|
|
@@ -1169,7 +1253,7 @@ class StructuredOCR:
|
|
| 1169 |
except ImportError:
|
| 1170 |
logger.warning("PIL not available for resizing. Using original image.")
|
| 1171 |
# Use enhanced encoder with proper MIME type detection
|
| 1172 |
-
from
|
| 1173 |
base64_data_url = encode_image_for_api(file_path)
|
| 1174 |
except Exception as e:
|
| 1175 |
logger.warning(f"Image resize failed: {str(e)}. Using original image.")
|
|
@@ -1178,7 +1262,7 @@ class StructuredOCR:
|
|
| 1178 |
base64_data_url = encode_image_for_api(file_path)
|
| 1179 |
else:
|
| 1180 |
# For smaller images, use as-is with proper MIME type
|
| 1181 |
-
from
|
| 1182 |
base64_data_url = encode_image_for_api(file_path)
|
| 1183 |
except Exception as e:
|
| 1184 |
# Fallback to original image if any preprocessing fails
|
|
@@ -1243,7 +1327,7 @@ class StructuredOCR:
|
|
| 1243 |
logger.error("Maximum retries reached, rate limit error persists.")
|
| 1244 |
try:
|
| 1245 |
# Try to import the local OCR fallback function
|
| 1246 |
-
from
|
| 1247 |
|
| 1248 |
# Attempt local OCR fallback
|
| 1249 |
ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
|
|
@@ -1455,7 +1539,14 @@ class StructuredOCR:
|
|
| 1455 |
logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
|
| 1456 |
|
| 1457 |
# Perform language detection on the OCR text before returning
|
| 1458 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1459 |
|
| 1460 |
return {
|
| 1461 |
"file_name": filename,
|
|
@@ -1629,7 +1720,12 @@ class StructuredOCR:
|
|
| 1629 |
|
| 1630 |
# If OCR text has clear French patterns but language is English or missing, fix it
|
| 1631 |
if ocr_markdown and 'languages' in result:
|
| 1632 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1633 |
|
| 1634 |
except Exception as e:
|
| 1635 |
# Fall back to text-only model if vision model fails
|
|
@@ -1639,22 +1735,25 @@ class StructuredOCR:
|
|
| 1639 |
return result
|
| 1640 |
|
| 1641 |
# We've removed document type detection entirely for simplicity
|
|
|
|
| 1642 |
|
| 1643 |
# Create a prompt with enhanced language detection instructions
|
| 1644 |
generic_section = (
|
| 1645 |
f"You are an OCR specialist processing historical documents. "
|
| 1646 |
-
f"Focus on accurately extracting text content while preserving structure and formatting. "
|
| 1647 |
f"Pay attention to any historical features and document characteristics.\n\n"
|
| 1648 |
-
f"IMPORTANT: Accurately identify the document's language(s). Look for language-specific characters, words, and phrases. "
|
| 1649 |
-
f"Specifically check for French (accents like é, è, ç, words like 'le', 'la', 'et', 'est'), German (umlauts, words like 'und', 'der', 'das'), "
|
| 1650 |
-
f"Latin, and other non-English languages. Carefully analyze the text before determining language.\n\n"
|
| 1651 |
f"Create a structured JSON response with the following fields:\n"
|
| 1652 |
f"- file_name: The document's name\n"
|
| 1653 |
f"- topics: An array of topics covered in the document\n"
|
| 1654 |
f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
|
| 1655 |
f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
|
| 1656 |
-
f" * title: The
|
| 1657 |
-
f" *
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1658 |
f" * raw_text: The complete OCR text\n"
|
| 1659 |
)
|
| 1660 |
|
|
@@ -1665,86 +1764,7 @@ class StructuredOCR:
|
|
| 1665 |
|
| 1666 |
# Return the enhanced prompt
|
| 1667 |
return generic_section + custom_section
|
| 1668 |
-
|
| 1669 |
-
def _detect_text_language(self, text, current_languages=None):
|
| 1670 |
-
"""
|
| 1671 |
-
Detect language from text content using the external language detector
|
| 1672 |
-
or falling back to internal detection if needed
|
| 1673 |
-
|
| 1674 |
-
Args:
|
| 1675 |
-
text: The text to analyze
|
| 1676 |
-
current_languages: Optional list of languages already detected
|
| 1677 |
-
|
| 1678 |
-
Returns:
|
| 1679 |
-
List of detected languages
|
| 1680 |
-
"""
|
| 1681 |
-
logger = logging.getLogger("language_detector")
|
| 1682 |
-
|
| 1683 |
-
# If no text provided, return current languages or default
|
| 1684 |
-
if not text or len(text.strip()) < 10:
|
| 1685 |
-
return current_languages if current_languages else ["English"]
|
| 1686 |
-
|
| 1687 |
-
# Use the external language detector if available
|
| 1688 |
-
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
| 1689 |
-
logger.info("Using external language detector")
|
| 1690 |
-
return self.language_detector.detect_languages(text,
|
| 1691 |
-
filename=getattr(self, 'current_filename', None),
|
| 1692 |
-
current_languages=current_languages)
|
| 1693 |
-
|
| 1694 |
-
# Fallback for when the external module is not available
|
| 1695 |
-
logger.info("Language detector not available, using simple detection")
|
| 1696 |
-
|
| 1697 |
-
# Get all words from text (lowercase for comparison)
|
| 1698 |
-
text_lower = text.lower()
|
| 1699 |
-
words = text_lower.split()
|
| 1700 |
-
|
| 1701 |
-
# Basic language markers - equal treatment of all languages
|
| 1702 |
-
language_indicators = {
|
| 1703 |
-
"French": {
|
| 1704 |
-
"chars": ['é', 'è', 'ê', 'à', 'ç', 'ù', 'â', 'î', 'ô', 'û'],
|
| 1705 |
-
"words": ['le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette']
|
| 1706 |
-
},
|
| 1707 |
-
"Spanish": {
|
| 1708 |
-
"chars": ['ñ', 'á', 'é', 'í', 'ó', 'ú', '¿', '¡'],
|
| 1709 |
-
"words": ['el', 'la', 'los', 'las', 'y', 'en', 'por', 'que', 'con', 'del']
|
| 1710 |
-
},
|
| 1711 |
-
"German": {
|
| 1712 |
-
"chars": ['ä', 'ö', 'ü', 'ß'],
|
| 1713 |
-
"words": ['der', 'die', 'das', 'und', 'ist', 'von', 'mit', 'für', 'sich']
|
| 1714 |
-
},
|
| 1715 |
-
"Latin": {
|
| 1716 |
-
"chars": [],
|
| 1717 |
-
"words": ['et', 'in', 'ad', 'est', 'sunt', 'non', 'cum', 'sed', 'qui', 'quod']
|
| 1718 |
-
}
|
| 1719 |
-
}
|
| 1720 |
-
|
| 1721 |
-
detected_languages = []
|
| 1722 |
-
|
| 1723 |
-
# Simple detection logic - check for language markers
|
| 1724 |
-
for language, indicators in language_indicators.items():
|
| 1725 |
-
has_chars = any(char in text_lower for char in indicators["chars"])
|
| 1726 |
-
has_words = any(word in words for word in indicators["words"])
|
| 1727 |
|
| 1728 |
-
if has_chars and has_words:
|
| 1729 |
-
detected_languages.append(language)
|
| 1730 |
-
|
| 1731 |
-
# Check for English
|
| 1732 |
-
english_words = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'it']
|
| 1733 |
-
if sum(1 for word in words if word in english_words) >= 2:
|
| 1734 |
-
detected_languages.append("English")
|
| 1735 |
-
|
| 1736 |
-
# If no languages detected, default to English
|
| 1737 |
-
if not detected_languages:
|
| 1738 |
-
detected_languages = ["English"]
|
| 1739 |
-
|
| 1740 |
-
# Limit to top 2 languages
|
| 1741 |
-
detected_languages = detected_languages[:2]
|
| 1742 |
-
|
| 1743 |
-
# Log what we found
|
| 1744 |
-
logger.info(f"Simple fallback language detection results: {detected_languages}")
|
| 1745 |
-
|
| 1746 |
-
return detected_languages
|
| 1747 |
-
|
| 1748 |
def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
|
| 1749 |
"""
|
| 1750 |
Extract structured data using text-only model with detailed historical context prompting
|
|
|
|
| 47 |
|
| 48 |
# Import utilities for OCR processing
|
| 49 |
try:
|
| 50 |
+
from utils.image_utils import replace_images_in_markdown, get_combined_markdown
|
| 51 |
except ImportError:
|
| 52 |
+
# Define minimal fallback functions if module not found
|
| 53 |
+
logger.warning("Could not import utils.image_utils - using minimal fallback functions")
|
| 54 |
+
|
| 55 |
def replace_images_in_markdown(markdown_str, images_dict):
|
| 56 |
+
"""Minimal fallback implementation of replace_images_in_markdown"""
|
| 57 |
+
import re
|
| 58 |
+
for img_id, base64_str in images_dict.items():
|
| 59 |
+
# Match alt text OR link part, ignore extension
|
| 60 |
+
base_id = img_id.split('.')[0]
|
| 61 |
+
pattern = re.compile(rf"!\[[^\]]*{base_id}[^\]]*\]\([^\)]+\)")
|
| 62 |
+
markdown_str = pattern.sub(f"", markdown_str)
|
| 63 |
return markdown_str
|
| 64 |
|
| 65 |
def get_combined_markdown(ocr_response):
|
| 66 |
+
"""Minimal fallback implementation of get_combined_markdown"""
|
| 67 |
markdowns = []
|
| 68 |
for page in ocr_response.pages:
|
| 69 |
image_data = {}
|
| 70 |
+
if hasattr(page, "images"):
|
| 71 |
+
for img in page.images:
|
| 72 |
+
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
| 73 |
+
image_data[img.id] = img.image_base64
|
| 74 |
+
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
| 75 |
+
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
| 76 |
+
markdowns.append(processed_markdown)
|
| 77 |
return "\n\n".join(markdowns)
|
| 78 |
|
| 79 |
# Import config directly (now local to historical-ocr)
|
| 80 |
try:
|
| 81 |
+
from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE, IMAGE_PREPROCESSING
|
| 82 |
except ImportError:
|
| 83 |
# Fallback defaults if config is not available
|
| 84 |
import os
|
|
|
|
| 87 |
TEXT_MODEL = "mistral-large-latest"
|
| 88 |
VISION_MODEL = "mistral-large-latest"
|
| 89 |
TEST_MODE = True
|
| 90 |
+
# Default image preprocessing settings if config not available
|
| 91 |
+
IMAGE_PREPROCESSING = {
|
| 92 |
+
"max_size_mb": 8.0,
|
| 93 |
+
# Add basic defaults for preprocessing
|
| 94 |
+
"enhance_contrast": 1.2,
|
| 95 |
+
"denoise": True,
|
| 96 |
+
"compression_quality": 95
|
| 97 |
+
}
|
| 98 |
logging.warning("Config module not found. Using environment variables and defaults.")
|
| 99 |
|
| 100 |
# Helper function to make OCR objects JSON serializable
|
|
|
|
| 145 |
is_valid_image = False
|
| 146 |
logging.warning("Markdown image reference detected")
|
| 147 |
|
| 148 |
+
# Extract the image ID for logging
|
| 149 |
+
try:
|
| 150 |
+
img_id = image_base64.split('![')[1].split('](')[0]
|
| 151 |
+
logging.debug(f"Markdown reference for image: {img_id}")
|
| 152 |
+
except:
|
| 153 |
+
img_id = "unknown"
|
| 154 |
+
|
| 155 |
# Case 3: Needs detailed text content detection
|
| 156 |
else:
|
| 157 |
# Use the same proven approach as in our tests
|
|
|
|
| 210 |
'image_base64': image_base64
|
| 211 |
}
|
| 212 |
else:
|
| 213 |
+
# Process as text if validation fails, but properly handle markdown references
|
| 214 |
if image_base64 and isinstance(image_base64, str):
|
| 215 |
+
# Special handling for markdown image references
|
| 216 |
+
if image_base64.startswith(''):
|
| 217 |
+
# Extract the image description (alt text) if available
|
| 218 |
+
try:
|
| 219 |
+
# Parse the alt text from 
|
| 220 |
+
alt_text = image_base64.split('![')[1].split('](')[0]
|
| 221 |
+
# Use the alt text or a placeholder if it's just the image name
|
| 222 |
+
if alt_text and not alt_text.endswith('.jpeg') and not alt_text.endswith('.jpg'):
|
| 223 |
+
result[key] = f"[Image: {alt_text}]"
|
| 224 |
+
else:
|
| 225 |
+
# Just note that there's an image without the reference
|
| 226 |
+
result[key] = "[Image]"
|
| 227 |
+
logging.info(f"Converted markdown reference to text placeholder: {result[key]}")
|
| 228 |
+
except:
|
| 229 |
+
# Fallback for parsing errors
|
| 230 |
+
result[key] = "[Image]"
|
| 231 |
+
else:
|
| 232 |
+
# Regular text content
|
| 233 |
+
result[key] = image_base64
|
| 234 |
else:
|
| 235 |
result[key] = str(value)
|
| 236 |
# Handle collections
|
|
|
|
| 425 |
result = serialize_ocr_response(result)
|
| 426 |
|
| 427 |
# Make a final pass to check for any remaining non-serializable objects
|
| 428 |
+
# Proactively check for OCRImageObject instances to avoid serialization warnings
|
| 429 |
+
def has_ocr_image_objects(obj):
|
| 430 |
+
"""Check if object contains any OCRImageObject instances recursively"""
|
| 431 |
+
if isinstance(obj, dict):
|
| 432 |
+
return any(has_ocr_image_objects(v) for v in obj.values())
|
| 433 |
+
elif isinstance(obj, list):
|
| 434 |
+
return any(has_ocr_image_objects(item) for item in obj)
|
| 435 |
+
else:
|
| 436 |
+
return 'OCRImageObject' in str(type(obj))
|
| 437 |
+
|
| 438 |
+
# Apply serialization preemptively if OCRImageObjects are detected
|
| 439 |
+
if has_ocr_image_objects(result):
|
| 440 |
+
# Quietly apply full serialization before any errors occur
|
| 441 |
+
result = serialize_ocr_response(result)
|
| 442 |
+
else:
|
| 443 |
+
# Test JSON serialization to catch any other issues
|
| 444 |
+
json.dumps(result)
|
| 445 |
except TypeError as e:
|
| 446 |
+
# If there's still a serialization error, run the whole result through our serializer
|
| 447 |
logger = logging.getLogger("serializer")
|
| 448 |
logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
|
| 449 |
+
# Use a more robust approach to ensure complete serialization
|
| 450 |
+
try:
|
| 451 |
+
# First attempt with our custom serializer
|
| 452 |
+
result = serialize_ocr_response(result)
|
| 453 |
+
# Test if it's fully serializable now
|
| 454 |
+
json.dumps(result)
|
| 455 |
+
except Exception as inner_e:
|
| 456 |
+
# If still not serializable, convert to a simpler format
|
| 457 |
+
logger.warning(f"Secondary serialization error: {str(inner_e)}. Converting to basic format.")
|
| 458 |
+
# Create a simplified result with just the essential information
|
| 459 |
+
simplified_result = {
|
| 460 |
+
"file_name": result.get("file_name", "unknown"),
|
| 461 |
+
"topics": result.get("topics", ["Document"]),
|
| 462 |
+
"languages": [str(lang) for lang in result.get("languages", ["English"]) if lang is not None],
|
| 463 |
+
"ocr_contents": {
|
| 464 |
+
"raw_text": result.get("ocr_contents", {}).get("raw_text", "Text extraction failed due to serialization error")
|
| 465 |
+
},
|
| 466 |
+
"serialization_error": f"Original result could not be fully serialized: {str(e)}"
|
| 467 |
+
}
|
| 468 |
+
result = simplified_result
|
| 469 |
|
| 470 |
return result
|
| 471 |
|
|
|
|
| 1181 |
|
| 1182 |
# Use enhanced preprocessing functions from ocr_utils
|
| 1183 |
try:
|
| 1184 |
+
from preprocessing import preprocess_image
|
| 1185 |
+
from utils.file_utils import get_base64_from_bytes
|
| 1186 |
|
| 1187 |
+
logger.info(f"Applying image preprocessing for OCR")
|
| 1188 |
|
| 1189 |
# Get preprocessing settings from config
|
| 1190 |
max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
|
|
|
|
| 1192 |
if file_size_mb > max_size_mb:
|
| 1193 |
logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
|
| 1194 |
|
| 1195 |
+
# Handwritten docs default to the conservative pipeline
|
| 1196 |
+
base64_data_url = get_base64_from_bytes(
|
| 1197 |
+
preprocess_image(file_path.read_bytes(),
|
| 1198 |
+
{"document_type": "handwritten",
|
| 1199 |
+
"grayscale": True,
|
| 1200 |
+
"denoise": True,
|
| 1201 |
+
"contrast": 0})
|
| 1202 |
+
)
|
| 1203 |
|
| 1204 |
logger.info(f"Image preprocessing completed successfully")
|
| 1205 |
|
|
|
|
| 1253 |
except ImportError:
|
| 1254 |
logger.warning("PIL not available for resizing. Using original image.")
|
| 1255 |
# Use enhanced encoder with proper MIME type detection
|
| 1256 |
+
from utils.image_utils import encode_image_for_api
|
| 1257 |
base64_data_url = encode_image_for_api(file_path)
|
| 1258 |
except Exception as e:
|
| 1259 |
logger.warning(f"Image resize failed: {str(e)}. Using original image.")
|
|
|
|
| 1262 |
base64_data_url = encode_image_for_api(file_path)
|
| 1263 |
else:
|
| 1264 |
# For smaller images, use as-is with proper MIME type
|
| 1265 |
+
from utils.image_utils import encode_image_for_api
|
| 1266 |
base64_data_url = encode_image_for_api(file_path)
|
| 1267 |
except Exception as e:
|
| 1268 |
# Fallback to original image if any preprocessing fails
|
|
|
|
| 1327 |
logger.error("Maximum retries reached, rate limit error persists.")
|
| 1328 |
try:
|
| 1329 |
# Try to import the local OCR fallback function
|
| 1330 |
+
from utils.image_utils import try_local_ocr_fallback
|
| 1331 |
|
| 1332 |
# Attempt local OCR fallback
|
| 1333 |
ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
|
|
|
|
| 1539 |
logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
|
| 1540 |
|
| 1541 |
# Perform language detection on the OCR text before returning
|
| 1542 |
+
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
| 1543 |
+
detected_languages = self.language_detector.detect_languages(
|
| 1544 |
+
ocr_markdown,
|
| 1545 |
+
filename=getattr(self, 'current_filename', None)
|
| 1546 |
+
)
|
| 1547 |
+
else:
|
| 1548 |
+
# If language detector is not available, use default English
|
| 1549 |
+
detected_languages = ["English"]
|
| 1550 |
|
| 1551 |
return {
|
| 1552 |
"file_name": filename,
|
|
|
|
| 1720 |
|
| 1721 |
# If OCR text has clear French patterns but language is English or missing, fix it
|
| 1722 |
if ocr_markdown and 'languages' in result:
|
| 1723 |
+
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
| 1724 |
+
result['languages'] = self.language_detector.detect_languages(
|
| 1725 |
+
ocr_markdown,
|
| 1726 |
+
filename=getattr(self, 'current_filename', None),
|
| 1727 |
+
current_languages=result['languages']
|
| 1728 |
+
)
|
| 1729 |
|
| 1730 |
except Exception as e:
|
| 1731 |
# Fall back to text-only model if vision model fails
|
|
|
|
| 1735 |
return result
|
| 1736 |
|
| 1737 |
# We've removed document type detection entirely for simplicity
|
| 1738 |
+
|
| 1739 |
|
| 1740 |
# Create a prompt with enhanced language detection instructions
|
| 1741 |
generic_section = (
|
| 1742 |
f"You are an OCR specialist processing historical documents. "
|
| 1743 |
+
f"Focus on accurately extracting text content and image chunks while preserving structure and formatting. "
|
| 1744 |
f"Pay attention to any historical features and document characteristics.\n\n"
|
|
|
|
|
|
|
|
|
|
| 1745 |
f"Create a structured JSON response with the following fields:\n"
|
| 1746 |
f"- file_name: The document's name\n"
|
| 1747 |
f"- topics: An array of topics covered in the document\n"
|
| 1748 |
f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
|
| 1749 |
f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
|
| 1750 |
+
f" * title: The title or heading (if present)\n"
|
| 1751 |
+
f" * transcript: The full text of the document\n"
|
| 1752 |
+
f" * text: The main text content (if different from transcript)\n"
|
| 1753 |
+
f" * content: The body content (if different than transcript)\n"
|
| 1754 |
+
f" * images: An array of image objects with their base64 data\n"
|
| 1755 |
+
f" * alt_text: The alt text or description of the images\n"
|
| 1756 |
+
f" * caption: The caption or title of the images\n"
|
| 1757 |
f" * raw_text: The complete OCR text\n"
|
| 1758 |
)
|
| 1759 |
|
|
|
|
| 1764 |
|
| 1765 |
# Return the enhanced prompt
|
| 1766 |
return generic_section + custom_section
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1767 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1768 |
def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
|
| 1769 |
"""
|
| 1770 |
Extract structured data using text-only model with detailed historical context prompting
|
test_magician.py → testing/test_magician.py
RENAMED
|
File without changes
|
ui_components.py
CHANGED
|
@@ -3,9 +3,21 @@ import os
|
|
| 3 |
import io
|
| 4 |
import base64
|
| 5 |
import logging
|
|
|
|
| 6 |
from datetime import datetime
|
| 7 |
from pathlib import Path
|
| 8 |
import json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
from constants import (
|
| 10 |
DOCUMENT_TYPES,
|
| 11 |
DOCUMENT_LAYOUTS,
|
|
@@ -19,7 +31,16 @@ from constants import (
|
|
| 19 |
PREPROCESSING_DOC_TYPES,
|
| 20 |
ROTATION_OPTIONS
|
| 21 |
)
|
| 22 |
-
from utils import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
class ProgressReporter:
|
| 25 |
"""Class to handle progress reporting in the UI"""
|
|
@@ -69,12 +90,10 @@ def create_sidebar_options():
|
|
| 69 |
|
| 70 |
# Create a container for the sidebar options
|
| 71 |
with st.container():
|
| 72 |
-
#
|
| 73 |
-
|
| 74 |
-
use_vision = st.toggle("Use Vision Model", value=True, help="Use vision model for better understanding of document structure")
|
| 75 |
|
| 76 |
# Document type selection
|
| 77 |
-
st.markdown("### Document Type")
|
| 78 |
doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
|
| 79 |
help="Select the type of document you're processing for better results")
|
| 80 |
|
|
@@ -91,8 +110,8 @@ def create_sidebar_options():
|
|
| 91 |
|
| 92 |
# Custom prompt
|
| 93 |
custom_prompt = ""
|
| 94 |
-
|
| 95 |
-
|
| 96 |
prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
|
| 97 |
|
| 98 |
# Add layout information if not standard
|
|
@@ -103,53 +122,37 @@ def create_sidebar_options():
|
|
| 103 |
|
| 104 |
# Set the custom prompt
|
| 105 |
custom_prompt = prompt_template
|
| 106 |
-
|
| 107 |
-
# Allow user to edit the prompt
|
| 108 |
-
st.markdown("**Custom Processing Instructions**")
|
| 109 |
-
custom_prompt = st.text_area("", value=custom_prompt,
|
| 110 |
-
help="Customize the instructions for processing this document",
|
| 111 |
-
height=80)
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
# Add image segmentation option
|
| 145 |
-
st.markdown("### Advanced Options")
|
| 146 |
-
use_segmentation = st.toggle("Enable Image Segmentation",
|
| 147 |
-
value=False,
|
| 148 |
-
help="Segment the image into text and image regions for better OCR results on complex documents")
|
| 149 |
-
|
| 150 |
-
# Show explanation if segmentation is enabled
|
| 151 |
-
if use_segmentation:
|
| 152 |
-
st.info("Image segmentation identifies distinct text regions in complex documents, improving OCR accuracy. This is especially helpful for documents with mixed content like the Magician illustration.")
|
| 153 |
|
| 154 |
# Create preprocessing options dictionary
|
| 155 |
# Set document_type based on selection in UI
|
|
@@ -169,17 +172,17 @@ def create_sidebar_options():
|
|
| 169 |
"rotation": rotation
|
| 170 |
}
|
| 171 |
|
| 172 |
-
# PDF-specific options
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
|
| 184 |
# Create options dictionary
|
| 185 |
options = {
|
|
@@ -219,471 +222,6 @@ def create_file_uploader():
|
|
| 219 |
)
|
| 220 |
return uploaded_file
|
| 221 |
|
| 222 |
-
# Function removed - now using inline implementation in app.py
|
| 223 |
-
def _unused_display_preprocessing_preview(uploaded_file, preprocessing_options):
|
| 224 |
-
"""Display a preview of image with preprocessing options applied"""
|
| 225 |
-
if (any(preprocessing_options.values()) and
|
| 226 |
-
uploaded_file.type.startswith('image/')):
|
| 227 |
-
|
| 228 |
-
st.markdown("**Preprocessed Preview**")
|
| 229 |
-
try:
|
| 230 |
-
# Create a container for the preview
|
| 231 |
-
with st.container():
|
| 232 |
-
processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
|
| 233 |
-
# Convert image to base64 and display as HTML to avoid fullscreen button
|
| 234 |
-
img_data = base64.b64encode(processed_bytes).decode()
|
| 235 |
-
img_html = f'<img src="data:image/jpeg;base64,{img_data}" style="width:100%; border-radius:4px;">'
|
| 236 |
-
st.markdown(img_html, unsafe_allow_html=True)
|
| 237 |
-
|
| 238 |
-
# Show preprocessing metadata in a well-formatted caption
|
| 239 |
-
meta_items = []
|
| 240 |
-
if preprocessing_options.get("document_type", "standard") != "standard":
|
| 241 |
-
meta_items.append(f"Document type ({preprocessing_options['document_type']})")
|
| 242 |
-
if preprocessing_options.get("grayscale", False):
|
| 243 |
-
meta_items.append("Grayscale")
|
| 244 |
-
if preprocessing_options.get("denoise", False):
|
| 245 |
-
meta_items.append("Denoise")
|
| 246 |
-
if preprocessing_options.get("contrast", 0) != 0:
|
| 247 |
-
meta_items.append(f"Contrast ({preprocessing_options['contrast']})")
|
| 248 |
-
if preprocessing_options.get("rotation", 0) != 0:
|
| 249 |
-
meta_items.append(f"Rotation ({preprocessing_options['rotation']}°)")
|
| 250 |
-
|
| 251 |
-
# Only show "Applied:" if there are actual preprocessing steps
|
| 252 |
-
if meta_items:
|
| 253 |
-
meta_text = "Applied: " + ", ".join(meta_items)
|
| 254 |
-
st.caption(meta_text)
|
| 255 |
-
except Exception as e:
|
| 256 |
-
st.error(f"Error in preprocessing: {str(e)}")
|
| 257 |
-
st.info("Try using grayscale preprocessing for PNG images with transparency")
|
| 258 |
-
|
| 259 |
-
def display_results(result, container, custom_prompt=""):
|
| 260 |
-
"""Display OCR results in the provided container"""
|
| 261 |
-
with container:
|
| 262 |
-
# Add heading for document metadata
|
| 263 |
-
st.markdown("### Document Metadata")
|
| 264 |
-
|
| 265 |
-
# Create a compact metadata section
|
| 266 |
-
meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
|
| 267 |
-
|
| 268 |
-
# Document type
|
| 269 |
-
if 'detected_document_type' in result:
|
| 270 |
-
meta_html += f'<div><strong>Type:</strong> {result["detected_document_type"]}</div>'
|
| 271 |
-
|
| 272 |
-
# Processing time
|
| 273 |
-
if 'processing_time' in result:
|
| 274 |
-
meta_html += f'<div><strong>Time:</strong> {result["processing_time"]:.1f}s</div>'
|
| 275 |
-
|
| 276 |
-
# Page information
|
| 277 |
-
if 'limited_pages' in result:
|
| 278 |
-
meta_html += f'<div><strong>Pages:</strong> {result["limited_pages"]["processed"]}/{result["limited_pages"]["total"]}</div>'
|
| 279 |
-
|
| 280 |
-
meta_html += '</div>'
|
| 281 |
-
st.markdown(meta_html, unsafe_allow_html=True)
|
| 282 |
-
|
| 283 |
-
# Language metadata on a separate line, Subject Tags below
|
| 284 |
-
|
| 285 |
-
# First show languages if available
|
| 286 |
-
if 'languages' in result and result['languages']:
|
| 287 |
-
languages = [lang for lang in result['languages'] if lang is not None]
|
| 288 |
-
if languages:
|
| 289 |
-
# Create a dedicated line for Languages
|
| 290 |
-
lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 291 |
-
lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
|
| 292 |
-
|
| 293 |
-
# Add language tags
|
| 294 |
-
for lang in languages:
|
| 295 |
-
# Clean language name if needed
|
| 296 |
-
clean_lang = str(lang).strip()
|
| 297 |
-
if clean_lang: # Only add if not empty
|
| 298 |
-
lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
|
| 299 |
-
|
| 300 |
-
lang_html += '</div>'
|
| 301 |
-
st.markdown(lang_html, unsafe_allow_html=True)
|
| 302 |
-
|
| 303 |
-
# Create a separate line for Time if we have time-related tags
|
| 304 |
-
if 'topics' in result and result['topics']:
|
| 305 |
-
time_tags = [topic for topic in result['topics']
|
| 306 |
-
if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
| 307 |
-
if time_tags:
|
| 308 |
-
time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 309 |
-
time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
|
| 310 |
-
for tag in time_tags:
|
| 311 |
-
time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
|
| 312 |
-
time_html += '</div>'
|
| 313 |
-
st.markdown(time_html, unsafe_allow_html=True)
|
| 314 |
-
|
| 315 |
-
# Then display remaining subject tags if available
|
| 316 |
-
if 'topics' in result and result['topics']:
|
| 317 |
-
# Filter out time-related tags which are already displayed
|
| 318 |
-
subject_tags = [topic for topic in result['topics']
|
| 319 |
-
if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
| 320 |
-
|
| 321 |
-
if subject_tags:
|
| 322 |
-
# Create a separate line for Subject Tags
|
| 323 |
-
tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 324 |
-
tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
|
| 325 |
-
tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
|
| 326 |
-
|
| 327 |
-
# Generate a badge for each remaining tag
|
| 328 |
-
for topic in subject_tags:
|
| 329 |
-
# Determine tag category class
|
| 330 |
-
tag_class = "subject-tag" # Default class
|
| 331 |
-
|
| 332 |
-
# Add specialized class based on category
|
| 333 |
-
if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
|
| 334 |
-
tag_class += " tag-language" # Languages
|
| 335 |
-
elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
|
| 336 |
-
tag_class += " tag-document-type" # Document types
|
| 337 |
-
elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
|
| 338 |
-
tag_class += " tag-subject" # Subject domains
|
| 339 |
-
|
| 340 |
-
# Add each tag as an inline span
|
| 341 |
-
tags_html += f'<span class="{tag_class}">{topic}</span>'
|
| 342 |
-
|
| 343 |
-
# Close the containers
|
| 344 |
-
tags_html += '</div></div>'
|
| 345 |
-
|
| 346 |
-
# Render the subject tags section
|
| 347 |
-
st.markdown(tags_html, unsafe_allow_html=True)
|
| 348 |
-
|
| 349 |
-
# No OCR content heading - start directly with tabs
|
| 350 |
-
|
| 351 |
-
# Check if we have OCR content
|
| 352 |
-
if 'ocr_contents' in result:
|
| 353 |
-
# Create a single view instead of tabs
|
| 354 |
-
content_tab1 = st.container()
|
| 355 |
-
|
| 356 |
-
# Check for images in the result to use later
|
| 357 |
-
has_images = result.get('has_images', False)
|
| 358 |
-
has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
|
| 359 |
-
has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
|
| 360 |
-
any('images' in page for page in result['raw_response_data']['pages']
|
| 361 |
-
if isinstance(page, dict)))
|
| 362 |
-
|
| 363 |
-
# Display structured content
|
| 364 |
-
with content_tab1:
|
| 365 |
-
# Display structured content with markdown formatting
|
| 366 |
-
if isinstance(result['ocr_contents'], dict):
|
| 367 |
-
# CSS is now handled in the main layout.py file
|
| 368 |
-
|
| 369 |
-
# Function to process text with markdown support
|
| 370 |
-
def format_markdown_text(text):
|
| 371 |
-
"""Format text with markdown and handle special patterns"""
|
| 372 |
-
if not text:
|
| 373 |
-
return ""
|
| 374 |
-
|
| 375 |
-
import re
|
| 376 |
-
|
| 377 |
-
# First, ensure we're working with a string
|
| 378 |
-
if not isinstance(text, str):
|
| 379 |
-
text = str(text)
|
| 380 |
-
|
| 381 |
-
# Ensure newlines are preserved for proper spacing
|
| 382 |
-
# Convert any Windows line endings to Unix
|
| 383 |
-
text = text.replace('\r\n', '\n')
|
| 384 |
-
|
| 385 |
-
# Format dates (MM/DD/YYYY or similar patterns)
|
| 386 |
-
date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
|
| 387 |
-
text = re.sub(date_pattern, r'**\g<0>**', text)
|
| 388 |
-
|
| 389 |
-
# Detect markdown tables and preserve them
|
| 390 |
-
table_sections = []
|
| 391 |
-
non_table_lines = []
|
| 392 |
-
in_table = False
|
| 393 |
-
table_buffer = []
|
| 394 |
-
|
| 395 |
-
# Process text line by line, preserving tables
|
| 396 |
-
lines = text.split('\n')
|
| 397 |
-
for i, line in enumerate(lines):
|
| 398 |
-
line_stripped = line.strip()
|
| 399 |
-
|
| 400 |
-
# Detect table rows by pipe character
|
| 401 |
-
if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
|
| 402 |
-
if not in_table:
|
| 403 |
-
in_table = True
|
| 404 |
-
if table_buffer:
|
| 405 |
-
table_buffer = []
|
| 406 |
-
table_buffer.append(line)
|
| 407 |
-
|
| 408 |
-
# Check if the next line is a table separator
|
| 409 |
-
if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
|
| 410 |
-
table_buffer.append(lines[i+1])
|
| 411 |
-
|
| 412 |
-
# Detect table separators (---|---|---)
|
| 413 |
-
elif in_table and '---' in line_stripped and '|' in line_stripped:
|
| 414 |
-
table_buffer.append(line)
|
| 415 |
-
|
| 416 |
-
# End of table detection
|
| 417 |
-
elif in_table:
|
| 418 |
-
# Check if this is still part of the table
|
| 419 |
-
next_line_is_table = False
|
| 420 |
-
if i < len(lines) - 1:
|
| 421 |
-
next_line = lines[i+1].strip()
|
| 422 |
-
if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
|
| 423 |
-
next_line_is_table = True
|
| 424 |
-
|
| 425 |
-
if not next_line_is_table:
|
| 426 |
-
in_table = False
|
| 427 |
-
# Save the complete table
|
| 428 |
-
if table_buffer:
|
| 429 |
-
table_sections.append('\n'.join(table_buffer))
|
| 430 |
-
table_buffer = []
|
| 431 |
-
# Add current line to non-table lines
|
| 432 |
-
non_table_lines.append(line)
|
| 433 |
-
else:
|
| 434 |
-
# Still part of the table
|
| 435 |
-
table_buffer.append(line)
|
| 436 |
-
else:
|
| 437 |
-
# Not in a table
|
| 438 |
-
non_table_lines.append(line)
|
| 439 |
-
|
| 440 |
-
# Handle any remaining table buffer
|
| 441 |
-
if in_table and table_buffer:
|
| 442 |
-
table_sections.append('\n'.join(table_buffer))
|
| 443 |
-
|
| 444 |
-
# Process non-table lines
|
| 445 |
-
processed_lines = []
|
| 446 |
-
for line in non_table_lines:
|
| 447 |
-
line_stripped = line.strip()
|
| 448 |
-
|
| 449 |
-
# Check if line is in ALL CAPS (and not just a short acronym)
|
| 450 |
-
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
| 451 |
-
# ALL CAPS line - make bold instead of heading to prevent large display
|
| 452 |
-
processed_lines.append(f"**{line_stripped}**")
|
| 453 |
-
# Process potential headers (lines ending with colon)
|
| 454 |
-
elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
|
| 455 |
-
# Likely a header - make it bold
|
| 456 |
-
processed_lines.append(f"**{line_stripped}**")
|
| 457 |
-
else:
|
| 458 |
-
# Keep original line with its spacing
|
| 459 |
-
processed_lines.append(line)
|
| 460 |
-
|
| 461 |
-
# Join non-table lines
|
| 462 |
-
processed_text = '\n'.join(processed_lines)
|
| 463 |
-
|
| 464 |
-
# Reinsert tables in the right positions
|
| 465 |
-
for table in table_sections:
|
| 466 |
-
# Generate a unique marker for this table
|
| 467 |
-
marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
|
| 468 |
-
# Find a good position to insert this table
|
| 469 |
-
# For now, just append all tables at the end
|
| 470 |
-
processed_text += f"\n\n{table}\n\n"
|
| 471 |
-
|
| 472 |
-
# Make sure paragraphs have proper spacing but not excessive
|
| 473 |
-
processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
|
| 474 |
-
|
| 475 |
-
# Ensure two newlines between paragraphs for proper markdown rendering
|
| 476 |
-
processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
|
| 477 |
-
|
| 478 |
-
return processed_text
|
| 479 |
-
|
| 480 |
-
# Collect all available images from the result
|
| 481 |
-
available_images = []
|
| 482 |
-
if has_images and 'pages_data' in result:
|
| 483 |
-
for page_idx, page in enumerate(result['pages_data']):
|
| 484 |
-
if 'images' in page and len(page['images']) > 0:
|
| 485 |
-
for img_idx, img in enumerate(page['images']):
|
| 486 |
-
if 'image_base64' in img:
|
| 487 |
-
available_images.append({
|
| 488 |
-
'source': 'pages_data',
|
| 489 |
-
'page': page_idx,
|
| 490 |
-
'index': img_idx,
|
| 491 |
-
'data': img['image_base64']
|
| 492 |
-
})
|
| 493 |
-
|
| 494 |
-
# Get images from raw response as well
|
| 495 |
-
if 'raw_response_data' in result:
|
| 496 |
-
raw_data = result['raw_response_data']
|
| 497 |
-
if isinstance(raw_data, dict) and 'pages' in raw_data:
|
| 498 |
-
for page_idx, page in enumerate(raw_data['pages']):
|
| 499 |
-
if isinstance(page, dict) and 'images' in page:
|
| 500 |
-
for img_idx, img in enumerate(page['images']):
|
| 501 |
-
if isinstance(img, dict) and 'base64' in img:
|
| 502 |
-
available_images.append({
|
| 503 |
-
'source': 'raw_response',
|
| 504 |
-
'page': page_idx,
|
| 505 |
-
'index': img_idx,
|
| 506 |
-
'data': img['base64']
|
| 507 |
-
})
|
| 508 |
-
|
| 509 |
-
# Extract images for display at the top
|
| 510 |
-
images_to_display = []
|
| 511 |
-
|
| 512 |
-
# First, collect all available images
|
| 513 |
-
for img_idx, img in enumerate(available_images):
|
| 514 |
-
if 'data' in img:
|
| 515 |
-
images_to_display.append({
|
| 516 |
-
'data': img['data'],
|
| 517 |
-
'id': img.get('id', f"img_{img_idx}"),
|
| 518 |
-
'index': img_idx
|
| 519 |
-
})
|
| 520 |
-
|
| 521 |
-
# Simple display of image without dropdown or Document Image tab
|
| 522 |
-
if images_to_display and len(images_to_display) > 0:
|
| 523 |
-
# Just display the first image directly
|
| 524 |
-
st.image(images_to_display[0]['data'], use_container_width=True)
|
| 525 |
-
|
| 526 |
-
# Organize sections in a logical order
|
| 527 |
-
section_order = ["title", "author", "date", "summary", "content", "transcript", "metadata"]
|
| 528 |
-
ordered_sections = []
|
| 529 |
-
|
| 530 |
-
# Add known sections first in preferred order
|
| 531 |
-
for section_name in section_order:
|
| 532 |
-
if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
|
| 533 |
-
ordered_sections.append(section_name)
|
| 534 |
-
|
| 535 |
-
# Add any remaining sections
|
| 536 |
-
for section in result['ocr_contents'].keys():
|
| 537 |
-
if (section not in ordered_sections and
|
| 538 |
-
section not in ['error', 'partial_text'] and
|
| 539 |
-
result['ocr_contents'][section]):
|
| 540 |
-
ordered_sections.append(section)
|
| 541 |
-
|
| 542 |
-
# If only raw_text is available and no other content, add it last
|
| 543 |
-
if ('raw_text' in result['ocr_contents'] and
|
| 544 |
-
result['ocr_contents']['raw_text'] and
|
| 545 |
-
len(ordered_sections) == 0):
|
| 546 |
-
ordered_sections.append('raw_text')
|
| 547 |
-
|
| 548 |
-
# Add minimal spacing before OCR results
|
| 549 |
-
st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
|
| 550 |
-
st.markdown("### Document Content")
|
| 551 |
-
|
| 552 |
-
# Process each section using expanders
|
| 553 |
-
for i, section in enumerate(ordered_sections):
|
| 554 |
-
content = result['ocr_contents'][section]
|
| 555 |
-
|
| 556 |
-
# Skip empty content
|
| 557 |
-
if not content:
|
| 558 |
-
continue
|
| 559 |
-
|
| 560 |
-
# Create an expander for each section
|
| 561 |
-
# First section is expanded by default
|
| 562 |
-
with st.expander(f"{section.replace('_', ' ').title()}", expanded=(i == 0)):
|
| 563 |
-
if isinstance(content, str):
|
| 564 |
-
# Handle image markdown
|
| 565 |
-
if content.startswith("![") and content.endswith(")"):
|
| 566 |
-
try:
|
| 567 |
-
alt_text = content[2:content.index(']')]
|
| 568 |
-
st.info(f"Image description: {alt_text if len(alt_text) > 5 else 'Image'}")
|
| 569 |
-
except:
|
| 570 |
-
st.info("Contains image reference")
|
| 571 |
-
else:
|
| 572 |
-
# Process text content
|
| 573 |
-
formatted_content = format_markdown_text(content).strip()
|
| 574 |
-
|
| 575 |
-
# Check if content contains markdown tables or complex text
|
| 576 |
-
has_tables = '|' in formatted_content and '---' in formatted_content
|
| 577 |
-
has_complex_structure = formatted_content.count('\n') > 5 or formatted_content.count('**') > 2
|
| 578 |
-
|
| 579 |
-
# Use a container with minimal margins
|
| 580 |
-
with st.container():
|
| 581 |
-
# For text-only extractions or content with tables, ensure proper rendering
|
| 582 |
-
if has_tables or has_complex_structure:
|
| 583 |
-
# For text with tables or multiple paragraphs, use special handling
|
| 584 |
-
# First ensure proper markdown spacing
|
| 585 |
-
formatted_content = formatted_content.replace('\n\n\n', '\n\n')
|
| 586 |
-
|
| 587 |
-
# Look for any all caps headers that might be misinterpreted
|
| 588 |
-
import re
|
| 589 |
-
formatted_content = re.sub(
|
| 590 |
-
r'^([A-Z][A-Z\s]+)$',
|
| 591 |
-
r'**\1**',
|
| 592 |
-
formatted_content,
|
| 593 |
-
flags=re.MULTILINE
|
| 594 |
-
)
|
| 595 |
-
|
| 596 |
-
# Preserve table formatting by adding proper spacing
|
| 597 |
-
if has_tables:
|
| 598 |
-
formatted_content = formatted_content.replace('\n|', '\n\n|')
|
| 599 |
-
|
| 600 |
-
# Add proper paragraph spacing
|
| 601 |
-
formatted_content = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', formatted_content)
|
| 602 |
-
|
| 603 |
-
# Use standard markdown with custom styling
|
| 604 |
-
st.markdown(formatted_content, unsafe_allow_html=False)
|
| 605 |
-
else:
|
| 606 |
-
# For simpler content, use standard markdown
|
| 607 |
-
st.markdown(formatted_content)
|
| 608 |
-
|
| 609 |
-
elif isinstance(content, list):
|
| 610 |
-
# Create markdown list
|
| 611 |
-
list_items = []
|
| 612 |
-
for item in content:
|
| 613 |
-
if isinstance(item, str):
|
| 614 |
-
item_text = format_markdown_text(item).strip()
|
| 615 |
-
# Handle potential HTML special characters for proper rendering
|
| 616 |
-
item_text = item_text.replace('<', '<').replace('>', '>')
|
| 617 |
-
list_items.append(f"- {item_text}")
|
| 618 |
-
else:
|
| 619 |
-
list_items.append(f"- {str(item)}")
|
| 620 |
-
|
| 621 |
-
list_content = "\n".join(list_items)
|
| 622 |
-
|
| 623 |
-
# Use a container with minimal margins
|
| 624 |
-
with st.container():
|
| 625 |
-
# Use standard markdown for better rendering
|
| 626 |
-
st.markdown(list_content)
|
| 627 |
-
|
| 628 |
-
elif isinstance(content, dict):
|
| 629 |
-
# Format dictionary content
|
| 630 |
-
dict_items = []
|
| 631 |
-
for k, v in content.items():
|
| 632 |
-
key_formatted = k.replace('_', ' ').title()
|
| 633 |
-
|
| 634 |
-
if isinstance(v, str):
|
| 635 |
-
value_formatted = format_markdown_text(v).strip()
|
| 636 |
-
dict_items.append(f"**{key_formatted}:** {value_formatted}")
|
| 637 |
-
else:
|
| 638 |
-
dict_items.append(f"**{key_formatted}:** {str(v)}")
|
| 639 |
-
|
| 640 |
-
dict_content = "\n".join(dict_items)
|
| 641 |
-
|
| 642 |
-
# Use a container with minimal margins
|
| 643 |
-
with st.container():
|
| 644 |
-
# Use standard markdown for better rendering
|
| 645 |
-
st.markdown(dict_content)
|
| 646 |
-
|
| 647 |
-
# Display custom prompt if provided
|
| 648 |
-
if custom_prompt:
|
| 649 |
-
with st.expander("Custom Processing Instructions"):
|
| 650 |
-
st.write(custom_prompt)
|
| 651 |
-
|
| 652 |
-
# No download heading - start directly with buttons
|
| 653 |
-
|
| 654 |
-
# JSON download - use full width for buttons
|
| 655 |
-
try:
|
| 656 |
-
json_str = json.dumps(result, indent=2)
|
| 657 |
-
st.download_button(
|
| 658 |
-
label="Download JSON",
|
| 659 |
-
data=json_str,
|
| 660 |
-
file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.json",
|
| 661 |
-
mime="application/json"
|
| 662 |
-
)
|
| 663 |
-
except Exception as e:
|
| 664 |
-
st.error(f"Error creating JSON download: {str(e)}")
|
| 665 |
-
|
| 666 |
-
# Text download
|
| 667 |
-
try:
|
| 668 |
-
if 'ocr_contents' in result:
|
| 669 |
-
if 'raw_text' in result['ocr_contents']:
|
| 670 |
-
text_content = result['ocr_contents']['raw_text']
|
| 671 |
-
elif 'content' in result['ocr_contents']:
|
| 672 |
-
text_content = result['ocr_contents']['content']
|
| 673 |
-
else:
|
| 674 |
-
text_content = str(result['ocr_contents'])
|
| 675 |
-
else:
|
| 676 |
-
text_content = "No text content available."
|
| 677 |
-
|
| 678 |
-
st.download_button(
|
| 679 |
-
label="Download Text",
|
| 680 |
-
data=text_content,
|
| 681 |
-
file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt",
|
| 682 |
-
mime="text/plain"
|
| 683 |
-
)
|
| 684 |
-
except Exception as e:
|
| 685 |
-
st.error(f"Error creating text download: {str(e)}")
|
| 686 |
-
|
| 687 |
def display_document_with_images(result):
|
| 688 |
"""Display document with images"""
|
| 689 |
# Check for pages_data first
|
|
@@ -759,7 +297,7 @@ def display_document_with_images(result):
|
|
| 759 |
if isinstance(raw_page, dict) and 'images' in raw_page:
|
| 760 |
for img in raw_page['images']:
|
| 761 |
if isinstance(img, dict) and 'base64' in img:
|
| 762 |
-
st.image(img['base64'])
|
| 763 |
st.caption("Image from OCR response")
|
| 764 |
image_displayed = True
|
| 765 |
break
|
|
@@ -797,7 +335,7 @@ def display_previous_results():
|
|
| 797 |
st.markdown("""
|
| 798 |
<div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
|
| 799 |
<div style="font-size: 36px; margin-bottom: 15px;">📄</div>
|
| 800 |
-
<
|
| 801 |
<p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
|
| 802 |
</div>
|
| 803 |
""", unsafe_allow_html=True)
|
|
@@ -806,7 +344,7 @@ def display_previous_results():
|
|
| 806 |
with col2:
|
| 807 |
try:
|
| 808 |
# Create download button for all results
|
| 809 |
-
from
|
| 810 |
zip_data = create_results_zip_in_memory(st.session_state.previous_results)
|
| 811 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 812 |
|
|
@@ -908,37 +446,22 @@ def display_previous_results():
|
|
| 908 |
meta_html += '</div>'
|
| 909 |
st.markdown(meta_html, unsafe_allow_html=True)
|
| 910 |
|
| 911 |
-
# Simplified tabs -
|
| 912 |
has_images = selected_result.get('has_images', False)
|
| 913 |
if has_images:
|
| 914 |
-
view_tabs = st.tabs(["Document Content", "Raw
|
| 915 |
view_tab1, view_tab2, view_tab3 = view_tabs
|
| 916 |
else:
|
| 917 |
-
view_tabs = st.tabs(["Document Content", "Raw
|
| 918 |
view_tab1, view_tab2 = view_tabs
|
| 919 |
-
|
| 920 |
-
# Define helper function for formatting text
|
| 921 |
-
def format_text_display(text):
|
| 922 |
-
if not isinstance(text, str):
|
| 923 |
-
return text
|
| 924 |
-
|
| 925 |
-
lines = text.split('\n')
|
| 926 |
-
processed_lines = []
|
| 927 |
-
for line in lines:
|
| 928 |
-
line_stripped = line.strip()
|
| 929 |
-
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
| 930 |
-
processed_lines.append(f"**{line_stripped}**")
|
| 931 |
-
else:
|
| 932 |
-
processed_lines.append(line)
|
| 933 |
-
|
| 934 |
-
return '\n'.join(processed_lines)
|
| 935 |
|
| 936 |
# First tab - Document Content (simplified structured view)
|
| 937 |
with view_tab1:
|
| 938 |
# Display content in a cleaner, more streamlined format
|
| 939 |
if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
|
| 940 |
# Create a more focused list of important sections
|
| 941 |
-
priority_sections = ["title", "content", "transcript", "summary"
|
| 942 |
displayed_sections = set()
|
| 943 |
|
| 944 |
# First display priority sections
|
|
@@ -951,7 +474,7 @@ def display_previous_results():
|
|
| 951 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 952 |
|
| 953 |
# Format and display content
|
| 954 |
-
formatted_content =
|
| 955 |
st.markdown(formatted_content)
|
| 956 |
displayed_sections.add(section)
|
| 957 |
|
|
@@ -963,7 +486,7 @@ def display_previous_results():
|
|
| 963 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 964 |
|
| 965 |
if isinstance(content, str):
|
| 966 |
-
st.markdown(
|
| 967 |
elif isinstance(content, list):
|
| 968 |
for item in content:
|
| 969 |
st.markdown(f"- {item}")
|
|
@@ -971,34 +494,42 @@ def display_previous_results():
|
|
| 971 |
for k, v in content.items():
|
| 972 |
st.markdown(f"**{k}:** {v}")
|
| 973 |
|
| 974 |
-
# Second tab - Raw
|
| 975 |
with view_tab2:
|
| 976 |
-
# Extract
|
| 977 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 978 |
if 'ocr_contents' in selected_result:
|
| 979 |
-
|
| 980 |
-
|
| 981 |
-
|
| 982 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 983 |
|
| 984 |
-
#
|
| 985 |
-
|
| 986 |
|
| 987 |
-
#
|
| 988 |
-
|
| 989 |
-
with col1:
|
| 990 |
-
st.button("Copy Text", key="selected_copy_btn")
|
| 991 |
-
with col2:
|
| 992 |
-
st.download_button(
|
| 993 |
-
label="Download Text",
|
| 994 |
-
data=edited_text,
|
| 995 |
-
file_name=f"{file_name.split('.')[0]}_text.txt",
|
| 996 |
-
mime="text/plain",
|
| 997 |
-
key="selected_download_btn"
|
| 998 |
-
)
|
| 999 |
|
| 1000 |
-
# Third tab -
|
| 1001 |
-
if has_images and
|
| 1002 |
with view_tab3:
|
| 1003 |
# Simplified image display
|
| 1004 |
if 'pages_data' in selected_result:
|
|
@@ -1007,7 +538,7 @@ def display_previous_results():
|
|
| 1007 |
if 'images' in page_data and len(page_data['images']) > 0:
|
| 1008 |
for img in page_data['images']:
|
| 1009 |
if 'image_base64' in img:
|
| 1010 |
-
st.image(img['image_base64'],
|
| 1011 |
|
| 1012 |
# Get page text if available
|
| 1013 |
page_text = ""
|
|
@@ -1018,21 +549,22 @@ def display_previous_results():
|
|
| 1018 |
if page_text:
|
| 1019 |
with st.expander(f"Page {i+1} Text", expanded=False):
|
| 1020 |
st.text(page_text)
|
|
|
|
| 1021 |
|
| 1022 |
def display_about_tab():
|
| 1023 |
-
"""Display
|
| 1024 |
-
st.header("
|
| 1025 |
|
| 1026 |
# Add app description
|
| 1027 |
st.markdown("""
|
| 1028 |
-
**Historical OCR** is a
|
| 1029 |
""")
|
| 1030 |
|
| 1031 |
# Purpose section with consistent formatting
|
| 1032 |
st.markdown("### Purpose")
|
| 1033 |
st.markdown("""
|
| 1034 |
This tool is designed to assist scholars in historical research by extracting text from challenging documents.
|
| 1035 |
-
While it may not achieve
|
| 1036 |
historical documents, particularly:
|
| 1037 |
""")
|
| 1038 |
|
|
|
|
| 3 |
import io
|
| 4 |
import base64
|
| 5 |
import logging
|
| 6 |
+
import re
|
| 7 |
from datetime import datetime
|
| 8 |
from pathlib import Path
|
| 9 |
import json
|
| 10 |
+
|
| 11 |
+
# Define exports
|
| 12 |
+
__all__ = [
|
| 13 |
+
'ProgressReporter',
|
| 14 |
+
'create_sidebar_options',
|
| 15 |
+
'create_file_uploader',
|
| 16 |
+
'display_document_with_images',
|
| 17 |
+
'display_previous_results',
|
| 18 |
+
'display_about_tab',
|
| 19 |
+
'display_results' # Re-export from utils.ui_utils
|
| 20 |
+
]
|
| 21 |
from constants import (
|
| 22 |
DOCUMENT_TYPES,
|
| 23 |
DOCUMENT_LAYOUTS,
|
|
|
|
| 31 |
PREPROCESSING_DOC_TYPES,
|
| 32 |
ROTATION_OPTIONS
|
| 33 |
)
|
| 34 |
+
from utils.image_utils import format_ocr_text
|
| 35 |
+
from utils.content_utils import (
|
| 36 |
+
classify_document_content,
|
| 37 |
+
extract_document_text,
|
| 38 |
+
extract_image_description,
|
| 39 |
+
clean_raw_text,
|
| 40 |
+
format_markdown_text
|
| 41 |
+
)
|
| 42 |
+
from utils.ui_utils import display_results
|
| 43 |
+
from preprocessing import preprocess_image
|
| 44 |
|
| 45 |
class ProgressReporter:
|
| 46 |
"""Class to handle progress reporting in the UI"""
|
|
|
|
| 90 |
|
| 91 |
# Create a container for the sidebar options
|
| 92 |
with st.container():
|
| 93 |
+
# Default to using vision model (removed selection from UI)
|
| 94 |
+
use_vision = True
|
|
|
|
| 95 |
|
| 96 |
# Document type selection
|
|
|
|
| 97 |
doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
|
| 98 |
help="Select the type of document you're processing for better results")
|
| 99 |
|
|
|
|
| 110 |
|
| 111 |
# Custom prompt
|
| 112 |
custom_prompt = ""
|
| 113 |
+
# Get the template for the selected document type if not auto-detect
|
| 114 |
+
if doc_type != DOCUMENT_TYPES[0]:
|
| 115 |
prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
|
| 116 |
|
| 117 |
# Add layout information if not standard
|
|
|
|
| 122 |
|
| 123 |
# Set the custom prompt
|
| 124 |
custom_prompt = prompt_template
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
# Allow user to edit the prompt (always visible)
|
| 127 |
+
custom_prompt = st.text_area("Custom Processing Instructions", value=custom_prompt,
|
| 128 |
+
help="Customize the instructions for processing this document",
|
| 129 |
+
height=80)
|
| 130 |
+
|
| 131 |
+
# Image preprocessing options (always visible)
|
| 132 |
+
st.markdown("### Image Preprocessing")
|
| 133 |
+
|
| 134 |
+
# Grayscale conversion
|
| 135 |
+
grayscale = st.checkbox("Convert to Grayscale",
|
| 136 |
+
value=False,
|
| 137 |
+
help="Convert color images to grayscale for better text recognition")
|
| 138 |
+
|
| 139 |
+
# Light denoising option
|
| 140 |
+
denoise = st.checkbox("Light Denoising",
|
| 141 |
+
value=False,
|
| 142 |
+
help="Apply gentle denoising to improve text clarity")
|
| 143 |
+
|
| 144 |
+
# Contrast adjustment
|
| 145 |
+
contrast = st.slider("Contrast Adjustment",
|
| 146 |
+
min_value=-20,
|
| 147 |
+
max_value=20,
|
| 148 |
+
value=0,
|
| 149 |
+
step=5,
|
| 150 |
+
help="Adjust image contrast (limited range)")
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
# Initialize rotation (keeping it set to 0)
|
| 154 |
+
rotation = 0
|
| 155 |
+
use_segmentation = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
# Create preprocessing options dictionary
|
| 158 |
# Set document_type based on selection in UI
|
|
|
|
| 172 |
"rotation": rotation
|
| 173 |
}
|
| 174 |
|
| 175 |
+
# PDF-specific options
|
| 176 |
+
st.markdown("### PDF Options")
|
| 177 |
+
max_pages = st.number_input("Maximum Pages to Process",
|
| 178 |
+
min_value=1,
|
| 179 |
+
max_value=20,
|
| 180 |
+
value=DEFAULT_MAX_PAGES,
|
| 181 |
+
help="Limit the number of pages to process (for multi-page PDFs)")
|
| 182 |
+
|
| 183 |
+
# Set default values for removed options
|
| 184 |
+
pdf_dpi = DEFAULT_PDF_DPI
|
| 185 |
+
pdf_rotation = 0
|
| 186 |
|
| 187 |
# Create options dictionary
|
| 188 |
options = {
|
|
|
|
| 222 |
)
|
| 223 |
return uploaded_file
|
| 224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
def display_document_with_images(result):
|
| 226 |
"""Display document with images"""
|
| 227 |
# Check for pages_data first
|
|
|
|
| 297 |
if isinstance(raw_page, dict) and 'images' in raw_page:
|
| 298 |
for img in raw_page['images']:
|
| 299 |
if isinstance(img, dict) and 'base64' in img:
|
| 300 |
+
st.image(img['base64'], use_container_width=True)
|
| 301 |
st.caption("Image from OCR response")
|
| 302 |
image_displayed = True
|
| 303 |
break
|
|
|
|
| 335 |
st.markdown("""
|
| 336 |
<div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
|
| 337 |
<div style="font-size: 36px; margin-bottom: 15px;">📄</div>
|
| 338 |
+
<h3="margin-bottom: 16px; font-weight: 500;">No Previous Results</h3>
|
| 339 |
<p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
|
| 340 |
</div>
|
| 341 |
""", unsafe_allow_html=True)
|
|
|
|
| 344 |
with col2:
|
| 345 |
try:
|
| 346 |
# Create download button for all results
|
| 347 |
+
from utils.image_utils import create_results_zip_in_memory
|
| 348 |
zip_data = create_results_zip_in_memory(st.session_state.previous_results)
|
| 349 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 350 |
|
|
|
|
| 446 |
meta_html += '</div>'
|
| 447 |
st.markdown(meta_html, unsafe_allow_html=True)
|
| 448 |
|
| 449 |
+
# Simplified tabs - using the same format as main view
|
| 450 |
has_images = selected_result.get('has_images', False)
|
| 451 |
if has_images:
|
| 452 |
+
view_tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
|
| 453 |
view_tab1, view_tab2, view_tab3 = view_tabs
|
| 454 |
else:
|
| 455 |
+
view_tabs = st.tabs(["Document Content", "Raw JSON"])
|
| 456 |
view_tab1, view_tab2 = view_tabs
|
| 457 |
+
view_tab3 = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 458 |
|
| 459 |
# First tab - Document Content (simplified structured view)
|
| 460 |
with view_tab1:
|
| 461 |
# Display content in a cleaner, more streamlined format
|
| 462 |
if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
|
| 463 |
# Create a more focused list of important sections
|
| 464 |
+
priority_sections = ["title", "content", "transcript", "summary"]
|
| 465 |
displayed_sections = set()
|
| 466 |
|
| 467 |
# First display priority sections
|
|
|
|
| 474 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 475 |
|
| 476 |
# Format and display content
|
| 477 |
+
formatted_content = format_ocr_text(content)
|
| 478 |
st.markdown(formatted_content)
|
| 479 |
displayed_sections.add(section)
|
| 480 |
|
|
|
|
| 486 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 487 |
|
| 488 |
if isinstance(content, str):
|
| 489 |
+
st.markdown(format_ocr_text(content))
|
| 490 |
elif isinstance(content, list):
|
| 491 |
for item in content:
|
| 492 |
st.markdown(f"- {item}")
|
|
|
|
| 494 |
for k, v in content.items():
|
| 495 |
st.markdown(f"**{k}:** {v}")
|
| 496 |
|
| 497 |
+
# Second tab - Raw JSON (simplified)
|
| 498 |
with view_tab2:
|
| 499 |
+
# Extract the relevant JSON data
|
| 500 |
+
json_data = {}
|
| 501 |
+
|
| 502 |
+
# Include important metadata
|
| 503 |
+
for field in ['file_name', 'timestamp', 'processing_time', 'languages', 'topics', 'subjects', 'detected_document_type', 'text']:
|
| 504 |
+
if field in selected_result:
|
| 505 |
+
json_data[field] = selected_result[field]
|
| 506 |
+
|
| 507 |
+
# Include OCR contents
|
| 508 |
if 'ocr_contents' in selected_result:
|
| 509 |
+
json_data['ocr_contents'] = selected_result['ocr_contents']
|
| 510 |
+
|
| 511 |
+
# Exclude large binary data like base64 images to keep JSON clean
|
| 512 |
+
if 'pages_data' in selected_result:
|
| 513 |
+
# Create simplified pages_data without large binary content
|
| 514 |
+
simplified_pages = []
|
| 515 |
+
for page in selected_result['pages_data']:
|
| 516 |
+
simplified_page = {
|
| 517 |
+
'page_number': page.get('page_number', 0),
|
| 518 |
+
'has_text': bool(page.get('markdown', '')),
|
| 519 |
+
'has_images': bool(page.get('images', [])),
|
| 520 |
+
'image_count': len(page.get('images', []))
|
| 521 |
+
}
|
| 522 |
+
simplified_pages.append(simplified_page)
|
| 523 |
+
json_data['pages_summary'] = simplified_pages
|
| 524 |
|
| 525 |
+
# Format the JSON prettily
|
| 526 |
+
json_str = json.dumps(json_data, indent=2)
|
| 527 |
|
| 528 |
+
# Display in a monospace font with syntax highlighting
|
| 529 |
+
st.code(json_str, language="json")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 530 |
|
| 531 |
+
# Third tab - Images (simplified)
|
| 532 |
+
if has_images and view_tab3 is not None:
|
| 533 |
with view_tab3:
|
| 534 |
# Simplified image display
|
| 535 |
if 'pages_data' in selected_result:
|
|
|
|
| 538 |
if 'images' in page_data and len(page_data['images']) > 0:
|
| 539 |
for img in page_data['images']:
|
| 540 |
if 'image_base64' in img:
|
| 541 |
+
st.image(img['image_base64'], use_container_width=True)
|
| 542 |
|
| 543 |
# Get page text if available
|
| 544 |
page_text = ""
|
|
|
|
| 549 |
if page_text:
|
| 550 |
with st.expander(f"Page {i+1} Text", expanded=False):
|
| 551 |
st.text(page_text)
|
| 552 |
+
|
| 553 |
|
| 554 |
def display_about_tab():
|
| 555 |
+
"""Display learn more tab content"""
|
| 556 |
+
st.header("Learn More")
|
| 557 |
|
| 558 |
# Add app description
|
| 559 |
st.markdown("""
|
| 560 |
+
**Historical OCR** is a tailored academic tool for extracting text from historical documents, manuscripts, and printed materials.
|
| 561 |
""")
|
| 562 |
|
| 563 |
# Purpose section with consistent formatting
|
| 564 |
st.markdown("### Purpose")
|
| 565 |
st.markdown("""
|
| 566 |
This tool is designed to assist scholars in historical research by extracting text from challenging documents.
|
| 567 |
+
While it may not achieve full accuracy for all materials, it serves as a tailored research aid for navigating
|
| 568 |
historical documents, particularly:
|
| 569 |
""")
|
| 570 |
|
utils/content_utils.py
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
import ast
|
| 3 |
+
from .text_utils import clean_raw_text, format_markdown_text
|
| 4 |
+
|
| 5 |
+
def classify_document_content(result):
|
| 6 |
+
"""Classify document content based on structure and content"""
|
| 7 |
+
classification = {
|
| 8 |
+
'has_title': False,
|
| 9 |
+
'has_content': False,
|
| 10 |
+
'has_sections': False,
|
| 11 |
+
'is_structured': False
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
|
| 15 |
+
return classification
|
| 16 |
+
|
| 17 |
+
# Check for title
|
| 18 |
+
if 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
| 19 |
+
classification['has_title'] = True
|
| 20 |
+
|
| 21 |
+
# Check for content
|
| 22 |
+
content_fields = ['content', 'transcript', 'text']
|
| 23 |
+
for field in content_fields:
|
| 24 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
| 25 |
+
classification['has_content'] = True
|
| 26 |
+
break
|
| 27 |
+
|
| 28 |
+
# Check for sections
|
| 29 |
+
section_count = 0
|
| 30 |
+
for key in result['ocr_contents'].keys():
|
| 31 |
+
if key not in ['raw_text', 'error'] and result['ocr_contents'][key]:
|
| 32 |
+
section_count += 1
|
| 33 |
+
|
| 34 |
+
classification['has_sections'] = section_count > 2
|
| 35 |
+
|
| 36 |
+
# Check if structured
|
| 37 |
+
classification['is_structured'] = (
|
| 38 |
+
classification['has_title'] and
|
| 39 |
+
classification['has_content'] and
|
| 40 |
+
classification['has_sections']
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
return classification
|
| 44 |
+
|
| 45 |
+
def extract_document_text(result):
|
| 46 |
+
"""Extract main document text content"""
|
| 47 |
+
if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
|
| 48 |
+
return ""
|
| 49 |
+
|
| 50 |
+
# Try to get the text from content fields in preferred order - prioritize main_text
|
| 51 |
+
for field in ['main_text', 'content', 'transcript', 'text', 'raw_text']:
|
| 52 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
| 53 |
+
content = result['ocr_contents'][field]
|
| 54 |
+
if isinstance(content, str):
|
| 55 |
+
return content
|
| 56 |
+
|
| 57 |
+
return ""
|
| 58 |
+
|
| 59 |
+
def extract_image_description(image_data):
|
| 60 |
+
"""Extract image description from data"""
|
| 61 |
+
if not image_data or not isinstance(image_data, dict):
|
| 62 |
+
return ""
|
| 63 |
+
|
| 64 |
+
# Try different fields that might contain descriptions
|
| 65 |
+
for field in ['alt_text', 'caption', 'description']:
|
| 66 |
+
if field in image_data and image_data[field]:
|
| 67 |
+
return image_data[field]
|
| 68 |
+
|
| 69 |
+
return ""
|
| 70 |
+
|
| 71 |
+
def format_structured_data(content):
|
| 72 |
+
"""Format structured data like lists and dictionaries into readable markdown
|
| 73 |
+
|
| 74 |
+
Args:
|
| 75 |
+
content: The content to format (str, list, dict)
|
| 76 |
+
|
| 77 |
+
Returns:
|
| 78 |
+
Formatted markdown text
|
| 79 |
+
"""
|
| 80 |
+
if not content:
|
| 81 |
+
return ""
|
| 82 |
+
|
| 83 |
+
# If it's already a string, look for patterns that appear to be Python/JSON representations
|
| 84 |
+
if isinstance(content, str):
|
| 85 |
+
# Look for lists like ['item1', 'item2', 'item3']
|
| 86 |
+
list_pattern = r"(\[([^\[\]]*)\])"
|
| 87 |
+
dict_pattern = r"(\{([^\{\}]*)\})"
|
| 88 |
+
|
| 89 |
+
# First handle lists - ['item1', 'item2']
|
| 90 |
+
def replace_list(match):
|
| 91 |
+
try:
|
| 92 |
+
# Try to parse the match as a Python list
|
| 93 |
+
list_str = match.group(1)
|
| 94 |
+
|
| 95 |
+
# Quick check for empty list
|
| 96 |
+
if list_str == "[]":
|
| 97 |
+
return ""
|
| 98 |
+
|
| 99 |
+
# Safe evaluation of list-like string
|
| 100 |
+
try:
|
| 101 |
+
items = ast.literal_eval(list_str)
|
| 102 |
+
if isinstance(items, list):
|
| 103 |
+
# Convert to markdown bullet points
|
| 104 |
+
return "\n" + "\n".join([f"- {item}" for item in items])
|
| 105 |
+
else:
|
| 106 |
+
return list_str # Not a list, return unchanged
|
| 107 |
+
except (SyntaxError, ValueError):
|
| 108 |
+
# Try a simpler regex-based approach for common formats
|
| 109 |
+
# Handle simple comma-separated lists
|
| 110 |
+
items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
|
| 111 |
+
if items:
|
| 112 |
+
# Extract the matched groups and handle both single and double quotes
|
| 113 |
+
clean_items = [item[0] if item[0] else item[1] for item in items]
|
| 114 |
+
return "\n" + "\n".join([f"- {item}" for item in clean_items])
|
| 115 |
+
return list_str # Couldn't parse, return unchanged
|
| 116 |
+
except Exception:
|
| 117 |
+
return match.group(0) # Return the original text if any error
|
| 118 |
+
|
| 119 |
+
# Handle dictionaries or structured fields like {key: value, key2: value2}
|
| 120 |
+
def replace_dict(match):
|
| 121 |
+
try:
|
| 122 |
+
dict_str = match.group(1)
|
| 123 |
+
|
| 124 |
+
# Quick check for empty dict
|
| 125 |
+
if dict_str == "{}":
|
| 126 |
+
return ""
|
| 127 |
+
|
| 128 |
+
# First try to parse as a Python dict
|
| 129 |
+
try:
|
| 130 |
+
data_dict = ast.literal_eval(dict_str)
|
| 131 |
+
if isinstance(data_dict, dict):
|
| 132 |
+
return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
|
| 133 |
+
except (SyntaxError, ValueError):
|
| 134 |
+
# If that fails, use regex to extract key-value pairs
|
| 135 |
+
pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
|
| 136 |
+
if pairs:
|
| 137 |
+
formatted_pairs = []
|
| 138 |
+
for pair in pairs:
|
| 139 |
+
if pair[0] and pair[1]: # Single quotes
|
| 140 |
+
formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
|
| 141 |
+
elif pair[2] and pair[3]: # Double quotes
|
| 142 |
+
formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
|
| 143 |
+
return "\n" + "\n".join(formatted_pairs)
|
| 144 |
+
return dict_str # Return original if couldn't parse
|
| 145 |
+
except Exception:
|
| 146 |
+
return match.group(0) # Return original text if any error
|
| 147 |
+
|
| 148 |
+
# Check for keys with array values (common in OCR output)
|
| 149 |
+
key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
|
| 150 |
+
|
| 151 |
+
def replace_key_array(match):
|
| 152 |
+
try:
|
| 153 |
+
key = match.group(1)
|
| 154 |
+
array_str = match.group(2)
|
| 155 |
+
|
| 156 |
+
# Process the array part with our list replacer
|
| 157 |
+
formatted_array = replace_list(re.match(list_pattern, array_str))
|
| 158 |
+
|
| 159 |
+
# If we successfully formatted it, return with the key as a header
|
| 160 |
+
if formatted_array != array_str:
|
| 161 |
+
return f"**{key}**:{formatted_array}"
|
| 162 |
+
else:
|
| 163 |
+
return match.group(0) # Return original if no change
|
| 164 |
+
except Exception:
|
| 165 |
+
return match.group(0) # Return the original on error
|
| 166 |
+
|
| 167 |
+
# Apply all replacements
|
| 168 |
+
content = re.sub(key_array_pattern, replace_key_array, content)
|
| 169 |
+
content = re.sub(list_pattern, replace_list, content)
|
| 170 |
+
content = re.sub(dict_pattern, replace_dict, content)
|
| 171 |
+
|
| 172 |
+
return content
|
| 173 |
+
|
| 174 |
+
# Handle native Python lists
|
| 175 |
+
elif isinstance(content, list):
|
| 176 |
+
if not content:
|
| 177 |
+
return ""
|
| 178 |
+
# Convert to markdown bullet points
|
| 179 |
+
return "\n".join([f"- {item}" for item in content])
|
| 180 |
+
|
| 181 |
+
# Handle native Python dictionaries
|
| 182 |
+
elif isinstance(content, dict):
|
| 183 |
+
if not content:
|
| 184 |
+
return ""
|
| 185 |
+
# Convert to markdown key-value pairs
|
| 186 |
+
return "\n".join([f"**{k}**: {v}" for k, v in content.items()])
|
| 187 |
+
|
| 188 |
+
# Return as string for other types
|
| 189 |
+
return str(content)
|
utils/file_utils.py
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
File utility functions for historical OCR processing.
|
| 3 |
+
"""
|
| 4 |
+
import base64
|
| 5 |
+
import logging
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
|
| 8 |
+
# Configure logging
|
| 9 |
+
logger = logging.getLogger("utils")
|
| 10 |
+
logger.setLevel(logging.INFO)
|
| 11 |
+
|
| 12 |
+
def get_base64_from_image(image_path):
|
| 13 |
+
"""
|
| 14 |
+
Get base64 data URL from image file with proper MIME type.
|
| 15 |
+
|
| 16 |
+
Args:
|
| 17 |
+
image_path: Path to the image file
|
| 18 |
+
|
| 19 |
+
Returns:
|
| 20 |
+
Base64 data URL with appropriate MIME type prefix
|
| 21 |
+
"""
|
| 22 |
+
try:
|
| 23 |
+
# Convert to Path object for better handling
|
| 24 |
+
path_obj = Path(image_path)
|
| 25 |
+
|
| 26 |
+
# Determine mime type based on file extension
|
| 27 |
+
mime_type = 'image/jpeg' # Default mime type
|
| 28 |
+
suffix = path_obj.suffix.lower()
|
| 29 |
+
if suffix == '.png':
|
| 30 |
+
mime_type = 'image/png'
|
| 31 |
+
elif suffix == '.gif':
|
| 32 |
+
mime_type = 'image/gif'
|
| 33 |
+
elif suffix in ['.jpg', '.jpeg']:
|
| 34 |
+
mime_type = 'image/jpeg'
|
| 35 |
+
elif suffix == '.pdf':
|
| 36 |
+
mime_type = 'application/pdf'
|
| 37 |
+
|
| 38 |
+
# Read and encode file
|
| 39 |
+
with open(path_obj, "rb") as file:
|
| 40 |
+
encoded = base64.b64encode(file.read()).decode('utf-8')
|
| 41 |
+
return f"data:{mime_type};base64,{encoded}"
|
| 42 |
+
except Exception as e:
|
| 43 |
+
logger.error(f"Error encoding file to base64: {str(e)}")
|
| 44 |
+
return ""
|
| 45 |
+
|
| 46 |
+
def get_base64_from_bytes(file_bytes, mime_type=None, file_name=None):
|
| 47 |
+
"""
|
| 48 |
+
Get base64 data URL from file bytes with proper MIME type.
|
| 49 |
+
|
| 50 |
+
Args:
|
| 51 |
+
file_bytes: Binary file data
|
| 52 |
+
mime_type: MIME type of the file (optional)
|
| 53 |
+
file_name: Original file name for MIME type detection (optional)
|
| 54 |
+
|
| 55 |
+
Returns:
|
| 56 |
+
Base64 data URL with appropriate MIME type prefix
|
| 57 |
+
"""
|
| 58 |
+
try:
|
| 59 |
+
# Determine mime type if not provided
|
| 60 |
+
if mime_type is None and file_name is not None:
|
| 61 |
+
# Get file extension
|
| 62 |
+
suffix = Path(file_name).suffix.lower()
|
| 63 |
+
if suffix == '.png':
|
| 64 |
+
mime_type = 'image/png'
|
| 65 |
+
elif suffix == '.gif':
|
| 66 |
+
mime_type = 'image/gif'
|
| 67 |
+
elif suffix in ['.jpg', '.jpeg']:
|
| 68 |
+
mime_type = 'image/jpeg'
|
| 69 |
+
elif suffix == '.pdf':
|
| 70 |
+
mime_type = 'application/pdf'
|
| 71 |
+
else:
|
| 72 |
+
# Default to image/jpeg for unknown types when processing images
|
| 73 |
+
mime_type = 'image/jpeg'
|
| 74 |
+
elif mime_type is None:
|
| 75 |
+
# Default MIME type if we can't determine it - use image/jpeg instead of application/octet-stream
|
| 76 |
+
# to ensure compatibility with Mistral AI OCR API
|
| 77 |
+
mime_type = 'image/jpeg'
|
| 78 |
+
|
| 79 |
+
# Encode and create data URL
|
| 80 |
+
encoded = base64.b64encode(file_bytes).decode('utf-8')
|
| 81 |
+
return f"data:{mime_type};base64,{encoded}"
|
| 82 |
+
except Exception as e:
|
| 83 |
+
logger.error(f"Error encoding bytes to base64: {str(e)}")
|
| 84 |
+
return ""
|
| 85 |
+
|
| 86 |
+
def handle_temp_files(temp_file_paths):
|
| 87 |
+
"""
|
| 88 |
+
Clean up temporary files
|
| 89 |
+
|
| 90 |
+
Args:
|
| 91 |
+
temp_file_paths: List of temporary file paths to clean up
|
| 92 |
+
"""
|
| 93 |
+
import os
|
| 94 |
+
for temp_path in temp_file_paths:
|
| 95 |
+
try:
|
| 96 |
+
if os.path.exists(temp_path):
|
| 97 |
+
os.unlink(temp_path)
|
| 98 |
+
logger.info(f"Removed temporary file: {temp_path}")
|
| 99 |
+
except Exception as e:
|
| 100 |
+
logger.warning(f"Failed to remove temporary file {temp_path}: {str(e)}")
|
utils/general_utils.py
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
General utility functions for historical OCR processing.
|
| 3 |
+
"""
|
| 4 |
+
import os
|
| 5 |
+
import base64
|
| 6 |
+
import hashlib
|
| 7 |
+
import time
|
| 8 |
+
import logging
|
| 9 |
+
from datetime import datetime
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from functools import wraps
|
| 12 |
+
|
| 13 |
+
# Configure logging
|
| 14 |
+
logger = logging.getLogger("utils")
|
| 15 |
+
logger.setLevel(logging.INFO)
|
| 16 |
+
|
| 17 |
+
def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
|
| 18 |
+
"""
|
| 19 |
+
Generate a cache key for OCR processing
|
| 20 |
+
|
| 21 |
+
Args:
|
| 22 |
+
file_bytes: File content as bytes
|
| 23 |
+
file_type: Type of file (pdf or image)
|
| 24 |
+
use_vision: Whether to use vision model
|
| 25 |
+
preprocessing_options: Dictionary of preprocessing options
|
| 26 |
+
pdf_rotation: PDF rotation value
|
| 27 |
+
custom_prompt: Custom prompt for OCR
|
| 28 |
+
|
| 29 |
+
Returns:
|
| 30 |
+
str: Cache key
|
| 31 |
+
"""
|
| 32 |
+
# Generate file hash
|
| 33 |
+
file_hash = hashlib.md5(file_bytes).hexdigest()
|
| 34 |
+
|
| 35 |
+
# Include preprocessing options in cache key
|
| 36 |
+
preprocessing_options_hash = ""
|
| 37 |
+
if preprocessing_options:
|
| 38 |
+
# Add pdf_rotation to preprocessing options to ensure it's part of the cache key
|
| 39 |
+
if pdf_rotation != 0:
|
| 40 |
+
preprocessing_options_with_rotation = preprocessing_options.copy()
|
| 41 |
+
preprocessing_options_with_rotation['pdf_rotation'] = pdf_rotation
|
| 42 |
+
preprocessing_str = str(sorted(preprocessing_options_with_rotation.items()))
|
| 43 |
+
else:
|
| 44 |
+
preprocessing_str = str(sorted(preprocessing_options.items()))
|
| 45 |
+
preprocessing_options_hash = hashlib.md5(preprocessing_str.encode()).hexdigest()
|
| 46 |
+
elif pdf_rotation != 0:
|
| 47 |
+
# If no preprocessing options but we have rotation, include that in the hash
|
| 48 |
+
preprocessing_options_hash = hashlib.md5(f"pdf_rotation_{pdf_rotation}".encode()).hexdigest()
|
| 49 |
+
|
| 50 |
+
# Create base cache key
|
| 51 |
+
cache_key = f"{file_hash}_{file_type}_{use_vision}_{preprocessing_options_hash}"
|
| 52 |
+
|
| 53 |
+
# Include custom prompt in cache key if provided
|
| 54 |
+
if custom_prompt:
|
| 55 |
+
custom_prompt_hash = hashlib.md5(str(custom_prompt).encode()).hexdigest()
|
| 56 |
+
cache_key = f"{cache_key}_{custom_prompt_hash}"
|
| 57 |
+
|
| 58 |
+
return cache_key
|
| 59 |
+
|
| 60 |
+
def timing(description):
|
| 61 |
+
"""Context manager for timing code execution"""
|
| 62 |
+
class TimingContext:
|
| 63 |
+
def __init__(self, description):
|
| 64 |
+
self.description = description
|
| 65 |
+
|
| 66 |
+
def __enter__(self):
|
| 67 |
+
self.start_time = time.time()
|
| 68 |
+
return self
|
| 69 |
+
|
| 70 |
+
def __exit__(self, exc_type, exc_val, exc_tb):
|
| 71 |
+
end_time = time.time()
|
| 72 |
+
execution_time = end_time - self.start_time
|
| 73 |
+
logger.info(f"{self.description} took {execution_time:.2f} seconds")
|
| 74 |
+
return False
|
| 75 |
+
|
| 76 |
+
return TimingContext(description)
|
| 77 |
+
|
| 78 |
+
def format_timestamp(timestamp=None):
|
| 79 |
+
"""Format timestamp for display"""
|
| 80 |
+
if timestamp is None:
|
| 81 |
+
timestamp = datetime.now()
|
| 82 |
+
elif isinstance(timestamp, str):
|
| 83 |
+
try:
|
| 84 |
+
timestamp = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
|
| 85 |
+
except ValueError:
|
| 86 |
+
timestamp = datetime.now()
|
| 87 |
+
|
| 88 |
+
return timestamp.strftime("%Y-%m-%d %H:%M")
|
| 89 |
+
|
| 90 |
+
def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
|
| 91 |
+
"""
|
| 92 |
+
Create a descriptive filename for the result
|
| 93 |
+
|
| 94 |
+
Args:
|
| 95 |
+
original_filename: Original filename
|
| 96 |
+
result: OCR result dictionary
|
| 97 |
+
file_ext: File extension
|
| 98 |
+
preprocessing_options: Dictionary of preprocessing options
|
| 99 |
+
|
| 100 |
+
Returns:
|
| 101 |
+
str: Descriptive filename
|
| 102 |
+
"""
|
| 103 |
+
# Get base name without extension
|
| 104 |
+
original_name = Path(original_filename).stem
|
| 105 |
+
|
| 106 |
+
# Add document type to filename if detected
|
| 107 |
+
doc_type_tag = ""
|
| 108 |
+
if 'detected_document_type' in result:
|
| 109 |
+
doc_type = result['detected_document_type'].lower()
|
| 110 |
+
doc_type_tag = f"_{doc_type.replace(' ', '_')}"
|
| 111 |
+
elif 'topics' in result and result['topics']:
|
| 112 |
+
# Use first tag as document type if not explicitly detected
|
| 113 |
+
doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
|
| 114 |
+
|
| 115 |
+
# Add period tag for historical context if available
|
| 116 |
+
period_tag = ""
|
| 117 |
+
if 'topics' in result and result['topics']:
|
| 118 |
+
for tag in result['topics']:
|
| 119 |
+
if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
|
| 120 |
+
period_tag = f"_{tag.lower().replace(' ', '_')}"
|
| 121 |
+
break
|
| 122 |
+
|
| 123 |
+
# Generate final descriptive filename
|
| 124 |
+
descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
|
| 125 |
+
return descriptive_name
|
| 126 |
+
|
| 127 |
+
def extract_subject_tags(result, raw_text, preprocessing_options=None):
|
| 128 |
+
"""
|
| 129 |
+
Extract subject tags from OCR result
|
| 130 |
+
|
| 131 |
+
Args:
|
| 132 |
+
result: OCR result dictionary
|
| 133 |
+
raw_text: Raw text from OCR
|
| 134 |
+
preprocessing_options: Dictionary of preprocessing options
|
| 135 |
+
|
| 136 |
+
Returns:
|
| 137 |
+
list: Subject tags
|
| 138 |
+
"""
|
| 139 |
+
subject_tags = []
|
| 140 |
+
|
| 141 |
+
# Use existing topics as starting point if available
|
| 142 |
+
if 'topics' in result and result['topics']:
|
| 143 |
+
subject_tags = list(result['topics'])
|
| 144 |
+
|
| 145 |
+
# Add document type if detected
|
| 146 |
+
if 'detected_document_type' in result:
|
| 147 |
+
doc_type = result['detected_document_type'].capitalize()
|
| 148 |
+
if doc_type not in subject_tags:
|
| 149 |
+
subject_tags.append(doc_type)
|
| 150 |
+
|
| 151 |
+
# If no tags were found, add some defaults
|
| 152 |
+
if not subject_tags:
|
| 153 |
+
subject_tags = ["Document", "Historical Document"]
|
| 154 |
+
|
| 155 |
+
# Try to infer content type
|
| 156 |
+
if "letter" in raw_text.lower()[:1000] or "dear" in raw_text.lower()[:200]:
|
| 157 |
+
subject_tags.append("Letter")
|
| 158 |
+
|
| 159 |
+
# Check if it might be a newspaper
|
| 160 |
+
if "newspaper" in raw_text.lower()[:1000] or "editor" in raw_text.lower()[:500]:
|
| 161 |
+
subject_tags.append("Newspaper")
|
| 162 |
+
|
| 163 |
+
return subject_tags
|
utils/image_utils.py
ADDED
|
@@ -0,0 +1,886 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Utility functions for OCR image processing with Mistral AI.
|
| 3 |
+
Contains helper functions for working with OCR responses and image handling.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
# Standard library imports
|
| 7 |
+
import json
|
| 8 |
+
import base64
|
| 9 |
+
import io
|
| 10 |
+
import zipfile
|
| 11 |
+
import logging
|
| 12 |
+
import re
|
| 13 |
+
import time
|
| 14 |
+
import math
|
| 15 |
+
from datetime import datetime
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
from typing import Dict, List, Optional, Union, Any, Tuple
|
| 18 |
+
from functools import lru_cache
|
| 19 |
+
|
| 20 |
+
# Configure logging
|
| 21 |
+
logging.basicConfig(level=logging.INFO,
|
| 22 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 23 |
+
logger = logging.getLogger(__name__)
|
| 24 |
+
|
| 25 |
+
# Third-party imports
|
| 26 |
+
import numpy as np
|
| 27 |
+
|
| 28 |
+
# Mistral AI imports
|
| 29 |
+
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
|
| 30 |
+
from mistralai.models import OCRImageObject
|
| 31 |
+
|
| 32 |
+
# Check for image processing libraries
|
| 33 |
+
try:
|
| 34 |
+
from PIL import Image, ImageEnhance, ImageFilter, ImageOps
|
| 35 |
+
PILLOW_AVAILABLE = True
|
| 36 |
+
except ImportError:
|
| 37 |
+
logger.warning("PIL not available - image preprocessing will be limited")
|
| 38 |
+
PILLOW_AVAILABLE = False
|
| 39 |
+
|
| 40 |
+
try:
|
| 41 |
+
import cv2
|
| 42 |
+
CV2_AVAILABLE = True
|
| 43 |
+
except ImportError:
|
| 44 |
+
logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
|
| 45 |
+
CV2_AVAILABLE = False
|
| 46 |
+
|
| 47 |
+
# Import configuration
|
| 48 |
+
try:
|
| 49 |
+
from config import IMAGE_PREPROCESSING
|
| 50 |
+
except ImportError:
|
| 51 |
+
# Fallback defaults if config not available
|
| 52 |
+
IMAGE_PREPROCESSING = {
|
| 53 |
+
"enhance_contrast": 1.5,
|
| 54 |
+
"sharpen": True,
|
| 55 |
+
"denoise": True,
|
| 56 |
+
"max_size_mb": 8.0,
|
| 57 |
+
"target_dpi": 300,
|
| 58 |
+
"compression_quality": 92
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
def detect_skew(image: Union[Image.Image, np.ndarray]) -> float:
|
| 62 |
+
"""
|
| 63 |
+
Quick skew detection that returns angle in degrees.
|
| 64 |
+
Uses a computationally efficient approach by analyzing at 1% resolution.
|
| 65 |
+
|
| 66 |
+
Args:
|
| 67 |
+
image: PIL Image or numpy array
|
| 68 |
+
|
| 69 |
+
Returns:
|
| 70 |
+
Estimated skew angle in degrees (positive or negative)
|
| 71 |
+
"""
|
| 72 |
+
# Convert PIL Image to numpy array if needed
|
| 73 |
+
if isinstance(image, Image.Image):
|
| 74 |
+
# Convert to grayscale for processing
|
| 75 |
+
if image.mode != 'L':
|
| 76 |
+
img_np = np.array(image.convert('L'))
|
| 77 |
+
else:
|
| 78 |
+
img_np = np.array(image)
|
| 79 |
+
else:
|
| 80 |
+
# If already numpy array, ensure it's grayscale
|
| 81 |
+
if len(image.shape) == 3:
|
| 82 |
+
if CV2_AVAILABLE:
|
| 83 |
+
img_np = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
|
| 84 |
+
else:
|
| 85 |
+
# Fallback grayscale conversion
|
| 86 |
+
img_np = np.mean(image, axis=2).astype(np.uint8)
|
| 87 |
+
else:
|
| 88 |
+
img_np = image
|
| 89 |
+
|
| 90 |
+
# Downsample to 1% resolution for faster processing
|
| 91 |
+
height, width = img_np.shape
|
| 92 |
+
target_size = int(min(width, height) * 0.01)
|
| 93 |
+
|
| 94 |
+
# Use a sane minimum size and ensure we have enough pixels to detect lines
|
| 95 |
+
target_size = max(target_size, 100)
|
| 96 |
+
|
| 97 |
+
if CV2_AVAILABLE:
|
| 98 |
+
# OpenCV-based implementation (faster)
|
| 99 |
+
# Resize the image to the target size
|
| 100 |
+
scale_factor = target_size / max(width, height)
|
| 101 |
+
small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_AREA)
|
| 102 |
+
|
| 103 |
+
# Apply binary thresholding to get cleaner edges
|
| 104 |
+
_, binary = cv2.threshold(small_img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
|
| 105 |
+
|
| 106 |
+
# Use Hough Line Transform to detect lines
|
| 107 |
+
lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=target_size//10,
|
| 108 |
+
minLineLength=target_size//5, maxLineGap=target_size//10)
|
| 109 |
+
|
| 110 |
+
if lines is None or len(lines) < 3:
|
| 111 |
+
# Not enough lines detected, assume no significant skew
|
| 112 |
+
return 0.0
|
| 113 |
+
|
| 114 |
+
# Calculate angles of lines
|
| 115 |
+
angles = []
|
| 116 |
+
for line in lines:
|
| 117 |
+
x1, y1, x2, y2 = line[0]
|
| 118 |
+
if x2 - x1 == 0: # Avoid division by zero
|
| 119 |
+
continue
|
| 120 |
+
angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
|
| 121 |
+
|
| 122 |
+
# Normalize angle to -45 to 45 range
|
| 123 |
+
angle = angle % 180
|
| 124 |
+
if angle > 90:
|
| 125 |
+
angle -= 180
|
| 126 |
+
if angle > 45:
|
| 127 |
+
angle -= 90
|
| 128 |
+
if angle < -45:
|
| 129 |
+
angle += 90
|
| 130 |
+
|
| 131 |
+
angles.append(angle)
|
| 132 |
+
|
| 133 |
+
if not angles:
|
| 134 |
+
return 0.0
|
| 135 |
+
|
| 136 |
+
# Use median to reduce impact of outliers
|
| 137 |
+
angles.sort()
|
| 138 |
+
median_angle = angles[len(angles) // 2]
|
| 139 |
+
|
| 140 |
+
return median_angle
|
| 141 |
+
else:
|
| 142 |
+
# PIL-only fallback implementation
|
| 143 |
+
# Resize using PIL
|
| 144 |
+
small_img = Image.fromarray(img_np).resize(
|
| 145 |
+
(int(width * target_size / max(width, height)),
|
| 146 |
+
int(height * target_size / max(width, height))),
|
| 147 |
+
Image.NEAREST
|
| 148 |
+
)
|
| 149 |
+
|
| 150 |
+
# Find edges
|
| 151 |
+
edges = small_img.filter(ImageFilter.FIND_EDGES)
|
| 152 |
+
edges_data = np.array(edges)
|
| 153 |
+
|
| 154 |
+
# Simple edge orientation analysis (less precise than OpenCV)
|
| 155 |
+
# Count horizontal vs vertical edges
|
| 156 |
+
h_edges = np.sum(np.abs(np.diff(edges_data, axis=1)))
|
| 157 |
+
v_edges = np.sum(np.abs(np.diff(edges_data, axis=0)))
|
| 158 |
+
|
| 159 |
+
# If horizontal edges dominate, no significant skew
|
| 160 |
+
if h_edges > v_edges * 1.2:
|
| 161 |
+
return 0.0
|
| 162 |
+
|
| 163 |
+
# Simple angle estimation based on edge distribution
|
| 164 |
+
# This is a simplified approach that works for slight skews
|
| 165 |
+
rows, cols = edges_data.shape
|
| 166 |
+
xs, ys = [], []
|
| 167 |
+
|
| 168 |
+
# Sample strong edge points
|
| 169 |
+
for r in range(0, rows, 2):
|
| 170 |
+
for c in range(0, cols, 2):
|
| 171 |
+
if edges_data[r, c] > 128:
|
| 172 |
+
xs.append(c)
|
| 173 |
+
ys.append(r)
|
| 174 |
+
|
| 175 |
+
if len(xs) < 10: # Not enough edge points
|
| 176 |
+
return 0.0
|
| 177 |
+
|
| 178 |
+
# Use simple linear regression to estimate the slope
|
| 179 |
+
n = len(xs)
|
| 180 |
+
mean_x = sum(xs) / n
|
| 181 |
+
mean_y = sum(ys) / n
|
| 182 |
+
|
| 183 |
+
# Calculate slope
|
| 184 |
+
numerator = sum((xs[i] - mean_x) * (ys[i] - mean_y) for i in range(n))
|
| 185 |
+
denominator = sum((xs[i] - mean_x) ** 2 for i in range(n))
|
| 186 |
+
|
| 187 |
+
if abs(denominator) < 1e-6: # Avoid division by zero
|
| 188 |
+
return 0.0
|
| 189 |
+
|
| 190 |
+
slope = numerator / denominator
|
| 191 |
+
angle = math.atan(slope) * 180.0 / math.pi
|
| 192 |
+
|
| 193 |
+
# Normalize to -45 to 45 degrees
|
| 194 |
+
if angle > 45:
|
| 195 |
+
angle -= 90
|
| 196 |
+
elif angle < -45:
|
| 197 |
+
angle += 90
|
| 198 |
+
|
| 199 |
+
return angle
|
| 200 |
+
|
| 201 |
+
def replace_images_in_markdown(md: str, images: dict[str, str]) -> str:
|
| 202 |
+
"""
|
| 203 |
+
Replace image placeholders in markdown with base64-encoded images.
|
| 204 |
+
Uses regex-based matching to handle variations in image IDs and formats.
|
| 205 |
+
|
| 206 |
+
Args:
|
| 207 |
+
md: Markdown text containing image placeholders
|
| 208 |
+
images: Dictionary mapping image IDs to base64 strings
|
| 209 |
+
|
| 210 |
+
Returns:
|
| 211 |
+
Markdown text with images replaced by base64 data
|
| 212 |
+
"""
|
| 213 |
+
# Process each image ID in the dictionary
|
| 214 |
+
for img_id, base64_str in images.items():
|
| 215 |
+
# Extract the base ID without extension for more flexible matching
|
| 216 |
+
base_id = img_id.split('.')[0]
|
| 217 |
+
|
| 218 |
+
# Match markdown image pattern where URL contains the base ID
|
| 219 |
+
# Using a single regex with groups to capture the full pattern
|
| 220 |
+
pattern = re.compile(rf'!\[([^\]]*)\]\(([^\)]*{base_id}[^\)]*)\)')
|
| 221 |
+
|
| 222 |
+
# Process all matches
|
| 223 |
+
matches = list(pattern.finditer(md))
|
| 224 |
+
for match in reversed(matches): # Process in reverse to avoid offset issues
|
| 225 |
+
# Replace the entire match with a properly formatted base64 image
|
| 226 |
+
md = md[:match.start()] + f"" + md[match.end():]
|
| 227 |
+
|
| 228 |
+
return md
|
| 229 |
+
|
| 230 |
+
def get_combined_markdown(ocr_response) -> str:
|
| 231 |
+
"""
|
| 232 |
+
Combine OCR text and images into a single markdown document.
|
| 233 |
+
|
| 234 |
+
Args:
|
| 235 |
+
ocr_response: OCR response object from Mistral AI
|
| 236 |
+
|
| 237 |
+
Returns:
|
| 238 |
+
Combined markdown string with embedded images
|
| 239 |
+
"""
|
| 240 |
+
markdowns = []
|
| 241 |
+
|
| 242 |
+
# Process each page of the OCR response
|
| 243 |
+
for page in ocr_response.pages:
|
| 244 |
+
# Extract image data if available
|
| 245 |
+
image_data = {}
|
| 246 |
+
if hasattr(page, "images"):
|
| 247 |
+
for img in page.images:
|
| 248 |
+
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
| 249 |
+
image_data[img.id] = img.image_base64
|
| 250 |
+
|
| 251 |
+
# Replace image placeholders with base64 data
|
| 252 |
+
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
| 253 |
+
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
| 254 |
+
markdowns.append(processed_markdown)
|
| 255 |
+
|
| 256 |
+
# Join all pages' markdown with double newlines
|
| 257 |
+
return "\n\n".join(markdowns)
|
| 258 |
+
|
| 259 |
+
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
| 260 |
+
"""
|
| 261 |
+
Encode an image as base64 data URL for API submission.
|
| 262 |
+
|
| 263 |
+
Args:
|
| 264 |
+
image_path: Path to the image file
|
| 265 |
+
|
| 266 |
+
Returns:
|
| 267 |
+
Base64 data URL for the image
|
| 268 |
+
"""
|
| 269 |
+
# Convert to Path object if string
|
| 270 |
+
image_file = Path(image_path) if isinstance(image_path, str) else image_path
|
| 271 |
+
|
| 272 |
+
# Verify image exists
|
| 273 |
+
if not image_file.is_file():
|
| 274 |
+
raise FileNotFoundError(f"Image file not found: {image_file}")
|
| 275 |
+
|
| 276 |
+
# Determine mime type based on file extension
|
| 277 |
+
mime_type = 'image/jpeg' # Default mime type
|
| 278 |
+
suffix = image_file.suffix.lower()
|
| 279 |
+
if suffix == '.png':
|
| 280 |
+
mime_type = 'image/png'
|
| 281 |
+
elif suffix == '.gif':
|
| 282 |
+
mime_type = 'image/gif'
|
| 283 |
+
elif suffix in ['.jpg', '.jpeg']:
|
| 284 |
+
mime_type = 'image/jpeg'
|
| 285 |
+
elif suffix == '.pdf':
|
| 286 |
+
mime_type = 'application/pdf'
|
| 287 |
+
|
| 288 |
+
# Encode image as base64
|
| 289 |
+
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
| 290 |
+
return f"data:{mime_type};base64,{encoded}"
|
| 291 |
+
|
| 292 |
+
def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
|
| 293 |
+
"""
|
| 294 |
+
Encode binary data as base64 data URL for API submission.
|
| 295 |
+
|
| 296 |
+
Args:
|
| 297 |
+
file_bytes: Binary file data
|
| 298 |
+
mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
|
| 299 |
+
|
| 300 |
+
Returns:
|
| 301 |
+
Base64 data URL for the data
|
| 302 |
+
"""
|
| 303 |
+
# Encode data as base64
|
| 304 |
+
encoded = base64.b64encode(file_bytes).decode()
|
| 305 |
+
return f"data:{mime_type};base64,{encoded}"
|
| 306 |
+
|
| 307 |
+
def calculate_image_entropy(pil_img: Image.Image) -> float:
|
| 308 |
+
"""
|
| 309 |
+
Calculate the entropy of a PIL image.
|
| 310 |
+
Entropy is a measure of randomness; low entropy indicates a blank or simple image,
|
| 311 |
+
high entropy indicates more complex content (e.g., text or detailed images).
|
| 312 |
+
|
| 313 |
+
Args:
|
| 314 |
+
pil_img: PIL Image object
|
| 315 |
+
|
| 316 |
+
Returns:
|
| 317 |
+
float: Entropy value
|
| 318 |
+
"""
|
| 319 |
+
# Convert to grayscale for entropy calculation
|
| 320 |
+
gray_img = pil_img.convert("L")
|
| 321 |
+
arr = np.array(gray_img)
|
| 322 |
+
# Compute histogram
|
| 323 |
+
hist, _ = np.histogram(arr, bins=256, range=(0, 255), density=True)
|
| 324 |
+
# Remove zero entries to avoid log(0)
|
| 325 |
+
hist = hist[hist > 0]
|
| 326 |
+
# Calculate entropy
|
| 327 |
+
entropy = -np.sum(hist * np.log2(hist))
|
| 328 |
+
return float(entropy)
|
| 329 |
+
|
| 330 |
+
def serialize_ocr_object(obj):
|
| 331 |
+
"""
|
| 332 |
+
Serialize OCR response objects to JSON serializable format.
|
| 333 |
+
Handles OCRImageObject specifically to prevent serialization errors.
|
| 334 |
+
|
| 335 |
+
Args:
|
| 336 |
+
obj: The object to serialize
|
| 337 |
+
|
| 338 |
+
Returns:
|
| 339 |
+
JSON serializable representation of the object
|
| 340 |
+
"""
|
| 341 |
+
# Fast path: Handle primitive types directly
|
| 342 |
+
if obj is None or isinstance(obj, (str, int, float, bool)):
|
| 343 |
+
return obj
|
| 344 |
+
|
| 345 |
+
# Handle collections
|
| 346 |
+
if isinstance(obj, list):
|
| 347 |
+
return [serialize_ocr_object(item) for item in obj]
|
| 348 |
+
elif isinstance(obj, dict):
|
| 349 |
+
return {k: serialize_ocr_object(v) for k, v in obj.items()}
|
| 350 |
+
elif isinstance(obj, OCRImageObject):
|
| 351 |
+
# Special handling for OCRImageObject
|
| 352 |
+
return {
|
| 353 |
+
'id': obj.id if hasattr(obj, 'id') else None,
|
| 354 |
+
'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
|
| 355 |
+
}
|
| 356 |
+
elif hasattr(obj, '__dict__'):
|
| 357 |
+
# For objects with __dict__ attribute
|
| 358 |
+
return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
|
| 359 |
+
if not k.startswith('_')} # Skip private attributes
|
| 360 |
+
else:
|
| 361 |
+
# Try to convert to string as last resort
|
| 362 |
+
try:
|
| 363 |
+
return str(obj)
|
| 364 |
+
except:
|
| 365 |
+
return None
|
| 366 |
+
|
| 367 |
+
def format_ocr_text(text):
|
| 368 |
+
"""
|
| 369 |
+
Format OCR text with simple, predictable rules that ensure consistency.
|
| 370 |
+
This formats ALL CAPS lines as bold markdown and preserves the rest.
|
| 371 |
+
|
| 372 |
+
Args:
|
| 373 |
+
text: Text content to format
|
| 374 |
+
|
| 375 |
+
Returns:
|
| 376 |
+
Formatted text with consistent styling
|
| 377 |
+
"""
|
| 378 |
+
if not isinstance(text, str):
|
| 379 |
+
return text
|
| 380 |
+
|
| 381 |
+
lines = text.split('\n')
|
| 382 |
+
processed_lines = []
|
| 383 |
+
for line in lines:
|
| 384 |
+
line_stripped = line.strip()
|
| 385 |
+
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
| 386 |
+
processed_lines.append(f"**{line_stripped}**")
|
| 387 |
+
else:
|
| 388 |
+
processed_lines.append(line)
|
| 389 |
+
|
| 390 |
+
return '\n'.join(processed_lines)
|
| 391 |
+
|
| 392 |
+
def create_results_zip(results, output_dir=None, zip_name=None):
|
| 393 |
+
"""
|
| 394 |
+
Create a zip file containing OCR results.
|
| 395 |
+
|
| 396 |
+
Args:
|
| 397 |
+
results: Dictionary or list of OCR results
|
| 398 |
+
output_dir: Optional output directory
|
| 399 |
+
zip_name: Optional zip file name
|
| 400 |
+
|
| 401 |
+
Returns:
|
| 402 |
+
Path to the created zip file
|
| 403 |
+
"""
|
| 404 |
+
# Create temporary output directory if not provided
|
| 405 |
+
if output_dir is None:
|
| 406 |
+
output_dir = Path.cwd() / "output"
|
| 407 |
+
output_dir.mkdir(exist_ok=True)
|
| 408 |
+
else:
|
| 409 |
+
output_dir = Path(output_dir)
|
| 410 |
+
output_dir.mkdir(exist_ok=True)
|
| 411 |
+
|
| 412 |
+
# Generate zip name if not provided
|
| 413 |
+
if zip_name is None:
|
| 414 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 415 |
+
|
| 416 |
+
if isinstance(results, list):
|
| 417 |
+
# For a list of results, create a descriptive name
|
| 418 |
+
file_count = len(results)
|
| 419 |
+
zip_name = f"ocr_results_{file_count}_{timestamp}.zip"
|
| 420 |
+
else:
|
| 421 |
+
# For single result, create descriptive filename
|
| 422 |
+
base_name = results.get('file_name', 'document').split('.')[0]
|
| 423 |
+
zip_name = f"{base_name}_{timestamp}.zip"
|
| 424 |
+
|
| 425 |
+
try:
|
| 426 |
+
# Get zip data in memory first
|
| 427 |
+
zip_data = create_results_zip_in_memory(results)
|
| 428 |
+
|
| 429 |
+
# Save to file
|
| 430 |
+
zip_path = output_dir / zip_name
|
| 431 |
+
with open(zip_path, 'wb') as f:
|
| 432 |
+
f.write(zip_data)
|
| 433 |
+
|
| 434 |
+
return zip_path
|
| 435 |
+
except Exception as e:
|
| 436 |
+
# Create an empty zip file as fallback
|
| 437 |
+
logger.error(f"Error creating zip file: {str(e)}")
|
| 438 |
+
zip_path = output_dir / zip_name
|
| 439 |
+
with zipfile.ZipFile(zip_path, 'w') as zipf:
|
| 440 |
+
zipf.writestr("info.txt", "Could not create complete archive")
|
| 441 |
+
|
| 442 |
+
return zip_path
|
| 443 |
+
|
| 444 |
+
def create_results_zip_in_memory(results):
|
| 445 |
+
"""
|
| 446 |
+
Create a zip file containing OCR results in memory.
|
| 447 |
+
|
| 448 |
+
Args:
|
| 449 |
+
results: Dictionary or list of OCR results
|
| 450 |
+
|
| 451 |
+
Returns:
|
| 452 |
+
Binary zip file data
|
| 453 |
+
"""
|
| 454 |
+
# Create a BytesIO object
|
| 455 |
+
zip_buffer = io.BytesIO()
|
| 456 |
+
|
| 457 |
+
# Check if results is a list or a dictionary
|
| 458 |
+
is_list = isinstance(results, list)
|
| 459 |
+
|
| 460 |
+
# Create zip file in memory
|
| 461 |
+
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
| 462 |
+
if is_list:
|
| 463 |
+
# Handle list of results
|
| 464 |
+
for i, result in enumerate(results):
|
| 465 |
+
try:
|
| 466 |
+
# Create a descriptive base filename for this result
|
| 467 |
+
base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
|
| 468 |
+
|
| 469 |
+
# Add document type if available
|
| 470 |
+
if 'topics' in result and result['topics']:
|
| 471 |
+
topic = result['topics'][0].lower().replace(' ', '_')
|
| 472 |
+
base_filename = f"{base_filename}_{topic}"
|
| 473 |
+
|
| 474 |
+
# Add language if available
|
| 475 |
+
if 'languages' in result and result['languages']:
|
| 476 |
+
lang = result['languages'][0].lower()
|
| 477 |
+
# Only add if it's not already in the filename
|
| 478 |
+
if lang not in base_filename.lower():
|
| 479 |
+
base_filename = f"{base_filename}_{lang}"
|
| 480 |
+
|
| 481 |
+
# For PDFs, add page information
|
| 482 |
+
if 'limited_pages' in result:
|
| 483 |
+
base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
|
| 484 |
+
|
| 485 |
+
# Add timestamp if available
|
| 486 |
+
if 'timestamp' in result:
|
| 487 |
+
try:
|
| 488 |
+
# Try to parse the timestamp and reformat it
|
| 489 |
+
dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
|
| 490 |
+
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
| 491 |
+
base_filename = f"{base_filename}_{timestamp}"
|
| 492 |
+
except Exception:
|
| 493 |
+
pass
|
| 494 |
+
|
| 495 |
+
# Add JSON results for each file with descriptive name
|
| 496 |
+
result_json = json.dumps(result, indent=2)
|
| 497 |
+
zipf.writestr(f"{base_filename}.json", result_json)
|
| 498 |
+
|
| 499 |
+
# Add HTML content (generated from the result)
|
| 500 |
+
html_content = create_html_with_images(result)
|
| 501 |
+
zipf.writestr(f"{base_filename}.html", html_content)
|
| 502 |
+
|
| 503 |
+
# Add raw OCR text if available
|
| 504 |
+
if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
|
| 505 |
+
zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
|
| 506 |
+
|
| 507 |
+
except Exception as e:
|
| 508 |
+
# If any result fails, skip it and continue
|
| 509 |
+
logger.warning(f"Failed to process result for zip: {str(e)}")
|
| 510 |
+
continue
|
| 511 |
+
else:
|
| 512 |
+
# Handle single result
|
| 513 |
+
try:
|
| 514 |
+
# Create a descriptive base filename for this result
|
| 515 |
+
base_filename = results.get('file_name', 'document').split('.')[0]
|
| 516 |
+
|
| 517 |
+
# Add document type if available
|
| 518 |
+
if 'topics' in results and results['topics']:
|
| 519 |
+
topic = results['topics'][0].lower().replace(' ', '_')
|
| 520 |
+
base_filename = f"{base_filename}_{topic}"
|
| 521 |
+
|
| 522 |
+
# Add language if available
|
| 523 |
+
if 'languages' in results and results['languages']:
|
| 524 |
+
lang = results['languages'][0].lower()
|
| 525 |
+
# Only add if it's not already in the filename
|
| 526 |
+
if lang not in base_filename.lower():
|
| 527 |
+
base_filename = f"{base_filename}_{lang}"
|
| 528 |
+
|
| 529 |
+
# For PDFs, add page information
|
| 530 |
+
if 'limited_pages' in results:
|
| 531 |
+
base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
|
| 532 |
+
|
| 533 |
+
# Add timestamp if available
|
| 534 |
+
if 'timestamp' in results:
|
| 535 |
+
try:
|
| 536 |
+
# Try to parse the timestamp and reformat it
|
| 537 |
+
dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
|
| 538 |
+
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
| 539 |
+
base_filename = f"{base_filename}_{timestamp}"
|
| 540 |
+
except Exception:
|
| 541 |
+
# If parsing fails, create a new timestamp
|
| 542 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 543 |
+
base_filename = f"{base_filename}_{timestamp}"
|
| 544 |
+
else:
|
| 545 |
+
# No timestamp in the result, create a new one
|
| 546 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 547 |
+
base_filename = f"{base_filename}_{timestamp}"
|
| 548 |
+
|
| 549 |
+
# Add JSON results with descriptive name
|
| 550 |
+
results_json = json.dumps(results, indent=2)
|
| 551 |
+
zipf.writestr(f"{base_filename}.json", results_json)
|
| 552 |
+
|
| 553 |
+
# Add HTML content with descriptive name
|
| 554 |
+
html_content = create_html_with_images(results)
|
| 555 |
+
zipf.writestr(f"{base_filename}.html", html_content)
|
| 556 |
+
|
| 557 |
+
# Add raw OCR text if available
|
| 558 |
+
if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
|
| 559 |
+
zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
|
| 560 |
+
|
| 561 |
+
except Exception as e:
|
| 562 |
+
# If processing fails, log the error
|
| 563 |
+
logger.error(f"Failed to create zip file: {str(e)}")
|
| 564 |
+
pass
|
| 565 |
+
|
| 566 |
+
# Seek to the beginning of the BytesIO object
|
| 567 |
+
zip_buffer.seek(0)
|
| 568 |
+
|
| 569 |
+
# Return the zip file bytes
|
| 570 |
+
return zip_buffer.getvalue()
|
| 571 |
+
|
| 572 |
+
def create_html_with_images(result):
|
| 573 |
+
"""
|
| 574 |
+
Create a clean HTML document from OCR results that properly preserves page references
|
| 575 |
+
and text structure, without any document-specific special cases.
|
| 576 |
+
|
| 577 |
+
Args:
|
| 578 |
+
result: OCR result dictionary
|
| 579 |
+
|
| 580 |
+
Returns:
|
| 581 |
+
HTML content as string
|
| 582 |
+
"""
|
| 583 |
+
# Import content utils to use classification functions
|
| 584 |
+
try:
|
| 585 |
+
from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
|
| 586 |
+
content_utils_available = True
|
| 587 |
+
except ImportError:
|
| 588 |
+
content_utils_available = False
|
| 589 |
+
|
| 590 |
+
# Get content classification
|
| 591 |
+
has_text = True
|
| 592 |
+
has_images = False
|
| 593 |
+
has_page_refs = False
|
| 594 |
+
|
| 595 |
+
if content_utils_available:
|
| 596 |
+
classification = classify_document_content(result)
|
| 597 |
+
has_text = classification['has_content']
|
| 598 |
+
has_images = result.get('has_images', False)
|
| 599 |
+
has_page_refs = False
|
| 600 |
+
else:
|
| 601 |
+
# Minimal fallback detection
|
| 602 |
+
if 'has_images' in result:
|
| 603 |
+
has_images = result['has_images']
|
| 604 |
+
|
| 605 |
+
# Check for image data more thoroughly
|
| 606 |
+
if 'pages_data' in result and isinstance(result['pages_data'], list):
|
| 607 |
+
for page in result['pages_data']:
|
| 608 |
+
if isinstance(page, dict) and 'images' in page and page['images']:
|
| 609 |
+
has_images = True
|
| 610 |
+
break
|
| 611 |
+
|
| 612 |
+
# Start building the HTML document
|
| 613 |
+
html = [
|
| 614 |
+
'<!DOCTYPE html>',
|
| 615 |
+
'<html lang="en">',
|
| 616 |
+
'<head>',
|
| 617 |
+
' <meta charset="UTF-8">',
|
| 618 |
+
' <meta name="viewport" content="width=device-width, initial-scale=1.0">',
|
| 619 |
+
f' <title>{result.get("file_name", "Document")}</title>',
|
| 620 |
+
' <style>',
|
| 621 |
+
' body {',
|
| 622 |
+
' font-family: Georgia, serif;',
|
| 623 |
+
' line-height: 1.6;',
|
| 624 |
+
' color: #333;',
|
| 625 |
+
' max-width: 800px;',
|
| 626 |
+
' margin: 0 auto;',
|
| 627 |
+
' padding: 20px;',
|
| 628 |
+
' }',
|
| 629 |
+
' h1, h2, h3, h4 {',
|
| 630 |
+
' color: #222;',
|
| 631 |
+
' margin-top: 1.5em;',
|
| 632 |
+
' margin-bottom: 0.5em;',
|
| 633 |
+
' }',
|
| 634 |
+
' h1 { font-size: 24px; }',
|
| 635 |
+
' h2 { font-size: 22px; }',
|
| 636 |
+
' h3 { font-size: 20px; }',
|
| 637 |
+
' h4 { font-size: 18px; }',
|
| 638 |
+
' p { margin: 1em 0; }',
|
| 639 |
+
' .metadata {',
|
| 640 |
+
' background-color: #f8f9fa;',
|
| 641 |
+
' border: 1px solid #eaecef;',
|
| 642 |
+
' border-radius: 6px;',
|
| 643 |
+
' padding: 15px;',
|
| 644 |
+
' margin-bottom: 20px;',
|
| 645 |
+
' }',
|
| 646 |
+
' .metadata p { margin: 5px 0; }',
|
| 647 |
+
' img {',
|
| 648 |
+
' max-width: 100%;',
|
| 649 |
+
' height: auto;',
|
| 650 |
+
' display: block;',
|
| 651 |
+
' margin: 20px auto;',
|
| 652 |
+
' border: 1px solid #ddd;',
|
| 653 |
+
' border-radius: 4px;',
|
| 654 |
+
' }',
|
| 655 |
+
' .image-container {',
|
| 656 |
+
' margin: 20px 0;',
|
| 657 |
+
' text-align: center;',
|
| 658 |
+
' }',
|
| 659 |
+
' .image-caption {',
|
| 660 |
+
' font-size: 0.9em;',
|
| 661 |
+
' text-align: center;',
|
| 662 |
+
' color: #666;',
|
| 663 |
+
' margin-top: 5px;',
|
| 664 |
+
' }',
|
| 665 |
+
' .text-block {',
|
| 666 |
+
' margin: 10px 0;',
|
| 667 |
+
' }',
|
| 668 |
+
' .page-ref {',
|
| 669 |
+
' font-weight: bold;',
|
| 670 |
+
' color: #555;',
|
| 671 |
+
' }',
|
| 672 |
+
' .separator {',
|
| 673 |
+
' border-top: 1px solid #eaecef;',
|
| 674 |
+
' margin: 30px 0;',
|
| 675 |
+
' }',
|
| 676 |
+
' </style>',
|
| 677 |
+
'</head>',
|
| 678 |
+
'<body>'
|
| 679 |
+
]
|
| 680 |
+
|
| 681 |
+
# Add document metadata
|
| 682 |
+
html.append('<div class="metadata">')
|
| 683 |
+
html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
|
| 684 |
+
|
| 685 |
+
# Add timestamp
|
| 686 |
+
if 'timestamp' in result:
|
| 687 |
+
html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
|
| 688 |
+
|
| 689 |
+
# Add languages if available
|
| 690 |
+
if 'languages' in result and result['languages']:
|
| 691 |
+
languages = [lang for lang in result['languages'] if lang]
|
| 692 |
+
if languages:
|
| 693 |
+
html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
|
| 694 |
+
|
| 695 |
+
# Add document type and topics
|
| 696 |
+
if 'detected_document_type' in result:
|
| 697 |
+
html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
|
| 698 |
+
|
| 699 |
+
if 'topics' in result and result['topics']:
|
| 700 |
+
html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
|
| 701 |
+
|
| 702 |
+
html.append('</div>') # Close metadata div
|
| 703 |
+
|
| 704 |
+
# Document title - extract from result if available
|
| 705 |
+
if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
| 706 |
+
title_content = result['ocr_contents']['title']
|
| 707 |
+
# No special handling for any specific document types
|
| 708 |
+
html.append(f'<h2>{title_content}</h2>')
|
| 709 |
+
|
| 710 |
+
# Add images if present
|
| 711 |
+
if has_images and 'pages_data' in result:
|
| 712 |
+
html.append('<h3>Images</h3>')
|
| 713 |
+
|
| 714 |
+
# Extract and display all images
|
| 715 |
+
for page_idx, page in enumerate(result['pages_data']):
|
| 716 |
+
if 'images' in page and isinstance(page['images'], list):
|
| 717 |
+
for img_idx, img in enumerate(page['images']):
|
| 718 |
+
if 'image_base64' in img and img['image_base64']:
|
| 719 |
+
# Image container
|
| 720 |
+
html.append('<div class="image-container">')
|
| 721 |
+
html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
|
| 722 |
+
|
| 723 |
+
# Generic caption based on index
|
| 724 |
+
html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
|
| 725 |
+
html.append('</div>')
|
| 726 |
+
|
| 727 |
+
# Add image description if available through utils
|
| 728 |
+
if content_utils_available:
|
| 729 |
+
description = extract_image_description(result)
|
| 730 |
+
if description:
|
| 731 |
+
html.append('<div class="text-block">')
|
| 732 |
+
html.append(f'<p>{description}</p>')
|
| 733 |
+
html.append('</div>')
|
| 734 |
+
|
| 735 |
+
html.append('<hr class="separator">')
|
| 736 |
+
|
| 737 |
+
# Add document text section
|
| 738 |
+
html.append('<h3>Text</h3>')
|
| 739 |
+
|
| 740 |
+
# Extract text content systematically
|
| 741 |
+
text_content = ""
|
| 742 |
+
|
| 743 |
+
if content_utils_available:
|
| 744 |
+
# Use the systematic utility function
|
| 745 |
+
text_content = extract_document_text(result)
|
| 746 |
+
else:
|
| 747 |
+
# Fallback extraction logic
|
| 748 |
+
if 'ocr_contents' in result:
|
| 749 |
+
for field in ["main_text", "content", "text", "transcript", "raw_text"]:
|
| 750 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
| 751 |
+
content = result['ocr_contents'][field]
|
| 752 |
+
if isinstance(content, str) and content.strip():
|
| 753 |
+
text_content = content
|
| 754 |
+
break
|
| 755 |
+
elif isinstance(content, dict):
|
| 756 |
+
# Try to convert complex objects to string
|
| 757 |
+
try:
|
| 758 |
+
text_content = json.dumps(content, indent=2)
|
| 759 |
+
break
|
| 760 |
+
except:
|
| 761 |
+
pass
|
| 762 |
+
|
| 763 |
+
# Process text content for HTML display
|
| 764 |
+
if text_content:
|
| 765 |
+
# Clean the text but preserve page references
|
| 766 |
+
text_content = text_content.replace('\r\n', '\n')
|
| 767 |
+
|
| 768 |
+
# Preserve page references by wrapping them in HTML tags
|
| 769 |
+
if has_page_refs:
|
| 770 |
+
# Highlight common page reference patterns
|
| 771 |
+
page_patterns = [
|
| 772 |
+
(r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
|
| 773 |
+
(r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
|
| 774 |
+
(r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
|
| 775 |
+
(r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
|
| 776 |
+
(r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
|
| 777 |
+
(r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
|
| 778 |
+
(r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
|
| 779 |
+
(r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
|
| 780 |
+
]
|
| 781 |
+
|
| 782 |
+
for pattern, replacement in page_patterns:
|
| 783 |
+
text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
|
| 784 |
+
|
| 785 |
+
# Convert newlines to paragraphs
|
| 786 |
+
paragraphs = text_content.split('\n\n')
|
| 787 |
+
paragraphs = [p for p in paragraphs if p.strip()]
|
| 788 |
+
|
| 789 |
+
html.append('<div class="text-block">')
|
| 790 |
+
for paragraph in paragraphs:
|
| 791 |
+
# Check if paragraph contains multiple lines
|
| 792 |
+
if '\n' in paragraph:
|
| 793 |
+
lines = paragraph.split('\n')
|
| 794 |
+
lines = [line for line in lines if line.strip()]
|
| 795 |
+
|
| 796 |
+
# Convert each line to a paragraph
|
| 797 |
+
for line in lines:
|
| 798 |
+
html.append(f'<p>{line}</p>')
|
| 799 |
+
else:
|
| 800 |
+
html.append(f'<p>{paragraph}</p>')
|
| 801 |
+
html.append('</div>')
|
| 802 |
+
else:
|
| 803 |
+
html.append('<p>No text content available.</p>')
|
| 804 |
+
|
| 805 |
+
# Close the HTML document
|
| 806 |
+
html.append('</body>')
|
| 807 |
+
html.append('</html>')
|
| 808 |
+
|
| 809 |
+
return '\n'.join(html)
|
| 810 |
+
|
| 811 |
+
def clean_ocr_result(result: dict,
|
| 812 |
+
use_segmentation: bool = False,
|
| 813 |
+
vision_enabled: bool = True) -> dict:
|
| 814 |
+
"""
|
| 815 |
+
1. Replace or strip markdown image refs ()
|
| 816 |
+
2. Collapse pages that are *only* an illustration into a single
|
| 817 |
+
`illustrations` bucket when vision is off
|
| 818 |
+
3. Normalise `ocr_contents` keys to always have at least `raw_text`
|
| 819 |
+
"""
|
| 820 |
+
if 'pages_data' in result:
|
| 821 |
+
# Build a dict {id: base64} for quick look-ups
|
| 822 |
+
image_dict = {
|
| 823 |
+
img['id']: img['image_base64']
|
| 824 |
+
for page in result['pages_data']
|
| 825 |
+
for img in page.get('images', [])
|
| 826 |
+
}
|
| 827 |
+
|
| 828 |
+
# --- 1 · replace or drop image placeholders ---
|
| 829 |
+
def _scrub(markdown: str) -> str:
|
| 830 |
+
if vision_enabled and image_dict:
|
| 831 |
+
return replace_images_in_markdown(markdown, image_dict)
|
| 832 |
+
# no vision / no images → drop the line
|
| 833 |
+
return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
|
| 834 |
+
|
| 835 |
+
for page in result['pages_data']:
|
| 836 |
+
page['markdown'] = _scrub(page.get('markdown', ''))
|
| 837 |
+
|
| 838 |
+
# --- 2 · group illustration-only pages when vision is off ---
|
| 839 |
+
if not vision_enabled and 'pages_data' in result:
|
| 840 |
+
text_pages, art_pages = [], []
|
| 841 |
+
for p in result['pages_data']:
|
| 842 |
+
has_text = p.get('markdown', '').strip()
|
| 843 |
+
(text_pages if has_text else art_pages).append(p)
|
| 844 |
+
result['pages_data'] = text_pages
|
| 845 |
+
if art_pages:
|
| 846 |
+
# keep one thumbnail under metadata
|
| 847 |
+
result.setdefault('illustrations', []).extend(art_pages)
|
| 848 |
+
|
| 849 |
+
# --- 3 · ensure raw_text key ---
|
| 850 |
+
if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
|
| 851 |
+
# First, try to extract any embedded text from image references
|
| 852 |
+
raw_text_parts = []
|
| 853 |
+
|
| 854 |
+
for page in result.get('pages_data', []):
|
| 855 |
+
markdown = page.get('markdown', '')
|
| 856 |
+
# Check if the markdown contains image references
|
| 857 |
+
img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
|
| 858 |
+
|
| 859 |
+
# Process each image reference to extract text content
|
| 860 |
+
if img_refs:
|
| 861 |
+
for alt_text, img_url in img_refs:
|
| 862 |
+
# If alt text contains actual text content (not just image ID), add it
|
| 863 |
+
if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
|
| 864 |
+
# Clean up the alt text and add it as text content
|
| 865 |
+
alt_text = alt_text.strip()
|
| 866 |
+
if alt_text and len(alt_text) > 3: # Only add if meaningful
|
| 867 |
+
raw_text_parts.append(alt_text)
|
| 868 |
+
|
| 869 |
+
# Remove image references from markdown
|
| 870 |
+
cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
|
| 871 |
+
|
| 872 |
+
# Add any remaining text content
|
| 873 |
+
if cleaned_markdown.strip():
|
| 874 |
+
raw_text_parts.append(cleaned_markdown.strip())
|
| 875 |
+
|
| 876 |
+
# Join all extracted text content
|
| 877 |
+
if raw_text_parts:
|
| 878 |
+
result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
|
| 879 |
+
else:
|
| 880 |
+
# Fallback: use original method if no text was extracted
|
| 881 |
+
joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
|
| 882 |
+
# Final cleanup of image references
|
| 883 |
+
joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
|
| 884 |
+
result['ocr_contents']['raw_text'] = joined
|
| 885 |
+
|
| 886 |
+
return result
|
utils/text_utils.py
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Text utility functions for OCR processing"""
|
| 2 |
+
|
| 3 |
+
import re
|
| 4 |
+
|
| 5 |
+
def clean_raw_text(text):
|
| 6 |
+
"""Clean raw text by removing image references and serialized data.
|
| 7 |
+
|
| 8 |
+
Args:
|
| 9 |
+
text (str): The text to clean
|
| 10 |
+
|
| 11 |
+
Returns:
|
| 12 |
+
str: The cleaned text
|
| 13 |
+
"""
|
| 14 |
+
if not text or not isinstance(text, str):
|
| 15 |
+
return ""
|
| 16 |
+
|
| 17 |
+
# # Remove image references like 
|
| 18 |
+
# text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
|
| 19 |
+
|
| 20 |
+
# # Remove basic markdown image references like 
|
| 21 |
+
# text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
|
| 22 |
+
|
| 23 |
+
# # Remove base64 encoded image data
|
| 24 |
+
# text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
|
| 25 |
+
|
| 26 |
+
# # Remove image object references like [[OCRImageObject:...]]
|
| 27 |
+
# text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
|
| 28 |
+
|
| 29 |
+
# # Clean up any JSON-like image object references
|
| 30 |
+
# text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
|
| 31 |
+
|
| 32 |
+
# # Clean up excessive whitespace and line breaks created by removals
|
| 33 |
+
# text = re.sub(r'\n{3,}', '\n\n', text)
|
| 34 |
+
# text = re.sub(r'\s{3,}', ' ', text)
|
| 35 |
+
|
| 36 |
+
return text.strip()
|
| 37 |
+
|
| 38 |
+
def format_markdown_text(text):
|
| 39 |
+
"""Format text with markdown and handle special patterns
|
| 40 |
+
|
| 41 |
+
Args:
|
| 42 |
+
text (str): The text to format
|
| 43 |
+
|
| 44 |
+
Returns:
|
| 45 |
+
str: The formatted markdown text
|
| 46 |
+
"""
|
| 47 |
+
if not text:
|
| 48 |
+
return ""
|
| 49 |
+
|
| 50 |
+
# First, ensure we're working with a string
|
| 51 |
+
if not isinstance(text, str):
|
| 52 |
+
text = str(text)
|
| 53 |
+
|
| 54 |
+
# Ensure newlines are preserved for proper spacing
|
| 55 |
+
# Convert any Windows line endings to Unix
|
| 56 |
+
text = text.replace('\r\n', '\n')
|
| 57 |
+
|
| 58 |
+
# Format dates (MM/DD/YYYY or similar patterns)
|
| 59 |
+
date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
|
| 60 |
+
text = re.sub(date_pattern, r'**\g<0>**', text)
|
| 61 |
+
|
| 62 |
+
# Detect markdown tables and preserve them
|
| 63 |
+
table_sections = []
|
| 64 |
+
non_table_lines = []
|
| 65 |
+
in_table = False
|
| 66 |
+
table_buffer = []
|
| 67 |
+
|
| 68 |
+
# Process text line by line, preserving tables
|
| 69 |
+
lines = text.split('\n')
|
| 70 |
+
for i, line in enumerate(lines):
|
| 71 |
+
line_stripped = line.strip()
|
| 72 |
+
|
| 73 |
+
# Detect table rows by pipe character
|
| 74 |
+
if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
|
| 75 |
+
if not in_table:
|
| 76 |
+
in_table = True
|
| 77 |
+
if table_buffer:
|
| 78 |
+
table_buffer = []
|
| 79 |
+
table_buffer.append(line)
|
| 80 |
+
|
| 81 |
+
# Check if the next line is a table separator
|
| 82 |
+
if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
|
| 83 |
+
table_buffer.append(lines[i+1])
|
| 84 |
+
|
| 85 |
+
# Detect table separators (---|---|---)
|
| 86 |
+
elif in_table and '---' in line_stripped and '|' in line_stripped:
|
| 87 |
+
table_buffer.append(line)
|
| 88 |
+
|
| 89 |
+
# End of table detection
|
| 90 |
+
elif in_table:
|
| 91 |
+
# Check if this is still part of the table
|
| 92 |
+
next_line_is_table = False
|
| 93 |
+
if i < len(lines) - 1:
|
| 94 |
+
next_line = lines[i+1].strip()
|
| 95 |
+
if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
|
| 96 |
+
next_line_is_table = True
|
| 97 |
+
|
| 98 |
+
if not next_line_is_table:
|
| 99 |
+
in_table = False
|
| 100 |
+
# Save the complete table
|
| 101 |
+
if table_buffer:
|
| 102 |
+
table_sections.append('\n'.join(table_buffer))
|
| 103 |
+
table_buffer = []
|
| 104 |
+
# Add current line to non-table lines
|
| 105 |
+
non_table_lines.append(line)
|
| 106 |
+
else:
|
| 107 |
+
# Still part of the table
|
| 108 |
+
table_buffer.append(line)
|
| 109 |
+
else:
|
| 110 |
+
# Not in a table
|
| 111 |
+
non_table_lines.append(line)
|
| 112 |
+
|
| 113 |
+
# Handle any remaining table buffer
|
| 114 |
+
if in_table and table_buffer:
|
| 115 |
+
table_sections.append('\n'.join(table_buffer))
|
| 116 |
+
|
| 117 |
+
# Process non-table lines
|
| 118 |
+
processed_lines = []
|
| 119 |
+
for line in non_table_lines:
|
| 120 |
+
line_stripped = line.strip()
|
| 121 |
+
|
| 122 |
+
# Check if line is in ALL CAPS (and not just a short acronym)
|
| 123 |
+
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
| 124 |
+
# ALL CAPS line - make bold instead of heading to prevent large display
|
| 125 |
+
processed_lines.append(f"**{line_stripped}**")
|
| 126 |
+
# Process potential headers (lines ending with colon)
|
| 127 |
+
elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
|
| 128 |
+
# Likely a header - make it bold
|
| 129 |
+
processed_lines.append(f"**{line_stripped}**")
|
| 130 |
+
else:
|
| 131 |
+
# Keep original line with its spacing
|
| 132 |
+
processed_lines.append(line)
|
| 133 |
+
|
| 134 |
+
# Join non-table lines
|
| 135 |
+
processed_text = '\n'.join(processed_lines)
|
| 136 |
+
|
| 137 |
+
# Reinsert tables in the right positions
|
| 138 |
+
for table in table_sections:
|
| 139 |
+
# Generate a unique marker for this table
|
| 140 |
+
marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
|
| 141 |
+
# Find a good position to insert this table
|
| 142 |
+
# For now, just append all tables at the end
|
| 143 |
+
processed_text += f"\n\n{table}\n\n"
|
| 144 |
+
|
| 145 |
+
# Make sure paragraphs have proper spacing but not excessive
|
| 146 |
+
processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
|
| 147 |
+
|
| 148 |
+
# Ensure two newlines between paragraphs for proper markdown rendering
|
| 149 |
+
processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
|
| 150 |
+
|
| 151 |
+
return processed_text
|
utils/ui_utils.py
ADDED
|
@@ -0,0 +1,413 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
UI utilities for OCR results display.
|
| 3 |
+
"""
|
| 4 |
+
import streamlit as st
|
| 5 |
+
import json
|
| 6 |
+
import base64
|
| 7 |
+
import io
|
| 8 |
+
from datetime import datetime
|
| 9 |
+
|
| 10 |
+
from utils.image_utils import format_ocr_text, create_html_with_images
|
| 11 |
+
from utils.content_utils import classify_document_content, format_structured_data
|
| 12 |
+
|
| 13 |
+
def display_results(result, container, custom_prompt=""):
|
| 14 |
+
"""Display OCR results in the provided container"""
|
| 15 |
+
with container:
|
| 16 |
+
# Add heading for document metadata
|
| 17 |
+
st.markdown("### Document Metadata")
|
| 18 |
+
|
| 19 |
+
# Filter out large data structures from metadata display
|
| 20 |
+
meta = {k: v for k, v in result.items()
|
| 21 |
+
if k not in ['pages_data', 'illustrations', 'ocr_contents', 'raw_response_data']}
|
| 22 |
+
|
| 23 |
+
# Create a compact metadata section
|
| 24 |
+
meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
|
| 25 |
+
|
| 26 |
+
# Document type
|
| 27 |
+
if 'detected_document_type' in meta:
|
| 28 |
+
meta_html += f'<div><strong>Type:</strong> {meta["detected_document_type"]}</div>'
|
| 29 |
+
|
| 30 |
+
# Processing time
|
| 31 |
+
if 'processing_time' in meta:
|
| 32 |
+
meta_html += f'<div><strong>Time:</strong> {meta["processing_time"]:.1f}s</div>'
|
| 33 |
+
|
| 34 |
+
# Page information
|
| 35 |
+
if 'limited_pages' in meta:
|
| 36 |
+
meta_html += f'<div><strong>Pages:</strong> {meta["limited_pages"]["processed"]}/{meta["limited_pages"]["total"]}</div>'
|
| 37 |
+
|
| 38 |
+
meta_html += '</div>'
|
| 39 |
+
st.markdown(meta_html, unsafe_allow_html=True)
|
| 40 |
+
|
| 41 |
+
# Language metadata on a separate line, Subject Tags below
|
| 42 |
+
|
| 43 |
+
# First show languages if available
|
| 44 |
+
if 'languages' in result and result['languages']:
|
| 45 |
+
languages = [lang for lang in result['languages'] if lang is not None]
|
| 46 |
+
if languages:
|
| 47 |
+
# Create a dedicated line for Languages
|
| 48 |
+
lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 49 |
+
lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
|
| 50 |
+
|
| 51 |
+
# Add language tags
|
| 52 |
+
for lang in languages:
|
| 53 |
+
# Clean language name if needed
|
| 54 |
+
clean_lang = str(lang).strip()
|
| 55 |
+
if clean_lang: # Only add if not empty
|
| 56 |
+
lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
|
| 57 |
+
|
| 58 |
+
lang_html += '</div>'
|
| 59 |
+
st.markdown(lang_html, unsafe_allow_html=True)
|
| 60 |
+
|
| 61 |
+
# Create a separate line for Time if we have time-related tags
|
| 62 |
+
if 'topics' in result and result['topics']:
|
| 63 |
+
time_tags = [topic for topic in result['topics']
|
| 64 |
+
if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
| 65 |
+
if time_tags:
|
| 66 |
+
time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 67 |
+
time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
|
| 68 |
+
for tag in time_tags:
|
| 69 |
+
time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
|
| 70 |
+
time_html += '</div>'
|
| 71 |
+
st.markdown(time_html, unsafe_allow_html=True)
|
| 72 |
+
|
| 73 |
+
# Then display remaining subject tags if available
|
| 74 |
+
if 'topics' in result and result['topics']:
|
| 75 |
+
# Filter out time-related tags which are already displayed
|
| 76 |
+
subject_tags = [topic for topic in result['topics']
|
| 77 |
+
if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
| 78 |
+
|
| 79 |
+
if subject_tags:
|
| 80 |
+
# Create a separate line for Subject Tags
|
| 81 |
+
tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
| 82 |
+
tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
|
| 83 |
+
tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
|
| 84 |
+
|
| 85 |
+
# Generate a badge for each remaining tag
|
| 86 |
+
for topic in subject_tags:
|
| 87 |
+
# Determine tag category class
|
| 88 |
+
tag_class = "subject-tag" # Default class
|
| 89 |
+
|
| 90 |
+
# Add specialized class based on category
|
| 91 |
+
if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
|
| 92 |
+
tag_class += " tag-language" # Languages
|
| 93 |
+
elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
|
| 94 |
+
tag_class += " tag-document-type" # Document types
|
| 95 |
+
elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
|
| 96 |
+
tag_class += " tag-subject" # Subject domains
|
| 97 |
+
|
| 98 |
+
# Add each tag as an inline span
|
| 99 |
+
tags_html += f'<span class="{tag_class}">{topic}</span>'
|
| 100 |
+
|
| 101 |
+
# Close the containers
|
| 102 |
+
tags_html += '</div></div>'
|
| 103 |
+
|
| 104 |
+
# Render the subject tags section
|
| 105 |
+
st.markdown(tags_html, unsafe_allow_html=True)
|
| 106 |
+
|
| 107 |
+
# Check if we have OCR content
|
| 108 |
+
if 'ocr_contents' in result:
|
| 109 |
+
# Create a single view instead of tabs
|
| 110 |
+
content_tab1 = st.container()
|
| 111 |
+
|
| 112 |
+
# Check for images in the result to use later
|
| 113 |
+
has_images = result.get('has_images', False)
|
| 114 |
+
has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
|
| 115 |
+
has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
|
| 116 |
+
any('images' in page for page in result['raw_response_data']['pages']
|
| 117 |
+
if isinstance(page, dict)))
|
| 118 |
+
|
| 119 |
+
# Display structured content
|
| 120 |
+
with content_tab1:
|
| 121 |
+
# Display structured content with markdown formatting
|
| 122 |
+
if isinstance(result['ocr_contents'], dict):
|
| 123 |
+
# CSS is now handled in the main layout.py file
|
| 124 |
+
|
| 125 |
+
# Collect all available images from the result
|
| 126 |
+
available_images = []
|
| 127 |
+
if has_images and 'pages_data' in result:
|
| 128 |
+
for page_idx, page in enumerate(result['pages_data']):
|
| 129 |
+
if 'images' in page and len(page['images']) > 0:
|
| 130 |
+
for img_idx, img in enumerate(page['images']):
|
| 131 |
+
if 'image_base64' in img:
|
| 132 |
+
available_images.append({
|
| 133 |
+
'source': 'pages_data',
|
| 134 |
+
'page': page_idx,
|
| 135 |
+
'index': img_idx,
|
| 136 |
+
'data': img['image_base64']
|
| 137 |
+
})
|
| 138 |
+
|
| 139 |
+
# Get images from raw response as well
|
| 140 |
+
if 'raw_response_data' in result:
|
| 141 |
+
raw_data = result['raw_response_data']
|
| 142 |
+
if isinstance(raw_data, dict) and 'pages' in raw_data:
|
| 143 |
+
for page_idx, page in enumerate(raw_data['pages']):
|
| 144 |
+
if isinstance(page, dict) and 'images' in page:
|
| 145 |
+
for img_idx, img in enumerate(page['images']):
|
| 146 |
+
if isinstance(img, dict) and 'base64' in img:
|
| 147 |
+
available_images.append({
|
| 148 |
+
'source': 'raw_response',
|
| 149 |
+
'page': page_idx,
|
| 150 |
+
'index': img_idx,
|
| 151 |
+
'data': img['base64']
|
| 152 |
+
})
|
| 153 |
+
|
| 154 |
+
# Extract images for display at the top
|
| 155 |
+
images_to_display = []
|
| 156 |
+
|
| 157 |
+
# First, collect all available images
|
| 158 |
+
for img_idx, img in enumerate(available_images):
|
| 159 |
+
if 'data' in img:
|
| 160 |
+
images_to_display.append({
|
| 161 |
+
'data': img['data'],
|
| 162 |
+
'id': img.get('id', f"img_{img_idx}"),
|
| 163 |
+
'index': img_idx
|
| 164 |
+
})
|
| 165 |
+
|
| 166 |
+
# Image display now only happens in the Images tab
|
| 167 |
+
|
| 168 |
+
# Organize sections in a logical order - prioritize main_text
|
| 169 |
+
section_order = ["title", "author", "date", "summary", "main_text", "content", "transcript", "metadata"]
|
| 170 |
+
ordered_sections = []
|
| 171 |
+
|
| 172 |
+
# Add known sections first in preferred order
|
| 173 |
+
for section_name in section_order:
|
| 174 |
+
if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
|
| 175 |
+
ordered_sections.append(section_name)
|
| 176 |
+
|
| 177 |
+
# Add any remaining sections
|
| 178 |
+
for section in result['ocr_contents'].keys():
|
| 179 |
+
if (section not in ordered_sections and
|
| 180 |
+
section not in ['error', 'partial_text'] and
|
| 181 |
+
result['ocr_contents'][section]):
|
| 182 |
+
ordered_sections.append(section)
|
| 183 |
+
|
| 184 |
+
# If only raw_text is available and no other content, add it last
|
| 185 |
+
if ('raw_text' in result['ocr_contents'] and
|
| 186 |
+
result['ocr_contents']['raw_text'] and
|
| 187 |
+
len(ordered_sections) == 0):
|
| 188 |
+
ordered_sections.append('raw_text')
|
| 189 |
+
|
| 190 |
+
# Add minimal spacing before OCR results
|
| 191 |
+
st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
|
| 192 |
+
|
| 193 |
+
# Create tabs for different views
|
| 194 |
+
if has_images:
|
| 195 |
+
tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
|
| 196 |
+
doc_tab, json_tab, img_tab = tabs
|
| 197 |
+
else:
|
| 198 |
+
tabs = st.tabs(["Document Content", "Raw JSON"])
|
| 199 |
+
doc_tab, json_tab = tabs
|
| 200 |
+
img_tab = None
|
| 201 |
+
|
| 202 |
+
# Document Content tab with simplified and systematic content handling
|
| 203 |
+
with doc_tab:
|
| 204 |
+
# Classify document content using our utility function
|
| 205 |
+
content_classification = classify_document_content(result)
|
| 206 |
+
|
| 207 |
+
# Track what content has been displayed to avoid redundancy
|
| 208 |
+
displayed_content = set()
|
| 209 |
+
|
| 210 |
+
# Create a single unified content section
|
| 211 |
+
st.markdown("#### Document Content")
|
| 212 |
+
st.markdown("##### Title")
|
| 213 |
+
|
| 214 |
+
# Extract main structured content fields without redundancy
|
| 215 |
+
text_fields = {}
|
| 216 |
+
|
| 217 |
+
# Use the exact same approach as in Previous Results tab for consistency
|
| 218 |
+
# Create a more focused list of important sections - prioritize main_text
|
| 219 |
+
priority_sections = ["title", "main_text", "content", "transcript", "summary"]
|
| 220 |
+
displayed_sections = set()
|
| 221 |
+
|
| 222 |
+
# First display priority sections
|
| 223 |
+
for section in priority_sections:
|
| 224 |
+
if section in result['ocr_contents'] and result['ocr_contents'][section]:
|
| 225 |
+
content = result['ocr_contents'][section]
|
| 226 |
+
if isinstance(content, str) and content.strip():
|
| 227 |
+
# Only add a subheader for meaningful section names, not raw_text
|
| 228 |
+
if section != "raw_text" and section != "title":
|
| 229 |
+
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 230 |
+
|
| 231 |
+
# Format and display content
|
| 232 |
+
# First format any structured data (lists, dicts)
|
| 233 |
+
structured_content = format_structured_data(content)
|
| 234 |
+
# Then apply regular OCR text formatting
|
| 235 |
+
formatted_content = format_ocr_text(structured_content)
|
| 236 |
+
st.markdown(formatted_content)
|
| 237 |
+
displayed_sections.add(section)
|
| 238 |
+
break
|
| 239 |
+
elif isinstance(content, dict):
|
| 240 |
+
# Display dictionary content as key-value pairs
|
| 241 |
+
for k, v in content.items():
|
| 242 |
+
if k not in ['error', 'partial_text'] and v:
|
| 243 |
+
st.markdown(f"**{k.replace('_', ' ').title()}**")
|
| 244 |
+
if isinstance(v, str):
|
| 245 |
+
# Format any structured data in the string
|
| 246 |
+
formatted_v = format_structured_data(v)
|
| 247 |
+
st.markdown(format_ocr_text(formatted_v))
|
| 248 |
+
else:
|
| 249 |
+
# Format non-string values (lists, dicts)
|
| 250 |
+
formatted_v = format_structured_data(v)
|
| 251 |
+
st.markdown(formatted_v)
|
| 252 |
+
displayed_sections.add(section)
|
| 253 |
+
break
|
| 254 |
+
elif isinstance(content, list):
|
| 255 |
+
# Format and display list items using our structured formatter
|
| 256 |
+
formatted_list = format_structured_data(content)
|
| 257 |
+
st.markdown(formatted_list)
|
| 258 |
+
displayed_sections.add(section)
|
| 259 |
+
break
|
| 260 |
+
|
| 261 |
+
# Then display any remaining sections not already shown
|
| 262 |
+
for section, content in result['ocr_contents'].items():
|
| 263 |
+
if (section not in displayed_sections and
|
| 264 |
+
section not in ['error', 'partial_text'] and
|
| 265 |
+
content):
|
| 266 |
+
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
| 267 |
+
|
| 268 |
+
if isinstance(content, str):
|
| 269 |
+
# Format any structured data in the string before display
|
| 270 |
+
structured_content = format_structured_data(content)
|
| 271 |
+
st.markdown(format_ocr_text(structured_content))
|
| 272 |
+
elif isinstance(content, list):
|
| 273 |
+
# Format list using our structured formatter
|
| 274 |
+
formatted_list = format_structured_data(content)
|
| 275 |
+
st.markdown(formatted_list)
|
| 276 |
+
elif isinstance(content, dict):
|
| 277 |
+
# Format dictionary using our structured formatter
|
| 278 |
+
formatted_dict = format_structured_data(content)
|
| 279 |
+
st.markdown(formatted_dict)
|
| 280 |
+
|
| 281 |
+
# Raw JSON tab - for viewing the raw OCR response data
|
| 282 |
+
with json_tab:
|
| 283 |
+
# Extract the relevant JSON data
|
| 284 |
+
json_data = {}
|
| 285 |
+
|
| 286 |
+
# Include important metadata
|
| 287 |
+
for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
|
| 288 |
+
if field in result:
|
| 289 |
+
json_data[field] = result[field]
|
| 290 |
+
|
| 291 |
+
# Include OCR contents
|
| 292 |
+
if 'ocr_contents' in result:
|
| 293 |
+
json_data['ocr_contents'] = result['ocr_contents']
|
| 294 |
+
|
| 295 |
+
# Exclude large binary data like base64 images to keep JSON clean
|
| 296 |
+
if 'pages_data' in result:
|
| 297 |
+
# Create simplified pages_data without large binary content
|
| 298 |
+
simplified_pages = []
|
| 299 |
+
for page in result['pages_data']:
|
| 300 |
+
simplified_page = {
|
| 301 |
+
'page_number': page.get('page_number', 0),
|
| 302 |
+
'has_text': bool(page.get('markdown', '')),
|
| 303 |
+
'has_images': bool(page.get('images', [])),
|
| 304 |
+
'image_count': len(page.get('images', []))
|
| 305 |
+
}
|
| 306 |
+
simplified_pages.append(simplified_page)
|
| 307 |
+
json_data['pages_summary'] = simplified_pages
|
| 308 |
+
|
| 309 |
+
# Format the JSON prettily
|
| 310 |
+
json_str = json.dumps(json_data, indent=2)
|
| 311 |
+
|
| 312 |
+
# Display in a monospace font with syntax highlighting
|
| 313 |
+
st.code(json_str, language="json")
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
# Images tab - for viewing document images
|
| 317 |
+
if has_images and img_tab:
|
| 318 |
+
with img_tab:
|
| 319 |
+
# Display each available image
|
| 320 |
+
for i, img in enumerate(images_to_display):
|
| 321 |
+
st.image(img['data'], caption=f"Image {i+1}", use_container_width=True)
|
| 322 |
+
|
| 323 |
+
# Display custom prompt if provided
|
| 324 |
+
if custom_prompt:
|
| 325 |
+
with st.expander("Custom Processing Instructions"):
|
| 326 |
+
st.write(custom_prompt)
|
| 327 |
+
|
| 328 |
+
# No download heading - start directly with buttons
|
| 329 |
+
|
| 330 |
+
# Create export section with a simple download menu
|
| 331 |
+
st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
|
| 332 |
+
|
| 333 |
+
# Prepare all download files at once to avoid rerun resets
|
| 334 |
+
try:
|
| 335 |
+
# 1. JSON download
|
| 336 |
+
json_str = json.dumps(result, indent=2)
|
| 337 |
+
json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
|
| 338 |
+
|
| 339 |
+
# 2. Text download with improved structure
|
| 340 |
+
text_parts = []
|
| 341 |
+
filename = result.get('file_name', 'document')
|
| 342 |
+
text_parts.append(f"DOCUMENT: {filename}\n")
|
| 343 |
+
|
| 344 |
+
if 'timestamp' in result:
|
| 345 |
+
text_parts.append(f"Processed: {result['timestamp']}\n")
|
| 346 |
+
|
| 347 |
+
if 'languages' in result and result['languages']:
|
| 348 |
+
languages = [lang for lang in result['languages'] if lang is not None]
|
| 349 |
+
if languages:
|
| 350 |
+
text_parts.append(f"Languages: {', '.join(languages)}\n")
|
| 351 |
+
|
| 352 |
+
if 'topics' in result and result['topics']:
|
| 353 |
+
text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
|
| 354 |
+
|
| 355 |
+
text_parts.append("\n" + "="*50 + "\n\n")
|
| 356 |
+
|
| 357 |
+
if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
| 358 |
+
text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
|
| 359 |
+
|
| 360 |
+
content_added = False
|
| 361 |
+
|
| 362 |
+
if 'ocr_contents' in result:
|
| 363 |
+
for field in ["main_text", "content", "text", "transcript", "raw_text"]:
|
| 364 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
| 365 |
+
text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
|
| 366 |
+
content_added = True
|
| 367 |
+
break
|
| 368 |
+
|
| 369 |
+
text_content = "\n".join(text_parts)
|
| 370 |
+
text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
|
| 371 |
+
|
| 372 |
+
# 3. HTML download
|
| 373 |
+
from utils.image_utils import create_html_with_images
|
| 374 |
+
html_content = create_html_with_images(result)
|
| 375 |
+
html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
|
| 376 |
+
|
| 377 |
+
# Hide download options in an expander
|
| 378 |
+
with st.expander("Download Options"):
|
| 379 |
+
# Remove columns and use vertical layout instead
|
| 380 |
+
# Add spacing between buttons for better readability
|
| 381 |
+
st.download_button(
|
| 382 |
+
label="JSON",
|
| 383 |
+
data=json_str,
|
| 384 |
+
file_name=json_filename,
|
| 385 |
+
mime="application/json",
|
| 386 |
+
key="download_json_btn",
|
| 387 |
+
use_container_width=True
|
| 388 |
+
)
|
| 389 |
+
|
| 390 |
+
st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
|
| 391 |
+
|
| 392 |
+
st.download_button(
|
| 393 |
+
label="Text",
|
| 394 |
+
data=text_content,
|
| 395 |
+
file_name=text_filename,
|
| 396 |
+
mime="text/plain",
|
| 397 |
+
key="download_text_btn",
|
| 398 |
+
use_container_width=True
|
| 399 |
+
)
|
| 400 |
+
|
| 401 |
+
st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
|
| 402 |
+
|
| 403 |
+
st.download_button(
|
| 404 |
+
label="HTML",
|
| 405 |
+
data=html_content,
|
| 406 |
+
file_name=html_filename,
|
| 407 |
+
mime="text/html",
|
| 408 |
+
key="download_html_btn",
|
| 409 |
+
use_container_width=True
|
| 410 |
+
)
|
| 411 |
+
|
| 412 |
+
except Exception as e:
|
| 413 |
+
st.error(f"Error preparing download files: {str(e)}")
|