Spaces:

milwright
/

historical-ocr

Running

milwright commited on Apr 4, 2025

Commit

622c90f

1 Parent(s): 75ead00

Fix sample document loading and processing pipeline

- Fixed sample document loading to automatically process after selection
- Enhanced SampleDocument class with better file emulation
- Added session state management for reliable sample processing
- Improved user feedback during sample document processing
- Updated CLAUDE.md with improved documentation

Files changed (2) hide show

CLAUDE.md +4 -1
app.py +246 -14

CLAUDE.md CHANGED Viewed

@@ -8,6 +8,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - Process PDF files: `python pdf_ocr.py <file_path>`
 - Process single file with logging: `python process_file.py <file_path>`
 - Run newspaper test: `python test_newspaper.py <file_path>`
 - Run typechecking: `mypy .`
 - Lint code: `ruff check .` or `flake8`
@@ -23,6 +24,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - **Naming**: snake_case for variables/functions, PascalCase for classes
 - **Documentation**: Google-style docstrings for all functions/classes
 - **Logging**: Use module-level loggers with appropriate log levels
 - **Line length**: ≤100 characters
 ## Architecture
@@ -30,4 +32,5 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - Utils: `ocr_utils.py` - OCR text and image processing utilities
 - PDF handling: `pdf_ocr.py` - PDF-specific processing functionality
 - Config: `config.py` - Configuration settings and API keys
-- Web: `app.py` - Streamlit interface with UI components in `/ui` directory

 - Process PDF files: `python pdf_ocr.py <file_path>`
 - Process single file with logging: `python process_file.py <file_path>`
 - Run newspaper test: `python test_newspaper.py <file_path>`
+- Run notebook demo: `jupyter notebook notebook_demo.ipynb`
 - Run typechecking: `mypy .`
 - Lint code: `ruff check .` or `flake8`
 - **Naming**: snake_case for variables/functions, PascalCase for classes
 - **Documentation**: Google-style docstrings for all functions/classes
 - **Logging**: Use module-level loggers with appropriate log levels
+- **Exception handling**: Implement graceful fallbacks for API errors
 - **Line length**: ≤100 characters
 ## Architecture
 - Utils: `ocr_utils.py` - OCR text and image processing utilities
 - PDF handling: `pdf_ocr.py` - PDF-specific processing functionality
 - Config: `config.py` - Configuration settings and API keys
+- Web: `app.py` - Streamlit interface with UI components in `/ui` directory
+- Demo: `notebook_demo.ipynb` - Interactive notebook with educational examples

app.py CHANGED Viewed

@@ -511,12 +511,12 @@ with main_tab1:
         # Add heading for the file uploader (just text, no container)
         st.markdown('### Upload Document')
-        # Model info below the heading
-        st.markdown("Using the latest `mistral-ocr-latest` model for advanced document understanding.")
         # Enhanced file uploader with better help text
         uploaded_file = st.file_uploader("Drag and drop PDFs or images here", type=["pdf", "png", "jpg", "jpeg"],
-                                        help="Supports PDFs, JPGs, PNGs and other image formats")
         # Removed seed prompt instructions from here, moving to sidebar
@@ -917,6 +917,8 @@ with main_tab2:
                             badge_color = "#6a1b9a"  # Purple for document types
                         elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
                             badge_color = "#2e7d32"  # Green for subject domains
                         st.markdown(
                             f'<span style="background-color: {badge_color}; color: white; padding: 3px 8px; '
@@ -1193,6 +1195,27 @@ with main_tab3:
     """)
 with main_tab1:
     if uploaded_file is not None:
         # Check file size (cap at 50MB)
         file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
@@ -1247,8 +1270,21 @@ with main_tab1:
             # No extra spacing needed as it will be managed programmatically
             metadata_placeholder = st.empty()
-        # Results section
-        if process_button:
             # Move the progress indicator reference to just below the button
             progress_container = progress_placeholder
             try:
@@ -1477,8 +1513,8 @@ with main_tab1:
                     # Only show when custom_prompt exists in the session AND has content, or when the result explicitly states it was applied
                     has_instructions = ('custom_prompt' in locals() and custom_prompt and len(str(custom_prompt).strip()) > 0)
                     if has_instructions or 'custom_prompt_applied' in result:
-                        # Use a simpler message that just shows custom instructions were applied
-                        metadata_html += f'<p style="margin-top:10px; padding:5px 8px; background-color:#f0f8ff; border-left:3px solid #4ba3e3; border-radius:3px; color:#333;"><strong>Advanced Analysis:</strong> Custom instructions applied</p>'
                     # Close the metadata card
                     metadata_html += '</div>'
@@ -1936,6 +1972,63 @@ with main_tab1:
                     if 'ocr_contents' not in result:
                         st.error("No OCR content was extracted from the document.")
                     # Close document content div
                     st.markdown('</div>', unsafe_allow_html=True)
@@ -2038,6 +2131,41 @@ with main_tab1:
                                     lang_tag = f"{lang} Language"
                                     subject_tags.append(lang_tag)
                     except Exception as e:
                         logger.warning(f"Error generating subject tags: {str(e)}")
                         # Fallback tags if extraction fails
@@ -2094,9 +2222,7 @@ with main_tab1:
             except Exception as e:
                 st.error(f"Error processing document: {str(e)}")
     else:
-        # Empty placeholder - we've moved the upload instruction to the file_uploader
-        # Show example images in a simpler layout
         st.subheader("Example Documents")
         # Add a simplified info message about examples
@@ -2106,9 +2232,115 @@ with main_tab1:
         - Handwritten letters and documents
         - Printed books and articles
         - Multi-page PDFs
-        Upload your own document to get started or explore the 'About' tab for more information.
         """)
-        # Display a direct message about sample documents
-        st.info("Sample documents are available in the input directory. Upload a document to begin analysis.")# Minor update

         # Add heading for the file uploader (just text, no container)
         st.markdown('### Upload Document')
+        # Model info with clearer instructions
+        st.markdown("Using the latest `mistral-ocr-latest` model for advanced document understanding. To get started upload your own document, use an example document, or explore the 'About' tab for more info.")
         # Enhanced file uploader with better help text
         uploaded_file = st.file_uploader("Drag and drop PDFs or images here", type=["pdf", "png", "jpg", "jpeg"],
+                                        help="Limit 200MB per file • PDF, PNG, JPG, JPEG")
         # Removed seed prompt instructions from here, moving to sidebar
                             badge_color = "#6a1b9a"  # Purple for document types
                         elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
                             badge_color = "#2e7d32"  # Green for subject domains
+                        elif any(term in topic.lower() for term in ["preprocessed", "enhanced", "grayscale", "denoised", "contrast", "rotated"]):
+                            badge_color = "#e65100"  # Orange for preprocessing-related tags
                         st.markdown(
                             f'<span style="background-color: {badge_color}; color: white; padding: 3px 8px; '
     """)
 with main_tab1:
+    # Initialize session states if needed
+    if 'auto_process_sample' not in st.session_state:
+        st.session_state.auto_process_sample = False
+    if 'sample_just_loaded' not in st.session_state:
+        st.session_state.sample_just_loaded = False
+    # Use uploaded_file or sample_document if available
+    if 'sample_document' in st.session_state and st.session_state.sample_document is not None:
+        # Use the sample document
+        uploaded_file = st.session_state.sample_document
+        # Add a notice about using sample document
+        st.success(f"Using sample document: {uploaded_file.name}")
+        # Set auto-process flag in session state if this is a newly loaded sample
+        if st.session_state.sample_just_loaded:
+            st.session_state.auto_process_sample = True
+            st.session_state.sample_just_loaded = False
+        # Clear sample document after use to avoid interference with future uploads
+        st.session_state.sample_document = None
     if uploaded_file is not None:
         # Check file size (cap at 50MB)
         file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
             # No extra spacing needed as it will be managed programmatically
             metadata_placeholder = st.empty()
+        # Check if we need to auto-process a sample document
+        if 'auto_process_sample' not in st.session_state:
+            st.session_state.auto_process_sample = False
+        # Results section - process if button clicked or auto-process flag is set
+        process_now = process_button or st.session_state.auto_process_sample
+        # Show a message if auto-processing
+        if st.session_state.auto_process_sample:
+            st.info("Automatically processing sample document...")
+        if process_now:
+            # Reset auto-process flag to avoid processing on next rerun
+            if st.session_state.auto_process_sample:
+                st.session_state.auto_process_sample = False
             # Move the progress indicator reference to just below the button
             progress_container = progress_placeholder
             try:
                     # Only show when custom_prompt exists in the session AND has content, or when the result explicitly states it was applied
                     has_instructions = ('custom_prompt' in locals() and custom_prompt and len(str(custom_prompt).strip()) > 0)
                     if has_instructions or 'custom_prompt_applied' in result:
+                        # Use consistent styling with other metadata fields
+                        metadata_html += f'<p><strong>Advanced Analysis:</strong> Custom instructions applied</p>'
                     # Close the metadata card
                     metadata_html += '</div>'
                     if 'ocr_contents' not in result:
                         st.error("No OCR content was extracted from the document.")
+                    else:
+                        # Check for minimal text content in OCR results
+                        has_minimal_text = False
+                        total_text_length = 0
+                        # Check if the document is an image (not a PDF)
+                        is_image = result.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png', '.gif'))
+                        # If image file with raw_text only
+                        if is_image and 'ocr_contents' in result:
+                            ocr_contents = result['ocr_contents']
+                            # Check if only raw_text exists with minimal content
+                            has_raw_text_only = False
+                            if 'raw_text' in ocr_contents:
+                                raw_text = ocr_contents['raw_text']
+                                total_text_length += len(raw_text.strip())
+                                # Check if raw_text is the only significant field
+                                other_content_fields = [k for k in ocr_contents.keys()
+                                                       if k not in ['raw_text', 'error', 'partial_text']
+                                                       and isinstance(ocr_contents[k], (str, list))
+                                                       and ocr_contents[k]]
+                                if len(other_content_fields) <= 1:  # Only raw_text or one other field
+                                    has_raw_text_only = True
+                            # Check if minimal text was extracted (less than 50 characters)
+                            if total_text_length < 50 and has_raw_text_only:
+                                has_minimal_text = True
+                        # Check if any meaningful preprocessing options were used
+                        preprocessing_used = False
+                        if preprocessing_options.get("document_type", "standard") != "standard":
+                            preprocessing_used = True
+                        if preprocessing_options.get("grayscale", False):
+                            preprocessing_used = True
+                        if preprocessing_options.get("denoise", False):
+                            preprocessing_used = True
+                        if preprocessing_options.get("contrast", 0) != 0:
+                            preprocessing_used = True
+                        if preprocessing_options.get("rotation", 0) != 0:
+                            preprocessing_used = True
+                        # If minimal text was found and preprocessing options weren't used
+                        if has_minimal_text and not preprocessing_used and uploaded_file.type.startswith('image/'):
+                            st.warning("""
+                            **Limited text extracted from this image.**
+                            Try using preprocessing options in the sidebar to improve results:
+                            - Convert to grayscale for clearer text
+                            - Use denoising for aged or degraded documents
+                            - Adjust contrast for faded text
+                            - Try different rotation if text orientation is unclear
+                            Click the "Preprocessing Options" section in the sidebar under "Image Processing".
+                            """)
                     # Close document content div
                     st.markdown('</div>', unsafe_allow_html=True)
                                     lang_tag = f"{lang} Language"
                                     subject_tags.append(lang_tag)
+                        # Add preprocessing information as tags if preprocessing was applied
+                        if uploaded_file.type.startswith('image/'):
+                            # Check if meaningful preprocessing options were used
+                            if preprocessing_options.get("document_type", "standard") != "standard":
+                                doc_type = preprocessing_options["document_type"].capitalize()
+                                preprocessing_tag = f"Enhanced ({doc_type})"
+                                if preprocessing_tag not in subject_tags:
+                                    subject_tags.append(preprocessing_tag)
+                            preprocessing_methods = []
+                            if preprocessing_options.get("grayscale", False):
+                                preprocessing_methods.append("Grayscale")
+                            if preprocessing_options.get("denoise", False):
+                                preprocessing_methods.append("Denoised")
+                            if preprocessing_options.get("contrast", 0) != 0:
+                                contrast_val = preprocessing_options.get("contrast", 0)
+                                if contrast_val > 0:
+                                    preprocessing_methods.append("Contrast Enhanced")
+                                else:
+                                    preprocessing_methods.append("Contrast Reduced")
+                            if preprocessing_options.get("rotation", 0) != 0:
+                                preprocessing_methods.append("Rotated")
+                            # Add a combined preprocessing tag if methods were applied
+                            if preprocessing_methods:
+                                prep_tag = "Preprocessed"
+                                if prep_tag not in subject_tags:
+                                    subject_tags.append(prep_tag)
+                                # Add the specific method as a tag if only one was used
+                                if len(preprocessing_methods) == 1:
+                                    method_tag = preprocessing_methods[0]
+                                    if method_tag not in subject_tags:
+                                        subject_tags.append(method_tag)
                     except Exception as e:
                         logger.warning(f"Error generating subject tags: {str(e)}")
                         # Fallback tags if extraction fails
             except Exception as e:
                 st.error(f"Error processing document: {str(e)}")
     else:
+        # Example Documents section after file uploader
         st.subheader("Example Documents")
         # Add a simplified info message about examples
         - Handwritten letters and documents
         - Printed books and articles
         - Multi-page PDFs
         """)
+        # Add CSS to make the dropdown match the column width
+        st.markdown("""
+        <style>
+        /* Make the selectbox container match the full column width */
+        .main .block-container .element-container:has([data-testid="stSelectbox"]) {
+            width: 100% !important;
+            max-width: 100% !important;
+        }
+        /* Make the actual selectbox control take the full width */
+        .stSelectbox > div > div {
+            width: 100% !important;
+            max-width: 100% !important;
+        }
+        </style>
+        """, unsafe_allow_html=True)
+        # Sample document URLs dropdown with clearer label
+        sample_urls = [
+            "Select a sample document",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/a-la-carte.pdf",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magician-or-bottle-cungerer.jpg",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/handwritten-letter.jpg",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magellan-travels.jpg",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/milgram-flier.png",
+            "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/baldwin-15st-north.jpg"
+        ]
+        sample_names = [
+            "Select a sample document",
+            "Restaurant Menu (PDF)",
+            "The Magician (Image)",
+            "Handwritten Letter (Image)",
+            "Magellan Travels (Image)",
+            "Milgram Flier (Image)",
+            "Baldwin Street (Image)"
+        ]
+        # Initialize sample_document in session state if it doesn't exist
+        if 'sample_document' not in st.session_state:
+            st.session_state.sample_document = None
+        selected_sample = st.selectbox("Select a sample document from `~/input`", options=range(len(sample_urls)), format_func=lambda i: sample_names[i])
+        if selected_sample > 0:
+            selected_url = sample_urls[selected_sample]
+            # Add process button for the sample document
+            if st.button("Load Sample Document"):
+                try:
+                    import requests
+                    from io import BytesIO
+                    with st.spinner(f"Downloading {sample_names[selected_sample]}..."):
+                        response = requests.get(selected_url)
+                        response.raise_for_status()
+                        # Extract filename from URL
+                        file_name = selected_url.split("/")[-1]
+                        # Create a BytesIO object from the downloaded content
+                        file_content = BytesIO(response.content)
+                        # Store as a UploadedFile-like object in session state
+                        class SampleDocument:
+                            def __init__(self, name, content, content_type):
+                                self.name = name
+                                self._content = content
+                                self.type = content_type
+                                self.size = len(content)
+                            def getvalue(self):
+                                return self._content
+                            def read(self):
+                                return self._content
+                            def seek(self, position):
+                                # Implement seek for compatibility with some file operations
+                                return
+                            def tell(self):
+                                # Implement tell for compatibility
+                                return 0
+                        # Determine content type based on file extension
+                        if file_name.lower().endswith('.pdf'):
+                            content_type = 'application/pdf'
+                        elif file_name.lower().endswith(('.jpg', '.jpeg')):
+                            content_type = 'image/jpeg'
+                        elif file_name.lower().endswith('.png'):
+                            content_type = 'image/png'
+                        else:
+                            content_type = 'application/octet-stream'
+                        # Save download info in session state for more reliable handling
+                        st.session_state.sample_document = SampleDocument(
+                            name=file_name,
+                            content=response.content,
+                            content_type=content_type
+                        )
+                        # Set a flag to indicate this is a newly loaded sample
+                        st.session_state.sample_just_loaded = True
+                        # Force rerun to load the document
+                        st.rerun()
+                except Exception as e:
+                    st.error(f"Error downloading sample document: {str(e)}")
+                    st.info("Please try uploading your own document instead.")