Spaces:

Ayaan-Sharif
/

ocr-layout-detection-poc

Running

App Files Files Community

Ayaan Sharif commited on Oct 15

Commit

933ba3b

1 Parent(s): 513efc5

Add AI-powered document layout detection app with examples

Browse files

Files changed (7) hide show

README.md +104 -6
app.py +318 -0
requirements.txt +9 -0
sample/Screenshot 2025-10-13 114010.png +3 -0
sample/Screenshot 2025-10-13 114606.png +3 -0
sample/Screenshot 2025-10-15 111602.png +3 -0
sample/Screenshot 2025-10-15 175735.png +3 -0

README.md CHANGED Viewed

@@ -1,12 +1,110 @@
 ---
-title: Ocr Layout Detection Poc
-emoji: 😻
-colorFrom: indigo
-colorTo: pink
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Document Layout Detection
+emoji: 📄
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 5.49.0
 app_file: app.py
 pinned: false
+license: mit
 ---
+# 📄 Document Layout & Table Structure Detection
+A powerful AI-powered tool for automatically detecting document layout and structure.
+## 🎯 What Does This Do?
+This Space automatically analyzes your documents (PDFs, images, scanned documents) to:
+- 🏷️ **Detect Layout Elements**: Identifies titles, headers, paragraphs, lists, tables, figures, captions, formulas, and more
+- 📊 **Extract Tables**: Recognizes table structures and extracts data
+- 🖼️ **Visual Output**: Shows bounding boxes around detected elements with color-coded labels
+- 📝 **Export Formats**: Provides Markdown, JSON, and visual outputs
+- 🔍 **OCR Support**: Automatically processes scanned documents and images
+## 🚀 How to Use
+1. **Upload** your document (PDF, JPG, PNG, etc.)
+2. **Choose** processing mode:
+   - **Fast**: Quick processing for simple documents
+   - **Accurate**: Better quality for complex tables (slower)
+3. **Configure** options:
+   - Enable/disable OCR
+   - Enable/disable table detection
+4. **Process** and view results!
+## 📚 Use Cases
+Perfect for analyzing:
+- 🆔 **ID Documents**: Aadhaar cards, passports, driver's licenses
+- 📄 **Forms & Applications**: Government forms, surveys, questionnaires
+- 🧾 **Invoices & Receipts**: Business documents with tables
+- 📖 **Research Papers**: Academic documents with complex layouts
+- 📊 **Reports**: Annual reports, financial statements
+- 📰 **Articles & Documents**: Any structured document
+## 🛠️ Technology
+This Space uses state-of-the-art AI models:
+- **Layout Model**: Advanced neural networks for document layout analysis
+- **Table Structure Model**: TableFormer architecture for table detection and extraction
+- **OCR Engine**: Integrated OCR for text recognition in scanned documents
+- **Framework**: Modern document processing pipeline
+## 🎨 Output Formats
+### 1. Visual Visualization
+- Bounding boxes drawn on the document
+- Color-coded by element type
+- Labels showing detected elements
+### 2. Markdown Export
+- Clean, structured text output
+- Preserves document hierarchy
+- Ready for further processing
+### 3. JSON Data
+- Complete layout predictions
+- Bounding box coordinates
+- Element types and confidence scores
+- Machine-readable format
+## 🌟 Features
+This tool offers:
+- Advanced AI models for layout detection
+- Supports multiple input formats (PDF, images)
+- Accurate table structure extraction
+- Handles both digital and scanned documents
+- Exports to various formats (Markdown, JSON)
+- Fast and accurate processing modes
+## 🧪 Local Testing
+Want to test locally? Check out `test_local.py` in this repository.
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the app locally
+python app.py
+# Or test on a specific file
+python test_local.py path/to/your/document.pdf
+```
+## 🤝 Contributing
+Found a bug or have a suggestion? Feel free to open an issue or contribute!
+## 📝 License
+MIT License - Feel free to use and modify for your projects.
+---
+**Made with ❤️ for better document understanding**

app.py ADDED Viewed

	@@ -0,0 +1,318 @@

+import gradio as gr
+from docling.document_converter import DocumentConverter
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
+from docling.document_converter import PdfFormatOption
+from PIL import Image, ImageDraw, ImageFont
+import json
+import fitz  # PyMuPDF
+# Color mapping for different layout elements
+COLORS = {
+    "title": "#FF6B6B",
+    "text": "#4ECDC4",
+    "section_header": "#95E1D3",
+    "table": "#F38181",
+    "list": "#AA96DA",
+    "figure": "#FCBAD3",
+    "caption": "#A8D8EA",
+    "formula": "#FFD93D",
+    "footnote": "#6BCB77",
+    "page_header": "#4D96FF",
+    "page_footer": "#9D84B7",
+    "picture": "#FF8C42",
+}
+def draw_layout_boxes(image_path, layout_data, scale_x=1.0, scale_y=1.0):
+    """Draw bounding boxes on the image based on layout predictions"""
+    # Open the image
+    if isinstance(image_path, str):
+        img = Image.open(image_path).convert("RGB")
+    else:
+        img = image_path.convert("RGB")
+    draw = ImageDraw.Draw(img)
+    # Try to load a font, fallback to default if not available
+    try:
+        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 20)
+        small_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 14)
+    except:
+        font = ImageFont.load_default()
+        small_font = ImageFont.load_default()
+    # Draw each cluster
+    for cluster in layout_data:
+        label = cluster.get("label", "unknown")
+        bbox = cluster.get("bbox")
+        if bbox:
+            # bbox format: [x0, y0, x1, y1] from PDF coordinates
+            # Scale to match rendered image dimensions
+            x0, y0, x1, y1 = bbox
+            x0 = x0 * scale_x
+            y0 = y0 * scale_y
+            x1 = x1 * scale_x
+            y1 = y1 * scale_y
+            # Get color for this label
+            color = COLORS.get(label, "#999999")
+            # Draw rectangle
+            draw.rectangle([x0, y0, x1, y1], outline=color, width=3)
+            # Draw label background
+            label_text = label.replace("_", " ").title()
+            bbox_text = draw.textbbox((x0, y0 - 25), label_text, font=small_font)
+            draw.rectangle([bbox_text[0] - 2, bbox_text[1] - 2, bbox_text[2] + 2, bbox_text[3] + 2],
+                         fill=color)
+            # Draw label text
+            draw.text((x0, y0 - 25), label_text, fill="white", font=small_font)
+    return img
+def process_document(file_path, mode, enable_ocr, enable_tables):
+    """Process document with Docling and return results"""
+    try:
+        # Configure pipeline options
+        pipeline_options = PdfPipelineOptions()
+        pipeline_options.do_table_structure = enable_tables
+        if enable_tables:
+            if mode == "Accurate":
+                pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
+            else:
+                pipeline_options.table_structure_options.mode = TableFormerMode.FAST
+        pipeline_options.do_ocr = enable_ocr
+        pipeline_options.generate_page_images = True
+        pipeline_options.generate_picture_images = True
+        # Create converter
+        converter = DocumentConverter(
+            format_options={
+                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
+                InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
+            }
+        )
+        # Convert document
+        result = converter.convert(file_path)
+        # Extract layout information
+        layout_info = []
+        total_clusters = 0
+        table_count = 0
+        for page_no, page in enumerate(result.pages, 1):
+            if page.predictions.layout:
+                clusters = page.predictions.layout.clusters
+                total_clusters += len(clusters)
+                for cluster in clusters:
+                    layout_info.append({
+                        "page": page_no,
+                        "label": cluster.label,
+                        "bbox": [cluster.bbox.l, cluster.bbox.t, cluster.bbox.r, cluster.bbox.b],
+                        "confidence": getattr(cluster, "confidence", None)
+                    })
+            # Count tables
+            if page.predictions.tablestructure and page.predictions.tablestructure.table_map:
+                table_count += len(page.predictions.tablestructure.table_map)
+        # Get markdown output
+        markdown_output = result.document.export_to_markdown()
+        # Create visualization for first page
+        visualization = None
+        if result.pages and layout_info:
+            # Draw boxes on first page only
+            first_page_layout = [item for item in layout_info if item["page"] == 1]
+            try:
+                # Check if input is an image or PDF
+                file_ext = file_path.lower().split('.')[-1]
+                if file_ext in ['jpg', 'jpeg', 'png', 'tiff', 'bmp']:
+                    # For images: Open directly, coordinates should match 1:1
+                    first_page_image = Image.open(file_path).convert("RGB")
+                    # No scaling needed for images - coordinates are already in pixels
+                    visualization = draw_layout_boxes(first_page_image, first_page_layout,
+                                                     scale_x=1.0, scale_y=1.0)
+                else:
+                    # For PDFs: Render and calculate scale
+                    doc = fitz.open(file_path)
+                    page = doc[0]
+                    # Get page dimensions in PDF points
+                    page_rect = page.rect
+                    pdf_width = page_rect.width
+                    pdf_height = page_rect.height
+                    # Render at 2x for better quality
+                    zoom = 2.0
+                    mat = fitz.Matrix(zoom, zoom)
+                    pix = page.get_pixmap(matrix=mat)
+                    first_page_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+                    # Calculate scale: rendered_pixels / pdf_points
+                    scale_x = pix.width / pdf_width
+                    scale_y = pix.height / pdf_height
+                    doc.close()
+                    # Draw boxes with calculated scale
+                    visualization = draw_layout_boxes(first_page_image, first_page_layout,
+                                                     scale_x=scale_x, scale_y=scale_y)
+            except Exception as e:
+                print(f"Could not create visualization: {e}")
+                import traceback
+                traceback.print_exc()
+        # Create summary
+        summary = f"""## Document Analysis Summary
+📄 **Total Pages:** {len(result.document.pages)}
+🏷️ **Layout Elements Detected:** {total_clusters}
+📊 **Tables Found:** {table_count}
+### Layout Elements by Type:
+"""
+        # Count elements by type
+        element_counts = {}
+        for item in layout_info:
+            label = item["label"]
+            element_counts[label] = element_counts.get(label, 0) + 1
+        for label, count in sorted(element_counts.items()):
+            summary += f"- **{label.replace('_', ' ').title()}**: {count}\n"
+        # JSON output
+        json_output = json.dumps(layout_info, indent=2)
+        return visualization, summary, markdown_output, json_output
+    except Exception as e:
+        error_msg = f"Error processing document: {str(e)}"
+        return None, error_msg, error_msg, error_msg
+def gradio_interface(file, mode, enable_ocr, enable_tables):
+    """Gradio interface function"""
+    if file is None:
+        return None, "Please upload a document", "", ""
+    return process_document(file.name, mode, enable_ocr, enable_tables)
+# Create Gradio interface
+with gr.Blocks(title="Document Layout Detection", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 📄 Document Layout & Structure Detection
+    Upload a document (PDF, image, etc.) to automatically detect its layout structure including text, tables, figures, and more!
+    **Features:**
+    - **AI-Powered Layout Detection**: Automatically identifies document elements
+    - **Table Structure Extraction**: Recognizes and extracts table data
+    - **OCR Support**: Reads text from scanned documents and images
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            file_input = gr.File(
+                label="Upload Document",
+                file_types=[".pdf", ".jpg", ".jpeg", ".png", ".tiff", ".bmp"]
+            )
+            mode_dropdown = gr.Dropdown(
+                choices=["Fast", "Accurate"],
+                value="Fast",
+                label="Processing Mode",
+                info="Accurate mode is slower but better for complex tables"
+            )
+            ocr_checkbox = gr.Checkbox(
+                label="Enable OCR",
+                value=True,
+                info="Use OCR for scanned documents and images"
+            )
+            tables_checkbox = gr.Checkbox(
+                label="Enable Table Detection",
+                value=True,
+                info="Detect and extract table structures"
+            )
+            process_btn = gr.Button("🚀 Process Document", variant="primary", size="lg")
+        with gr.Column(scale=2):
+            visualization_output = gr.Image(label="Layout Visualization (First Page)")
+            summary_output = gr.Markdown(label="Summary")
+    with gr.Tabs():
+        with gr.Tab("📝 Markdown Output"):
+            markdown_output = gr.Textbox(
+                label="Extracted Content (Markdown)",
+                lines=20,
+                max_lines=30
+            )
+        with gr.Tab("🔧 JSON Layout Data"):
+            json_output = gr.Code(
+                label="Layout Predictions (JSON)",
+                language="json",
+                lines=20
+            )
+    gr.Markdown("""
+    ### Legend
+    Different colors represent different document elements:
+    - 🔴 Title
+    - 🔵 Text
+    - 🟢 Section Header
+    - 🟠 Table
+    - 🟣 List/Figure/Formula
+    ### How to Use
+    1. Upload your document (PDF or image of ID card, invoice, report, etc.)
+    2. Choose processing options (Fast mode recommended for quick results)
+    3. Click "Process Document"
+    4. View the visualization with bounding boxes and explore the outputs
+    ### 💡 Try Examples Below!
+    Click on any example to see instant results on different document types.
+    """)
+    # Add examples
+    gr.Examples(
+        examples=[
+            ["sample/Screenshot 2025-10-13 114010.png", "Fast", True, True],
+            ["sample/Screenshot 2025-10-13 114606.png", "Fast", True, True],
+            ["sample/Screenshot 2025-10-15 111602.png", "Fast", True, True],
+            ["sample/Screenshot 2025-10-15 175735.png", "Fast", True, True],
+        ],
+        inputs=[file_input, mode_dropdown, ocr_checkbox, tables_checkbox],
+        outputs=[visualization_output, summary_output, markdown_output, json_output],
+        fn=gradio_interface,
+        cache_examples=False,
+        label="📚 Example Documents"
+    )
+    # Connect the button
+    process_btn.click(
+        fn=gradio_interface,
+        inputs=[file_input, mode_dropdown, ocr_checkbox, tables_checkbox],
+        outputs=[visualization_output, summary_output, markdown_output, json_output]
+    )
+    # Auto-process on file upload (optional)
+    file_input.change(
+        fn=gradio_interface,
+        inputs=[file_input, mode_dropdown, ocr_checkbox, tables_checkbox],
+        outputs=[visualization_output, summary_output, markdown_output, json_output]
+    )
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+# Install torch first with CPU support
+--extra-index-url https://download.pytorch.org/whl/cpu
+torch
+torchvision
+# Main dependencies
+docling>=2.0
+gradio>=5.0
+pymupdf>=1.24

sample/Screenshot 2025-10-13 114010.png ADDED Viewed

Git LFS Details

SHA256: de7049326db14e68944db3b142d79c1725a3399c4cc52970420be48ce73e9cd4
Pointer size: 131 Bytes
Size of remote file: 236 kB

sample/Screenshot 2025-10-13 114606.png ADDED Viewed

Git LFS Details

SHA256: c164aa77c45cefd007f28b73671d9b834611c93996affbbd56c49d11966b94b1
Pointer size: 131 Bytes
Size of remote file: 169 kB

sample/Screenshot 2025-10-15 111602.png ADDED Viewed

Git LFS Details

SHA256: 56992f493b30c9c763bb36cae7a71c80fdb99b34209c114c2a63fee4fe3ae835
Pointer size: 131 Bytes
Size of remote file: 454 kB

sample/Screenshot 2025-10-15 175735.png ADDED Viewed

Git LFS Details

SHA256: 54bb9f82b8f08629bd61af031bcdfd451b9ba603264d664abb128302e8793289
Pointer size: 131 Bytes
Size of remote file: 503 kB