Spaces:

flozi00
/

structured-docling

Running on Zero

App Files Files Community

flozi00 commited on Nov 5, 2025

Commit

8e3d376

1 Parent(s): 62cc451

first try

Browse files

Files changed (5) hide show

.gitignore +46 -0
README.md +97 -5
app.py +148 -0
app_hf_spaces.py +169 -0
requirements.txt +3 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,46 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+env/
+ENV/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Gradio
+gradio_cached_examples/
+flagged/
+# OS
+.DS_Store
+Thumbs.db
+# Temporary files
+*.tmp
+temp/
+tmp/

README.md CHANGED Viewed

@@ -1,13 +1,105 @@
 ---
 title: Structured Docling
-emoji: 💻
-colorFrom: red
-colorTo: red
 sdk: gradio
 sdk_version: 5.49.1
-app_file: app.py
 pinned: false
 license: gpl-3.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Structured Docling
+emoji: 📄
+colorFrom: blue
+colorTo: indigo
 sdk: gradio
 sdk_version: 5.49.1
+app_file: app_hf_spaces.py
 pinned: false
 license: gpl-3.0
 ---
+# Docling Structured Extraction Demo
+A Gradio-based demo application for extracting structured data from documents using Docling's beta structured extraction feature.
+## Features
+- 📄 Support for PDF and image files (PNG, JPG, JPEG, TIFF, BMP)
+- 🌐 URL input for remote documents
+- 🎯 Customizable JSON templates for extraction
+- 🚀 Optimized for Hugging Face Spaces with Zero GPU support
+- 📊 Clean JSON output with extracted data
+## Files
+- `app.py` - Standard Gradio application
+- `app_hf_spaces.py` - Version optimized for Hugging Face Spaces with Zero GPU decorator
+- `requirements.txt` - Python dependencies
+## Installation
+```bash
+pip install -r requirements.txt
+```
+## Usage
+### Local Development
+Run the standard version:
+```bash
+python app.py
+```
+### Hugging Face Spaces
+The `app_hf_spaces.py` file is specifically designed for deployment on Hugging Face Spaces with Zero GPU support.
+To deploy:
+1. Create a new Space on Hugging Face
+2. Upload `app_hf_spaces.py` (rename to `app.py`)
+3. Upload `requirements.txt`
+4. Enable Zero GPU in Space settings
+## How to Use the Demo
+1. **Input Source**: Either upload a document file or provide a URL to a document
+2. **Define Template**: Create a JSON template specifying the fields you want to extract
+   - Use `"string"` for text fields
+   - Use `"float"` for decimal numbers
+   - Use `"int"` for whole numbers
+3. **Extract**: Click the "Extract" button to process the document
+4. **View Results**: The extracted data will appear in JSON format in the output box
+## Template Examples
+### Simple Invoice Extraction
+```json
+{
+  "bill_no": "string",
+  "total": "float",
+  "date": "string"
+}
+```
+### Detailed Invoice Extraction
+```json
+{
+  "bill_no": "string",
+  "total": "float",
+  "sender_name": "string",
+  "receiver_name": "string",
+  "postal_code": "string",
+  "city": "string"
+}
+```
+## Notes
+- The structured extraction API is currently in **beta** and may change
+- Only PDF and image formats are supported
+- The extraction uses Vision Language Models (VLM) for understanding document content
+- Processing time depends on document complexity and size
+## Requirements
+- Python 3.9+
+- gradio >= 4.0.0
+- docling[vlm] >= 2.0.0
+- spaces >= 0.19.0 (for Hugging Face Spaces deployment)
+## License
+This demo is provided as-is for demonstration purposes.

app.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import json
+import gradio as gr
+from docling.datamodel.base_models import InputFormat
+from docling.document_extractor import DocumentExtractor
+# Initialize the extractor
+extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])
+def process_extraction(file_input, url_input, template_json):
+    """
+    Process document extraction with the provided template.
+    Args:
+        file_input: Uploaded file (PDF or image)
+        url_input: URL to a document
+        template_json: JSON string defining the extraction template
+    Returns:
+        JSON string with extracted data
+    """
+    try:
+        # Determine the source
+        source = None
+        if file_input is not None:
+            source = file_input.name
+        elif url_input and url_input.strip():
+            source = url_input.strip()
+        else:
+            return json.dumps(
+                {"error": "Please provide either a file or a URL"}, indent=2
+            )
+        # Parse the template JSON
+        try:
+            template = json.loads(template_json)
+        except json.JSONDecodeError as e:
+            return json.dumps({"error": f"Invalid JSON template: {str(e)}"}, indent=2)
+        # Perform extraction
+        result = extractor.extract(
+            source=source,
+            template=template,
+        )
+        # Format the output
+        output = {"pages": []}
+        for page in result.pages:
+            page_data = {
+                "page_no": page.page_no,
+                "extracted_data": page.extracted_data,
+                "raw_text": page.raw_text,
+                "errors": page.errors if page.errors else [],
+            }
+            output["pages"].append(page_data)
+        return json.dumps(output, indent=2)
+    except Exception as e:
+        return json.dumps({"error": f"Extraction failed: {str(e)}"}, indent=2)
+# Default template example
+default_template = json.dumps(
+    {"bill_no": "string", "total": "float", "date": "string"}, indent=2
+)
+# Create Gradio interface
+with gr.Blocks(title="Docling Structured Extraction") as demo:
+    gr.Markdown(
+        """
+    # 📄 Docling Structured Extraction Demo
+    Extract structured data from documents (PDF/Images) using AI-powered extraction.
+    **Note:** This feature is currently in beta.
+    ### How to use:
+    1. Upload a file OR provide a URL to a document
+    2. Define your extraction template in JSON format
+    3. Click "Extract" to get structured data
+    """
+    )
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### Input Source")
+            file_input = gr.File(
+                label="Upload File (PDF or Image)",
+                file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"],
+            )
+            url_input = gr.Textbox(
+                label="Or Enter Document URL",
+                placeholder="https://example.com/document.pdf",
+                lines=1,
+            )
+            gr.Markdown("### Extraction Template")
+            template_input = gr.Code(
+                label="JSON Template", value=default_template, language="json", lines=15
+            )
+            extract_btn = gr.Button("Extract", variant="primary", size="lg")
+        with gr.Column():
+            gr.Markdown("### Extracted Data")
+            output_json = gr.Code(label="Result (JSON)", language="json", lines=25)
+    # Examples section
+    gr.Markdown("### Examples")
+    gr.Examples(
+        examples=[
+            [
+                None,
+                "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg",
+                json.dumps({"bill_no": "string", "total": "float"}, indent=2),
+            ],
+            [
+                None,
+                "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg",
+                json.dumps(
+                    {
+                        "bill_no": "string",
+                        "total": "float",
+                        "sender_name": "string",
+                        "receiver_name": "string",
+                    },
+                    indent=2,
+                ),
+            ],
+        ],
+        inputs=[file_input, url_input, template_input],
+        label="Try these examples",
+    )
+    # Connect the extraction function
+    extract_btn.click(
+        fn=process_extraction,
+        inputs=[file_input, url_input, template_input],
+        outputs=output_json,
+    )
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()
+    demo.launch()

app_hf_spaces.py ADDED Viewed

	@@ -0,0 +1,169 @@

+import json
+import gradio as gr
+import spaces  # Hugging Face Spaces Zero GPU support
+from docling.datamodel.base_models import InputFormat
+from docling.document_extractor import DocumentExtractor
+# Initialize the extractor (will be moved to GPU when decorated function is called)
+def get_extractor():
+    """Initialize extractor - called within GPU context"""
+    return DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])
+@spaces.GPU(duration=60)  # Allocate GPU for up to 60 seconds
+def process_extraction(file_input, url_input, template_json):
+    """
+    Process document extraction with the provided template.
+    Uses Hugging Face Spaces Zero GPU feature.
+    Args:
+        file_input: Uploaded file (PDF or image)
+        url_input: URL to a document
+        template_json: JSON string defining the extraction template
+    Returns:
+        JSON string with extracted data
+    """
+    try:
+        # Initialize extractor in GPU context
+        extractor = get_extractor()
+        # Determine the source
+        source = None
+        if file_input is not None:
+            source = file_input.name
+        elif url_input and url_input.strip():
+            source = url_input.strip()
+        else:
+            return json.dumps(
+                {"error": "Please provide either a file or a URL"}, indent=2
+            )
+        # Parse the template JSON
+        try:
+            template = json.loads(template_json)
+        except json.JSONDecodeError as e:
+            return json.dumps({"error": f"Invalid JSON template: {str(e)}"}, indent=2)
+        # Perform extraction
+        result = extractor.extract(
+            source=source,
+            template=template,
+        )
+        # Format the output
+        output = {"pages": []}
+        for page in result.pages:
+            page_data = {
+                "page_no": page.page_no,
+                "extracted_data": page.extracted_data,
+                "raw_text": page.raw_text,
+                "errors": page.errors if page.errors else [],
+            }
+            output["pages"].append(page_data)
+        return json.dumps(output, indent=2)
+    except Exception as e:
+        return json.dumps({"error": f"Extraction failed: {str(e)}"}, indent=2)
+# Default template example
+default_template = json.dumps(
+    {"bill_no": "string", "total": "float", "date": "string"}, indent=2
+)
+# Create Gradio interface
+with gr.Blocks(title="Docling Structured Extraction") as demo:
+    gr.Markdown(
+        """
+    # 📄 Docling Structured Extraction Demo
+    Extract structured data from documents (PDF/Images) using AI-powered extraction.
+    **Note:** This feature is currently in beta.
+    ### How to use:
+    1. Upload a file OR provide a URL to a document
+    2. Define your extraction template in JSON format
+    3. Click "Extract" to get structured data
+    🚀 **Powered by Hugging Face Spaces Zero GPU**
+    """
+    )
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### Input Source")
+            file_input = gr.File(
+                label="Upload File (PDF or Image)",
+                file_types=[".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"],
+            )
+            url_input = gr.Textbox(
+                label="Or Enter Document URL",
+                placeholder="https://example.com/document.pdf",
+                lines=1,
+            )
+            gr.Markdown("### Extraction Template")
+            gr.Markdown(
+                """
+            Define the structure of data you want to extract. Use JSON format with field names and types:
+            - `"string"` for text fields
+            - `"float"` for numbers with decimals
+            - `"int"` for whole numbers
+            """
+            )
+            template_input = gr.Code(
+                label="JSON Template", value=default_template, language="json", lines=15
+            )
+            extract_btn = gr.Button("Extract", variant="primary", size="lg")
+        with gr.Column():
+            gr.Markdown("### Extracted Data")
+            output_json = gr.Code(label="Result (JSON)", language="json", lines=25)
+    # Examples section
+    gr.Markdown("### Examples")
+    gr.Examples(
+        examples=[
+            [
+                None,
+                "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg",
+                json.dumps({"bill_no": "string", "total": "float"}, indent=2),
+            ],
+            [
+                None,
+                "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg",
+                json.dumps(
+                    {
+                        "bill_no": "string",
+                        "total": "float",
+                        "sender_name": "string",
+                        "receiver_name": "string",
+                        "postal_code": "string",
+                    },
+                    indent=2,
+                ),
+            ],
+        ],
+        inputs=[file_input, url_input, template_input],
+        label="Try these examples",
+    )
+    # Connect the extraction function
+    extract_btn.click(
+        fn=process_extraction,
+        inputs=[file_input, url_input, template_input],
+        outputs=output_json,
+    )
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio>=4.0.0
+docling[vlm]>=2.0.0
+spaces>=0.19.0