table-extraction / README.md
jobian's picture
Initial commit: GMFT Table Extraction PoC
33b80e1

A newer version of the Gradio SDK is available: 6.7.0

Upgrade
metadata
title: Table Extraction
emoji: 📈
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Extract tables from Documents

Table Extraction

Hugging Face Spaces Made with Gradio License: Apache-2.0

This is a Proof of Concept (PoC) for extracting tables from text-based PDF documents using the GMFT library and serving results via a Gradio app.


What this Space does

  • Upload a text-based PDF.
  • Automatically detects tables and extracts them into structured JSON.
  • Each extracted table includes:
    • Page number + Table ID
    • Table columns and data
    • Bounding box + captions (if detected)
  • Outputs are saved as:
    • tables.json → structured table data
    • metrics.json → per-file stats (tables found, processing time, etc.)
  • Tables are rendered back in the UI for quick inspection.
  • JSON + metrics are available for download.

⚠️ Limitations

  • This PoC currently works best with text-based PDFs.
  • Scanned PDFs may not be parsed correctly (OCR support is not yet included).

Example JSON structure

{
  "file": "example.pdf",
  "extracted_at": "2025-09-04T13:22:58+00:00",
  "n_pages": 12,
  "tables": [
    {
      "page": 3,
      "table_id": 5,
      "columns": ["Name", "Age", "Score"],
      "n_rows": 4,
      "n_cols": 3,
      "data": [
        {"Name": "Alice", "Age": 23, "Score": 89},
        {"Name": "Bob", "Age": 25, "Score": 92}
      ]
    }
  ]
}