Spaces:
Running
Running
| title: Table Extraction | |
| emoji: π | |
| colorFrom: pink | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.44.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Extract tables from Documents | |
| # Table Extraction | |
| [](https://huggingface.co/spaces/jobian/table-extraction) | |
| [](https://gradio.app/) | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| This is a **Proof of Concept** (PoC) for extracting **tables from text-based PDF documents** using the [GMFT library](https://github.com/microsoft/gmft) and serving results via a Gradio app. | |
| --- | |
| ## What this Space does | |
| - Upload a **text-based PDF**. | |
| - Automatically **detects tables** and extracts them into structured JSON. | |
| - Each extracted table includes: | |
| - **Page number** + **Table ID** | |
| - Table **columns** and **data** | |
| - **Bounding box** + captions (if detected) | |
| - Outputs are saved as: | |
| - `tables.json` β structured table data | |
| - `metrics.json` β per-file stats (tables found, processing time, etc.) | |
| - Tables are **rendered back in the UI** for quick inspection. | |
| - JSON + metrics are available for **download**. | |
| --- | |
| ## β οΈ Limitations | |
| - This PoC currently works best with **text-based PDFs**. | |
| - **Scanned PDFs** may not be parsed correctly (OCR support is not yet included). | |
| --- | |
| ## Example JSON structure | |
| ```json | |
| { | |
| "file": "example.pdf", | |
| "extracted_at": "2025-09-04T13:22:58+00:00", | |
| "n_pages": 12, | |
| "tables": [ | |
| { | |
| "page": 3, | |
| "table_id": 5, | |
| "columns": ["Name", "Age", "Score"], | |
| "n_rows": 4, | |
| "n_cols": 3, | |
| "data": [ | |
| {"Name": "Alice", "Age": 23, "Score": 89}, | |
| {"Name": "Bob", "Age": 25, "Score": 92} | |
| ] | |
| } | |
| ] | |
| } | |