--- title: Table Extraction emoji: 📈 colorFrom: pink colorTo: green sdk: gradio sdk_version: 5.44.1 app_file: app.py pinned: false license: apache-2.0 short_description: Extract tables from Documents --- # Table Extraction [![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/jobian/table-extraction) [![Made with Gradio](https://img.shields.io/badge/Made%20with-Gradio-orange)](https://gradio.app/) [![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) This is a **Proof of Concept** (PoC) for extracting **tables from text-based PDF documents** using the [GMFT library](https://github.com/microsoft/gmft) and serving results via a Gradio app. --- ## What this Space does - Upload a **text-based PDF**. - Automatically **detects tables** and extracts them into structured JSON. - Each extracted table includes: - **Page number** + **Table ID** - Table **columns** and **data** - **Bounding box** + captions (if detected) - Outputs are saved as: - `tables.json` → structured table data - `metrics.json` → per-file stats (tables found, processing time, etc.) - Tables are **rendered back in the UI** for quick inspection. - JSON + metrics are available for **download**. --- ## ⚠️ Limitations - This PoC currently works best with **text-based PDFs**. - **Scanned PDFs** may not be parsed correctly (OCR support is not yet included). --- ## Example JSON structure ```json { "file": "example.pdf", "extracted_at": "2025-09-04T13:22:58+00:00", "n_pages": 12, "tables": [ { "page": 3, "table_id": 5, "columns": ["Name", "Age", "Score"], "n_rows": 4, "n_cols": 3, "data": [ {"Name": "Alice", "Age": 23, "Score": 89}, {"Name": "Bob", "Age": 25, "Score": 92} ] } ] }