table-extraction / README.md
jobian's picture
Initial commit: GMFT Table Extraction PoC
33b80e1
---
title: Table Extraction
emoji: πŸ“ˆ
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Extract tables from Documents
---
# Table Extraction
[![Hugging Face Spaces](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/jobian/table-extraction)
[![Made with Gradio](https://img.shields.io/badge/Made%20with-Gradio-orange)](https://gradio.app/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
This is a **Proof of Concept** (PoC) for extracting **tables from text-based PDF documents** using the [GMFT library](https://github.com/microsoft/gmft) and serving results via a Gradio app.
---
## What this Space does
- Upload a **text-based PDF**.
- Automatically **detects tables** and extracts them into structured JSON.
- Each extracted table includes:
- **Page number** + **Table ID**
- Table **columns** and **data**
- **Bounding box** + captions (if detected)
- Outputs are saved as:
- `tables.json` β†’ structured table data
- `metrics.json` β†’ per-file stats (tables found, processing time, etc.)
- Tables are **rendered back in the UI** for quick inspection.
- JSON + metrics are available for **download**.
---
## ⚠️ Limitations
- This PoC currently works best with **text-based PDFs**.
- **Scanned PDFs** may not be parsed correctly (OCR support is not yet included).
---
## Example JSON structure
```json
{
"file": "example.pdf",
"extracted_at": "2025-09-04T13:22:58+00:00",
"n_pages": 12,
"tables": [
{
"page": 3,
"table_id": 5,
"columns": ["Name", "Age", "Score"],
"n_rows": 4,
"n_cols": 3,
"data": [
{"Name": "Alice", "Age": 23, "Score": 89},
{"Name": "Bob", "Age": 25, "Score": 92}
]
}
]
}