Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.7.0
metadata
title: Table Extraction
emoji: 📈
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Extract tables from Documents
Table Extraction
This is a Proof of Concept (PoC) for extracting tables from text-based PDF documents using the GMFT library and serving results via a Gradio app.
What this Space does
- Upload a text-based PDF.
- Automatically detects tables and extracts them into structured JSON.
- Each extracted table includes:
- Page number + Table ID
- Table columns and data
- Bounding box + captions (if detected)
- Outputs are saved as:
tables.json→ structured table datametrics.json→ per-file stats (tables found, processing time, etc.)
- Tables are rendered back in the UI for quick inspection.
- JSON + metrics are available for download.
⚠️ Limitations
- This PoC currently works best with text-based PDFs.
- Scanned PDFs may not be parsed correctly (OCR support is not yet included).
Example JSON structure
{
"file": "example.pdf",
"extracted_at": "2025-09-04T13:22:58+00:00",
"n_pages": 12,
"tables": [
{
"page": 3,
"table_id": 5,
"columns": ["Name", "Age", "Score"],
"n_rows": 4,
"n_cols": 3,
"data": [
{"Name": "Alice", "Age": 23, "Score": 89},
{"Name": "Bob", "Age": 25, "Score": 92}
]
}
]
}