Spaces:

jobian
/

table-extraction

Running

App Files Files Community

table-extraction / README.md

jobian

Initial commit: GMFT Table Extraction PoC

33b80e1 6 months ago

preview code

raw

history blame contribute delete

1.91 kB

	---
	title: Table Extraction
	emoji: 📈
	colorFrom: pink
	colorTo: green
	sdk: gradio
	sdk_version: 5.44.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Extract tables from Documents
	---

	# Table Extraction

	[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/jobian/table-extraction)
	[![Made with Gradio](https://img.shields.io/badge/Made%20with-Gradio-orange)](https://gradio.app/)
	[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

	This is a Proof of Concept (PoC) for extracting tables from text-based PDF documents using the [GMFT library](https://github.com/microsoft/gmft) and serving results via a Gradio app.

	---

	## What this Space does

	- Upload a text-based PDF.
	- Automatically detects tables and extracts them into structured JSON.
	- Each extracted table includes:
	- Page number + Table ID
	- Table columns and data
	- Bounding box + captions (if detected)
	- Outputs are saved as:
	- `tables.json` → structured table data
	- `metrics.json` → per-file stats (tables found, processing time, etc.)
	- Tables are rendered back in the UI for quick inspection.
	- JSON + metrics are available for download.

	---

	## ⚠️ Limitations

	- This PoC currently works best with text-based PDFs.
	- Scanned PDFs may not be parsed correctly (OCR support is not yet included).

	---

	## Example JSON structure

	```json
	{
	"file": "example.pdf",
	"extracted_at": "2025-09-04T13:22:58+00:00",
	"n_pages": 12,
	"tables": [
	{
	"page": 3,
	"table_id": 5,
	"columns": ["Name", "Age", "Score"],
	"n_rows": 4,
	"n_cols": 3,
	"data": [
	{"Name": "Alice", "Age": 23, "Score": 89},
	{"Name": "Bob", "Age": 25, "Score": 92}
	]
	}
	]
	}