Spaces:

jobian
/

table-extraction

Running

App Files Files Community

table-extraction / README.md

jobian

Initial commit: GMFT Table Extraction PoC

33b80e1 6 months ago

preview code

raw

history blame contribute delete

1.91 kB

A newer version of the Gradio SDK is available: 6.7.0

Upgrade

metadata

title: Table Extraction
emoji: 📈
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Extract tables from Documents

Table Extraction

This is a Proof of Concept (PoC) for extracting tables from text-based PDF documents using the GMFT library and serving results via a Gradio app.

What this Space does

Upload a text-based PDF.
Automatically detects tables and extracts them into structured JSON.
Each extracted table includes:
- Page number + Table ID
- Table columns and data
- Bounding box + captions (if detected)
Outputs are saved as:
- tables.json → structured table data
- metrics.json → per-file stats (tables found, processing time, etc.)
Tables are rendered back in the UI for quick inspection.
JSON + metrics are available for download.

⚠️ Limitations

This PoC currently works best with text-based PDFs.
Scanned PDFs may not be parsed correctly (OCR support is not yet included).

Example JSON structure

{
  "file": "example.pdf",
  "extracted_at": "2025-09-04T13:22:58+00:00",
  "n_pages": 12,
  "tables": [
    {
      "page": 3,
      "table_id": 5,
      "columns": ["Name", "Age", "Score"],
      "n_rows": 4,
      "n_cols": 3,
      "data": [
        {"Name": "Alice", "Age": 23, "Score": 89},
        {"Name": "Bob", "Age": 25, "Score": 92}
      ]
    }
  ]
}