---
title: Table Extraction
emoji: 📈
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Extract tables from Documents
---

# Table Extraction

[![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/jobian/table-extraction)
[![Made with Gradio](https://img.shields.io/badge/Made%20with-Gradio-orange)](https://gradio.app/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

This is a **Proof of Concept** (PoC) for extracting **tables from text-based PDF documents** using the [GMFT library](https://github.com/microsoft/gmft) and serving results via a Gradio app.

---

## What this Space does

- Upload a **text-based PDF**.
- Automatically **detects tables** and extracts them into structured JSON.
- Each extracted table includes:
  - **Page number** + **Table ID**
  - Table **columns** and **data**
  - **Bounding box** + captions (if detected)
- Outputs are saved as:
  - `tables.json` → structured table data
  - `metrics.json` → per-file stats (tables found, processing time, etc.)
- Tables are **rendered back in the UI** for quick inspection.
- JSON + metrics are available for **download**.

---

## ⚠️ Limitations

- This PoC currently works best with **text-based PDFs**.  
- **Scanned PDFs** may not be parsed correctly (OCR support is not yet included).

---

## Example JSON structure

```json
{
  "file": "example.pdf",
  "extracted_at": "2025-09-04T13:22:58+00:00",
  "n_pages": 12,
  "tables": [
    {
      "page": 3,
      "table_id": 5,
      "columns": ["Name", "Age", "Score"],
      "n_rows": 4,
      "n_cols": 3,
      "data": [
        {"Name": "Alice", "Age": 23, "Score": 89},
        {"Name": "Bob", "Age": 25, "Score": 92}
      ]
    }
  ]
}