sammoftah's picture
Deploy Document Understanding OCR
e5ee651 verified
metadata
title: Document Understanding OCR
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit

Document Understanding OCR

Question

After OCR has produced text, how do we recover a structured document schema?

System Boundary

This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.

Method

The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.

Technique

This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.

The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.

Output

The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.

Why It Matters

Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.

What To Notice

Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.

Effect In Practice

This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.

Hugging Face Extension

The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.

Limitations

This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.

Run Locally

pip install -r requirements.txt
streamlit run app.py