sammoftah's picture
Deploy Document Understanding OCR
e5ee651 verified
---
title: Document Understanding OCR
emoji: ๐Ÿ“„
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# Document Understanding OCR
## Question
After OCR has produced text, how do we recover a structured document schema?
## System Boundary
This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.
## Method
The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.
## Technique
This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.
The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.
## Output
The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.
## Why It Matters
Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.
## What To Notice
Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.
## Effect In Practice
This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.
## Hugging Face Extension
The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.
## Limitations
This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.
## Run Locally
```bash
pip install -r requirements.txt
streamlit run app.py
```