title: Document Understanding OCR
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
Document Understanding OCR
Question
After OCR has produced text, how do we recover a structured document schema?
System Boundary
This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.
Method
The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.
Technique
This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.
The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.
Output
The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.
Why It Matters
Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.
What To Notice
Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.
Effect In Practice
This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.
Hugging Face Extension
The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.
Limitations
This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.
Run Locally
pip install -r requirements.txt
streamlit run app.py