File size: 1,935 Bytes
a8537c6 e5ee651 a8537c6 e5ee651 a8537c6 e5ee651 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | ---
title: Document Understanding OCR
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# Document Understanding OCR
## Question
After OCR has produced text, how do we recover a structured document schema?
## System Boundary
This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.
## Method
The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.
## Technique
This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.
The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.
## Output
The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.
## Why It Matters
Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.
## What To Notice
Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.
## Effect In Practice
This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.
## Hugging Face Extension
The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.
## Limitations
This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.
## Run Locally
```bash
pip install -r requirements.txt
streamlit run app.py
```
|