--- title: Document Understanding OCR emoji: 📄 colorFrom: yellow colorTo: blue sdk: docker pinned: false license: mit --- # Document Understanding OCR ## Question After OCR has produced text, how do we recover a structured document schema? ## System Boundary This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export. ## Method The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold. ## Technique This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal. The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention. ## Output The app returns a field table, line-item table, confidence chart, review queue, and JSON payload. ## Why It Matters Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream. ## What To Notice Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong. ## Effect In Practice This pattern supports invoice processing, procurement workflows, form extraction, and human review queues. ## Hugging Face Extension The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy. ## Limitations This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation. ## Run Locally ```bash pip install -r requirements.txt streamlit run app.py ```