---
title: Document Understanding OCR
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# Document Understanding OCR

## Question

After OCR has produced text, how do we recover a structured document schema?

## System Boundary

This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.

## Method

The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.

## Technique

This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.

The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.

## Output

The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.

## Why It Matters

Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.

## What To Notice

Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.

## Effect In Practice

This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.

## Hugging Face Extension

The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.

## Limitations

This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.

## Run Locally

```bash
pip install -r requirements.txt
streamlit run app.py
```