File size: 1,935 Bytes
a8537c6
e5ee651
 
 
 
a8537c6
 
e5ee651
a8537c6
 
e5ee651
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
title: Document Understanding OCR
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# Document Understanding OCR

## Question

After OCR has produced text, how do we recover a structured document schema?

## System Boundary

This Streamlit Space demonstrates the post-OCR layer for invoices: field extraction, confidence scoring, line-item parsing, validation, and JSON export.

## Method

The app applies transparent extraction patterns to OCR text, computes field-level confidence, parses line items, and compares extracted fields against a review threshold.

## Technique

This is schema extraction after OCR. Raw text is mapped into named fields, and each field gets a confidence signal.

The method is intentionally transparent: field patterns are visible and the review threshold controls which fields require human attention.

## Output

The app returns a field table, line-item table, confidence chart, review queue, and JSON payload.

## Why It Matters

Document AI becomes useful when extraction is inspectable. A human reviewer should know which fields were found, which were uncertain, and what JSON would be sent downstream.

## What To Notice

Field-level confidence is more actionable than a single document score. A document can be mostly correct while one critical field, such as total or due date, is wrong.

## Effect In Practice

This pattern supports invoice processing, procurement workflows, form extraction, and human review queues.

## Hugging Face Extension

The Space can add document-image OCR with TrOCR, Donut, LayoutLM, or a vision-language model and evaluate field-level extraction accuracy.

## Limitations

This version starts from OCR text. A full system should add image-to-text OCR or document VLM inference, table recognition, multilingual support, and labeled evaluation.

## Run Locally

```bash
pip install -r requirements.txt
streamlit run app.py
```