Invoice LayoutLMv3 Multi-Domain Field Extraction

Five fine-tuned LayoutLMv3-base models for extracting structured fields from invoice/document images. Each domain has its own token-classification head with domain-specific BIO labels.

Trained on custom synthetic invoice data.

Domains

Domain	Fields	Description
`general`	13 scalar + line items	Standard business invoices
`receipt`	7 scalar + line items	POS / thermal receipts
`medical`	16 scalar + procedures	Hospital bills
`insurance`	22 scalar + items	Insurance EOB / claims
`logistics`	22 scalar + charges	Freight / shipping invoices

Quick Start

The easiest way to use this model is via the included inference_example.py:

pip install transformers torch easyocr huggingface_hub Pillow

# Download inference_example.py from this repo, then:
python inference_example.py invoice.png                    # auto-detect domain
python inference_example.py invoice.png --domain general   # force domain

The script handles everything: OCR, subword-to-word alignment, BIO span merging, and label-prefix stripping. First run downloads ~2.5 GB of model weights.

Manual Usage

from huggingface_hub import snapshot_download
from transformers import AutoModelForTokenClassification, LayoutLMv3Processor
import json, torch

# Download all domains
snapshot_download("rhlprj/invoice-layoutlmv3-multidomain", local_dir="models/")

# Load one domain
domain = "general"
model = AutoModelForTokenClassification.from_pretrained(f"models/{domain}")
processor = LayoutLMv3Processor.from_pretrained(f"models/{domain}", apply_ocr=False)

with open(f"models/{domain}/label_maps.json") as f:
    label_maps = json.load(f)
id2label = {int(k): v for k, v in label_maps["id2label"].items()}

# Encode (supply your own OCR words + bboxes normalised to 0-1000)
encoding = processor(
    images=pil_image,
    text=ocr_words,          # List[str]
    boxes=boxes_0_1000,      # List[List[int]], each [x0, y0, x1, y1] in 0-1000
    truncation=True,
    padding="max_length",
    max_length=512,
    return_tensors="pt",
)

# Run model
with torch.no_grad():
    outputs = model(**{k: v.to(model.device) for k, v in encoding.items()})
token_logits = outputs.logits[0].cpu()

# CRITICAL: map subword predictions back to word level using word_ids()
# Do NOT use preds[1:len(words)+1] — that assumes 1 token per word and WILL break.
word_ids = encoding.word_ids(0)
first_subword = {}
for tok_idx, w_id in enumerate(word_ids):
    if w_id is not None and w_id not in first_subword:
        first_subword[w_id] = tok_idx

for w_idx in range(len(ocr_words)):
    tok_idx = first_subword.get(w_idx)
    if tok_idx is not None:
        label = id2label[int(token_logits[tok_idx].argmax())]
        print(f"  {ocr_words[w_idx]:30s} -> {label}")

Important: Subword Alignment

LayoutLMv3 uses a RoBERTa tokenizer that splits words into subword tokens. For example, INV-2025-00782 becomes 6+ subword tokens. The model predicts one BIO label per subword, so you must use encoding.word_ids(0) to map predictions back to word level. Taking predictions[1:len(words)+1] is incorrect and will produce garbage labels.

See inference_example.py for the complete, tested implementation.

Training

Base model: microsoft/layoutlmv3-base (133M params)
Method: LoRA (rank=16, alpha=32, target=query+value) - 0.44% trainable params
Data: Synthetic invoices with auto-aligned BIO labels via EasyOCR + rapidfuzz
Hardware: NVIDIA RTX 2000 Ada (8 GB VRAM)

Label Maps

Each domain folder contains label_maps.json with the full BIO label set. Labels follow the format: O, B-<field>, I-<field>.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for rhlprj/invoice-layoutlmv3-multidomain

Base model

microsoft/layoutlmv3-base

Finetuned

(305)

this model