YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
OCR LayoutLMv3โT5 Reordering Model
This repository contains a custom sequenceโtoโsequence model that leverages Microsoftโs LayoutLMv3 as an encoder and T5โsmall as a decoder, connected via a learned projection layer. The model has been fineโtuned to perform OCR text reordering, taking raw OCR outputs (words and bounding boxes) and producing a correctly ordered textual document.
๐ Task Description
The goal of this project is to reorder OCRโextracted tokens from document images into coherent, humanโreadable text. Given:
- A document image (RGB)
- A list of words detected by an OCR engine
- Corresponding bounding boxes for each word
The model learns to output the tokens in the correct reading order, effectively reconstructing the underlying document text layout.
๐๏ธ Model Architecture
Encoder: LayoutLMv3 (Base)
- Processes both visual (image) and textual (input tokens + boxes) information.
- Returns a mixed embedding sequence of visual patch embeddings and token embeddings.
Projection Layer
- A custom
Linear(768 โ d_model) โ LayerNorm โ GELUblock. - Takes only the first
seq_lentoken embeddings (text part) from LayoutLMv3โs last hidden state. - Projects them into the T5 embedding space (
d_model = 512).
- A custom
Decoder: T5โsmall
- Consumes the projected embeddings via
inputs_embeds. - Generates the reordered text sequence autoregressively.
- Consumes the projected embeddings via
๐๏ธ Dataset & Preprocessing
Data Format: NDJSON files, where each line is a JSON object with fields:
img_name: image filenamesrc_word_list: list of OCRโextracted tokenssrc_wordbox_list: list of bounding boxes ([x0, y0, x1, y1])ordered_src_doc(optional): groundโtruth token order for supervised training
Chunked Reading: To efficiently handle large NDJSON files, we wrap lines in brackets and streamโparse using
ijson, yielding chunks of configurable size (default 1000 samples).Custom Dataset Class:
- Loads images from disk (
PIL.Image). - Returns raw image, words, boxes, and joined target string.
- Loads images from disk (
Collator: Uses
AutoProcessor(LayoutLMv3) to tokenize and encode image+OCR inputs, andT5Tokenizerto prepare decoder labels.
โ๏ธ Training
- Max Samples: 6,000 (for faster prototyping)
- Batch Size: 8
- Epochs: 30, saving checkpoints every 5 epochs
- Optimizer: AdamW with weight decay (0.01)
- Mixed Precision:
torch.cuda.amp.GradScaler - Scheduler: Linear warmup (10% of total steps) โ decay
Checkpoint Contents:
- LayoutLMv3 encoder weights
- T5โsmall decoder weights
- Projection layer weights
- Optimizer & scheduler states
๐ Inference
- Load LayoutLMv3 + T5 + custom projection from saved checkpoint.
- Prepare input with the same
AutoProcessorand OCR outputs. - Forward through encoder โ projection โ T5
.generate(). - Decode tokens to natural language string.
๐ ๏ธ Usage
- Push your model & processor to Hugging Face Hub.
- (Optional) Add a custom
inference.pypipeline for hosted inference. - Call via
huggingface_hub.InferenceClient:
from huggingface_hub import InferenceClient
import base64
client = InferenceClient(token="hf_your_token")
def encode_image(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
image_b64 = encode_image("doc.png")
words = [...] # OCR tokens list
boxes = [...] # OCR bounding boxes list
result = client.image_to_text(
model="your-username/ocr-layoutlmv3-base-t5-small",
inputs={"image": image_b64, "words": words, "boxes": boxes}
)
print(result)