OCR LayoutLMv3‑T5 Reordering Model

This repository contains a custom sequence‑to‑sequence model that leverages Microsoft’s LayoutLMv3 as an encoder and T5‑small as a decoder, connected via a learned projection layer. The model has been fine‑tuned to perform OCR text reordering, taking raw OCR outputs (words and bounding boxes) and producing a correctly ordered textual document.

📋 Task Description

The goal of this project is to reorder OCR‑extracted tokens from document images into coherent, human‑readable text. Given:

A document image (RGB)
A list of words detected by an OCR engine
Corresponding bounding boxes for each word

The model learns to output the tokens in the correct reading order, effectively reconstructing the underlying document text layout.

🏗️ Model Architecture

Encoder: LayoutLMv3 (Base)
- Processes both visual (image) and textual (input tokens + boxes) information.
- Returns a mixed embedding sequence of visual patch embeddings and token embeddings.
Projection Layer
- A custom Linear(768 → d_model) → LayerNorm → GELU block.
- Takes only the first seq_len token embeddings (text part) from LayoutLMv3’s last hidden state.
- Projects them into the T5 embedding space (d_model = 512).
Decoder: T5‑small
- Consumes the projected embeddings via inputs_embeds.
- Generates the reordered text sequence autoregressively.

🗂️ Dataset & Preprocessing

Data Format: NDJSON files, where each line is a JSON object with fields:
- img_name: image filename
- src_word_list: list of OCR‑extracted tokens
- src_wordbox_list: list of bounding boxes ([x0, y0, x1, y1])
- ordered_src_doc (optional): ground‑truth token order for supervised training
Chunked Reading: To efficiently handle large NDJSON files, we wrap lines in brackets and stream‐parse using ijson, yielding chunks of configurable size (default 1000 samples).
Custom Dataset Class:
- Loads images from disk (PIL.Image).
- Returns raw image, words, boxes, and joined target string.
Collator: Uses AutoProcessor (LayoutLMv3) to tokenize and encode image+OCR inputs, and T5Tokenizer to prepare decoder labels.

⚙️ Training

Max Samples: 6,000 (for faster prototyping)
Batch Size: 8
Epochs: 30, saving checkpoints every 5 epochs
Optimizer: AdamW with weight decay (0.01)
Mixed Precision: torch.cuda.amp.GradScaler
Scheduler: Linear warmup (10% of total steps) → decay

Checkpoint Contents:

LayoutLMv3 encoder weights
T5‑small decoder weights
Projection layer weights
Optimizer & scheduler states

🚀 Inference

Load LayoutLMv3 + T5 + custom projection from saved checkpoint.
Prepare input with the same AutoProcessor and OCR outputs.
Forward through encoder → projection → T5 .generate().
Decode tokens to natural language string.

🛠️ Usage

Push your model & processor to Hugging Face Hub.
(Optional) Add a custom inference.py pipeline for hosted inference.
Call via huggingface_hub.InferenceClient:

from huggingface_hub import InferenceClient
import base64

client = InferenceClient(token="hf_your_token")

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

image_b64 = encode_image("doc.png")
words = [...]  # OCR tokens list
boxes = [...]  # OCR bounding boxes list

result = client.image_to_text(
    model="your-username/ocr-layoutlmv3-base-t5-small",
    inputs={"image": image_b64, "words": words, "boxes": boxes}
)
print(result)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support