YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

OCR LayoutLMv3โ€‘T5 Reordering Model

This repository contains a custom sequenceโ€‘toโ€‘sequence model that leverages Microsoftโ€™s LayoutLMv3 as an encoder and T5โ€‘small as a decoder, connected via a learned projection layer. The model has been fineโ€‘tuned to perform OCR text reordering, taking raw OCR outputs (words and bounding boxes) and producing a correctly ordered textual document.


๐Ÿ“‹ Task Description

The goal of this project is to reorder OCRโ€‘extracted tokens from document images into coherent, humanโ€‘readable text. Given:

  • A document image (RGB)
  • A list of words detected by an OCR engine
  • Corresponding bounding boxes for each word

The model learns to output the tokens in the correct reading order, effectively reconstructing the underlying document text layout.


๐Ÿ—๏ธ Model Architecture

  1. Encoder: LayoutLMv3 (Base)

    • Processes both visual (image) and textual (input tokens + boxes) information.
    • Returns a mixed embedding sequence of visual patch embeddings and token embeddings.
  2. Projection Layer

    • A custom Linear(768 โ†’ d_model) โ†’ LayerNorm โ†’ GELU block.
    • Takes only the first seq_len token embeddings (text part) from LayoutLMv3โ€™s last hidden state.
    • Projects them into the T5 embedding space (d_model = 512).
  3. Decoder: T5โ€‘small

    • Consumes the projected embeddings via inputs_embeds.
    • Generates the reordered text sequence autoregressively.

๐Ÿ—‚๏ธ Dataset & Preprocessing

  • Data Format: NDJSON files, where each line is a JSON object with fields:

    • img_name: image filename
    • src_word_list: list of OCRโ€‘extracted tokens
    • src_wordbox_list: list of bounding boxes ([x0, y0, x1, y1])
    • ordered_src_doc (optional): groundโ€‘truth token order for supervised training
  • Chunked Reading: To efficiently handle large NDJSON files, we wrap lines in brackets and streamโ€parse using ijson, yielding chunks of configurable size (default 1000 samples).

  • Custom Dataset Class:

    • Loads images from disk (PIL.Image).
    • Returns raw image, words, boxes, and joined target string.
  • Collator: Uses AutoProcessor (LayoutLMv3) to tokenize and encode image+OCR inputs, and T5Tokenizer to prepare decoder labels.


โš™๏ธ Training

  • Max Samples: 6,000 (for faster prototyping)
  • Batch Size: 8
  • Epochs: 30, saving checkpoints every 5 epochs
  • Optimizer: AdamW with weight decay (0.01)
  • Mixed Precision: torch.cuda.amp.GradScaler
  • Scheduler: Linear warmup (10% of total steps) โ†’ decay

Checkpoint Contents:

  • LayoutLMv3 encoder weights
  • T5โ€‘small decoder weights
  • Projection layer weights
  • Optimizer & scheduler states

๐Ÿš€ Inference

  1. Load LayoutLMv3 + T5 + custom projection from saved checkpoint.
  2. Prepare input with the same AutoProcessor and OCR outputs.
  3. Forward through encoder โ†’ projection โ†’ T5 .generate().
  4. Decode tokens to natural language string.

๐Ÿ› ๏ธ Usage

  1. Push your model & processor to Hugging Face Hub.
  2. (Optional) Add a custom inference.py pipeline for hosted inference.
  3. Call via huggingface_hub.InferenceClient:
from huggingface_hub import InferenceClient
import base64

client = InferenceClient(token="hf_your_token")

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

image_b64 = encode_image("doc.png")
words = [...]  # OCR tokens list
boxes = [...]  # OCR bounding boxes list

result = client.image_to_text(
    model="your-username/ocr-layoutlmv3-base-t5-small",
    inputs={"image": image_b64, "words": words, "boxes": boxes}
)
print(result)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support