LiLT FUNSD โ€” GGUF

GGUF conversion of philschmid/lilt-en-funsd for use with CrispEmbed.

LiLT (Language-independent Layout Transformer) is a dual-stream encoder that combines RoBERTa (768d text) with a parallel layout transformer (192d) via BiACM (bidirectional attention complementation). It takes OCR text + bounding boxes and performs token classification for document understanding.

This variant is fine-tuned on FUNSD (Form Understanding in Noisy Scanned Documents) with 7 IOB labels: O, B-HEADER, I-HEADER, B-QUESTION, I-QUESTION, B-ANSWER, I-ANSWER.

Model Details

Property Value
Architecture LiLT (RoBERTa + Layout Transformer + BiACM)
Parameters 130.7M
Hidden size 768 (text) / 192 (layout)
Layers 12
Heads 12
Vocab 50,265 (RoBERTa BPE)
Labels 7 (FUNSD IOB)
License MIT
Base model SCUT-DLVCLab/lilt-roberta-en-base

Available Formats

File Format Size
Float32 498 MB
Q8_0 134 MB
Q4_K 90 MB

Usage

Python

CLI

Parity

Verified against HuggingFace transformers using the crispembed-diff harness:

  • 25/25 encoder stages: cos_min = 1.000000
  • 16/16 token labels match (100%)
  • max_abs < 1.6e-03 across all layers

Citation

Downloads last month
207
GGUF
Model size
0.1B params
Architecture
lilt
Hardware compatibility
Log In to add your hardware

8-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/lilt-funsd-GGUF

Quantized
(1)
this model