LiLT FUNSD — GGUF

GGUF conversion of philschmid/lilt-en-funsd for use with CrispEmbed.

LiLT (Language-independent Layout Transformer) is a dual-stream encoder that combines RoBERTa (768d text) with a parallel layout transformer (192d) via BiACM (bidirectional attention complementation). It takes OCR text + bounding boxes and performs token classification for document understanding.

This variant is fine-tuned on FUNSD (Form Understanding in Noisy Scanned Documents) with 7 IOB labels: O, B-HEADER, I-HEADER, B-QUESTION, I-QUESTION, B-ANSWER, I-ANSWER.

Model Details

Property	Value
Architecture	LiLT (RoBERTa + Layout Transformer + BiACM)
Parameters	130.7M
Hidden size	768 (text) / 192 (layout)
Layers	12
Heads	12
Vocab	50,265 (RoBERTa BPE)
Labels	7 (FUNSD IOB)
License	MIT
Base model	SCUT-DLVCLab/lilt-roberta-en-base

Available Formats

File	Format	Size
	Float32	498 MB
	Q8_0	134 MB
	Q4_K	90 MB

Usage

Python

CLI

Parity

Verified against HuggingFace transformers using the crispembed-diff harness:

25/25 encoder stages: cos_min = 1.000000
16/16 token labels match (100%)
max_abs < 1.6e-03 across all layers

Citation

Downloads last month: 207

GGUF

Model size

0.1B params

Architecture

lilt

Hardware compatibility

8-bit

32-bit

Model tree for cstr/lilt-funsd-GGUF

Base model

philschmid/lilt-en-funsd

Quantized

(1)

this model