Update README.md

cfc5b23 verified about 1 month ago

5.77 kB

license: apache-2.0
pipeline_tag: image-to-text
library_name: transformers
tags:
  - falcon
  - ocr
  - vision-language
  - document-understanding

Falcon OCR

Dense early-fusion vision-language model for document OCR. Given a document image, it extracts text, tables, formulas, and other elements as plain text.

Highlights

Despite our model's compact 300M-parameter architecture, it achieves near state-of-the-art (SOTA) performance across major benchmarks.

Strong Performance: FalconOCR achieves near-SOTA results on both olmOCR and OmniDocBench, delivering competitive accuracy for text, tables, and formula recognition against models many times its size.
Two-Stage Layout Pipeline: FalconOCR pairs with PP-DocLayoutV3 for layout detection, enabling accurate region-level parsing of complex documents with mixed content types while preserving reading order.
Simple and Lightweight Architecture: Built on a compact 300M-parameter vision-language model, FalconOCR offers a streamlined alternative to bulky multi-model pipelines. Task switching is handled simply by changing the input prompt.
Efficient and Fast Inference: FalconOCR's small footprint enables fast inference out of the box, with an optional vLLM backend for high-throughput production deployments.

Benchmark Results

olmOCR Benchmark

Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.

Model	Average	ArXiv Math	Base	Hdr/Ftr	TinyTxt	MultCol	OldScan	OldMath	Tables
Mistral OCR 3	81.7	85.4	99.9	93.8	88.9	82.1	48.8	68.3	86.1
Chandra	82.0	81.4	99.8	88.8	91.9	82.9	49.2	73.6	88.2
Gemini 3 Pro	80.2	70.6	99.8	84.0	90.3	79.2	47.5	84.9	84.9
PaddleOCR VL 1.5	79.3	85.4	98.8	96.9	80.8	82.6	39.2	66.4	84.1
PaddleOCR VL	79.2	85.4	98.6	96.9	80.8	82.5	38.8	66.4	83.9
DeepSeek OCR v2	78.8	81.9	99.8	95.6	88.7	83.6	33.7	68.8	78.1
Gemini 3 Flash	77.5	66.5	99.8	83.8	88.2	73.7	46.0	85.8	75.9
GPT 5.2	69.8	61.0	99.8	75.6	62.2	70.2	34.6	75.8	79.0
FalconOCR	80.3	80.9	99.5	94.2	78.3	87.3	43.5	70.1	90.1

OmniDocBench

Performance comparison on full-page document parsing. Overall↑ aggregates the three sub-metrics. Edit↓ measures text edit distance (lower is better). CDM↑ evaluates formula recognition accuracy. TEDS↑ measures table structure similarity.

Model	Overall↑	Edit↓	CDM↑	TEDS↑
PaddleOCR VL 1.5	94.37	0.075	94.4	91.1
PaddleOCR VL	91.76	0.024	91.7	85.9
Chandra	88.97	0.046	88.1	89.5
DeepSeek OCR v2	87.66	0.037	89.2	77.5
GPT 5.2	86.56	0.061	88.0	77.7
Mistral OCR 3	85.20	0.053	84.3	76.1
FalconOCR	88.64	0.055	86.8	84.6

Citation

Installation

pip install transformers torch einops

Requires PyTorch 2.5+ (FlexAttention).

Quick Start

import torch
from transformers import AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/Falcon-OCR",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="cuda",
)

image = Image.open("document.png")
texts = model.generate(image)
print(texts[0])

The first generate() call is slower due to torch.compile building optimized kernels. Subsequent calls are much faster.

Layout OCR

For most documents, generate() works well out of the box — the model handles mixed content (text, tables, formulas) in a single pass. For very dense or complex documents with many heterogeneous regions, you can use the two-stage layout detection + per-region OCR pipeline:

results = model.generate_with_layout(image)

for det in results[0]:
    print(f"[{det['category']}] {det['text'][:100]}...")

This runs PP-DocLayoutV3 to detect regions (text blocks, tables, formulas, etc.), then OCRs each crop with the appropriate category-specific prompt. Nested boxes (e.g. inline formulas inside text) are automatically filtered.

# Batch of pages
results = model.generate_with_layout(
    [Image.open("page1.png"), Image.open("page2.png")],
    ocr_batch_size=32,
)

# results[0] = list of dicts for page 1
# results[1] = list of dicts for page 2
for det in results[0]:
    print(det["category"], det["bbox"], det["score"])
    print(det["text"])

The layout model is loaded lazily on the first generate_with_layout() call (~100 MB). It runs on the same GPU as the OCR model.

Returns: list[list[dict]] — per image, a list of detections in reading order:

{
    "category": "text",       # layout category
    "bbox": [x1, y1, x2, y2], # in original image pixels
    "score": 0.93,            # detection confidence
    "text": "..."             # extracted text
}

tiiuae
/

Falcon-OCR

Falcon OCR

Highlights

Benchmark Results

olmOCR Benchmark

OmniDocBench

Citation

Installation

Quick Start

Categories

Layout OCR

Citation