Falcon-OCR / README.md
wamreyaz's picture
Update README.md
cfc5b23 verified
|
raw
history blame
5.77 kB
metadata
license: apache-2.0
pipeline_tag: image-to-text
library_name: transformers
tags:
  - falcon
  - ocr
  - vision-language
  - document-understanding

Falcon OCR

Dense early-fusion vision-language model for document OCR. Given a document image, it extracts text, tables, formulas, and other elements as plain text.

Highlights

Despite our model's compact 300M-parameter architecture, it achieves near state-of-the-art (SOTA) performance across major benchmarks.

  1. Strong Performance: FalconOCR achieves near-SOTA results on both olmOCR and OmniDocBench, delivering competitive accuracy for text, tables, and formula recognition against models many times its size.
  2. Two-Stage Layout Pipeline: FalconOCR pairs with PP-DocLayoutV3 for layout detection, enabling accurate region-level parsing of complex documents with mixed content types while preserving reading order.
  3. Simple and Lightweight Architecture: Built on a compact 300M-parameter vision-language model, FalconOCR offers a streamlined alternative to bulky multi-model pipelines. Task switching is handled simply by changing the input prompt.
  4. Efficient and Fast Inference: FalconOCR's small footprint enables fast inference out of the box, with an optional vLLM backend for high-throughput production deployments.

Benchmark Results

olmOCR Benchmark

Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.

Model Average ArXiv Math Base Hdr/Ftr TinyTxt MultCol OldScan OldMath Tables
Mistral OCR 3 81.7 85.4 99.9 93.8 88.9 82.1 48.8 68.3 86.1
Chandra 82.0 81.4 99.8 88.8 91.9 82.9 49.2 73.6 88.2
Gemini 3 Pro 80.2 70.6 99.8 84.0 90.3 79.2 47.5 84.9 84.9
PaddleOCR VL 1.5 79.3 85.4 98.8 96.9 80.8 82.6 39.2 66.4 84.1
PaddleOCR VL 79.2 85.4 98.6 96.9 80.8 82.5 38.8 66.4 83.9
DeepSeek OCR v2 78.8 81.9 99.8 95.6 88.7 83.6 33.7 68.8 78.1
Gemini 3 Flash 77.5 66.5 99.8 83.8 88.2 73.7 46.0 85.8 75.9
GPT 5.2 69.8 61.0 99.8 75.6 62.2 70.2 34.6 75.8 79.0
FalconOCR 80.3 80.9 99.5 94.2 78.3 87.3 43.5 70.1 90.1

OmniDocBench

Performance comparison on full-page document parsing. Overall↑ aggregates the three sub-metrics. Edit↓ measures text edit distance (lower is better). CDM↑ evaluates formula recognition accuracy. TEDS↑ measures table structure similarity.

Model Overall↑ Edit↓ CDM↑ TEDS↑
PaddleOCR VL 1.5 94.37 0.075 94.4 91.1
PaddleOCR VL 91.76 0.024 91.7 85.9
Chandra 88.97 0.046 88.1 89.5
DeepSeek OCR v2 87.66 0.037 89.2 77.5
GPT 5.2 86.56 0.061 88.0 77.7
Mistral OCR 3 85.20 0.053 84.3 76.1
FalconOCR 88.64 0.055 86.8 84.6

Citation

Installation

pip install transformers torch einops

Requires PyTorch 2.5+ (FlexAttention).

Quick Start

import torch
from transformers import AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/Falcon-OCR",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="cuda",
)

image = Image.open("document.png")
texts = model.generate(image)
print(texts[0])

The first generate() call is slower due to torch.compile building optimized kernels. Subsequent calls are much faster.

Categories

By default, category is "plain" (general text extraction). You can specify a category to use a task-specific prompt:

texts = model.generate(image, category="table")
texts = model.generate(image, category="formula")

Returns: list[str] — one extracted text string per image.

Available categories: plain, text, table, formula, caption, footnote, list-item, page-footer, page-header, section-header, title.

Layout OCR

For most documents, generate() works well out of the box — the model handles mixed content (text, tables, formulas) in a single pass. For very dense or complex documents with many heterogeneous regions, you can use the two-stage layout detection + per-region OCR pipeline:

results = model.generate_with_layout(image)

for det in results[0]:
    print(f"[{det['category']}] {det['text'][:100]}...")

This runs PP-DocLayoutV3 to detect regions (text blocks, tables, formulas, etc.), then OCRs each crop with the appropriate category-specific prompt. Nested boxes (e.g. inline formulas inside text) are automatically filtered.

# Batch of pages
results = model.generate_with_layout(
    [Image.open("page1.png"), Image.open("page2.png")],
    ocr_batch_size=32,
)

# results[0] = list of dicts for page 1
# results[1] = list of dicts for page 2
for det in results[0]:
    print(det["category"], det["bbox"], det["score"])
    print(det["text"])

The layout model is loaded lazily on the first generate_with_layout() call (~100 MB). It runs on the same GPU as the OCR model.

Returns: list[list[dict]] — per image, a list of detections in reading order:

{
    "category": "text",       # layout category
    "bbox": [x1, y1, x2, y2], # in original image pixels
    "score": 0.93,            # detection confidence
    "text": "..."             # extracted text
}

Citation