license: apache-2.0
pipeline_tag: image-to-text
library_name: transformers
tags:
- falcon
- ocr
- vision-language
- document-understanding
Falcon OCR
Dense early-fusion vision-language model for document OCR. Given a document image, it extracts text, tables, formulas, and other elements as plain text.
Highlights
Despite our model's compact 300M-parameter architecture, it achieves near state-of-the-art (SOTA) performance across major benchmarks.
- Strong Performance: FalconOCR achieves near-SOTA results on both olmOCR and OmniDocBench, delivering competitive accuracy for text, tables, and formula recognition against models many times its size.
- Two-Stage Layout Pipeline: FalconOCR pairs with PP-DocLayoutV3 for layout detection, enabling accurate region-level parsing of complex documents with mixed content types while preserving reading order.
- Simple and Lightweight Architecture: Built on a compact 300M-parameter vision-language model, FalconOCR offers a streamlined alternative to bulky multi-model pipelines. Task switching is handled simply by changing the input prompt.
- Efficient and Fast Inference: FalconOCR's small footprint enables fast inference out of the box, with an optional vLLM backend for high-throughput production deployments.
Benchmark Results
olmOCR Benchmark
Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.
| Model | Average | ArXiv Math | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
|---|---|---|---|---|---|---|---|---|---|
| Mistral OCR 3 | 81.7 | 85.4 | 99.9 | 93.8 | 88.9 | 82.1 | 48.8 | 68.3 | 86.1 |
| Chandra | 82.0 | 81.4 | 99.8 | 88.8 | 91.9 | 82.9 | 49.2 | 73.6 | 88.2 |
| Gemini 3 Pro | 80.2 | 70.6 | 99.8 | 84.0 | 90.3 | 79.2 | 47.5 | 84.9 | 84.9 |
| PaddleOCR VL 1.5 | 79.3 | 85.4 | 98.8 | 96.9 | 80.8 | 82.6 | 39.2 | 66.4 | 84.1 |
| PaddleOCR VL | 79.2 | 85.4 | 98.6 | 96.9 | 80.8 | 82.5 | 38.8 | 66.4 | 83.9 |
| DeepSeek OCR v2 | 78.8 | 81.9 | 99.8 | 95.6 | 88.7 | 83.6 | 33.7 | 68.8 | 78.1 |
| Gemini 3 Flash | 77.5 | 66.5 | 99.8 | 83.8 | 88.2 | 73.7 | 46.0 | 85.8 | 75.9 |
| GPT 5.2 | 69.8 | 61.0 | 99.8 | 75.6 | 62.2 | 70.2 | 34.6 | 75.8 | 79.0 |
| FalconOCR | 80.3 | 80.9 | 99.5 | 94.2 | 78.3 | 87.3 | 43.5 | 70.1 | 90.1 |
OmniDocBench
Performance comparison on full-page document parsing. Overall↑ aggregates the three sub-metrics. Edit↓ measures text edit distance (lower is better). CDM↑ evaluates formula recognition accuracy. TEDS↑ measures table structure similarity.
| Model | Overall↑ | Edit↓ | CDM↑ | TEDS↑ |
|---|---|---|---|---|
| PaddleOCR VL 1.5 | 94.37 | 0.075 | 94.4 | 91.1 |
| PaddleOCR VL | 91.76 | 0.024 | 91.7 | 85.9 |
| Chandra | 88.97 | 0.046 | 88.1 | 89.5 |
| DeepSeek OCR v2 | 87.66 | 0.037 | 89.2 | 77.5 |
| GPT 5.2 | 86.56 | 0.061 | 88.0 | 77.7 |
| Mistral OCR 3 | 85.20 | 0.053 | 84.3 | 76.1 |
| FalconOCR | 88.64 | 0.055 | 86.8 | 84.6 |
Citation
Installation
pip install transformers torch einops
Requires PyTorch 2.5+ (FlexAttention).
Quick Start
import torch
from transformers import AutoModelForCausalLM
from PIL import Image
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/Falcon-OCR",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="cuda",
)
image = Image.open("document.png")
texts = model.generate(image)
print(texts[0])
The first
generate()call is slower due totorch.compilebuilding optimized kernels. Subsequent calls are much faster.
Categories
By default, category is "plain" (general text extraction). You can specify a category to use a task-specific prompt:
texts = model.generate(image, category="table")
texts = model.generate(image, category="formula")
Returns: list[str] — one extracted text string per image.
Available categories: plain, text, table, formula, caption, footnote, list-item, page-footer, page-header, section-header, title.
Layout OCR
For most documents, generate() works well out of the box — the model handles mixed content (text, tables, formulas) in a single pass. For very dense or complex documents with many heterogeneous regions, you can use the two-stage layout detection + per-region OCR pipeline:
results = model.generate_with_layout(image)
for det in results[0]:
print(f"[{det['category']}] {det['text'][:100]}...")
This runs PP-DocLayoutV3 to detect regions (text blocks, tables, formulas, etc.), then OCRs each crop with the appropriate category-specific prompt. Nested boxes (e.g. inline formulas inside text) are automatically filtered.
# Batch of pages
results = model.generate_with_layout(
[Image.open("page1.png"), Image.open("page2.png")],
ocr_batch_size=32,
)
# results[0] = list of dicts for page 1
# results[1] = list of dicts for page 2
for det in results[0]:
print(det["category"], det["bbox"], det["score"])
print(det["text"])
The layout model is loaded lazily on the first
generate_with_layout()call (~100 MB). It runs on the same GPU as the OCR model.
Returns: list[list[dict]] — per image, a list of detections in reading order:
{
"category": "text", # layout category
"bbox": [x1, y1, x2, y2], # in original image pixels
"score": 0.93, # detection confidence
"text": "..." # extracted text
}