tiiuae
/

Falcon-OCR

@@ -7,7 +7,6 @@ tags:
 - vision-language
 - document-understanding
 ---
 # Falcon OCR
 Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.
@@ -38,14 +37,12 @@ Falcon OCR requires PyTorch 2.5 or newer for FlexAttention. The first call can b
 import torch
 from PIL import Image
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(
     "tiiuae/Falcon-OCR",
     trust_remote_code=True,
     torch_dtype=torch.bfloat16,
     device_map="auto",
 )
 image = Image.open("document.png")
 texts = model.generate(image)  # default category is "plain"
 print(texts[0])
@@ -62,41 +59,29 @@ texts = model.generate(image, category="table")    # HTML table
 ## API
 ### `model.generate(images, category="plain", **kwargs)`
 - **Inputs**:
   - `images`: a `PIL.Image.Image` or a list of images
   - `category`: one of `plain`, `text`, `table`, `formula`, `caption`, `footnote`, `list-item`, `page-footer`, `page-header`, `section-header`, `title`
 - **Returns**: `list[str]`, one extracted string per image
 ## Layout OCR (two-stage pipeline)
 For sparse documents, running OCR on the whole image can work well. For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), we provide an optional two-stage pipeline:
 1. A layout detector finds regions on the page.
 2. Falcon OCR runs independently on each crop with a category-specific prompt.
 We use [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors) for the layout detector.
 ```python
 results = model.generate_with_layout(image)
 for det in results[0]:
     print(f"[{det['category']}] {det['text'][:100]}...")
 ```
 Batch mode:
 ```python
 results = model.generate_with_layout(
     [Image.open("page1.png"), Image.open("page2.png")],
     ocr_batch_size=32,
 )
 ```
 The layout model is loaded lazily on the first `generate_with_layout()` call and runs on the same GPU as the OCR model.
 **Returns**: `list[list[dict]]`, one list per image, in reading order:
 ```python
 {
     "category": "text",       # layout category
@@ -105,13 +90,16 @@ The layout model is loaded lazily on the first `generate_with_layout()` call and
     "text": "..."             # extracted text
 }
 ```
-## Benchmark Results
 ### olmOCR Benchmark
 Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.
 | Model | Average | ArXiv Math | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
 |---|---|---|---|---|---|---|---|---|---|
 | Mistral OCR 3 | 81.7 | **85.4** | **99.9** | 93.8 | 88.9 | 82.1 | 48.8 | 68.3 | 86.1 |
@@ -145,6 +133,23 @@ First, a compact model can be competitive if the interface is simple and the tra
 More broadly, these results suggest that an early-fusion single-stack Transformer can be a viable alternative to the common "vision encoder plus text decoder" recipe for OCR. We do not view this as a finished answer, but as a promising direction: one early-fusion backbone, shared parameter space among text and images, one decoding interface, and better data and training signals, rather than an increasingly complex pipelines. To our knowledge, this is among the first demonstrations that this early-fusion recipe can reach competitive document OCR accuracy at this scale, and we hope it encourages more work in this direction.
 ## Limitations
 - **Old scans and tiny text**: heavily degraded scans and very small glyphs remain challenging. These cases often need higher effective resolution and better coverage in the training mixture.
@@ -208,7 +213,7 @@ docker run -d --name falcon-ocr \
   -e VLLM_GPU_MEM_UTIL=0.90 \
   -p 8000:8000 \
   -p 5002:5002 \
-  https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
 ```
 ### API
@@ -228,7 +233,6 @@ The easiest way to send files. Supports images and multi-page PDFs:
 # Single image
 curl -X POST http://localhost:5002/falconocr/upload \
   -F "files=@photo.jpg;type=image/jpeg"
 # PDF document
 curl -X POST http://localhost:5002/falconocr/upload \
   -F "files=@document.pdf;type=application/pdf"
@@ -253,10 +257,9 @@ Response:
 {
   "json_result": [[{
     "index": 0,
-    "label": "text",
     "mapped_label": "text",
     "content": "The Manuscript",
-    "bbox_2d": [273, 273, 937, 380],
     "score": 0.3145
   }]],
   "markdown_result": "The Manuscript",
@@ -351,7 +354,7 @@ docker run -d --name falcon-ocr \
   -e VLLM_GPU_MEM_UTIL=0.90 \
   -p 8000:8000 \
   -p 5002:5002 \
-  [griffintaur/falcon-ocr:latest](https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest)
 ```
 #### Single GPU (memory sharing)
@@ -369,7 +372,7 @@ docker run -d --name falcon-ocr \
   -e MAX_NUM_SEQS=512 \
   -p 8000:8000 \
   -p 5002:5002 \
-  https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
 ```
 #### Custom Ports
@@ -384,7 +387,7 @@ docker run -d --name falcon-ocr \
   -e PIPELINE_PORT=15002 \
   -p 18000:18000 \
   -p 15002:15002 \
-  https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
 ```
 Docker `--gpus "device=3,4"` makes the container see GPUs as local indices `0,1`.
@@ -405,3 +408,4 @@ If you use Falcon OCR, please cite:
   note         = {Code: https://github.com/tiiuae/Falcon-Perception},
 }
 ```

 - vision-language
 - document-understanding
 ---
 # Falcon OCR
 Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.
 import torch
 from PIL import Image
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(
     "tiiuae/Falcon-OCR",
     trust_remote_code=True,
     torch_dtype=torch.bfloat16,
     device_map="auto",
 )
 image = Image.open("document.png")
 texts = model.generate(image)  # default category is "plain"
 print(texts[0])
 ## API
 ### `model.generate(images, category="plain", **kwargs)`
 - **Inputs**:
   - `images`: a `PIL.Image.Image` or a list of images
   - `category`: one of `plain`, `text`, `table`, `formula`, `caption`, `footnote`, `list-item`, `page-footer`, `page-header`, `section-header`, `title`
 - **Returns**: `list[str]`, one extracted string per image
 ## Layout OCR (two-stage pipeline)
 For sparse documents, running OCR on the whole image can work well. For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), we provide an optional two-stage pipeline:
 1. A layout detector finds regions on the page.
 2. Falcon OCR runs independently on each crop with a category-specific prompt.
 We use [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors) for the layout detector.
 ```python
 results = model.generate_with_layout(image)
 for det in results[0]:
     print(f"[{det['category']}] {det['text'][:100]}...")
 ```
 Batch mode:
 ```python
 results = model.generate_with_layout(
     [Image.open("page1.png"), Image.open("page2.png")],
     ocr_batch_size=32,
 )
 ```
 The layout model is loaded lazily on the first `generate_with_layout()` call and runs on the same GPU as the OCR model.
 **Returns**: `list[list[dict]]`, one list per image, in reading order:
 ```python
 {
     "category": "text",       # layout category
     "text": "..."             # extracted text
 }
 ```
+## When to Use What
+| Mode | Best for | How |
+|------|----------|-----|
+| **Plain OCR** | Simple documents, real-world photos, slides, receipts, screenshots | `model.generate(image)` |
+| **Layout + OCR** | Complex multi-column documents, academic papers, reports, dense pages like newspapers | `model.generate_with_layout(image)` |
+## Benchmark Results
 ### olmOCR Benchmark
 Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.
 | Model | Average | ArXiv Math | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
 |---|---|---|---|---|---|---|---|---|---|
 | Mistral OCR 3 | 81.7 | **85.4** | **99.9** | 93.8 | 88.9 | 82.1 | 48.8 | 68.3 | 86.1 |
 More broadly, these results suggest that an early-fusion single-stack Transformer can be a viable alternative to the common "vision encoder plus text decoder" recipe for OCR. We do not view this as a finished answer, but as a promising direction: one early-fusion backbone, shared parameter space among text and images, one decoding interface, and better data and training signals, rather than an increasingly complex pipelines. To our knowledge, this is among the first demonstrations that this early-fusion recipe can reach competitive document OCR accuracy at this scale, and we hope it encourages more work in this direction.
+## Serving Throughput
+Measured on a single A100-80GB GPU with vLLM, processing document images from olmOCR-Bench at very high concurrency for optimal utilisation of vllm.
+We benchmark two modes to isolate different parts of the pipeline:
+- **Cropped regions** — A layout detector is run offline first to extract all regions from every page. Only the resulting crops are sent to the VLLM . This measures pure VLLM throughput with no layout overhead.
+- **Layout + OCR** — The full end-to-end pipeline: layout detection finds regions on each page, crops them, and the VLLM runs on every crop. This is the real-world serving number that includes both layout and OCR time.
+| Mode | tok/s | img/s | Description |
+|------|------:|------:|-------------|
+| **Layout + OCR** | 5,825 | 2.9 | Full pipeline: layout detection → crop →
+per-region OCR |
+| **Plain OCR** | 6,076 | 43.7 | plain OCR, no layout step |
+At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs (e.g. PaddleOCR VL), which translates directly into higher serving throughput at competitive accuracy.
 ## Limitations
 - **Old scans and tiny text**: heavily degraded scans and very small glyphs remain challenging. These cases often need higher effective resolution and better coverage in the training mixture.
   -e VLLM_GPU_MEM_UTIL=0.90 \
   -p 8000:8000 \
   -p 5002:5002 \
+  ghcr.io/tiiuae/falcon-ocr:latest
 ```
 ### API
 # Single image
 curl -X POST http://localhost:5002/falconocr/upload \
   -F "files=@photo.jpg;type=image/jpeg"
 # PDF document
 curl -X POST http://localhost:5002/falconocr/upload \
   -F "files=@document.pdf;type=application/pdf"
 {
   "json_result": [[{
     "index": 0,
     "mapped_label": "text",
     "content": "The Manuscript",
+    "bbox": [273, 273, 937, 380],
     "score": 0.3145
   }]],
   "markdown_result": "The Manuscript",
   -e VLLM_GPU_MEM_UTIL=0.90 \
   -p 8000:8000 \
   -p 5002:5002 \
+  ghcr.io/tiiuae/falcon-ocr:latest
 ```
 #### Single GPU (memory sharing)
   -e MAX_NUM_SEQS=512 \
   -p 8000:8000 \
   -p 5002:5002 \
+  ghcr.io/tiiuae/falcon-ocr:latest
 ```
 #### Custom Ports
   -e PIPELINE_PORT=15002 \
   -p 18000:18000 \
   -p 15002:15002 \
+  ghcr.io/tiiuae/falcon-ocr:latest
 ```
 Docker `--gpus "device=3,4"` makes the container see GPUs as local indices `0,1`.
   note         = {Code: https://github.com/tiiuae/Falcon-Perception},
 }
 ```