Files changed (1) hide show
  1. README.md +29 -25
README.md CHANGED
@@ -7,7 +7,6 @@ tags:
7
  - vision-language
8
  - document-understanding
9
  ---
10
-
11
  # Falcon OCR
12
 
13
  Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.
@@ -38,14 +37,12 @@ Falcon OCR requires PyTorch 2.5 or newer for FlexAttention. The first call can b
38
  import torch
39
  from PIL import Image
40
  from transformers import AutoModelForCausalLM
41
-
42
  model = AutoModelForCausalLM.from_pretrained(
43
  "tiiuae/Falcon-OCR",
44
  trust_remote_code=True,
45
  torch_dtype=torch.bfloat16,
46
  device_map="auto",
47
  )
48
-
49
  image = Image.open("document.png")
50
  texts = model.generate(image) # default category is "plain"
51
  print(texts[0])
@@ -62,41 +59,29 @@ texts = model.generate(image, category="table") # HTML table
62
  ## API
63
 
64
  ### `model.generate(images, category="plain", **kwargs)`
65
-
66
  - **Inputs**:
67
  - `images`: a `PIL.Image.Image` or a list of images
68
  - `category`: one of `plain`, `text`, `table`, `formula`, `caption`, `footnote`, `list-item`, `page-footer`, `page-header`, `section-header`, `title`
69
  - **Returns**: `list[str]`, one extracted string per image
70
-
71
  ## Layout OCR (two-stage pipeline)
72
-
73
  For sparse documents, running OCR on the whole image can work well. For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), we provide an optional two-stage pipeline:
74
-
75
  1. A layout detector finds regions on the page.
76
  2. Falcon OCR runs independently on each crop with a category-specific prompt.
77
-
78
  We use [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors) for the layout detector.
79
-
80
  ```python
81
  results = model.generate_with_layout(image)
82
-
83
  for det in results[0]:
84
  print(f"[{det['category']}] {det['text'][:100]}...")
85
  ```
86
-
87
  Batch mode:
88
-
89
  ```python
90
  results = model.generate_with_layout(
91
  [Image.open("page1.png"), Image.open("page2.png")],
92
  ocr_batch_size=32,
93
  )
94
  ```
95
-
96
  The layout model is loaded lazily on the first `generate_with_layout()` call and runs on the same GPU as the OCR model.
97
-
98
  **Returns**: `list[list[dict]]`, one list per image, in reading order:
99
-
100
  ```python
101
  {
102
  "category": "text", # layout category
@@ -105,13 +90,16 @@ The layout model is loaded lazily on the first `generate_with_layout()` call and
105
  "text": "..." # extracted text
106
  }
107
  ```
 
108
 
109
- ## Benchmark Results
 
 
 
110
 
 
111
  ### olmOCR Benchmark
112
-
113
  Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.
114
-
115
  | Model | Average | ArXiv Math | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
116
  |---|---|---|---|---|---|---|---|---|---|
117
  | Mistral OCR 3 | 81.7 | **85.4** | **99.9** | 93.8 | 88.9 | 82.1 | 48.8 | 68.3 | 86.1 |
@@ -145,6 +133,23 @@ First, a compact model can be competitive if the interface is simple and the tra
145
 
146
  More broadly, these results suggest that an early-fusion single-stack Transformer can be a viable alternative to the common "vision encoder plus text decoder" recipe for OCR. We do not view this as a finished answer, but as a promising direction: one early-fusion backbone, shared parameter space among text and images, one decoding interface, and better data and training signals, rather than an increasingly complex pipelines. To our knowledge, this is among the first demonstrations that this early-fusion recipe can reach competitive document OCR accuracy at this scale, and we hope it encourages more work in this direction.
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ## Limitations
149
 
150
  - **Old scans and tiny text**: heavily degraded scans and very small glyphs remain challenging. These cases often need higher effective resolution and better coverage in the training mixture.
@@ -208,7 +213,7 @@ docker run -d --name falcon-ocr \
208
  -e VLLM_GPU_MEM_UTIL=0.90 \
209
  -p 8000:8000 \
210
  -p 5002:5002 \
211
- https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
212
  ```
213
 
214
  ### API
@@ -228,7 +233,6 @@ The easiest way to send files. Supports images and multi-page PDFs:
228
  # Single image
229
  curl -X POST http://localhost:5002/falconocr/upload \
230
  -F "files=@photo.jpg;type=image/jpeg"
231
-
232
  # PDF document
233
  curl -X POST http://localhost:5002/falconocr/upload \
234
  -F "files=@document.pdf;type=application/pdf"
@@ -253,10 +257,9 @@ Response:
253
  {
254
  "json_result": [[{
255
  "index": 0,
256
- "label": "text",
257
  "mapped_label": "text",
258
  "content": "The Manuscript",
259
- "bbox_2d": [273, 273, 937, 380],
260
  "score": 0.3145
261
  }]],
262
  "markdown_result": "The Manuscript",
@@ -351,7 +354,7 @@ docker run -d --name falcon-ocr \
351
  -e VLLM_GPU_MEM_UTIL=0.90 \
352
  -p 8000:8000 \
353
  -p 5002:5002 \
354
- [griffintaur/falcon-ocr:latest](https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest)
355
  ```
356
 
357
  #### Single GPU (memory sharing)
@@ -369,7 +372,7 @@ docker run -d --name falcon-ocr \
369
  -e MAX_NUM_SEQS=512 \
370
  -p 8000:8000 \
371
  -p 5002:5002 \
372
- https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
373
  ```
374
 
375
  #### Custom Ports
@@ -384,7 +387,7 @@ docker run -d --name falcon-ocr \
384
  -e PIPELINE_PORT=15002 \
385
  -p 18000:18000 \
386
  -p 15002:15002 \
387
- https://ghcr.io/v2/tiiuae/falcon-ocr/manifests/latest
388
  ```
389
 
390
  Docker `--gpus "device=3,4"` makes the container see GPUs as local indices `0,1`.
@@ -405,3 +408,4 @@ If you use Falcon OCR, please cite:
405
  note = {Code: https://github.com/tiiuae/Falcon-Perception},
406
  }
407
  ```
 
 
7
  - vision-language
8
  - document-understanding
9
  ---
 
10
  # Falcon OCR
11
 
12
  Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.
 
37
  import torch
38
  from PIL import Image
39
  from transformers import AutoModelForCausalLM
 
40
  model = AutoModelForCausalLM.from_pretrained(
41
  "tiiuae/Falcon-OCR",
42
  trust_remote_code=True,
43
  torch_dtype=torch.bfloat16,
44
  device_map="auto",
45
  )
 
46
  image = Image.open("document.png")
47
  texts = model.generate(image) # default category is "plain"
48
  print(texts[0])
 
59
  ## API
60
 
61
  ### `model.generate(images, category="plain", **kwargs)`
 
62
  - **Inputs**:
63
  - `images`: a `PIL.Image.Image` or a list of images
64
  - `category`: one of `plain`, `text`, `table`, `formula`, `caption`, `footnote`, `list-item`, `page-footer`, `page-header`, `section-header`, `title`
65
  - **Returns**: `list[str]`, one extracted string per image
 
66
  ## Layout OCR (two-stage pipeline)
 
67
  For sparse documents, running OCR on the whole image can work well. For dense documents with heterogeneous regions (multi-column layouts, interleaved tables and formulas, small captions), we provide an optional two-stage pipeline:
 
68
  1. A layout detector finds regions on the page.
69
  2. Falcon OCR runs independently on each crop with a category-specific prompt.
 
70
  We use [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3_safetensors) for the layout detector.
 
71
  ```python
72
  results = model.generate_with_layout(image)
 
73
  for det in results[0]:
74
  print(f"[{det['category']}] {det['text'][:100]}...")
75
  ```
 
76
  Batch mode:
 
77
  ```python
78
  results = model.generate_with_layout(
79
  [Image.open("page1.png"), Image.open("page2.png")],
80
  ocr_batch_size=32,
81
  )
82
  ```
 
83
  The layout model is loaded lazily on the first `generate_with_layout()` call and runs on the same GPU as the OCR model.
 
84
  **Returns**: `list[list[dict]]`, one list per image, in reading order:
 
85
  ```python
86
  {
87
  "category": "text", # layout category
 
90
  "text": "..." # extracted text
91
  }
92
  ```
93
+ ## When to Use What
94
 
95
+ | Mode | Best for | How |
96
+ |------|----------|-----|
97
+ | **Plain OCR** | Simple documents, real-world photos, slides, receipts, screenshots | `model.generate(image)` |
98
+ | **Layout + OCR** | Complex multi-column documents, academic papers, reports, dense pages like newspapers | `model.generate_with_layout(image)` |
99
 
100
+ ## Benchmark Results
101
  ### olmOCR Benchmark
 
102
  Category-wise performance comparison of FalconOCR against state-of-the-art OCR models. We report accuracy (%) across all category splits.
 
103
  | Model | Average | ArXiv Math | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
104
  |---|---|---|---|---|---|---|---|---|---|
105
  | Mistral OCR 3 | 81.7 | **85.4** | **99.9** | 93.8 | 88.9 | 82.1 | 48.8 | 68.3 | 86.1 |
 
133
 
134
  More broadly, these results suggest that an early-fusion single-stack Transformer can be a viable alternative to the common "vision encoder plus text decoder" recipe for OCR. We do not view this as a finished answer, but as a promising direction: one early-fusion backbone, shared parameter space among text and images, one decoding interface, and better data and training signals, rather than an increasingly complex pipelines. To our knowledge, this is among the first demonstrations that this early-fusion recipe can reach competitive document OCR accuracy at this scale, and we hope it encourages more work in this direction.
135
 
136
+ ## Serving Throughput
137
+
138
+ Measured on a single A100-80GB GPU with vLLM, processing document images from olmOCR-Bench at very high concurrency for optimal utilisation of vllm.
139
+
140
+ We benchmark two modes to isolate different parts of the pipeline:
141
+
142
+ - **Cropped regions** — A layout detector is run offline first to extract all regions from every page. Only the resulting crops are sent to the VLLM . This measures pure VLLM throughput with no layout overhead.
143
+ - **Layout + OCR** — The full end-to-end pipeline: layout detection finds regions on each page, crops them, and the VLLM runs on every crop. This is the real-world serving number that includes both layout and OCR time.
144
+
145
+ | Mode | tok/s | img/s | Description |
146
+ |------|------:|------:|-------------|
147
+ | **Layout + OCR** | 5,825 | 2.9 | Full pipeline: layout detection → crop →
148
+ per-region OCR |
149
+ | **Plain OCR** | 6,076 | 43.7 | plain OCR, no layout step |
150
+
151
+ At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs (e.g. PaddleOCR VL), which translates directly into higher serving throughput at competitive accuracy.
152
+
153
  ## Limitations
154
 
155
  - **Old scans and tiny text**: heavily degraded scans and very small glyphs remain challenging. These cases often need higher effective resolution and better coverage in the training mixture.
 
213
  -e VLLM_GPU_MEM_UTIL=0.90 \
214
  -p 8000:8000 \
215
  -p 5002:5002 \
216
+ ghcr.io/tiiuae/falcon-ocr:latest
217
  ```
218
 
219
  ### API
 
233
  # Single image
234
  curl -X POST http://localhost:5002/falconocr/upload \
235
  -F "files=@photo.jpg;type=image/jpeg"
 
236
  # PDF document
237
  curl -X POST http://localhost:5002/falconocr/upload \
238
  -F "files=@document.pdf;type=application/pdf"
 
257
  {
258
  "json_result": [[{
259
  "index": 0,
 
260
  "mapped_label": "text",
261
  "content": "The Manuscript",
262
+ "bbox": [273, 273, 937, 380],
263
  "score": 0.3145
264
  }]],
265
  "markdown_result": "The Manuscript",
 
354
  -e VLLM_GPU_MEM_UTIL=0.90 \
355
  -p 8000:8000 \
356
  -p 5002:5002 \
357
+ ghcr.io/tiiuae/falcon-ocr:latest
358
  ```
359
 
360
  #### Single GPU (memory sharing)
 
372
  -e MAX_NUM_SEQS=512 \
373
  -p 8000:8000 \
374
  -p 5002:5002 \
375
+ ghcr.io/tiiuae/falcon-ocr:latest
376
  ```
377
 
378
  #### Custom Ports
 
387
  -e PIPELINE_PORT=15002 \
388
  -p 18000:18000 \
389
  -p 15002:15002 \
390
+ ghcr.io/tiiuae/falcon-ocr:latest
391
  ```
392
 
393
  Docker `--gpus "device=3,4"` makes the container see GPUs as local indices `0,1`.
 
408
  note = {Code: https://github.com/tiiuae/Falcon-Perception},
409
  }
410
  ```
411
+