| ---
|
| pipeline_tag: image-to-text
|
| library_name: onnxruntime
|
| tags:
|
| - falcon
|
| - ocr
|
| - vision-language
|
| - document-understanding
|
| license: apache-2.0
|
| ---
|
|
|
|
|
| # ONNX model for [Falcon-OCR](https://huggingface.co/tiiuae/Falcon-OCR)
|
|
|
| ## try with [ningpp/flux](https://github.com/ningpp/flux)
|
|
|
| Flux is a Java-based OCR
|
|
|
|
|
|
|
| # Falcon OCR
|
|
|
| Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.
|
|
|
| Most OCR VLM systems are built as a pipeline with a vision encoder feeding a separate text decoder, plus additional task-specific glue. Falcon OCR takes a different approach: a single Transformer processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention mask where image tokens attend bidirectionally and text tokens decode causally conditioned on the image.
|
|
|
| We built it this way for two practical reasons. First, it keeps the interface simple: one backbone, one decoding path, and task switching through prompts rather than a growing set of modules. Second, a 0.3B model has a lower latency and cost footprint than 0.9B-class OCR VLMs, and in our vLLM-based serving setup this translates into higher throughput, often 2–3× faster depending on sequence lengths and batch configuration. To our knowledge, this is one of the first attempts to apply this early-fusion single-stack recipe directly to competitive document OCR at this scale.
|
|
|
|
|
|
|