ningpp
/

Falcon-OCR-ONNX

vision-language

document-understanding

Model card Files Files and versions

Falcon-OCR-ONNX / README.md

ningpp's picture

Upload 6 files

46f2559 verified 14 days ago

|

history blame contribute delete

1.56 kB

	---
	pipeline_tag: image-to-text
	library_name: onnxruntime
	tags:
	- falcon
	- ocr
	- vision-language
	- document-understanding
	license: apache-2.0
	---


	# ONNX model for [Falcon-OCR](https://huggingface.co/tiiuae/Falcon-OCR)

	## try with [ningpp/flux](https://github.com/ningpp/flux)

	Flux is a Java-based OCR



	# Falcon OCR

	Falcon OCR is a 300M parameter early-fusion vision-language model for document OCR. Given an image, it can produce plain text, LaTeX for formulas, or HTML for tables, depending on the requested output format.

	Most OCR VLM systems are built as a pipeline with a vision encoder feeding a separate text decoder, plus additional task-specific glue. Falcon OCR takes a different approach: a single Transformer processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention mask where image tokens attend bidirectionally and text tokens decode causally conditioned on the image.

	We built it this way for two practical reasons. First, it keeps the interface simple: one backbone, one decoding path, and task switching through prompts rather than a growing set of modules. Second, a 0.3B model has a lower latency and cost footprint than 0.9B-class OCR VLMs, and in our vLLM-based serving setup this translates into higher throughput, often 2–3× faster depending on sequence lengths and batch configuration. To our knowledge, this is one of the first attempts to apply this early-fusion single-stack recipe directly to competitive document OCR at this scale.