Spaces:

asdasdAsdfA23
/

chvzh_ocr

Running on Zero

App Files Files Community

chvzh_ocr / MODEL_SELECTION.md

Codex

Build multilingual OCR Space for documents

94f482b 19 days ago

preview code

raw

history blame contribute delete

3.23 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Model Selection Notes

Date checked: 2026-04-04

Task:

OCR from photos of passports, insurance documents, and similar IDs
Russian, Uzbek, Tatar, and mixed-language documents
reliable field extraction into JSON
target size around 2B, slightly larger acceptable
should be usable on Hugging Face Spaces with CPU and ZeroGPU

Recommendation

Best overall model for this task:

Qwen/Qwen2.5-VL-3B-Instruct

Best CPU-oriented fallback:

Qwen/Qwen3-VL-2B-Instruct

Additional experimental option:

Qwen/Qwen3.5-4B

Why `Qwen2.5-VL-3B-Instruct`

The official model card explicitly says it supports generating structured outputs for scans of invoices, forms, and tables. That is much closer to passport and insurance extraction than generic OCR.

Source:

https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

This makes it the strongest fit for key information extraction in JSON, not just raw text transcription.

Why keep `Qwen3-VL-2B-Instruct` in the Space

The official card says it expands OCR support to 32 languages and improves long-document structure parsing. That is directly relevant for Russian / Uzbek / Tatar documents and mixed-script records.

Source:

https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct

It is also the safest model in this range for CPU-only fallback because it stays close to the requested size.

Why not choose `Qwen2-VL-2B-Instruct` as primary

The official card is still strong and mentions DocVQA performance, but it is an older base than both Qwen2.5-VL and Qwen3-VL.

Source:

https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct

For this document task, the newer families are better aligned with structured extraction and multilingual OCR improvements.

Why not choose `JackChew/Qwen2-VL-2B-OCR`

Its card emphasizes complete text extraction from documents, payslips, invoices, and tables. That is useful for raw OCR, but the positioning is still closer to "extract all text" than "extract normalized KIE fields into strict JSON".

Source:

https://huggingface.co/JackChew/Qwen2-VL-2B-OCR

I included it in the Space for comparison, but not as the main choice.

Why not choose `prithivMLmods/Qwen2-VL-OCR-2B-Instruct`

Its card explicitly says the fine-tune is tailored for OCR, image-to-text, and math / LaTeX formatting. The listed training datasets are LaTeX OCR datasets, which is a weak fit for passport and insurance-document extraction.

Source:

https://huggingface.co/prithivMLmods/Qwen2-VL-OCR-2B-Instruct

That makes it less convincing for multilingual KIE on civil documents.

About `Qwen3.5`

I checked Qwen3.5 as requested. As of 2026-04-04, the practical official option I found is Qwen/Qwen3.5-4B. I added it to the Space as an experimental choice. I still do not recommend it as the default because it is above the requested size and is a weaker fit for lightweight OCR/KIE serving than Qwen2.5-VL-3B-Instruct or Qwen3-VL-2B-Instruct.

Source:

https://huggingface.co/Qwen/Qwen3.5-4B

Final deployment choice

Default best-quality model in the app: Qwen/Qwen2.5-VL-3B-Instruct
Default CPU-safe fallback in the app: Qwen/Qwen3-VL-2B-Instruct
Additional experimental option in the app: Qwen/Qwen3.5-4B

Model Selection Notes

Recommendation

Why Qwen2.5-VL-3B-Instruct

Why keep Qwen3-VL-2B-Instruct in the Space

Why not choose Qwen2-VL-2B-Instruct as primary

Why not choose JackChew/Qwen2-VL-2B-OCR

Why not choose prithivMLmods/Qwen2-VL-OCR-2B-Instruct

About Qwen3.5

Final deployment choice

Why `Qwen2.5-VL-3B-Instruct`

Why keep `Qwen3-VL-2B-Instruct` in the Space

Why not choose `Qwen2-VL-2B-Instruct` as primary

Why not choose `JackChew/Qwen2-VL-2B-OCR`

Why not choose `prithivMLmods/Qwen2-VL-OCR-2B-Instruct`

About `Qwen3.5`