chvzh_ocr / MODEL_SELECTION.md
Codex
Build multilingual OCR Space for documents
94f482b

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Model Selection Notes

Date checked: 2026-04-04

Task:

  • OCR from photos of passports, insurance documents, and similar IDs
  • Russian, Uzbek, Tatar, and mixed-language documents
  • reliable field extraction into JSON
  • target size around 2B, slightly larger acceptable
  • should be usable on Hugging Face Spaces with CPU and ZeroGPU

Recommendation

Best overall model for this task:

  • Qwen/Qwen2.5-VL-3B-Instruct

Best CPU-oriented fallback:

  • Qwen/Qwen3-VL-2B-Instruct

Additional experimental option:

  • Qwen/Qwen3.5-4B

Why Qwen2.5-VL-3B-Instruct

The official model card explicitly says it supports generating structured outputs for scans of invoices, forms, and tables. That is much closer to passport and insurance extraction than generic OCR.

Source:

This makes it the strongest fit for key information extraction in JSON, not just raw text transcription.

Why keep Qwen3-VL-2B-Instruct in the Space

The official card says it expands OCR support to 32 languages and improves long-document structure parsing. That is directly relevant for Russian / Uzbek / Tatar documents and mixed-script records.

Source:

It is also the safest model in this range for CPU-only fallback because it stays close to the requested size.

Why not choose Qwen2-VL-2B-Instruct as primary

The official card is still strong and mentions DocVQA performance, but it is an older base than both Qwen2.5-VL and Qwen3-VL.

Source:

For this document task, the newer families are better aligned with structured extraction and multilingual OCR improvements.

Why not choose JackChew/Qwen2-VL-2B-OCR

Its card emphasizes complete text extraction from documents, payslips, invoices, and tables. That is useful for raw OCR, but the positioning is still closer to "extract all text" than "extract normalized KIE fields into strict JSON".

Source:

I included it in the Space for comparison, but not as the main choice.

Why not choose prithivMLmods/Qwen2-VL-OCR-2B-Instruct

Its card explicitly says the fine-tune is tailored for OCR, image-to-text, and math / LaTeX formatting. The listed training datasets are LaTeX OCR datasets, which is a weak fit for passport and insurance-document extraction.

Source:

That makes it less convincing for multilingual KIE on civil documents.

About Qwen3.5

I checked Qwen3.5 as requested. As of 2026-04-04, the practical official option I found is Qwen/Qwen3.5-4B. I added it to the Space as an experimental choice. I still do not recommend it as the default because it is above the requested size and is a weaker fit for lightweight OCR/KIE serving than Qwen2.5-VL-3B-Instruct or Qwen3-VL-2B-Instruct.

Source:

Final deployment choice

  • Default best-quality model in the app: Qwen/Qwen2.5-VL-3B-Instruct
  • Default CPU-safe fallback in the app: Qwen/Qwen3-VL-2B-Instruct
  • Additional experimental option in the app: Qwen/Qwen3.5-4B