--- license: apache-2.0 language: - ja - en base_model: - Qwen/Qwen3.5-2B library_name: transformers pipeline_tag: image-text-to-text tags: - ocr - document-ai - vision-language - qwen3_5 - multimodal - japanese --- # Qwen3.5-OCR-JP-2B **Qwen3.5-OCR-JP-2B** is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with [Chandra OCR 2 (datalab-to/chandra)](https://github.com/datalab-to/chandra) — HTML layout blocks with bounding boxes and labels. ## Focus Training data emphasizes the following Japanese document features: - Ruby annotations — emitted as HTML5 ruby markup, e.g. `漢字かんじ` - Japanese handwriting, vertical writing ## Quickstart ### vLLM (recommended) ```python import base64, io from PIL import Image from vllm import LLM, SamplingParams PROMPT = "OCR this image as HTML layout blocks with bbox and label." llm = LLM( model="ebinan92/Qwen3.5-ocr-jp-2b", dtype="bfloat16", max_model_len=12288, limit_mm_per_prompt={"image": 1}, trust_remote_code=True, ) sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000) image = Image.open("page.png").convert("RGB") buf = io.BytesIO() image.save(buf, format="PNG") b64 = base64.b64encode(buf.getvalue()).decode() messages = [{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}, {"type": "text", "text": PROMPT}, ], }] print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text) ``` Requires `vllm>=0.19.1` and `transformers>=5.5.1`. ### transformers ```python import torch from PIL import Image from transformers import AutoProcessor, AutoModelForImageTextToText PROMPT = "OCR this image as HTML layout blocks with bbox and label." ckpt = "ebinan92/Qwen3.5-ocr-jp-2b" processor = AutoProcessor.from_pretrained(ckpt) model = AutoModelForImageTextToText.from_pretrained( ckpt, dtype=torch.bfloat16, device_map="auto" ) image = Image.open("page.png").convert("RGB") messages = [{ "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": PROMPT}, ], }] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) out = model.generate(**inputs, max_new_tokens=8000, do_sample=False) print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]) ``` ## Benchmarks | Benchmark | Metric | chandra-ocr-2 | Qwen3.5-ocr-jp-2b | sarashina2.2-ocr | |---|---|---|---|---| | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Accuracy ↑ | **85.9** | 82.8 | — | | [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda) | CER % ↓ | **7.2** | 7.3 | 12.0 | | [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda) | BLEU ↑ | 94.2 | **94.6** | 91.4 | | [JaWildText](https://huggingface.co/datasets/llm-jp/jawildtext) | CER % ↓ | 7.68 | **6.33** | 47.78 | sarashina2.2-ocr's olmOCR-bench overall is omitted because its [HF card](https://huggingface.co/sbintuitions/sarashina2.2-ocr) does not report the `baseline` row. VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable). olmOCR-bench score for chandra-ocr-2 is taken from the official [HF card](https://huggingface.co/datalab-to/chandra-ocr-2).
olmOCR-bench JSONL breakdown | JSONL | chandra-ocr-2 | Qwen3.5-ocr-jp-2b | |---|---|---| | arxiv_math | **90.2** | 85.7 | | table_tests | **89.9** | 88.1 | | baseline | **99.6** | 99.1 | | headers_footers | **92.5** | 90.3 | | old_scans_math | **89.3** | 81.9 | | long_tiny_text | 92.1 | **92.3** | | multi_column | **83.5** | 79.6 | | old_scans | **49.8** | 45.4 |
## Limitations - Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions. - Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental. ## License Apache 2.0. This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from `datalab-to/chandra-ocr-2` (or any other Chandra release) were used. ## Acknowledgements - [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) — base model (Apache 2.0) - [Chandra](https://github.com/datalab-to/chandra) — format reference