Qwen3.5-ocr-jp-2b / README.md
ebinan92's picture
Simplify model card intro
dc58acc verified
metadata
license: apache-2.0
language:
  - ja
  - en
base_model:
  - Qwen/Qwen3.5-2B
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - ocr
  - document-ai
  - vision-language
  - qwen3_5
  - multimodal
  - japanese

Qwen3.5-OCR-JP-2B

Qwen3.5-OCR-JP-2B is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with Chandra OCR 2 (datalab-to/chandra) β€” HTML layout blocks with bounding boxes and labels.

Focus

Training data emphasizes the following Japanese document features:

  • Ruby annotations β€” emitted as HTML5 ruby markup, e.g. <ruby>ζΌ’ε­—<rt>γ‹γ‚“γ˜</rt></ruby>
  • Japanese handwriting, vertical writing

Quickstart

vLLM (recommended)

import base64, io
from PIL import Image
from vllm import LLM, SamplingParams

PROMPT = "OCR this image as HTML layout blocks with bbox and label."

llm = LLM(
    model="ebinan92/Qwen3.5-ocr-jp-2b",
    dtype="bfloat16",
    max_model_len=12288,
    limit_mm_per_prompt={"image": 1},
    trust_remote_code=True,
)
sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000)

image = Image.open("page.png").convert("RGB")
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": PROMPT},
    ],
}]
print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text)

Requires vllm>=0.19.1 and transformers>=5.5.1.

transformers

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

PROMPT = "OCR this image as HTML layout blocks with bbox and label."

ckpt = "ebinan92/Qwen3.5-ocr-jp-2b"
processor = AutoProcessor.from_pretrained(ckpt)
model = AutoModelForImageTextToText.from_pretrained(
    ckpt, dtype=torch.bfloat16, device_map="auto"
)

image = Image.open("page.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": PROMPT},
    ],
}]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Benchmarks

Benchmark Metric chandra-ocr-2 Qwen3.5-ocr-jp-2b sarashina2.2-ocr
olmOCR-bench Accuracy ↑ 85.9† 82.8 β€”
VJRODaβ€» CER % ↓ 7.2 7.3 12.0
VJRODaβ€» BLEU ↑ 94.2 94.6 91.4
JaWildText CER % ↓ 7.68 6.33 47.78

sarashina2.2-ocr's olmOCR-bench overall is omitted because its HF card does not report the baseline row.

β€» VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable).
† olmOCR-bench score for chandra-ocr-2 is taken from the official HF card.

olmOCR-bench JSONL breakdown
JSONL chandra-ocr-2† Qwen3.5-ocr-jp-2b
arxiv_math 90.2 85.7
table_tests 89.9 88.1
baseline 99.6 99.1
headers_footers 92.5 90.3
old_scans_math 89.3 81.9
long_tiny_text 92.1 92.3
multi_column 83.5 79.6
old_scans 49.8 45.4

Limitations

  • Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions.
  • Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental.

License

Apache 2.0.

This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from datalab-to/chandra-ocr-2 (or any other Chandra release) were used.

Acknowledgements