Qwen3.5-OCR-JP-2B

Qwen3.5-OCR-JP-2B is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with Chandra OCR 2 (datalab-to/chandra) โ€” HTML layout blocks with bounding boxes and labels.

Focus

Training data emphasizes the following Japanese document features:

  • Ruby annotations โ€” emitted as HTML5 ruby markup, e.g. <ruby>ๆผขๅญ—<rt>ใ‹ใ‚“ใ˜</rt></ruby>
  • Japanese handwriting, vertical writing

Quickstart

vLLM (recommended)

import base64, io
from PIL import Image
from vllm import LLM, SamplingParams

PROMPT = "OCR this image as HTML layout blocks with bbox and label."

llm = LLM(
    model="ebinan92/Qwen3.5-ocr-jp-2b",
    dtype="bfloat16",
    max_model_len=12288,
    limit_mm_per_prompt={"image": 1},
    trust_remote_code=True,
)
sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000)

image = Image.open("page.png").convert("RGB")
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": PROMPT},
    ],
}]
print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text)

Requires vllm>=0.19.1 and transformers>=5.5.1.

transformers

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

PROMPT = "OCR this image as HTML layout blocks with bbox and label."

ckpt = "ebinan92/Qwen3.5-ocr-jp-2b"
processor = AutoProcessor.from_pretrained(ckpt)
model = AutoModelForImageTextToText.from_pretrained(
    ckpt, dtype=torch.bfloat16, device_map="auto"
)

image = Image.open("page.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": PROMPT},
    ],
}]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Benchmarks

Benchmark Metric chandra-ocr-2 Qwen3.5-ocr-jp-2b sarashina2.2-ocr
olmOCR-bench Accuracy โ†‘ 85.9โ€  82.8 โ€”
VJRODaโ€ป CER % โ†“ 7.2 7.3 12.0
VJRODaโ€ป BLEU โ†‘ 94.2 94.6 91.4
JaWildText CER % โ†“ 7.68 6.33 47.78

sarashina2.2-ocr's olmOCR-bench overall is omitted because its HF card does not report the baseline row.

โ€ป VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable).
โ€  olmOCR-bench score for chandra-ocr-2 is taken from the official HF card.

olmOCR-bench JSONL breakdown
JSONL chandra-ocr-2โ€  Qwen3.5-ocr-jp-2b
arxiv_math 90.2 85.7
table_tests 89.9 88.1
baseline 99.6 99.1
headers_footers 92.5 90.3
old_scans_math 89.3 81.9
long_tiny_text 92.1 92.3
multi_column 83.5 79.6
old_scans 49.8 45.4

Limitations

  • Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions.
  • Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental.

License

Apache 2.0.

This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from datalab-to/chandra-ocr-2 (or any other Chandra release) were used.

Acknowledgements

Downloads last month
468
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ebinan92/Qwen3.5-ocr-jp-2b

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(169)
this model

Space using ebinan92/Qwen3.5-ocr-jp-2b 1