Qwen3.5-ocr-jp-2b / README.md
ebinan92's picture
Simplify model card intro
dc58acc verified
---
license: apache-2.0
language:
- ja
- en
base_model:
- Qwen/Qwen3.5-2B
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- ocr
- document-ai
- vision-language
- qwen3_5
- multimodal
- japanese
---
# Qwen3.5-OCR-JP-2B
**Qwen3.5-OCR-JP-2B** is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with [Chandra OCR 2 (datalab-to/chandra)](https://github.com/datalab-to/chandra) β€” HTML layout blocks with bounding boxes and labels.
## Focus
Training data emphasizes the following Japanese document features:
- Ruby annotations β€” emitted as HTML5 ruby markup, e.g. `<ruby>ζΌ’ε­—<rt>γ‹γ‚“γ˜</rt></ruby>`
- Japanese handwriting, vertical writing
## Quickstart
### vLLM (recommended)
```python
import base64, io
from PIL import Image
from vllm import LLM, SamplingParams
PROMPT = "OCR this image as HTML layout blocks with bbox and label."
llm = LLM(
model="ebinan92/Qwen3.5-ocr-jp-2b",
dtype="bfloat16",
max_model_len=12288,
limit_mm_per_prompt={"image": 1},
trust_remote_code=True,
)
sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000)
image = Image.open("page.png").convert("RGB")
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()
messages = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": PROMPT},
],
}]
print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text)
```
Requires `vllm>=0.19.1` and `transformers>=5.5.1`.
### transformers
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
PROMPT = "OCR this image as HTML layout blocks with bbox and label."
ckpt = "ebinan92/Qwen3.5-ocr-jp-2b"
processor = AutoProcessor.from_pretrained(ckpt)
model = AutoModelForImageTextToText.from_pretrained(
ckpt, dtype=torch.bfloat16, device_map="auto"
)
image = Image.open("page.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
],
}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
```
## Benchmarks
| Benchmark | Metric | chandra-ocr-2 | Qwen3.5-ocr-jp-2b | sarashina2.2-ocr |
|---|---|---|---|---|
| [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Accuracy ↑ | **85.9**<sup>†</sup> | 82.8 | β€” |
| [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda)<sup>β€»</sup> | CER % ↓ | **7.2** | 7.3 | 12.0 |
| [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda)<sup>β€»</sup> | BLEU ↑ | 94.2 | **94.6** | 91.4 |
| [JaWildText](https://huggingface.co/datasets/llm-jp/jawildtext) | CER % ↓ | 7.68 | **6.33** | 47.78 |
sarashina2.2-ocr's olmOCR-bench overall is omitted because its [HF card](https://huggingface.co/sbintuitions/sarashina2.2-ocr) does not report the `baseline` row.
<sup>β€»</sup> VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable).
<sup>†</sup> olmOCR-bench score for chandra-ocr-2 is taken from the official [HF card](https://huggingface.co/datalab-to/chandra-ocr-2).
<details>
<summary>olmOCR-bench JSONL breakdown</summary>
| JSONL | chandra-ocr-2<sup>†</sup> | Qwen3.5-ocr-jp-2b |
|---|---|---|
| arxiv_math | **90.2** | 85.7 |
| table_tests | **89.9** | 88.1 |
| baseline | **99.6** | 99.1 |
| headers_footers | **92.5** | 90.3 |
| old_scans_math | **89.3** | 81.9 |
| long_tiny_text | 92.1 | **92.3** |
| multi_column | **83.5** | 79.6 |
| old_scans | **49.8** | 45.4 |
</details>
## Limitations
- Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions.
- Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental.
## License
Apache 2.0.
This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from `datalab-to/chandra-ocr-2` (or any other Chandra release) were used.
## Acknowledgements
- [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) β€” base model (Apache 2.0)
- [Chandra](https://github.com/datalab-to/chandra) β€” format reference