Image-Text-to-Text
Transformers
Safetensors
Japanese
English
qwen3_5
ocr
document-ai
vision-language
multimodal
japanese
conversational
Instructions to use ebinan92/Qwen3.5-ocr-jp-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ebinan92/Qwen3.5-ocr-jp-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ebinan92/Qwen3.5-ocr-jp-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ebinan92/Qwen3.5-ocr-jp-2b") model = AutoModelForImageTextToText.from_pretrained("ebinan92/Qwen3.5-ocr-jp-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ebinan92/Qwen3.5-ocr-jp-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ebinan92/Qwen3.5-ocr-jp-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ebinan92/Qwen3.5-ocr-jp-2b
- SGLang
How to use ebinan92/Qwen3.5-ocr-jp-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ebinan92/Qwen3.5-ocr-jp-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ebinan92/Qwen3.5-ocr-jp-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ebinan92/Qwen3.5-ocr-jp-2b with Docker Model Runner:
docker model run hf.co/ebinan92/Qwen3.5-ocr-jp-2b
| license: apache-2.0 | |
| language: | |
| - ja | |
| - en | |
| base_model: | |
| - Qwen/Qwen3.5-2B | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - ocr | |
| - document-ai | |
| - vision-language | |
| - qwen3_5 | |
| - multimodal | |
| - japanese | |
| # Qwen3.5-OCR-JP-2B | |
| **Qwen3.5-OCR-JP-2B** is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with [Chandra OCR 2 (datalab-to/chandra)](https://github.com/datalab-to/chandra) β HTML layout blocks with bounding boxes and labels. | |
| ## Focus | |
| Training data emphasizes the following Japanese document features: | |
| - Ruby annotations β emitted as HTML5 ruby markup, e.g. `<ruby>ζΌ’ε<rt>γγγ</rt></ruby>` | |
| - Japanese handwriting, vertical writing | |
| ## Quickstart | |
| ### vLLM (recommended) | |
| ```python | |
| import base64, io | |
| from PIL import Image | |
| from vllm import LLM, SamplingParams | |
| PROMPT = "OCR this image as HTML layout blocks with bbox and label." | |
| llm = LLM( | |
| model="ebinan92/Qwen3.5-ocr-jp-2b", | |
| dtype="bfloat16", | |
| max_model_len=12288, | |
| limit_mm_per_prompt={"image": 1}, | |
| trust_remote_code=True, | |
| ) | |
| sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000) | |
| image = Image.open("page.png").convert("RGB") | |
| buf = io.BytesIO() | |
| image.save(buf, format="PNG") | |
| b64 = base64.b64encode(buf.getvalue()).decode() | |
| messages = [{ | |
| "role": "user", | |
| "content": [ | |
| {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}, | |
| {"type": "text", "text": PROMPT}, | |
| ], | |
| }] | |
| print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text) | |
| ``` | |
| Requires `vllm>=0.19.1` and `transformers>=5.5.1`. | |
| ### transformers | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from transformers import AutoProcessor, AutoModelForImageTextToText | |
| PROMPT = "OCR this image as HTML layout blocks with bbox and label." | |
| ckpt = "ebinan92/Qwen3.5-ocr-jp-2b" | |
| processor = AutoProcessor.from_pretrained(ckpt) | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| ckpt, dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| image = Image.open("page.png").convert("RGB") | |
| messages = [{ | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": PROMPT}, | |
| ], | |
| }] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| out = model.generate(**inputs, max_new_tokens=8000, do_sample=False) | |
| print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]) | |
| ``` | |
| ## Benchmarks | |
| | Benchmark | Metric | chandra-ocr-2 | Qwen3.5-ocr-jp-2b | sarashina2.2-ocr | | |
| |---|---|---|---|---| | |
| | [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) | Accuracy β | **85.9**<sup>β </sup> | 82.8 | β | | |
| | [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda)<sup>β»</sup> | CER % β | **7.2** | 7.3 | 12.0 | | |
| | [VJRODa](https://gitlab.llm-jp.nii.ac.jp/datasets/vjroda)<sup>β»</sup> | BLEU β | 94.2 | **94.6** | 91.4 | | |
| | [JaWildText](https://huggingface.co/datasets/llm-jp/jawildtext) | CER % β | 7.68 | **6.33** | 47.78 | | |
| sarashina2.2-ocr's olmOCR-bench overall is omitted because its [HF card](https://huggingface.co/sbintuitions/sarashina2.2-ocr) does not report the `baseline` row. | |
| <sup>β»</sup> VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable). | |
| <sup>β </sup> olmOCR-bench score for chandra-ocr-2 is taken from the official [HF card](https://huggingface.co/datalab-to/chandra-ocr-2). | |
| <details> | |
| <summary>olmOCR-bench JSONL breakdown</summary> | |
| | JSONL | chandra-ocr-2<sup>β </sup> | Qwen3.5-ocr-jp-2b | | |
| |---|---|---| | |
| | arxiv_math | **90.2** | 85.7 | | |
| | table_tests | **89.9** | 88.1 | | |
| | baseline | **99.6** | 99.1 | | |
| | headers_footers | **92.5** | 90.3 | | |
| | old_scans_math | **89.3** | 81.9 | | |
| | long_tiny_text | 92.1 | **92.3** | | |
| | multi_column | **83.5** | 79.6 | | |
| | old_scans | **49.8** | 45.4 | | |
| </details> | |
| ## Limitations | |
| - Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions. | |
| - Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental. | |
| ## License | |
| Apache 2.0. | |
| This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from `datalab-to/chandra-ocr-2` (or any other Chandra release) were used. | |
| ## Acknowledgements | |
| - [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) β base model (Apache 2.0) | |
| - [Chandra](https://github.com/datalab-to/chandra) β format reference | |