Instructions to use ebinan92/Qwen3.5-ocr-jp-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ebinan92/Qwen3.5-ocr-jp-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ebinan92/Qwen3.5-ocr-jp-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ebinan92/Qwen3.5-ocr-jp-2b") model = AutoModelForImageTextToText.from_pretrained("ebinan92/Qwen3.5-ocr-jp-2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ebinan92/Qwen3.5-ocr-jp-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ebinan92/Qwen3.5-ocr-jp-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ebinan92/Qwen3.5-ocr-jp-2b
- SGLang
How to use ebinan92/Qwen3.5-ocr-jp-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ebinan92/Qwen3.5-ocr-jp-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ebinan92/Qwen3.5-ocr-jp-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ebinan92/Qwen3.5-ocr-jp-2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ebinan92/Qwen3.5-ocr-jp-2b with Docker Model Runner:
docker model run hf.co/ebinan92/Qwen3.5-ocr-jp-2b
Qwen3.5-OCR-JP-2B
Qwen3.5-OCR-JP-2B is a Japanese/English Vision-Language OCR model built on top of Qwen3.5-2B. Output schema is compatible with Chandra OCR 2 (datalab-to/chandra) โ HTML layout blocks with bounding boxes and labels.
Focus
Training data emphasizes the following Japanese document features:
- Ruby annotations โ emitted as HTML5 ruby markup, e.g.
<ruby>ๆผขๅญ<rt>ใใใ</rt></ruby> - Japanese handwriting, vertical writing
Quickstart
vLLM (recommended)
import base64, io
from PIL import Image
from vllm import LLM, SamplingParams
PROMPT = "OCR this image as HTML layout blocks with bbox and label."
llm = LLM(
model="ebinan92/Qwen3.5-ocr-jp-2b",
dtype="bfloat16",
max_model_len=12288,
limit_mm_per_prompt={"image": 1},
trust_remote_code=True,
)
sampling = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=8000)
image = Image.open("page.png").convert("RGB")
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()
messages = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": PROMPT},
],
}]
print(llm.chat(messages, sampling_params=sampling)[0].outputs[0].text)
Requires vllm>=0.19.1 and transformers>=5.5.1.
transformers
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
PROMPT = "OCR this image as HTML layout blocks with bbox and label."
ckpt = "ebinan92/Qwen3.5-ocr-jp-2b"
processor = AutoProcessor.from_pretrained(ckpt)
model = AutoModelForImageTextToText.from_pretrained(
ckpt, dtype=torch.bfloat16, device_map="auto"
)
image = Image.open("page.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
],
}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Benchmarks
| Benchmark | Metric | chandra-ocr-2 | Qwen3.5-ocr-jp-2b | sarashina2.2-ocr |
|---|---|---|---|---|
| olmOCR-bench | Accuracy โ | 85.9โ | 82.8 | โ |
| VJRODaโป | CER % โ | 7.2 | 7.3 | 12.0 |
| VJRODaโป | BLEU โ | 94.2 | 94.6 | 91.4 |
| JaWildText | CER % โ | 7.68 | 6.33 | 47.78 |
sarashina2.2-ocr's olmOCR-bench overall is omitted because its HF card does not report the baseline row.
โป VJRODa is evaluated on 92 / 100 samples (8 PDFs are NDL WARP-restricted and unavailable).
โ olmOCR-bench score for chandra-ocr-2 is taken from the official HF card.
olmOCR-bench JSONL breakdown
| JSONL | chandra-ocr-2โ | Qwen3.5-ocr-jp-2b |
|---|---|---|
| arxiv_math | 90.2 | 85.7 |
| table_tests | 89.9 | 88.1 |
| baseline | 99.6 | 99.1 |
| headers_footers | 92.5 | 90.3 |
| old_scans_math | 89.3 | 81.9 |
| long_tiny_text | 92.1 | 92.3 |
| multi_column | 83.5 | 79.6 |
| old_scans | 49.8 | 45.4 |
Limitations
- Works only with the single fixed prompt above. It is not tuned for other tasks or free-form instructions.
- Trained primarily on Japanese and English. Coverage of other languages (Chinese, Korean, etc.) is incidental.
License
Apache 2.0.
This model is derived from Qwen3.5-2B, trained on independently constructed datasets. No outputs or weights from datalab-to/chandra-ocr-2 (or any other Chandra release) were used.
Acknowledgements
- Qwen3.5-2B โ base model (Apache 2.0)
- Chandra โ format reference
- Downloads last month
- 468