Instructions to use EthannW/HunyuanOCR-1-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use EthannW/HunyuanOCR-1-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="EthannW/HunyuanOCR-1-5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("EthannW/HunyuanOCR-1-5")
model = AutoModelForMultimodalLM.from_pretrained("EthannW/HunyuanOCR-1-5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use EthannW/HunyuanOCR-1-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "EthannW/HunyuanOCR-1-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/EthannW/HunyuanOCR-1-5

SGLang

How to use EthannW/HunyuanOCR-1-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "EthannW/HunyuanOCR-1-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "EthannW/HunyuanOCR-1-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use EthannW/HunyuanOCR-1-5 with Docker Model Runner:
```
docker model run hf.co/EthannW/HunyuanOCR-1-5
```

HunyuanOCR-1.5 · Preview

Towards Efficient and Effective E2E OCR

📝 Note. This is a preview release of HunyuanOCR-1.5 weights. The technical report and official weights are coming very soon; the checkpoint, file layout and interface here may still evolve before the final release. Training / inference toolkit and full documentation live in the GitHub repo (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR.

📖 Introduction

HunyuanOCR-1.5 is a lightweight, end-to-end OCR-specialized vision-language model. It targets a broad range of text-centric visual tasks and unifies document parsing, text spotting, information extraction, and text-image translation within a single end-to-end VLM.

Building upon the validated lightweight architecture of HunyuanOCR-1.0, HunyuanOCR-1.5 does not redesign the backbone. Instead, it performs a systematic upgrade around two goals — making the model faster and better:

⚡ Faster — DFlash inference acceleration. A lightweight block-diffusion draft model drafts multiple candidate tokens in parallel, verified by the target model in a single pass, significantly reducing decoding latency of long structured OCR outputs (dense documents, tables, formulas) while preserving the target model's output distribution. Draft weights: EthannW/HunyuanOCR-1-5-DFlash.
💻 PC-side deployment via llama.cpp. Beyond server-grade vLLM, HunyuanOCR-1.5 also supports CPU / consumer-GPU / laptop deployment via llama.cpp with an OpenAI-compatible llama-server. A DFlash-adapted llama.cpp fork is also provided so the same speculative-decoding acceleration is available on PC.
🧠 Better — Agentic Data Flow + upgraded training recipe. An agent-driven data-construction system (Agentic Data Flow) translates model weaknesses into executable data requirements, targeting long-tail capabilities such as low-resource OCR, ancient-script OCR, and multi-image text-centric QA. Pretraining Stage-3 is re-planned with 4K resolution and a 128K context window; post-training refines SFT data and further explores RL across different OCR tasks.

Together, HunyuanOCR-1.5 achieves both faster inference and broader OCR capability coverage while retaining the deployment advantages of a lightweight end-to-end model.

⚙️ Environment

Python 3.10+ (3.12 tested)
PyTorch 2.1+ (CUDA 12.1+; a cu130 build has been tested end-to-end)
transformers ≥ 4.57 (ships HunYuanVLForConditionalGeneration + AutoProcessor for the HunyuanOCR-1.5 series)
vLLM nightly (0.23.x, cu130 build tested) — for OpenAI-compatible serving and (in the DFlash draft repo) speculative decoding

transformers-only (single-image debug)

pip install "transformers>=4.57" torch pillow accelerate
# for FlashAttention:
pip install flash-attn --no-build-isolation

vLLM serving (tested recipe)

We use a dedicated venv for inference to keep vLLM nightly isolated:

uv pip install -U vllm \
    --torch-backend=cu130 \
    --extra-index-url https://wheels.vllm.ai/nightly
uv pip install runai-model-streamer

💡 On CUDA 12.x, replace --torch-backend=cu130 with the matching tag (e.g. cu121, cu124).

🚀 Quick start

A. HuggingFace transformers (single-image debug)

import torch
from transformers import AutoProcessor, HunYuanVLForConditionalGeneration

MODEL_ID = "EthannW/HunyuanOCR-1-5"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = HunYuanVLForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
    trust_remote_code=True,
).eval()

prompt = (
    "提取文档图片中正文的所有信息用markdown格式表示，其中页眉、页脚部分忽略，"
    "表格用html格式表达，文档中公式用latex格式表示，按照阅读顺序组织进行解析。"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "/path/to/document.png"},
        {"type": "text",  "text":  prompt},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)

gen = out[:, inputs["input_ids"].shape[1]:]
print(processor.batch_decode(gen, skip_special_tokens=True)[0])

Or use the ready-made single-image script from the repo:

git clone -b develop https://github.com/Tencent-Hunyuan/HunyuanOCR.git
cd HunyuanOCR

python inference/infer_base.py \
    --model EthannW/HunyuanOCR-1-5 \
    --image /path/to/document.png \
    --max-new-tokens 8000

B. vLLM (OpenAI-compatible serving)

# Autoregressive baseline
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
GPU=0 PORT=8000 GPU_MEM_UTIL=0.9 \
bash inference/serve_ar.sh

# DFlash speculative decoding (needs the DFlash draft repo)
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
DFLASH_PATH=EthannW/HunyuanOCR-1-5-DFlash \
GPU=0 PORT=8001 GPU_MEM_UTIL=0.9 NUM_SPEC_TOKENS=15 \
bash inference/serve_dflash.sh

Send one image with the shipped client (streaming + tail-repetition early-stop, matches internal bench sampling params):

python inference/infer_vllm_client.py \
    --host 127.0.0.1 --port 8000 \
    --model tencent/HunyuanOCR-1-5 \
    --image /path/to/document.png

C. PC-side deployment via llama.cpp

See docs/llama_cpp.md in the GitHub repo for GGUF conversion, community llama-server launch, and the DFlash-adapted fork.

🎯 Default OCR prompt

提取文档图片中正文的所有信息用markdown格式表示，其中页眉、页脚部分忽略，
表格用html格式表达，文档中公式用latex格式表示，按照阅读顺序组织进行解析。

The model also handles text spotting, information extraction, and text-image translation — pass a task-specific instruction as the text prompt.

🔗 Related repositories

GitHub — training & inference toolkit (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR
DFlash draft weights (required for speculative-decoding acceleration): EthannW/HunyuanOCR-1-5-DFlash
HunyuanOCR-1.0 (previous generation): tencent/HunyuanOCR

📜 License

HunyuanOCR-1.5 is released under the same license as HunyuanOCR 1.0 — the Tencent Hunyuan Community License Agreement.

⚠️ Preview notice. This checkpoint is a preview snapshot. The technical report and official model release will follow shortly; interfaces and weights may be updated before the final release.

Downloads last month: 8

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for EthannW/HunyuanOCR-1-5

Finetunes

1 model