Instructions to use EthannW/HunyuanOCR-1-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EthannW/HunyuanOCR-1-5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="EthannW/HunyuanOCR-1-5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("EthannW/HunyuanOCR-1-5") model = AutoModelForMultimodalLM.from_pretrained("EthannW/HunyuanOCR-1-5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use EthannW/HunyuanOCR-1-5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EthannW/HunyuanOCR-1-5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EthannW/HunyuanOCR-1-5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/EthannW/HunyuanOCR-1-5
- SGLang
How to use EthannW/HunyuanOCR-1-5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EthannW/HunyuanOCR-1-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EthannW/HunyuanOCR-1-5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EthannW/HunyuanOCR-1-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EthannW/HunyuanOCR-1-5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use EthannW/HunyuanOCR-1-5 with Docker Model Runner:
docker model run hf.co/EthannW/HunyuanOCR-1-5
HunyuanOCR-1.5 · Preview
Towards Efficient and Effective E2E OCR
📝 Note. This is a preview release of HunyuanOCR-1.5 weights. The technical report and official weights are coming very soon; the checkpoint, file layout and interface here may still evolve before the final release. Training / inference toolkit and full documentation live in the GitHub repo (branch
develop): https://github.com/Tencent-Hunyuan/HunyuanOCR.
📖 Introduction
HunyuanOCR-1.5 is a lightweight, end-to-end OCR-specialized vision-language model. It targets a broad range of text-centric visual tasks and unifies document parsing, text spotting, information extraction, and text-image translation within a single end-to-end VLM.
Building upon the validated lightweight architecture of HunyuanOCR-1.0, HunyuanOCR-1.5 does not redesign the backbone. Instead, it performs a systematic upgrade around two goals — making the model faster and better:
⚡ Faster — DFlash inference acceleration. A lightweight block-diffusion draft model drafts multiple candidate tokens in parallel, verified by the target model in a single pass, significantly reducing decoding latency of long structured OCR outputs (dense documents, tables, formulas) while preserving the target model's output distribution. Draft weights:
EthannW/HunyuanOCR-1-5-DFlash.💻 PC-side deployment via llama.cpp. Beyond server-grade vLLM, HunyuanOCR-1.5 also supports CPU / consumer-GPU / laptop deployment via
llama.cppwith an OpenAI-compatiblellama-server. A DFlash-adaptedllama.cppfork is also provided so the same speculative-decoding acceleration is available on PC.🧠 Better — Agentic Data Flow + upgraded training recipe. An agent-driven data-construction system (Agentic Data Flow) translates model weaknesses into executable data requirements, targeting long-tail capabilities such as low-resource OCR, ancient-script OCR, and multi-image text-centric QA. Pretraining Stage-3 is re-planned with 4K resolution and a 128K context window; post-training refines SFT data and further explores RL across different OCR tasks.
Together, HunyuanOCR-1.5 achieves both faster inference and broader OCR capability coverage while retaining the deployment advantages of a lightweight end-to-end model.
⚙️ Environment
- Python 3.10+ (3.12 tested)
- PyTorch 2.1+ (CUDA 12.1+; a cu130 build has been tested end-to-end)
- transformers ≥ 4.57 (ships
HunYuanVLForConditionalGeneration+AutoProcessorfor the HunyuanOCR-1.5 series) - vLLM nightly (0.23.x, cu130 build tested) — for OpenAI-compatible serving and (in the DFlash draft repo) speculative decoding
transformers-only (single-image debug)
pip install "transformers>=4.57" torch pillow accelerate
# for FlashAttention:
pip install flash-attn --no-build-isolation
vLLM serving (tested recipe)
We use a dedicated venv for inference to keep vLLM nightly isolated:
uv pip install -U vllm \
--torch-backend=cu130 \
--extra-index-url https://wheels.vllm.ai/nightly
uv pip install runai-model-streamer
💡 On CUDA 12.x, replace
--torch-backend=cu130with the matching tag (e.g.cu121,cu124).
🚀 Quick start
A. HuggingFace transformers (single-image debug)
import torch
from transformers import AutoProcessor, HunYuanVLForConditionalGeneration
MODEL_ID = "EthannW/HunyuanOCR-1-5"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = HunYuanVLForConditionalGeneration.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
trust_remote_code=True,
).eval()
prompt = (
"提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,"
"表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。"
)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "/path/to/document.png"},
{"type": "text", "text": prompt},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)
gen = out[:, inputs["input_ids"].shape[1]:]
print(processor.batch_decode(gen, skip_special_tokens=True)[0])
Or use the ready-made single-image script from the repo:
git clone -b develop https://github.com/Tencent-Hunyuan/HunyuanOCR.git
cd HunyuanOCR
python inference/infer_base.py \
--model EthannW/HunyuanOCR-1-5 \
--image /path/to/document.png \
--max-new-tokens 8000
B. vLLM (OpenAI-compatible serving)
# Autoregressive baseline
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
GPU=0 PORT=8000 GPU_MEM_UTIL=0.9 \
bash inference/serve_ar.sh
# DFlash speculative decoding (needs the DFlash draft repo)
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
DFLASH_PATH=EthannW/HunyuanOCR-1-5-DFlash \
GPU=0 PORT=8001 GPU_MEM_UTIL=0.9 NUM_SPEC_TOKENS=15 \
bash inference/serve_dflash.sh
Send one image with the shipped client (streaming + tail-repetition early-stop, matches internal bench sampling params):
python inference/infer_vllm_client.py \
--host 127.0.0.1 --port 8000 \
--model tencent/HunyuanOCR-1-5 \
--image /path/to/document.png
C. PC-side deployment via llama.cpp
See docs/llama_cpp.md in the GitHub repo for GGUF conversion, community
llama-server launch, and the DFlash-adapted fork.
🎯 Default OCR prompt
提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,
表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。
The model also handles text spotting, information extraction, and text-image translation — pass a task-specific instruction as the text prompt.
🔗 Related repositories
- GitHub — training & inference toolkit (branch
develop): https://github.com/Tencent-Hunyuan/HunyuanOCR - DFlash draft weights (required for speculative-decoding acceleration):
EthannW/HunyuanOCR-1-5-DFlash - HunyuanOCR-1.0 (previous generation):
tencent/HunyuanOCR
📜 License
HunyuanOCR-1.5 is released under the same license as HunyuanOCR 1.0 — the Tencent Hunyuan Community License Agreement.
⚠️ Preview notice. This checkpoint is a preview snapshot. The technical report and official model release will follow shortly; interfaces and weights may be updated before the final release.
- Downloads last month
- 8