HunyuanOCR-1.5  ·  Preview

Towards Efficient and Effective E2E OCR

📝 Note. This is a preview release of HunyuanOCR-1.5 weights. The technical report and official weights are coming very soon; the checkpoint, file layout and interface here may still evolve before the final release. Training / inference toolkit and full documentation live in the GitHub repo (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR.


📖 Introduction

HunyuanOCR-1.5 is a lightweight, end-to-end OCR-specialized vision-language model. It targets a broad range of text-centric visual tasks and unifies document parsing, text spotting, information extraction, and text-image translation within a single end-to-end VLM.

Building upon the validated lightweight architecture of HunyuanOCR-1.0, HunyuanOCR-1.5 does not redesign the backbone. Instead, it performs a systematic upgrade around two goals — making the model faster and better:

  • Faster — DFlash inference acceleration. A lightweight block-diffusion draft model drafts multiple candidate tokens in parallel, verified by the target model in a single pass, significantly reducing decoding latency of long structured OCR outputs (dense documents, tables, formulas) while preserving the target model's output distribution. Draft weights: EthannW/HunyuanOCR-1-5-DFlash.

  • 💻 PC-side deployment via llama.cpp. Beyond server-grade vLLM, HunyuanOCR-1.5 also supports CPU / consumer-GPU / laptop deployment via llama.cpp with an OpenAI-compatible llama-server. A DFlash-adapted llama.cpp fork is also provided so the same speculative-decoding acceleration is available on PC.

  • 🧠 Better — Agentic Data Flow + upgraded training recipe. An agent-driven data-construction system (Agentic Data Flow) translates model weaknesses into executable data requirements, targeting long-tail capabilities such as low-resource OCR, ancient-script OCR, and multi-image text-centric QA. Pretraining Stage-3 is re-planned with 4K resolution and a 128K context window; post-training refines SFT data and further explores RL across different OCR tasks.

Together, HunyuanOCR-1.5 achieves both faster inference and broader OCR capability coverage while retaining the deployment advantages of a lightweight end-to-end model.


⚙️ Environment

  • Python 3.10+ (3.12 tested)
  • PyTorch 2.1+ (CUDA 12.1+; a cu130 build has been tested end-to-end)
  • transformers ≥ 4.57 (ships HunYuanVLForConditionalGeneration + AutoProcessor for the HunyuanOCR-1.5 series)
  • vLLM nightly (0.23.x, cu130 build tested) — for OpenAI-compatible serving and (in the DFlash draft repo) speculative decoding

transformers-only (single-image debug)

pip install "transformers>=4.57" torch pillow accelerate
# for FlashAttention:
pip install flash-attn --no-build-isolation

vLLM serving (tested recipe)

We use a dedicated venv for inference to keep vLLM nightly isolated:

uv pip install -U vllm \
    --torch-backend=cu130 \
    --extra-index-url https://wheels.vllm.ai/nightly
uv pip install runai-model-streamer

💡 On CUDA 12.x, replace --torch-backend=cu130 with the matching tag (e.g. cu121, cu124).


🚀 Quick start

A. HuggingFace transformers (single-image debug)

import torch
from transformers import AutoProcessor, HunYuanVLForConditionalGeneration

MODEL_ID = "EthannW/HunyuanOCR-1-5"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = HunYuanVLForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
    trust_remote_code=True,
).eval()

prompt = (
    "提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,"
    "表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "/path/to/document.png"},
        {"type": "text",  "text":  prompt},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=8000, do_sample=False)

gen = out[:, inputs["input_ids"].shape[1]:]
print(processor.batch_decode(gen, skip_special_tokens=True)[0])

Or use the ready-made single-image script from the repo:

git clone -b develop https://github.com/Tencent-Hunyuan/HunyuanOCR.git
cd HunyuanOCR

python inference/infer_base.py \
    --model EthannW/HunyuanOCR-1-5 \
    --image /path/to/document.png \
    --max-new-tokens 8000

B. vLLM (OpenAI-compatible serving)

# Autoregressive baseline
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
GPU=0 PORT=8000 GPU_MEM_UTIL=0.9 \
bash inference/serve_ar.sh

# DFlash speculative decoding (needs the DFlash draft repo)
MODEL_PATH=EthannW/HunyuanOCR-1-5 \
DFLASH_PATH=EthannW/HunyuanOCR-1-5-DFlash \
GPU=0 PORT=8001 GPU_MEM_UTIL=0.9 NUM_SPEC_TOKENS=15 \
bash inference/serve_dflash.sh

Send one image with the shipped client (streaming + tail-repetition early-stop, matches internal bench sampling params):

python inference/infer_vllm_client.py \
    --host 127.0.0.1 --port 8000 \
    --model tencent/HunyuanOCR-1-5 \
    --image /path/to/document.png

C. PC-side deployment via llama.cpp

See docs/llama_cpp.md in the GitHub repo for GGUF conversion, community llama-server launch, and the DFlash-adapted fork.


🎯 Default OCR prompt

提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,
表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。

The model also handles text spotting, information extraction, and text-image translation — pass a task-specific instruction as the text prompt.


🔗 Related repositories


📜 License

HunyuanOCR-1.5 is released under the same license as HunyuanOCR 1.0 — the Tencent Hunyuan Community License Agreement.

⚠️ Preview notice. This checkpoint is a preview snapshot. The technical report and official model release will follow shortly; interfaces and weights may be updated before the final release.

Downloads last month
8
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EthannW/HunyuanOCR-1-5

Finetunes
1 model