Instructions to use EthannW/HunyuanOCR-1-5-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use EthannW/HunyuanOCR-1-5-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="EthannW/HunyuanOCR-1-5-DFlash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("EthannW/HunyuanOCR-1-5-DFlash")
model = AutoModelForMultimodalLM.from_pretrained("EthannW/HunyuanOCR-1-5-DFlash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use EthannW/HunyuanOCR-1-5-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "EthannW/HunyuanOCR-1-5-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5-DFlash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/EthannW/HunyuanOCR-1-5-DFlash

SGLang

How to use EthannW/HunyuanOCR-1-5-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "EthannW/HunyuanOCR-1-5-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5-DFlash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "EthannW/HunyuanOCR-1-5-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EthannW/HunyuanOCR-1-5-DFlash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use EthannW/HunyuanOCR-1-5-DFlash with Docker Model Runner:
```
docker model run hf.co/EthannW/HunyuanOCR-1-5-DFlash
```

HunyuanOCR-1.5 · DFlash Draft · Preview

Speculative-decoding draft for EthannW/HunyuanOCR-1-5

📝 Note. This is a preview release of the DFlash draft weights. The technical report and official weights are coming very soon; checkpoint, interface and file layout may still evolve before the final release. Full toolkit / docs live in the GitHub repo (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR.

⚠️ This model is not usable standalone. It is a draft model used only for speculative decoding together with the target model EthannW/HunyuanOCR-1-5.

📖 What is DFlash?

End-to-end OCR is often accompanied by long autoregressive decoding — the major bottleneck for dense documents, tables, formulas, and other long structured outputs.

HunyuanOCR-1.5 adopts a speculative-decoding framework based on DFlash:

A lightweight block-diffusion draft model (this repo) proposes multiple candidate tokens in parallel.
The target model (EthannW/HunyuanOCR-1-5) verifies them in a single forward pass.
Accepted tokens are committed as-is, so the output distribution of the target model is preserved — DFlash is a lossless acceleration.

The result is significantly reduced decoding latency for long structured OCR outputs, without sacrificing accuracy.

Architecture: 5-layer Qwen3-style block-diffusion draft (~360 M params in bfloat16), predicting 16 masked tokens in a single block. The draft is bound to target-layer indices [1, 8, 15, 22] of the 24-layer HunyuanOCR-1.5 base.

⚙️ Environment

Python 3.10+ (3.12 tested)
PyTorch 2.1+ (CUDA 12.1+; a cu130 build has been tested end-to-end)
transformers ≥ 4.57
vLLM nightly (0.23.x, cu130 build tested) — required for real speculative-decoding speedup at deployment time. DFlash support is included in the nightly wheel; no separate patch is needed.

uv pip install -U vllm \
    --torch-backend=cu130 \
    --extra-index-url https://wheels.vllm.ai/nightly
uv pip install runai-model-streamer

💡 On CUDA 12.x, replace --torch-backend=cu130 with the matching tag (e.g. cu121, cu124).

🚀 How to use

A. transformers — single-image correctness / draft-load check

Use the shipped script from the GitHub repo. It loads the draft, runs it alongside the target for one image, and verifies that the AR reference matches:

git clone -b develop https://github.com/Tencent-Hunyuan/HunyuanOCR.git
cd HunyuanOCR

python inference/infer_dflash.py \
    --model        EthannW/HunyuanOCR-1-5 \
    --dflash-model EthannW/HunyuanOCR-1-5-DFlash \
    --image        /path/to/document.png \
    --num-spec-tokens 15

ℹ️ infer_dflash.py only verifies that the DFlash draft loads and produces a matching AR reference on a single image. Real speculative-decoding acceleration is only realized under vLLM, see below.

B. vLLM speculative decoding (recommended for real speedup)

MODEL_PATH=EthannW/HunyuanOCR-1-5 \
DFLASH_PATH=EthannW/HunyuanOCR-1-5-DFlash \
GPU=0 PORT=8001 GPU_MEM_UTIL=0.9 \
NUM_SPEC_TOKENS=15 \
bash inference/serve_dflash.sh

Under the hood the launch script passes:

--speculative-config '{"method":"dflash","model":"EthannW/HunyuanOCR-1-5-DFlash","num_speculative_tokens":15}'

to the vLLM entrypoint. Send an OpenAI-compatible request with the shipped single-image client:

python inference/infer_vllm_client.py \
    --host 127.0.0.1 --port 8001 \
    --model tencent/HunyuanOCR-1-5 \
    --image /path/to/document.png

C. llama.cpp (PC-side)

A DFlash-adapted llama.cpp fork is provided for CPU / consumer-GPU / laptop speculative decoding. See docs/llama_cpp.md in the GitHub repo for the full guide (GGUF conversion of both target + draft, llama-server launch, and a smoke-test client).

📦 Files in this repo

file	purpose
`model.safetensors`	draft weights (bfloat16)
`config.json`	draft config; sets `auto_map` to `dflash.DFlashDraftModel`
`dflash.py`	`DFlashDraftModel` implementation (loaded via `trust_remote_code=True`)
`chat_template.jinja`, `tokenizer.json`, `tokenizer_config.json`, `processor_config.json`	tokenizer / processor, kept in sync with the target model

🔗 Related repositories

Target model (required): EthannW/HunyuanOCR-1-5
GitHub — training & inference toolkit (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR
HunyuanOCR-1.0 (previous generation): tencent/HunyuanOCR

📜 License

HunyuanOCR-1.5 (including the DFlash draft) is released under the same license as HunyuanOCR 1.0 — the Tencent Hunyuan Community License Agreement.

⚠️ Preview notice. This draft checkpoint is a preview snapshot. The technical report and official model release will follow shortly; interfaces and weights may be updated before the final release.

Downloads last month: 52

Safetensors

Model size

90.7M params

Tensor type

F32

Model tree for EthannW/HunyuanOCR-1-5-DFlash

Base model

EthannW/HunyuanOCR-1-5

Finetuned

(1)

this model