HunyuanOCR-1.5 · DFlash Draft  ·  Preview

Speculative-decoding draft for EthannW/HunyuanOCR-1-5

📝 Note. This is a preview release of the DFlash draft weights. The technical report and official weights are coming very soon; checkpoint, interface and file layout may still evolve before the final release. Full toolkit / docs live in the GitHub repo (branch develop): https://github.com/Tencent-Hunyuan/HunyuanOCR.

⚠️ This model is not usable standalone. It is a draft model used only for speculative decoding together with the target model EthannW/HunyuanOCR-1-5.


📖 What is DFlash?

End-to-end OCR is often accompanied by long autoregressive decoding — the major bottleneck for dense documents, tables, formulas, and other long structured outputs.

HunyuanOCR-1.5 adopts a speculative-decoding framework based on DFlash:

  • A lightweight block-diffusion draft model (this repo) proposes multiple candidate tokens in parallel.
  • The target model (EthannW/HunyuanOCR-1-5) verifies them in a single forward pass.
  • Accepted tokens are committed as-is, so the output distribution of the target model is preserved — DFlash is a lossless acceleration.

The result is significantly reduced decoding latency for long structured OCR outputs, without sacrificing accuracy.

Architecture: 5-layer Qwen3-style block-diffusion draft (~360 M params in bfloat16), predicting 16 masked tokens in a single block. The draft is bound to target-layer indices [1, 8, 15, 22] of the 24-layer HunyuanOCR-1.5 base.


⚙️ Environment

  • Python 3.10+ (3.12 tested)
  • PyTorch 2.1+ (CUDA 12.1+; a cu130 build has been tested end-to-end)
  • transformers ≥ 4.57
  • vLLM nightly (0.23.x, cu130 build tested) — required for real speculative-decoding speedup at deployment time. DFlash support is included in the nightly wheel; no separate patch is needed.
uv pip install -U vllm \
    --torch-backend=cu130 \
    --extra-index-url https://wheels.vllm.ai/nightly
uv pip install runai-model-streamer

💡 On CUDA 12.x, replace --torch-backend=cu130 with the matching tag (e.g. cu121, cu124).


🚀 How to use

A. transformers — single-image correctness / draft-load check

Use the shipped script from the GitHub repo. It loads the draft, runs it alongside the target for one image, and verifies that the AR reference matches:

git clone -b develop https://github.com/Tencent-Hunyuan/HunyuanOCR.git
cd HunyuanOCR

python inference/infer_dflash.py \
    --model        EthannW/HunyuanOCR-1-5 \
    --dflash-model EthannW/HunyuanOCR-1-5-DFlash \
    --image        /path/to/document.png \
    --num-spec-tokens 15

ℹ️ infer_dflash.py only verifies that the DFlash draft loads and produces a matching AR reference on a single image. Real speculative-decoding acceleration is only realized under vLLM, see below.

B. vLLM speculative decoding (recommended for real speedup)

MODEL_PATH=EthannW/HunyuanOCR-1-5 \
DFLASH_PATH=EthannW/HunyuanOCR-1-5-DFlash \
GPU=0 PORT=8001 GPU_MEM_UTIL=0.9 \
NUM_SPEC_TOKENS=15 \
bash inference/serve_dflash.sh

Under the hood the launch script passes:

--speculative-config '{"method":"dflash","model":"EthannW/HunyuanOCR-1-5-DFlash","num_speculative_tokens":15}'

to the vLLM entrypoint. Send an OpenAI-compatible request with the shipped single-image client:

python inference/infer_vllm_client.py \
    --host 127.0.0.1 --port 8001 \
    --model tencent/HunyuanOCR-1-5 \
    --image /path/to/document.png

C. llama.cpp (PC-side)

A DFlash-adapted llama.cpp fork is provided for CPU / consumer-GPU / laptop speculative decoding. See docs/llama_cpp.md in the GitHub repo for the full guide (GGUF conversion of both target + draft, llama-server launch, and a smoke-test client).


📦 Files in this repo

file purpose
model.safetensors draft weights (bfloat16)
config.json draft config; sets auto_map to dflash.DFlashDraftModel
dflash.py DFlashDraftModel implementation (loaded via trust_remote_code=True)
chat_template.jinja, tokenizer.json, tokenizer_config.json, processor_config.json tokenizer / processor, kept in sync with the target model

🔗 Related repositories


📜 License

HunyuanOCR-1.5 (including the DFlash draft) is released under the same license as HunyuanOCR 1.0 — the Tencent Hunyuan Community License Agreement.

⚠️ Preview notice. This draft checkpoint is a preview snapshot. The technical report and official model release will follow shortly; interfaces and weights may be updated before the final release.

Downloads last month
52
Safetensors
Model size
90.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EthannW/HunyuanOCR-1-5-DFlash

Finetuned
(1)
this model