Upload folder using huggingface_hub

c79aa4d verified 2 days ago

5.72 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	tags:
	- diffusion
	- vlm
	- block-diffusion
	- parallel-decoding
	---

	# Fast-dVLM (3B) — Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

	[[Paper](https://arxiv.org/abs/2604.06832)] [[Project Page](https://nvlabs.github.io/Fast-dLLM/fast_dvlm/)] [[Code](https://github.com/NVlabs/Fast-dLLM)] [[Fast-dLLM v2](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B)]

	## Introduction

	Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one.

	Fast-dVLM is a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. Built on Qwen2.5-VL-3B-Instruct, Fast-dVLM directly converts the pretrained AR VLM into a block-diffusion model in a single stage, leveraging the already multimodally aligned VLM.

	### Key Highlights

	- Lossless Quality: Matches the AR baseline (Qwen2.5-VL-3B) across 11 multimodal benchmarks (74.0 avg).
	- Up to 6.18x Speedup: With SGLang integration and FP8 quantization.
	- 2.63x Tokens/NFE: With self-speculative block decoding.
	- Direct Conversion: Single-stage AR-to-diffusion conversion outperforms two-stage approach (73.3 vs 60.2 avg).

	### Key Techniques

	- Block-Size Annealing: Curriculum that progressively increases the block size during training.
	- Causal Context Attention: Noisy tokens attend bidirectionally within blocks (N2N), to clean tokens from preceding blocks (N2C), while clean tokens follow causal attention (C2C).
	- Auto-Truncation Masking: Prevents cross-turn leakage in multi-turn dialogue.
	- Vision-Efficient Concatenation: Vision embeddings included only in the clean stream, reducing peak memory by 15% and training time by 14.2%.

	---

	## Model Overview

	\| Property \| Value \|
	\|---\|---\|
	\| Type \| Block Diffusion Vision-Language Model \|
	\| Base Model \| `Qwen/Qwen2.5-VL-3B-Instruct` \|
	\| Architecture \| Transformer w/ M-RoPE, SwiGLU, RMSNorm, GQA \|
	\| Text Layers \| 36 \|
	\| Vision Depth \| 32 \|
	\| Text Hidden Size \| 2048 \|
	\| Attention Heads \| 16 (Q), 2 (KV, GQA) \|
	\| Block Diffusion Size \| 32 \|

	---

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model_name = "Efficient-Large-Model/Fast_dVLM_3B"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	processor = AutoProcessor.from_pretrained(model_name, use_fast=False)
	processor.tokenizer = tokenizer

	prompt = "Describe this image in detail."
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
	{"type": "text", "text": prompt},
	],
	}
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	mask_id = tokenizer.encode("\|<MASK>\|")[0]

	generated_ids = model.generate(
	input_ids=inputs.input_ids,
	tokenizer=tokenizer,
	pixel_values=inputs.pixel_values,
	image_grid_thw=inputs.image_grid_thw,
	mask_id=mask_id,
	max_tokens=512,
	)

	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	---

	## Benchmark Results

	Fast-dVLM matches the AR baseline on 11 multimodal benchmarks while achieving 2.63x Tokens/NFE with speculative decoding.

	\| Model \| AI2D \| ChartQA \| DocVQA \| GQA \| MMBench \| MMMU \| POPE \| RWQA \| SEED2+ \| TextVQA \| Avg \| MMMU-Pro-V \| Tok/NFE \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Qwen2.5-VL-3B \| 80.8 \| 84.0 \| 93.1 \| 59.0 \| 76.9 \| 47.3 \| 86.2 \| 65.1 \| 68.6 \| 79.1 \| 74.0 \| 26.3 \| 1.00 \|
	\| Fast-dVLM (MDM) \| 79.7 \| 82.8 \| 92.1 \| 63.0 \| 74.2 \| 44.6 \| 88.6 \| 65.1 \| 67.2 \| 76.1 \| 73.3 \| 21.4 \| 1.95 \|
	\| Fast-dVLM (spec.) \| 79.7 \| 83.1 \| 92.9 \| 63.3 \| 74.3 \| 46.6 \| 88.6 \| 65.1 \| 67.2 \| 79.3 \| 74.0 \| 24.6 \| 2.63 \|

	### Inference Acceleration

	\| Setting \| MMMU-Pro-V \| TPS \| SpeedUp \|
	\|---\|---\|---\|---\|
	\| AR baseline \| 26.3 \| 56.7 \| 1.00x \|
	\| Fast-dVLM (MDM, t=0.9) \| 21.4 \| 82.2 \| 1.45x \|
	\| + Spec. decoding (linear) \| 24.6 \| 112.7 \| 1.98x \|
	\| + SGLang serving \| 24.1 \| 319.0 \| 5.63x \|
	\| + SmoothQuant-W8A8 (FP8) \| 23.8 \| 350.3 \| 6.18x \|

	---

	## Citation

	If you use Fast-dVLM in your research, please cite:

	```bibtex
	@misc{wu2026fastdvlmefficientblockdiffusionvlm,
	title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
	author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
	year={2026},
	eprint={2604.06832},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.06832},
	}
	```

	---

	## License

	Released under Apache 2.0, following the base Qwen2.5-VL license.