File size: 5,722 Bytes

c79aa4d

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- diffusion
- vlm
- block-diffusion
- parallel-decoding
---

# Fast-dVLM (3B) — Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

[[Paper](https://arxiv.org/abs/2604.06832)] [[Project Page](https://nvlabs.github.io/Fast-dLLM/fast_dvlm/)] [[Code](https://github.com/NVlabs/Fast-dLLM)] [[Fast-dLLM v2](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B)]

## Introduction

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in **physical AI scenarios** such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one.

**Fast-dVLM** is a block-diffusion-based VLM that enables **KV-cache-compatible parallel decoding** and **speculative block decoding** for inference acceleration. Built on **Qwen2.5-VL-3B-Instruct**, Fast-dVLM directly converts the pretrained AR VLM into a block-diffusion model in a single stage, leveraging the already multimodally aligned VLM.

### Key Highlights

- **Lossless Quality**: Matches the AR baseline (Qwen2.5-VL-3B) across **11 multimodal benchmarks** (74.0 avg).
- **Up to 6.18x Speedup**: With SGLang integration and FP8 quantization.
- **2.63x Tokens/NFE**: With self-speculative block decoding.
- **Direct Conversion**: Single-stage AR-to-diffusion conversion outperforms two-stage approach (73.3 vs 60.2 avg).

### Key Techniques

- **Block-Size Annealing**: Curriculum that progressively increases the block size during training.
- **Causal Context Attention**: Noisy tokens attend bidirectionally within blocks (N2N), to clean tokens from preceding blocks (N2C), while clean tokens follow causal attention (C2C).
- **Auto-Truncation Masking**: Prevents cross-turn leakage in multi-turn dialogue.
- **Vision-Efficient Concatenation**: Vision embeddings included only in the clean stream, reducing peak memory by 15% and training time by 14.2%.

---

## Model Overview

| Property | Value |
|---|---|
| **Type** | Block Diffusion Vision-Language Model |
| **Base Model** | `Qwen/Qwen2.5-VL-3B-Instruct` |
| **Architecture** | Transformer w/ M-RoPE, SwiGLU, RMSNorm, GQA |
| **Text Layers** | 36 |
| **Vision Depth** | 32 |
| **Text Hidden Size** | 2048 |
| **Attention Heads** | 16 (Q), 2 (KV, GQA) |
| **Block Diffusion Size** | 32 |

---

## Quickstart

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Efficient-Large-Model/Fast_dVLM_3B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name, use_fast=False)
processor.tokenizer = tokenizer

prompt = "Describe this image in detail."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

mask_id = tokenizer.encode("|<MASK>|")[0]

generated_ids = model.generate(
    input_ids=inputs.input_ids,
    tokenizer=tokenizer,
    pixel_values=inputs.pixel_values,
    image_grid_thw=inputs.image_grid_thw,
    mask_id=mask_id,
    max_tokens=512,
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

---

## Benchmark Results

Fast-dVLM matches the AR baseline on 11 multimodal benchmarks while achieving 2.63x Tokens/NFE with speculative decoding.

| Model | AI2D | ChartQA | DocVQA | GQA | MMBench | MMMU | POPE | RWQA | SEED2+ | TextVQA | Avg | MMMU-Pro-V | Tok/NFE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B | 80.8 | 84.0 | 93.1 | 59.0 | 76.9 | 47.3 | 86.2 | 65.1 | 68.6 | 79.1 | 74.0 | 26.3 | 1.00 |
| **Fast-dVLM (MDM)** | 79.7 | 82.8 | 92.1 | 63.0 | 74.2 | 44.6 | 88.6 | 65.1 | 67.2 | 76.1 | 73.3 | 21.4 | 1.95 |
| **Fast-dVLM (spec.)** | 79.7 | 83.1 | 92.9 | 63.3 | 74.3 | 46.6 | 88.6 | 65.1 | 67.2 | 79.3 | **74.0** | 24.6 | **2.63** |

### Inference Acceleration

| Setting | MMMU-Pro-V | TPS | SpeedUp |
|---|---|---|---|
| AR baseline | 26.3 | 56.7 | 1.00x |
| Fast-dVLM (MDM, t=0.9) | 21.4 | 82.2 | 1.45x |
| + Spec. decoding (linear) | 24.6 | 112.7 | 1.98x |
| + SGLang serving | 24.1 | 319.0 | 5.63x |
| + SmoothQuant-W8A8 (FP8) | 23.8 | **350.3** | **6.18x** |

---

## Citation

If you use Fast-dVLM in your research, please cite:

```bibtex
@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}
```

---

## License

Released under **Apache 2.0**, following the base Qwen2.5-VL license.