File size: 6,762 Bytes
38b26bc c2b5bbb 4ef39cc c2b5bbb 4ef39cc 8379959 c2b5bbb 38b26bc c2b5bbb 471de27 38b26bc 418aae3 c2b5bbb 418aae3 3448297 38b26bc 5264308 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38d1185 d10f1de 38b26bc d10f1de 38b26bc d10f1de c2b5bbb 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc d10f1de ec4ac76 d10f1de 38d1185 d10f1de ec4ac76 d10f1de 38b26bc d10f1de 38b26bc d10f1de 38b26bc fca72b3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | ---
base_model:
- Dream-org/Dream-v0-Instruct-7B
datasets:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
- lmms-lab/LLaVA-NeXT-Data
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
pipeline_tag: image-text-to-text
tags:
- Diffusion_Multimodal_Large_Language_Model
- MLLM
- Discrete_Diffusion
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/635364b3c41f548fe39db945/T6ffjtAkFkI76QjXmN6iR.png" alt="Dimple" style="width:100%;"/>
<p align="center">
π€ <a href="https://huggingface.co/rp-yu/Dimple-7B">Model</a>   |    π¬ <a href="https://huggingface.co/spaces/rp-yu/Dimple-7B">Demo: Chat with Dimple</a>   |   π <a href="https://huggingface.co/papers/2505.16990">Paper</a>   |    β¨ <a href="https://github.com/yu-rp/Dimple">Code</a>  
</p>
# π§ Dimple-7B
**Dimple** is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing an **autoregressive-then-diffusion** training strategy:
* **Stage 1**: Autoregressive fine-tuning for alignment and initial instruction tuning.
* **Stage 2**: Diffusion-based fine-tuning for enhanced instruction-following capabilities.
Trained on the same dataset as LLaVA-NEXT, **Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%**, demonstrating that diffusion-based multimodal language models can match its autoregressive counterparts under similar training budget.
---
## π Highlights
* **Hybrid Training**: Combines autoregressive and diffusion training.
* **Diffusion Decoding**: Supports confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
* **Controllable Generation**: Enables fine-grained control over format, structure, and length via structure priors.
* **Autoregressive-like Prefilling**: Enhances inference speed using prefilling techniques.
---
## π Evaluation Results
| Benchmark | Dimple-7B (ours) | LLaVA-1.5-7B | LLaVA-NEXT-7B | Eagle-7B | Eagle2-9B | Qwen-VL-7B | Qwen2.5-VL-7B |
| --------------------- | ---------------- | ------------ | ------------- | -------- | --------- | ---------- | ------------- |
| **Training Samples** | 1.3M | 1.2M | 1.3M | 2.4M | 27.8M | 1.5B | - |
| **Training Tokens** | 0.8B | - | - | - | - | - | 2.6T |
| **Base LLM** | Dream (Qwen2.5) | Vicuna | Vicuna-1.5 | Vicuna | Qwen2.5 | Qwen | Qwen2.5 |
| **GQA** | 59.2 | 62.0 | 64.8 | 64.9 | - | 59.3 | - |
| **MMBench (en test)** | 74.6 | 64.3 | 68.7 | 68.4 | - | - | 83.5 |
| **MME (Perception)** | 1514 | 1510 | 1519 | 1528 | - | - | - |
| **MME (Cognition)** | 432 | - | 332 | - | - | - | - |
| **MME (Total)** | 1946 | - | 1851 | - | - | - | 2347 |
| **POPE** | 86.2 | 85.8 | 86.7 | 88.8 | - | - | - |
| **MMMU (val)** | 45.2 | - | 35.8 | 36.3 | 56.1 | - | 58.6 |
| **SQA (img)** | 77.1 | 66.8 | 72.8 | 70.0 | - | - | - |
| **AI2D** | 74.4 | - | 65.4 | - | 83.9 | 62.3 | 83.9 |
| **ChartQA** | 63.4 | - | 54.9 | 67.7 | 86.4 | 65.7 | 87.3 |
| **TextVQA** | 61.6 | - | 64.8 | - | 83.0 | - | - |
| **OCRBench** | 565 | - | 490 | 529 | - | - | - |
| **MathVista (mini)** | 42.3 | - | 33.0 | - | 63.8 | 37.0 | 68.2 |
| **MMVet** | 41.2 | 31.1 | 47.3 | - | 62.2 | - | 67.1 |
---
## π οΈ Environment
Make sure your environment includes the following versions:
```bash
transformers==4.46.2
torch==2.5.1
accelerate==1.6.0
```
---
## π Inference Example
```python
import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image
model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
[{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "Describe this image."}
]}],
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]
inputs = processor(
text=text,
images=images,
videos=None,
padding="longest",
return_tensors="pt",
)
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=64,
output_history=True,
return_dict_in_generate=True,
steps=64,
temperature=0.2,
top_p=0.95,
alg="origin",
use_cache=True,
alg_p_threshold=0.95,
use_original_confidence=True,
decoding_pipeline="dim",
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.
```
---
## π Citation
```
@misc{dimple,
title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding},
author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
year={2025},
eprint={2505.16990},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.16990},
}
``` |