Dimple-7B / README.md

Update README.md

fca72b3 verified 9 months ago

6.76 kB

	---
	base_model:
	- Dream-org/Dream-v0-Instruct-7B
	datasets:
	- liuhaotian/LLaVA-CC3M-Pretrain-595K
	- lmms-lab/LLaVA-NeXT-Data
	language:
	- en
	library_name: transformers
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: image-text-to-text
	tags:
	- Diffusion_Multimodal_Large_Language_Model
	- MLLM
	- Discrete_Diffusion
	---

	<img src="https://cdn-uploads.huggingface.co/production/uploads/635364b3c41f548fe39db945/T6ffjtAkFkI76QjXmN6iR.png" alt="Dimple" style="width:100%;"/>


	<p align="center">
	🤗 <a href="https://huggingface.co/rp-yu/Dimple-7B">Model</a>&nbsp&nbsp \| &nbsp&nbsp 💬 <a href="https://huggingface.co/spaces/rp-yu/Dimple-7B">Demo: Chat with Dimple</a>&nbsp&nbsp \| &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2505.16990">Paper</a>&nbsp&nbsp \| &nbsp&nbsp ✨ <a href="https://github.com/yu-rp/Dimple">Code</a>&nbsp&nbsp
	</p>

	# 💧 Dimple-7B

	Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing an autoregressive-then-diffusion training strategy:

	* Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
	* Stage 2: Diffusion-based fine-tuning for enhanced instruction-following capabilities.

	Trained on the same dataset as LLaVA-NEXT, Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%, demonstrating that diffusion-based multimodal language models can match its autoregressive counterparts under similar training budget.

	---

	## 🔍 Highlights

	* Hybrid Training: Combines autoregressive and diffusion training.
	* Diffusion Decoding: Supports confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
	* Controllable Generation: Enables fine-grained control over format, structure, and length via structure priors.
	* Autoregressive-like Prefilling: Enhances inference speed using prefilling techniques.

	---

	## 📊 Evaluation Results

	\| Benchmark \| Dimple-7B (ours) \| LLaVA-1.5-7B \| LLaVA-NEXT-7B \| Eagle-7B \| Eagle2-9B \| Qwen-VL-7B \| Qwen2.5-VL-7B \|
	\| --------------------- \| ---------------- \| ------------ \| ------------- \| -------- \| --------- \| ---------- \| ------------- \|
	\| Training Samples \| 1.3M \| 1.2M \| 1.3M \| 2.4M \| 27.8M \| 1.5B \| - \|
	\| Training Tokens \| 0.8B \| - \| - \| - \| - \| - \| 2.6T \|
	\| Base LLM \| Dream (Qwen2.5) \| Vicuna \| Vicuna-1.5 \| Vicuna \| Qwen2.5 \| Qwen \| Qwen2.5 \|
	\| GQA \| 59.2 \| 62.0 \| 64.8 \| 64.9 \| - \| 59.3 \| - \|
	\| MMBench (en test) \| 74.6 \| 64.3 \| 68.7 \| 68.4 \| - \| - \| 83.5 \|
	\| MME (Perception) \| 1514 \| 1510 \| 1519 \| 1528 \| - \| - \| - \|
	\| MME (Cognition) \| 432 \| - \| 332 \| - \| - \| - \| - \|
	\| MME (Total) \| 1946 \| - \| 1851 \| - \| - \| - \| 2347 \|
	\| POPE \| 86.2 \| 85.8 \| 86.7 \| 88.8 \| - \| - \| - \|
	\| MMMU (val) \| 45.2 \| - \| 35.8 \| 36.3 \| 56.1 \| - \| 58.6 \|
	\| SQA (img) \| 77.1 \| 66.8 \| 72.8 \| 70.0 \| - \| - \| - \|
	\| AI2D \| 74.4 \| - \| 65.4 \| - \| 83.9 \| 62.3 \| 83.9 \|
	\| ChartQA \| 63.4 \| - \| 54.9 \| 67.7 \| 86.4 \| 65.7 \| 87.3 \|
	\| TextVQA \| 61.6 \| - \| 64.8 \| - \| 83.0 \| - \| - \|
	\| OCRBench \| 565 \| - \| 490 \| 529 \| - \| - \| - \|
	\| MathVista (mini) \| 42.3 \| - \| 33.0 \| - \| 63.8 \| 37.0 \| 68.2 \|
	\| MMVet \| 41.2 \| 31.1 \| 47.3 \| - \| 62.2 \| - \| 67.1 \|

	---

	## 🛠️ Environment

	Make sure your environment includes the following versions:

	```bash
	transformers==4.46.2
	torch==2.5.1
	accelerate==1.6.0
	```

	---

	## 🚀 Inference Example

	```python
	import torch
	from transformers import AutoProcessor, AutoModel
	import json, requests
	from PIL import Image

	model_name = "rp-yu/Dimple-7B"
	processor = AutoProcessor.from_pretrained(
	model_name,
	trust_remote_code=True
	)
	model = AutoModel.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	)

	image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
	messages = [
	[{"role": "user", "content": [
	{"type": "image", "image": image_url},
	{"type": "text", "text": "Describe this image."}
	]}],
	]
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
	)
	images = [
	Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
	]

	inputs = processor(
	text=text,
	images=images,
	videos=None,
	padding="longest",
	return_tensors="pt",
	)

	input_ids = inputs.pop("input_ids")
	output = model.diffusion_generate(
	input_ids,
	max_new_tokens=64,
	output_history=True,
	return_dict_in_generate=True,
	steps=64,
	temperature=0.2,
	top_p=0.95,
	alg="origin",
	use_cache=True,
	alg_p_threshold=0.95,
	use_original_confidence=True,
	decoding_pipeline="dim",
	**inputs
	)

	generations = [
	processor.tokenizer.decode(g[len(p):].cpu().tolist())
	for p, g in zip(input_ids, output.sequences)
	]

	for j in range(len(messages)):
	print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])

	# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.
	```

	---

	## 📚 Citation

	```
	@misc{dimple,
	title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding},
	author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
	year={2025},
	eprint={2505.16990},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.16990},
	}
	```