xiazhi commited on
Commit
39fee1d
·
verified ·
1 Parent(s): 04ad6f5

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -3,93 +3,31 @@ license: apache-2.0
3
  tags:
4
  - diffusion
5
  - vision-language
6
- - diffusivonvl
7
  - qwen2.5-vl
8
- library_name: transformers
9
- pipeline_tag: image-text-to-text
10
  ---
11
 
12
- <div align="center">
13
 
14
- <h1>DiffusionVL: Translating Any Autoregressive Models into <br> Diffusion Vision Language Models</h1>
15
 
16
- **_SOTA dVLM Performance with <5% Data & 2.0× Inference Speedup!_**
17
-
18
- [Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>
19
-
20
- <sup>1</sup>Huazhong University of Science and Technology
21
-
22
- <sup>*</sup>equal contribution, <sup>✉️</sup>corresponding author, xgwang@hust.edu.cn
23
-
24
- [![arXiv](https://img.shields.io/badge/arXiv-DiffusionVL-b31b1b.svg)](https://arxiv.org/abs/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
25
-
26
- </div>
27
-
28
- ## 📰 News
29
-
30
- - **[2025.12.18]** 🎉 Our paper **DiffusionVL** is released on arXiv! And we release the DiffusionVL models translated from Qwen2.5VL at huggingface. The training code and more models are comming soon!
31
-
32
- ## 📄 Introduction
33
-
34
- The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
35
-
36
- DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0× inference speedup.
37
-
38
- ## ✨ Highlights
39
-
40
- - **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
41
-
42
- - **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
43
-
44
- - **2.0× Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0× speedup over previous dVLMs.
45
-
46
- <div align="center">
47
- <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
48
- <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
49
- </div>
50
-
51
-
52
-
53
-
54
- ### 🎯 Inference with Pre-trained Models
55
-
56
- - **Download Pre-trained Models:**
57
-
58
- | Model | Base Model | Download |
59
- | :--- | :--- | :--- |
60
- | **DiffusionVL-Qwen2.5VL-3B** | Qwen2.5-VL-3B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-3B) |
61
- | **DiffusionVL-Qwen2.5VL-7B** | Qwen2.5-VL-7B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-7B) |
62
-
63
- - **Environment Setup:**
64
-
65
- The core environments are list as follows:
66
- ```
67
- torch==2.6.0
68
- torchvision==0.21.0
69
- torchaudio==2.6.0
70
- transformers==4.55.0
71
- accelerate==1.10.1
72
- pillow==10.4.0
73
- requests=2.32.5
74
- ```
75
-
76
- - **Quick Start:**
77
 
78
  ```python
79
- from transformers import AutoModelForCausalLM, AutoProcessor
80
  import torch
81
 
82
  # Load model with trust_remote_code
83
  model = AutoModelForCausalLM.from_pretrained(
84
- "hustvl/DiffusionVL-Qwen2.5VL-7B",
85
  torch_dtype=torch.bfloat16,
86
  device_map="auto",
87
  trust_remote_code=True
88
  )
89
 
90
  # Load processor (includes tokenizer)
91
- processor = AutoProcessor.from_pretrained("hustvl/DiffusionVL-Qwen2.5VL-7B", trust_remote_code=True)
92
 
 
93
  from PIL import Image
94
  import requests
95
 
@@ -110,7 +48,7 @@ output_ids = model.generate(
110
  inputs=inputs["input_ids"],
111
  images=inputs.get("pixel_values"),
112
  image_grid_thws=inputs.get("image_grid_thw"),
113
- gen_length=128,
114
  steps=8,
115
  temperature=0.0,
116
  remasking_strategy="low_confidence_static",
@@ -119,23 +57,27 @@ output_ids = model.generate(
119
  # Decode output
120
  output_text = processor.decode(output_ids[0], skip_special_tokens=True)
121
  print(output_text)
122
-
123
  ```
124
 
125
- ## ❤️ Acknowledgements
126
 
127
- This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [SDAR](https://github.com/JetAstra/SDAR). We thank the authors for their open-source contributions.
 
 
 
 
 
128
 
129
- ## 📝 Citation
130
- If you find our work useful, please cite our paper:
131
- ```
132
- @misc{zeng2025diffusionvltranslatingautoregressivemodels,
133
- title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models},
134
- author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
135
- year={2025},
136
- eprint={2512.15713},
137
- archivePrefix={arXiv},
138
- primaryClass={cs.CV},
139
- url={https://arxiv.org/abs/2512.15713},
140
- }
141
- ```
 
3
  tags:
4
  - diffusion
5
  - vision-language
 
6
  - qwen2.5-vl
 
 
7
  ---
8
 
9
+ # DiffusionVL
10
 
11
+ DiffusionVL is a vision-language model based on Qwen2.5-VL architecture with BD3LM diffusion-based generation.
12
 
13
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ```python
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
17
  import torch
18
 
19
  # Load model with trust_remote_code
20
  model = AutoModelForCausalLM.from_pretrained(
21
+ "path/to/model",
22
  torch_dtype=torch.bfloat16,
23
  device_map="auto",
24
  trust_remote_code=True
25
  )
26
 
27
  # Load processor (includes tokenizer)
28
+ processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
29
 
30
+ # Image + text generation
31
  from PIL import Image
32
  import requests
33
 
 
48
  inputs=inputs["input_ids"],
49
  images=inputs.get("pixel_values"),
50
  image_grid_thws=inputs.get("image_grid_thw"),
51
+ gen_length=256,
52
  steps=8,
53
  temperature=0.0,
54
  remasking_strategy="low_confidence_static",
 
57
  # Decode output
58
  output_text = processor.decode(output_ids[0], skip_special_tokens=True)
59
  print(output_text)
 
60
  ```
61
 
62
+ ## Generation Parameters
63
 
64
+ - `gen_length`: Number of tokens to generate (default: 256)
65
+ - `steps`: Number of diffusion steps per block (default: 8)
66
+ - `temperature`: Sampling temperature, 0 for greedy (default: 0.0)
67
+ - `top_k`: Top-k sampling parameter (default: 0, disabled)
68
+ - `top_p`: Top-p (nucleus) sampling parameter (default: 1.0)
69
+ - `remasking_strategy`: 'low_confidence' or 'sequential' (default: 'low_confidence')
70
 
71
+ ## Model Configuration
72
+
73
+ - **Architecture**: DiffusionVL_Qwen2_5_VL_ForConditionalGeneration
74
+ - **BD3LM Enabled**: True
75
+ - **Block Size**: 8
76
+ - **Hidden Size**: 2048
77
+ - **Num Layers**: 36
78
+
79
+ ## Notes
80
+
81
+ - The model uses `trust_remote_code=True` because it includes custom modeling code
82
+ - Both model and processor can be loaded from the same directory
83
+ - Image preprocessing uses Qwen2VLImageProcessor internally (identical to Qwen2.5-VL)
config.json CHANGED
@@ -95,9 +95,10 @@
95
  "mm_use_im_start_end": false,
96
  "mm_vision_select_feature": "patch",
97
  "mm_vision_select_layer": -2,
 
98
  "mm_vision_tower_lr": 2e-06,
99
  "model_max_length": 8192,
100
- "model_type": "diffusionvl_qwen2_5_vl",
101
  "num_attention_heads": 16,
102
  "num_hidden_layers": 36,
103
  "num_key_value_heads": 2,
@@ -114,6 +115,84 @@
114
  },
115
  "rope_theta": 1000000.0,
116
  "sliding_window": null,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  "tie_word_embeddings": true,
118
  "tokenizer_model_max_length": 8192,
119
  "tokenizer_padding_side": "right",
 
95
  "mm_use_im_start_end": false,
96
  "mm_vision_select_feature": "patch",
97
  "mm_vision_select_layer": -2,
98
+ "mm_vision_tower": "/data/minimax-dialogue/users/qingke/results/hf_models/Qwen2.5-VL-3B-Instruct-Reformat",
99
  "mm_vision_tower_lr": 2e-06,
100
  "model_max_length": 8192,
101
+ "model_type": "diffusionvl_qwenvl",
102
  "num_attention_heads": 16,
103
  "num_hidden_layers": 36,
104
  "num_key_value_heads": 2,
 
115
  },
116
  "rope_theta": 1000000.0,
117
  "sliding_window": null,
118
+ "text_config": {
119
+ "architectures": [
120
+ "Qwen2_5_VLForConditionalGeneration"
121
+ ],
122
+ "attention_dropout": 0.0,
123
+ "bos_token_id": 151643,
124
+ "eos_token_id": 151645,
125
+ "hidden_act": "silu",
126
+ "hidden_size": 2048,
127
+ "image_token_id": null,
128
+ "initializer_range": 0.02,
129
+ "intermediate_size": 11008,
130
+ "layer_types": [
131
+ "full_attention",
132
+ "full_attention",
133
+ "full_attention",
134
+ "full_attention",
135
+ "full_attention",
136
+ "full_attention",
137
+ "full_attention",
138
+ "full_attention",
139
+ "full_attention",
140
+ "full_attention",
141
+ "full_attention",
142
+ "full_attention",
143
+ "full_attention",
144
+ "full_attention",
145
+ "full_attention",
146
+ "full_attention",
147
+ "full_attention",
148
+ "full_attention",
149
+ "full_attention",
150
+ "full_attention",
151
+ "full_attention",
152
+ "full_attention",
153
+ "full_attention",
154
+ "full_attention",
155
+ "full_attention",
156
+ "full_attention",
157
+ "full_attention",
158
+ "full_attention",
159
+ "full_attention",
160
+ "full_attention",
161
+ "full_attention",
162
+ "full_attention",
163
+ "full_attention",
164
+ "full_attention",
165
+ "full_attention",
166
+ "full_attention"
167
+ ],
168
+ "max_position_embeddings": 128000,
169
+ "max_window_layers": 70,
170
+ "model_type": "qwen2_5_vl_text",
171
+ "num_attention_heads": 16,
172
+ "num_hidden_layers": 36,
173
+ "num_key_value_heads": 2,
174
+ "rms_norm_eps": 1e-06,
175
+ "rope_scaling": {
176
+ "mrope_section": [
177
+ 16,
178
+ 24,
179
+ 24
180
+ ],
181
+ "rope_type": "default",
182
+ "type": "default"
183
+ },
184
+ "rope_theta": 1000000.0,
185
+ "sliding_window": null,
186
+ "tie_word_embeddings": true,
187
+ "torch_dtype": "float32",
188
+ "use_cache": true,
189
+ "use_sliding_window": false,
190
+ "video_token_id": null,
191
+ "vision_end_token_id": 151653,
192
+ "vision_start_token_id": 151652,
193
+ "vision_token_id": 151654,
194
+ "vocab_size": 151936
195
+ },
196
  "tie_word_embeddings": true,
197
  "tokenizer_model_max_length": 8192,
198
  "tokenizer_padding_side": "right",
configuration_diffusionvl_qwen2_5_vl.py CHANGED
@@ -190,7 +190,7 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
190
  ```
191
  """
192
 
193
- model_type = "diffusionvl_qwen2_5_vl"
194
  sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
195
  keys_to_ignore_at_inference = ["past_key_values"]
196
 
@@ -229,6 +229,10 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
229
  rope_scaling: Optional[dict] = None,
230
  **kwargs,
231
  ):
 
 
 
 
232
  # Text model configuration
233
  self.vocab_size = vocab_size
234
  self.hidden_size = hidden_size
 
190
  ```
191
  """
192
 
193
+ model_type = "diffusionvl_qwenvl"
194
  sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
195
  keys_to_ignore_at_inference = ["past_key_values"]
196
 
 
229
  rope_scaling: Optional[dict] = None,
230
  **kwargs,
231
  ):
232
+ # Remove text_config from kwargs to avoid GenerationConfig issues
233
+ # (text_config is only needed for train code, HF config uses flattened params)
234
+ kwargs.pop("text_config", None)
235
+
236
  # Text model configuration
237
  self.vocab_size = vocab_size
238
  self.hidden_size = hidden_size
modeling_diffusionvl_qwen2_5_vl.py CHANGED
@@ -33,6 +33,7 @@ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutpu
33
  from transformers.utils import logging
34
  from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
35
  from transformers.modeling_layers import GradientCheckpointingLayer
 
36
 
37
  from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
38
 
@@ -120,20 +121,24 @@ def apply_multimodal_rotary_pos_emb(
120
  k_embed = (k * cos) + (rotate_half(k) * sin)
121
  return q_embed, k_embed
122
 
123
-
124
  class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
 
125
  def __init__(self, hidden_size, eps=1e-6):
126
  super().__init__()
127
  self.weight = nn.Parameter(torch.ones(hidden_size))
128
  self.variance_epsilon = eps
129
 
130
- def forward(self, hidden_states):
131
  input_dtype = hidden_states.dtype
132
  hidden_states = hidden_states.to(torch.float32)
133
  variance = hidden_states.pow(2).mean(-1, keepdim=True)
134
  hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
135
  return self.weight * hidden_states.to(input_dtype)
136
 
 
 
 
137
 
138
  def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
139
  """
@@ -590,17 +595,17 @@ class DiffusionVL_Qwen2_5_VL_RotaryEmbedding(nn.Module):
590
 
591
 
592
  class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
593
- def __init__(self, config):
594
  super().__init__()
595
  self.hidden_size = config.hidden_size
596
  self.intermediate_size = config.intermediate_size
597
- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
598
- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
599
- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
600
- self.act_fn = nn.SiLU()
601
 
602
- def forward(self, x):
603
- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
604
 
605
 
606
  class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
@@ -759,18 +764,15 @@ class DiffusionVL_Qwen2_5_VL_PreTrainedModel(PreTrainedModel):
759
 
760
  config_class = DiffusionVL_Qwen2_5_VL_Config
761
  base_model_prefix = "model"
 
762
  supports_gradient_checkpointing = True
763
  _no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
 
 
 
764
 
765
- def _init_weights(self, module: nn.Module) -> None:
766
- """Initialize the weights."""
767
- std = self.config.initializer_range
768
- if isinstance(module, nn.Linear):
769
- module.weight.data.normal_(mean=0.0, std=std)
770
- if module.bias is not None:
771
- module.bias.data.zero_()
772
- elif isinstance(module, nn.Embedding):
773
- module.weight.data.normal_(mean=0.0, std=std)
774
 
775
 
776
  class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
@@ -1233,7 +1235,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
1233
  top_k: int = 0,
1234
  top_p: float = 1.0,
1235
  remasking_strategy: str = 'low_confidence_static',
1236
- use_kv_cache: bool = True,
1237
  confidence_threshold: float = 0.85,
1238
  **kwargs,
1239
  ):
@@ -1248,7 +1249,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
1248
  top_k: Top-k sampling parameter
1249
  top_p: Top-p (nucleus) sampling parameter
1250
  remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
1251
- use_kv_cache: Whether to use KV cache (default True)
1252
  confidence_threshold: Threshold for low_confidence_dynamic strategy
1253
 
1254
  Returns:
@@ -1291,8 +1291,8 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
1291
  prefill_blocks = prompt_len // block_size
1292
  prefill_length = prefill_blocks * block_size
1293
 
1294
- past_key_values = DynamicCache() if use_kv_cache else None
1295
- if use_kv_cache and prefill_length > 0:
1296
  prefill_embeds = x_embeds[:, :prefill_length]
1297
  prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
1298
  prefill_pos_ids = position_ids[:, :prefill_length]
@@ -1336,45 +1336,26 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
1336
  is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
1337
  if not is_mask.any():
1338
  # Store KV for fully unmasked block
1339
- if use_kv_cache:
1340
- _ = self.model(
1341
- inputs_embeds=cur_block_embeds,
1342
- attention_mask=model_mask,
1343
- position_ids=cur_pos_ids,
1344
- past_key_values=past_key_values,
1345
- use_cache=True,
1346
- store_kv=True
1347
- )
1348
- break
1349
-
1350
- # Forward pass
1351
- if use_kv_cache:
1352
- outputs = self.model(
1353
  inputs_embeds=cur_block_embeds,
1354
  attention_mask=model_mask,
1355
  position_ids=cur_pos_ids,
1356
  past_key_values=past_key_values,
1357
  use_cache=True,
1358
- store_kv=False
1359
  )
1360
- logits = self.lm_head(outputs.last_hidden_state).float()
1361
- else:
1362
- # No KV-cache: recompute full context
1363
- context_embeds = x_embeds[:, :block_end].clone()
1364
- context_embeds[:, block_start:block_end] = cur_block_embeds
1365
- context_mask = block_diffusion_mask[:, :, :block_end, :block_end]
1366
- context_pos_ids = position_ids[:, :block_end]
1367
- context_model_mask = {"full_attention": context_mask, "sliding_attention": context_mask}
1368
-
1369
- outputs = self.model(
1370
- inputs_embeds=context_embeds,
1371
- attention_mask=context_model_mask,
1372
- position_ids=context_pos_ids,
1373
- past_key_values=None,
1374
- use_cache=False,
1375
- store_kv=False
1376
- )
1377
- logits = self.lm_head(outputs.last_hidden_state[:, block_start:block_end]).float()
1378
 
1379
  # Sample tokens
1380
  x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
@@ -1500,7 +1481,7 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
1500
 
1501
  from transformers import AutoConfig, AutoModelForCausalLM
1502
 
1503
- AutoConfig.register("diffusionvl_qwen2_5_vl", DiffusionVL_Qwen2_5_VL_Config)
1504
  AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)
1505
 
1506
 
 
33
  from transformers.utils import logging
34
  from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
35
  from transformers.modeling_layers import GradientCheckpointingLayer
36
+ from transformers.integrations import use_kernel_forward_from_hub
37
 
38
  from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
39
 
 
121
  k_embed = (k * cos) + (rotate_half(k) * sin)
122
  return q_embed, k_embed
123
 
124
+ @use_kernel_forward_from_hub("RMSNorm")
125
  class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
126
+ """RMSNorm implementation matching Qwen2RMSNorm from modeling_qwen2.py"""
127
  def __init__(self, hidden_size, eps=1e-6):
128
  super().__init__()
129
  self.weight = nn.Parameter(torch.ones(hidden_size))
130
  self.variance_epsilon = eps
131
 
132
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
133
  input_dtype = hidden_states.dtype
134
  hidden_states = hidden_states.to(torch.float32)
135
  variance = hidden_states.pow(2).mean(-1, keepdim=True)
136
  hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
137
  return self.weight * hidden_states.to(input_dtype)
138
 
139
+ def extra_repr(self):
140
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
141
+
142
 
143
  def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
144
  """
 
595
 
596
 
597
  class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
598
+ def __init__(self, config, bias: bool = False):
599
  super().__init__()
600
  self.hidden_size = config.hidden_size
601
  self.intermediate_size = config.intermediate_size
602
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
603
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
604
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=bias)
605
+ self.act_fn = ACT2FN[config.hidden_act]
606
 
607
+ def forward(self, hidden_state):
608
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
609
 
610
 
611
  class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
 
764
 
765
  config_class = DiffusionVL_Qwen2_5_VL_Config
766
  base_model_prefix = "model"
767
+ input_modalities = ["image", "video", "text"]
768
  supports_gradient_checkpointing = True
769
  _no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
770
+ _skip_keys_device_placement = "past_key_values"
771
+ _supports_flash_attn = True
772
+ _supports_sdpa = True
773
 
774
+ _can_compile_fullgraph = True
775
+ _supports_attention_backend = True
 
 
 
 
 
 
 
776
 
777
 
778
  class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
 
1235
  top_k: int = 0,
1236
  top_p: float = 1.0,
1237
  remasking_strategy: str = 'low_confidence_static',
 
1238
  confidence_threshold: float = 0.85,
1239
  **kwargs,
1240
  ):
 
1249
  top_k: Top-k sampling parameter
1250
  top_p: Top-p (nucleus) sampling parameter
1251
  remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
 
1252
  confidence_threshold: Threshold for low_confidence_dynamic strategy
1253
 
1254
  Returns:
 
1291
  prefill_blocks = prompt_len // block_size
1292
  prefill_length = prefill_blocks * block_size
1293
 
1294
+ past_key_values = DynamicCache()
1295
+ if prefill_length > 0:
1296
  prefill_embeds = x_embeds[:, :prefill_length]
1297
  prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
1298
  prefill_pos_ids = position_ids[:, :prefill_length]
 
1336
  is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
1337
  if not is_mask.any():
1338
  # Store KV for fully unmasked block
1339
+ _ = self.model(
 
 
 
 
 
 
 
 
 
 
 
 
 
1340
  inputs_embeds=cur_block_embeds,
1341
  attention_mask=model_mask,
1342
  position_ids=cur_pos_ids,
1343
  past_key_values=past_key_values,
1344
  use_cache=True,
1345
+ store_kv=True
1346
  )
1347
+ break
1348
+
1349
+ # Forward pass
1350
+ outputs = self.model(
1351
+ inputs_embeds=cur_block_embeds,
1352
+ attention_mask=model_mask,
1353
+ position_ids=cur_pos_ids,
1354
+ past_key_values=past_key_values,
1355
+ use_cache=True,
1356
+ store_kv=False
1357
+ )
1358
+ logits = self.lm_head(outputs.last_hidden_state).float()
 
 
 
 
 
 
1359
 
1360
  # Sample tokens
1361
  x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
 
1481
 
1482
  from transformers import AutoConfig, AutoModelForCausalLM
1483
 
1484
+ AutoConfig.register("diffusionvl_qwenvl", DiffusionVL_Qwen2_5_VL_Config)
1485
  AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)
1486
 
1487
 
processing_diffusionvl_qwen2_5_vl.py CHANGED
@@ -54,6 +54,8 @@ def tokenizer_image_token(
54
  """
55
  Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
56
 
 
 
57
  Args:
58
  prompt: Input text containing <image> placeholders.
59
  tokenizer: The tokenizer to use for encoding text.
@@ -63,26 +65,27 @@ def tokenizer_image_token(
63
  Returns:
64
  List of token IDs or a PyTorch tensor.
65
  """
66
- prompt_chunks = prompt.split(DEFAULT_IMAGE_TOKEN)
 
 
 
 
67
 
68
  input_ids = []
69
  offset = 0
70
 
71
- if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0:
72
- # First chunk has text
73
- input_ids = tokenizer(prompt_chunks[0], add_special_tokens=False).input_ids
74
  offset = 1
 
75
 
76
- for chunk_idx in range(offset, len(prompt_chunks)):
77
- chunk = prompt_chunks[chunk_idx]
78
- # Add image token
79
- input_ids.append(image_token_index)
80
- # Add text after image
81
- if len(chunk) > 0:
82
- input_ids.extend(tokenizer(chunk, add_special_tokens=False).input_ids)
83
 
84
- if return_tensors == "pt":
85
- return torch.tensor(input_ids, dtype=torch.long)
 
 
86
  return input_ids
87
 
88
 
 
54
  """
55
  Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
56
 
57
+ This implementation matches the training code (llava/mm_utils.py::tokenizer_image_token).
58
+
59
  Args:
60
  prompt: Input text containing <image> placeholders.
61
  tokenizer: The tokenizer to use for encoding text.
 
65
  Returns:
66
  List of token IDs or a PyTorch tensor.
67
  """
68
+ # Tokenize each chunk (matching training code behavior)
69
+ prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
70
+
71
+ def insert_separator(X, sep):
72
+ return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
73
 
74
  input_ids = []
75
  offset = 0
76
 
77
+ # Handle BOS token if present (matching training code)
78
+ if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
 
79
  offset = 1
80
+ input_ids.append(prompt_chunks[0][0])
81
 
82
+ for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
83
+ input_ids.extend(x[offset:])
 
 
 
 
 
84
 
85
+ if return_tensors is not None:
86
+ if return_tensors == "pt":
87
+ return torch.tensor(input_ids, dtype=torch.long)
88
+ raise ValueError(f"Unsupported tensor type: {return_tensors}")
89
  return input_ids
90
 
91