Upload folder using huggingface_hub
Browse files- README.md +28 -86
- config.json +80 -1
- configuration_diffusionvl_qwen2_5_vl.py +5 -1
- modeling_diffusionvl_qwen2_5_vl.py +37 -56
- processing_diffusionvl_qwen2_5_vl.py +16 -13
README.md
CHANGED
|
@@ -3,93 +3,31 @@ license: apache-2.0
|
|
| 3 |
tags:
|
| 4 |
- diffusion
|
| 5 |
- vision-language
|
| 6 |
-
- diffusivonvl
|
| 7 |
- qwen2.5-vl
|
| 8 |
-
library_name: transformers
|
| 9 |
-
pipeline_tag: image-text-to-text
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
[Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>
|
| 19 |
-
|
| 20 |
-
<sup>1</sup>Huazhong University of Science and Technology
|
| 21 |
-
|
| 22 |
-
<sup>*</sup>equal contribution, <sup>✉️</sup>corresponding author, xgwang@hust.edu.cn
|
| 23 |
-
|
| 24 |
-
[](https://arxiv.org/abs/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
|
| 25 |
-
|
| 26 |
-
</div>
|
| 27 |
-
|
| 28 |
-
## 📰 News
|
| 29 |
-
|
| 30 |
-
- **[2025.12.18]** 🎉 Our paper **DiffusionVL** is released on arXiv! And we release the DiffusionVL models translated from Qwen2.5VL at huggingface. The training code and more models are comming soon!
|
| 31 |
-
|
| 32 |
-
## 📄 Introduction
|
| 33 |
-
|
| 34 |
-
The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
|
| 35 |
-
|
| 36 |
-
DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0× inference speedup.
|
| 37 |
-
|
| 38 |
-
## ✨ Highlights
|
| 39 |
-
|
| 40 |
-
- **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
|
| 41 |
-
|
| 42 |
-
- **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
|
| 43 |
-
|
| 44 |
-
- **2.0× Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0× speedup over previous dVLMs.
|
| 45 |
-
|
| 46 |
-
<div align="center">
|
| 47 |
-
<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
|
| 48 |
-
<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
|
| 49 |
-
</div>
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
### 🎯 Inference with Pre-trained Models
|
| 55 |
-
|
| 56 |
-
- **Download Pre-trained Models:**
|
| 57 |
-
|
| 58 |
-
| Model | Base Model | Download |
|
| 59 |
-
| :--- | :--- | :--- |
|
| 60 |
-
| **DiffusionVL-Qwen2.5VL-3B** | Qwen2.5-VL-3B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-3B) |
|
| 61 |
-
| **DiffusionVL-Qwen2.5VL-7B** | Qwen2.5-VL-7B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-7B) |
|
| 62 |
-
|
| 63 |
-
- **Environment Setup:**
|
| 64 |
-
|
| 65 |
-
The core environments are list as follows:
|
| 66 |
-
```
|
| 67 |
-
torch==2.6.0
|
| 68 |
-
torchvision==0.21.0
|
| 69 |
-
torchaudio==2.6.0
|
| 70 |
-
transformers==4.55.0
|
| 71 |
-
accelerate==1.10.1
|
| 72 |
-
pillow==10.4.0
|
| 73 |
-
requests=2.32.5
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
- **Quick Start:**
|
| 77 |
|
| 78 |
```python
|
| 79 |
-
from transformers import AutoModelForCausalLM, AutoProcessor
|
| 80 |
import torch
|
| 81 |
|
| 82 |
# Load model with trust_remote_code
|
| 83 |
model = AutoModelForCausalLM.from_pretrained(
|
| 84 |
-
"
|
| 85 |
torch_dtype=torch.bfloat16,
|
| 86 |
device_map="auto",
|
| 87 |
trust_remote_code=True
|
| 88 |
)
|
| 89 |
|
| 90 |
# Load processor (includes tokenizer)
|
| 91 |
-
processor = AutoProcessor.from_pretrained("
|
| 92 |
|
|
|
|
| 93 |
from PIL import Image
|
| 94 |
import requests
|
| 95 |
|
|
@@ -110,7 +48,7 @@ output_ids = model.generate(
|
|
| 110 |
inputs=inputs["input_ids"],
|
| 111 |
images=inputs.get("pixel_values"),
|
| 112 |
image_grid_thws=inputs.get("image_grid_thw"),
|
| 113 |
-
gen_length=
|
| 114 |
steps=8,
|
| 115 |
temperature=0.0,
|
| 116 |
remasking_strategy="low_confidence_static",
|
|
@@ -119,23 +57,27 @@ output_ids = model.generate(
|
|
| 119 |
# Decode output
|
| 120 |
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
|
| 121 |
print(output_text)
|
| 122 |
-
|
| 123 |
```
|
| 124 |
|
| 125 |
-
##
|
| 126 |
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
-
##
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
|
|
|
| 3 |
tags:
|
| 4 |
- diffusion
|
| 5 |
- vision-language
|
|
|
|
| 6 |
- qwen2.5-vl
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# DiffusionVL
|
| 10 |
|
| 11 |
+
DiffusionVL is a vision-language model based on Qwen2.5-VL architecture with BD3LM diffusion-based generation.
|
| 12 |
|
| 13 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
```python
|
| 16 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
|
| 17 |
import torch
|
| 18 |
|
| 19 |
# Load model with trust_remote_code
|
| 20 |
model = AutoModelForCausalLM.from_pretrained(
|
| 21 |
+
"path/to/model",
|
| 22 |
torch_dtype=torch.bfloat16,
|
| 23 |
device_map="auto",
|
| 24 |
trust_remote_code=True
|
| 25 |
)
|
| 26 |
|
| 27 |
# Load processor (includes tokenizer)
|
| 28 |
+
processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
|
| 29 |
|
| 30 |
+
# Image + text generation
|
| 31 |
from PIL import Image
|
| 32 |
import requests
|
| 33 |
|
|
|
|
| 48 |
inputs=inputs["input_ids"],
|
| 49 |
images=inputs.get("pixel_values"),
|
| 50 |
image_grid_thws=inputs.get("image_grid_thw"),
|
| 51 |
+
gen_length=256,
|
| 52 |
steps=8,
|
| 53 |
temperature=0.0,
|
| 54 |
remasking_strategy="low_confidence_static",
|
|
|
|
| 57 |
# Decode output
|
| 58 |
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
|
| 59 |
print(output_text)
|
|
|
|
| 60 |
```
|
| 61 |
|
| 62 |
+
## Generation Parameters
|
| 63 |
|
| 64 |
+
- `gen_length`: Number of tokens to generate (default: 256)
|
| 65 |
+
- `steps`: Number of diffusion steps per block (default: 8)
|
| 66 |
+
- `temperature`: Sampling temperature, 0 for greedy (default: 0.0)
|
| 67 |
+
- `top_k`: Top-k sampling parameter (default: 0, disabled)
|
| 68 |
+
- `top_p`: Top-p (nucleus) sampling parameter (default: 1.0)
|
| 69 |
+
- `remasking_strategy`: 'low_confidence' or 'sequential' (default: 'low_confidence')
|
| 70 |
|
| 71 |
+
## Model Configuration
|
| 72 |
+
|
| 73 |
+
- **Architecture**: DiffusionVL_Qwen2_5_VL_ForConditionalGeneration
|
| 74 |
+
- **BD3LM Enabled**: True
|
| 75 |
+
- **Block Size**: 8
|
| 76 |
+
- **Hidden Size**: 2048
|
| 77 |
+
- **Num Layers**: 36
|
| 78 |
+
|
| 79 |
+
## Notes
|
| 80 |
+
|
| 81 |
+
- The model uses `trust_remote_code=True` because it includes custom modeling code
|
| 82 |
+
- Both model and processor can be loaded from the same directory
|
| 83 |
+
- Image preprocessing uses Qwen2VLImageProcessor internally (identical to Qwen2.5-VL)
|
config.json
CHANGED
|
@@ -95,9 +95,10 @@
|
|
| 95 |
"mm_use_im_start_end": false,
|
| 96 |
"mm_vision_select_feature": "patch",
|
| 97 |
"mm_vision_select_layer": -2,
|
|
|
|
| 98 |
"mm_vision_tower_lr": 2e-06,
|
| 99 |
"model_max_length": 8192,
|
| 100 |
-
"model_type": "
|
| 101 |
"num_attention_heads": 16,
|
| 102 |
"num_hidden_layers": 36,
|
| 103 |
"num_key_value_heads": 2,
|
|
@@ -114,6 +115,84 @@
|
|
| 114 |
},
|
| 115 |
"rope_theta": 1000000.0,
|
| 116 |
"sliding_window": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
"tie_word_embeddings": true,
|
| 118 |
"tokenizer_model_max_length": 8192,
|
| 119 |
"tokenizer_padding_side": "right",
|
|
|
|
| 95 |
"mm_use_im_start_end": false,
|
| 96 |
"mm_vision_select_feature": "patch",
|
| 97 |
"mm_vision_select_layer": -2,
|
| 98 |
+
"mm_vision_tower": "/data/minimax-dialogue/users/qingke/results/hf_models/Qwen2.5-VL-3B-Instruct-Reformat",
|
| 99 |
"mm_vision_tower_lr": 2e-06,
|
| 100 |
"model_max_length": 8192,
|
| 101 |
+
"model_type": "diffusionvl_qwenvl",
|
| 102 |
"num_attention_heads": 16,
|
| 103 |
"num_hidden_layers": 36,
|
| 104 |
"num_key_value_heads": 2,
|
|
|
|
| 115 |
},
|
| 116 |
"rope_theta": 1000000.0,
|
| 117 |
"sliding_window": null,
|
| 118 |
+
"text_config": {
|
| 119 |
+
"architectures": [
|
| 120 |
+
"Qwen2_5_VLForConditionalGeneration"
|
| 121 |
+
],
|
| 122 |
+
"attention_dropout": 0.0,
|
| 123 |
+
"bos_token_id": 151643,
|
| 124 |
+
"eos_token_id": 151645,
|
| 125 |
+
"hidden_act": "silu",
|
| 126 |
+
"hidden_size": 2048,
|
| 127 |
+
"image_token_id": null,
|
| 128 |
+
"initializer_range": 0.02,
|
| 129 |
+
"intermediate_size": 11008,
|
| 130 |
+
"layer_types": [
|
| 131 |
+
"full_attention",
|
| 132 |
+
"full_attention",
|
| 133 |
+
"full_attention",
|
| 134 |
+
"full_attention",
|
| 135 |
+
"full_attention",
|
| 136 |
+
"full_attention",
|
| 137 |
+
"full_attention",
|
| 138 |
+
"full_attention",
|
| 139 |
+
"full_attention",
|
| 140 |
+
"full_attention",
|
| 141 |
+
"full_attention",
|
| 142 |
+
"full_attention",
|
| 143 |
+
"full_attention",
|
| 144 |
+
"full_attention",
|
| 145 |
+
"full_attention",
|
| 146 |
+
"full_attention",
|
| 147 |
+
"full_attention",
|
| 148 |
+
"full_attention",
|
| 149 |
+
"full_attention",
|
| 150 |
+
"full_attention",
|
| 151 |
+
"full_attention",
|
| 152 |
+
"full_attention",
|
| 153 |
+
"full_attention",
|
| 154 |
+
"full_attention",
|
| 155 |
+
"full_attention",
|
| 156 |
+
"full_attention",
|
| 157 |
+
"full_attention",
|
| 158 |
+
"full_attention",
|
| 159 |
+
"full_attention",
|
| 160 |
+
"full_attention",
|
| 161 |
+
"full_attention",
|
| 162 |
+
"full_attention",
|
| 163 |
+
"full_attention",
|
| 164 |
+
"full_attention",
|
| 165 |
+
"full_attention",
|
| 166 |
+
"full_attention"
|
| 167 |
+
],
|
| 168 |
+
"max_position_embeddings": 128000,
|
| 169 |
+
"max_window_layers": 70,
|
| 170 |
+
"model_type": "qwen2_5_vl_text",
|
| 171 |
+
"num_attention_heads": 16,
|
| 172 |
+
"num_hidden_layers": 36,
|
| 173 |
+
"num_key_value_heads": 2,
|
| 174 |
+
"rms_norm_eps": 1e-06,
|
| 175 |
+
"rope_scaling": {
|
| 176 |
+
"mrope_section": [
|
| 177 |
+
16,
|
| 178 |
+
24,
|
| 179 |
+
24
|
| 180 |
+
],
|
| 181 |
+
"rope_type": "default",
|
| 182 |
+
"type": "default"
|
| 183 |
+
},
|
| 184 |
+
"rope_theta": 1000000.0,
|
| 185 |
+
"sliding_window": null,
|
| 186 |
+
"tie_word_embeddings": true,
|
| 187 |
+
"torch_dtype": "float32",
|
| 188 |
+
"use_cache": true,
|
| 189 |
+
"use_sliding_window": false,
|
| 190 |
+
"video_token_id": null,
|
| 191 |
+
"vision_end_token_id": 151653,
|
| 192 |
+
"vision_start_token_id": 151652,
|
| 193 |
+
"vision_token_id": 151654,
|
| 194 |
+
"vocab_size": 151936
|
| 195 |
+
},
|
| 196 |
"tie_word_embeddings": true,
|
| 197 |
"tokenizer_model_max_length": 8192,
|
| 198 |
"tokenizer_padding_side": "right",
|
configuration_diffusionvl_qwen2_5_vl.py
CHANGED
|
@@ -190,7 +190,7 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
|
|
| 190 |
```
|
| 191 |
"""
|
| 192 |
|
| 193 |
-
model_type = "
|
| 194 |
sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
|
| 195 |
keys_to_ignore_at_inference = ["past_key_values"]
|
| 196 |
|
|
@@ -229,6 +229,10 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
|
|
| 229 |
rope_scaling: Optional[dict] = None,
|
| 230 |
**kwargs,
|
| 231 |
):
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
# Text model configuration
|
| 233 |
self.vocab_size = vocab_size
|
| 234 |
self.hidden_size = hidden_size
|
|
|
|
| 190 |
```
|
| 191 |
"""
|
| 192 |
|
| 193 |
+
model_type = "diffusionvl_qwenvl"
|
| 194 |
sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
|
| 195 |
keys_to_ignore_at_inference = ["past_key_values"]
|
| 196 |
|
|
|
|
| 229 |
rope_scaling: Optional[dict] = None,
|
| 230 |
**kwargs,
|
| 231 |
):
|
| 232 |
+
# Remove text_config from kwargs to avoid GenerationConfig issues
|
| 233 |
+
# (text_config is only needed for train code, HF config uses flattened params)
|
| 234 |
+
kwargs.pop("text_config", None)
|
| 235 |
+
|
| 236 |
# Text model configuration
|
| 237 |
self.vocab_size = vocab_size
|
| 238 |
self.hidden_size = hidden_size
|
modeling_diffusionvl_qwen2_5_vl.py
CHANGED
|
@@ -33,6 +33,7 @@ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutpu
|
|
| 33 |
from transformers.utils import logging
|
| 34 |
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
|
| 35 |
from transformers.modeling_layers import GradientCheckpointingLayer
|
|
|
|
| 36 |
|
| 37 |
from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
|
| 38 |
|
|
@@ -120,20 +121,24 @@ def apply_multimodal_rotary_pos_emb(
|
|
| 120 |
k_embed = (k * cos) + (rotate_half(k) * sin)
|
| 121 |
return q_embed, k_embed
|
| 122 |
|
| 123 |
-
|
| 124 |
class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
|
|
|
|
| 125 |
def __init__(self, hidden_size, eps=1e-6):
|
| 126 |
super().__init__()
|
| 127 |
self.weight = nn.Parameter(torch.ones(hidden_size))
|
| 128 |
self.variance_epsilon = eps
|
| 129 |
|
| 130 |
-
def forward(self, hidden_states):
|
| 131 |
input_dtype = hidden_states.dtype
|
| 132 |
hidden_states = hidden_states.to(torch.float32)
|
| 133 |
variance = hidden_states.pow(2).mean(-1, keepdim=True)
|
| 134 |
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
|
| 135 |
return self.weight * hidden_states.to(input_dtype)
|
| 136 |
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
|
| 139 |
"""
|
|
@@ -590,17 +595,17 @@ class DiffusionVL_Qwen2_5_VL_RotaryEmbedding(nn.Module):
|
|
| 590 |
|
| 591 |
|
| 592 |
class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
|
| 593 |
-
def __init__(self, config):
|
| 594 |
super().__init__()
|
| 595 |
self.hidden_size = config.hidden_size
|
| 596 |
self.intermediate_size = config.intermediate_size
|
| 597 |
-
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=
|
| 598 |
-
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=
|
| 599 |
-
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=
|
| 600 |
-
self.act_fn =
|
| 601 |
|
| 602 |
-
def forward(self,
|
| 603 |
-
return self.down_proj(self.act_fn(self.gate_proj(
|
| 604 |
|
| 605 |
|
| 606 |
class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
|
|
@@ -759,18 +764,15 @@ class DiffusionVL_Qwen2_5_VL_PreTrainedModel(PreTrainedModel):
|
|
| 759 |
|
| 760 |
config_class = DiffusionVL_Qwen2_5_VL_Config
|
| 761 |
base_model_prefix = "model"
|
|
|
|
| 762 |
supports_gradient_checkpointing = True
|
| 763 |
_no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
|
|
|
|
|
|
|
|
|
|
| 764 |
|
| 765 |
-
|
| 766 |
-
|
| 767 |
-
std = self.config.initializer_range
|
| 768 |
-
if isinstance(module, nn.Linear):
|
| 769 |
-
module.weight.data.normal_(mean=0.0, std=std)
|
| 770 |
-
if module.bias is not None:
|
| 771 |
-
module.bias.data.zero_()
|
| 772 |
-
elif isinstance(module, nn.Embedding):
|
| 773 |
-
module.weight.data.normal_(mean=0.0, std=std)
|
| 774 |
|
| 775 |
|
| 776 |
class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
|
|
@@ -1233,7 +1235,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
|
|
| 1233 |
top_k: int = 0,
|
| 1234 |
top_p: float = 1.0,
|
| 1235 |
remasking_strategy: str = 'low_confidence_static',
|
| 1236 |
-
use_kv_cache: bool = True,
|
| 1237 |
confidence_threshold: float = 0.85,
|
| 1238 |
**kwargs,
|
| 1239 |
):
|
|
@@ -1248,7 +1249,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
|
|
| 1248 |
top_k: Top-k sampling parameter
|
| 1249 |
top_p: Top-p (nucleus) sampling parameter
|
| 1250 |
remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
|
| 1251 |
-
use_kv_cache: Whether to use KV cache (default True)
|
| 1252 |
confidence_threshold: Threshold for low_confidence_dynamic strategy
|
| 1253 |
|
| 1254 |
Returns:
|
|
@@ -1291,8 +1291,8 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
|
|
| 1291 |
prefill_blocks = prompt_len // block_size
|
| 1292 |
prefill_length = prefill_blocks * block_size
|
| 1293 |
|
| 1294 |
-
past_key_values = DynamicCache()
|
| 1295 |
-
if
|
| 1296 |
prefill_embeds = x_embeds[:, :prefill_length]
|
| 1297 |
prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
|
| 1298 |
prefill_pos_ids = position_ids[:, :prefill_length]
|
|
@@ -1336,45 +1336,26 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
|
|
| 1336 |
is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
|
| 1337 |
if not is_mask.any():
|
| 1338 |
# Store KV for fully unmasked block
|
| 1339 |
-
|
| 1340 |
-
_ = self.model(
|
| 1341 |
-
inputs_embeds=cur_block_embeds,
|
| 1342 |
-
attention_mask=model_mask,
|
| 1343 |
-
position_ids=cur_pos_ids,
|
| 1344 |
-
past_key_values=past_key_values,
|
| 1345 |
-
use_cache=True,
|
| 1346 |
-
store_kv=True
|
| 1347 |
-
)
|
| 1348 |
-
break
|
| 1349 |
-
|
| 1350 |
-
# Forward pass
|
| 1351 |
-
if use_kv_cache:
|
| 1352 |
-
outputs = self.model(
|
| 1353 |
inputs_embeds=cur_block_embeds,
|
| 1354 |
attention_mask=model_mask,
|
| 1355 |
position_ids=cur_pos_ids,
|
| 1356 |
past_key_values=past_key_values,
|
| 1357 |
use_cache=True,
|
| 1358 |
-
store_kv=
|
| 1359 |
)
|
| 1360 |
-
|
| 1361 |
-
|
| 1362 |
-
|
| 1363 |
-
|
| 1364 |
-
|
| 1365 |
-
|
| 1366 |
-
|
| 1367 |
-
|
| 1368 |
-
|
| 1369 |
-
|
| 1370 |
-
|
| 1371 |
-
|
| 1372 |
-
position_ids=context_pos_ids,
|
| 1373 |
-
past_key_values=None,
|
| 1374 |
-
use_cache=False,
|
| 1375 |
-
store_kv=False
|
| 1376 |
-
)
|
| 1377 |
-
logits = self.lm_head(outputs.last_hidden_state[:, block_start:block_end]).float()
|
| 1378 |
|
| 1379 |
# Sample tokens
|
| 1380 |
x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
|
|
@@ -1500,7 +1481,7 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
|
|
| 1500 |
|
| 1501 |
from transformers import AutoConfig, AutoModelForCausalLM
|
| 1502 |
|
| 1503 |
-
AutoConfig.register("
|
| 1504 |
AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)
|
| 1505 |
|
| 1506 |
|
|
|
|
| 33 |
from transformers.utils import logging
|
| 34 |
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
|
| 35 |
from transformers.modeling_layers import GradientCheckpointingLayer
|
| 36 |
+
from transformers.integrations import use_kernel_forward_from_hub
|
| 37 |
|
| 38 |
from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
|
| 39 |
|
|
|
|
| 121 |
k_embed = (k * cos) + (rotate_half(k) * sin)
|
| 122 |
return q_embed, k_embed
|
| 123 |
|
| 124 |
+
@use_kernel_forward_from_hub("RMSNorm")
|
| 125 |
class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
|
| 126 |
+
"""RMSNorm implementation matching Qwen2RMSNorm from modeling_qwen2.py"""
|
| 127 |
def __init__(self, hidden_size, eps=1e-6):
|
| 128 |
super().__init__()
|
| 129 |
self.weight = nn.Parameter(torch.ones(hidden_size))
|
| 130 |
self.variance_epsilon = eps
|
| 131 |
|
| 132 |
+
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
| 133 |
input_dtype = hidden_states.dtype
|
| 134 |
hidden_states = hidden_states.to(torch.float32)
|
| 135 |
variance = hidden_states.pow(2).mean(-1, keepdim=True)
|
| 136 |
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
|
| 137 |
return self.weight * hidden_states.to(input_dtype)
|
| 138 |
|
| 139 |
+
def extra_repr(self):
|
| 140 |
+
return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
|
| 141 |
+
|
| 142 |
|
| 143 |
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
|
| 144 |
"""
|
|
|
|
| 595 |
|
| 596 |
|
| 597 |
class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
|
| 598 |
+
def __init__(self, config, bias: bool = False):
|
| 599 |
super().__init__()
|
| 600 |
self.hidden_size = config.hidden_size
|
| 601 |
self.intermediate_size = config.intermediate_size
|
| 602 |
+
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
|
| 603 |
+
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
|
| 604 |
+
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=bias)
|
| 605 |
+
self.act_fn = ACT2FN[config.hidden_act]
|
| 606 |
|
| 607 |
+
def forward(self, hidden_state):
|
| 608 |
+
return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
|
| 609 |
|
| 610 |
|
| 611 |
class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
|
|
|
|
| 764 |
|
| 765 |
config_class = DiffusionVL_Qwen2_5_VL_Config
|
| 766 |
base_model_prefix = "model"
|
| 767 |
+
input_modalities = ["image", "video", "text"]
|
| 768 |
supports_gradient_checkpointing = True
|
| 769 |
_no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
|
| 770 |
+
_skip_keys_device_placement = "past_key_values"
|
| 771 |
+
_supports_flash_attn = True
|
| 772 |
+
_supports_sdpa = True
|
| 773 |
|
| 774 |
+
_can_compile_fullgraph = True
|
| 775 |
+
_supports_attention_backend = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 776 |
|
| 777 |
|
| 778 |
class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
|
|
|
|
| 1235 |
top_k: int = 0,
|
| 1236 |
top_p: float = 1.0,
|
| 1237 |
remasking_strategy: str = 'low_confidence_static',
|
|
|
|
| 1238 |
confidence_threshold: float = 0.85,
|
| 1239 |
**kwargs,
|
| 1240 |
):
|
|
|
|
| 1249 |
top_k: Top-k sampling parameter
|
| 1250 |
top_p: Top-p (nucleus) sampling parameter
|
| 1251 |
remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
|
|
|
|
| 1252 |
confidence_threshold: Threshold for low_confidence_dynamic strategy
|
| 1253 |
|
| 1254 |
Returns:
|
|
|
|
| 1291 |
prefill_blocks = prompt_len // block_size
|
| 1292 |
prefill_length = prefill_blocks * block_size
|
| 1293 |
|
| 1294 |
+
past_key_values = DynamicCache()
|
| 1295 |
+
if prefill_length > 0:
|
| 1296 |
prefill_embeds = x_embeds[:, :prefill_length]
|
| 1297 |
prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
|
| 1298 |
prefill_pos_ids = position_ids[:, :prefill_length]
|
|
|
|
| 1336 |
is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
|
| 1337 |
if not is_mask.any():
|
| 1338 |
# Store KV for fully unmasked block
|
| 1339 |
+
_ = self.model(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1340 |
inputs_embeds=cur_block_embeds,
|
| 1341 |
attention_mask=model_mask,
|
| 1342 |
position_ids=cur_pos_ids,
|
| 1343 |
past_key_values=past_key_values,
|
| 1344 |
use_cache=True,
|
| 1345 |
+
store_kv=True
|
| 1346 |
)
|
| 1347 |
+
break
|
| 1348 |
+
|
| 1349 |
+
# Forward pass
|
| 1350 |
+
outputs = self.model(
|
| 1351 |
+
inputs_embeds=cur_block_embeds,
|
| 1352 |
+
attention_mask=model_mask,
|
| 1353 |
+
position_ids=cur_pos_ids,
|
| 1354 |
+
past_key_values=past_key_values,
|
| 1355 |
+
use_cache=True,
|
| 1356 |
+
store_kv=False
|
| 1357 |
+
)
|
| 1358 |
+
logits = self.lm_head(outputs.last_hidden_state).float()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1359 |
|
| 1360 |
# Sample tokens
|
| 1361 |
x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
|
|
|
|
| 1481 |
|
| 1482 |
from transformers import AutoConfig, AutoModelForCausalLM
|
| 1483 |
|
| 1484 |
+
AutoConfig.register("diffusionvl_qwenvl", DiffusionVL_Qwen2_5_VL_Config)
|
| 1485 |
AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)
|
| 1486 |
|
| 1487 |
|
processing_diffusionvl_qwen2_5_vl.py
CHANGED
|
@@ -54,6 +54,8 @@ def tokenizer_image_token(
|
|
| 54 |
"""
|
| 55 |
Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
|
| 56 |
|
|
|
|
|
|
|
| 57 |
Args:
|
| 58 |
prompt: Input text containing <image> placeholders.
|
| 59 |
tokenizer: The tokenizer to use for encoding text.
|
|
@@ -63,26 +65,27 @@ def tokenizer_image_token(
|
|
| 63 |
Returns:
|
| 64 |
List of token IDs or a PyTorch tensor.
|
| 65 |
"""
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
input_ids = []
|
| 69 |
offset = 0
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
input_ids = tokenizer(prompt_chunks[0], add_special_tokens=False).input_ids
|
| 74 |
offset = 1
|
|
|
|
| 75 |
|
| 76 |
-
for
|
| 77 |
-
|
| 78 |
-
# Add image token
|
| 79 |
-
input_ids.append(image_token_index)
|
| 80 |
-
# Add text after image
|
| 81 |
-
if len(chunk) > 0:
|
| 82 |
-
input_ids.extend(tokenizer(chunk, add_special_tokens=False).input_ids)
|
| 83 |
|
| 84 |
-
if return_tensors
|
| 85 |
-
|
|
|
|
|
|
|
| 86 |
return input_ids
|
| 87 |
|
| 88 |
|
|
|
|
| 54 |
"""
|
| 55 |
Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
|
| 56 |
|
| 57 |
+
This implementation matches the training code (llava/mm_utils.py::tokenizer_image_token).
|
| 58 |
+
|
| 59 |
Args:
|
| 60 |
prompt: Input text containing <image> placeholders.
|
| 61 |
tokenizer: The tokenizer to use for encoding text.
|
|
|
|
| 65 |
Returns:
|
| 66 |
List of token IDs or a PyTorch tensor.
|
| 67 |
"""
|
| 68 |
+
# Tokenize each chunk (matching training code behavior)
|
| 69 |
+
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
|
| 70 |
+
|
| 71 |
+
def insert_separator(X, sep):
|
| 72 |
+
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
|
| 73 |
|
| 74 |
input_ids = []
|
| 75 |
offset = 0
|
| 76 |
|
| 77 |
+
# Handle BOS token if present (matching training code)
|
| 78 |
+
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
|
|
|
|
| 79 |
offset = 1
|
| 80 |
+
input_ids.append(prompt_chunks[0][0])
|
| 81 |
|
| 82 |
+
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
|
| 83 |
+
input_ids.extend(x[offset:])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
if return_tensors is not None:
|
| 86 |
+
if return_tensors == "pt":
|
| 87 |
+
return torch.tensor(input_ids, dtype=torch.long)
|
| 88 |
+
raise ValueError(f"Unsupported tensor type: {return_tensors}")
|
| 89 |
return input_ids
|
| 90 |
|
| 91 |
|