Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +28 -86
config.json +80 -1
configuration_diffusionvl_qwen2_5_vl.py +5 -1
modeling_diffusionvl_qwen2_5_vl.py +37 -56
processing_diffusionvl_qwen2_5_vl.py +16 -13

README.md CHANGED Viewed

@@ -3,93 +3,31 @@ license: apache-2.0
 tags:
 - diffusion
 - vision-language
-- diffusivonvl
 - qwen2.5-vl
-library_name: transformers
-pipeline_tag: image-text-to-text
 ---
-<div align="center">
-<h1>DiffusionVL: Translating Any Autoregressive Models into <br> Diffusion Vision Language Models</h1>
-**_SOTA dVLM Performance with <5% Data & 2.0× Inference Speedup!_**
-[Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>
-<sup>1</sup>Huazhong University of Science and Technology
-<sup>*</sup>equal contribution, <sup>✉️</sup>corresponding author, xgwang@hust.edu.cn
-[![arXiv](https://img.shields.io/badge/arXiv-DiffusionVL-b31b1b.svg)](https://arxiv.org/abs/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
-</div>
-## 📰 News
-- **[2025.12.18]** 🎉 Our paper **DiffusionVL** is released on arXiv! And we release the DiffusionVL models translated from Qwen2.5VL at huggingface. The training code and more models are comming soon!
-## 📄 Introduction
-The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
-DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0× inference speedup.
-## ✨ Highlights
-- **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
-- **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
-- **2.0× Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0× speedup over previous dVLMs.
-<div align="center">
-<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
-<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
-</div>
-### 🎯 Inference with Pre-trained Models
-- **Download Pre-trained Models:**
-| Model | Base Model | Download |
-| :--- | :---  | :--- |
-| **DiffusionVL-Qwen2.5VL-3B** | Qwen2.5-VL-3B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-3B) |
-| **DiffusionVL-Qwen2.5VL-7B** | Qwen2.5-VL-7B | [HuggingFace](https://huggingface.co/hustvl/DiffusionVL-Qwen2.5VL-7B) |
-- **Environment Setup:**
-The core environments are list as follows:
-```
-torch==2.6.0
-torchvision==0.21.0
-torchaudio==2.6.0
-transformers==4.55.0
-accelerate==1.10.1
-pillow==10.4.0
-requests=2.32.5
-```
-- **Quick Start:**
 ```python
-from transformers import AutoModelForCausalLM, AutoProcessor
 import torch
 # Load model with trust_remote_code
 model = AutoModelForCausalLM.from_pretrained(
-    "hustvl/DiffusionVL-Qwen2.5VL-7B",
     torch_dtype=torch.bfloat16,
     device_map="auto",
     trust_remote_code=True
 )
 # Load processor (includes tokenizer)
-processor = AutoProcessor.from_pretrained("hustvl/DiffusionVL-Qwen2.5VL-7B", trust_remote_code=True)
 from PIL import Image
 import requests
@@ -110,7 +48,7 @@ output_ids = model.generate(
     inputs=inputs["input_ids"],
     images=inputs.get("pixel_values"),
     image_grid_thws=inputs.get("image_grid_thw"),
-    gen_length=128,
     steps=8,
     temperature=0.0,
     remasking_strategy="low_confidence_static",
@@ -119,23 +57,27 @@ output_ids = model.generate(
 # Decode output
 output_text = processor.decode(output_ids[0], skip_special_tokens=True)
 print(output_text)
 ```
-## ❤️ Acknowledgements
-This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [SDAR](https://github.com/JetAstra/SDAR). We thank the authors for their open-source contributions.
-## 📝 Citation
-If you find our work useful, please cite our paper:
-```
-@misc{zeng2025diffusionvltranslatingautoregressivemodels,
-      title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models},
-      author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
-      year={2025},
-      eprint={2512.15713},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2512.15713},
-}
-```

 tags:
 - diffusion
 - vision-language
 - qwen2.5-vl
 ---
+# DiffusionVL
+DiffusionVL is a vision-language model based on Qwen2.5-VL architecture with BD3LM diffusion-based generation.
+## Usage
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
 import torch
 # Load model with trust_remote_code
 model = AutoModelForCausalLM.from_pretrained(
+    "path/to/model",
     torch_dtype=torch.bfloat16,
     device_map="auto",
     trust_remote_code=True
 )
 # Load processor (includes tokenizer)
+processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
+# Image + text generation
 from PIL import Image
 import requests
     inputs=inputs["input_ids"],
     images=inputs.get("pixel_values"),
     image_grid_thws=inputs.get("image_grid_thw"),
+    gen_length=256,
     steps=8,
     temperature=0.0,
     remasking_strategy="low_confidence_static",
 # Decode output
 output_text = processor.decode(output_ids[0], skip_special_tokens=True)
 print(output_text)
 ```
+## Generation Parameters
+- `gen_length`: Number of tokens to generate (default: 256)
+- `steps`: Number of diffusion steps per block (default: 8)
+- `temperature`: Sampling temperature, 0 for greedy (default: 0.0)
+- `top_k`: Top-k sampling parameter (default: 0, disabled)
+- `top_p`: Top-p (nucleus) sampling parameter (default: 1.0)
+- `remasking_strategy`: 'low_confidence' or 'sequential' (default: 'low_confidence')
+## Model Configuration
+- **Architecture**: DiffusionVL_Qwen2_5_VL_ForConditionalGeneration
+- **BD3LM Enabled**: True
+- **Block Size**: 8
+- **Hidden Size**: 2048
+- **Num Layers**: 36
+## Notes
+- The model uses `trust_remote_code=True` because it includes custom modeling code
+- Both model and processor can be loaded from the same directory
+- Image preprocessing uses Qwen2VLImageProcessor internally (identical to Qwen2.5-VL)

config.json CHANGED Viewed

@@ -95,9 +95,10 @@
   "mm_use_im_start_end": false,
   "mm_vision_select_feature": "patch",
   "mm_vision_select_layer": -2,
   "mm_vision_tower_lr": 2e-06,
   "model_max_length": 8192,
-  "model_type": "diffusionvl_qwen2_5_vl",
   "num_attention_heads": 16,
   "num_hidden_layers": 36,
   "num_key_value_heads": 2,
@@ -114,6 +115,84 @@
   },
   "rope_theta": 1000000.0,
   "sliding_window": null,
   "tie_word_embeddings": true,
   "tokenizer_model_max_length": 8192,
   "tokenizer_padding_side": "right",

   "mm_use_im_start_end": false,
   "mm_vision_select_feature": "patch",
   "mm_vision_select_layer": -2,
+  "mm_vision_tower": "/data/minimax-dialogue/users/qingke/results/hf_models/Qwen2.5-VL-3B-Instruct-Reformat",
   "mm_vision_tower_lr": 2e-06,
   "model_max_length": 8192,
+  "model_type": "diffusionvl_qwenvl",
   "num_attention_heads": 16,
   "num_hidden_layers": 36,
   "num_key_value_heads": 2,
   },
   "rope_theta": 1000000.0,
   "sliding_window": null,
+  "text_config": {
+    "architectures": [
+      "Qwen2_5_VLForConditionalGeneration"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "eos_token_id": 151645,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "image_token_id": null,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 128000,
+    "max_window_layers": 70,
+    "model_type": "qwen2_5_vl_text",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_section": [
+        16,
+        24,
+        24
+      ],
+      "rope_type": "default",
+      "type": "default"
+    },
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "tie_word_embeddings": true,
+    "torch_dtype": "float32",
+    "use_cache": true,
+    "use_sliding_window": false,
+    "video_token_id": null,
+    "vision_end_token_id": 151653,
+    "vision_start_token_id": 151652,
+    "vision_token_id": 151654,
+    "vocab_size": 151936
+  },
   "tie_word_embeddings": true,
   "tokenizer_model_max_length": 8192,
   "tokenizer_padding_side": "right",

configuration_diffusionvl_qwen2_5_vl.py CHANGED Viewed

@@ -190,7 +190,7 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
     ```
     """
-    model_type = "diffusionvl_qwen2_5_vl"
     sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
     keys_to_ignore_at_inference = ["past_key_values"]
@@ -229,6 +229,10 @@ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
         rope_scaling: Optional[dict] = None,
         **kwargs,
     ):
         # Text model configuration
         self.vocab_size = vocab_size
         self.hidden_size = hidden_size

     ```
     """
+    model_type = "diffusionvl_qwenvl"
     sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
     keys_to_ignore_at_inference = ["past_key_values"]
         rope_scaling: Optional[dict] = None,
         **kwargs,
     ):
+        # Remove text_config from kwargs to avoid GenerationConfig issues
+        # (text_config is only needed for train code, HF config uses flattened params)
+        kwargs.pop("text_config", None)
         # Text model configuration
         self.vocab_size = vocab_size
         self.hidden_size = hidden_size

modeling_diffusionvl_qwen2_5_vl.py CHANGED Viewed

@@ -33,6 +33,7 @@ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutpu
 from transformers.utils import logging
 from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
 from transformers.modeling_layers import GradientCheckpointingLayer
 from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
@@ -120,20 +121,24 @@ def apply_multimodal_rotary_pos_emb(
     k_embed = (k * cos) + (rotate_half(k) * sin)
     return q_embed, k_embed
 class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
     def __init__(self, hidden_size, eps=1e-6):
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
         self.variance_epsilon = eps
-    def forward(self, hidden_states):
         input_dtype = hidden_states.dtype
         hidden_states = hidden_states.to(torch.float32)
         variance = hidden_states.pow(2).mean(-1, keepdim=True)
         hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
         return self.weight * hidden_states.to(input_dtype)
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
@@ -590,17 +595,17 @@ class DiffusionVL_Qwen2_5_VL_RotaryEmbedding(nn.Module):
 class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
-    def __init__(self, config):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.intermediate_size = config.intermediate_size
-        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
-        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
-        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
-        self.act_fn = nn.SiLU()
-    def forward(self, x):
-        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
 class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
@@ -759,18 +764,15 @@ class DiffusionVL_Qwen2_5_VL_PreTrainedModel(PreTrainedModel):
     config_class = DiffusionVL_Qwen2_5_VL_Config
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
     _no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
-    def _init_weights(self, module: nn.Module) -> None:
-        """Initialize the weights."""
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
 class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
@@ -1233,7 +1235,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
         top_k: int = 0,
         top_p: float = 1.0,
         remasking_strategy: str = 'low_confidence_static',
-        use_kv_cache: bool = True,
         confidence_threshold: float = 0.85,
         **kwargs,
     ):
@@ -1248,7 +1249,6 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
             top_k: Top-k sampling parameter
             top_p: Top-p (nucleus) sampling parameter
             remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
-            use_kv_cache: Whether to use KV cache (default True)
             confidence_threshold: Threshold for low_confidence_dynamic strategy
         Returns:
@@ -1291,8 +1291,8 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
         prefill_blocks = prompt_len // block_size
         prefill_length = prefill_blocks * block_size
-        past_key_values = DynamicCache() if use_kv_cache else None
-        if use_kv_cache and prefill_length > 0:
             prefill_embeds = x_embeds[:, :prefill_length]
             prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
             prefill_pos_ids = position_ids[:, :prefill_length]
@@ -1336,45 +1336,26 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
                 is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
                 if not is_mask.any():
                     # Store KV for fully unmasked block
-                    if use_kv_cache:
-                        _ = self.model(
-                            inputs_embeds=cur_block_embeds,
-                            attention_mask=model_mask,
-                            position_ids=cur_pos_ids,
-                            past_key_values=past_key_values,
-                            use_cache=True,
-                            store_kv=True
-                        )
-                    break
-                # Forward pass
-                if use_kv_cache:
-                    outputs = self.model(
                         inputs_embeds=cur_block_embeds,
                         attention_mask=model_mask,
                         position_ids=cur_pos_ids,
                         past_key_values=past_key_values,
                         use_cache=True,
-                        store_kv=False
                     )
-                    logits = self.lm_head(outputs.last_hidden_state).float()
-                else:
-                    # No KV-cache: recompute full context
-                    context_embeds = x_embeds[:, :block_end].clone()
-                    context_embeds[:, block_start:block_end] = cur_block_embeds
-                    context_mask = block_diffusion_mask[:, :, :block_end, :block_end]
-                    context_pos_ids = position_ids[:, :block_end]
-                    context_model_mask = {"full_attention": context_mask, "sliding_attention": context_mask}
-                    outputs = self.model(
-                        inputs_embeds=context_embeds,
-                        attention_mask=context_model_mask,
-                        position_ids=context_pos_ids,
-                        past_key_values=None,
-                        use_cache=False,
-                        store_kv=False
-                    )
-                    logits = self.lm_head(outputs.last_hidden_state[:, block_start:block_end]).float()
                 # Sample tokens
                 x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
@@ -1500,7 +1481,7 @@ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_Pre
 from transformers import AutoConfig, AutoModelForCausalLM
-AutoConfig.register("diffusionvl_qwen2_5_vl", DiffusionVL_Qwen2_5_VL_Config)
 AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)

 from transformers.utils import logging
 from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
 from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.integrations import use_kernel_forward_from_hub
 from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
     k_embed = (k * cos) + (rotate_half(k) * sin)
     return q_embed, k_embed
+@use_kernel_forward_from_hub("RMSNorm")
 class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
+    """RMSNorm implementation matching Qwen2RMSNorm from modeling_qwen2.py"""
     def __init__(self, hidden_size, eps=1e-6):
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
         self.variance_epsilon = eps
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
         input_dtype = hidden_states.dtype
         hidden_states = hidden_states.to(torch.float32)
         variance = hidden_states.pow(2).mean(-1, keepdim=True)
         hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
         return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
 class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
+    def __init__(self, config, bias: bool = False):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, hidden_state):
+        return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
 class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
     config_class = DiffusionVL_Qwen2_5_VL_Config
     base_model_prefix = "model"
+    input_modalities = ["image", "video", "text"]
     supports_gradient_checkpointing = True
     _no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
 class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
         top_k: int = 0,
         top_p: float = 1.0,
         remasking_strategy: str = 'low_confidence_static',
         confidence_threshold: float = 0.85,
         **kwargs,
     ):
             top_k: Top-k sampling parameter
             top_p: Top-p (nucleus) sampling parameter
             remasking_strategy: 'low_confidence_static', 'low_confidence_dynamic', or 'sequential'
             confidence_threshold: Threshold for low_confidence_dynamic strategy
         Returns:
         prefill_blocks = prompt_len // block_size
         prefill_length = prefill_blocks * block_size
+        past_key_values = DynamicCache()
+        if prefill_length > 0:
             prefill_embeds = x_embeds[:, :prefill_length]
             prefill_mask = block_diffusion_mask[:, :, :prefill_length, :prefill_length]
             prefill_pos_ids = position_ids[:, :prefill_length]
                 is_mask = torch.all(torch.abs(cur_block_embeds - mask_embed.to(cur_block_embeds.device)) < 1e-5, dim=-1)
                 if not is_mask.any():
                     # Store KV for fully unmasked block
+                    _ = self.model(
                         inputs_embeds=cur_block_embeds,
                         attention_mask=model_mask,
                         position_ids=cur_pos_ids,
                         past_key_values=past_key_values,
                         use_cache=True,
+                        store_kv=True
                     )
+                    break
+                # Forward pass
+                outputs = self.model(
+                    inputs_embeds=cur_block_embeds,
+                    attention_mask=model_mask,
+                    position_ids=cur_pos_ids,
+                    past_key_values=past_key_values,
+                    use_cache=True,
+                    store_kv=False
+                )
+                logits = self.lm_head(outputs.last_hidden_state).float()
                 # Sample tokens
                 x0, x0_p = self._sample_tokens(logits, temperature, top_k, top_p)
 from transformers import AutoConfig, AutoModelForCausalLM
+AutoConfig.register("diffusionvl_qwenvl", DiffusionVL_Qwen2_5_VL_Config)
 AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)

processing_diffusionvl_qwen2_5_vl.py CHANGED Viewed

@@ -54,6 +54,8 @@ def tokenizer_image_token(
     """
     Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
     Args:
         prompt: Input text containing <image> placeholders.
         tokenizer: The tokenizer to use for encoding text.
@@ -63,26 +65,27 @@ def tokenizer_image_token(
     Returns:
         List of token IDs or a PyTorch tensor.
     """
-    prompt_chunks = prompt.split(DEFAULT_IMAGE_TOKEN)
     input_ids = []
     offset = 0
-    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0:
-        # First chunk has text
-        input_ids = tokenizer(prompt_chunks[0], add_special_tokens=False).input_ids
         offset = 1
-    for chunk_idx in range(offset, len(prompt_chunks)):
-        chunk = prompt_chunks[chunk_idx]
-        # Add image token
-        input_ids.append(image_token_index)
-        # Add text after image
-        if len(chunk) > 0:
-            input_ids.extend(tokenizer(chunk, add_special_tokens=False).input_ids)
-    if return_tensors == "pt":
-        return torch.tensor(input_ids, dtype=torch.long)
     return input_ids

     """
     Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
+    This implementation matches the training code (llava/mm_utils.py::tokenizer_image_token).
     Args:
         prompt: Input text containing <image> placeholders.
         tokenizer: The tokenizer to use for encoding text.
     Returns:
         List of token IDs or a PyTorch tensor.
     """
+    # Tokenize each chunk (matching training code behavior)
+    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
+    def insert_separator(X, sep):
+        return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
     input_ids = []
     offset = 0
+    # Handle BOS token if present (matching training code)
+    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
         offset = 1
+        input_ids.append(prompt_chunks[0][0])
+    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+        input_ids.extend(x[offset:])
+    if return_tensors is not None:
+        if return_tensors == "pt":
+            return torch.tensor(input_ids, dtype=torch.long)
+        raise ValueError(f"Unsupported tensor type: {return_tensors}")
     return input_ids