DiffusionVL

DiffusionVL is a vision-language model based on Qwen2.5-VL architecture with BD3LM diffusion-based generation.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch

# Load model with trust_remote_code
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load processor (includes tokenizer)
processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)

# Image + text generation
from PIL import Image
import requests

url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image."}
    ]}
]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, 'to') else v for k, v in inputs.items()}

# Generate with diffusion
output_ids = model.generate(
    inputs=inputs["input_ids"],
    images=inputs.get("pixel_values"),
    image_grid_thws=inputs.get("image_grid_thw"),
    gen_length=256,
    steps=8,
    temperature=0.0,
    remasking_strategy="low_confidence_static",
)

# Decode output
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

Generation Parameters

  • gen_length: Number of tokens to generate (default: 256)
  • steps: Number of diffusion steps per block (default: 8)
  • temperature: Sampling temperature, 0 for greedy (default: 0.0)
  • top_k: Top-k sampling parameter (default: 0, disabled)
  • top_p: Top-p (nucleus) sampling parameter (default: 1.0)
  • remasking_strategy: 'low_confidence' or 'sequential' (default: 'low_confidence')

Model Configuration

  • Architecture: DiffusionVL_Qwen2_5_VL_ForConditionalGeneration
  • BD3LM Enabled: True
  • Block Size: 8
  • Hidden Size: 3584
  • Num Layers: 28

Notes

  • The model uses trust_remote_code=True because it includes custom modeling code
  • Both model and processor can be loaded from the same directory
  • Image preprocessing uses Qwen2VLImageProcessor internally (identical to Qwen2.5-VL)
Downloads last month
65
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including hustvl/DiffusionVL-Qwen2.5VL-7B