DiffusionVL
Collection
4 items
•
Updated
•
4
DiffusionVL is a vision-language model based on Qwen2.5-VL architecture with BD3LM diffusion-based generation.
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
# Load model with trust_remote_code
model = AutoModelForCausalLM.from_pretrained(
"path/to/model",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Load processor (includes tokenizer)
processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
# Image + text generation
from PIL import Image
import requests
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image."}
]}
]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, 'to') else v for k, v in inputs.items()}
# Generate with diffusion
output_ids = model.generate(
inputs=inputs["input_ids"],
images=inputs.get("pixel_values"),
image_grid_thws=inputs.get("image_grid_thw"),
gen_length=256,
steps=8,
temperature=0.0,
remasking_strategy="low_confidence_static",
)
# Decode output
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
gen_length: Number of tokens to generate (default: 256)steps: Number of diffusion steps per block (default: 8)temperature: Sampling temperature, 0 for greedy (default: 0.0)top_k: Top-k sampling parameter (default: 0, disabled)top_p: Top-p (nucleus) sampling parameter (default: 1.0)remasking_strategy: 'low_confidence' or 'sequential' (default: 'low_confidence')trust_remote_code=True because it includes custom modeling code