pure transformers inference code is needed
this is kinda working for me,
but I'm not sure if it is excactly the right way. Any update will be appreciated.
flash_attn_2 is not working, raising a cuda error when inference. Does transformers 5 use flash attn in other way?
transfomers 5.2 (or may be other 5.x?)
import os
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
model_path = ""
model = Qwen3_5ForConditionalGeneration.from_pretrained(
model_path,
dtype=torch.bfloat16,
device_map="cuda:0",
# attn_implementation="flash_attention_2", # with transfomers 5.2, error
local_files_only = True,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
image_file =""
prompt = ""
image = Image.open(image_file).convert('RGB')
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image,
},
{"type": "text", "text": prompt},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
# enable_thinking=False # non-thingking is not supported in api way, NOTE how does it work?
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs,
# max_new_tokens= 32768, temperature=1.0, top_p=0.95, top_k=20, repetition_penalty=1.0, do_sample=True # thinking
max_new_tokens= 128, temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.0, do_sample=True # non-thinking, NOTE how does it work?
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
r = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return r[0]
yeah this seems to work
and i think the processor is still Qwen3VLProcessor so you can actually use enable_thinking (also works for me), just that it's not officially guaranteed to perfectly function
I was having difficulty getting the model to read video input, whereby it gave me the following error:
File "/workspace/chatbots01/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/generation/utils.py", line 2434, in generate
model_kwargs["position_ids"] = self._prepare_position_ids_for_generation(inputs_tensor, model_kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 2122, in _prepare_position_ids_for_generation
vision_positions, rope_deltas = self.model.get_rope_index(inputs_tensor, **model_kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1572, in get_rope_index
grid_thw = next(grid_iters[modality_type])
StopIteration
However, after combing through the Transformers documentation for Qwen3VLProcessor, I think I found why I was getting the error. I then added the following line:
mm_processor_kwargs = {"fps": 2, "do_sample_frames": True}
Then, I added the kwargs to the processor method:
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, kwargs=mm_processor_kwargs)
Now, the model seems to be able to read video input! I hope this benefits anyone having a similar issue, as it took me a few days to figure this out!
