pure transformers inference code is needed

#24
by maltoseflower - opened

please give a pure transformers demo code, like qwen3
图片

this is kinda working for me,
but I'm not sure if it is excactly the right way. Any update will be appreciated.

flash_attn_2 is not working, raising a cuda error when inference. Does transformers 5 use flash attn in other way?

transfomers 5.2 (or may be other 5.x?)

import os
import torch
from PIL import Image

from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration

model_path = ""

model = Qwen3_5ForConditionalGeneration.from_pretrained(
model_path,
dtype=torch.bfloat16,
device_map="cuda:0",
# attn_implementation="flash_attention_2", # with transfomers 5.2, error
local_files_only = True,
trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)

image_file =""
prompt = ""

image = Image.open(image_file).convert('RGB')

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": prompt},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    # enable_thinking=False # non-thingking is not supported in api way, NOTE how does it work?
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, 
                            #    max_new_tokens= 32768, temperature=1.0, top_p=0.95, top_k=20, repetition_penalty=1.0, do_sample=True # thinking
                               max_new_tokens= 128, temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.0, do_sample=True # non-thinking, NOTE how does it work?
                               )
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
r = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

return r[0]

yeah this seems to work
and i think the processor is still Qwen3VLProcessor so you can actually use enable_thinking (also works for me), just that it's not officially guaranteed to perfectly function

I was having difficulty getting the model to read video input, whereby it gave me the following error:

File "/workspace/chatbots01/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/generation/utils.py", line 2434, in generate
model_kwargs["position_ids"] = self._prepare_position_ids_for_generation(inputs_tensor, model_kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 2122, in _prepare_position_ids_for_generation
vision_positions, rope_deltas = self.model.get_rope_index(inputs_tensor, **model_kwargs)
File "/workspace/chatbots01/lib/python3.10/site-packages/transformers/models/qwen3_5/modeling_qwen3_5.py", line 1572, in get_rope_index
grid_thw = next(grid_iters[modality_type])
StopIteration

However, after combing through the Transformers documentation for Qwen3VLProcessor, I think I found why I was getting the error. I then added the following line:

mm_processor_kwargs = {"fps": 2, "do_sample_frames": True}

Then, I added the kwargs to the processor method:

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, kwargs=mm_processor_kwargs)

Now, the model seems to be able to read video input! I hope this benefits anyone having a similar issue, as it took me a few days to figure this out!

Sign up or log in to comment