Giberish response when using the model with transformers library with an image

#12

by jluixjurado - opened Mar 4

Mar 4

Using the example code in 'use this model' button for any message with images (including the one in the example), the response is always '!!!!!!!!!!!!!!!!!!!!!!!!!!!...'

If using only text messages all works as expected.

Could yo help me with that?

Thanks in advance.

P.S.: NVIDIARTX6000PRO96GB. Transformers5.2

Example code:

from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-4B")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

bububaby

Mar 4

can you please explain cuz i dont really understand

hamzah0asadullah

Mar 4

•

edited Mar 4

Hey @jluixjurado

I think that generation_config.json not being included with the Qwen3.5 series is the most likely reason, it should be noted that I'm not a specialist.
For a quick try-out, you should also prefer using transformers's pipeline, but that's only my recommendation.

Try manually adding generation configurations (since generation_config.json is missing), I believe that should solve the issue.

As the example uses image coding, following script will crash if you do not have pillow and torchvision installed additionally. Install it using pip install pillow torchvision or your accelerator's equivalent.

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-0.8B") # I'm on my laptop so choose a parameter version of your liking
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
print("\n--- --- ---\n", pipe(text=messages, do_sample=True, temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0)[0]["generated_text"][-1]["content"], sep="")

I'm currently on my laptop and processing the image seems like it could take days so I leave it to you to run above script. I get receive any errors or warnings, so I think it should work for you too.

Have a great day.

jluixjurado

Mar 5

I need to use AutoModelForImageTextToText.from_pretrained instead of pipeline.

I've tried including temperature, repetition_penalty, top_k, top_k and do_sample parameters in the generating call but with no luck.

I've tried to create my own generation_config.json in the ../hub/model... directory too. But the same...

I'll keep trying.

Thanks so much for your help.

hamzah0asadullah

Mar 5

Hello again @jluixjurado

Running the code on my desktop (Windows 10, RTX4060) using a Python virtual environment seems to work fine.
However, since I got a CUDA OOM trying to process the original image (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG), which has a width of 3024 and a height of 4032, I had to resize the image first to a width of 605 and a height of 807 pixels (20% width and height of the original), which allowed me to get a response.

For setting up the environment, I used:

python -m venv .venv
.\.venv/Scripts/activate
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install transformers

Then, I used saved following resized image as lr-candy.JPG (lr -> row resolution):

Finally, running following modified script (marked modified lines with # MOD):

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B").to("cuda") # MOD (.to("cuda"))
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "./lr-candy.JPG"}, # MOD (".lr-candy.JPG")
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

gets me following output:

[...] (Some warnings because I don't have flash-attn installed)

The animal on the candy is a **cat**.

Looking closely at the black symbol on the green and teal candies, it appears to be a stylized (...lots of text)

I suspect that maybe the model can't handle the original high resolution, so resizing the image to sub 1024-pixel dimensions might work.

Have a great day.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment