AttributeError: 'str' object has no attribute 'shape' when calling .generate() with CogAgent Chat model

by TheRealLLMExplorer - opened May 31, 2025

May 31, 2025

I'm trying to generate captions using the cogagent-chat-hf model by passing a single video frame (converted to a PIL image) into the build_conversation_input_ids() method. However, I encounter the following error during generation:

AttributeError: 'str' object has no attribute 'shape'

The traceback points to this line inside the model’s forward call:

past_key_values_length = past_key_values[0][0].shape[2]

What I Tried

Using both template_version="chat" and "base"
Setting use_cache=False in .generate()
Ensured inputs are all tensors on the correct device and dtype (e.g., input_ids, images, cross_images)
Verified that the prompt decodes correctly and contains my input text

Code

input_by_model = model.build_conversation_input_ids(
    tokenizer=tokenizer,
    query=prompt,
    history=[],
    images=[image],  # PIL Image
    template_version="chat"
)

inputs = {
    "input_ids": input_by_model["input_ids"].unsqueeze(0).to(device),
    "token_type_ids": input_by_model["token_type_ids"].unsqueeze(0).to(device),
    "attention_mask": input_by_model["attention_mask"].unsqueeze(0).to(device),
    "images": [[input_by_model["images"][0].to(device).to(dtype)]],
}

if "cross_images" in input_by_model:
    inputs["cross_images"] = [[input_by_model["cross_images"][0].to(device).to(dtype)]]

with torch.no_grad():
    output = model.generate(**inputs, max_length=512, use_cache=False)

Question

Is there something wrong with the way I'm preparing the inputs?
Why is past_key_values being interpreted as a string during generation?

Any suggestions or fixes would be very appreciated!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment