Dimension Mismatch Between Vision Encoder (2048) and LLM (1024) without a Projector

#1
by cuimou - opened

First of all, thank you for your interesting work on merging the Qwen3 language model with the Qwen2.5-VL vision encoder. I was very excited to try it out.

I've been attempting to run inference with the ViFortune-AI/Qwen3-VL-1B-Merged model, but I'm consistently encountering a structural issue related to mismatched hidden dimensions between the vision and language components.

ViFortune-AI org
edited Sep 30, 2025

may be it should be AutoModelForVision2Seq or Qwen2_5_VLForConditionalGeneration instead of AutoModelForCausalLM. Try it again.
But the result is not guaranteed since it is just a merged version.

Tnt3o5 changed discussion status to closed

Sign up or log in to comment