Dimension Mismatch Between Vision Encoder (2048) and LLM (1024) without a Projector
#1
by
cuimou
- opened
First of all, thank you for your interesting work on merging the Qwen3 language model with the Qwen2.5-VL vision encoder. I was very excited to try it out.
I've been attempting to run inference with the ViFortune-AI/Qwen3-VL-1B-Merged model, but I'm consistently encountering a structural issue related to mismatched hidden dimensions between the vision and language components.
may be it should be AutoModelForVision2Seq or Qwen2_5_VLForConditionalGeneration instead of AutoModelForCausalLM. Try it again.
But the result is not guaranteed since it is just a merged version.
Tnt3o5
changed discussion status to
closed