vqwen3-4b
A ready-to-use vision-language model built by swapping Vicuna for Qwen3-4B
in the LLaVA-1.5 recipe. Everything is pre-wired: drop in a LlavaForConditionalGeneration
loader, pass an image + a prompt, get text out. No rigging.
- Vision tower:
openai/clip-vit-large-patch14-336(frozen) - Language model:
Qwen/Qwen3-4Bwith LoRA merged back into the weights - Projector: 2-layer MLP (Linear 1024 → GELU → Linear 2560), trained stage-1 then stage-2
- Image tokens: 576 (CLS stripped) per 336×336 image, spliced into the chat sequence
Quick start
import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
model_id = "alpharomercoma/vqwen3-4b"
model = LlavaForConditionalGeneration.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
image = Image.open("my_image.jpg").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."},
],
}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
reply = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(reply)
Training recipe
Two-stage reproduction of the LLaVA-1.5 recipe, both stages on a single H200 141 GB.
Stage 1 — feature alignment (projector only)
- Data:
liuhaotian/LLaVA-Pretrain(558,128 BLIP caption pairs) - Trainable: MLP projector only (~9.2 M params). CLIP + Qwen3-4B frozen.
- Batch 256 (32 × grad_accum 8), LR 1e-3 cosine, warmup 0.03, bf16, 1 epoch
- Loss masked on image sentinel; caption + EOS contribute
- 6.1 h, final train loss 2.87
Stage 2 — visual instruction tuning (projector + LoRA)
- Data:
liuhaotian/LLaVA-Instruct-150K(LLaVA-1.5 mix665k: COCO, GQA, OCR-VQA, TextVQA, VisualGenome + ShareGPT text-only; 665,286 records after filtering 12 dead image refs) - Trainable: projector (continues at LR 2e-5) + LoRA on Qwen3 (LR 2e-4)
- LoRA: r=128, α=256, dropout 0.05, targets
[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] - Batch 128 (16 × grad_accum 8), cosine + warmup 0.03, bf16, 1 epoch
- Liger-Kernel (FusedLinearCrossEntropy + Triton RoPE/RMSNorm/SwiGLU) — math-identical to the HF baseline
LengthGroupedSamplerover mix665k for padding efficiency- Gradient checkpointing ON, SDPA attention
- Conversation format: Qwen3 chat template, loss masked on non-assistant turns
- 17.1 h, final train loss 0.858
The stage-2 LoRA has been merged back into Qwen3's weights in this release,
so loading is a single .from_pretrained() call.
Architecture details
The model is the transformers-standard LlavaForConditionalGeneration with:
vision_config: CLIP ViT-L/14-336 (fixed)text_config: Qwen3-4B (with LoRA merged)image_seq_length: 576vision_feature_layer: −2 (penultimate hidden state)vision_feature_select_strategy:"default"(strips CLS)image_token_index: 151669 (the added<image>special token)projector_hidden_act:"gelu"
Because these choices match the LLaVA class upstream, no custom code or
trust_remote_code=True is required.
Limitations
- Trained on
LLaVA-Instruct-150K— inherits its distribution: English-heavy, mostly natural-image QA, OCR-light. Don't expect SOTA on GUI / document / chart tasks. - Single-image per conversation (splices one set of 576 tokens).
- Not aligned for safety/refusals beyond Qwen3-4B's base behavior + whatever mix665k contributes.
- Hallucinations are possible, especially for fine-grained object counts and spatial relations.
License
Apache 2.0 for the projector weights and LoRA-merged Qwen3 delta. Base models retain their original licenses: OpenAI CLIP (MIT), Qwen3-4B (Apache 2.0).
Citation / acknowledgements
- LLaVA and LLaVA-1.5 for the recipe
Qwen/Qwen3-4Bas the language backboneopenai/clip-vit-large-patch14-336as the vision tower- Full training + inference source: github:vqwen (companion repo)
- Downloads last month
- 17