--- license: mit datasets: - phiyodr/coco2017 language: - en metrics: - accuracy base_model: - Qwen/Qwen2-0.5B-Instruct - google/siglip-base-patch16-224 library_name: transformers pipeline_tag: image-text-to-text --- # Qwiglip VLM (Qwen2 + SigLIP) Custom Vision-Language Model built from scratch. Inspired by LLaVA VLM architecture, but with a custom MLP projector and LoRA fine-tuning for efficient training. Training data from https://huggingface.co/datasets/phiyodr/coco2017 Full repository at https://github.com/teohyc/qwiglip_vlm ## Components - Base LLM: Qwen/Qwen2-0.5B-Instruct - Vision Encoder: SigLIP - LoRA fine-tuning - Custom MLP projector ## Usage ***** CHECK OUT inference.py FOR DETAILED INFERENCE EXAMPLE ***** ```python import torch from PIL import Image from transformers import AutoTokenizer, AutoProcessor, AutoModel, Qwen2ForCausalLM from peft import PeftModel from vlm_model import MLPProjector, SiglipQwenVLM #configurations DEVICE = "cuda" if torch.cuda.is_available() else "cpu" LLM_NAME = "Qwen/Qwen2-0.5B-Instruct" VISION_NAME = "google/siglip-base-patch16-224" LORA_PATH = "lora_adapter" PROJECTOR_PATH = "projector.pt" NUM_IMAGE_TOKENS = 196 #refer to inference.py for full code ```