teohyc
/

QwigLip-VLM

Image-Text-to-Text

Model card Files Files and versions

QwigLip-VLM / README.md

teohyc's picture

Update README.md

8ca608f verified about 2 months ago

|

history blame contribute delete

1.23 kB


	---
	license: mit
	datasets:
	- phiyodr/coco2017
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen2-0.5B-Instruct
	- google/siglip-base-patch16-224
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	# Qwiglip VLM (Qwen2 + SigLIP)

	Custom Vision-Language Model built from scratch. Inspired by LLaVA VLM architecture, but with a custom MLP projector and LoRA fine-tuning for efficient training.
	Training data from https://huggingface.co/datasets/phiyodr/coco2017
	Full repository at https://github.com/teohyc/qwiglip_vlm

	## Components
	- Base LLM: Qwen/Qwen2-0.5B-Instruct
	- Vision Encoder: SigLIP
	- LoRA fine-tuning
	- Custom MLP projector

	## Usage
	*** CHECK OUT inference.py FOR DETAILED INFERENCE EXAMPLE ***

	```python
	import torch
	from PIL import Image
	from transformers import AutoTokenizer, AutoProcessor, AutoModel, Qwen2ForCausalLM
	from peft import PeftModel

	from vlm_model import MLPProjector, SiglipQwenVLM

	#configurations
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	LLM_NAME = "Qwen/Qwen2-0.5B-Instruct"
	VISION_NAME = "google/siglip-base-patch16-224"

	LORA_PATH = "lora_adapter"
	PROJECTOR_PATH = "projector.pt"

	NUM_IMAGE_TOKENS = 196

	#refer to inference.py for full code
	```