QwigLip-VLM / README.md
teohyc's picture
Update README.md
8ca608f verified
---
license: mit
datasets:
- phiyodr/coco2017
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2-0.5B-Instruct
- google/siglip-base-patch16-224
library_name: transformers
pipeline_tag: image-text-to-text
---
# Qwiglip VLM (Qwen2 + SigLIP)
Custom Vision-Language Model built from scratch. Inspired by LLaVA VLM architecture, but with a custom MLP projector and LoRA fine-tuning for efficient training.
Training data from https://huggingface.co/datasets/phiyodr/coco2017
Full repository at https://github.com/teohyc/qwiglip_vlm
## Components
- Base LLM: Qwen/Qwen2-0.5B-Instruct
- Vision Encoder: SigLIP
- LoRA fine-tuning
- Custom MLP projector
## Usage
***** CHECK OUT inference.py FOR DETAILED INFERENCE EXAMPLE *****
```python
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModel, Qwen2ForCausalLM
from peft import PeftModel
from vlm_model import MLPProjector, SiglipQwenVLM
#configurations
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LLM_NAME = "Qwen/Qwen2-0.5B-Instruct"
VISION_NAME = "google/siglip-base-patch16-224"
LORA_PATH = "lora_adapter"
PROJECTOR_PATH = "projector.pt"
NUM_IMAGE_TOKENS = 196
#refer to inference.py for full code
```