# CLIP2MT5 CrossAttention VQA Model

This is a Vision-Language model combining **CLIP-ViT** and **mT5** using a custom cross-attention bridge.
It supports Visual Question Answering (VQA) in Turkish.

## Usage

```python
from PIL import Image
from hf_clip2mt5 import load_for_inference, predict

repo_id = "MUERIS/TurkishVLMTAMGA"

model, tokenizer, device = load_for_inference(repo_id)

image = Image.open("example.jpg")
question = "Görselde kaç kişi var?"

answer = predict(model, tokenizer, device, image, question)
print(answer)