| # CLIP2MT5 CrossAttention VQA Model | |
| This is a Vision-Language model combining **CLIP-ViT** and **mT5** using a custom cross-attention bridge. | |
| It supports Visual Question Answering (VQA) in Turkish. | |
| ## Usage | |
| ```python | |
| from PIL import Image | |
| from hf_clip2mt5 import load_for_inference, predict | |
| repo_id = "MUERIS/TurkishVLMTAMGA" | |
| model, tokenizer, device = load_for_inference(repo_id) | |
| image = Image.open("example.jpg") | |
| question = "Görselde kaç kişi var?" | |
| answer = predict(model, tokenizer, device, image, question) | |
| print(answer) | |