# CLIP2MT5 CrossAttention VQA Model This is a Vision-Language model combining **CLIP-ViT** and **mT5** using a custom cross-attention bridge. It supports Visual Question Answering (VQA) in Turkish. ## Usage ```python from PIL import Image from hf_clip2mt5 import load_for_inference, predict repo_id = "MUERIS/TurkishVLMTAMGA" model, tokenizer, device = load_for_inference(repo_id) image = Image.open("example.jpg") question = "Görselde kaç kişi var?" answer = predict(model, tokenizer, device, image, question) print(answer)