--- license: mit tags: - multimodal - embeddings datasets: - ituperceptron/image-captioning-turkish - dogukanvzr/ml-paraphrase-tr library_name: pytorch language: - tr base_model: - newmindai/modernbert-base-tr-uncased-allnli-stsb - facebook/dinov2-base --- # Turkish Multimodal Embedding Model This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads. The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications. ## Model Summary - **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb` - **Vision encoder**: `facebook/dinov2-base` - **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768` - **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`) - **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`) - **Normalize outputs**: `{normalize}` - **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**) - **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish) ## Training Strategy (inspired by JINA-CLIP-v2 style) - The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style). - For **image–text**, standard CLIP-style training with **in-batch negatives** was applied. - For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples. - This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum). ## Datasets - **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish) - **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr) > Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals. > Please check each dataset’s license and terms before downstream use. ## Files - `pytorch_model.bin` — PyTorch `state_dict` - `config.json` — metadata (encoder IDs, dimensions, flags) - `model.py` — custom model classes (required to load) - (This README is the model card.) ## Evaluation Results **Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish) ### Image-Text **Average cosine similarity:** 0.7934 **Recall@K**
DirectionR@1R@5R@10
Text → Image0.93650.99130.9971
Image → Text0.93560.99270.9958
Raw metrics (JSON) ```json { "avg_cosine_sim": 0.7934404611587524, "recall_text_to_image": { "R@1": 0.936458564763386, "R@5": 0.9913352588313709, "R@10": 0.9971117529437903 }, "recall_image_to_text": { "R@1": 0.9355698733614752, "R@5": 0.9926682959342369, "R@10": 0.9957787158409243 } } ```
### Text-Text **Average cosine similarity:** 0.7599 **Recall@K**
DirectionR@1R@5R@10
Text → Text0.71980.94530.9824
Raw metrics (JSON) ```json { "avg_cosine_sim": 0.7599335312843323, "recall_text_to_text": { "R@1": 0.719875500222321, "R@5": 0.9453090262338817, "R@10": 0.9824366385060027 } } ```
## Loading & Usage ```python import torch import torch.nn.functional as F from transformers import AutoModel, AutoTokenizer, AutoImageProcessor from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "utkubascakir/MultiEmbedTR" model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True) model.eval() # Text Embedding texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"] text_inputs = tokenizer( texts, padding=True, truncation=True, return_tensors="pt" ).to(device) with torch.no_grad(): text_embeds = model.encode_text( input_ids=text_inputs["input_ids"], attention_mask=text_inputs["attention_mask"] ) print("Text embeddings shape:", text_embeds.shape) # Image Embedding img = Image.open("kedi.jpg").convert("RGB") image_inputs = image_processor( images=img, return_tensors="pt" ).to(device) with torch.no_grad(): image_embeds = model.encode_image( pixel_values=image_inputs["pixel_values"] ) print("Image embeddings shape:", image_embeds.shape) similarity = F.cosine_similarity(text_embeds, image_embeds) print("Cosine similarity:", similarity) ``` ## Limitations & Intended Use This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images. It has not been tested for specific downstream tasks (e.g., retrieval, classification). No guarantees for bias/toxicity; please evaluate on your own target domain. ## Citation If you use this model, please cite this repository.