Turkish Multimodal Embedding Model

This repository contains a contrastively trained Turkish multimodal embedding model, combining a text encoder and a vision encoder with projection heads.
The model is trained entirely on Turkish datasets (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.

Model Summary

Text encoder: newmindai/modernbert-base-tr-uncased-allnli-stsb
Vision encoder: facebook/dinov2-base
Dimensions: text_dim=768, image_dim=768, embed_dim=768
Projection dropout: fixed at 0.4 (inside ProjectionHead)
Pooling: mean pooling over tokens (use_mean_pooling_for_text=True)
Normalize outputs: {normalize}
Encoders frozen during training?: {frozen} (this release was trained with encoders NOT frozen)
Language focus: Turkish (both text and image–caption pairs are fully in Turkish)

Training Strategy (inspired by JINA-CLIP-v2 style)

The model was trained jointly with image–text and text–text pairs using a bidirectional contrastive loss (InfoNCE/CLIP-style).
For image–text, standard CLIP-style training with in-batch negatives was applied.
For text–text, only positive paraphrase pairs (label=1) were used, with in-batch negatives coming from other samples.
This follows the general training philosophy often seen in Jina’s multimodal work, but in a simplified single-stage setup (without the 3-stage curriculum).

Datasets

Image–Text: ituperceptron/image-captioning-turkish
Text–Text (Paraphrase): dogukanvzr/ml-paraphrase-tr

Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
Please check each dataset’s license and terms before downstream use.

Files

pytorch_model.bin — PyTorch state_dict
config.json — metadata (encoder IDs, dimensions, flags)
model.py — custom model classes (required to load)
(This README is the model card.)

Evaluation Results

Dataset: Test split created from ituperceptron/image-captioning-turkish

Image-Text

Average cosine similarity: 0.7934

Recall@K

Direction	R@1	R@5	R@10
Text → Image	0.9365	0.9913	0.9971
Image → Text	0.9356	0.9927	0.9958

Raw metrics (JSON)

{
    "avg_cosine_sim": 0.7934404611587524,
    "recall_text_to_image": {
        "R@1": 0.936458564763386,
        "R@5": 0.9913352588313709,
        "R@10": 0.9971117529437903
    },
    "recall_image_to_text": {
        "R@1": 0.9355698733614752,
        "R@5": 0.9926682959342369,
        "R@10": 0.9957787158409243
    }
}

Text-Text

Average cosine similarity: 0.7599

Recall@K

Direction	R@1	R@5	R@10
Text → Text	0.7198	0.9453	0.9824

Raw metrics (JSON)

{
    "avg_cosine_sim": 0.7599335312843323,
    "recall_text_to_text": {
        "R@1": 0.719875500222321,
        "R@5": 0.9453090262338817,
        "R@10": 0.9824366385060027
    }
}

Loading & Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "utkubascakir/MultiEmbedTR"

model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

model.eval()

# Text Embedding
texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"]
text_inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    text_embeds = model.encode_text(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs["attention_mask"]
    )  

print("Text embeddings shape:", text_embeds.shape)

# Image Embedding
img = Image.open("kedi.jpg").convert("RGB")
image_inputs = image_processor(
    images=img,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    image_embeds = model.encode_image(
        pixel_values=image_inputs["pixel_values"]
    ) 

print("Image embeddings shape:", image_embeds.shape)

similarity = F.cosine_similarity(text_embeds, image_embeds)
print("Cosine similarity:", similarity)

Limitations & Intended Use

This release provides a Turkish multimodal embedding model, trained to produce aligned vector representations for text and images.
It has not been tested for specific downstream tasks (e.g., retrieval, classification).
No guarantees for bias/toxicity; please evaluate on your own target domain.

Citation

If you use this model, please cite this repository.

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for utkubascakir/MultiEmbedTR

Base model

facebook/dinov2-base

Finetuned

(96)

this model

utkubascakir
/

MultiEmbedTR