File size: 5,583 Bytes

---
license: mit
tags:
- multimodal
- embeddings
datasets:
- ituperceptron/image-captioning-turkish
- dogukanvzr/ml-paraphrase-tr
library_name: pytorch
language:
- tr
base_model:
- newmindai/modernbert-base-tr-uncased-allnli-stsb
- facebook/dinov2-base
---

# Turkish Multimodal Embedding Model

This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.  
The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.

## Model Summary
- **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
- **Vision encoder**: `facebook/dinov2-base`
- **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
- **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
- **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
- **Normalize outputs**: `{normalize}`
- **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
- **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)

## Training Strategy (inspired by JINA-CLIP-v2 style)
- The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
- For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
- For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
- This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).

## Datasets
- **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)  
- **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)  

> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.  
> Please check each dataset’s license and terms before downstream use.

## Files
- `pytorch_model.bin` — PyTorch `state_dict`
- `config.json` — metadata (encoder IDs, dimensions, flags)
- `model.py` — custom model classes (required to load)
- (This README is the model card.)

## Evaluation Results
**Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)

### Image-Text
**Average cosine similarity:** 0.7934

**Recall@K**
<table>
<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
</table>

<details>
<summary>Raw metrics (JSON)</summary>

```json
{
    "avg_cosine_sim": 0.7934404611587524,
    "recall_text_to_image": {
        "R@1": 0.936458564763386,
        "R@5": 0.9913352588313709,
        "R@10": 0.9971117529437903
    },
    "recall_image_to_text": {
        "R@1": 0.9355698733614752,
        "R@5": 0.9926682959342369,
        "R@10": 0.9957787158409243
    }
}
```
</details>

### Text-Text
**Average cosine similarity:** 0.7599

**Recall@K**
<table>
<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
</table>

<details>
<summary>Raw metrics (JSON)</summary>
  
```json
{
    "avg_cosine_sim": 0.7599335312843323,
    "recall_text_to_text": {
        "R@1": 0.719875500222321,
        "R@5": 0.9453090262338817,
        "R@10": 0.9824366385060027
    }
}
```
</details>

## Loading & Usage
```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "utkubascakir/MultiEmbedTR"

model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

model.eval()

# Text Embedding
texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"]
text_inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    text_embeds = model.encode_text(
        input_ids=text_inputs["input_ids"],
        attention_mask=text_inputs["attention_mask"]
    )  

print("Text embeddings shape:", text_embeds.shape)

# Image Embedding
img = Image.open("kedi.jpg").convert("RGB")
image_inputs = image_processor(
    images=img,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    image_embeds = model.encode_image(
        pixel_values=image_inputs["pixel_values"]
    ) 

print("Image embeddings shape:", image_embeds.shape)

similarity = F.cosine_similarity(text_embeds, image_embeds)
print("Cosine similarity:", similarity)
```
  
## Limitations & Intended Use
This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.  
It has not been tested for specific downstream tasks (e.g., retrieval, classification).  
No guarantees for bias/toxicity; please evaluate on your own target domain.

## Citation
If you use this model, please cite this repository.