|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- multimodal |
|
|
- embeddings |
|
|
datasets: |
|
|
- ituperceptron/image-captioning-turkish |
|
|
- dogukanvzr/ml-paraphrase-tr |
|
|
library_name: pytorch |
|
|
language: |
|
|
- tr |
|
|
base_model: |
|
|
- newmindai/modernbert-base-tr-uncased-allnli-stsb |
|
|
- facebook/dinov2-base |
|
|
--- |
|
|
|
|
|
# Turkish Multimodal Embedding Model |
|
|
|
|
|
This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads. |
|
|
The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications. |
|
|
|
|
|
## Model Summary |
|
|
- **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb` |
|
|
- **Vision encoder**: `facebook/dinov2-base` |
|
|
- **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768` |
|
|
- **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`) |
|
|
- **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`) |
|
|
- **Normalize outputs**: `{normalize}` |
|
|
- **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**) |
|
|
- **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish) |
|
|
|
|
|
## Training Strategy (inspired by JINA-CLIP-v2 style) |
|
|
- The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style). |
|
|
- For **image–text**, standard CLIP-style training with **in-batch negatives** was applied. |
|
|
- For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples. |
|
|
- This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum). |
|
|
|
|
|
## Datasets |
|
|
- **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish) |
|
|
- **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr) |
|
|
|
|
|
> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals. |
|
|
> Please check each dataset’s license and terms before downstream use. |
|
|
|
|
|
## Files |
|
|
- `pytorch_model.bin` — PyTorch `state_dict` |
|
|
- `config.json` — metadata (encoder IDs, dimensions, flags) |
|
|
- `model.py` — custom model classes (required to load) |
|
|
- (This README is the model card.) |
|
|
|
|
|
## Evaluation Results |
|
|
**Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish) |
|
|
|
|
|
### Image-Text |
|
|
**Average cosine similarity:** 0.7934 |
|
|
|
|
|
**Recall@K** |
|
|
<table> |
|
|
<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr> |
|
|
<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr> |
|
|
<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr> |
|
|
</table> |
|
|
|
|
|
<details> |
|
|
<summary>Raw metrics (JSON)</summary> |
|
|
|
|
|
```json |
|
|
{ |
|
|
"avg_cosine_sim": 0.7934404611587524, |
|
|
"recall_text_to_image": { |
|
|
"R@1": 0.936458564763386, |
|
|
"R@5": 0.9913352588313709, |
|
|
"R@10": 0.9971117529437903 |
|
|
}, |
|
|
"recall_image_to_text": { |
|
|
"R@1": 0.9355698733614752, |
|
|
"R@5": 0.9926682959342369, |
|
|
"R@10": 0.9957787158409243 |
|
|
} |
|
|
} |
|
|
``` |
|
|
</details> |
|
|
|
|
|
### Text-Text |
|
|
**Average cosine similarity:** 0.7599 |
|
|
|
|
|
**Recall@K** |
|
|
<table> |
|
|
<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr> |
|
|
<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr> |
|
|
</table> |
|
|
|
|
|
<details> |
|
|
<summary>Raw metrics (JSON)</summary> |
|
|
|
|
|
```json |
|
|
{ |
|
|
"avg_cosine_sim": 0.7599335312843323, |
|
|
"recall_text_to_text": { |
|
|
"R@1": 0.719875500222321, |
|
|
"R@5": 0.9453090262338817, |
|
|
"R@10": 0.9824366385060027 |
|
|
} |
|
|
} |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## Loading & Usage |
|
|
```python |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor |
|
|
from PIL import Image |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model_name = "utkubascakir/MultiEmbedTR" |
|
|
|
|
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
model.eval() |
|
|
|
|
|
# Text Embedding |
|
|
texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"] |
|
|
text_inputs = tokenizer( |
|
|
texts, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
text_embeds = model.encode_text( |
|
|
input_ids=text_inputs["input_ids"], |
|
|
attention_mask=text_inputs["attention_mask"] |
|
|
) |
|
|
|
|
|
print("Text embeddings shape:", text_embeds.shape) |
|
|
|
|
|
# Image Embedding |
|
|
img = Image.open("kedi.jpg").convert("RGB") |
|
|
image_inputs = image_processor( |
|
|
images=img, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
image_embeds = model.encode_image( |
|
|
pixel_values=image_inputs["pixel_values"] |
|
|
) |
|
|
|
|
|
print("Image embeddings shape:", image_embeds.shape) |
|
|
|
|
|
similarity = F.cosine_similarity(text_embeds, image_embeds) |
|
|
print("Cosine similarity:", similarity) |
|
|
``` |
|
|
|
|
|
## Limitations & Intended Use |
|
|
This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images. |
|
|
It has not been tested for specific downstream tasks (e.g., retrieval, classification). |
|
|
No guarantees for bias/toxicity; please evaluate on your own target domain. |
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite this repository. |