MultiEmbedTR / README.md

Update README.md

05e2184 verified 13 days ago

5.58 kB

	---
	license: mit
	tags:
	- multimodal
	- embeddings
	datasets:
	- ituperceptron/image-captioning-turkish
	- dogukanvzr/ml-paraphrase-tr
	library_name: pytorch
	language:
	- tr
	base_model:
	- newmindai/modernbert-base-tr-uncased-allnli-stsb
	- facebook/dinov2-base
	---

	# Turkish Multimodal Embedding Model

	This repository contains a contrastively trained Turkish multimodal embedding model, combining a text encoder and a vision encoder with projection heads.
	The model is trained entirely on Turkish datasets (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.

	## Model Summary
	- Text encoder: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
	- Vision encoder: `facebook/dinov2-base`
	- Dimensions: `text_dim=768`, `image_dim=768`, `embed_dim=768`
	- Projection dropout: fixed at `0.4` (inside `ProjectionHead`)
	- Pooling: mean pooling over tokens (`use_mean_pooling_for_text=True`)
	- Normalize outputs: `{normalize}`
	- Encoders frozen during training?: `{frozen}` (this release was trained with encoders NOT frozen)
	- Language focus: Turkish (both text and image–caption pairs are fully in Turkish)

	## Training Strategy (inspired by JINA-CLIP-v2 style)
	- The model was trained jointly with image–text and text–text pairs using a bidirectional contrastive loss (InfoNCE/CLIP-style).
	- For image–text, standard CLIP-style training with in-batch negatives was applied.
	- For text–text, only positive paraphrase pairs (label=1) were used, with in-batch negatives coming from other samples.
	- This follows the general training philosophy often seen in Jina’s multimodal work, but in a simplified single-stage setup (without the 3-stage curriculum).

	## Datasets
	- Image–Text: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
	- Text–Text (Paraphrase): [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)

	> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
	> Please check each dataset’s license and terms before downstream use.

	## Files
	- `pytorch_model.bin` — PyTorch `state_dict`
	- `config.json` — metadata (encoder IDs, dimensions, flags)
	- `model.py` — custom model classes (required to load)
	- (This README is the model card.)

	## Evaluation Results
	Dataset: Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)

	### Image-Text
	Average cosine similarity: 0.7934

	Recall@K
	<table>
	<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
	<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
	<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
	</table>

	<details>
	<summary>Raw metrics (JSON)</summary>

	```json
	{
	"avg_cosine_sim": 0.7934404611587524,
	"recall_text_to_image": {
	"R@1": 0.936458564763386,
	"R@5": 0.9913352588313709,
	"R@10": 0.9971117529437903
	},
	"recall_image_to_text": {
	"R@1": 0.9355698733614752,
	"R@5": 0.9926682959342369,
	"R@10": 0.9957787158409243
	}
	}
	```
	</details>

	### Text-Text
	Average cosine similarity: 0.7599

	Recall@K
	<table>
	<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
	<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
	</table>

	<details>
	<summary>Raw metrics (JSON)</summary>

	```json
	{
	"avg_cosine_sim": 0.7599335312843323,
	"recall_text_to_text": {
	"R@1": 0.719875500222321,
	"R@5": 0.9453090262338817,
	"R@10": 0.9824366385060027
	}
	}
	```
	</details>

	## Loading & Usage
	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
	from PIL import Image

	device = "cuda" if torch.cuda.is_available() else "cpu"

	model_name = "utkubascakir/MultiEmbedTR"

	model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

	model.eval()

	# Text Embedding
	texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"]
	text_inputs = tokenizer(
	texts,
	padding=True,
	truncation=True,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	text_embeds = model.encode_text(
	input_ids=text_inputs["input_ids"],
	attention_mask=text_inputs["attention_mask"]
	)

	print("Text embeddings shape:", text_embeds.shape)

	# Image Embedding
	img = Image.open("kedi.jpg").convert("RGB")
	image_inputs = image_processor(
	images=img,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	image_embeds = model.encode_image(
	pixel_values=image_inputs["pixel_values"]
	)

	print("Image embeddings shape:", image_embeds.shape)

	similarity = F.cosine_similarity(text_embeds, image_embeds)
	print("Cosine similarity:", similarity)
	```

	## Limitations & Intended Use
	This release provides a Turkish multimodal embedding model, trained to produce aligned vector representations for text and images.
	It has not been tested for specific downstream tasks (e.g., retrieval, classification).
	No guarantees for bias/toxicity; please evaluate on your own target domain.

	## Citation
	If you use this model, please cite this repository.