utkubascakir
/

MultiEmbedTR

@@ -1,186 +1,166 @@
----
-license: mit
-tags:
-- multimodal
-- embeddings
-datasets:
-- ituperceptron/image-captioning-turkish
-- dogukanvzr/ml-paraphrase-tr
-library_name: pytorch
-language:
-- tr
-base_model:
-- newmindai/modernbert-base-tr-uncased-allnli-stsb
-- facebook/dinov2-base
----
-# Turkish Multimodal Embedding Model
-This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
-The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
-## Model Summary
-- **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
-- **Vision encoder**: `facebook/dinov2-base`
-- **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
-- **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
-- **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
-- **Normalize outputs**: `{normalize}`
-- **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
-- **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
-## Training Strategy (inspired by JINA-CLIP-v2 style)
-- The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
-- For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
-- For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
-- This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
-## Datasets
-- **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
-- **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
-> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
-> Please check each dataset’s license and terms before downstream use.
-## Files
-- `pytorch_model.bin` — PyTorch `state_dict`
-- `config.json` — metadata (encoder IDs, dimensions, flags)
-- `model.py` — custom model classes (required to load)
-- (This README is the model card.)
-## Evaluation Results
-**Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
-### Image-Text
-**Average cosine similarity:** 0.7934
-**Recall@K**
-<table>
-<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
-<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
-<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
-</table>
-<details>
-<summary>Raw metrics (JSON)</summary>
-```json
-{
-    "avg_cosine_sim": 0.7934404611587524,
-    "recall_text_to_image": {
-        "R@1": 0.936458564763386,
-        "R@5": 0.9913352588313709,
-        "R@10": 0.9971117529437903
-    },
-    "recall_image_to_text": {
-        "R@1": 0.9355698733614752,
-        "R@5": 0.9926682959342369,
-        "R@10": 0.9957787158409243
-    }
-}
-```
-</details>
-### Text-Text
-**Average cosine similarity:** 0.7599
-**Recall@K**
-<table>
-<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
-<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
-</table>
-<details>
-<summary>Raw metrics (JSON)</summary>
-```json
-{
-    "avg_cosine_sim": 0.7599335312843323,
-    "recall_text_to_text": {
-        "R@1": 0.719875500222321,
-        "R@5": 0.9453090262338817,
-        "R@10": 0.9824366385060027
-    }
-}
-```
-</details>
-## Loading & Usage
-```python
-import os, json, torch, importlib.util
-from huggingface_hub import snapshot_download
-from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
-from PIL import Image
-import torch.nn.functional as F
-# --- Settings
-repo_id = "utkubascakir/turkish-multimodal-embedding"
-local_dir = snapshot_download(repo_id)
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-# --- 1) Load config
-with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
-    cfg = json.load(f)
-# --- 2) Load base encoders & processor
-tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
-txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
-img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
-vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])
-# --- 3) Import the custom model class
-spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
-mod = importlib.util.module_from_spec(spec)
-spec.loader.exec_module(mod)  # exposes mod.MultiModalEmbedder
-# --- 4) Build the model and load weights
-model = mod.MultiModalEmbedder(
-    text_encoder=txt_enc,
-    vision_encoder=vis_enc,
-    text_dim=cfg.get("text_dim", 768),
-    image_dim=cfg.get("image_dim", 768),
-    embed_dim=cfg.get("embed_dim", 768),      # must match training
-    temperature_init=cfg.get("temperature_init", 1/0.07),
-    use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
-    freeze_encoders=cfg.get("freeze_encoders", False),
-).to(device)
-state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
-# If you accidentally uploaded a checkpoint dict with a "model" key:
-# if isinstance(state, dict) and "model" in state:
-#     state = state["model"]
-missing, unexpected = model.load_state_dict(state, strict=False)
-print("load_state_dict -> missing:", missing, " unexpected:", unexpected)
-model.eval()
-# --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
-texts = ["cat"]
-text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
-t_emb = model.encode_text(text_inputs)  # (B, embed_dim)
-img = Image.open("cat.jpeg").convert("RGB")
-img_inputs = img_proc(img, return_tensors="pt").to(device)
-v_emb = model.encode_image(img_inputs)  # (1, embed_dim)
-print("Text embeddings:", t_emb.shape)
-print("Image embeddings:", v_emb.shape)
-# Cosine similarity
-sim = F.cosine_similarity(t_emb, v_emb).item()
-print(f"Cosine similarity: {sim:.4f}")
-# --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
-# DO NOT use torch.no_grad() here during training
-# t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
-# v_train = model.forward_image(img_inputs["pixel_values"])
-# loss calculations go here...
-```
-## Limitations & Intended Use
-This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
-It has not been tested for specific downstream tasks (e.g., retrieval, classification).
-No guarantees for bias/toxicity; please evaluate on your own target domain.
-## Citation
 If you use this model, please cite this repository.

+---
+license: mit
+tags:
+- multimodal
+- embeddings
+datasets:
+- ituperceptron/image-captioning-turkish
+- dogukanvzr/ml-paraphrase-tr
+library_name: pytorch
+language:
+- tr
+base_model:
+- newmindai/modernbert-base-tr-uncased-allnli-stsb
+- facebook/dinov2-base
+---
+# Turkish Multimodal Embedding Model
+This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
+The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
+## Model Summary
+- **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
+- **Vision encoder**: `facebook/dinov2-base`
+- **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
+- **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
+- **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
+- **Normalize outputs**: `{normalize}`
+- **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
+- **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
+## Training Strategy (inspired by JINA-CLIP-v2 style)
+- The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
+- For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
+- For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
+- This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
+## Datasets
+- **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
+- **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
+> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
+> Please check each dataset’s license and terms before downstream use.
+## Files
+- `pytorch_model.bin` — PyTorch `state_dict`
+- `config.json` — metadata (encoder IDs, dimensions, flags)
+- `model.py` — custom model classes (required to load)
+- (This README is the model card.)
+## Evaluation Results
+**Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
+### Image-Text
+**Average cosine similarity:** 0.7934
+**Recall@K**
+<table>
+<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
+<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
+<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
+</table>
+<details>
+<summary>Raw metrics (JSON)</summary>
+```json
+{
+    "avg_cosine_sim": 0.7934404611587524,
+    "recall_text_to_image": {
+        "R@1": 0.936458564763386,
+        "R@5": 0.9913352588313709,
+        "R@10": 0.9971117529437903
+    },
+    "recall_image_to_text": {
+        "R@1": 0.9355698733614752,
+        "R@5": 0.9926682959342369,
+        "R@10": 0.9957787158409243
+    }
+}
+```
+</details>
+### Text-Text
+**Average cosine similarity:** 0.7599
+**Recall@K**
+<table>
+<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
+<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
+</table>
+<details>
+<summary>Raw metrics (JSON)</summary>
+```json
+{
+    "avg_cosine_sim": 0.7599335312843323,
+    "recall_text_to_text": {
+        "R@1": 0.719875500222321,
+        "R@5": 0.9453090262338817,
+        "R@10": 0.9824366385060027
+    }
+}
+```
+</details>
+## Loading & Usage
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+from PIL import Image
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "utkubascakir/MultiEmbedTR"
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
+model.eval()
+# Text Embedding
+texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"]
+text_inputs = tokenizer(
+    texts,
+    padding=True,
+    truncation=True,
+    return_tensors="pt"
+).to(device)
+with torch.no_grad():
+    text_embeds = model.encode_text(
+        input_ids=text_inputs["input_ids"],
+        attention_mask=text_inputs["attention_mask"]
+    )
+print("Text embeddings shape:", text_embeds.shape)
+# Image Embedding
+img = Image.open("kedi.jpg").convert("RGB")
+image_inputs = image_processor(
+    images=img,
+    return_tensors="pt"
+).to(device)
+with torch.no_grad():
+    image_embeds = model.encode_image(
+        pixel_values=image_inputs["pixel_values"]
+    )
+print("Image embeddings shape:", image_embeds.shape)
+similarity = F.cosine_similarity(text_embeds, image_embeds)
+print("Cosine similarity:", similarity)
+```
+## Limitations & Intended Use
+This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
+It has not been tested for specific downstream tasks (e.g., retrieval, classification).
+No guarantees for bias/toxicity; please evaluate on your own target domain.
+## Citation
 If you use this model, please cite this repository.