Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +183 -0
config.json +21 -0
configuration_multimodal.py +24 -0
model.safetensors +3 -0
modeling_multimodal.py +100 -0

README.md CHANGED Viewed

@@ -1,3 +1,186 @@
 ---
 license: mit
 ---

 ---
 license: mit
+tags:
+- multimodal
+- embeddings
+datasets:
+- ituperceptron/image-captioning-turkish
+- dogukanvzr/ml-paraphrase-tr
+library_name: pytorch
+language:
+- tr
+base_model:
+- newmindai/modernbert-base-tr-uncased-allnli-stsb
+- facebook/dinov2-base
 ---
+# Turkish Multimodal Embedding Model
+This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
+The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
+## Model Summary
+- **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
+- **Vision encoder**: `facebook/dinov2-base`
+- **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
+- **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
+- **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
+- **Normalize outputs**: `{normalize}`
+- **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
+- **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
+## Training Strategy (inspired by JINA-CLIP-v2 style)
+- The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
+- For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
+- For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
+- This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
+## Datasets
+- **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
+- **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
+> Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
+> Please check each dataset’s license and terms before downstream use.
+## Files
+- `pytorch_model.bin` — PyTorch `state_dict`
+- `config.json` — metadata (encoder IDs, dimensions, flags)
+- `model.py` — custom model classes (required to load)
+- (This README is the model card.)
+## Evaluation Results
+**Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
+### Image-Text
+**Average cosine similarity:** 0.7934
+**Recall@K**
+<table>
+<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
+<tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
+<tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
+</table>
+<details>
+<summary>Raw metrics (JSON)</summary>
+```json
+{
+    "avg_cosine_sim": 0.7934404611587524,
+    "recall_text_to_image": {
+        "R@1": 0.936458564763386,
+        "R@5": 0.9913352588313709,
+        "R@10": 0.9971117529437903
+    },
+    "recall_image_to_text": {
+        "R@1": 0.9355698733614752,
+        "R@5": 0.9926682959342369,
+        "R@10": 0.9957787158409243
+    }
+}
+```
+</details>
+### Text-Text
+**Average cosine similarity:** 0.7599
+**Recall@K**
+<table>
+<tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
+<tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
+</table>
+<details>
+<summary>Raw metrics (JSON)</summary>
+```json
+{
+    "avg_cosine_sim": 0.7599335312843323,
+    "recall_text_to_text": {
+        "R@1": 0.719875500222321,
+        "R@5": 0.9453090262338817,
+        "R@10": 0.9824366385060027
+    }
+}
+```
+</details>
+## Loading & Usage
+```python
+import os, json, torch, importlib.util
+from huggingface_hub import snapshot_download
+from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+from PIL import Image
+import torch.nn.functional as F
+# --- Settings
+repo_id = "utkubascakir/turkish-multimodal-embedding"
+local_dir = snapshot_download(repo_id)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# --- 1) Load config
+with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
+    cfg = json.load(f)
+# --- 2) Load base encoders & processor
+tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
+txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
+img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
+vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])
+# --- 3) Import the custom model class
+spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
+mod = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(mod)  # exposes mod.MultiModalEmbedder
+# --- 4) Build the model and load weights
+model = mod.MultiModalEmbedder(
+    text_encoder=txt_enc,
+    vision_encoder=vis_enc,
+    text_dim=cfg.get("text_dim", 768),
+    image_dim=cfg.get("image_dim", 768),
+    embed_dim=cfg.get("embed_dim", 768),      # must match training
+    temperature_init=cfg.get("temperature_init", 1/0.07),
+    use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
+    freeze_encoders=cfg.get("freeze_encoders", False),
+).to(device)
+state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
+# If you accidentally uploaded a checkpoint dict with a "model" key:
+# if isinstance(state, dict) and "model" in state:
+#     state = state["model"]
+missing, unexpected = model.load_state_dict(state, strict=False)
+print("load_state_dict -> missing:", missing, " unexpected:", unexpected)
+model.eval()
+# --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
+texts = ["cat"]
+text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
+t_emb = model.encode_text(text_inputs)  # (B, embed_dim)
+img = Image.open("cat.jpeg").convert("RGB")
+img_inputs = img_proc(img, return_tensors="pt").to(device)
+v_emb = model.encode_image(img_inputs)  # (1, embed_dim)
+print("Text embeddings:", t_emb.shape)
+print("Image embeddings:", v_emb.shape)
+# Cosine similarity
+sim = F.cosine_similarity(t_emb, v_emb).item()
+print(f"Cosine similarity: {sim:.4f}")
+# --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
+# DO NOT use torch.no_grad() here during training
+# t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
+# v_train = model.forward_image(img_inputs["pixel_values"])
+# loss calculations go here...
+```
+## Limitations & Intended Use
+This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
+It has not been tested for specific downstream tasks (e.g., retrieval, classification).
+No guarantees for bias/toxicity; please evaluate on your own target domain.
+## Citation
+If you use this model, please cite this repository.

config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": ["MultiEmbedTR"],
+  "model_type": "multimodal_embedder",
+  "text_model_name": "newmindai/modernbert-base-tr-uncased-allnli-stsb",
+  "vision_model_name": "facebook/dinov2-base",
+  "text_dim": 768,
+  "image_dim": 768,
+  "embed_dim": 768,
+  "temperature_init": 14.285714285714285,
+  "use_mean_pooling_for_text": true,
+  "auto_map": {
+    "AutoConfig": "configuration_multimodal.MultimodalConfig",
+    "AutoModel": "modeling_multimodal.MultimodalEmbedderHF"
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.0"
+}

configuration_multimodal.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from transformers import PretrainedConfig
+class MultimodalConfig(PretrainedConfig):
+    model_type = "multimodal_embedder"
+    def __init__(
+        self,
+        text_model_name="newmindai/modernbert-base-tr-uncased-allnli-stsb",
+        vision_model_name="facebook/dinov2-base",
+        text_dim=768,
+        image_dim=768,
+        embed_dim=384,
+        temperature_init=1/0.07,
+        use_mean_pooling_for_text=True,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.text_model_name = text_model_name
+        self.vision_model_name = vision_model_name
+        self.text_dim = text_dim
+        self.image_dim = image_dim
+        self.embed_dim = embed_dim
+        self.temperature_init = temperature_init
+        self.use_mean_pooling_for_text = use_mean_pooling_for_text

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:749481fee92fbfa3d799db5432a0548bce80a1019a3e95f4bbbda09d2f86bf3e
+size 904901012

modeling_multimodal.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel, AutoModel
+from HF_model.hf_ready.configuration_multimodal import MultimodalConfig
+class ProjectionHead(nn.Module):
+    def __init__(self, in_dim, out_dim, hidden_mult=2, p_drop=0.4):
+        super().__init__()
+        h = int(hidden_mult * out_dim)
+        self.net = nn.Sequential(
+            nn.Linear(in_dim, h),
+            nn.GELU(),
+            nn.Dropout(p_drop),
+            nn.Linear(h, out_dim),
+        )
+        self.ln = nn.LayerNorm(out_dim)
+        self.use_residual = (in_dim == out_dim)
+    def forward(self, x):
+        y = self.net(x)
+        if self.use_residual:
+            y = y + x
+        return self.ln(y)
+def masked_mean_pool(last_hidden_state, attention_mask):
+    mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
+    summed = (last_hidden_state * mask).sum(dim=1)
+    lengths = mask.sum(dim=1).clamp(min=1e-6)
+    return summed / lengths
+class MultiEmbedTR(PreTrainedModel):
+    config_class = MultimodalConfig
+    def __init__(self, config: MultimodalConfig):
+        super().__init__(config)
+        self.text_encoder = AutoModel.from_pretrained(
+            config.text_model_name,
+            trust_remote_code=True
+        )
+        self.vision_encoder = AutoModel.from_pretrained(
+            config.vision_model_name
+        )
+        self.text_proj = ProjectionHead(config.text_dim, config.embed_dim)
+        self.image_proj = ProjectionHead(config.image_dim, config.embed_dim)
+        self.logit_scale = nn.Parameter(
+            torch.tensor(math.log(config.temperature_init), dtype=torch.float)
+        )
+        self.post_init()
+    def encode_text(self, input_ids, attention_mask):
+        out = self.text_encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            return_dict=True
+        )
+        if self.config.use_mean_pooling_for_text:
+            pooled = masked_mean_pool(out.last_hidden_state, attention_mask)
+        else:
+            pooled = out.last_hidden_state[:, 0, :]
+        return F.normalize(self.text_proj(pooled), dim=-1)
+    def encode_image(self, pixel_values):
+        out = self.vision_encoder(
+            pixel_values=pixel_values,
+            return_dict=True
+        )
+        cls = out.last_hidden_state[:, 0, :]
+        return F.normalize(self.image_proj(cls), dim=-1)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        pixel_values=None,
+        return_dict=True,
+        **kwargs
+    ):
+        text_embeds = None
+        image_embeds = None
+        if input_ids is not None:
+            text_embeds = self.encode_text(input_ids, attention_mask)
+        if pixel_values is not None:
+            image_embeds = self.encode_image(pixel_values)
+        if not return_dict:
+            return text_embeds, image_embeds
+        return {
+            "text_embeds": text_embeds,
+            "image_embeds": image_embeds,
+            "logit_scale": self.logit_scale.exp(),
+        }