utkubascakir commited on
Commit
ff7fe7e
·
verified ·
1 Parent(s): 4901a5e

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,186 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - multimodal
5
+ - embeddings
6
+ datasets:
7
+ - ituperceptron/image-captioning-turkish
8
+ - dogukanvzr/ml-paraphrase-tr
9
+ library_name: pytorch
10
+ language:
11
+ - tr
12
+ base_model:
13
+ - newmindai/modernbert-base-tr-uncased-allnli-stsb
14
+ - facebook/dinov2-base
15
  ---
16
+
17
+ # Turkish Multimodal Embedding Model
18
+
19
+ This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
20
+ The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
21
+
22
+ ## Model Summary
23
+ - **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
24
+ - **Vision encoder**: `facebook/dinov2-base`
25
+ - **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
26
+ - **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
27
+ - **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
28
+ - **Normalize outputs**: `{normalize}`
29
+ - **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
30
+ - **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
31
+
32
+ ## Training Strategy (inspired by JINA-CLIP-v2 style)
33
+ - The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
34
+ - For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
35
+ - For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
36
+ - This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
37
+
38
+ ## Datasets
39
+ - **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
40
+ - **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
41
+
42
+ > Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
43
+ > Please check each dataset’s license and terms before downstream use.
44
+
45
+ ## Files
46
+ - `pytorch_model.bin` — PyTorch `state_dict`
47
+ - `config.json` — metadata (encoder IDs, dimensions, flags)
48
+ - `model.py` — custom model classes (required to load)
49
+ - (This README is the model card.)
50
+
51
+ ## Evaluation Results
52
+ **Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
53
+
54
+ ### Image-Text
55
+ **Average cosine similarity:** 0.7934
56
+
57
+ **Recall@K**
58
+ <table>
59
+ <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
60
+ <tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
61
+ <tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
62
+ </table>
63
+
64
+ <details>
65
+ <summary>Raw metrics (JSON)</summary>
66
+
67
+ ```json
68
+ {
69
+ "avg_cosine_sim": 0.7934404611587524,
70
+ "recall_text_to_image": {
71
+ "R@1": 0.936458564763386,
72
+ "R@5": 0.9913352588313709,
73
+ "R@10": 0.9971117529437903
74
+ },
75
+ "recall_image_to_text": {
76
+ "R@1": 0.9355698733614752,
77
+ "R@5": 0.9926682959342369,
78
+ "R@10": 0.9957787158409243
79
+ }
80
+ }
81
+ ```
82
+ </details>
83
+
84
+ ### Text-Text
85
+ **Average cosine similarity:** 0.7599
86
+
87
+ **Recall@K**
88
+ <table>
89
+ <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
90
+ <tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
91
+ </table>
92
+
93
+ <details>
94
+ <summary>Raw metrics (JSON)</summary>
95
+
96
+ ```json
97
+ {
98
+ "avg_cosine_sim": 0.7599335312843323,
99
+ "recall_text_to_text": {
100
+ "R@1": 0.719875500222321,
101
+ "R@5": 0.9453090262338817,
102
+ "R@10": 0.9824366385060027
103
+ }
104
+ }
105
+ ```
106
+ </details>
107
+
108
+ ## Loading & Usage
109
+ ```python
110
+ import os, json, torch, importlib.util
111
+ from huggingface_hub import snapshot_download
112
+ from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
113
+ from PIL import Image
114
+ import torch.nn.functional as F
115
+
116
+ # --- Settings
117
+ repo_id = "utkubascakir/turkish-multimodal-embedding"
118
+ local_dir = snapshot_download(repo_id)
119
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
120
+
121
+ # --- 1) Load config
122
+ with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
123
+ cfg = json.load(f)
124
+
125
+ # --- 2) Load base encoders & processor
126
+ tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
127
+ txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
128
+ img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
129
+ vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])
130
+
131
+ # --- 3) Import the custom model class
132
+ spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
133
+ mod = importlib.util.module_from_spec(spec)
134
+ spec.loader.exec_module(mod) # exposes mod.MultiModalEmbedder
135
+
136
+ # --- 4) Build the model and load weights
137
+ model = mod.MultiModalEmbedder(
138
+ text_encoder=txt_enc,
139
+ vision_encoder=vis_enc,
140
+ text_dim=cfg.get("text_dim", 768),
141
+ image_dim=cfg.get("image_dim", 768),
142
+ embed_dim=cfg.get("embed_dim", 768), # must match training
143
+ temperature_init=cfg.get("temperature_init", 1/0.07),
144
+ use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
145
+ freeze_encoders=cfg.get("freeze_encoders", False),
146
+ ).to(device)
147
+
148
+ state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
149
+ # If you accidentally uploaded a checkpoint dict with a "model" key:
150
+ # if isinstance(state, dict) and "model" in state:
151
+ # state = state["model"]
152
+ missing, unexpected = model.load_state_dict(state, strict=False)
153
+ print("load_state_dict -> missing:", missing, " unexpected:", unexpected)
154
+
155
+ model.eval()
156
+
157
+ # --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
158
+ texts = ["cat"]
159
+ text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
160
+ t_emb = model.encode_text(text_inputs) # (B, embed_dim)
161
+
162
+ img = Image.open("cat.jpeg").convert("RGB")
163
+ img_inputs = img_proc(img, return_tensors="pt").to(device)
164
+ v_emb = model.encode_image(img_inputs) # (1, embed_dim)
165
+
166
+ print("Text embeddings:", t_emb.shape)
167
+ print("Image embeddings:", v_emb.shape)
168
+
169
+ # Cosine similarity
170
+ sim = F.cosine_similarity(t_emb, v_emb).item()
171
+ print(f"Cosine similarity: {sim:.4f}")
172
+
173
+ # --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
174
+ # DO NOT use torch.no_grad() here during training
175
+ # t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
176
+ # v_train = model.forward_image(img_inputs["pixel_values"])
177
+ # loss calculations go here...
178
+ ```
179
+
180
+ ## Limitations & Intended Use
181
+ This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
182
+ It has not been tested for specific downstream tasks (e.g., retrieval, classification).
183
+ No guarantees for bias/toxicity; please evaluate on your own target domain.
184
+
185
+ ## Citation
186
+ If you use this model, please cite this repository.
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["MultiEmbedTR"],
3
+ "model_type": "multimodal_embedder",
4
+
5
+ "text_model_name": "newmindai/modernbert-base-tr-uncased-allnli-stsb",
6
+ "vision_model_name": "facebook/dinov2-base",
7
+
8
+ "text_dim": 768,
9
+ "image_dim": 768,
10
+ "embed_dim": 768,
11
+ "temperature_init": 14.285714285714285,
12
+ "use_mean_pooling_for_text": true,
13
+
14
+ "auto_map": {
15
+ "AutoConfig": "configuration_multimodal.MultimodalConfig",
16
+ "AutoModel": "modeling_multimodal.MultimodalEmbedderHF"
17
+ },
18
+
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.53.0"
21
+ }
configuration_multimodal.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+ class MultimodalConfig(PretrainedConfig):
4
+ model_type = "multimodal_embedder"
5
+
6
+ def __init__(
7
+ self,
8
+ text_model_name="newmindai/modernbert-base-tr-uncased-allnli-stsb",
9
+ vision_model_name="facebook/dinov2-base",
10
+ text_dim=768,
11
+ image_dim=768,
12
+ embed_dim=384,
13
+ temperature_init=1/0.07,
14
+ use_mean_pooling_for_text=True,
15
+ **kwargs
16
+ ):
17
+ super().__init__(**kwargs)
18
+ self.text_model_name = text_model_name
19
+ self.vision_model_name = vision_model_name
20
+ self.text_dim = text_dim
21
+ self.image_dim = image_dim
22
+ self.embed_dim = embed_dim
23
+ self.temperature_init = temperature_init
24
+ self.use_mean_pooling_for_text = use_mean_pooling_for_text
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:749481fee92fbfa3d799db5432a0548bce80a1019a3e95f4bbbda09d2f86bf3e
3
+ size 904901012
modeling_multimodal.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ from transformers import PreTrainedModel, AutoModel
6
+ from HF_model.hf_ready.configuration_multimodal import MultimodalConfig
7
+
8
+ class ProjectionHead(nn.Module):
9
+ def __init__(self, in_dim, out_dim, hidden_mult=2, p_drop=0.4):
10
+ super().__init__()
11
+ h = int(hidden_mult * out_dim)
12
+ self.net = nn.Sequential(
13
+ nn.Linear(in_dim, h),
14
+ nn.GELU(),
15
+ nn.Dropout(p_drop),
16
+ nn.Linear(h, out_dim),
17
+ )
18
+ self.ln = nn.LayerNorm(out_dim)
19
+ self.use_residual = (in_dim == out_dim)
20
+
21
+ def forward(self, x):
22
+ y = self.net(x)
23
+ if self.use_residual:
24
+ y = y + x
25
+ return self.ln(y)
26
+
27
+ def masked_mean_pool(last_hidden_state, attention_mask):
28
+ mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
29
+ summed = (last_hidden_state * mask).sum(dim=1)
30
+ lengths = mask.sum(dim=1).clamp(min=1e-6)
31
+ return summed / lengths
32
+
33
+ class MultiEmbedTR(PreTrainedModel):
34
+ config_class = MultimodalConfig
35
+
36
+ def __init__(self, config: MultimodalConfig):
37
+ super().__init__(config)
38
+
39
+ self.text_encoder = AutoModel.from_pretrained(
40
+ config.text_model_name,
41
+ trust_remote_code=True
42
+ )
43
+ self.vision_encoder = AutoModel.from_pretrained(
44
+ config.vision_model_name
45
+ )
46
+
47
+ self.text_proj = ProjectionHead(config.text_dim, config.embed_dim)
48
+ self.image_proj = ProjectionHead(config.image_dim, config.embed_dim)
49
+
50
+ self.logit_scale = nn.Parameter(
51
+ torch.tensor(math.log(config.temperature_init), dtype=torch.float)
52
+ )
53
+
54
+ self.post_init()
55
+
56
+ def encode_text(self, input_ids, attention_mask):
57
+ out = self.text_encoder(
58
+ input_ids=input_ids,
59
+ attention_mask=attention_mask,
60
+ return_dict=True
61
+ )
62
+ if self.config.use_mean_pooling_for_text:
63
+ pooled = masked_mean_pool(out.last_hidden_state, attention_mask)
64
+ else:
65
+ pooled = out.last_hidden_state[:, 0, :]
66
+ return F.normalize(self.text_proj(pooled), dim=-1)
67
+
68
+ def encode_image(self, pixel_values):
69
+ out = self.vision_encoder(
70
+ pixel_values=pixel_values,
71
+ return_dict=True
72
+ )
73
+ cls = out.last_hidden_state[:, 0, :]
74
+ return F.normalize(self.image_proj(cls), dim=-1)
75
+
76
+ def forward(
77
+ self,
78
+ input_ids=None,
79
+ attention_mask=None,
80
+ pixel_values=None,
81
+ return_dict=True,
82
+ **kwargs
83
+ ):
84
+ text_embeds = None
85
+ image_embeds = None
86
+
87
+ if input_ids is not None:
88
+ text_embeds = self.encode_text(input_ids, attention_mask)
89
+
90
+ if pixel_values is not None:
91
+ image_embeds = self.encode_image(pixel_values)
92
+
93
+ if not return_dict:
94
+ return text_embeds, image_embeds
95
+
96
+ return {
97
+ "text_embeds": text_embeds,
98
+ "image_embeds": image_embeds,
99
+ "logit_scale": self.logit_scale.exp(),
100
+ }