utkubascakir commited on
Commit
05e2184
·
verified ·
1 Parent(s): 2ed9a94

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -185
README.md CHANGED
@@ -1,186 +1,166 @@
1
- ---
2
- license: mit
3
- tags:
4
- - multimodal
5
- - embeddings
6
- datasets:
7
- - ituperceptron/image-captioning-turkish
8
- - dogukanvzr/ml-paraphrase-tr
9
- library_name: pytorch
10
- language:
11
- - tr
12
- base_model:
13
- - newmindai/modernbert-base-tr-uncased-allnli-stsb
14
- - facebook/dinov2-base
15
- ---
16
-
17
- # Turkish Multimodal Embedding Model
18
-
19
- This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
20
- The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
21
-
22
- ## Model Summary
23
- - **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
24
- - **Vision encoder**: `facebook/dinov2-base`
25
- - **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
26
- - **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
27
- - **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
28
- - **Normalize outputs**: `{normalize}`
29
- - **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
30
- - **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
31
-
32
- ## Training Strategy (inspired by JINA-CLIP-v2 style)
33
- - The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
34
- - For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
35
- - For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
36
- - This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
37
-
38
- ## Datasets
39
- - **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
40
- - **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
41
-
42
- > Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
43
- > Please check each dataset’s license and terms before downstream use.
44
-
45
- ## Files
46
- - `pytorch_model.bin` — PyTorch `state_dict`
47
- - `config.json` — metadata (encoder IDs, dimensions, flags)
48
- - `model.py` — custom model classes (required to load)
49
- - (This README is the model card.)
50
-
51
- ## Evaluation Results
52
- **Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
53
-
54
- ### Image-Text
55
- **Average cosine similarity:** 0.7934
56
-
57
- **Recall@K**
58
- <table>
59
- <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
60
- <tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
61
- <tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
62
- </table>
63
-
64
- <details>
65
- <summary>Raw metrics (JSON)</summary>
66
-
67
- ```json
68
- {
69
- "avg_cosine_sim": 0.7934404611587524,
70
- "recall_text_to_image": {
71
- "R@1": 0.936458564763386,
72
- "R@5": 0.9913352588313709,
73
- "R@10": 0.9971117529437903
74
- },
75
- "recall_image_to_text": {
76
- "R@1": 0.9355698733614752,
77
- "R@5": 0.9926682959342369,
78
- "R@10": 0.9957787158409243
79
- }
80
- }
81
- ```
82
- </details>
83
-
84
- ### Text-Text
85
- **Average cosine similarity:** 0.7599
86
-
87
- **Recall@K**
88
- <table>
89
- <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
90
- <tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
91
- </table>
92
-
93
- <details>
94
- <summary>Raw metrics (JSON)</summary>
95
-
96
- ```json
97
- {
98
- "avg_cosine_sim": 0.7599335312843323,
99
- "recall_text_to_text": {
100
- "R@1": 0.719875500222321,
101
- "R@5": 0.9453090262338817,
102
- "R@10": 0.9824366385060027
103
- }
104
- }
105
- ```
106
- </details>
107
-
108
- ## Loading & Usage
109
- ```python
110
- import os, json, torch, importlib.util
111
- from huggingface_hub import snapshot_download
112
- from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
113
- from PIL import Image
114
- import torch.nn.functional as F
115
-
116
- # --- Settings
117
- repo_id = "utkubascakir/turkish-multimodal-embedding"
118
- local_dir = snapshot_download(repo_id)
119
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
120
-
121
- # --- 1) Load config
122
- with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
123
- cfg = json.load(f)
124
-
125
- # --- 2) Load base encoders & processor
126
- tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
127
- txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
128
- img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
129
- vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])
130
-
131
- # --- 3) Import the custom model class
132
- spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
133
- mod = importlib.util.module_from_spec(spec)
134
- spec.loader.exec_module(mod) # exposes mod.MultiModalEmbedder
135
-
136
- # --- 4) Build the model and load weights
137
- model = mod.MultiModalEmbedder(
138
- text_encoder=txt_enc,
139
- vision_encoder=vis_enc,
140
- text_dim=cfg.get("text_dim", 768),
141
- image_dim=cfg.get("image_dim", 768),
142
- embed_dim=cfg.get("embed_dim", 768), # must match training
143
- temperature_init=cfg.get("temperature_init", 1/0.07),
144
- use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
145
- freeze_encoders=cfg.get("freeze_encoders", False),
146
- ).to(device)
147
-
148
- state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
149
- # If you accidentally uploaded a checkpoint dict with a "model" key:
150
- # if isinstance(state, dict) and "model" in state:
151
- # state = state["model"]
152
- missing, unexpected = model.load_state_dict(state, strict=False)
153
- print("load_state_dict -> missing:", missing, " unexpected:", unexpected)
154
-
155
- model.eval()
156
-
157
- # --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
158
- texts = ["cat"]
159
- text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
160
- t_emb = model.encode_text(text_inputs) # (B, embed_dim)
161
-
162
- img = Image.open("cat.jpeg").convert("RGB")
163
- img_inputs = img_proc(img, return_tensors="pt").to(device)
164
- v_emb = model.encode_image(img_inputs) # (1, embed_dim)
165
-
166
- print("Text embeddings:", t_emb.shape)
167
- print("Image embeddings:", v_emb.shape)
168
-
169
- # Cosine similarity
170
- sim = F.cosine_similarity(t_emb, v_emb).item()
171
- print(f"Cosine similarity: {sim:.4f}")
172
-
173
- # --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
174
- # DO NOT use torch.no_grad() here during training
175
- # t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
176
- # v_train = model.forward_image(img_inputs["pixel_values"])
177
- # loss calculations go here...
178
- ```
179
-
180
- ## Limitations & Intended Use
181
- This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
182
- It has not been tested for specific downstream tasks (e.g., retrieval, classification).
183
- No guarantees for bias/toxicity; please evaluate on your own target domain.
184
-
185
- ## Citation
186
  If you use this model, please cite this repository.
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - multimodal
5
+ - embeddings
6
+ datasets:
7
+ - ituperceptron/image-captioning-turkish
8
+ - dogukanvzr/ml-paraphrase-tr
9
+ library_name: pytorch
10
+ language:
11
+ - tr
12
+ base_model:
13
+ - newmindai/modernbert-base-tr-uncased-allnli-stsb
14
+ - facebook/dinov2-base
15
+ ---
16
+
17
+ # Turkish Multimodal Embedding Model
18
+
19
+ This repository contains a **contrastively trained Turkish multimodal embedding model**, combining a text encoder and a vision encoder with projection heads.
20
+ The model is trained entirely on **Turkish datasets** (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
21
+
22
+ ## Model Summary
23
+ - **Text encoder**: `newmindai/modernbert-base-tr-uncased-allnli-stsb`
24
+ - **Vision encoder**: `facebook/dinov2-base`
25
+ - **Dimensions**: `text_dim=768`, `image_dim=768`, `embed_dim=768`
26
+ - **Projection dropout**: fixed at `0.4` (inside `ProjectionHead`)
27
+ - **Pooling**: mean pooling over tokens (`use_mean_pooling_for_text=True`)
28
+ - **Normalize outputs**: `{normalize}`
29
+ - **Encoders frozen during training?**: `{frozen}` (this release was trained with encoders **NOT frozen**)
30
+ - **Language focus**: Turkish (both text and image–caption pairs are fully in Turkish)
31
+
32
+ ## Training Strategy (inspired by JINA-CLIP-v2 style)
33
+ - The model was trained jointly with **image–text** and **text–text** pairs using a **bidirectional contrastive loss** (InfoNCE/CLIP-style).
34
+ - For **image–text**, standard CLIP-style training with **in-batch negatives** was applied.
35
+ - For **text–text**, only **positive paraphrase pairs (label=1)** were used, with in-batch negatives coming from other samples.
36
+ - This follows the general training philosophy often seen in Jina’s multimodal work, but in a **simplified single-stage setup** (without the 3-stage curriculum).
37
+
38
+ ## Datasets
39
+ - **Image–Text**: [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
40
+ - **Text–Text (Paraphrase)**: [`dogukanvzr/ml-paraphrase-tr`](https://huggingface.co/datasets/dogukanvzr/ml-paraphrase-tr)
41
+
42
+ > Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
43
+ > Please check each dataset’s license and terms before downstream use.
44
+
45
+ ## Files
46
+ - `pytorch_model.bin` — PyTorch `state_dict`
47
+ - `config.json` — metadata (encoder IDs, dimensions, flags)
48
+ - `model.py` — custom model classes (required to load)
49
+ - (This README is the model card.)
50
+
51
+ ## Evaluation Results
52
+ **Dataset:** Test split created from [`ituperceptron/image-captioning-turkish`](https://huggingface.co/datasets/ituperceptron/image-captioning-turkish)
53
+
54
+ ### Image-Text
55
+ **Average cosine similarity:** 0.7934
56
+
57
+ **Recall@K**
58
+ <table>
59
+ <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
60
+ <tr><td>Text → Image</td><td>0.9365</td><td>0.9913</td><td>0.9971</td></tr>
61
+ <tr><td>Image → Text</td><td>0.9356</td><td>0.9927</td><td>0.9958</td></tr>
62
+ </table>
63
+
64
+ <details>
65
+ <summary>Raw metrics (JSON)</summary>
66
+
67
+ ```json
68
+ {
69
+ "avg_cosine_sim": 0.7934404611587524,
70
+ "recall_text_to_image": {
71
+ "R@1": 0.936458564763386,
72
+ "R@5": 0.9913352588313709,
73
+ "R@10": 0.9971117529437903
74
+ },
75
+ "recall_image_to_text": {
76
+ "R@1": 0.9355698733614752,
77
+ "R@5": 0.9926682959342369,
78
+ "R@10": 0.9957787158409243
79
+ }
80
+ }
81
+ ```
82
+ </details>
83
+
84
+ ### Text-Text
85
+ **Average cosine similarity:** 0.7599
86
+
87
+ **Recall@K**
88
+ <table>
89
+ <tr><th>Direction</th><th>R@1</th><th>R@5</th><th>R@10</th></tr>
90
+ <tr><td>Text → Text</td><td>0.7198</td><td>0.9453</td><td>0.9824</td></tr>
91
+ </table>
92
+
93
+ <details>
94
+ <summary>Raw metrics (JSON)</summary>
95
+
96
+ ```json
97
+ {
98
+ "avg_cosine_sim": 0.7599335312843323,
99
+ "recall_text_to_text": {
100
+ "R@1": 0.719875500222321,
101
+ "R@5": 0.9453090262338817,
102
+ "R@10": 0.9824366385060027
103
+ }
104
+ }
105
+ ```
106
+ </details>
107
+
108
+ ## Loading & Usage
109
+ ```python
110
+ import torch
111
+ import torch.nn.functional as F
112
+ from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
113
+ from PIL import Image
114
+
115
+ device = "cuda" if torch.cuda.is_available() else "cpu"
116
+
117
+ model_name = "utkubascakir/MultiEmbedTR"
118
+
119
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
120
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
121
+ image_processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
122
+
123
+ model.eval()
124
+
125
+ # Text Embedding
126
+ texts = ["yeşil arka planlı bir kedi", "kumsalda bir köpek"]
127
+ text_inputs = tokenizer(
128
+ texts,
129
+ padding=True,
130
+ truncation=True,
131
+ return_tensors="pt"
132
+ ).to(device)
133
+
134
+ with torch.no_grad():
135
+ text_embeds = model.encode_text(
136
+ input_ids=text_inputs["input_ids"],
137
+ attention_mask=text_inputs["attention_mask"]
138
+ )
139
+
140
+ print("Text embeddings shape:", text_embeds.shape)
141
+
142
+ # Image Embedding
143
+ img = Image.open("kedi.jpg").convert("RGB")
144
+ image_inputs = image_processor(
145
+ images=img,
146
+ return_tensors="pt"
147
+ ).to(device)
148
+
149
+ with torch.no_grad():
150
+ image_embeds = model.encode_image(
151
+ pixel_values=image_inputs["pixel_values"]
152
+ )
153
+
154
+ print("Image embeddings shape:", image_embeds.shape)
155
+
156
+ similarity = F.cosine_similarity(text_embeds, image_embeds)
157
+ print("Cosine similarity:", similarity)
158
+ ```
159
+
160
+ ## Limitations & Intended Use
161
+ This release provides a **Turkish multimodal embedding model**, trained to produce aligned vector representations for text and images.
162
+ It has not been tested for specific downstream tasks (e.g., retrieval, classification).
163
+ No guarantees for bias/toxicity; please evaluate on your own target domain.
164
+
165
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  If you use this model, please cite this repository.