Image-Text-to-Text
Transformers
PyTorch
English
qwen2_vl
Embedding
text-generation-inference
SwyWang commited on
Commit
2ce76d7
·
1 Parent(s): d94b5f6

finalized

Browse files
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
38
+ *.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -21,7 +21,7 @@ VIRTUE is a visual-interactive text-image universal embedder consisting of a VLM
21
  In addition, we introduce the SCaR benchmark ([train](https://huggingface.co/datasets/Sony/SCaR-Train), [eval](https://huggingface.co/datasets/Sony/SCaR-Eval)), composed of 1M samples for visual-interactive image-to-text retrieval, to evaluate visual-interactive embedding capabilities.
22
  SCaR enables evaluation of advanced reasoning and compositional tasks in multimodal, visual-interaction-aware embedding scenarios that remain unexplored.
23
 
24
- ## Model Details
25
 
26
  - [VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR)
27
  - [VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR)
@@ -42,10 +42,185 @@ SCaR enables evaluation of advanced reasoning and compositional tasks in multimo
42
 
43
  ## Resources
44
  - [Paper](https://arxiv.org/abs/2510.00523)
45
- - [Webpage]()
46
  - [Repository](https://github.com/sony/virtue)
47
 
48
  ## How to Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Citation
51
  ```
 
21
  In addition, we introduce the SCaR benchmark ([train](https://huggingface.co/datasets/Sony/SCaR-Train), [eval](https://huggingface.co/datasets/Sony/SCaR-Eval)), composed of 1M samples for visual-interactive image-to-text retrieval, to evaluate visual-interactive embedding capabilities.
22
  SCaR enables evaluation of advanced reasoning and compositional tasks in multimodal, visual-interaction-aware embedding scenarios that remain unexplored.
23
 
24
+ ## Model Checkpoints
25
 
26
  - [VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR)
27
  - [VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR)
 
42
 
43
  ## Resources
44
  - [Paper](https://arxiv.org/abs/2510.00523)
45
+ - [Webpage](https://sony.github.io/virtue/)
46
  - [Repository](https://github.com/sony/virtue)
47
 
48
  ## How to Use
49
+ ```=python
50
+ import os
51
+ import sys
52
+ import torch
53
+ import numpy as np
54
+ import json
55
+ import hydra
56
+ from hydra.core.global_hydra import GlobalHydra
57
+ from PIL import Image
58
+
59
+ # Add parent directory to path for src imports
60
+ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
61
+
62
+ from src.arguments import ModelArguments, DataArguments, TrainingArguments
63
+ from src.model.model import MMEBModel
64
+ from src.model.processor import load_processor, VLM_IMAGE_TOKENS, get_backbone_name, process_vlm_inputs_fns
65
+ from transformers import AutoConfig
66
+
67
+
68
+ # Initialize Hydra for SAM2 loading
69
+ if not GlobalHydra().is_initialized():
70
+ hydra.initialize(config_path="./configs", version_base=None)
71
+
72
+ # Determinism
73
+ torch.manual_seed(42)
74
+ torch.cuda.manual_seed_all(42)
75
+ torch.backends.cudnn.deterministic = True
76
+ torch.backends.cudnn.benchmark = False
77
+ np.random.seed(42)
78
+
79
+ model_dir = 'Sony/VIRTUE-2B-SCaR'
80
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
81
+
82
+ config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True, token=True)
83
+
84
+ # Build arguments directly (no YAML required)
85
+ model_args = ModelArguments(
86
+ model_name=model_dir,
87
+ checkpoint_path=None,
88
+ pooling="last",
89
+ normalize=True,
90
+ lora=False,
91
+ model_backbone='qwen2_vl',
92
+ )
93
+ persisted_sam = config.virtue_sam
94
+
95
+ model_args.sam = True
96
+ model_args.sam_config = {
97
+ "config_path": persisted_sam.get('config_path') if persisted_sam else None,
98
+ "checkpoint": persisted_sam.get('checkpoint') if persisted_sam else None,
99
+ "points_per_side": (persisted_sam.get('points_per_side') if persisted_sam else 16),
100
+ "feature_levels": (persisted_sam.get('feature_levels') if persisted_sam else 3),
101
+ }
102
+
103
+ data_args = DataArguments()
104
+ training_args = TrainingArguments()
105
+
106
+ processor = load_processor(model_args, data_args)
107
+ model = MMEBModel.load(model_args, is_trainable=False, processor=processor)
108
+ model.eval()
109
+ model = model.to(device, dtype=torch.bfloat16)
110
+
111
+ # Get model backbone and image token
112
+ model_backbone = get_backbone_name(hf_config=config)
113
+ image_token = VLM_IMAGE_TOKENS[model_backbone]
114
+
115
+ # Image + Text -> Text
116
+ image_path = '../assets/example.jpg'
117
+ image = Image.open(image_path).convert('RGB')
118
+
119
+ model_inputs = {
120
+ 'text': [f"{image_token}\nRepresent the given image with the following question: What is in the image"],
121
+ 'images': [image]
122
+ }
123
+
124
+ process_fn = process_vlm_inputs_fns[model_backbone]
125
+ inputs = process_fn(model_inputs, processor=processor, max_length=512)
126
+ device = next(model.parameters()).device
127
+ inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in inputs.items()}
128
+
129
+ with torch.no_grad():
130
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
131
+ qry_output = model(qry=inputs)["qry_reps"]
132
+
133
+ # Candidates for all scenarios
134
+ test_strings = ['A cat', 'A dog', 'A tiger']
135
+
136
+ # Scenario 1: No visual prompts (image only)
137
+ print("\n--- Similarities (no visual prompts) ---")
138
+ for string in test_strings:
139
+ cand_inputs = process_fn({'text': [string], 'images': [None]}, processor=processor)
140
+ cand_inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in cand_inputs.items()}
141
+ with torch.no_grad():
142
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
143
+ tgt_output = model(tgt=cand_inputs)["tgt_reps"]
144
+ sim = model.compute_similarity(qry_output, tgt_output)
145
+ print(f"no-prompt | {string} = {sim}")
146
+
147
+ '''
148
+ --- Similarities (no visual prompts) ---
149
+ no-prompt | A cat = tensor([[0.3030]], device='cuda:0')
150
+ no-prompt | A dog = tensor([[0.2453]], device='cuda:0')
151
+ no-prompt | A tiger = tensor([[0.1714]], device='cuda:0')
152
+ '''
153
+
154
+ # Scenario 2: Point prompts — two examples (left/right)
155
+ print("\n--- Similarities (point prompts) ---")
156
+ sam_size = 1024 # SAM2Transforms output size
157
+ point_examples = [(0.25, 0.5), (0.75, 0.5)]
158
+ for (px, py) in point_examples:
159
+ point_text = f"{image_token}\nFind the caption that best describes the segmented object, considering both local details and global context in the given image.\nReferring object point: ({int(px*image.size[0])}, {int(py*image.size[1])})"
160
+ q_inputs = process_fn({'text': [point_text], 'images': [image]}, processor=processor)
161
+ q_inputs['point'] = [px * sam_size, py * sam_size]
162
+ q_inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in q_inputs.items()}
163
+ with torch.no_grad():
164
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
165
+ point_qry = model(qry=q_inputs)["qry_reps"]
166
+ for string in test_strings:
167
+ cand_inputs = process_fn({'text': [string], 'images': [None]}, processor=processor)
168
+ cand_inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in cand_inputs.items()}
169
+ with torch.no_grad():
170
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
171
+ tgt_output = model(tgt=cand_inputs)["tgt_reps"]
172
+ sim = model.compute_similarity(point_qry, tgt_output)
173
+ print(f"point ({px:.2f},{py:.2f}) | {string} = {sim}")
174
+
175
+ '''
176
+ --- Similarities (point prompts) ---
177
+ point (0.25,0.50) | A cat = tensor([[0.1793]], device='cuda:0')
178
+ point (0.25,0.50) | A dog = tensor([[0.1339]], device='cuda:0')
179
+ point (0.25,0.50) | A tiger = tensor([[0.1314]], device='cuda:0')
180
+ point (0.75,0.50) | A cat = tensor([[0.2232]], device='cuda:0')
181
+ point (0.75,0.50) | A dog = tensor([[0.1742]], device='cuda:0')
182
+ point (0.75,0.50) | A tiger = tensor([[0.1692]], device='cuda:0')
183
+ '''
184
+
185
+ # Scenario 3: BBox prompts — two examples (left/right)
186
+ print("\n--- Similarities (bbox prompts) ---")
187
+ bbox_examples = [
188
+ (0.05, 0.20, 0.45, 0.80), # left
189
+ (0.55, 0.20, 0.95, 0.80), # right
190
+ ]
191
+ for (x1, y1, x2, y2) in bbox_examples:
192
+ bbox_text = f"{image_token}\nFind the caption that best describes the object in the bounding box, considering both local details and global context in the given image.\nReferring object bbox: ({int(x1*image.size[0])}, {int(y1*image.size[1])}, {int(x2*image.size[0])}, {int(y2*image.size[1])})"
193
+ q_inputs = process_fn({'text': [bbox_text], 'images': [image]}, processor=processor)
194
+ q_inputs['bbox'] = [x1 * sam_size, y1 * sam_size, x2 * sam_size, y2 * sam_size]
195
+ q_inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in q_inputs.items()}
196
+ with torch.no_grad():
197
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
198
+ bbox_qry = model(qry=q_inputs)["qry_reps"]
199
+ for string in test_strings:
200
+ cand_inputs = process_fn({'text': [string], 'images': [None]}, processor=processor)
201
+ cand_inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in cand_inputs.items()}
202
+ with torch.no_grad():
203
+ with torch.autocast(enabled=True, dtype=torch.bfloat16, device_type="cuda"):
204
+ tgt_output = model(tgt=cand_inputs)["tgt_reps"]
205
+ sim = model.compute_similarity(bbox_qry, tgt_output)
206
+ print(f"bbox ({x1:.2f},{y1:.2f},{x2:.2f},{y2:.2f}) | {string} = {sim}")
207
+
208
+ '''
209
+ --- Similarities (bbox prompts) ---
210
+ bbox (0.05,0.20,0.45,0.80) | A cat = tensor([[0.2100]], device='cuda:0')
211
+ bbox (0.05,0.20,0.45,0.80) | A dog = tensor([[0.1512]], device='cuda:0')
212
+ bbox (0.05,0.20,0.45,0.80) | A tiger = tensor([[0.1719]], device='cuda:0')
213
+ bbox (0.55,0.20,0.95,0.80) | A cat = tensor([[0.1583]], device='cuda:0')
214
+ bbox (0.55,0.20,0.95,0.80) | A dog = tensor([[0.1953]], device='cuda:0')
215
+ bbox (0.55,0.20,0.95,0.80) | A tiger = tensor([[0.1225]], device='cuda:0')
216
+ '''
217
+ ```
218
+
219
+ ## Ethical Considerations
220
+ _Note: This section is mainly taken from the [AKI](https://huggingface.co/Sony/AKI-4B-phi-3.5-mini) models_.
221
+
222
+ This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety.
223
+
224
 
225
  ## Citation
226
  ```
images/MMEB-results-with-SCaR.png ADDED

Git LFS Details

  • SHA256: ad2f515d3b821ebc8d09c34947f9c741e3f29fc325d173ab6743b0dbc176b3ce
  • Pointer size: 131 Bytes
  • Size of remote file: 197 kB
images/MMEB-results.png ADDED

Git LFS Details

  • SHA256: 0f9cd132b0231aa01e1de6addccb09faac02eb963e88f41fd44b62bf22b6ae8d
  • Pointer size: 131 Bytes
  • Size of remote file: 389 kB
images/SCaR-results.png ADDED

Git LFS Details

  • SHA256: 2b0585af6febc998b96de95abcf88a1dd037cf3a4472c384d991973f5e9274f8
  • Pointer size: 131 Bytes
  • Size of remote file: 379 kB
images/VIRTUE-framework.jpg ADDED

Git LFS Details

  • SHA256: b0a9f73a7c4663cb4e5715628ef815f47c57f26b02cf40af03a2720c6a060c6e
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB
images/example.jpg ADDED

Git LFS Details

  • SHA256: 52d2be6748264d763776726aaf6feeb75c4820386fb97462b376b287ac51fbc9
  • Pointer size: 130 Bytes
  • Size of remote file: 56.7 kB