Integrate with Sentence Transformers v5.4 (#3)

- Integrate with Sentence Transformers v5.4 (c1b7bef7440a5b0b5a101263274c4683fe7db712)
- Use the correct ST version (b2543ad42881007d9ade76455e943c6ce6bca9c1)

Files changed (7) hide show

1_Pooling/config.json +5 -0
README.md +46 -3
chat_template.jinja +53 -0
config_sentence_transformers.json +14 -0
modules.json +20 -0
preprocessor_config.json +2 -2
sentence_bert_config.json +28 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "embedding_dimension": 2048,
+    "pooling_mode": "lasttoken",
+    "include_prompt": true
+}

README.md CHANGED Viewed

@@ -10,7 +10,10 @@ metrics:
 - recall
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
-library_name: transformers == 4.51.3
 ---
 <h1 align="center">Vis-IR: Unifying Search With Visualized Information Retrieval</h1>
@@ -71,9 +74,49 @@ In this work, we formally define an emerging IR paradigm called Visualized Infor
 ## Model Usage
-> Our code works well on transformers==4.51.3, and we recommend using this version.
-### 1. UniSE-MLLM Models
 ```python
 import torch

 - recall
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+pipeline_tag: sentence-similarity
 ---
 <h1 align="center">Vis-IR: Unifying Search With Visualized Information Retrieval</h1>
 ## Model Usage
+### Using Sentence Transformers
+Install Sentence Transformers:
+```bash
+pip install "sentence_transformers[image]"
+```
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("BAAI/BGE-VL-Screenshot")
+# Queries: composed image + text inputs (prefix text with "Query:")
+query_inputs = [
+    {"text": "Query:After a 17% drop, what is Nvidia's closing stock price?", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_1.png"},
+    {"text": "Query:I would like to see a detailed and intuitive performance comparison between the two models.", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_2.png"},
+]
+query_embeddings = model.encode_query(query_inputs)
+print(query_embeddings.shape)
+# (2, 2048)
+# Candidates: screenshot images
+candidate_inputs = [
+    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_1.jpeg",
+    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_1.jpeg",
+    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_2.jpeg",
+    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_2.jpeg",
+]
+candidate_embeddings = model.encode_document(candidate_inputs)
+print(candidate_embeddings.shape)
+# (4, 2048)
+similarities = model.similarity(query_embeddings, candidate_embeddings)
+print(similarities)
+# tensor([[0.5725, 0.3449, 0.1913, 0.1497],
+#         [0.1457, 0.0795, 0.4243, 0.4177]])
+```
+The model provides two prompts: `"query"` for composed image+text queries and `"document"` (default) for screenshot candidates.
+### Using transformers
+> Our code works well on transformers==4.51.3, and we recommend using this version.
 ```python
 import torch

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,53 @@

+{%- if messages[0].role == 'system' -%}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].content is string -%}
+        {{- messages[0].content }}
+    {%- else -%}
+        {%- for content in messages[0].content -%}
+            {%- if 'text' in content -%}
+                {{- content.text }}
+            {%- endif -%}
+        {%- endfor -%}
+    {%- endif -%}
+    {{- '<|im_end|>\n' }}
+{%- else -%}
+    {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
+{%- endif -%}
+{%- set image_count = namespace(value=0) -%}
+{%- set video_count = namespace(value=0) -%}
+{%- for message in messages -%}
+    {%- if message.role == "user" -%}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string -%}
+            {{- message.content }}
+        {%- else -%}
+            {%- for content in message.content -%}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content -%}
+                    {%- set image_count.value = image_count.value + 1 -%}
+                    {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+                {%- elif content.type == 'video' or 'video' in content -%}
+                    {%- set video_count.value = video_count.value + 1 -%}
+                    {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+                {%- elif 'text' in content -%}
+                    {{- content.text }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role != "system" -%}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string -%}
+            {{- message.content }}
+        {%- else -%}
+            {%- for content in message.content -%}
+                {%- if 'text' in content -%}
+                    {{- content.text }}
+                {%- endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+        {{- '<|im_end|>\n' }}
+    {%- endif -%}
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+    {{- '<|im_start|>assistant\n<|endoftext|>' }}
+{%- endif -%}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "__version__": {
+    "pytorch": "2.10.0+cu128",
+    "sentence_transformers": "5.4.0",
+    "transformers": "5.5.0"
+  },
+  "default_prompt_name": "document",
+  "model_type": "SentenceTransformer",
+  "prompts": {
+    "document": "Represent the given text-rich image, focusing on extracting and interpreting both its rich text content and visual features.",
+    "query": "Represent the given image with the given query."
+  },
+  "similarity_fn_name": "cosine"
+}

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.base.modules.transformer.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.sentence_transformer.modules.pooling.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.sentence_transformer.modules.normalize.Normalize"
+  }
+]

preprocessor_config.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "min_pixels": 3136,
-  "max_pixels": 12845056,
   "patch_size": 14,
   "temporal_patch_size": 2,
   "merge_size": 2,

 {
+  "min_pixels": 50176,
+  "max_pixels": 1960000,
   "patch_size": 14,
   "temporal_patch_size": 2,
   "merge_size": 2,

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+    "transformer_task": "feature-extraction",
+    "modality_config": {
+        "text": {
+            "method": "forward",
+            "method_output_name": "last_hidden_state"
+        },
+        "image": {
+            "method": "forward",
+            "method_output_name": "last_hidden_state"
+        },
+        "image+text": {
+            "method": "forward",
+            "method_output_name": "last_hidden_state"
+        },
+        "message": {
+            "method": "forward",
+            "method_output_name": "last_hidden_state",
+            "format": "structured"
+        }
+    },
+    "module_output_name": "token_embeddings",
+    "processing_kwargs": {
+        "chat_template": {
+            "add_generation_prompt": true
+        }
+    }
+}