lbw18601752667
/

IDMR-2B

Model card Files Files and versions

lbw18601752667 commited on Apr 8, 2025

Commit

eb70300

·

verified ·

1 Parent(s): 574d0e7

Update README.md

Files changed (1) hide show

README.md +82 -3

README.md CHANGED Viewed

@@ -1,3 +1,82 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model:
+- OpenGVLab/InternVL2_5-2B
+---
+# IDMR-2B
+**IDMR** is a universal multimodal embedding model, particularly well-suited for **Instance-Driven Multimodal Retrieval (IDMR)** tasks. It is designed to achieve fine-grained, instance-level visual correspondence across modalities.
+---
+### 🔍 Learn More About IDMR
+- 📄 Paper: [IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval](https://arxiv.org/pdf/2504.00954)
+- 🤗 Demo: [IDMR Demo on Hugging Face Spaces](https://huggingface.co/spaces/lbw18601752667/IDMR-demo)
+- 💻 Code: [Github](https://github.com/BwLiu01/IDMR)
+## 🚀 Usage
+To get started, clone the GitHub repository and install the required dependencies:
+```bash
+git clone https://github.com/BwLiu01/IDMR.git
+cd IDMR
+pip install -r requirements.txt
+```
+```python
+import torch
+import numpy as np
+from PIL import Image
+from src.model import IDMRModel
+from src.vlm_backbone.intern_vl import InternVLProcessor
+from src.arguments import ModelArguments
+from transformers import AutoTokenizer, AutoImageProcessor
+device = "cuda"
+IMAGE_TOKEN = "<image>"
+# Load model and processor
+model_args = ModelArguments(model_name="lbw18601752667/IDMR-2B", model_backbone="internvl_2_5")
+# Initialize processor
+tokenizer = AutoTokenizer.from_pretrained(model_args.model_name, trust_remote_code=True)
+image_processor = AutoImageProcessor.from_pretrained(model_args.model_name, trust_remote_code=True, use_fast=False)
+processor = InternVLProcessor(image_processor=image_processor, tokenizer=tokenizer)
+# Load model
+model = IDMRModel.load(model_args).to(device, dtype=torch.bfloat16).eval()
+def get_embedding(text, image=None, type="qry"):
+    """Get embedding for text and/or image input"""
+    inputs = processor(
+        text=f"{IMAGE_TOKEN}\n {text}" if text else f"{IMAGE_TOKEN}\n Represent the given image.",
+        images=[image] if image else None,
+        return_tensors="pt",
+        max_length=1024,
+        truncation=True
+    )
+    inputs = {key: value.to(device) for key, value in inputs.items()}
+    inputs["image_flags"] = torch.tensor([1 if image else 0], dtype=torch.long).to(device)
+    with torch.no_grad(), torch.autocast(device_type=device, dtype=torch.bfloat16):
+        if type == "qry":
+            output = model(qry=inputs)["qry_reps"]
+        else:
+            output = model(tgt=inputs)["tgt_reps"]
+    return output.float()
+# Query
+query_text = "your query text"
+query_image = Image.open("your query image path")
+query_embedding = get_embedding(query_text, query_image, type="qry")
+# Target
+target_image = Image.open("your target image path")
+target_embedding = get_embedding(None, target_image, type="tgt")
+print(model.compute_similarity(query_embedding, target_embedding))
+```