lbw18601752667 commited on
Commit
eb70300
·
verified ·
1 Parent(s): 574d0e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,3 +1,82 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - OpenGVLab/InternVL2_5-2B
5
+ ---
6
+ # IDMR-2B
7
+
8
+ **IDMR** is a universal multimodal embedding model, particularly well-suited for **Instance-Driven Multimodal Retrieval (IDMR)** tasks. It is designed to achieve fine-grained, instance-level visual correspondence across modalities.
9
+
10
+ ---
11
+
12
+ ### 🔍 Learn More About IDMR
13
+
14
+ - 📄 Paper: [IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval](https://arxiv.org/pdf/2504.00954)
15
+ - 🤗 Demo: [IDMR Demo on Hugging Face Spaces](https://huggingface.co/spaces/lbw18601752667/IDMR-demo)
16
+ - 💻 Code: [Github](https://github.com/BwLiu01/IDMR)
17
+
18
+ ## 🚀 Usage
19
+
20
+ To get started, clone the GitHub repository and install the required dependencies:
21
+
22
+ ```bash
23
+ git clone https://github.com/BwLiu01/IDMR.git
24
+ cd IDMR
25
+ pip install -r requirements.txt
26
+ ```
27
+
28
+ ```python
29
+ import torch
30
+ import numpy as np
31
+ from PIL import Image
32
+ from src.model import IDMRModel
33
+ from src.vlm_backbone.intern_vl import InternVLProcessor
34
+ from src.arguments import ModelArguments
35
+ from transformers import AutoTokenizer, AutoImageProcessor
36
+
37
+ device = "cuda"
38
+ IMAGE_TOKEN = "<image>"
39
+
40
+ # Load model and processor
41
+ model_args = ModelArguments(model_name="lbw18601752667/IDMR-2B", model_backbone="internvl_2_5")
42
+
43
+ # Initialize processor
44
+ tokenizer = AutoTokenizer.from_pretrained(model_args.model_name, trust_remote_code=True)
45
+ image_processor = AutoImageProcessor.from_pretrained(model_args.model_name, trust_remote_code=True, use_fast=False)
46
+ processor = InternVLProcessor(image_processor=image_processor, tokenizer=tokenizer)
47
+
48
+ # Load model
49
+ model = IDMRModel.load(model_args).to(device, dtype=torch.bfloat16).eval()
50
+
51
+ def get_embedding(text, image=None, type="qry"):
52
+ """Get embedding for text and/or image input"""
53
+ inputs = processor(
54
+ text=f"{IMAGE_TOKEN}\n {text}" if text else f"{IMAGE_TOKEN}\n Represent the given image.",
55
+ images=[image] if image else None,
56
+ return_tensors="pt",
57
+ max_length=1024,
58
+ truncation=True
59
+ )
60
+ inputs = {key: value.to(device) for key, value in inputs.items()}
61
+ inputs["image_flags"] = torch.tensor([1 if image else 0], dtype=torch.long).to(device)
62
+
63
+ with torch.no_grad(), torch.autocast(device_type=device, dtype=torch.bfloat16):
64
+ if type == "qry":
65
+ output = model(qry=inputs)["qry_reps"]
66
+ else:
67
+ output = model(tgt=inputs)["tgt_reps"]
68
+ return output.float()
69
+
70
+
71
+ # Query
72
+ query_text = "your query text"
73
+ query_image = Image.open("your query image path")
74
+ query_embedding = get_embedding(query_text, query_image, type="qry")
75
+
76
+ # Target
77
+ target_image = Image.open("your target image path")
78
+ target_embedding = get_embedding(None, target_image, type="tgt")
79
+
80
+ print(model.compute_similarity(query_embedding, target_embedding))
81
+
82
+ ```