marsh123 commited on
Commit
882c8a3
·
verified ·
1 Parent(s): 5ee612c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - recall
7
+ base_model:
8
+ - Qwen/Qwen2-VL-2B-Instruct
9
+ library_name: transformers == 4.45.2
10
+ ---
11
+
12
+ <h1 align="center">Vis-IR: Unifying Search With Visualized Information Retrieval</h1>
13
+
14
+ <p align="center">
15
+ <a href="https://arxiv.org/abs/2502.11431">
16
+ <img alt="Build" src="http://img.shields.io/badge/arXiv-2502.11431-B31B1B.svg">
17
+ </a>
18
+ <a href="https://github.com/VectorSpaceLab/Vis-IR">
19
+ <img alt="Build" src="https://img.shields.io/badge/Github-Code-blue">
20
+ </a>
21
+ <a href="">
22
+ <img alt="Build" src="https://img.shields.io/badge/🤗 Datasets-VIRA-yellow">
23
+ </a>
24
+ <a href="">
25
+ <img alt="Build" src="https://img.shields.io/badge/🤗 Datasets-MVRB-yellow">
26
+ </a>
27
+ <a href="">
28
+ <img alt="Build" src="https://img.shields.io/badge/🤗 Model-UniSE CLIP-yellow">
29
+ </a>
30
+ <a href="https://huggingface.co/marsh123/UniSE">
31
+ <img alt="Build" src="https://img.shields.io/badge/🤗 Model-UniSE MLLM-yellow">
32
+ </a>
33
+
34
+ </p>
35
+
36
+ <h4 align="center">
37
+ <p>
38
+ <a href=#news>News</a> |
39
+ <a href=#release-plan>Release Plan</a> |
40
+ <a href=#overview>Overview</a> |
41
+ <a href="#license">License</a> |
42
+ <a href="#citation">Citation</a>
43
+ <p>
44
+ </h4>
45
+
46
+ ## News
47
+
48
+ ```2025-02-17``` 🎉🎉 Release our paper: [Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval](https://arxiv.org/abs/2502.11431).
49
+
50
+ ## Release Plan
51
+ - [x] Paper
52
+ - [x] UniSE models
53
+ - [ ] MVRB benchmark
54
+ - [ ] VIRA Dataset
55
+ - [ ] Evaluation code
56
+ - [ ] Fine-tuning code
57
+
58
+ ## Overview
59
+
60
+ In this work, we formally define an emerging IR paradigm called Visualized Information Retrieval, or **VisIR**, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called **Screenshots**, for various retrieval applications. We further make three key contributions for VisIR. First, we create **VIRA** (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and questionanswer formats. Second, we develop **UniSE** (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct **MVRB** (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE.
61
+
62
+ ## Model Usage
63
+
64
+ > Our code works well on transformers==4.45.2, and we recommend using this version.
65
+
66
+ ### 1. UniSE-MLLM Models
67
+
68
+ ```python
69
+ import torch
70
+ from transformers import AutoModel
71
+
72
+ MODEL_NAME = "marsh123/UniSE-MLLM"
73
+ model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
74
+ model.set_processor(MODEL_NAME)
75
+
76
+ with torch.no_grad():
77
+ device = torch.device("cuda:0")
78
+ model = model.to(device)
79
+ model.eval()
80
+
81
+ query_inputs = model.data_process(
82
+ images=["./assets/query_1.png", "./assets/query_2.png"],
83
+ text=["After a 17% drop, what is Nvidia's closing stock price?", "I would like to see a detailed and intuitive performance comparison between the two models."],
84
+ q_or_c="query",
85
+ task_instruction="Represent the given image with the given query."
86
+ )
87
+
88
+ candidate_inputs = model.data_process(
89
+ images=["./assets/positive_1.jpeg", "./assets/neg_1.jpeg",
90
+ "./assets/positive_2.jpeg", "./assets/neg_2.jpeg"],
91
+ q_or_c="candidate"
92
+ )
93
+
94
+ query_embeddings = model(**query_inputs)
95
+ candidate_embeddings = model(**candidate_inputs)
96
+ scores = torch.matmul(query_embeddings, candidate_embeddings.T)
97
+ print(scores)
98
+ ```
99
+
100
+ ## Performance on MVRB
101
+
102
+ MVRB is a comprehensive benchmark designed for the retrieval task centered on screenshots. It includes four meta tasks: Screenshot Retrieval (SR), Composed Screenshot Retrieval (CSR), Screenshot QA (SQA), and Open-Vocabulary Classification (OVC). We evaluate three main types of retrievers on MVRB: OCR+Text Retrievers, General Multimodal Retrievers, and Screenshot Document Retrievers. Our proposed UniSE-MLLM achieves state-of-the-art (SOTA) performance on this benchmark.
103
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66164f6245336ca774679611/igMgX-BvQ55Dyxuw26sgs.png)
104
+
105
+
106
+
107
+ ## License
108
+ Vis-IR is licensed under the [MIT License](LICENSE).
109
+
110
+
111
+ ## Citation
112
+ If you find this repository useful, please consider giving a star ⭐ and citation
113
+
114
+ ```
115
+ @article{liu2025any,
116
+ title={Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval},
117
+ author={Liu, Ze and Liang, Zhengyang and Zhou, Junjie and Liu, Zheng and Lian, Defu},
118
+ journal={arXiv preprint arXiv:2502.11431},
119
+ year={2025}
120
+ }
121
+ ```
122
+
123
+