File size: 2,785 Bytes
3b9d067
3064575
 
 
3b9d067
 
3064575
3b9d067
3064575
3b9d067
3064575
3b9d067
3064575
 
3b9d067
3064575
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b9d067
3064575
 
 
 
 
 
 
 
3b9d067
3064575
 
3b9d067
3064575
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: apache-2.0
pipeline_tag: object-detection
library_name: transformers
---

# ObjEmbed: Towards Universal Multimodal Object Embeddings

[ObjEmbed](https://huggingface.co/papers/2602.01753) is a novel MLLM embedding model that addresses the fundamental challenge of aligning objects with corresponding textual descriptions in vision-language understanding. Unlike models that excel at global image-text alignment, ObjEmbed focuses on fine-grained alignment by decomposing input images into multiple regional embeddings, each corresponding to an individual object, alongside global embeddings. This enables a wide range of visual understanding tasks such as visual grounding, local image retrieval, and global image retrieval.

This is the official PyTorch implementation of ObjEmbed.

- **Paper:** [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753)
- **Code:** [WeChatCV/ObjEmbed](https://github.com/WeChatCV/ObjEmbed)

## Key Features

-   **Object-Oriented Representation**: Captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
-   **Versatility**: Seamlessly handles both region-level and image-level tasks.
-   **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.

## Sample Usage

For detailed installation and environment setup, please refer to the [GitHub repository](https://github.com/WeChatCV/ObjEmbed).

### Referring Expression Comprehension (REC)
To output the top-1 prediction for a query:

```bash
# output the top1 prediction
python infer_objembed.py \
    --objembed_checkpoint /PATH/TO/OBJEMBED \
    --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
    --image assets/demo.jpg \
    --query "The car's license plate in HAWAII" \
    --task rec \
    --visualize
```

### Image Retrieval
To perform image retrieval based on a query:

```bash
python infer_objembed.py \
    --objembed_checkpoint /PATH/TO/OBJEMBED \
    --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
    --image image1.jpg image2.jpg image3.jpg \
    --query "YOUR_QUERY" \
    --task retrieval_by_image
```

## Citation

If you find our work helpful for your research, please consider citing our work:

```bibtex
@article{fu2026objembed,
  title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2602.01753},
  year={2026}
}
```