|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: object-detection |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ObjEmbed: Towards Universal Multimodal Object Embeddings |
|
|
|
|
|
[ObjEmbed](https://huggingface.co/papers/2602.01753) is a novel MLLM embedding model that addresses the fundamental challenge of aligning objects with corresponding textual descriptions in vision-language understanding. Unlike models that excel at global image-text alignment, ObjEmbed focuses on fine-grained alignment by decomposing input images into multiple regional embeddings, each corresponding to an individual object, alongside global embeddings. This enables a wide range of visual understanding tasks such as visual grounding, local image retrieval, and global image retrieval. |
|
|
|
|
|
This is the official PyTorch implementation of ObjEmbed. |
|
|
|
|
|
- **Paper:** [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753) |
|
|
- **Code:** [WeChatCV/ObjEmbed](https://github.com/WeChatCV/ObjEmbed) |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Object-Oriented Representation**: Captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. |
|
|
- **Versatility**: Seamlessly handles both region-level and image-level tasks. |
|
|
- **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. |
|
|
|
|
|
## Sample Usage |
|
|
|
|
|
For detailed installation and environment setup, please refer to the [GitHub repository](https://github.com/WeChatCV/ObjEmbed). |
|
|
|
|
|
### Referring Expression Comprehension (REC) |
|
|
To output the top-1 prediction for a query: |
|
|
|
|
|
```bash |
|
|
# output the top1 prediction |
|
|
python infer_objembed.py \ |
|
|
--objembed_checkpoint /PATH/TO/OBJEMBED \ |
|
|
--wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \ |
|
|
--image assets/demo.jpg \ |
|
|
--query "The car's license plate in HAWAII" \ |
|
|
--task rec \ |
|
|
--visualize |
|
|
``` |
|
|
|
|
|
### Image Retrieval |
|
|
To perform image retrieval based on a query: |
|
|
|
|
|
```bash |
|
|
python infer_objembed.py \ |
|
|
--objembed_checkpoint /PATH/TO/OBJEMBED \ |
|
|
--wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \ |
|
|
--image image1.jpg image2.jpg image3.jpg \ |
|
|
--query "YOUR_QUERY" \ |
|
|
--task retrieval_by_image |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful for your research, please consider citing our work: |
|
|
|
|
|
```bibtex |
|
|
@article{fu2026objembed, |
|
|
title={ObjEmbed: Towards Universal Multimodal Object Embeddings}, |
|
|
author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi}, |
|
|
journal={arXiv preprint arXiv:2602.01753}, |
|
|
year={2026} |
|
|
} |
|
|
``` |