nielsr HF Staff commited on
Commit
3064575
·
verified ·
1 Parent(s): 3b9d067

Update model card for ObjEmbed

Browse files

This PR completely overhauls the model card for the **ObjEmbed** model, as presented in the paper [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753).

The previous model card incorrectly referenced "WeDetect" and contained outdated information.

Key updates include:
- **Metadata**:
- Corrected `license` to `apache-2.0` based on the official GitHub repository.
- Updated `pipeline_tag` to `object-detection` to better reflect the model's capabilities in visual grounding and object understanding.
- Added `library_name: transformers` as the model uses the `transformers` library, as evidenced by its installation instructions.
- **Content**:
- Replaced all "WeDetect" content with a summary of ObjEmbed, its key features (Object-Oriented Representation, Versatility, Efficient Encoding) from the paper abstract and GitHub README.
- Added direct links to the paper and the official GitHub repository.
- Included sample usage instructions for Referring Expression Comprehension (REC) and Image Retrieval, directly from the GitHub README.
- Added the correct BibTeX citation.

Files changed (1) hide show
  1. README.md +53 -10
README.md CHANGED
@@ -1,20 +1,63 @@
1
  ---
2
- license: gpl-3.0
3
- pipeline_tag: zero-shot-object-detection
 
4
  ---
5
 
 
6
 
7
- ## WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
8
 
9
- This is the official PyTorch implementation of [WeDetect](https://arxiv.org/abs/2512.12309). Please see our [GitHub](https://github.com/WeChatCV/WeDetect).
10
 
11
- If you find our work helpful for your research, please consider citing our paper.
 
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ```
14
- @article{fu2025wedetect,
15
- title={WeDetect: Fast Open-Vocabulary Object Detection as Retrieval},
 
 
 
 
 
 
16
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
17
- journal={arXiv preprint arXiv:2512.12309},
18
- year={2025}
19
  }
20
- ```
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: object-detection
4
+ library_name: transformers
5
  ---
6
 
7
+ # ObjEmbed: Towards Universal Multimodal Object Embeddings
8
 
9
+ [ObjEmbed](https://huggingface.co/papers/2602.01753) is a novel MLLM embedding model that addresses the fundamental challenge of aligning objects with corresponding textual descriptions in vision-language understanding. Unlike models that excel at global image-text alignment, ObjEmbed focuses on fine-grained alignment by decomposing input images into multiple regional embeddings, each corresponding to an individual object, alongside global embeddings. This enables a wide range of visual understanding tasks such as visual grounding, local image retrieval, and global image retrieval.
10
 
11
+ This is the official PyTorch implementation of ObjEmbed.
12
 
13
+ - **Paper:** [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753)
14
+ - **Code:** [WeChatCV/ObjEmbed](https://github.com/WeChatCV/ObjEmbed)
15
 
16
+ ## Key Features
17
+
18
+ - **Object-Oriented Representation**: Captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
19
+ - **Versatility**: Seamlessly handles both region-level and image-level tasks.
20
+ - **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
21
+
22
+ ## Sample Usage
23
+
24
+ For detailed installation and environment setup, please refer to the [GitHub repository](https://github.com/WeChatCV/ObjEmbed).
25
+
26
+ ### Referring Expression Comprehension (REC)
27
+ To output the top-1 prediction for a query:
28
+
29
+ ```bash
30
+ # output the top1 prediction
31
+ python infer_objembed.py \
32
+ --objembed_checkpoint /PATH/TO/OBJEMBED \
33
+ --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
34
+ --image assets/demo.jpg \
35
+ --query "The car's license plate in HAWAII" \
36
+ --task rec \
37
+ --visualize
38
+ ```
39
+
40
+ ### Image Retrieval
41
+ To perform image retrieval based on a query:
42
+
43
+ ```bash
44
+ python infer_objembed.py \
45
+ --objembed_checkpoint /PATH/TO/OBJEMBED \
46
+ --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
47
+ --image image1.jpg image2.jpg image3.jpg \
48
+ --query "YOUR_QUERY" \
49
+ --task retrieval_by_image
50
  ```
51
+
52
+ ## Citation
53
+
54
+ If you find our work helpful for your research, please consider citing our work:
55
+
56
+ ```bibtex
57
+ @article{fu2026objembed,
58
+ title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
59
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
60
+ journal={arXiv preprint arXiv:2602.01753},
61
+ year={2026}
62
  }
63
+ ```