Update model card for ObjEmbed

This PR completely overhauls the model card for the **ObjEmbed** model, as presented in the paper [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753).

The previous model card incorrectly referenced "WeDetect" and contained outdated information.

Key updates include:
- **Metadata**:
- Corrected `license` to `apache-2.0` based on the official GitHub repository.
- Updated `pipeline_tag` to `object-detection` to better reflect the model's capabilities in visual grounding and object understanding.
- Added `library_name: transformers` as the model uses the `transformers` library, as evidenced by its installation instructions.
- **Content**:
- Replaced all "WeDetect" content with a summary of ObjEmbed, its key features (Object-Oriented Representation, Versatility, Efficient Encoding) from the paper abstract and GitHub README.
- Added direct links to the paper and the official GitHub repository.
- Included sample usage instructions for Referring Expression Comprehension (REC) and Image Retrieval, directly from the GitHub README.
- Added the correct BibTeX citation.

Files changed (1) hide show

README.md +53 -10

README.md CHANGED Viewed

@@ -1,20 +1,63 @@
 ---
-license: gpl-3.0
-pipeline_tag: zero-shot-object-detection
 ---
-## WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
-This is the official PyTorch implementation of [WeDetect](https://arxiv.org/abs/2512.12309). Please see our [GitHub](https://github.com/WeChatCV/WeDetect).
-If you find our work helpful for your research, please consider citing our paper.
 ```
-@article{fu2025wedetect,
-  title={WeDetect: Fast Open-Vocabulary Object Detection as Retrieval},
   author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
-  journal={arXiv preprint arXiv:2512.12309},
-  year={2025}
 }
-```

 ---
+license: apache-2.0
+pipeline_tag: object-detection
+library_name: transformers
 ---
+# ObjEmbed: Towards Universal Multimodal Object Embeddings
+[ObjEmbed](https://huggingface.co/papers/2602.01753) is a novel MLLM embedding model that addresses the fundamental challenge of aligning objects with corresponding textual descriptions in vision-language understanding. Unlike models that excel at global image-text alignment, ObjEmbed focuses on fine-grained alignment by decomposing input images into multiple regional embeddings, each corresponding to an individual object, alongside global embeddings. This enables a wide range of visual understanding tasks such as visual grounding, local image retrieval, and global image retrieval.
+This is the official PyTorch implementation of ObjEmbed.
+- **Paper:** [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753)
+- **Code:** [WeChatCV/ObjEmbed](https://github.com/WeChatCV/ObjEmbed)
+## Key Features
+-   **Object-Oriented Representation**: Captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
+-   **Versatility**: Seamlessly handles both region-level and image-level tasks.
+-   **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
+## Sample Usage
+For detailed installation and environment setup, please refer to the [GitHub repository](https://github.com/WeChatCV/ObjEmbed).
+### Referring Expression Comprehension (REC)
+To output the top-1 prediction for a query:
+```bash
+# output the top1 prediction
+python infer_objembed.py \
+    --objembed_checkpoint /PATH/TO/OBJEMBED \
+    --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
+    --image assets/demo.jpg \
+    --query "The car's license plate in HAWAII" \
+    --task rec \
+    --visualize
+```
+### Image Retrieval
+To perform image retrieval based on a query:
+```bash
+python infer_objembed.py \
+    --objembed_checkpoint /PATH/TO/OBJEMBED \
+    --wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
+    --image image1.jpg image2.jpg image3.jpg \
+    --query "YOUR_QUERY" \
+    --task retrieval_by_image
 ```
+## Citation
+If you find our work helpful for your research, please consider citing our work:
+```bibtex
+@article{fu2026objembed,
+  title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
   author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
+  journal={arXiv preprint arXiv:2602.01753},
+  year={2026}
 }
+```