nielsr HF Staff commited on
Commit
a65f226
·
verified ·
1 Parent(s): b1d047a

Add model card metadata and description

Browse files

Hi, I'm Niels from the Hugging Face community science team. I'm opening this PR to improve your model card with relevant metadata and a descriptive summary of your work.

This update:
- Adds the `image-feature-extraction` pipeline tag for better discoverability.
- Adds `library_name: transformers` based on the configuration files and requirements.
- Links the model card to the official paper and GitHub repository.
- Includes a brief overview of ObjEmbed's key properties.

Feel free to merge this if it looks good!

Files changed (1) hide show
  1. README.md +26 -4
README.md CHANGED
@@ -1,13 +1,35 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
- ## ObjEmbed: Towards Universal Multimodal Object Embeddings
5
 
6
- This is the official PyTorch implementation of [ObjEmbed](https://arxiv.org/abs/2602.01753). Please see our [GitHub](https://github.com/WeChatCV/ObjEmbed).
7
 
8
- If you find our work helpful for your research, please consider citing our paper.
9
 
10
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  @article{fu2026objembed,
12
  title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
13
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-feature-extraction
4
+ library_name: transformers
5
+ tags:
6
+ - multimodal
7
+ - vision-language
8
+ - embeddings
9
+ - image-retrieval
10
+ - visual-grounding
11
  ---
 
12
 
13
+ # ObjEmbed: Towards Universal Multimodal Object Embeddings
14
 
15
+ [ObjEmbed](https://arxiv.org/abs/2602.01753) is a multimodal embedding model designed to align specific image regions (objects) with textual descriptions. Unlike global embedding models, ObjEmbed decomposes an image into multiple regional embeddings along with global embeddings, supporting tasks such as visual grounding, local image retrieval, and global image retrieval.
16
 
17
+ ## Key Features
18
+
19
+ - **Object-Oriented Representation**: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality.
20
+ - **Versatility**: It seamlessly handles both region-level and image-level tasks.
21
+ - **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
22
+
23
+ ## Resources
24
+
25
+ - **Paper**: [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://arxiv.org/abs/2602.01753)
26
+ - **Code**: [Official GitHub Repository](https://github.com/WeChatCV/ObjEmbed)
27
+
28
+ ## Citation
29
+
30
+ If you find ObjEmbed helpful for your research, please consider citing:
31
+
32
+ ```bibtex
33
  @article{fu2026objembed,
34
  title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
35
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},