Image Feature Extraction
Transformers
Safetensors
qwen3_vl
multimodal
vision-language
embeddings
image-retrieval
visual-grounding
Instructions to use fushh7/ObjEmbed-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fushh7/ObjEmbed-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="fushh7/ObjEmbed-2B")# Load model directly from transformers import AutoProcessor, WeDetectEmbedding processor = AutoProcessor.from_pretrained("fushh7/ObjEmbed-2B") model = WeDetectEmbedding.from_pretrained("fushh7/ObjEmbed-2B") - Notebooks
- Google Colab
- Kaggle
Add model card metadata and description (#1)
Browse files- Add model card metadata and description (a65f22601118e0f664a8bc9ba6c7b852ad53be24)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,13 +1,35 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
## ObjEmbed: Towards Universal Multimodal Object Embeddings
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
@article{fu2026objembed,
|
| 12 |
title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
|
| 13 |
author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-feature-extraction
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal
|
| 7 |
+
- vision-language
|
| 8 |
+
- embeddings
|
| 9 |
+
- image-retrieval
|
| 10 |
+
- visual-grounding
|
| 11 |
---
|
|
|
|
| 12 |
|
| 13 |
+
# ObjEmbed: Towards Universal Multimodal Object Embeddings
|
| 14 |
|
| 15 |
+
[ObjEmbed](https://arxiv.org/abs/2602.01753) is a multimodal embedding model designed to align specific image regions (objects) with textual descriptions. Unlike global embedding models, ObjEmbed decomposes an image into multiple regional embeddings along with global embeddings, supporting tasks such as visual grounding, local image retrieval, and global image retrieval.
|
| 16 |
|
| 17 |
+
## Key Features
|
| 18 |
+
|
| 19 |
+
- **Object-Oriented Representation**: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality.
|
| 20 |
+
- **Versatility**: It seamlessly handles both region-level and image-level tasks.
|
| 21 |
+
- **Efficient Encoding**: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.
|
| 22 |
+
|
| 23 |
+
## Resources
|
| 24 |
+
|
| 25 |
+
- **Paper**: [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://arxiv.org/abs/2602.01753)
|
| 26 |
+
- **Code**: [Official GitHub Repository](https://github.com/WeChatCV/ObjEmbed)
|
| 27 |
+
|
| 28 |
+
## Citation
|
| 29 |
+
|
| 30 |
+
If you find ObjEmbed helpful for your research, please consider citing:
|
| 31 |
+
|
| 32 |
+
```bibtex
|
| 33 |
@article{fu2026objembed,
|
| 34 |
title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
|
| 35 |
author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
|