fushh7
/

WeDetect

Zero-Shot Object Detection

Model card Files Files and versions

WeDetect / README.md

nielsr's picture

nielsr HF Staff

Update model card for ObjEmbed

3064575 verified 11 days ago

|

2.79 kB

	---
	license: apache-2.0
	pipeline_tag: object-detection
	library_name: transformers
	---

	# ObjEmbed: Towards Universal Multimodal Object Embeddings

	[ObjEmbed](https://huggingface.co/papers/2602.01753) is a novel MLLM embedding model that addresses the fundamental challenge of aligning objects with corresponding textual descriptions in vision-language understanding. Unlike models that excel at global image-text alignment, ObjEmbed focuses on fine-grained alignment by decomposing input images into multiple regional embeddings, each corresponding to an individual object, alongside global embeddings. This enables a wide range of visual understanding tasks such as visual grounding, local image retrieval, and global image retrieval.

	This is the official PyTorch implementation of ObjEmbed.

	- Paper: [ObjEmbed: Towards Universal Multimodal Object Embeddings](https://huggingface.co/papers/2602.01753)
	- Code: [WeChatCV/ObjEmbed](https://github.com/WeChatCV/ObjEmbed)

	## Key Features

	- Object-Oriented Representation: Captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval.
	- Versatility: Seamlessly handles both region-level and image-level tasks.
	- Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.

	## Sample Usage

	For detailed installation and environment setup, please refer to the [GitHub repository](https://github.com/WeChatCV/ObjEmbed).

	### Referring Expression Comprehension (REC)
	To output the top-1 prediction for a query:

	```bash
	# output the top1 prediction
	python infer_objembed.py \
	--objembed_checkpoint /PATH/TO/OBJEMBED \
	--wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
	--image assets/demo.jpg \
	--query "The car's license plate in HAWAII" \
	--task rec \
	--visualize
	```

	### Image Retrieval
	To perform image retrieval based on a query:

	```bash
	python infer_objembed.py \
	--objembed_checkpoint /PATH/TO/OBJEMBED \
	--wedetect_uni_checkpoint /PATH/TO/WEDETECT_UNI \
	--image image1.jpg image2.jpg image3.jpg \
	--query "YOUR_QUERY" \
	--task retrieval_by_image
	```

	## Citation

	If you find our work helpful for your research, please consider citing our work:

	```bibtex
	@article{fu2026objembed,
	title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
	author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
	journal={arXiv preprint arXiv:2602.01753},
	year={2026}
	}
	```