Update README.md

bec37f2 verified 7 months ago

4.34 kB

	---
	library_name: transformers
	tags:
	- vision
	license: apache-2.0
	pipeline_tag: zero-shot-object-detection
	---


	# LLMDet (large variant)

	[LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
	](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

	LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model.

	You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino).


	## Intended uses

	You can use the raw model for zero-shot object detection.

	Here's how to use the model for zero-shot object detection:

	```py
	import torch
	from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
	from transformers.image_utils import load_image


	# Prepare processor and model
	model_id = "iSEE-Laboratory/llmdet_large"
	device = "cuda" if torch.cuda.is_available() else "cpu"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

	# Prepare inputs
	image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = load_image(image_url)
	text_labels = [["a cat", "a remote control"]]
	inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)

	# Postprocess outputs
	results = processor.post_process_grounded_object_detection(
	outputs,
	threshold=0.4,
	target_sizes=[(image.height, image.width)]
	)

	# Retrieve the first image result
	result = results[0]
	for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
	box = [round(x, 2) for x in box.tolist()]
	print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
	```

	## Training Data

	This model was trained on:
	- [Objects365v1](https://www.objects365.org/overview.html)
	- [Open Images v6](https://research.google/blog/open-images-v6-now-featuring-localized-narratives/)
	- [GOLD-G](https://arxiv.org/abs/2104.12763)
	- [GroundingCap-1M](https://arxiv.org/abs/2501.18954)


	## Evaluation results

	- Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):

	\| Model \| Pre-Train Data \| MiniVal APr \| MiniVal APc \| MiniVal APf \| MiniVal AP \| Val1.0 APr \| Val1.0 APc \| Val1.0 APf \| Val1.0 AP \|
	\| --------------------------------------------------------- \| -------------------------------------------- \| ------------ \| ----------- \| ----------- \| ----------- \| ---------- \| ---------- \| ---------- \| ----------- \|
	\| [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny) \| (O365,GoldG,GRIT,V3Det) + GroundingCap-1M \| 44.7 \| 37.3 \| 39.5 \| 50.7 \| 34.9 \| 26.0 \| 30.1 \| 44.3 \|
	\| [llmdet_base](https://huggingface.co/rziga/llmdet_base) \| (O365,GoldG,V3Det) + GroundingCap-1M \| 48.3 \| 40.8 \| 43.1 \| 54.3 \| 38.5 \| 28.2 \| 34.3 \| 47.8 \|
	\| [llmdet_large](https://huggingface.co/rziga/llmdet_large) \| (O365V2,OpenImageV6,GoldG) + GroundingCap-1M \| 51.1 \| 45.1 \| 46.1 \| 56.6 \| 42.0 \| 31.6 \| 38.8 \| 50.2 \|



	## BibTeX entry and citation info

	```bib
	@article{fu2025llmdet,
	title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
	author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
	journal={arXiv preprint arXiv:2501.18954},
	year={2025}
	}
	```