| | --- |
| | library_name: transformers |
| | tags: |
| | - vision |
| | license: apache-2.0 |
| | pipeline_tag: zero-shot-object-detection |
| | --- |
| | |
| |
|
| | # LLMDet (large variant) |
| |
|
| | [LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models |
| | ](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng. |
| |
|
| | LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model. |
| |
|
| | You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino). |
| |
|
| |
|
| | ## Intended uses |
| |
|
| | You can use the raw model for zero-shot object detection. |
| |
|
| | Here's how to use the model for zero-shot object detection: |
| |
|
| | ```py |
| | import torch |
| | from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor |
| | from transformers.image_utils import load_image |
| | |
| | |
| | # Prepare processor and model |
| | model_id = "iSEE-Laboratory/llmdet_large" |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device) |
| | |
| | # Prepare inputs |
| | image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| | image = load_image(image_url) |
| | text_labels = [["a cat", "a remote control"]] |
| | inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device) |
| | |
| | # Run inference |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | |
| | # Postprocess outputs |
| | results = processor.post_process_grounded_object_detection( |
| | outputs, |
| | threshold=0.4, |
| | target_sizes=[(image.height, image.width)] |
| | ) |
| | |
| | # Retrieve the first image result |
| | result = results[0] |
| | for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]): |
| | box = [round(x, 2) for x in box.tolist()] |
| | print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}") |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | This model was trained on: |
| | - [Objects365v1](https://www.objects365.org/overview.html) |
| | - [Open Images v6](https://research.google/blog/open-images-v6-now-featuring-localized-narratives/) |
| | - [GOLD-G](https://arxiv.org/abs/2104.12763) |
| | - [GroundingCap-1M](https://arxiv.org/abs/2501.18954) |
| |
|
| |
|
| | ## Evaluation results |
| |
|
| | - Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)): |
| |
|
| | | Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | |
| | | --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- | |
| | | [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny) | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 | |
| | | [llmdet_base](https://huggingface.co/rziga/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 | |
| | | [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 | |
| | |
| |
|
| |
|
| | ## BibTeX entry and citation info |
| |
|
| | ```bib |
| | @article{fu2025llmdet, |
| | title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models}, |
| | author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi}, |
| | journal={arXiv preprint arXiv:2501.18954}, |
| | year={2025} |
| | } |
| | ``` |
| |
|