Ambarella
/

OWLViT

Model card Files Files and versions

OWLViT / README.md

cooper_robot

Add release note for v1.1.0

2daff4c 29 days ago

|

history blame contribute delete

2.65 kB

	---
	library_name: pytorch
	---

	![owlvit_logo](resource/OWLViT.png)

	OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.

	Original paper: [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)

	# OWLv1 CLIP ViT-B/32

	This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.

	Model Configuration:
	- Reference implementation: [OWLv1 CLIP ViT-B/32](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
	- Original Weight: [owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32/blob/main/pytorch_model.bin)
	- Resolution: 3x768x768
	- Support Cooper version:
	- Cooper SDK: [2.5.2]
	- Cooper Foundry: [2.2]

	\| Model \| Device \| Model Link \|
	\| :-----: \| :-----: \| :-----: \|
	\| OWLv1 CLIP ViT-B/32 Image Encoder\| N1-655 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_image_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Text Encoder\| N1-655 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_text_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Predictor\| N1-655 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_predictor.bin) \|
	\| OWLv1 CLIP ViT-B/32 Image Encoder\| CV72 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_image_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Text Encoder\| CV72 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_text_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Predictor\| CV72 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_predictor.bin) \|
	\| OWLv1 CLIP ViT-B/32 Image Encoder\| CV75 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_image_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Text Encoder\| CV75 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_text_encoder.bin) \|
	\| OWLv1 CLIP ViT-B/32 Predictor\| CV75 \| [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_predictor.bin) \|