OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.

Original paper: Simple Open-Vocabulary Object Detection with Vision Transformers

OWLv1 CLIP ViT-B/32

This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.

Model Configuration:

Reference implementation: OWLv1 CLIP ViT-B/32
Original Weight: owlvit-base-patch32
Resolution: 3x768x768
Support Cooper version:
- Cooper SDK: [2.5.4]
- Cooper Foundry: [2.2]

Model	Device	compression	Model Link
OWLv1 CLIP ViT-B/32 Image Encoder	N1-655	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	N1-655	Amba_optimized	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	N1-655	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Predictor	N1-655	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Image Encoder	CV7	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV7	Amba_optimized	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV7	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Predictor	CV7	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Image Encoder	CV72	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV72	Amba_optimized	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV72	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Predictor	CV72	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Image Encoder	CV75	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV75	Amba_optimized	Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder	CV75	Activation_fp16	Model_Link
OWLv1 CLIP ViT-B/32 Predictor	CV75	Activation_fp16	Model_Link

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ambarella/OWLViT

Simple Open-Vocabulary Object Detection with Vision Transformers

Paper • 2205.06230 • Published May 12, 2022 • 4