OWLViT / README.md
cooper_robot
Add release note for v1.1.0
2daff4c
metadata
library_name: pytorch

owlvit_logo

OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.

Original paper: Simple Open-Vocabulary Object Detection with Vision Transformers

OWLv1 CLIP ViT-B/32

This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.

Model Configuration:

Model Device Model Link
OWLv1 CLIP ViT-B/32 Image Encoder N1-655 Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder N1-655 Model_Link
OWLv1 CLIP ViT-B/32 Predictor N1-655 Model_Link
OWLv1 CLIP ViT-B/32 Image Encoder CV72 Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder CV72 Model_Link
OWLv1 CLIP ViT-B/32 Predictor CV72 Model_Link
OWLv1 CLIP ViT-B/32 Image Encoder CV75 Model_Link
OWLv1 CLIP ViT-B/32 Text Encoder CV75 Model_Link
OWLv1 CLIP ViT-B/32 Predictor CV75 Model_Link