OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.
Original paper: Simple Open-Vocabulary Object Detection with Vision Transformers
OWLv1 CLIP ViT-B/32
This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.
Model Configuration:
- Reference implementation: OWLv1 CLIP ViT-B/32
- Original Weight: owlvit-base-patch32
- Resolution: 3x768x768
- Support Cooper version:
- Cooper SDK: [2.5.2]
- Cooper Foundry: [2.2]
| Model | Device | Model Link |
|---|---|---|
| OWLv1 CLIP ViT-B/32 Image Encoder | N1-655 | Model_Link |
| OWLv1 CLIP ViT-B/32 Text Encoder | N1-655 | Model_Link |
| OWLv1 CLIP ViT-B/32 Predictor | N1-655 | Model_Link |
| OWLv1 CLIP ViT-B/32 Image Encoder | CV72 | Model_Link |
| OWLv1 CLIP ViT-B/32 Text Encoder | CV72 | Model_Link |
| OWLv1 CLIP ViT-B/32 Predictor | CV72 | Model_Link |
| OWLv1 CLIP ViT-B/32 Image Encoder | CV75 | Model_Link |
| OWLv1 CLIP ViT-B/32 Text Encoder | CV75 | Model_Link |
| OWLv1 CLIP ViT-B/32 Predictor | CV75 | Model_Link |
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
