File size: 4,159 Bytes
2daff4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22847aa
2daff4c
 
0a46133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
library_name: pytorch
---

![owlvit_logo](resource/OWLViT.png)

OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.

Original paper: [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)

# OWLv1 CLIP ViT-B/32

This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.

Model Configuration:
- Reference implementation: [OWLv1 CLIP ViT-B/32](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
- Original Weight: [owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32/blob/main/pytorch_model.bin)
- Resolution: 3x768x768
- Support Cooper version:
    - Cooper SDK: [2.5.4]
    - Cooper Foundry: [2.2]

| Model | Device | compression | Model Link |
| :-----: | :-----: | :-----: | ------- |
| OWLv1 CLIP ViT-B/32 Image Encoder| N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_image_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| N1-655 | Amba_optimized | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_text_encoder_amba_optimized.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_text_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Predictor| N1-655 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_predictor_act16.bin) |
| OWLv1 CLIP ViT-B/32 Image Encoder| CV7 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv7_owlvit_v1_base_patch32_image_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV7 | Amba_optimized | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv7_owlvit_v1_base_patch32_text_encoder_amba_optimized.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV7 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv7_owlvit_v1_base_patch32_text_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Predictor| CV7 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv7_owlvit_v1_base_patch32_predictor_act16.bin) |
| OWLv1 CLIP ViT-B/32 Image Encoder| CV72 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_image_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV72 | Amba_optimized | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_text_encoder_amba_optimized.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV72 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_text_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Predictor| CV72 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_predictor_act16.bin) |
| OWLv1 CLIP ViT-B/32 Image Encoder| CV75 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_image_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV75 | Amba_optimized | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_text_encoder_amba_optimized.bin) |
| OWLv1 CLIP ViT-B/32 Text Encoder| CV75 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_text_encoder_act16.bin) |
| OWLv1 CLIP ViT-B/32 Predictor| CV75 | Activation_fp16 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_predictor_act16.bin) |