cooper_robot
commited on
Commit
·
2daff4c
1
Parent(s):
baada4a
Add release note for v1.1.0
Browse files- README.md +33 -0
- resource/OWLViT.png +3 -0
README.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: pytorch
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+

|
| 6 |
+
|
| 7 |
+
OWL-ViT extends CLIP-based vision–language models to perform open-vocabulary object detection by aligning image regions with textual descriptions, enabling zero-shot detection without task-specific training.
|
| 8 |
+
|
| 9 |
+
Original paper: [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)
|
| 10 |
+
|
| 11 |
+
# OWLv1 CLIP ViT-B/32
|
| 12 |
+
|
| 13 |
+
This model uses the OWL-ViT v1 with a CLIP ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, leveraging the vision–language alignment of CLIP to detect objects specified by arbitrary text queries. It is well suited for applications such as open-vocabulary detection, image search, and real-time visual understanding across diverse domains.
|
| 14 |
+
|
| 15 |
+
Model Configuration:
|
| 16 |
+
- Reference implementation: [OWLv1 CLIP ViT-B/32](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
|
| 17 |
+
- Original Weight: [owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32/blob/main/pytorch_model.bin)
|
| 18 |
+
- Resolution: 3x768x768
|
| 19 |
+
- Support Cooper version:
|
| 20 |
+
- Cooper SDK: [2.5.2]
|
| 21 |
+
- Cooper Foundry: [2.2]
|
| 22 |
+
|
| 23 |
+
| Model | Device | Model Link |
|
| 24 |
+
| :-----: | :-----: | :-----: |
|
| 25 |
+
| OWLv1 CLIP ViT-B/32 Image Encoder| N1-655 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_image_encoder.bin) |
|
| 26 |
+
| OWLv1 CLIP ViT-B/32 Text Encoder| N1-655 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_text_encoder.bin) |
|
| 27 |
+
| OWLv1 CLIP ViT-B/32 Predictor| N1-655 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/n1-655_owlvit_v1_base_patch32_predictor.bin) |
|
| 28 |
+
| OWLv1 CLIP ViT-B/32 Image Encoder| CV72 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_image_encoder.bin) |
|
| 29 |
+
| OWLv1 CLIP ViT-B/32 Text Encoder| CV72 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_text_encoder.bin) |
|
| 30 |
+
| OWLv1 CLIP ViT-B/32 Predictor| CV72 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv72_owlvit_v1_base_patch32_predictor.bin) |
|
| 31 |
+
| OWLv1 CLIP ViT-B/32 Image Encoder| CV75 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_image_encoder.bin) |
|
| 32 |
+
| OWLv1 CLIP ViT-B/32 Text Encoder| CV75 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_text_encoder.bin) |
|
| 33 |
+
| OWLv1 CLIP ViT-B/32 Predictor| CV75 | [Model_Link](https://huggingface.co/Ambarella/OWLViT/blob/main/cv75_owlvit_v1_base_patch32_predictor.bin) |
|
resource/OWLViT.png
ADDED
|
Git LFS Details
|