Update README.md with new model card content
Browse files
README.md
CHANGED
|
@@ -11,17 +11,20 @@ Weights are released under the [MIT License](https://opensource.org/license/mit)
|
|
| 11 |
|
| 12 |
## Links
|
| 13 |
|
| 14 |
-
* [CLIP Quickstart Notebook](https://www.kaggle.com/code/
|
| 15 |
-
* [CLIP API Documentation](https://keras.io/api/
|
| 16 |
* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## Installation
|
| 19 |
|
| 20 |
-
Keras and
|
| 21 |
|
| 22 |
```
|
| 23 |
-
pip install -U -q keras-
|
| 24 |
-
pip install -U -q keras
|
|
|
|
| 25 |
```
|
| 26 |
|
| 27 |
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
|
|
@@ -35,53 +38,64 @@ The following model checkpoints are provided by the Keras team. Full code exampl
|
|
| 35 |
| clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
|
| 36 |
| clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
|
| 37 |
| clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
## Example
|
| 40 |
-
```
|
| 41 |
-
from keras import ops
|
| 42 |
import keras
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
(input_resolution, input_resolution),
|
| 67 |
-
interpolation="bicubic",
|
| 68 |
-
)
|
| 69 |
-
/ 255.0
|
| 70 |
-
)
|
| 71 |
-
central_fraction = input_resolution / image.shape[0]
|
| 72 |
-
width, height = image.shape[0], image.shape[1]
|
| 73 |
-
left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
|
| 74 |
-
top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
|
| 75 |
-
right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
|
| 76 |
-
bottom = ops.cast(
|
| 77 |
-
(height + height * central_fraction) / 2, dtype="int32"
|
| 78 |
-
)
|
| 79 |
-
|
| 80 |
-
image = ops.slice(
|
| 81 |
-
image, [left, top, 0], [right - left, bottom - top, 3]
|
| 82 |
-
)
|
| 83 |
-
|
| 84 |
-
image = (image - mean) / std
|
| 85 |
-
return ops.expand_dims(image, axis=0)
|
| 86 |
```
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
## Links
|
| 13 |
|
| 14 |
+
* [CLIP Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/clip-quickstart-notebook)
|
| 15 |
+
* [CLIP API Documentation](https://keras.io/keras_hub/api/models/clip/)
|
| 16 |
* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
|
| 17 |
+
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
|
| 18 |
+
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
|
| 19 |
|
| 20 |
## Installation
|
| 21 |
|
| 22 |
+
Keras and KerasHub can be installed with:
|
| 23 |
|
| 24 |
```
|
| 25 |
+
pip install -U -q keras-hub
|
| 26 |
+
pip install -U -q keras
|
| 27 |
+
|
| 28 |
```
|
| 29 |
|
| 30 |
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
|
|
|
|
| 38 |
| clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
|
| 39 |
| clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
|
| 40 |
| clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
|
| 41 |
+
| clip_vit_b_32_laion2b_s34b_b79k | 151.28M | 151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model. |
|
| 42 |
+
| clip_vit_h_14_laion2b_s32b_b79k | 986.11M | 986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
|
| 43 |
+
| clip_vit_g_14_laion2b_s12b_b42k | 1.37B | 1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
|
| 44 |
+
| clip_vit_bigg_14_laion2b_39b_b160k | 2.54B | 2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model. |
|
| 45 |
|
| 46 |
+
## Example Usage
|
| 47 |
+
```python
|
|
|
|
| 48 |
import keras
|
| 49 |
+
import numpy as np
|
| 50 |
+
import matplotlib.pyplot as plt
|
| 51 |
+
from keras_hub.models import CLIPBackbone, CLIPTokenizer
|
| 52 |
+
from keras_hub.layers import CLIPImageConverter
|
| 53 |
+
|
| 54 |
+
# instantiate the model and preprocessing tools
|
| 55 |
+
clip = CLIPBackbone.from_preset("clip_vit_large_patch14")
|
| 56 |
+
tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14",
|
| 57 |
+
sequence_length=5)
|
| 58 |
+
image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14")
|
| 59 |
+
|
| 60 |
+
# obtain tokens for some input text
|
| 61 |
+
tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
|
| 62 |
+
|
| 63 |
+
# preprocess image and text
|
| 64 |
+
image = keras.utils.load_img("cat.jpg")
|
| 65 |
+
image = image_converter(np.array([image]).astype(float))
|
| 66 |
+
|
| 67 |
+
# query the model for similarities
|
| 68 |
+
clip({
|
| 69 |
+
"images": image,
|
| 70 |
+
"token_ids": tokens,
|
| 71 |
+
})
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
| 73 |
|
| 74 |
+
## Example Usage with Hugging Face URI
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
import keras
|
| 78 |
+
import numpy as np
|
| 79 |
+
import matplotlib.pyplot as plt
|
| 80 |
+
from keras_hub.models import CLIPBackbone, CLIPTokenizer
|
| 81 |
+
from keras_hub.layers import CLIPImageConverter
|
| 82 |
+
|
| 83 |
+
# instantiate the model and preprocessing tools
|
| 84 |
+
clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14")
|
| 85 |
+
tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14",
|
| 86 |
+
sequence_length=5)
|
| 87 |
+
image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14")
|
| 88 |
+
|
| 89 |
+
# obtain tokens for some input text
|
| 90 |
+
tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
|
| 91 |
+
|
| 92 |
+
# preprocess image and text
|
| 93 |
+
image = keras.utils.load_img("cat.jpg")
|
| 94 |
+
image = image_converter(np.array([image]).astype(float))
|
| 95 |
+
|
| 96 |
+
# query the model for similarities
|
| 97 |
+
clip({
|
| 98 |
+
"images": image,
|
| 99 |
+
"token_ids": tokens,
|
| 100 |
+
})
|
| 101 |
+
```
|