keras
/

clip_vit_large_patch14

KerasHub

Model card Files Files and versions

xet

Community

Divyasreepat commited on Mar 24, 2025

Commit

28d8797

verified ·

1 Parent(s): 49fa6e9

Update README.md with new model card content

Browse files

Files changed (1) hide show

README.md +65 -51

README.md CHANGED Viewed

@@ -11,17 +11,20 @@ Weights are released under the [MIT License](https://opensource.org/license/mit)
 ## Links
-* [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification)
-* [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/)
 * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
 ## Installation
-Keras and KerasCV can be installed with:
 ```
-pip install -U -q keras-cv
-pip install -U -q keras>=3
 ```
 Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
@@ -35,53 +38,64 @@ The following model checkpoints are provided by the Keras team. Full code exampl
 | clip-vit-base-patch32      | 151.28M    | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)  |
 | clip-vit-large-patch14     | 427.62M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)  |
 | clip-vit-large-patch14-336 | 427.94M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)  |
-## Example code
-```
-from keras import ops
 import keras
-from keras_cv.models.feature_extractor.clip import CLIPProcessor
-from keras_cv.models import CLIP
-processor = CLIPProcessor("vocab.json", "merges.txt")
-# processed_image = transform_image("cat.jpg", 224)
-tokens = processor(["mountains", "cat on tortoise", "house"])
-model = CLIP.from_preset("clip-vit-base-patch32")
-output = model({
-                "images": processed_image,
-                "token_ids": tokens['token_ids'],
-                "padding_mask": tokens['padding_mask']})
-# optional if you need to pre process image
-def transform_image(image_path, input_resolution):
-    mean = ops.array([0.48145466, 0.4578275, 0.40821073])
-    std = ops.array([0.26862954, 0.26130258, 0.27577711])
-    image = keras.utils.load_img(image_path)
-    image = keras.utils.img_to_array(image)
-    image = (
-        ops.image.resize(
-            image,
-            (input_resolution, input_resolution),
-            interpolation="bicubic",
-        )
-        / 255.0
-    )
-    central_fraction = input_resolution / image.shape[0]
-    width, height = image.shape[0], image.shape[1]
-    left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
-    top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
-    right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
-    bottom = ops.cast(
-        (height + height * central_fraction) / 2, dtype="int32"
-    )
-    image = ops.slice(
-        image, [left, top, 0], [right - left, bottom - top, 3]
-    )
-    image = (image - mean) / std
-    return ops.expand_dims(image, axis=0)
 ```

 ## Links
+* [CLIP Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/clip-quickstart-notebook)
+* [CLIP API Documentation](https://keras.io/keras_hub/api/models/clip/)
 * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
+* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
+* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
 ## Installation
+Keras and KerasHub can be installed with:
 ```
+pip install -U -q keras-hub
+pip install -U -q keras
 ```
 Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
 | clip-vit-base-patch32      | 151.28M    | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)  |
 | clip-vit-large-patch14     | 427.62M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)  |
 | clip-vit-large-patch14-336 | 427.94M    | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)  |
+|  clip_vit_b_32_laion2b_s34b_b79k    | 151.28M    | 151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model. |
+|  clip_vit_h_14_laion2b_s32b_b79k    | 986.11M    | 986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
+|  clip_vit_g_14_laion2b_s12b_b42k    | 1.37B   | 1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
+|  clip_vit_bigg_14_laion2b_39b_b160k   | 2.54B   | 2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model. |
+## Example Usage
+```python
 import keras
+import numpy as np
+import matplotlib.pyplot as plt
+from keras_hub.models import CLIPBackbone, CLIPTokenizer
+from keras_hub.layers import CLIPImageConverter
+# instantiate the model and preprocessing tools
+clip = CLIPBackbone.from_preset("clip_vit_large_patch14")
+tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14",
+sequence_length=5)
+image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14")
+# obtain tokens for some input text
+tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
+# preprocess image and text
+image = keras.utils.load_img("cat.jpg")
+image = image_converter(np.array([image]).astype(float))
+# query the model for similarities
+clip({
+     "images": image,
+     "token_ids": tokens,
+})
 ```
+## Example Usage with Hugging Face URI
+```python
+import keras
+import numpy as np
+import matplotlib.pyplot as plt
+from keras_hub.models import CLIPBackbone, CLIPTokenizer
+from keras_hub.layers import CLIPImageConverter
+# instantiate the model and preprocessing tools
+clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14")
+tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14",
+sequence_length=5)
+image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14")
+# obtain tokens for some input text
+tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
+# preprocess image and text
+image = keras.utils.load_img("cat.jpg")
+image = image_converter(np.array([image]).astype(float))
+# query the model for similarities
+clip({
+     "images": image,
+     "token_ids": tokens,
+})
+```