Upload folder using huggingface_hub

a9bd396 verified about 1 month ago

4.75 kB

	<!--Copyright 2021 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.

	⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
	rendered properly in your Markdown viewer.

	-->
	This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12.

	<div style="float: right;">
	<div class="flex flex-wrap space-x-1">
	<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
	<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
	<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
	</div>
	</div>

	# CLIP

	[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.

	You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai?search_models=clip) organization.

	> [!TIP]
	> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.

	The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [`Pipeline`] or the [`AutoModel`] class.

	<hfoptions id="usage">
	<hfoption id="Pipeline">

	```py
	import torch
	from transformers import pipeline

	clip = pipeline(
	task="zero-shot-image-classification",
	model="openai/clip-vit-base-patch32",
	dtype=torch.bfloat16,
	device=0
	)
	labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
	clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
	```

	</hfoption>
	<hfoption id="AutoModel">

	```py
	import requests
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModel

	model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", dtype=torch.bfloat16, attn_implementation="sdpa")
	processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

	inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

	outputs = model(**inputs)
	logits_per_image = outputs.logits_per_image
	probs = logits_per_image.softmax(dim=1)
	most_likely_idx = probs.argmax(dim=1).item()
	most_likely_label = labels[most_likely_idx]
	print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
	```

	</hfoption>
	</hfoptions>

	## Notes

	- Use [`CLIPImageProcessor`] to resize (or rescale) and normalizes images for the model.

	## CLIPConfig

	[[autodoc]] CLIPConfig

	## CLIPTextConfig

	[[autodoc]] CLIPTextConfig

	## CLIPVisionConfig

	[[autodoc]] CLIPVisionConfig

	## CLIPTokenizer

	[[autodoc]] CLIPTokenizer
	- get_special_tokens_mask
	- save_vocabulary

	## CLIPTokenizerFast

	[[autodoc]] CLIPTokenizerFast

	## CLIPImageProcessor

	[[autodoc]] CLIPImageProcessor
	- preprocess

	## CLIPImageProcessorFast

	[[autodoc]] CLIPImageProcessorFast
	- preprocess

	## CLIPProcessor

	[[autodoc]] CLIPProcessor
	- __call__

	## CLIPModel

	[[autodoc]] CLIPModel
	- forward
	- get_text_features
	- get_image_features

	## CLIPTextModel

	[[autodoc]] CLIPTextModel
	- forward

	## CLIPTextModelWithProjection

	[[autodoc]] CLIPTextModelWithProjection
	- forward

	## CLIPVisionModelWithProjection

	[[autodoc]] CLIPVisionModelWithProjection
	- forward

	## CLIPVisionModel

	[[autodoc]] CLIPVisionModel
	- forward

	## CLIPForImageClassification

	[[autodoc]] CLIPForImageClassification
	- forward