Commit
·
db01457
1
Parent(s):
c834bfc
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,145 +1,100 @@
|
|
| 1 |
---
|
| 2 |
tags:
|
| 3 |
- vision
|
|
|
|
|
|
|
|
|
|
| 4 |
widget:
|
| 5 |
-
- src:
|
|
|
|
| 6 |
candidate_labels: playing music, playing sports
|
| 7 |
example_title: Cat & Dog
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
# Model Card: CLIP
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
### Model Date
|
| 19 |
|
| 20 |
-
January 2021
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
|
|
|
| 25 |
|
| 26 |
-
The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
|
| 27 |
|
| 28 |
|
| 29 |
-
### Documents
|
| 30 |
|
| 31 |
-
-
|
| 32 |
-
-
|
| 33 |
|
| 34 |
|
| 35 |
-
### Use with Transformers
|
| 36 |
|
| 37 |
-
|
| 38 |
-
from PIL import Image
|
| 39 |
-
import requests
|
| 40 |
-
|
| 41 |
-
from transformers import CLIPProcessor, CLIPModel
|
| 42 |
|
| 43 |
-
|
| 44 |
-
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
| 45 |
-
|
| 46 |
-
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
| 47 |
-
image = Image.open(requests.get(url, stream=True).raw)
|
| 48 |
-
|
| 49 |
-
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
| 50 |
-
|
| 51 |
-
outputs = model(**inputs)
|
| 52 |
-
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
| 53 |
-
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
| 54 |
-
```
|
| 55 |
|
|
|
|
| 56 |
|
| 57 |
## Model Use
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
|
|
|
| 66 |
|
| 67 |
-
|
|
|
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
**Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
|
| 72 |
|
| 73 |
-
Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
|
| 74 |
|
| 75 |
-
|
| 76 |
|
|
|
|
| 77 |
|
| 78 |
|
| 79 |
-
## Data
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
-
- Food101
|
| 96 |
-
- CIFAR10
|
| 97 |
-
- CIFAR100
|
| 98 |
-
- Birdsnap
|
| 99 |
-
- SUN397
|
| 100 |
-
- Stanford Cars
|
| 101 |
-
- FGVC Aircraft
|
| 102 |
-
- VOC2007
|
| 103 |
-
- DTD
|
| 104 |
-
- Oxford-IIIT Pet dataset
|
| 105 |
-
- Caltech101
|
| 106 |
-
- Flowers102
|
| 107 |
-
- MNIST
|
| 108 |
-
- SVHN
|
| 109 |
-
- IIIT5K
|
| 110 |
-
- Hateful Memes
|
| 111 |
-
- SST-2
|
| 112 |
-
- UCF101
|
| 113 |
-
- Kinetics700
|
| 114 |
-
- Country211
|
| 115 |
-
- CLEVR Counting
|
| 116 |
-
- KITTI Distance
|
| 117 |
-
- STL-10
|
| 118 |
-
- RareAct
|
| 119 |
-
- Flickr30
|
| 120 |
-
- MSCOCO
|
| 121 |
-
- ImageNet
|
| 122 |
-
- ImageNet-A
|
| 123 |
-
- ImageNet-R
|
| 124 |
-
- ImageNet Sketch
|
| 125 |
-
- ObjectNet (ImageNet Overlap)
|
| 126 |
-
- Youtube-BB
|
| 127 |
-
- ImageNet-Vid
|
| 128 |
|
| 129 |
-
## Limitations
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
### Bias and Fairness
|
| 134 |
|
| 135 |
-
We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from [Fairface](https://arxiv.org/abs/1908.04913) into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper).
|
| 136 |
|
| 137 |
-
|
| 138 |
|
|
|
|
| 139 |
|
|
|
|
| 140 |
|
| 141 |
-
## Feedback
|
| 142 |
|
| 143 |
-
### Where to send questions or comments about the model
|
| 144 |
|
| 145 |
-
|
|
|
|
| 1 |
---
|
| 2 |
tags:
|
| 3 |
- vision
|
| 4 |
+
- coin
|
| 5 |
+
- coin-retrieval
|
| 6 |
+
- coin-recognition
|
| 7 |
widget:
|
| 8 |
+
- src: >-
|
| 9 |
+
https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
|
| 10 |
candidate_labels: playing music, playing sports
|
| 11 |
example_title: Cat & Dog
|
| 12 |
+
license: apache-2.0
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
| 15 |
|
| 16 |
# Model Card: CLIP
|
| 17 |
|
| 18 |
+
## Model Details / 模型细节
|
| 19 |
|
| 20 |
+
This model is fine-tuned on a coin dataset using **contrastive learning** techniques, based on OpenAI's CLIP (ViT-B/32). It aims to enhance the feature extraction capabilities for **Coin** images, thus achieving more accurate image-based search functionalities. The model combines the powerful features of the Vision Transformer (ViT) with the multimodal learning capabilities of CLIP, specifically optimized for coin imagery.
|
| 21 |
|
| 22 |
+
这个模型是在 OpenAI 的 CLIP (ViT-B/32) 基础上,利用对比学习技术并使用硬币数据集进行微调得到的。它旨在提高硬币图像的特征提取能力,从而实现更准确的以图搜图功能。该模型结合了视觉变换器(ViT)的强大功能和 CLIP 的多模态学习能力,专门针对硬币图像进行了优化。
|
| 23 |
|
|
|
|
| 24 |
|
|
|
|
| 25 |
|
| 26 |
+
## Usage and Limitations / 使用和限制
|
| 27 |
|
| 28 |
+
- **Usage**: This model is primarily used for extracting representation vectors from coin images, enabling efficient and precise image-based searches in a coin image database.
|
| 29 |
+
- **Limitations**: As the model is trained specifically on coin images, it may not perform well on non-coin images.
|
| 30 |
|
|
|
|
| 31 |
|
| 32 |
|
|
|
|
| 33 |
|
| 34 |
+
- **用途**:此模型主要用于提取硬币图片的表示向量,以实现在硬币图像库中进行高效、精确的以图搜图。
|
| 35 |
+
- **限制**:由于模型是针对硬币图像进行训练的,因此在处理非硬币图像时可能效果不佳。
|
| 36 |
|
| 37 |
|
|
|
|
| 38 |
|
| 39 |
+
## Documents
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
- Base Model: [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
|
| 44 |
|
| 45 |
## Model Use
|
| 46 |
|
| 47 |
+
```python3
|
| 48 |
+
from PIL import Image
|
| 49 |
+
import requests
|
| 50 |
|
| 51 |
+
from transformers import CLIPProcessor, CLIPModel
|
| 52 |
|
| 53 |
+
model = CLIPModel.from_pretrained("breezedeus/coin-clip-vit-base-patch32")
|
| 54 |
+
processor = CLIPProcessor.from_pretrained("breezedeus/coin-clip-vit-base-patch32")
|
| 55 |
|
| 56 |
+
image_fp = "path/to/coin_image.jpg"
|
| 57 |
+
image = Image.open(image_fp).convert("RGB")
|
| 58 |
|
| 59 |
+
inputs = processor(images=image, return_tensors="pt")
|
| 60 |
+
img_features = model.get_image_features(**inputs)
|
| 61 |
+
img_features = F.normalize(img_features, dim=1)
|
| 62 |
+
```
|
| 63 |
|
|
|
|
| 64 |
|
|
|
|
| 65 |
|
| 66 |
+
## Training Data
|
| 67 |
|
| 68 |
+
The model was trained on a specialized coin image dataset. This dataset includes images of various currencies' coins.
|
| 69 |
|
| 70 |
|
|
|
|
| 71 |
|
| 72 |
+
本模型使用的是专门的硬币图像数据集进行训练。这个数据集包含了多种货币的硬币图片。
|
| 73 |
|
| 74 |
+
## Training Process
|
| 75 |
|
| 76 |
+
The model was fine-tuned on the OpenAI CLIP (ViT-B/32) pretrained model using a coin image dataset. The training process involved Contrastive Learning fine-tuning techniques and parameter settings.
|
| 77 |
|
| 78 |
|
| 79 |
|
| 80 |
+
模型是在 OpenAI 的 CLIP (ViT-B/32) 预训练模型的基础上,使用硬币图像数据集进行微调。训练过程采用了对比学习的微调技巧和参数设置。
|
| 81 |
|
| 82 |
+
## Performance
|
| 83 |
|
| 84 |
+
This model demonstrates excellent performance in coin image retrieval tasks.
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
|
|
|
| 87 |
|
| 88 |
+
该模型在硬币图像检索任务上展现了优异的性能。
|
| 89 |
|
|
|
|
| 90 |
|
|
|
|
| 91 |
|
| 92 |
+
## Feedback
|
| 93 |
|
| 94 |
+
### Where to send questions or comments about the model
|
| 95 |
|
| 96 |
+
Welcome to contact the author [Breezedeus](https://www.breezedeus.com/join-group).
|
| 97 |
|
|
|
|
| 98 |
|
|
|
|
| 99 |
|
| 100 |
+
欢迎联系作者 [Breezedeus](https://www.breezedeus.com/join-group) 。
|