Spaces:
Sleeping
Sleeping
Upload 3 files
Browse files- README.md +40 -0
- app.py +42 -0
- requirements.txt +2 -0
README.md
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLIP Zero-Shot Classification on Oxford Pets Dataset
|
| 2 |
+
|
| 3 |
+
## Model Details
|
| 4 |
+
- **Model Name**: CLIP (Contrastive Language-Image Pre-training)
|
| 5 |
+
- **Model Version**: openai/clip-vit-large-patch14
|
| 6 |
+
- **Task**: Zero-shot Image Classification
|
| 7 |
+
- **Dataset**: Oxford-IIIT Pet Dataset
|
| 8 |
+
|
| 9 |
+
## Evaluation Results
|
| 10 |
+
The model was evaluated on the Oxford Pets dataset using zero-shot classification. The following metrics were obtained:
|
| 11 |
+
|
| 12 |
+
- **Accuracy**: 0.8800 -> 88%
|
| 13 |
+
- **Precision**: 0.8768 -> 87.68%
|
| 14 |
+
- **Recall**: 0.8800 -> 88%
|
| 15 |
+
|
| 16 |
+
## Model Description
|
| 17 |
+
CLIP (Contrastive Language-Image Pre-training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark's performance. This zero-shot capability of CLIP is particularly useful for tasks where labeled data is scarce or expensive to obtain.
|
| 18 |
+
|
| 19 |
+
## Dataset
|
| 20 |
+
The Oxford-IIIT Pet Dataset is a 37 category pet dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed.
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
```python
|
| 24 |
+
from transformers import pipeline
|
| 25 |
+
|
| 26 |
+
# Load the model
|
| 27 |
+
checkpoint = "openai/clip-vit-large-patch14"
|
| 28 |
+
detector = pipeline(model=checkpoint, task="zero-shot-image-classification")
|
| 29 |
+
|
| 30 |
+
# Define candidate labels
|
| 31 |
+
labels = ['Siamese', 'Birman', 'shiba inu', 'staffordshire bull terrier', ...]
|
| 32 |
+
|
| 33 |
+
# Run inference
|
| 34 |
+
results = detector(image, candidate_labels=labels)
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Limitations
|
| 38 |
+
- The model's performance may vary depending on the quality and characteristics of the input images
|
| 39 |
+
- Zero-shot classification may not perform as well as fine-tuned models on specific tasks
|
| 40 |
+
- The model's predictions are based on the provided candidate labels, so the quality of results depends on the relevance and completeness of these labels
|
app.py
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from transformers import pipeline
|
| 3 |
+
|
| 4 |
+
# Load models
|
| 5 |
+
vit_classifier = pipeline("image-classification", model="kuhs/vit-base-oxford-iiit-pets")
|
| 6 |
+
clip_detector = pipeline(model="openai/clip-vit-large-patch14", task="zero-shot-image-classification")
|
| 7 |
+
|
| 8 |
+
labels_oxford_pets = [
|
| 9 |
+
'Siamese', 'Birman', 'shiba inu', 'staffordshire bull terrier', 'basset hound', 'Bombay', 'japanese chin',
|
| 10 |
+
'chihuahua', 'german shorthaired', 'pomeranian', 'beagle', 'english cocker spaniel', 'american pit bull terrier',
|
| 11 |
+
'Ragdoll', 'Persian', 'Egyptian Mau', 'miniature pinscher', 'Sphynx', 'Maine Coon', 'keeshond', 'yorkshire terrier',
|
| 12 |
+
'havanese', 'leonberger', 'wheaten terrier', 'american bulldog', 'english setter', 'boxer', 'newfoundland', 'Bengal',
|
| 13 |
+
'samoyed', 'British Shorthair', 'great pyrenees', 'Abyssinian', 'pug', 'saint bernard', 'Russian Blue', 'scottish terrier'
|
| 14 |
+
]
|
| 15 |
+
|
| 16 |
+
def classify_pet(image):
|
| 17 |
+
vit_results = vit_classifier(image)
|
| 18 |
+
vit_output = {result['label']: result['score'] for result in vit_results}
|
| 19 |
+
|
| 20 |
+
clip_results = clip_detector(image, candidate_labels=labels_oxford_pets)
|
| 21 |
+
clip_output = {result['label']: result['score'] for result in clip_results}
|
| 22 |
+
|
| 23 |
+
return {"ViT Classification": vit_output, "CLIP Zero-Shot Classification": clip_output}
|
| 24 |
+
|
| 25 |
+
example_images = [
|
| 26 |
+
["example_images/dog1.jpeg"],
|
| 27 |
+
["example_images/dog2.jpeg"],
|
| 28 |
+
["example_images/leonberger.jpg"],
|
| 29 |
+
["example_images/snow_leopard.jpeg"],
|
| 30 |
+
["example_images/cat.jpg"]
|
| 31 |
+
]
|
| 32 |
+
|
| 33 |
+
iface = gr.Interface(
|
| 34 |
+
fn=classify_pet,
|
| 35 |
+
inputs=gr.Image(type="filepath"),
|
| 36 |
+
outputs=gr.JSON(),
|
| 37 |
+
title="Pet Classification Comparison",
|
| 38 |
+
description="Upload an image of a pet, and compare results from a trained ViT model and a zero-shot CLIP model.",
|
| 39 |
+
examples=example_images
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
iface.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
transformers
|
| 2 |
+
torch
|