Sreekanth3096
/

vit-coco-image-classification

@@ -1,8 +1,28 @@
-Model Overview:
 The Vision Transformer (ViT) is a transformer encoder model designed for image recognition tasks. It was pretrained on a large dataset of 14 million images and 21,843 classes known as ImageNet-21k, and fine-tuned on ImageNet 2012, which consists of 1 million images across 1,000 classes.
-How It Works:
 Input Representation: Images are split into fixed-size patches (16x16 pixels) and linearly embedded. A special [CLS] token is added at the beginning of the sequence to indicate the image's classification.
@@ -10,7 +30,7 @@ Transformer Encoder: The model uses a transformer encoder architecture, similar
 Classification: After processing through the transformer layers, the output from the [CLS] token is used for image classification. This token's final hidden state represents the entire image's features.
-Intended Uses:
 Image Classification: ViT can be directly used for image classification tasks. By adding a linear layer on top of the [CLS] token, the model can classify images into one of the 1,000 ImageNet classes.
 Limitations:
@@ -21,7 +41,44 @@ Training Details:
 Preprocessing: Images are resized to 224x224 pixels and normalized across RGB channels.
 Training: Pretraining was conducted on TPUv3 hardware with a batch size of 4096 and learning rate warmup. Gradient clipping was applied during training to enhance stability.
-Evaluation Results:
 Performance: Detailed evaluation results on various benchmarks can be found in tables from the original paper. Fine-tuning the model on higher resolutions typically improves classification accuracy.

+---
+license: apache-2.0
+tags:
+- vision
+- image-classification
+- vit
+datasets:
+- imagenet-1k
+- imagenet-21k
+widget:
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
+  example_title: Tiger
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
+  example_title: Teapot
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
+  example_title: Palace
+language:
+- en
+library_name: transformers
+pipeline_tag: image-classification
+---
+# Model Overviwe:
 The Vision Transformer (ViT) is a transformer encoder model designed for image recognition tasks. It was pretrained on a large dataset of 14 million images and 21,843 classes known as ImageNet-21k, and fine-tuned on ImageNet 2012, which consists of 1 million images across 1,000 classes.
+# How It Works:
 Input Representation: Images are split into fixed-size patches (16x16 pixels) and linearly embedded. A special [CLS] token is added at the beginning of the sequence to indicate the image's classification.
 Classification: After processing through the transformer layers, the output from the [CLS] token is used for image classification. This token's final hidden state represents the entire image's features.
+# Intended Uses:
 Image Classification: ViT can be directly used for image classification tasks. By adding a linear layer on top of the [CLS] token, the model can classify images into one of the 1,000 ImageNet classes.
 Limitations:
 Preprocessing: Images are resized to 224x224 pixels and normalized across RGB channels.
 Training: Pretraining was conducted on TPUv3 hardware with a batch size of 4096 and learning rate warmup. Gradient clipping was applied during training to enhance stability.
+```python
+from transformers import ViTImageProcessor, ViTForImageClassification
+from PIL import Image
+import requests
+import torch
+def predict_image_from_url(url):
+    # Load image from URL
+    image = Image.open(requests.get(url, stream=True).raw)
+    # Initialize processor and model
+    processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
+    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
+    # Preprocess image and make predictions
+    inputs = processor(images=image, return_tensors="pt")
+    outputs = model(**inputs)
+    # Get predicted class label
+    logits = outputs.logits
+    predicted_class_idx = logits.argmax(-1).item()
+    predicted_class = model.config.id2label[predicted_class_idx]
+    return predicted_class
+# Example usage
+if __name__ == "__main__":
+    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+    predicted_class = predict_image_from_url(url)
+    print(f"Predicted class: {predicted_class}")
+```
+For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).
+## Training data
+The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, and fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.
+# Evaluation Results:
 Performance: Detailed evaluation results on various benchmarks can be found in tables from the original paper. Fine-tuning the model on higher resolutions typically improves classification accuracy.