An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper
โข
2010.11929
โข
Published
โข
15
A fine-tuned Vision Transformer (ViT) model for NSFW image classification based on the "google/vit-base-patch16-224-in21k" architecture.
This model is a variant of the transformer encoder architecture adapted for image classification tasks. It has been fine-tuned specifically for detecting Not Safe For Work (NSFW) content in images with high accuracy.
from PIL import Image
from transformers import pipeline
img = Image.open("<path_to_image_file>")
classifier = pipeline("image-classification", model="ashishupadhyay/NSFW_DETECTION")
classifier(img)
import torch
from PIL import Image
from transformers import AutoModelForImageClassification, ViTImageProcessor
img = Image.open("<path_to_image_file>")
model = AutoModelForImageClassification.from_pretrained("ashishupadhyay/NSFW_DETECTION")
processor = ViTImageProcessor.from_pretrained('ashishupadhyay/NSFW_DETECTION')
with torch.no_grad():
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_label = logits.argmax(-1).item()
model.config.id2label[predicted_label]
The model's performance may be influenced by the quality and representativeness of the training data. Users should assess the model's suitability for their specific applications and datasets before implementation.