Vision Transformer (ViT) Fine-tuned on Food101 Subset

Model Description

This model is a fine-tuned version of google/vit-base-patch16-224 for food image classification across 10 categories.

Classes

pizza
sushi
hamburger
ice_cream
steak
baklava
cheesecake
pancakes
tacos
ramen

Evaluation Results

Metric	Value
Accuracy	98.04%

Training Logs

Epoch	Training Loss	Validation Loss	Accuracy
1	0.3254	0.1076	97.20%
2	0.1216	0.0904	97.68%
3	0.0361	0.0770	97.88%
4	0.0118	0.0764	98.00%
5	0.0084	0.0767	98.04%

Training Summary:

Total steps: 1,175
Final training loss: 0.2446
Training runtime: 2,705 seconds (~45 minutes)
Throughput: 13.86 samples/second

Reproduce Evaluation

from datasets import load_dataset
from transformers import pipeline
from tqdm import tqdm

# Load model
classifier = pipeline("image-classification", model="Nav772/vit-food-classifier", device=0)

# Load same test split
dataset = load_dataset("food101", split="validation")

# Filter to same 10 classes
selected_classes = ["pizza", "sushi", "hamburger", "ice_cream", "steak", 
                    "baklava", "cheesecake", "pancakes", "tacos", "ramen"]
class_names = dataset.features['label'].names
selected_indices = [class_names.index(c) for c in selected_classes]

filtered = dataset.filter(lambda x: x['label'] in selected_indices)

# Evaluate
correct = 0
total = 0

for example in tqdm(filtered):
    pred = classifier(example['image'])[0]['label']
    true_label = class_names[example['label']]
    if pred == true_label:
        correct += 1
    total += 1

print(f"Accuracy: {correct/total:.4f} ({correct}/{total})")

Training Data

Dataset: Food101 (subset)
Train samples: ~7,500 images
Validation samples: ~2,500 images
Classes: 10 food categories

Training Procedure

Base model: google/vit-base-patch16-224
Epochs: 5
Batch size: 32
Learning rate: 3e-5
Image size: 224x224
Mixed precision: FP16
Warmup ratio: 0.1
Weight decay: 0.01

Usage

from transformers import pipeline

classifier = pipeline("image-classification", model="Nav772/vit-food-classifier")
result = classifier("path/to/food/image.jpg")
print(result)

Limitations

Only classifies 10 specific food categories
May not generalize to food items outside these categories
Performance may degrade on low-quality or obscured images

Downloads last month: 3

Safetensors

Model size

85.8M params

Tensor type

F32

Dataset used to train Nav772/vit-food-classifier

Space using Nav772/vit-food-classifier 1

Evaluation results

Accuracy on food101
validation set self-reported

0.980