File size: 3,093 Bytes
df98640
edad6a6
df98640
 
 
 
 
 
 
 
 
 
0bade36
 
 
 
 
 
 
 
 
 
 
 
 
edad6a6
 
df98640
edad6a6
df98640
edad6a6
df98640
edad6a6
df98640
edad6a6
df98640
 
 
 
 
 
 
 
 
 
edad6a6
0bade36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df98640
edad6a6
df98640
 
 
 
edad6a6
df98640
edad6a6
df98640
 
 
 
 
 
0bade36
 
edad6a6
df98640
 
 
edad6a6
df98640
cb28fc3
df98640
 
edad6a6
df98640
edad6a6
cb28fc3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

---
license: mit
datasets:
- food101
metrics:
- accuracy
pipeline_tag: image-classification
tags:
- vision
- food-classification
- vit
model-index:
- name: vit-food-classifier
  results:
  - task:
      type: image-classification
    dataset:
      name: food101
      type: food101
      split: validation
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.9804
---

# Vision Transformer (ViT) Fine-tuned on Food101 Subset

## Model Description

This model is a fine-tuned version of `google/vit-base-patch16-224` for food image classification across 10 categories.

## Classes

- pizza
- sushi
- hamburger
- ice_cream
- steak
- baklava
- cheesecake
- pancakes
- tacos
- ramen

## Evaluation Results

| Metric | Value |
|--------|-------|
| **Accuracy** | 98.04% |

## Training Logs

| Epoch | Training Loss | Validation Loss | Accuracy |
|-------|---------------|-----------------|----------|
| 1     | 0.3254        | 0.1076          | 97.20%   |
| 2     | 0.1216        | 0.0904          | 97.68%   |
| 3     | 0.0361        | 0.0770          | 97.88%   |
| 4     | 0.0118        | 0.0764          | 98.00%   |
| 5     | 0.0084        | 0.0767          | **98.04%** |

**Training Summary:**
- Total steps: 1,175
- Final training loss: 0.2446
- Training runtime: 2,705 seconds (~45 minutes)
- Throughput: 13.86 samples/second

### Reproduce Evaluation
```python
from datasets import load_dataset
from transformers import pipeline
from tqdm import tqdm

# Load model
classifier = pipeline("image-classification", model="Nav772/vit-food-classifier", device=0)

# Load same test split
dataset = load_dataset("food101", split="validation")

# Filter to same 10 classes
selected_classes = ["pizza", "sushi", "hamburger", "ice_cream", "steak", 
                    "baklava", "cheesecake", "pancakes", "tacos", "ramen"]
class_names = dataset.features['label'].names
selected_indices = [class_names.index(c) for c in selected_classes]

filtered = dataset.filter(lambda x: x['label'] in selected_indices)

# Evaluate
correct = 0
total = 0

for example in tqdm(filtered):
    pred = classifier(example['image'])[0]['label']
    true_label = class_names[example['label']]
    if pred == true_label:
        correct += 1
    total += 1

print(f"Accuracy: {correct/total:.4f} ({correct}/{total})")
```

## Training Data

- **Dataset**: Food101 (subset)
- **Train samples**: ~7,500 images
- **Validation samples**: ~2,500 images
- **Classes**: 10 food categories

## Training Procedure

- **Base model**: google/vit-base-patch16-224
- **Epochs**: 5
- **Batch size**: 32
- **Learning rate**: 3e-5
- **Image size**: 224x224
- **Mixed precision**: FP16
- **Warmup ratio**: 0.1
- **Weight decay**: 0.01

## Usage
```python
from transformers import pipeline

classifier = pipeline("image-classification", model="Nav772/vit-food-classifier")
result = classifier("path/to/food/image.jpg")
print(result)
```

## Limitations

- Only classifies 10 specific food categories
- May not generalize to food items outside these categories
- Performance may degrade on low-quality or obscured images