β Arabica Coffee Bean Quality Classification β ViT & MobileNetV2
Trained models from undergraduate thesis research on classifying Arabica coffee bean quality using Vision Transformer (ViT), benchmarked against MobileNetV2 (CNN baseline). Classification follows the Indonesian National Standard (SNI) defect value system across 6 quality grades.
π Full paper & poster β GitHub Releases
π» Code & notebooks β GitHub Repo
π¦ Dataset β Kaggle
π Model Files
| File |
Architecture |
Pretrain |
Patch Size |
Train:Test |
Train Acc |
Test Acc |
Prediction Acc |
vit-imgt-16(7030).pth β |
ViT-B/16 |
ImageNet |
16Γ16 px |
70:30 |
97.02% |
84.72% |
91.67% |
vit-imgt-16(8020).pth |
ViT-B/16 |
ImageNet |
16Γ16 px |
80:20 |
96.98% |
86.25% |
88.34% |
vit-imgt-16(5050).pth |
ViT-B/16 |
ImageNet |
16Γ16 px |
50:50 |
98.50% |
83.50% |
86.67% |
vit-imgt-16(4060).pth |
ViT-B/16 |
ImageNet |
16Γ16 px |
40:60 |
99.80% |
79.72% |
83.34% |
vit-imgt-32(7030).pth |
ViT-B/32 |
ImageNet |
32Γ32 px |
70:30 |
97.50% |
81.67% |
76.67% |
vit-imgt-32(8020).pth |
ViT-B/32 |
ImageNet |
32Γ32 px |
80:20 |
97.29% |
81.67% |
76.67% |
vit-imgt-32(5050).pth |
ViT-B/32 |
ImageNet |
32Γ32 px |
50:50 |
98.50% |
79.67% |
76.67% |
vit-imgt-32(4060).pth |
ViT-B/32 |
ImageNet |
32Γ32 px |
40:60 |
99.58% |
77.22% |
61.67% |
vit-hgfc-16(8020).pth |
ViT-B/16 |
HuggingFace |
16Γ16 px |
80:20 |
81.11% |
80.83% |
90.00% |
vit-hgfc-16(7030).pth |
ViT-B/16 |
HuggingFace |
16Γ16 px |
70:30 |
81.31% |
78.05% |
90.00% |
vit-hgfc-16(5050).pth |
ViT-B/16 |
HuggingFace |
16Γ16 px |
50:50 |
79.83% |
75.00% |
85.00% |
vit-hgfc-16(4060).pth |
ViT-B/16 |
HuggingFace |
16Γ16 px |
40:60 |
79.58% |
71.53% |
83.34% |
vit-hgfc-32(8020).pth |
ViT-B/32 |
HuggingFace |
32Γ32 px |
80:20 |
82.71% |
81.67% |
78.34% |
vit-hgfc-32(7030).pth |
ViT-B/32 |
HuggingFace |
32Γ32 px |
70:30 |
82.26% |
79.17% |
78.34% |
vit-hgfc-32(5050).pth |
ViT-B/32 |
HuggingFace |
32Γ32 px |
50:50 |
81.33% |
77.67% |
80.00% |
vit-hgfc-32(4060).pth |
ViT-B/32 |
HuggingFace |
32Γ32 px |
40:60 |
79.58% |
76.94% |
78.34% |
mobilenetv2(7030)weight.pth |
MobileNetV2 |
ImageNet |
β |
70:30 |
79.29% |
78.33% |
86.67% |
mobilenetv2(8020)weight.pth |
MobileNetV2 |
ImageNet |
β |
80:20 |
79.79% |
78.33% |
75.00% |
mobilenetv2(5050)weight.pth |
MobileNetV2 |
ImageNet |
β |
50:50 |
80.00% |
76.17% |
81.67% |
mobilenetv2(4060)weight.pth |
MobileNetV2 |
ImageNet |
β |
40:60 |
79.79% |
74.17% |
86.67% |
coffeebean_vit_best.onnx + .onnx.data |
ViT-B/16 (best) |
ImageNet |
16Γ16 px |
70:30 |
β |
β |
91.67% |
ποΈ Quality Classes (SNI)
| Label |
Grade |
mutu1 |
Specialty β 0 defects per 300g |
mutu2 |
Grade 1 β max 11 defect values |
mutu3 |
Grade 2 β 12β25 defect values |
mutu4 |
Grade 3 β 26β44 defect values |
mutu5 |
Grade 4a β 45β60 defect values |
mutu6 |
Grade 4b β 61β80 defect values |
π Usage
Load best model (ViT ImageNet B/16, 70:30)
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image
class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.vit_b_16(weights=None)
model.heads = torch.nn.Sequential(
torch.nn.Dropout(0.1),
torch.nn.Linear(768, len(class_names))
)
checkpoint = torch.load('vit-imgt-16(7030).pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
image = Image.open('your_coffee_image.jpg')
input_tensor = transform(image).unsqueeze(0)
with torch.no_grad():
outputs = model(input_tensor)
probabilities = torch.softmax(outputs, dim=1)
predicted_idx = torch.argmax(probabilities).item()
print(f"Predicted: {class_names[predicted_idx]}")
print(f"Confidence: {probabilities[0, predicted_idx].item():.4f}")
Load MobileNetV2 baseline
import torchvision.models as models
class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.mobilenet_v2(weights=None)
model.classifier[1] = torch.nn.Linear(model.last_channel, len(class_names))
checkpoint = torch.load('mobilenetv2(7030)weight.pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()
βοΈ Training Details
| Parameter |
Value |
| Framework |
PyTorch + torchvision |
| Optimizer |
AdamW (lr=1e-3, weight_decay=3e-2) |
| Loss |
CrossEntropyLoss |
| Epochs |
90 |
| Batch size |
10 |
| Input size |
224Γ224 px |
| Dropout |
0.1 (classifier head) |
| Normalization |
ImageNet mean/std ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) |
π Key Findings
- ViT ImageNet B/16 outperforms MobileNetV2 by +5% accuracy
- ViT is also +2 images/second faster at inference
- Tradeoff: ViT model size is ~320 MB larger than MobileNetV2
π Citation
@thesis{zaafirrahman2024vit,
author = {Aulya Az Zaafirrahman},
title = {Klasifikasi Mutu Biji Kopi Arabika Berbasis Image Processing Menggunakan Metode Vision Transformer (ViT)},
school = {Universitas Brawijaya},
type = {Teknik Industri Pertanian},
year = {2024}
}