β˜• Arabica Coffee Bean Quality Classification β€” ViT & MobileNetV2

Trained models from undergraduate thesis research on classifying Arabica coffee bean quality using Vision Transformer (ViT), benchmarked against MobileNetV2 (CNN baseline). Classification follows the Indonesian National Standard (SNI) defect value system across 6 quality grades.

πŸ“„ Full paper & poster β†’ GitHub Releases πŸ’» Code & notebooks β†’ GitHub Repo πŸ“¦ Dataset β†’ Kaggle


πŸ† Model Files

File Architecture Pretrain Patch Size Train:Test Train Acc Test Acc Prediction Acc
vit-imgt-16(7030).pth ⭐ ViT-B/16 ImageNet 16Γ—16 px 70:30 97.02% 84.72% 91.67%
vit-imgt-16(8020).pth ViT-B/16 ImageNet 16Γ—16 px 80:20 96.98% 86.25% 88.34%
vit-imgt-16(5050).pth ViT-B/16 ImageNet 16Γ—16 px 50:50 98.50% 83.50% 86.67%
vit-imgt-16(4060).pth ViT-B/16 ImageNet 16Γ—16 px 40:60 99.80% 79.72% 83.34%
vit-imgt-32(7030).pth ViT-B/32 ImageNet 32Γ—32 px 70:30 97.50% 81.67% 76.67%
vit-imgt-32(8020).pth ViT-B/32 ImageNet 32Γ—32 px 80:20 97.29% 81.67% 76.67%
vit-imgt-32(5050).pth ViT-B/32 ImageNet 32Γ—32 px 50:50 98.50% 79.67% 76.67%
vit-imgt-32(4060).pth ViT-B/32 ImageNet 32Γ—32 px 40:60 99.58% 77.22% 61.67%
vit-hgfc-16(8020).pth ViT-B/16 HuggingFace 16Γ—16 px 80:20 81.11% 80.83% 90.00%
vit-hgfc-16(7030).pth ViT-B/16 HuggingFace 16Γ—16 px 70:30 81.31% 78.05% 90.00%
vit-hgfc-16(5050).pth ViT-B/16 HuggingFace 16Γ—16 px 50:50 79.83% 75.00% 85.00%
vit-hgfc-16(4060).pth ViT-B/16 HuggingFace 16Γ—16 px 40:60 79.58% 71.53% 83.34%
vit-hgfc-32(8020).pth ViT-B/32 HuggingFace 32Γ—32 px 80:20 82.71% 81.67% 78.34%
vit-hgfc-32(7030).pth ViT-B/32 HuggingFace 32Γ—32 px 70:30 82.26% 79.17% 78.34%
vit-hgfc-32(5050).pth ViT-B/32 HuggingFace 32Γ—32 px 50:50 81.33% 77.67% 80.00%
vit-hgfc-32(4060).pth ViT-B/32 HuggingFace 32Γ—32 px 40:60 79.58% 76.94% 78.34%
mobilenetv2(7030)weight.pth MobileNetV2 ImageNet β€” 70:30 79.29% 78.33% 86.67%
mobilenetv2(8020)weight.pth MobileNetV2 ImageNet β€” 80:20 79.79% 78.33% 75.00%
mobilenetv2(5050)weight.pth MobileNetV2 ImageNet β€” 50:50 80.00% 76.17% 81.67%
mobilenetv2(4060)weight.pth MobileNetV2 ImageNet β€” 40:60 79.79% 74.17% 86.67%
coffeebean_vit_best.onnx + .onnx.data ViT-B/16 (best) ImageNet 16Γ—16 px 70:30 β€” β€” 91.67%

πŸ—‚οΈ Quality Classes (SNI)

Label Grade
mutu1 Specialty β€” 0 defects per 300g
mutu2 Grade 1 β€” max 11 defect values
mutu3 Grade 2 β€” 12–25 defect values
mutu4 Grade 3 β€” 26–44 defect values
mutu5 Grade 4a β€” 45–60 defect values
mutu6 Grade 4b β€” 61–80 defect values

πŸš€ Usage

Load best model (ViT ImageNet B/16, 70:30)

import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image

# 1. Setup model
class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.vit_b_16(weights=None)
model.heads = torch.nn.Sequential(
    torch.nn.Dropout(0.1),
    torch.nn.Linear(768, len(class_names))
)

# 2. Load weights (download from this repo first)
checkpoint = torch.load('vit-imgt-16(7030).pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

# 3. Preprocessing (ImageNet standard)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# 4. Predict
image = Image.open('your_coffee_image.jpg')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    outputs = model(input_tensor)
    probabilities = torch.softmax(outputs, dim=1)
    predicted_idx = torch.argmax(probabilities).item()

print(f"Predicted: {class_names[predicted_idx]}")
print(f"Confidence: {probabilities[0, predicted_idx].item():.4f}")

Load MobileNetV2 baseline

import torchvision.models as models

class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.mobilenet_v2(weights=None)
model.classifier[1] = torch.nn.Linear(model.last_channel, len(class_names))

checkpoint = torch.load('mobilenetv2(7030)weight.pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

βš™οΈ Training Details

Parameter Value
Framework PyTorch + torchvision
Optimizer AdamW (lr=1e-3, weight_decay=3e-2)
Loss CrossEntropyLoss
Epochs 90
Batch size 10
Input size 224Γ—224 px
Dropout 0.1 (classifier head)
Normalization ImageNet mean/std ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

πŸ“Š Key Findings

  • ViT ImageNet B/16 outperforms MobileNetV2 by +5% accuracy
  • ViT is also +2 images/second faster at inference
  • Tradeoff: ViT model size is ~320 MB larger than MobileNetV2

πŸ“„ Citation

@thesis{zaafirrahman2024vit,
  author  = {Aulya Az Zaafirrahman},
  title   = {Klasifikasi Mutu Biji Kopi Arabika Berbasis Image Processing Menggunakan Metode Vision Transformer (ViT)},
  school  = {Universitas Brawijaya},
  type    = {Teknik Industri Pertanian},
  year    = {2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support