☕ Arabica Coffee Bean Quality Classification — ViT & MobileNetV2

Trained models from undergraduate thesis research on classifying Arabica coffee bean quality using Vision Transformer (ViT), benchmarked against MobileNetV2 (CNN baseline). Classification follows the Indonesian National Standard (SNI) defect value system across 6 quality grades.

📄 Full paper & poster → GitHub Releases 💻 Code & notebooks → GitHub Repo 📦 Dataset → Kaggle

🏆 Model Files

File	Architecture	Pretrain	Patch Size	Train:Test	Train Acc	Test Acc	Prediction Acc
`vit-imgt-16(7030).pth` ⭐	ViT-B/16	ImageNet	16×16 px	70:30	97.02%	84.72%	91.67%
`vit-imgt-16(8020).pth`	ViT-B/16	ImageNet	16×16 px	80:20	96.98%	86.25%	88.34%
`vit-imgt-16(5050).pth`	ViT-B/16	ImageNet	16×16 px	50:50	98.50%	83.50%	86.67%
`vit-imgt-16(4060).pth`	ViT-B/16	ImageNet	16×16 px	40:60	99.80%	79.72%	83.34%
`vit-imgt-32(7030).pth`	ViT-B/32	ImageNet	32×32 px	70:30	97.50%	81.67%	76.67%
`vit-imgt-32(8020).pth`	ViT-B/32	ImageNet	32×32 px	80:20	97.29%	81.67%	76.67%
`vit-imgt-32(5050).pth`	ViT-B/32	ImageNet	32×32 px	50:50	98.50%	79.67%	76.67%
`vit-imgt-32(4060).pth`	ViT-B/32	ImageNet	32×32 px	40:60	99.58%	77.22%	61.67%
`vit-hgfc-16(8020).pth`	ViT-B/16	HuggingFace	16×16 px	80:20	81.11%	80.83%	90.00%
`vit-hgfc-16(7030).pth`	ViT-B/16	HuggingFace	16×16 px	70:30	81.31%	78.05%	90.00%
`vit-hgfc-16(5050).pth`	ViT-B/16	HuggingFace	16×16 px	50:50	79.83%	75.00%	85.00%
`vit-hgfc-16(4060).pth`	ViT-B/16	HuggingFace	16×16 px	40:60	79.58%	71.53%	83.34%
`vit-hgfc-32(8020).pth`	ViT-B/32	HuggingFace	32×32 px	80:20	82.71%	81.67%	78.34%
`vit-hgfc-32(7030).pth`	ViT-B/32	HuggingFace	32×32 px	70:30	82.26%	79.17%	78.34%
`vit-hgfc-32(5050).pth`	ViT-B/32	HuggingFace	32×32 px	50:50	81.33%	77.67%	80.00%
`vit-hgfc-32(4060).pth`	ViT-B/32	HuggingFace	32×32 px	40:60	79.58%	76.94%	78.34%
`mobilenetv2(7030)weight.pth`	MobileNetV2	ImageNet	—	70:30	79.29%	78.33%	86.67%
`mobilenetv2(8020)weight.pth`	MobileNetV2	ImageNet	—	80:20	79.79%	78.33%	75.00%
`mobilenetv2(5050)weight.pth`	MobileNetV2	ImageNet	—	50:50	80.00%	76.17%	81.67%
`mobilenetv2(4060)weight.pth`	MobileNetV2	ImageNet	—	40:60	79.79%	74.17%	86.67%
`coffeebean_vit_best.onnx` + `.onnx.data`	ViT-B/16 (best)	ImageNet	16×16 px	70:30	—	—	91.67%

🗂️ Quality Classes (SNI)

Label	Grade
`mutu1`	Specialty — 0 defects per 300g
`mutu2`	Grade 1 — max 11 defect values
`mutu3`	Grade 2 — 12–25 defect values
`mutu4`	Grade 3 — 26–44 defect values
`mutu5`	Grade 4a — 45–60 defect values
`mutu6`	Grade 4b — 61–80 defect values

🚀 Usage

Load best model (ViT ImageNet B/16, 70:30)

import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image

# 1. Setup model
class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.vit_b_16(weights=None)
model.heads = torch.nn.Sequential(
    torch.nn.Dropout(0.1),
    torch.nn.Linear(768, len(class_names))
)

# 2. Load weights (download from this repo first)
checkpoint = torch.load('vit-imgt-16(7030).pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

# 3. Preprocessing (ImageNet standard)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# 4. Predict
image = Image.open('your_coffee_image.jpg')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    outputs = model(input_tensor)
    probabilities = torch.softmax(outputs, dim=1)
    predicted_idx = torch.argmax(probabilities).item()

print(f"Predicted: {class_names[predicted_idx]}")
print(f"Confidence: {probabilities[0, predicted_idx].item():.4f}")

Load MobileNetV2 baseline

import torchvision.models as models

class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.mobilenet_v2(weights=None)
model.classifier[1] = torch.nn.Linear(model.last_channel, len(class_names))

checkpoint = torch.load('mobilenetv2(7030)weight.pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

⚙️ Training Details

Parameter	Value
Framework	PyTorch + torchvision
Optimizer	AdamW (lr=1e-3, weight_decay=3e-2)
Loss	CrossEntropyLoss
Epochs	90
Batch size	10
Input size	224×224 px
Dropout	0.1 (classifier head)
Normalization	ImageNet mean/std ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

📊 Key Findings

ViT ImageNet B/16 outperforms MobileNetV2 by +5% accuracy
ViT is also +2 images/second faster at inference
Tradeoff: ViT model size is ~320 MB larger than MobileNetV2

📄 Citation

@thesis{zaafirrahman2024vit,
  author  = {Aulya Az Zaafirrahman},
  title   = {Klasifikasi Mutu Biji Kopi Arabika Berbasis Image Processing Menggunakan Metode Vision Transformer (ViT)},
  school  = {Universitas Brawijaya},
  type    = {Teknik Industri Pertanian},
  year    = {2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track