MobileViT-XS Garbage Classifier

Compact binary classification model for filtering objectively-bad frames in anime video preprocessing. Optimized for size with minimal accuracy loss.

Model Details

Architecture: MobileViT-XS
Parameters: 1.93M
Model Size: 7.6MB
Input Size: 256×256
Classes: [quality, garbage]

Performance

Without threshold (0.5):

Accuracy: 93.14%
Precision: 91.77%
Recall: 95.37%
F1-Score: 93.54%

With optimal threshold (0.6315):

Accuracy: 93.41%
Precision: 92.91%
Recall: 94.54%
F1-Score: 93.72%

Usage

import torch
import timm
from torchvision import transforms
from PIL import Image

# Load model
model = timm.create_model('mobilevit_xs', num_classes=2, pretrained=False)
model.load_state_dict(torch.load('pytorch_model.bin'))
model.eval()

# Prepare image
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

img = transform(Image.open('frame.webp').convert('RGB')).unsqueeze(0).cuda()

# Predict
with torch.no_grad():
    logits = model(img)
    probs = torch.softmax(logits, dim=1)
    garbage_prob = probs[0, 0].item()  # Class 0 = garbage

# Decision
is_garbage = garbage_prob > 0.6315  # Use optimal threshold

Training Data

Total frames: 12,440
Training: 10,574 frames
Validation: 1,866 frames (895 garbage, 971 quality)
Labeling: Verified via reverse-engineered frame matching

Garbage Detection

Filters frames with:

Solid black/white/uniform color (33%)
No edge patterns (33%)
Low detail content (16%)
Extreme outliers (15%)

Threshold Recommendations

Default (0.5): Good starting point, higher recall
Optimal (0.6315): Best F1-score, balanced precision/recall
High precision (0.70-0.75): Reduce false positives
High recall (0.55-0.60): Catch more garbage, accept more false positives

vs MobileViT-S

MobileViT-XS is 60% smaller (7.6MB vs 20MB) with only 0.15pp F1 loss (93.72% vs 93.87%). Use for memory-constrained deployments.

License

MIT

Downloads last month: -