Anime Frame Interesting Classifier (ViT v2.0)
Model Details
Architecture: MobileViT-Small (Transformer)
Framework: Hugging Face Transformers
Input Size: 224x224 RGB images
Output: Binary classification (Boring/Interesting)
Performance
Evaluated on v2.0 test set (433 frames):
- F1 Score: 94.92%
- Accuracy: 94.92%
- Precision: 95.05%
- Recall: 94.92%
Training data: 3,999 frames (4,432 total with test holdout)
Intended Use
What it does: Classifies anime frames as either "interesting" (depicting meaningful character/scene details) or "boring" (back-of-head shots, non-descript backgrounds, montages).
Strengths:
- Transformer-based semantic understanding
- Better generalization to style variations
- Good for ensemble voting with CNN model
- Complementary confidence to CNN predictions
When to use:
- Ensemble voting with CNN model for higher confidence
- Applications preferring transformer-based features
- Fine-tuning for downstream anime tasks
When NOT to use:
- Real-world photos or non-anime content
- Frames smaller than 224x224
- Speed-critical deployments (slower than CNN)
Labels
- Class 0 (Boring): Frames lacking interesting visual details or character focus
- Class 1 (Interesting): Frames with clear character/scene details suitable for downstream tasks
Model Size & Speed
- Model Size: 19 MB (SafeTensors format)
- Inference Speed: ~25ms per image on GPU
- VRAM Required: ~2 GB (including activations)
- Speed vs CNN: ~25% slower but more semantically aware
Training Data Composition
- 900 manually curated frames (hand-labeled)
- 1,655 frames filtered via dual-agreement with garbage classifier
- 1,877 frames from curated anime site scraper
- Total: 4,432 frames (90% train, 10% test holdout)
All frames are 224x224 RGB anime screenshots.
How to Use
Recommended: HuggingFace Transformers (SafeTensors)
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
# Load model (automatically uses SafeTensors)
processor = AutoImageProcessor.from_pretrained(
'hf_models/anime-frame-interesting-classifier-vit-v2'
)
model = AutoModelForImageClassification.from_pretrained(
'hf_models/anime-frame-interesting-classifier-vit-v2',
trust_remote_code=False # Safe: SafeTensors prevents code execution
)
model.eval()
image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
prediction = logits.argmax(-1).item()
confidence = logits.softmax(-1)[0][prediction].item()
print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")
print(f"Confidence: {confidence:.2%}")
Direct Load with SafeTensors
from transformers import MobileViTForImageClassification, AutoImageProcessor
from safetensors.torch import load_file
from PIL import Image
import torch
# Load with SafeTensors (secure)
model = MobileViTForImageClassification.from_pretrained(
'apple/mobilevit-small',
num_labels=2
)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()
processor = AutoImageProcessor.from_pretrained('apple/mobilevit-small')
image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")
Security Note: This model uses SafeTensors format (not pickle). SafeTensors is a secure serialization format that cannot execute arbitrary code during loading, unlike pickle-based .bin files.
Comparison with CNN Model
See anime-frame-interesting-classifier-cnn-v2 for CNN alternative:
| Metric | ViT (This Model) | CNN | Use Case |
|---|---|---|---|
| F1 | 94.92% | 95.15% | ViT: ensemble, CNN: general |
| Speed | Slower | Faster | CNN preferred for speed |
| Size | 20 MB | 20 MB | Similar footprint |
| Semantics | Better understanding | Good efficiency | ViT for understanding |
| Ensemble | Better recall with CNN voting | Better precision with ViT voting | Use together |
Recommended: Use ensemble voting for maximum confidence:
- Classify with both models
- Flag disagreements for manual review
- Trust when both models agree
Limitations
- Anime-only: Trained exclusively on anime content
- Speed: Slower inference than CNN (1-2 fps vs 2-3 fps)
- Dataset bias: Training data skewed toward popular anime styles
- Resolution: Trained on 224x224; extreme aspect ratios need preprocessing
- Edge cases: Minimal training on hard-to-classify borderline frames
Training Details
- Dataset: v2.0 (4,432 frames)
- Base Model: apple/mobilevit-small (5.6M parameters)
- Train/Test Split: 90/10 (3,999 train, 433 test)
- Epochs: 20
- Batch Size: 64
- Optimizer: AdamW (lr=1e-4)
- Loss: CrossEntropyLoss
- Augmentation: None (data quality sufficient at this scale)
Version History
- v2.0 (current): 94.92% F1, retrained on expanded 4,432-frame dataset
- v1.0: 86% F1, 900-frame dataset (legacy, deprecated)
Citation
If you use this model, please reference:
- Dataset: Anime Frame Interesting v2.0 (4,432 curated frames)
- Architecture: MobileViT-Small (apple/mobilevit-small)
- Framework: Hugging Face Transformers
- Training: PyTorch, 2026
Ensemble Strategy
For best results, use both CNN and ViT models together:
def ensemble_classify(image_path, cnn_model, vit_model):
"""Classify with both models, flag disagreements"""
# CNN prediction
cnn_pred = classify_with_cnn(image_path, cnn_model)
# ViT prediction
vit_pred = classify_with_vit(image_path, vit_model)
if cnn_pred['class'] == vit_pred['class']:
# Agreement: high confidence
confidence = (cnn_pred['conf'] + vit_pred['conf']) / 2
return {
'prediction': cnn_pred['class'],
'confidence': confidence,
'agreement': 'both'
}
else:
# Disagreement: flag for manual review
return {
'cnn': cnn_pred,
'vit': vit_pred,
'agreement': 'none',
'recommendation': 'manual_review'
}
Future Improvements
- Collect 1,000+ edge-case frames for hard-negatives
- Experiment with larger ViT variants (if available)
- Fine-tune for specific anime styles
- Distill to smaller model for faster inference
- Downloads last month
- 46