Anime Frame Interesting Classifier (ViT v2.0)

Model Details

Architecture: MobileViT-Small (Transformer)
Framework: Hugging Face Transformers
Input Size: 224x224 RGB images
Output: Binary classification (Boring/Interesting)

Performance

Evaluated on v2.0 test set (433 frames):

F1 Score: 94.92%
Accuracy: 94.92%
Precision: 95.05%
Recall: 94.92%

Training data: 3,999 frames (4,432 total with test holdout)

Intended Use

What it does: Classifies anime frames as either "interesting" (depicting meaningful character/scene details) or "boring" (back-of-head shots, non-descript backgrounds, montages).

Strengths:

Transformer-based semantic understanding
Better generalization to style variations
Good for ensemble voting with CNN model
Complementary confidence to CNN predictions

When to use:

Ensemble voting with CNN model for higher confidence
Applications preferring transformer-based features
Fine-tuning for downstream anime tasks

When NOT to use:

Real-world photos or non-anime content
Frames smaller than 224x224
Speed-critical deployments (slower than CNN)

Labels

Class 0 (Boring): Frames lacking interesting visual details or character focus
Class 1 (Interesting): Frames with clear character/scene details suitable for downstream tasks

Model Size & Speed

Model Size: 19 MB (SafeTensors format)
Inference Speed: ~25ms per image on GPU
VRAM Required: ~2 GB (including activations)
Speed vs CNN: ~25% slower but more semantically aware

Training Data Composition

900 manually curated frames (hand-labeled)
1,655 frames filtered via dual-agreement with garbage classifier
1,877 frames from curated anime site scraper
Total: 4,432 frames (90% train, 10% test holdout)

All frames are 224x224 RGB anime screenshots.

How to Use

Recommended: HuggingFace Transformers (SafeTensors)

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

# Load model (automatically uses SafeTensors)
processor = AutoImageProcessor.from_pretrained(
    'hf_models/anime-frame-interesting-classifier-vit-v2'
)
model = AutoModelForImageClassification.from_pretrained(
    'hf_models/anime-frame-interesting-classifier-vit-v2',
    trust_remote_code=False  # Safe: SafeTensors prevents code execution
)
model.eval()

image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    prediction = logits.argmax(-1).item()
    confidence = logits.softmax(-1)[0][prediction].item()

print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")
print(f"Confidence: {confidence:.2%}")

Direct Load with SafeTensors

from transformers import MobileViTForImageClassification, AutoImageProcessor
from safetensors.torch import load_file
from PIL import Image
import torch

# Load with SafeTensors (secure)
model = MobileViTForImageClassification.from_pretrained(
    'apple/mobilevit-small',
    num_labels=2
)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

processor = AutoImageProcessor.from_pretrained('apple/mobilevit-small')
image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()

print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")

Security Note: This model uses SafeTensors format (not pickle). SafeTensors is a secure serialization format that cannot execute arbitrary code during loading, unlike pickle-based .bin files.

Comparison with CNN Model

See anime-frame-interesting-classifier-cnn-v2 for CNN alternative:

Metric	ViT (This Model)	CNN	Use Case
F1	94.92%	95.15%	ViT: ensemble, CNN: general
Speed	Slower	Faster	CNN preferred for speed
Size	20 MB	20 MB	Similar footprint
Semantics	Better understanding	Good efficiency	ViT for understanding
Ensemble	Better recall with CNN voting	Better precision with ViT voting	Use together

Recommended: Use ensemble voting for maximum confidence:

Classify with both models
Flag disagreements for manual review
Trust when both models agree

Limitations

Anime-only: Trained exclusively on anime content
Speed: Slower inference than CNN (1-2 fps vs 2-3 fps)
Dataset bias: Training data skewed toward popular anime styles
Resolution: Trained on 224x224; extreme aspect ratios need preprocessing
Edge cases: Minimal training on hard-to-classify borderline frames

Training Details

Dataset: v2.0 (4,432 frames)
Base Model: apple/mobilevit-small (5.6M parameters)
Train/Test Split: 90/10 (3,999 train, 433 test)
Epochs: 20
Batch Size: 64
Optimizer: AdamW (lr=1e-4)
Loss: CrossEntropyLoss
Augmentation: None (data quality sufficient at this scale)

Version History

v2.0 (current): 94.92% F1, retrained on expanded 4,432-frame dataset
v1.0: 86% F1, 900-frame dataset (legacy, deprecated)

Citation

If you use this model, please reference:

Dataset: Anime Frame Interesting v2.0 (4,432 curated frames)
Architecture: MobileViT-Small (apple/mobilevit-small)
Framework: Hugging Face Transformers
Training: PyTorch, 2026

Ensemble Strategy

For best results, use both CNN and ViT models together:

def ensemble_classify(image_path, cnn_model, vit_model):
    """Classify with both models, flag disagreements"""
    # CNN prediction
    cnn_pred = classify_with_cnn(image_path, cnn_model)
    
    # ViT prediction
    vit_pred = classify_with_vit(image_path, vit_model)
    
    if cnn_pred['class'] == vit_pred['class']:
        # Agreement: high confidence
        confidence = (cnn_pred['conf'] + vit_pred['conf']) / 2
        return {
            'prediction': cnn_pred['class'],
            'confidence': confidence,
            'agreement': 'both'
        }
    else:
        # Disagreement: flag for manual review
        return {
            'cnn': cnn_pred,
            'vit': vit_pred,
            'agreement': 'none',
            'recommendation': 'manual_review'
        }

Future Improvements

Collect 1,000+ edge-case frames for hard-negatives
Experiment with larger ViT variants (if available)
Fine-tune for specific anime styles
Distill to smaller model for faster inference

Downloads last month: 1

Safetensors

Model size

4.95M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support