YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Anime Frame Interesting Classifier (ViT v2.0)

Model Details

Architecture: MobileViT-Small (Transformer)
Framework: Hugging Face Transformers
Input Size: 224x224 RGB images
Output: Binary classification (Boring/Interesting)

Performance

Evaluated on v2.0 test set (433 frames):

  • F1 Score: 94.92%
  • Accuracy: 94.92%
  • Precision: 95.05%
  • Recall: 94.92%

Training data: 3,999 frames (4,432 total with test holdout)

Intended Use

What it does: Classifies anime frames as either "interesting" (depicting meaningful character/scene details) or "boring" (back-of-head shots, non-descript backgrounds, montages).

Strengths:

  • Transformer-based semantic understanding
  • Better generalization to style variations
  • Good for ensemble voting with CNN model
  • Complementary confidence to CNN predictions

When to use:

  • Ensemble voting with CNN model for higher confidence
  • Applications preferring transformer-based features
  • Fine-tuning for downstream anime tasks

When NOT to use:

  • Real-world photos or non-anime content
  • Frames smaller than 224x224
  • Speed-critical deployments (slower than CNN)

Labels

  • Class 0 (Boring): Frames lacking interesting visual details or character focus
  • Class 1 (Interesting): Frames with clear character/scene details suitable for downstream tasks

Model Size & Speed

  • Model Size: 19 MB (SafeTensors format)
  • Inference Speed: ~25ms per image on GPU
  • VRAM Required: ~2 GB (including activations)
  • Speed vs CNN: ~25% slower but more semantically aware

Training Data Composition

  • 900 manually curated frames (hand-labeled)
  • 1,655 frames filtered via dual-agreement with garbage classifier
  • 1,877 frames from curated anime site scraper
  • Total: 4,432 frames (90% train, 10% test holdout)

All frames are 224x224 RGB anime screenshots.

How to Use

Recommended: HuggingFace Transformers (SafeTensors)

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

# Load model (automatically uses SafeTensors)
processor = AutoImageProcessor.from_pretrained(
    'hf_models/anime-frame-interesting-classifier-vit-v2'
)
model = AutoModelForImageClassification.from_pretrained(
    'hf_models/anime-frame-interesting-classifier-vit-v2',
    trust_remote_code=False  # Safe: SafeTensors prevents code execution
)
model.eval()

image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    prediction = logits.argmax(-1).item()
    confidence = logits.softmax(-1)[0][prediction].item()

print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")
print(f"Confidence: {confidence:.2%}")

Direct Load with SafeTensors

from transformers import MobileViTForImageClassification, AutoImageProcessor
from safetensors.torch import load_file
from PIL import Image
import torch

# Load with SafeTensors (secure)
model = MobileViTForImageClassification.from_pretrained(
    'apple/mobilevit-small',
    num_labels=2
)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

processor = AutoImageProcessor.from_pretrained('apple/mobilevit-small')
image = Image.open('frame.png')
inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()

print(f"Prediction: {'Interesting' if prediction == 1 else 'Boring'}")

Security Note: This model uses SafeTensors format (not pickle). SafeTensors is a secure serialization format that cannot execute arbitrary code during loading, unlike pickle-based .bin files.

Comparison with CNN Model

See anime-frame-interesting-classifier-cnn-v2 for CNN alternative:

Metric ViT (This Model) CNN Use Case
F1 94.92% 95.15% ViT: ensemble, CNN: general
Speed Slower Faster CNN preferred for speed
Size 20 MB 20 MB Similar footprint
Semantics Better understanding Good efficiency ViT for understanding
Ensemble Better recall with CNN voting Better precision with ViT voting Use together

Recommended: Use ensemble voting for maximum confidence:

  1. Classify with both models
  2. Flag disagreements for manual review
  3. Trust when both models agree

Limitations

  • Anime-only: Trained exclusively on anime content
  • Speed: Slower inference than CNN (1-2 fps vs 2-3 fps)
  • Dataset bias: Training data skewed toward popular anime styles
  • Resolution: Trained on 224x224; extreme aspect ratios need preprocessing
  • Edge cases: Minimal training on hard-to-classify borderline frames

Training Details

  • Dataset: v2.0 (4,432 frames)
  • Base Model: apple/mobilevit-small (5.6M parameters)
  • Train/Test Split: 90/10 (3,999 train, 433 test)
  • Epochs: 20
  • Batch Size: 64
  • Optimizer: AdamW (lr=1e-4)
  • Loss: CrossEntropyLoss
  • Augmentation: None (data quality sufficient at this scale)

Version History

  • v2.0 (current): 94.92% F1, retrained on expanded 4,432-frame dataset
  • v1.0: 86% F1, 900-frame dataset (legacy, deprecated)

Citation

If you use this model, please reference:

  • Dataset: Anime Frame Interesting v2.0 (4,432 curated frames)
  • Architecture: MobileViT-Small (apple/mobilevit-small)
  • Framework: Hugging Face Transformers
  • Training: PyTorch, 2026

Ensemble Strategy

For best results, use both CNN and ViT models together:

def ensemble_classify(image_path, cnn_model, vit_model):
    """Classify with both models, flag disagreements"""
    # CNN prediction
    cnn_pred = classify_with_cnn(image_path, cnn_model)
    
    # ViT prediction
    vit_pred = classify_with_vit(image_path, vit_model)
    
    if cnn_pred['class'] == vit_pred['class']:
        # Agreement: high confidence
        confidence = (cnn_pred['conf'] + vit_pred['conf']) / 2
        return {
            'prediction': cnn_pred['class'],
            'confidence': confidence,
            'agreement': 'both'
        }
    else:
        # Disagreement: flag for manual review
        return {
            'cnn': cnn_pred,
            'vit': vit_pred,
            'agreement': 'none',
            'recommendation': 'manual_review'
        }

Future Improvements

  • Collect 1,000+ edge-case frames for hard-negatives
  • Experiment with larger ViT variants (if available)
  • Fine-tune for specific anime styles
  • Distill to smaller model for faster inference
Downloads last month
46
Safetensors
Model size
4.95M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support