YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Video Anomaly Detection with TimeSformer

Model Description

This is an EnhancedTimeSformer model trained for video anomaly detection and deepfake detection using a one-class learning approach. The model was trained exclusively on real videos from WebVid-10M and learns to reconstruct normal video frames. Anomalies (including deepfakes) are detected by measuring reconstruction error.

Key Features

  • βœ… Self-supervised learning - No labeled deepfake data required for training
  • βœ… Better generalization - More robust to novel deepfake methods than supervised approaches
  • βœ… Optical flow integration - Captures temporal dynamics
  • βœ… Transformer-based - Spatial-temporal attention mechanisms
  • βœ… 100% accuracy on ultra-extreme synthetic deepfakes

Model Architecture

  • Base: TimeSformer (Vision Transformer for Video)
  • Enhancements:
    • Factorized 3D convolutions for efficient spatiotemporal processing
    • Optical flow estimation and encoding
    • 3D patch embeddings
    • 12-layer transformer with 12 attention heads
    • Dual decoder heads (frame reconstruction + flow prediction)

Training Details

  • Dataset: WebVid-10M (real videos only)
  • Training objective: Self-supervised frame reconstruction
  • Epochs: 15
  • Final validation loss: 0.1821
  • Input: 16 frames at 224x224 resolution
  • Approach: One-class classification via reconstruction error

Performance

On Ultra-Extreme Synthetic Deepfakes:

  • Accuracy: 100%
  • Precision: 100%
  • Recall: 100%
  • F1-Score: 100%
  • False Positive Rate: 0%

Detection Metrics:

  • Optimal Threshold: 0.3137
  • Real Video MSE: 0.1445 Β± 0.0846
  • Fake Video MSE: 0.5559 Β± 0.0949
  • Separation Ratio: 3.85x

Important Notes:

  • ⚠️ Model tested on ultra-extreme synthetic fakes (with obvious artifacts)
  • ⚠️ Real deepfakes are more subtle - expect lower accuracy (estimated 70-85%)
  • βœ… Better cross-dataset generalization than supervised methods
  • βœ… No memorization of specific deepfake method signatures

Usage

import torch
import torch.nn.functional as F
from model import create_model

# Load model
model = create_model()
checkpoint = torch.load("pytorch_model.ckpt", map_location='cuda')

# Extract state dict
if 'state_dict' in checkpoint:
    state_dict = checkpoint['state_dict']
else:
    state_dict = checkpoint

# Clean state dict (remove prefixes)
new_state_dict = {}
for k, v in state_dict.items():
    if k.startswith('model.model.'):
        new_key = k.replace('model.model.', '')
        new_state_dict[new_key] = v
    elif k.startswith('model.'):
        new_key = k.replace('model.', '')
        new_state_dict[new_key] = v
    else:
        new_state_dict[k] = v

model.load_state_dict(new_state_dict, strict=False)
model.eval()
model = model.cuda()

# Prepare video (B, C, T, H, W) with values in [-1, 1]
video_tensor = preprocess_video(video_path)  # Your preprocessing
video_tensor = video_tensor.cuda()

# Get prediction
with torch.no_grad():
    frame_pred, flow_pred = model(video_tensor)
    
    # Calculate reconstruction error
    mid_frame = video_tensor.shape[2] // 2
    target = video_tensor[:, :, mid_frame]
    mse_error = F.mse_loss(frame_pred, target).item()
    
    # Detect deepfake
    THRESHOLD = 0.3137
    is_fake = mse_error > THRESHOLD
    
    print(f"MSE: {mse_error:.4f}")
    print(f"Prediction: {'FAKE' if is_fake else 'REAL'}")

Limitations

  1. Tested primarily on extreme manipulations - Real deepfakes are more subtle
  2. Reconstruction-based detection - May struggle with high-quality deepfakes that maintain temporal consistency
  3. Threshold sensitivity - Optimal threshold may vary across different video sources
  4. One-class approach - Lower peak accuracy than supervised methods, but better generalization

Recommended Use Cases

  • βœ… Initial screening of videos for obvious manipulations
  • βœ… Ensemble component with other detection methods
  • βœ… Research on generalization in deepfake detection
  • βœ… Detection of out-of-distribution videos

Not Recommended For

  • ❌ Sole detector for critical applications
  • ❌ Detection of subtle, professional-grade deepfakes without additional methods
  • ❌ Real-time video verification (model is compute-intensive)

Citation

If you use this model, please cite:

@misc{timesformer-deepfake-detector,
  author = {ash12321},
  title = {Video Anomaly Detection with TimeSformer},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ash12321/deepfake-detector-timesformer}}
}

License

MIT License - See repository for details

Contact

For questions or issues, please open an issue on the Hugging Face repository.


Note: This model represents a research approach to deepfake detection through one-class learning. For production deployments, consider using an ensemble of multiple detection methods including supervised classifiers, biological signal detectors, and temporal consistency checkers.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support