YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Video Anomaly Detection with TimeSformer
Model Description
This is an EnhancedTimeSformer model trained for video anomaly detection and deepfake detection using a one-class learning approach. The model was trained exclusively on real videos from WebVid-10M and learns to reconstruct normal video frames. Anomalies (including deepfakes) are detected by measuring reconstruction error.
Key Features
- β Self-supervised learning - No labeled deepfake data required for training
- β Better generalization - More robust to novel deepfake methods than supervised approaches
- β Optical flow integration - Captures temporal dynamics
- β Transformer-based - Spatial-temporal attention mechanisms
- β 100% accuracy on ultra-extreme synthetic deepfakes
Model Architecture
- Base: TimeSformer (Vision Transformer for Video)
- Enhancements:
- Factorized 3D convolutions for efficient spatiotemporal processing
- Optical flow estimation and encoding
- 3D patch embeddings
- 12-layer transformer with 12 attention heads
- Dual decoder heads (frame reconstruction + flow prediction)
Training Details
- Dataset: WebVid-10M (real videos only)
- Training objective: Self-supervised frame reconstruction
- Epochs: 15
- Final validation loss: 0.1821
- Input: 16 frames at 224x224 resolution
- Approach: One-class classification via reconstruction error
Performance
On Ultra-Extreme Synthetic Deepfakes:
- Accuracy: 100%
- Precision: 100%
- Recall: 100%
- F1-Score: 100%
- False Positive Rate: 0%
Detection Metrics:
- Optimal Threshold: 0.3137
- Real Video MSE: 0.1445 Β± 0.0846
- Fake Video MSE: 0.5559 Β± 0.0949
- Separation Ratio: 3.85x
Important Notes:
- β οΈ Model tested on ultra-extreme synthetic fakes (with obvious artifacts)
- β οΈ Real deepfakes are more subtle - expect lower accuracy (estimated 70-85%)
- β Better cross-dataset generalization than supervised methods
- β No memorization of specific deepfake method signatures
Usage
import torch
import torch.nn.functional as F
from model import create_model
# Load model
model = create_model()
checkpoint = torch.load("pytorch_model.ckpt", map_location='cuda')
# Extract state dict
if 'state_dict' in checkpoint:
state_dict = checkpoint['state_dict']
else:
state_dict = checkpoint
# Clean state dict (remove prefixes)
new_state_dict = {}
for k, v in state_dict.items():
if k.startswith('model.model.'):
new_key = k.replace('model.model.', '')
new_state_dict[new_key] = v
elif k.startswith('model.'):
new_key = k.replace('model.', '')
new_state_dict[new_key] = v
else:
new_state_dict[k] = v
model.load_state_dict(new_state_dict, strict=False)
model.eval()
model = model.cuda()
# Prepare video (B, C, T, H, W) with values in [-1, 1]
video_tensor = preprocess_video(video_path) # Your preprocessing
video_tensor = video_tensor.cuda()
# Get prediction
with torch.no_grad():
frame_pred, flow_pred = model(video_tensor)
# Calculate reconstruction error
mid_frame = video_tensor.shape[2] // 2
target = video_tensor[:, :, mid_frame]
mse_error = F.mse_loss(frame_pred, target).item()
# Detect deepfake
THRESHOLD = 0.3137
is_fake = mse_error > THRESHOLD
print(f"MSE: {mse_error:.4f}")
print(f"Prediction: {'FAKE' if is_fake else 'REAL'}")
Limitations
- Tested primarily on extreme manipulations - Real deepfakes are more subtle
- Reconstruction-based detection - May struggle with high-quality deepfakes that maintain temporal consistency
- Threshold sensitivity - Optimal threshold may vary across different video sources
- One-class approach - Lower peak accuracy than supervised methods, but better generalization
Recommended Use Cases
- β Initial screening of videos for obvious manipulations
- β Ensemble component with other detection methods
- β Research on generalization in deepfake detection
- β Detection of out-of-distribution videos
Not Recommended For
- β Sole detector for critical applications
- β Detection of subtle, professional-grade deepfakes without additional methods
- β Real-time video verification (model is compute-intensive)
Citation
If you use this model, please cite:
@misc{timesformer-deepfake-detector,
author = {ash12321},
title = {Video Anomaly Detection with TimeSformer},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ash12321/deepfake-detector-timesformer}}
}
License
MIT License - See repository for details
Contact
For questions or issues, please open an issue on the Hugging Face repository.
Note: This model represents a research approach to deepfake detection through one-class learning. For production deployments, consider using an ensemble of multiple detection methods including supervised classifiers, biological signal detectors, and temporal consistency checkers.
- Downloads last month
- 3