---
license: mit
base_model: MCG-NJU/videomae-base
tags:
- video-classification
- crime-detection
- violence-detection
- videomae
- computer-vision
- security
- surveillance
- generated_from_trainer
language:
- en
datasets:
- jinmang2/ucf_crime
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: video-classification
model-index:
- name: test-upload-model
  results:
  - task:
      name: Violence Detection
      type: video-classification
    dataset:
      name: UCF Crime Dataset (Subset)
      type: jinmang2/ucf_crime
      args: violence_detection
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.5000
    - name: Precision
      type: precision
      value: 0.2500
    - name: Recall
      type: recall
      value: 0.5000
    - name: F1
      type: f1
      value: 0.3333
---

# Nikeytas/Test Upload Model

This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the UCF Crime dataset with **event-based binary classification**. It achieves the following results on the evaluation set:

- **Loss**: 0.5847
- **Accuracy**: 0.5000
- **Precision**: 0.2500
- **Recall**: 0.5000
- **F1 Score**: 0.3333

## 🎯 Model Overview

This VideoMAE model has been fine-tuned for **binary violence detection** in video content. The model classifies videos into two categories:
- **Violent Crime** (1): Videos containing violent criminal activities
- **Non-Violent Incident** (0): Videos with non-violent or normal activities

The model is based on the **VideoMAE architecture** and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios.

## 📊 Dataset & Training

### Dataset Composition

**Total Videos**: 20
- **Violent Crime Videos**: 10
- **Non-Violent Incident Videos**: 10

**Class Balance**: 50.0% violent crimes

**Event Distribution**:
- **Arrest**: 20 videos
- **Arson**: 20 videos

**Data Splits**:
- **Training**: 12 videos
- **Validation**: 4 videos  
- **Test**: 4 videos

## 🎯 Performance

### Performance Metrics

**Validation Performance**:
- **eval_loss**: 0.5847
- **eval_accuracy**: 0.5000
- **eval_precision**: 0.2500
- **eval_recall**: 0.5000
- **eval_f1**: 0.3333
- **eval_runtime**: 0.6636
- **eval_samples_per_second**: 6.0270
- **eval_steps_per_second**: 3.0140
- **epoch**: 1.0000

**Test Performance**:
- **eval_loss**: 0.6700
- **eval_accuracy**: 0.5000
- **eval_precision**: 0.2500
- **eval_recall**: 0.5000
- **eval_f1**: 0.3333
- **eval_runtime**: 0.4271
- **eval_samples_per_second**: 9.3660
- **eval_steps_per_second**: 4.6830
- **epoch**: 1.0000

**Training Information**:
- **Training Time**: 0.1 minutes
- **Best Accuracy Achieved**: 0.5000
- **Model Architecture**: VideoMAE Base (fine-tuned)
- **Fine-tuning Approach**: Event-based binary classification

## 🚀 Training Procedure

### Training Hyperparameters

The following hyperparameters were used during training:
- **Learning Rate**: 5e-05
- **Train Batch Size**: 2
- **Eval Batch Size**: 2 
- **Optimizer**: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- **LR Scheduler Type**: Linear
- **Training Epochs**: 1
- **Weight Decay**: 0.01

### Training Results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|---------------|-------|------|-----------------|----------|
| 0.5 | 1.00 | N/A | 0.5847 | 0.5000 |

### Framework Versions

- **Transformers**: 4.30.2+
- **PyTorch**: 2.0.1+
- **Datasets**: Latest
- **Device**: Apple Silicon MPS / CUDA / CPU (Auto-detected)

## 🚀 Quick Start

### Installation

```bash
pip install transformers torch torchvision opencv-python pillow
```

### Basic Usage

```python
import torch
from transformers import AutoModelForVideoClassification, AutoProcessor
import cv2
import numpy as np

# Load model and processor
model = AutoModelForVideoClassification.from_pretrained("Nikeytas/test-upload-model")
processor = AutoProcessor.from_pretrained("Nikeytas/test-upload-model")

# Process video
def classify_video(video_path, num_frames=16):
    # Extract frames
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame_rgb)
    
    cap.release()
    
    # Process with model
    inputs = processor(frames, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    
    label = "Violent Crime" if predicted_class == 1 else "Non-Violent"
    return label, confidence

# Example usage
video_path = "path/to/your/video.mp4"
prediction, confidence = classify_video(video_path)
print(f"Prediction: {prediction} (Confidence: {confidence:.3f})")
```

### Batch Processing

```python
import os
from pathlib import Path

def process_video_directory(video_dir, output_file="results.txt"):
    results = []
    
    for video_file in Path(video_dir).glob("*.mp4"):
        try:
            prediction, confidence = classify_video(str(video_file))
            results.append({
                "file": video_file.name,
                "prediction": prediction,
                "confidence": confidence
            })
            print(f"✅ {video_file.name}: {prediction} ({confidence:.3f})")
        except Exception as e:
            print(f"❌ Error processing {video_file.name}: {e}")
    
    # Save results
    with open(output_file, "w") as f:
        for result in results:
            f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n")
    
    return results

# Process all videos in a directory
results = process_video_directory("./videos/")
```

## 📈 Technical Specifications

- **Base Model**: MCG-NJU/videomae-base
- **Architecture**: Vision Transformer (ViT) adapted for video
- **Input Resolution**: 224x224 pixels per frame
- **Temporal Resolution**: 16 frames per video clip
- **Output Classes**: 2 (Binary classification)
- **Training Framework**: HuggingFace Transformers
- **Optimization**: AdamW optimizer with learning rate 5e-5

## ⚠️ Limitations

1. **Dataset Scope**: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence
2. **Temporal Context**: Uses 16-frame clips which may miss context in longer sequences
3. **Environmental Bias**: Performance may vary with different lighting, camera angles, and video quality
4. **False Positives**: May misclassify intense but non-violent activities (sports, action movies)
5. **Real-time Performance**: Processing time depends on hardware capabilities

## 🔒 Ethical Considerations

### Intended Use
- **Primary**: Research and development in video analysis
- **Secondary**: Security system enhancement with human oversight
- **Educational**: Computer vision and AI safety research

### Prohibited Uses
- **Surveillance without consent**: Do not use for unauthorized monitoring
- **Discriminatory profiling**: Avoid bias against specific groups or communities  
- **Automated punishment**: Never use for automated legal or disciplinary actions
- **Privacy violation**: Respect privacy laws and individual rights

### Bias and Fairness
- Model trained on specific dataset that may not represent all populations
- Regular evaluation needed for bias detection and mitigation
- Human oversight required for critical applications
- Consider demographic representation in deployment scenarios

## 📝 Model Card Information

- **Developed by**: Research Team
- **Model Type**: Video Classification (Binary)
- **Training Data**: UCF Crime Dataset (Subset)
- **Training Date**: 2025-06-08 15:19:08 UTC
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
- **Intended Users**: Researchers, Security Professionals, Developers

## 📚 Citation

If you use this model in your research, please cite:

```bibtex
@misc{Nikeytas_test_upload_model,
    title={VideoMAE Fine-tuned for Crime Detection},
    author={Research Team},
    year={2024},
    publisher={Hugging Face},
    url={https://huggingface.co/Nikeytas/test-upload-model}
}
```

## 🤝 Contributing

We welcome contributions to improve the model! Please:
1. Report issues with specific examples
2. Suggest improvements for bias reduction
3. Share evaluation results on new datasets
4. Contribute to documentation and examples

## 📞 Contact

For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team.

---

*Last updated: 2025-06-08 15:19:08 UTC*
*Model version: 1.0*
*Framework: HuggingFace Transformers*