TSAM β Temporal Shift Audio-Visual Model for Viewer Emotion Recognition
Pre-trained model weights for the paper "Decoding Viewer Emotions in Video Ads" by Alexey Antonov, Shravan Sampath Kumar, Jiefei Wei, William Headley, Orlando Wood, and Giovanni Montana, published in Nature Scientific Reports.
Code: github.com/gmontana/DecodingViewerEmotions Dataset: dnamodel/adcumen-viewer-emotions
Model Description
TSAM (Temporal Shift Augmented Module) is a deep learning model that predicts viewer emotional responses to video advertisements. It processes both visual frames and audio tracks from 5-second video clips to classify emotional reactions across seven categories.
Architecture
- Backbone: ResNet50 pre-trained on ImageNet-21K
- Temporal modeling: Temporal Shift Module (TSM) for efficient video understanding
- Audio-visual fusion: Multimodal fusion of visual and audio features
- Output: 7-class emotion classification
Emotion Classes
| ID | Emotion |
|---|---|
| 0 | Anger |
| 1 | Contempt |
| 2 | Disgust |
| 3 | Fear |
| 4 | Happiness |
| 5 | Sadness |
| 6 | Surprise |
Files
| File | Description |
|---|---|
backbone_weights.tar |
ResNet50 backbone pre-trained on ImageNet-21K |
tsam_weights.tar |
Trained TSAM model checkpoint (best balanced accuracy) |
Usage
Download weights
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="dnamodel/tsam-viewer-emotions",
local_dir="./tsam-weights"
)
Inference
See the code repository for full training and inference instructions.
# 1. Clone the code repo
git clone https://github.com/gmontana/DecodingViewerEmotions.git
cd DecodingViewerEmotions
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download dataset and model weights
# 4. Run setup_data.py to extract frames and audio
# 5. Run predict.py for inference
python predict.py
Requirements
- Python 3.10+
- PyTorch 2.5+
- FFmpeg
- CUDA-capable GPU
Training Details
- Training data: 21,392 five-second video clips from video advertisements
- Validation data: 2,856 clips
- Test data: 2,387 clips
- Annotation: Each original advertisement annotated by ~75 viewers using System1's "Test Your Ad" tool
- Selection criterion: Best balanced accuracy on the validation set
Citation
@article{antonov2024decoding,
title={Decoding viewer emotions in video ads},
author={Antonov, Alexey and Kumar, Shravan Sampath and Wei, Jiefei and Headley, William and Wood, Orlando and Montana, Giovanni},
journal={Scientific Reports},
volume={14},
pages={25680},
year={2024},
publisher={Nature Publishing Group},
doi={10.1038/s41598-024-76968-9}
}
License
The TSAM software and associated weights are provided under a custom license from the University of Warwick. Use is permitted solely for academic research and non-commercial evaluation. See the LICENSE file for full terms.
Contact
- Questions or collaborations: Giovanni Montana β g.montana@warwick.ac.uk
- Commercial licensing: Warwick Ventures β ventures@warwick.ac.uk
- Downloads last month
- 25