TSAM — Temporal Shift Audio-Visual Model for Viewer Emotion Recognition

Pre-trained model weights for the paper "Decoding Viewer Emotions in Video Ads" by Alexey Antonov, Shravan Sampath Kumar, Jiefei Wei, William Headley, Orlando Wood, and Giovanni Montana, published in Nature Scientific Reports.

Code: github.com/gmontana/DecodingViewerEmotions Dataset: dnamodel/adcumen-viewer-emotions

Model Description

TSAM (Temporal Shift Augmented Module) is a deep learning model that predicts viewer emotional responses to video advertisements. It processes both visual frames and audio tracks from 5-second video clips to classify emotional reactions across seven categories.

Architecture

Backbone: ResNet50 pre-trained on ImageNet-21K
Temporal modeling: Temporal Shift Module (TSM) for efficient video understanding
Audio-visual fusion: Multimodal fusion of visual and audio features
Output: 7-class emotion classification

Emotion Classes

ID	Emotion
0	Anger
1	Contempt
2	Disgust
3	Fear
4	Happiness
5	Sadness
6	Surprise

Files

File	Description
`backbone_weights.tar`	ResNet50 backbone pre-trained on ImageNet-21K
`tsam_weights.tar`	Trained TSAM model checkpoint (best balanced accuracy)

Usage

Download weights

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="dnamodel/tsam-viewer-emotions",
    local_dir="./tsam-weights"
)

Inference

See the code repository for full training and inference instructions.

# 1. Clone the code repo
git clone https://github.com/gmontana/DecodingViewerEmotions.git
cd DecodingViewerEmotions

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download dataset and model weights
# 4. Run setup_data.py to extract frames and audio
# 5. Run predict.py for inference
python predict.py

Requirements

Python 3.10+
PyTorch 2.5+
FFmpeg
CUDA-capable GPU

Training Details

Training data: 21,392 five-second video clips from video advertisements
Validation data: 2,856 clips
Test data: 2,387 clips
Annotation: Each original advertisement annotated by ~75 viewers using System1's "Test Your Ad" tool
Selection criterion: Best balanced accuracy on the validation set

Citation

@article{antonov2024decoding,
  title={Decoding viewer emotions in video ads},
  author={Antonov, Alexey and Kumar, Shravan Sampath and Wei, Jiefei and Headley, William and Wood, Orlando and Montana, Giovanni},
  journal={Scientific Reports},
  volume={14},
  pages={25680},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-024-76968-9}
}

License

The TSAM software and associated weights are provided under a custom license from the University of Warwick. Use is permitted solely for academic research and non-commercial evaluation. See the LICENSE file for full terms.

Contact

Questions or collaborations: Giovanni Montana — g.montana@warwick.ac.uk
Commercial licensing: Warwick Ventures — ventures@warwick.ac.uk

Downloads last month: 2

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

dnamodel
/

tsam-viewer-emotions