Multimodal Sentiment Model with Augmentation

DeBERTa-v3-Large 기반 멀티모달 감성 분석 모델 (CMU-MOSEI)

Model Description

이 모델은 CMU-MOSEI 데이터셋에서 멀티모달 감성 분석을 위해 학습되었습니다. IITP "나비효과" 연구 프로젝트의 일환으로 개발되었습니다.

Architecture

Text Encoder: DeBERTa-v3-Large (microsoft/deberta-v3-large)
Audio Encoder: Transformer Encoder (2 layers)
Video Encoder: Transformer Encoder (2 layers)
Fusion: Cross-modal attention + Multi-head self-attention

Key Features

Cross-modal attention between text, audio, and video
Mixup augmentation for audio/video modalities
Multi-task learning with auxiliary classifiers (T, A, V branches)
Frozen first 20 layers of DeBERTa for efficient training

Performance

Metric	Score
Mult_acc_7	56.17%
Mult_acc_5	57.83%
Has0_acc_2	~84%
MAE	-
Corr	-

Comparison with Baselines

Model	Mult_acc_7
MulT (2020)	50.7%
MMML (2023)	54.95%
Ours	56.17%

Training Details

Dataset: CMU-MOSEI (unaligned_50.pkl)
Batch Size: 16
Learning Rate: 2e-5 (other), 5e-6 (DeBERTa)
Epochs: 50 (early stopping: 15)
Optimizer: AdamW
Scheduler: Cosine with warmup
Mixup: alpha=0.4, prob=0.5
Loss weights: cls=0.7, aux=0.1

Usage

import torch
from transformers import AutoTokenizer

# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']

# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel

model = DeBERTaMultimodalModel(
    model_name='microsoft/deberta-v3-large',
    hidden_size=512,
    num_heads=8,
    dropout=0.2,
    freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')

Input Format

Text: Raw text string (tokenized by DeBERTa tokenizer)
Audio: COVAREP features (74-dim, 500 timesteps)
Video: OpenFace features (35-dim, 500 timesteps)