Multimodal Sentiment Model with Augmentation
DeBERTa-v3-Large ๊ธฐ๋ฐ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์ ๋ชจ๋ธ (CMU-MOSEI)
Model Description
์ด ๋ชจ๋ธ์ CMU-MOSEI ๋ฐ์ดํฐ์ ์์ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์์ ์ํด ํ์ต๋์์ต๋๋ค. IITP "๋๋นํจ๊ณผ" ์ฐ๊ตฌ ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค.
Architecture
- Text Encoder: DeBERTa-v3-Large (microsoft/deberta-v3-large)
- Audio Encoder: Transformer Encoder (2 layers)
- Video Encoder: Transformer Encoder (2 layers)
- Fusion: Cross-modal attention + Multi-head self-attention
Key Features
- Cross-modal attention between text, audio, and video
- Mixup augmentation for audio/video modalities
- Multi-task learning with auxiliary classifiers (T, A, V branches)
- Frozen first 20 layers of DeBERTa for efficient training
Performance
| Metric | Score |
|---|---|
| Mult_acc_7 | 56.17% |
| Mult_acc_5 | 57.83% |
| Has0_acc_2 | ~84% |
| MAE | - |
| Corr | - |
Comparison with Baselines
| Model | Mult_acc_7 |
|---|---|
| MulT (2020) | 50.7% |
| MMML (2023) | 54.95% |
| Ours | 56.17% |
Training Details
- Dataset: CMU-MOSEI (unaligned_50.pkl)
- Batch Size: 16
- Learning Rate: 2e-5 (other), 5e-6 (DeBERTa)
- Epochs: 50 (early stopping: 15)
- Optimizer: AdamW
- Scheduler: Cosine with warmup
- Mixup: alpha=0.4, prob=0.5
- Loss weights: cls=0.7, aux=0.1
Usage
import torch
from transformers import AutoTokenizer
# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']
# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel
model = DeBERTaMultimodalModel(
model_name='microsoft/deberta-v3-large',
hidden_size=512,
num_heads=8,
dropout=0.2,
freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
Input Format
- Text: Raw text string (tokenized by DeBERTa tokenizer)
- Audio: COVAREP features (74-dim, 500 timesteps)
- Video: OpenFace features (35-dim, 500 timesteps)