File size: 2,192 Bytes
9c197c4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# Multimodal Sentiment Model with Augmentation
DeBERTa-v3-Large ๊ธฐ๋ฐ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์ ๋ชจ๋ธ (CMU-MOSEI)
## Model Description
์ด ๋ชจ๋ธ์ CMU-MOSEI ๋ฐ์ดํฐ์
์์ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์์ ์ํด ํ์ต๋์์ต๋๋ค.
IITP "๋๋นํจ๊ณผ" ์ฐ๊ตฌ ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค.
### Architecture
- **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large)
- **Audio Encoder**: Transformer Encoder (2 layers)
- **Video Encoder**: Transformer Encoder (2 layers)
- **Fusion**: Cross-modal attention + Multi-head self-attention
### Key Features
- Cross-modal attention between text, audio, and video
- Mixup augmentation for audio/video modalities
- Multi-task learning with auxiliary classifiers (T, A, V branches)
- Frozen first 20 layers of DeBERTa for efficient training
## Performance
| Metric | Score |
|--------|-------|
| **Mult_acc_7** | **56.17%** |
| Mult_acc_5 | 57.83% |
| Has0_acc_2 | ~84% |
| MAE | - |
| Corr | - |
### Comparison with Baselines
| Model | Mult_acc_7 |
|-------|-----------|
| MulT (2020) | 50.7% |
| MMML (2023) | 54.95% |
| **Ours** | **56.17%** |
## Training Details
- **Dataset**: CMU-MOSEI (unaligned_50.pkl)
- **Batch Size**: 16
- **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa)
- **Epochs**: 50 (early stopping: 15)
- **Optimizer**: AdamW
- **Scheduler**: Cosine with warmup
- **Mixup**: alpha=0.4, prob=0.5
- **Loss weights**: cls=0.7, aux=0.1
## Usage
```python
import torch
from transformers import AutoTokenizer
# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']
# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel
model = DeBERTaMultimodalModel(
model_name='microsoft/deberta-v3-large',
hidden_size=512,
num_heads=8,
dropout=0.2,
freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
```
## Input Format
- **Text**: Raw text string (tokenized by DeBERTa tokenizer)
- **Audio**: COVAREP features (74-dim, 500 timesteps)
- **Video**: OpenFace features (35-dim, 500 timesteps)
|