File size: 2,192 Bytes
9c197c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Multimodal Sentiment Model with Augmentation

DeBERTa-v3-Large ๊ธฐ๋ฐ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ์„ฑ ๋ถ„์„ ๋ชจ๋ธ (CMU-MOSEI)

## Model Description

์ด ๋ชจ๋ธ์€ CMU-MOSEI ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ์„ฑ ๋ถ„์„์„ ์œ„ํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
IITP "๋‚˜๋น„ํšจ๊ณผ" ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

### Architecture

- **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large)
- **Audio Encoder**: Transformer Encoder (2 layers)
- **Video Encoder**: Transformer Encoder (2 layers)
- **Fusion**: Cross-modal attention + Multi-head self-attention

### Key Features

- Cross-modal attention between text, audio, and video
- Mixup augmentation for audio/video modalities
- Multi-task learning with auxiliary classifiers (T, A, V branches)
- Frozen first 20 layers of DeBERTa for efficient training

## Performance

| Metric | Score |
|--------|-------|
| **Mult_acc_7** | **56.17%** |
| Mult_acc_5 | 57.83% |
| Has0_acc_2 | ~84% |
| MAE | - |
| Corr | - |

### Comparison with Baselines

| Model | Mult_acc_7 |
|-------|-----------|
| MulT (2020) | 50.7% |
| MMML (2023) | 54.95% |
| **Ours** | **56.17%** |

## Training Details

- **Dataset**: CMU-MOSEI (unaligned_50.pkl)
- **Batch Size**: 16
- **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa)
- **Epochs**: 50 (early stopping: 15)
- **Optimizer**: AdamW
- **Scheduler**: Cosine with warmup
- **Mixup**: alpha=0.4, prob=0.5
- **Loss weights**: cls=0.7, aux=0.1

## Usage

```python
import torch
from transformers import AutoTokenizer

# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']

# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel

model = DeBERTaMultimodalModel(
    model_name='microsoft/deberta-v3-large',
    hidden_size=512,
    num_heads=8,
    dropout=0.2,
    freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
```

## Input Format

- **Text**: Raw text string (tokenized by DeBERTa tokenizer)
- **Audio**: COVAREP features (74-dim, 500 timesteps)
- **Video**: OpenFace features (35-dim, 500 timesteps)