File size: 2,192 Bytes

9c197c4

# Multimodal Sentiment Model with Augmentation

DeBERTa-v3-Large 기반 멀티모달 감성 분석 모델 (CMU-MOSEI)

## Model Description

이 모델은 CMU-MOSEI 데이터셋에서 멀티모달 감성 분석을 위해 학습되었습니다.
IITP "나비효과" 연구 프로젝트의 일환으로 개발되었습니다.

### Architecture

- **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large)
- **Audio Encoder**: Transformer Encoder (2 layers)
- **Video Encoder**: Transformer Encoder (2 layers)
- **Fusion**: Cross-modal attention + Multi-head self-attention

### Key Features

- Cross-modal attention between text, audio, and video
- Mixup augmentation for audio/video modalities
- Multi-task learning with auxiliary classifiers (T, A, V branches)
- Frozen first 20 layers of DeBERTa for efficient training

## Performance

| Metric | Score |
|--------|-------|
| **Mult_acc_7** | **56.17%** |
| Mult_acc_5 | 57.83% |
| Has0_acc_2 | ~84% |
| MAE | - |
| Corr | - |

### Comparison with Baselines

| Model | Mult_acc_7 |
|-------|-----------|
| MulT (2020) | 50.7% |
| MMML (2023) | 54.95% |
| **Ours** | **56.17%** |

## Training Details

- **Dataset**: CMU-MOSEI (unaligned_50.pkl)
- **Batch Size**: 16
- **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa)
- **Epochs**: 50 (early stopping: 15)
- **Optimizer**: AdamW
- **Scheduler**: Cosine with warmup
- **Mixup**: alpha=0.4, prob=0.5
- **Loss weights**: cls=0.7, aux=0.1

## Usage

```python
import torch
from transformers import AutoTokenizer

# Load checkpoint
checkpoint = torch.load('best_model.pt')
args = checkpoint['args']

# Initialize model
from train_deberta_multimodal import DeBERTaMultimodalModel

model = DeBERTaMultimodalModel(
    model_name='microsoft/deberta-v3-large',
    hidden_size=512,
    num_heads=8,
    dropout=0.2,
    freeze_deberta_layers=20
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
```

## Input Format

- **Text**: Raw text string (tokenized by DeBERTa tokenizer)
- **Audio**: COVAREP features (74-dim, 500 timesteps)
- **Video**: OpenFace features (35-dim, 500 timesteps)