Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multimodal Sentiment Model with Augmentation
|
| 2 |
+
|
| 3 |
+
DeBERTa-v3-Large ๊ธฐ๋ฐ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์ ๋ชจ๋ธ (CMU-MOSEI)
|
| 4 |
+
|
| 5 |
+
## Model Description
|
| 6 |
+
|
| 7 |
+
์ด ๋ชจ๋ธ์ CMU-MOSEI ๋ฐ์ดํฐ์
์์ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์์ ์ํด ํ์ต๋์์ต๋๋ค.
|
| 8 |
+
IITP "๋๋นํจ๊ณผ" ์ฐ๊ตฌ ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค.
|
| 9 |
+
|
| 10 |
+
### Architecture
|
| 11 |
+
|
| 12 |
+
- **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large)
|
| 13 |
+
- **Audio Encoder**: Transformer Encoder (2 layers)
|
| 14 |
+
- **Video Encoder**: Transformer Encoder (2 layers)
|
| 15 |
+
- **Fusion**: Cross-modal attention + Multi-head self-attention
|
| 16 |
+
|
| 17 |
+
### Key Features
|
| 18 |
+
|
| 19 |
+
- Cross-modal attention between text, audio, and video
|
| 20 |
+
- Mixup augmentation for audio/video modalities
|
| 21 |
+
- Multi-task learning with auxiliary classifiers (T, A, V branches)
|
| 22 |
+
- Frozen first 20 layers of DeBERTa for efficient training
|
| 23 |
+
|
| 24 |
+
## Performance
|
| 25 |
+
|
| 26 |
+
| Metric | Score |
|
| 27 |
+
|--------|-------|
|
| 28 |
+
| **Mult_acc_7** | **56.17%** |
|
| 29 |
+
| Mult_acc_5 | 57.83% |
|
| 30 |
+
| Has0_acc_2 | ~84% |
|
| 31 |
+
| MAE | - |
|
| 32 |
+
| Corr | - |
|
| 33 |
+
|
| 34 |
+
### Comparison with Baselines
|
| 35 |
+
|
| 36 |
+
| Model | Mult_acc_7 |
|
| 37 |
+
|-------|-----------|
|
| 38 |
+
| MulT (2020) | 50.7% |
|
| 39 |
+
| MMML (2023) | 54.95% |
|
| 40 |
+
| **Ours** | **56.17%** |
|
| 41 |
+
|
| 42 |
+
## Training Details
|
| 43 |
+
|
| 44 |
+
- **Dataset**: CMU-MOSEI (unaligned_50.pkl)
|
| 45 |
+
- **Batch Size**: 16
|
| 46 |
+
- **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa)
|
| 47 |
+
- **Epochs**: 50 (early stopping: 15)
|
| 48 |
+
- **Optimizer**: AdamW
|
| 49 |
+
- **Scheduler**: Cosine with warmup
|
| 50 |
+
- **Mixup**: alpha=0.4, prob=0.5
|
| 51 |
+
- **Loss weights**: cls=0.7, aux=0.1
|
| 52 |
+
|
| 53 |
+
## Usage
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
import torch
|
| 57 |
+
from transformers import AutoTokenizer
|
| 58 |
+
|
| 59 |
+
# Load checkpoint
|
| 60 |
+
checkpoint = torch.load('best_model.pt')
|
| 61 |
+
args = checkpoint['args']
|
| 62 |
+
|
| 63 |
+
# Initialize model
|
| 64 |
+
from train_deberta_multimodal import DeBERTaMultimodalModel
|
| 65 |
+
|
| 66 |
+
model = DeBERTaMultimodalModel(
|
| 67 |
+
model_name='microsoft/deberta-v3-large',
|
| 68 |
+
hidden_size=512,
|
| 69 |
+
num_heads=8,
|
| 70 |
+
dropout=0.2,
|
| 71 |
+
freeze_deberta_layers=20
|
| 72 |
+
)
|
| 73 |
+
model.load_state_dict(checkpoint['model_state_dict'])
|
| 74 |
+
model.eval()
|
| 75 |
+
|
| 76 |
+
# Load tokenizer
|
| 77 |
+
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Input Format
|
| 81 |
+
|
| 82 |
+
- **Text**: Raw text string (tokenized by DeBERTa tokenizer)
|
| 83 |
+
- **Audio**: COVAREP features (74-dim, 500 timesteps)
|
| 84 |
+
- **Video**: OpenFace features (35-dim, 500 timesteps)
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
If you use this model, please cite:
|
| 89 |
+
|
| 90 |
+
```bibtex
|
| 91 |
+
@misc{iknow2024mosei,
|
| 92 |
+
title={Multimodal Sentiment Analysis with DeBERTa and Cross-Modal Attention},
|
| 93 |
+
author={iKnow Lab},
|
| 94 |
+
year={2024},
|
| 95 |
+
publisher={Hugging Face}
|
| 96 |
+
}
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Acknowledgements
|
| 100 |
+
|
| 101 |
+
This work was supported by IITP (Institute of Information & Communications Technology Planning & Evaluation) grant funded by the Korea government (MSIT).
|
| 102 |
+
|
| 103 |
+
- Project: ์ฌ๋์ค์ฌ์ธ๊ณต์ง๋ฅํต์ฌ์์ฒ๊ธฐ์ ๊ฐ๋ฐ
|
| 104 |
+
- Task: ๋ณต์กํ ์ธ๊ณผ ๊ด๊ณ ์ดํด๋ฅผ ์ํ ์ด๋ ๋ฐ์ดํฐ ๊ธฐ๋ฐ ๊ท์ถ์ ์ถ๋ก ํ๋ ์์ํฌ
|
| 105 |
+
- Grant Number: RS-2022-II220680
|
| 106 |
+
|
| 107 |
+
## License
|
| 108 |
+
|
| 109 |
+
This model is released under the MIT License.
|