| # Multimodal Sentiment Model with Augmentation | |
| DeBERTa-v3-Large ๊ธฐ๋ฐ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์ ๋ชจ๋ธ (CMU-MOSEI) | |
| ## Model Description | |
| ์ด ๋ชจ๋ธ์ CMU-MOSEI ๋ฐ์ดํฐ์ ์์ ๋ฉํฐ๋ชจ๋ฌ ๊ฐ์ฑ ๋ถ์์ ์ํด ํ์ต๋์์ต๋๋ค. | |
| IITP "๋๋นํจ๊ณผ" ์ฐ๊ตฌ ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค. | |
| ### Architecture | |
| - **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large) | |
| - **Audio Encoder**: Transformer Encoder (2 layers) | |
| - **Video Encoder**: Transformer Encoder (2 layers) | |
| - **Fusion**: Cross-modal attention + Multi-head self-attention | |
| ### Key Features | |
| - Cross-modal attention between text, audio, and video | |
| - Mixup augmentation for audio/video modalities | |
| - Multi-task learning with auxiliary classifiers (T, A, V branches) | |
| - Frozen first 20 layers of DeBERTa for efficient training | |
| ## Performance | |
| | Metric | Score | | |
| |--------|-------| | |
| | **Mult_acc_7** | **56.17%** | | |
| | Mult_acc_5 | 57.83% | | |
| | Has0_acc_2 | ~84% | | |
| | MAE | - | | |
| | Corr | - | | |
| ### Comparison with Baselines | |
| | Model | Mult_acc_7 | | |
| |-------|-----------| | |
| | MulT (2020) | 50.7% | | |
| | MMML (2023) | 54.95% | | |
| | **Ours** | **56.17%** | | |
| ## Training Details | |
| - **Dataset**: CMU-MOSEI (unaligned_50.pkl) | |
| - **Batch Size**: 16 | |
| - **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa) | |
| - **Epochs**: 50 (early stopping: 15) | |
| - **Optimizer**: AdamW | |
| - **Scheduler**: Cosine with warmup | |
| - **Mixup**: alpha=0.4, prob=0.5 | |
| - **Loss weights**: cls=0.7, aux=0.1 | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer | |
| # Load checkpoint | |
| checkpoint = torch.load('best_model.pt') | |
| args = checkpoint['args'] | |
| # Initialize model | |
| from train_deberta_multimodal import DeBERTaMultimodalModel | |
| model = DeBERTaMultimodalModel( | |
| model_name='microsoft/deberta-v3-large', | |
| hidden_size=512, | |
| num_heads=8, | |
| dropout=0.2, | |
| freeze_deberta_layers=20 | |
| ) | |
| model.load_state_dict(checkpoint['model_state_dict']) | |
| model.eval() | |
| # Load tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large') | |
| ``` | |
| ## Input Format | |
| - **Text**: Raw text string (tokenized by DeBERTa tokenizer) | |
| - **Audio**: COVAREP features (74-dim, 500 timesteps) | |
| - **Video**: OpenFace features (35-dim, 500 timesteps) | |