iknow-lab
/

multimodal-sentiment-model-with-augmentation

Model card Files Files and versions

xet

Community

SoyeonHH commited on 4 days ago

Commit

9c197c4

verified ·

1 Parent(s): 4677217

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +109 -0

README.md ADDED Viewed

	@@ -0,0 +1,109 @@

+# Multimodal Sentiment Model with Augmentation
+DeBERTa-v3-Large 기반 멀티모달 감성 분석 모델 (CMU-MOSEI)
+## Model Description
+이 모델은 CMU-MOSEI 데이터셋에서 멀티모달 감성 분석을 위해 학습되었습니다.
+IITP "나비효과" 연구 프로젝트의 일환으로 개발되었습니다.
+### Architecture
+- **Text Encoder**: DeBERTa-v3-Large (microsoft/deberta-v3-large)
+- **Audio Encoder**: Transformer Encoder (2 layers)
+- **Video Encoder**: Transformer Encoder (2 layers)
+- **Fusion**: Cross-modal attention + Multi-head self-attention
+### Key Features
+- Cross-modal attention between text, audio, and video
+- Mixup augmentation for audio/video modalities
+- Multi-task learning with auxiliary classifiers (T, A, V branches)
+- Frozen first 20 layers of DeBERTa for efficient training
+## Performance
+| Metric | Score |
+|--------|-------|
+| **Mult_acc_7** | **56.17%** |
+| Mult_acc_5 | 57.83% |
+| Has0_acc_2 | ~84% |
+| MAE | - |
+| Corr | - |
+### Comparison with Baselines
+| Model | Mult_acc_7 |
+|-------|-----------|
+| MulT (2020) | 50.7% |
+| MMML (2023) | 54.95% |
+| **Ours** | **56.17%** |
+## Training Details
+- **Dataset**: CMU-MOSEI (unaligned_50.pkl)
+- **Batch Size**: 16
+- **Learning Rate**: 2e-5 (other), 5e-6 (DeBERTa)
+- **Epochs**: 50 (early stopping: 15)
+- **Optimizer**: AdamW
+- **Scheduler**: Cosine with warmup
+- **Mixup**: alpha=0.4, prob=0.5
+- **Loss weights**: cls=0.7, aux=0.1
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer
+# Load checkpoint
+checkpoint = torch.load('best_model.pt')
+args = checkpoint['args']
+# Initialize model
+from train_deberta_multimodal import DeBERTaMultimodalModel
+model = DeBERTaMultimodalModel(
+    model_name='microsoft/deberta-v3-large',
+    hidden_size=512,
+    num_heads=8,
+    dropout=0.2,
+    freeze_deberta_layers=20
+)
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
+```
+## Input Format
+- **Text**: Raw text string (tokenized by DeBERTa tokenizer)
+- **Audio**: COVAREP features (74-dim, 500 timesteps)
+- **Video**: OpenFace features (35-dim, 500 timesteps)
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{iknow2024mosei,
+  title={Multimodal Sentiment Analysis with DeBERTa and Cross-Modal Attention},
+  author={iKnow Lab},
+  year={2024},
+  publisher={Hugging Face}
+}
+```
+## Acknowledgements
+This work was supported by IITP (Institute of Information & Communications Technology Planning & Evaluation) grant funded by the Korea government (MSIT).
+- Project: 사람중심인공지능핵심원천기술개발
+- Task: 복잡한 인과 관계 이해를 위한 옴니 데이터 기반 귀추적 추론 프레임워크
+- Grant Number: RS-2022-II220680
+## License
+This model is released under the MIT License.