iknow-lab
/

multimodal-sentiment-model-with-augmentation

Model card Files Files and versions

multimodal-sentiment-model-with-augmentation / README.md

SoyeonHH's picture

Update README.md

628e8db verified 3 days ago

|

history blame contribute delete

2.19 kB

	# Multimodal Sentiment Model with Augmentation

	DeBERTa-v3-Large 기반 멀티모달 감성 분석 모델 (CMU-MOSEI)

	## Model Description

	이 모델은 CMU-MOSEI 데이터셋에서 멀티모달 감성 분석을 위해 학습되었습니다.
	IITP "나비효과" 연구 프로젝트의 일환으로 개발되었습니다.

	### Architecture

	- Text Encoder: DeBERTa-v3-Large (microsoft/deberta-v3-large)
	- Audio Encoder: Transformer Encoder (2 layers)
	- Video Encoder: Transformer Encoder (2 layers)
	- Fusion: Cross-modal attention + Multi-head self-attention

	### Key Features

	- Cross-modal attention between text, audio, and video
	- Mixup augmentation for audio/video modalities
	- Multi-task learning with auxiliary classifiers (T, A, V branches)
	- Frozen first 20 layers of DeBERTa for efficient training

	## Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Mult_acc_7 \| 56.17% \|
	\| Mult_acc_5 \| 57.83% \|
	\| Has0_acc_2 \| ~84% \|
	\| MAE \| - \|
	\| Corr \| - \|

	### Comparison with Baselines

	\| Model \| Mult_acc_7 \|
	\|-------\|-----------\|
	\| MulT (2020) \| 50.7% \|
	\| MMML (2023) \| 54.95% \|
	\| Ours \| 56.17% \|

	## Training Details

	- Dataset: CMU-MOSEI (unaligned_50.pkl)
	- Batch Size: 16
	- Learning Rate: 2e-5 (other), 5e-6 (DeBERTa)
	- Epochs: 50 (early stopping: 15)
	- Optimizer: AdamW
	- Scheduler: Cosine with warmup
	- Mixup: alpha=0.4, prob=0.5
	- Loss weights: cls=0.7, aux=0.1

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer

	# Load checkpoint
	checkpoint = torch.load('best_model.pt')
	args = checkpoint['args']

	# Initialize model
	from train_deberta_multimodal import DeBERTaMultimodalModel

	model = DeBERTaMultimodalModel(
	model_name='microsoft/deberta-v3-large',
	hidden_size=512,
	num_heads=8,
	dropout=0.2,
	freeze_deberta_layers=20
	)
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
	```

	## Input Format

	- Text: Raw text string (tokenized by DeBERTa tokenizer)
	- Audio: COVAREP features (74-dim, 500 timesteps)
	- Video: OpenFace features (35-dim, 500 timesteps)