HyunaZ
/

hubert_emotion

+---
+license: apache-2.0
+language:
+  - ko
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+tags:
+  - speech
+  - audio
+---
+# hubert-base-korean
+## Model Details
+Hubert(Hidden-Unit BERT)는 Facebook에서 제안한 Speech Representation Learning 모델입니다.
+Hubert는 기존의 음성 인식 모델과 달리, 음성 신호를 raw waveform에서 바로 학습하는 self-supervised learning 방식을 사용합니다.
+https://huggingface.co/team-lucid/hubert-base-korean 를 베이스모델로 활용했습니다.
+## How to Get Started with the Model
+### Pytorch
+```py
+import torch
+import librosa
+from transformers import AutoFeatureExtractor, AutoConfig
+import whisper
+from pytorch_lightning import Trainer
+import pytorch_lightning as pl
+from torch import nn
+from transformers import HubertForSequenceClassification
+class MyLitModel(pl.LightningModule):
+    def __init__(self, audio_model_name, num_label2s, n_layers=1, projector=True, classifier=True, dropout=0.07, lr_decay=1):
+        super(MyLitModel, self).__init__()
+        self.config = AutoConfig.from_pretrained(audio_model_name)
+        self.config.output_hidden_states = True
+        self.audio_model = HubertForSequenceClassification.from_pretrained(audio_model_name, config=self.config)
+        self.label2_classifier = nn.Linear(self.audio_model.config.hidden_size, num_label2s)
+        self.intensity_regressor = nn.Linear(self.audio_model.config.hidden_size, 1)
+    def forward(self, audio_values, audio_attn_mask=None):
+        outputs = self.audio_model(input_values=audio_values, attention_mask=audio_attn_mask)
+        label2_logits = self.label2_classifier(outputs.hidden_states[-1][:, 0, :])
+        intensity_preds = self.intensity_regressor(outputs.hidden_states[-1][:, 0, :]).squeeze(-1)
+        return label2_logits, intensity_preds
+# 모델 관련 설정
+audio_model_name = "team-lucid/hubert-base-korean"
+NUM_LABELS = 7
+SAMPLING_RATE = 16000
+# Hubert 모델 로드
+pretrained_model_path = "" # 모델 체크포인트
+hubert_model = MyLitModel.load_from_checkpoint(
+    pretrained_model_path,
+    audio_model_name=audio_model_name,
+    num_label2s=NUM_LABELS,
+)
+hubert_model.eval()
+hubert_model.to("cuda" if torch.cuda.is_available() else "cpu")
+# Feature extractor 로드
+feature_extractor = AutoFeatureExtractor.from_pretrained(audio_model_name)
+# 음성 파일 처리
+audio_path = ""  # 처리할 음성 파일 경로
+audio_np, _ = librosa.load(audio_path, sr=SAMPLING_RATE, mono=True)
+inputs = feature_extractor(raw_speech=audio_np, return_tensors="pt", sampling_rate=SAMPLING_RATE)
+audio_values = inputs["input_values"].to(hubert_model.device)
+audio_attn_mask = inputs.get("attention_mask", None)
+if audio_attn_mask is not None:
+    audio_attn_mask = audio_attn_mask.to(hubert_model.device)
+# 감정 분석
+with torch.no_grad():
+    if audio_attn_mask is None:
+        label2_logits, intensity_preds = hubert_model(audio_values)
+    else:
+        label2_logits, intensity_preds = hubert_model(audio_values, audio_attn_mask)
+emotion_label = torch.argmax(label2_logits, dim=-1).item()
+emotion_intensity = intensity_preds.item()
+print(f"Emotion Label: {emotion_label}, Emotion Intensity: {emotion_intensity}")
+```
+## Training Details
+### Training Data
+해당 모델은 AI hub의 감정 분류를 위한 대화음성데이터셋 (https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=263) 중
+각 라벨 별 데이터셋 1000개씩, 총 7000개를 활용해 학습을 진행했습니다.
+### Training Procedure
+각 7가지 감정 (행복, 분노, 혐오, 공포, 중립, 슬픔, 놀람)과 각 감정의 강도(0-2)를 동시에 학습하는 멀티테스크 모델로 설계했습니다.
+#### Training Hyperparameters
+| Hyperparameter      | Base    |
+|:--------------------|---------|
+| Learning Rates      | 1e-5    |
+| Learning Rate Decay | 0.8     |
+| Batch Size          | 8       |
+| Weight Decay        | 0.01    |
+| Epoch               | 30      |