anggars commited on
Commit
97d024d
·
verified ·
1 Parent(s): 4079efc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -13,6 +13,7 @@ language:
13
  - en
14
  metrics:
15
  - accuracy
 
16
  pipeline_tag: audio-classification
17
  ---
18
 
@@ -30,38 +31,46 @@ The research is structured to prioritize emotional resonance and genre-specific
30
  The system utilizes a custom Multimodal PyTorch architecture combining NLP and state-of-the-art audio transformers:
31
 
32
  ### 1. Lyrical Stream (NLP)
33
- - **Encoder:** `xlm-roberta-base` (or equivalent transformer)
34
  - **Logic:** Extracts high-level semantic embeddings from song lyrics. The base model weights are frozen to maintain stable pre-trained representations.
35
 
36
  ### 2. Acoustic Stream (DSP)
37
  - **Model:** Pre-trained Audio Transformer (`WavLMModel`)
38
  - **Input:** Raw audio waveform processed via `AutoFeatureExtractor`.
39
- - **Logic:** Captures the complex guitar textures and erratic drum patterns common in the genre natively from the waveform, replacing the legacy 2D-CNN Mel-spectrogram approach.
40
 
41
  ### 3. Fusion Layer
42
  - **Method:** Feature concatenation (Text Embeddings + Audio Embeddings) into a unified representation, processed through feed-forward layers with dropout regularization.
43
  - **Heads:** Multi-task fully connected layers for joint classification of MBTI, Emotion, Vibe, Intensity, and Tempo.
44
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Academic Context
46
  This project is an undergraduate thesis developed at Sekolah Tinggi Teknologi Cipasung (STTC), Informatics Department.
47
 
48
  **Thesis Title:** *RANCANG BANGUN SISTEM ANALISIS MULTIMODAL EMOSI DAN KEPRIBADIAN MBTI PADA LIRIK MUSIK MIDWEST EMO MENGGUNAKAN ARSITEKTUR TRANSFORMER DAN EKSTRAKSI FITUR AUDIO DALAM MUSIK MATH ROCK*
49
 
50
- **Key Findings:**
51
- - **High-Performance Metrics:** The architecture successfully identifies broad acoustic targets, proving the viability of the multimodal feature extraction pipeline using WavLM.
52
- - **Class Imbalance in Niche Genres:** For 16-class targets like MBTI and Emotion, the model highlights the natural bias of the dataset. Classes with sufficient samples perform adequately, while minority classes struggle due to lack of representation. This serves as a realistic baseline for future research regarding data sparsity in highly subjective Music Information Retrieval (MIR) tasks.
53
-
54
  ## How to Use
55
- Since this is a custom PyTorch architecture incorporating Hugging Face transformers, it must be loaded by defining the model class locally before loading the state dictionary.
56
 
57
  ```python
58
  import torch
59
  from transformers import WavLMModel, AutoFeatureExtractor
60
 
61
- # 1. Define your custom multimodal architecture class matching the training script
62
  # 2. Load the weights
63
  model = MultimodalMathRock()
64
- model.load_state_dict(torch.load("model.pt", map_location="cpu"))
 
65
  model.eval()
66
 
67
  # Model is ready for inference
 
13
  - en
14
  metrics:
15
  - accuracy
16
+ - f1
17
  pipeline_tag: audio-classification
18
  ---
19
 
 
31
  The system utilizes a custom Multimodal PyTorch architecture combining NLP and state-of-the-art audio transformers:
32
 
33
  ### 1. Lyrical Stream (NLP)
34
+ - **Encoder:** `xlm-roberta-base`
35
  - **Logic:** Extracts high-level semantic embeddings from song lyrics. The base model weights are frozen to maintain stable pre-trained representations.
36
 
37
  ### 2. Acoustic Stream (DSP)
38
  - **Model:** Pre-trained Audio Transformer (`WavLMModel`)
39
  - **Input:** Raw audio waveform processed via `AutoFeatureExtractor`.
40
+ - **Logic:** Captures complex guitar textures and erratic drum patterns natively from the waveform, replacing legacy 2D-CNN Mel-spectrogram approaches.
41
 
42
  ### 3. Fusion Layer
43
  - **Method:** Feature concatenation (Text Embeddings + Audio Embeddings) into a unified representation, processed through feed-forward layers with dropout regularization.
44
  - **Heads:** Multi-task fully connected layers for joint classification of MBTI, Emotion, Vibe, Intensity, and Tempo.
45
 
46
+ ## Evaluation Metrics (Epoch 10)
47
+ The following results were obtained using the `model.pt` (Final Epoch 10) on a 400-sample evaluation subset:
48
+
49
+ | Task | Accuracy | Macro F1-Score | Key Performance Note |
50
+ | :--- | :--- | :--- | :--- |
51
+ | **Vibe** | 78.00% | 0.76 | Exceptional detection of 'Melancholic' tracks (0.83 F1). |
52
+ | **Intensity** | 73.00% | 0.69 | Highly stable predictions for 'Medium' intensity levels. |
53
+ | **Emotion** | 55.75% | 0.25 | Strong precision in 'Grief' (0.71 F1) and 'Amusement'. |
54
+ | **Tempo** | 54.50% | 0.30 | Consistent performance on 'Moderate' tempo classifications. |
55
+ | **MBTI** | 25.75% | 0.23 | Outperforms random baseline (6.25%) by a factor of 4. |
56
+
57
  ## Academic Context
58
  This project is an undergraduate thesis developed at Sekolah Tinggi Teknologi Cipasung (STTC), Informatics Department.
59
 
60
  **Thesis Title:** *RANCANG BANGUN SISTEM ANALISIS MULTIMODAL EMOSI DAN KEPRIBADIAN MBTI PADA LIRIK MUSIK MIDWEST EMO MENGGUNAKAN ARSITEKTUR TRANSFORMER DAN EKSTRAKSI FITUR AUDIO DALAM MUSIK MATH ROCK*
61
 
 
 
 
 
62
  ## How to Use
63
+ Since this is a custom PyTorch architecture, the model class must be defined locally before loading the state dictionary.
64
 
65
  ```python
66
  import torch
67
  from transformers import WavLMModel, AutoFeatureExtractor
68
 
69
+ # 1. Define the MultimodalMathRock class architecture
70
  # 2. Load the weights
71
  model = MultimodalMathRock()
72
+ checkpoint = torch.load("model.pt", map_location="cpu")
73
+ model.load_state_dict(checkpoint['model_state'])
74
  model.eval()
75
 
76
  # Model is ready for inference