anggars commited on
Commit
cff7c34
·
verified ·
1 Parent(s): b88ea44

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -19
README.md CHANGED
@@ -5,6 +5,8 @@ tags:
5
  - music-classification
6
  - math-rock
7
  - pytorch
 
 
8
  datasets:
9
  - anggars/neural-mathrock
10
  language:
@@ -16,50 +18,50 @@ pipeline_tag: audio-classification
16
 
17
  # Neural Mathrock: Multimodal Emotion and Personality Analysis
18
 
19
- This repository hosts a multimodal deep learning framework specialized in the affective and psychological analysis of Math Rock and Midwest Emo music. By integrating lyrical semantics and acoustic patterns, the system extracts features to classify emotional and personality-based characteristics.
20
 
21
  ## Project Objectives
22
  The research is structured to prioritize emotional resonance and genre-specific complexities:
23
- - **Emotion & Vibe Recognition:** Identifying affective states and general vibes (e.g., Melancholic, Aggressive) through the synergy of lyrical themes and audio arrays.
24
  - **Personality (MBTI) Profiling:** Correlating complex musical arrangements and introspective lyrics with personality archetypes.
25
  - **Acoustic Feature Extraction:** Analyzing technical attributes like syncopation and odd time signatures using robust audio signal processing.
26
 
27
  ## Technical Architecture: Late Fusion Multimodal
28
- The system utilizes a custom Multimodal PyTorch architecture:
29
 
30
  ### 1. Lyrical Stream (NLP)
31
- - **Encoder:** `xlm-roberta-base`
32
- - **Logic:** Extracts high-level semantic embeddings from song lyrics (768-dim). The base model weights are frozen to maintain stable pre-trained representations.
33
 
34
  ### 2. Acoustic Stream (DSP)
35
- - **Model:** 2D-Convolutional Neural Network (CNN) followed by a Transformer Encoder.
36
- - **Input:** 3-channel stacked features (Mel-spectrogram, 13-coefficient MFCC, and Chroma).
37
- - **Logic:** Captures the complex guitar textures and erratic drum patterns common in the genre, pooling them into a 256-dim feature vector.
38
 
39
  ### 3. Fusion Layer
40
- - **Method:** Feature concatenation (768-dim Text + 256-dim Audio) into a 1024-dim representation, processed through a 512-dim feed-forward layer with dropout regularization.
41
  - **Heads:** Multi-task fully connected layers for joint classification of MBTI, Emotion, Vibe, Intensity, and Tempo.
42
 
43
- ## Performance Summary and Academic Context
44
- This project is an undergraduate thesis developed at Sekolah Tinggi Teknologi Cipasung (STTC), Informatics Department, Class of 2022. It explores the extreme class imbalances naturally found in niche music genres.
 
 
45
 
46
  **Key Findings:**
47
- - **High-Performance Metrics:** The architecture successfully identifies broad acoustic targets, achieving an **F1-Score of 0.77 for High Intensity** and **0.72 for Melancholic Vibe**, proving the viability of the multimodal feature extraction pipeline.
48
- - **Class Imbalance in Niche Genres:** For 16-class targets like MBTI and Emotion, the model highlights the natural bias of the dataset. Classes with sufficient samples (e.g., ESTJ, Desire, Pride) perform adequately, while minority classes struggle due to lack of representation (Macro F1 ~0.07). This serves as a realistic baseline for future research regarding data sparsity in highly subjective Music Information Retrieval (MIR) tasks.
49
 
50
  ## How to Use
51
- Since this is a custom PyTorch architecture, it cannot be loaded via the standard `pipeline` API. You must define the model classes locally and load the state dictionary.
52
 
53
  ```python
54
  import torch
 
55
 
56
- # 1. Define the MultimodalMathRock architecture classes here first
57
- # (Refer to the training script for the AudioCNNTransformer, TextTransformer, and Fusion module)
58
-
59
  # 2. Load the weights
60
  model = MultimodalMathRock()
61
- state_dict = torch.load("model.pt", map_location="cpu")
62
- model.load_state_dict(state_dict['model_state'])
63
  model.eval()
64
 
65
  # Model is ready for inference
 
5
  - music-classification
6
  - math-rock
7
  - pytorch
8
+ - wavlm
9
+ - transformer
10
  datasets:
11
  - anggars/neural-mathrock
12
  language:
 
18
 
19
  # Neural Mathrock: Multimodal Emotion and Personality Analysis
20
 
21
+ This repository hosts a multimodal deep learning framework specialized in the affective and psychological analysis of Math Rock and Midwest Emo music. By integrating lyrical semantics and raw acoustic representations, the system extracts features to classify emotional and personality-based characteristics.
22
 
23
  ## Project Objectives
24
  The research is structured to prioritize emotional resonance and genre-specific complexities:
25
+ - **Emotion & Vibe Recognition:** Identifying affective states and general vibes (e.g., Melancholic, Aggressive) through the synergy of lyrical themes and audio.
26
  - **Personality (MBTI) Profiling:** Correlating complex musical arrangements and introspective lyrics with personality archetypes.
27
  - **Acoustic Feature Extraction:** Analyzing technical attributes like syncopation and odd time signatures using robust audio signal processing.
28
 
29
  ## Technical Architecture: Late Fusion Multimodal
30
+ The system utilizes a custom Multimodal PyTorch architecture combining NLP and state-of-the-art audio transformers:
31
 
32
  ### 1. Lyrical Stream (NLP)
33
+ - **Encoder:** `xlm-roberta-base` (or equivalent transformer)
34
+ - **Logic:** Extracts high-level semantic embeddings from song lyrics. The base model weights are frozen to maintain stable pre-trained representations.
35
 
36
  ### 2. Acoustic Stream (DSP)
37
+ - **Model:** Pre-trained Audio Transformer (`WavLMModel`)
38
+ - **Input:** Raw audio waveform processed via `AutoFeatureExtractor`.
39
+ - **Logic:** Captures the complex guitar textures and erratic drum patterns common in the genre natively from the waveform, replacing the legacy 2D-CNN Mel-spectrogram approach.
40
 
41
  ### 3. Fusion Layer
42
+ - **Method:** Feature concatenation (Text Embeddings + Audio Embeddings) into a unified representation, processed through feed-forward layers with dropout regularization.
43
  - **Heads:** Multi-task fully connected layers for joint classification of MBTI, Emotion, Vibe, Intensity, and Tempo.
44
 
45
+ ## Academic Context
46
+ This project is an undergraduate thesis developed at Sekolah Tinggi Teknologi Cipasung (STTC), Informatics Department.
47
+
48
+ **Thesis Title:** *RANCANG BANGUN SISTEM ANALISIS MULTIMODAL EMOSI DAN KEPRIBADIAN MBTI PADA LIRIK MUSIK MIDWEST EMO MENGGUNAKAN ARSITEKTUR TRANSFORMER DAN EKSTRAKSI FITUR AUDIO DALAM MUSIK MATH ROCK*
49
 
50
  **Key Findings:**
51
+ - **High-Performance Metrics:** The architecture successfully identifies broad acoustic targets, proving the viability of the multimodal feature extraction pipeline using WavLM.
52
+ - **Class Imbalance in Niche Genres:** For 16-class targets like MBTI and Emotion, the model highlights the natural bias of the dataset. Classes with sufficient samples perform adequately, while minority classes struggle due to lack of representation. This serves as a realistic baseline for future research regarding data sparsity in highly subjective Music Information Retrieval (MIR) tasks.
53
 
54
  ## How to Use
55
+ Since this is a custom PyTorch architecture incorporating Hugging Face transformers, it must be loaded by defining the model class locally before loading the state dictionary.
56
 
57
  ```python
58
  import torch
59
+ from transformers import WavLMModel, AutoFeatureExtractor
60
 
61
+ # 1. Define your custom multimodal architecture class matching the training script
 
 
62
  # 2. Load the weights
63
  model = MultimodalMathRock()
64
+ model.load_state_dict(torch.load("model.pt", map_location="cpu"))
 
65
  model.eval()
66
 
67
  # Model is ready for inference