File size: 3,656 Bytes

77608d0
55fc11a
77608d0
e77ccba
55fc11a
dfa8a76
 
cff7c34
 
55fc11a
 
 
 
dfa8a76
 
97d024d
7bebcfb
77608d0
 
dfa8a76
 
cff7c34
dfa8a76
 
 
cff7c34
55fc11a
 
dfa8a76
 
cff7c34
dfa8a76
 
97d024d
cff7c34
dfa8a76
 
cff7c34
 
97d024d
dfa8a76
 
cff7c34
55fc11a
dfa8a76
97d024d
 
 
 
 
 
 
 
 
 
 
cff7c34
 
 
 
dfa8a76
 
97d024d
77608d0
dfa8a76
 
cff7c34
55fc11a
97d024d
55fc11a
 
97d024d
 
55fc11a

---
library_name: pytorch
tags:
- multimodal
- music-classification
- math-rock
- pytorch
- wavlm
- transformer
datasets:
- anggars/neural-mathrock
language:
- en
metrics:
- accuracy
- f1
pipeline_tag: audio-classification
---

# Neural Mathrock: Multimodal Emotion and Personality Analysis

This repository hosts a multimodal deep learning framework specialized in the affective and psychological analysis of Math Rock and Midwest Emo music. By integrating lyrical semantics and raw acoustic representations, the system extracts features to classify emotional and personality-based characteristics.

## Project Objectives
The research is structured to prioritize emotional resonance and genre-specific complexities:
- **Emotion & Vibe Recognition:** Identifying affective states and general vibes (e.g., Melancholic, Aggressive) through the synergy of lyrical themes and audio.
- **Personality (MBTI) Profiling:** Correlating complex musical arrangements and introspective lyrics with personality archetypes.
- **Acoustic Feature Extraction:** Analyzing technical attributes like syncopation and odd time signatures using robust audio signal processing.

## Technical Architecture: Late Fusion Multimodal
The system utilizes a custom Multimodal PyTorch architecture combining NLP and state-of-the-art audio transformers:

### 1. Lyrical Stream (NLP)
- **Encoder:** `xlm-roberta-base`
- **Logic:** Extracts high-level semantic embeddings from song lyrics. The base model weights are frozen to maintain stable pre-trained representations.

### 2. Acoustic Stream (DSP)
- **Model:** Pre-trained Audio Transformer (`WavLMModel`)
- **Input:** Raw audio waveform processed via `AutoFeatureExtractor`.
- **Logic:** Captures complex guitar textures and erratic drum patterns natively from the waveform, replacing legacy 2D-CNN Mel-spectrogram approaches.

### 3. Fusion Layer
- **Method:** Feature concatenation (Text Embeddings + Audio Embeddings) into a unified representation, processed through feed-forward layers with dropout regularization.
- **Heads:** Multi-task fully connected layers for joint classification of MBTI, Emotion, Vibe, Intensity, and Tempo.

## Evaluation Metrics (Epoch 10)
The following results were obtained using the `model.pt` (Final Epoch 10) on a 400-sample evaluation subset:

| Task | Accuracy | Macro F1-Score | Key Performance Note |
| :--- | :--- | :--- | :--- |
| **Vibe** | 78.00% | 0.76 | Exceptional detection of 'Melancholic' tracks (0.83 F1). |
| **Intensity** | 73.00% | 0.69 | Highly stable predictions for 'Medium' intensity levels. |
| **Emotion** | 55.75% | 0.25 | Strong precision in 'Grief' (0.71 F1) and 'Amusement'. |
| **Tempo** | 54.50% | 0.30 | Consistent performance on 'Moderate' tempo classifications. |
| **MBTI** | 25.75% | 0.23 | Outperforms random baseline (6.25%) by a factor of 4. |

## Academic Context
This project is an undergraduate thesis developed at Sekolah Tinggi Teknologi Cipasung (STTC), Informatics Department. 

**Thesis Title:** *RANCANG BANGUN SISTEM ANALISIS MULTIMODAL EMOSI DAN KEPRIBADIAN MBTI PADA LIRIK MUSIK MIDWEST EMO MENGGUNAKAN ARSITEKTUR TRANSFORMER DAN EKSTRAKSI FITUR AUDIO DALAM MUSIK MATH ROCK*

## How to Use
Since this is a custom PyTorch architecture, the model class must be defined locally before loading the state dictionary.

```python
import torch
from transformers import WavLMModel, AutoFeatureExtractor

# 1. Define the MultimodalMathRock class architecture
# 2. Load the weights
model = MultimodalMathRock()
checkpoint = torch.load("model.pt", map_location="cpu")
model.load_state_dict(checkpoint['model_state'])
model.eval()

# Model is ready for inference