keshavbhandari commited on
Commit
36a69ab
·
verified ·
1 Parent(s): 127f774

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +131 -3
README.md CHANGED
@@ -1,3 +1,131 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Structural Derivation: A Model for Analyzing Musical Coherence
2
+
3
+ This repository contains a model trained to understand the internal structure of music. It can be used to generate two novel metrics for any given audio file: the **Structural Derivation Consistency (SDC)** score and the **Structural Diversity** score.
4
+
5
+ These metrics are particularly useful for quantifying the high-level coherence of a musical piece and can effectively distinguish between human-composed music and music generated by AI models that may exhibit "thematic drift."
6
+
7
+ ---
8
+
9
+ ## 🎵 Key Metrics
10
+
11
+ The model analyzes a song by first splitting it into segments (e.g., 20 seconds each) and generating a high-level embedding for each segment. These embeddings are then used to compute the following scores.
12
+
13
+ ### 1. Structural Derivation Consistency (SDC) Score
14
+
15
+ The SDC score quantifies a song's **structural and thematic integrity**. It serves as a proxy for **musical narrative consistency**, measuring how well a song "remembers" and develops the musical ideas presented in its opening segment.
16
+
17
+ * **How it works:** The score is the average cosine similarity between the embedding of the first segment (S1) and every subsequent segment (S2, S3, ..., SN).
18
+ * **Interpretation:**
19
+ * A **high SDC score** suggests the piece is thematically coherent. Like a well-written story, it builds upon its initial premise. This is characteristic of human-composed music, which often uses repetition, variation, and development of a core motif.
20
+ * A **low SDC score** indicates potential **"thematic drift."** The song's later sections diverge significantly from its beginning, lacking a clear connection to the initial musical statement. This can be a trait of AI-generated music that struggles with long-form compositional structure.
21
+
22
+ ### 2. Structural Diversity Score
23
+
24
+ The Structural Diversity score measures the **variety of distinct musical ideas** within a single piece of music. It complements the SDC score by providing insight into the song's complexity and repetitiveness.
25
+
26
+ * **How it works:** The score is calculated using the [Vendi Score](https://github.com/verga11/vendi-score) on the similarity matrix derived from all segment embeddings.
27
+ * **Interpretation:**
28
+ * A **high Diversity score** indicates that the song contains a wide range of different-sounding segments. The piece is musically rich and varied.
29
+ * A **low Diversity score** suggests the song is repetitive or monotonous, with very similar-sounding segments throughout.
30
+
31
+ A compelling musical piece might exhibit both high SDC (it's coherent) and high Diversity (it's interesting and not overly repetitive).
32
+
33
+ ---
34
+
35
+ ## 💻 Installation & Usage
36
+
37
+ ### 1. Installation
38
+
39
+ Install the package and its dependencies directly from the GitHub repository:
40
+
41
+ ```bash
42
+ pip install git+[https://github.com/keshavbhandari/musical_structure_metrics.git](https://github.com/keshavbhandari/musical_structure_metrics.git)
43
+ ```
44
+
45
+ ### 2. Example Usage
46
+
47
+ The following script loads the model from Hugging Face, processes a list of audio files, and prints the resulting scores. The `analyze_audio_files` function handles splitting the audio, batching, and computing both metrics.
48
+
49
+ ```python
50
+ import torch
51
+ from structure_derivation.analyzer import analyze_audio_files
52
+ from structure_derivation.model.model import StructureDerivationModel
53
+
54
+ # 1. Set up the device and load the model
55
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
56
+
57
+ # The model will be automatically downloaded from Hugging Face
58
+ model = StructureDerivationModel.from_pretrained(
59
+ "keshavbhandari/structure-derivation"
60
+ )
61
+ model.to(device)
62
+ model.eval()
63
+
64
+ # 2. Define the list of audio files you want to analyze
65
+ my_audio_files = [
66
+ "path/to/your/song1.mp3",
67
+ "path/to/your/song2.wav",
68
+ "path/to/another/track.flac",
69
+ ]
70
+
71
+ # 3. Analyze the files to get the scores
72
+ # This function returns a dictionary with file paths as keys
73
+ analysis_results = analyze_audio_files(my_audio_files, model, device)
74
+
75
+ # 4. Print the results
76
+ for path, scores in analysis_results.items():
77
+ print(f"\n--- Results for: {path} ---")
78
+ print(f" Structural Derivation Consistency: {scores['Structural Derivation Consistency']:.4f}")
79
+ print(f" Structural Diversity: {scores['Structural Diversity']:.4f}")
80
+ ```
81
+
82
+ ### 3. Model Architecture & Training
83
+ Architecture
84
+ The core of this model is the Hierarchical Token-Semantic Audio Transformer (HTS-AT) encoder. This architecture, introduced by Chen et al., is a highly efficient and powerful audio transformer.
85
+
86
+ Key features of HTS-AT include:
87
+ - **Hierarchical Structure:** It uses Patch-Merge layers to gradually reduce the number of audio tokens as they pass through the network. This significantly reduces GPU memory consumption and training time compared to standard transformers.
88
+ - **Windowed Attention:** Instead of computing self-attention across the entire audio spectrogram, it uses a more efficient windowed attention mechanism from the Swin Transformer. This captures local relationships effectively while remaining computationally tractable.
89
+ - **High Performance:** HTS-AT achieves state-of-the-art results on major audio classification benchmarks like AudioSet and ESC-50.
90
+
91
+ This architecture is ideal for our purpose as it can efficiently process long audio clips and learn rich, meaningful representations of musical structure.
92
+
93
+ Training
94
+ The model was trained using a contrastive learning objective on a large-scale corpus of nearly 400,000 music files (each longer than one minute) for 20 epochs. The training data was sourced from a combination of the following publicly available datasets:
95
+
96
+ - GiantSteps Key Dataset
97
+
98
+ - FSL10K
99
+
100
+ - Emotify
101
+
102
+ - Emopia
103
+
104
+ - ACM MIRUM
105
+
106
+ - JamendoMaxCaps
107
+
108
+ - MTG-Jamendo
109
+
110
+ - SongEval
111
+
112
+ - Song Describer
113
+
114
+ - ISMIR04
115
+
116
+ - Hainsworth
117
+
118
+ - Saraga
119
+
120
+ - DEAM
121
+
122
+ - MAESTRO v3.0.0
123
+
124
+ ### 4. Re-training on Custom Data
125
+ If you wish to re-train the model on your own dataset, you can use the provided training script. Run the following command from the root of the repository, adjusting the GPU devices as needed:
126
+
127
+ ```bash
128
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nproc_per_node=6 --master_port=12345 structure_derivation/train/train.py
129
+ ```
130
+
131
+ This will start the training process using the default hyperparameters specified in the script. You can modify these parameters as needed to suit your dataset and training requirements.