Update README.md
Browse files
README.md
CHANGED
|
@@ -1,138 +1,151 @@
|
|
| 1 |
-
---
|
| 2 |
-
base_model:
|
| 3 |
-
- stabilityai/stable-audio-open-1.0
|
| 4 |
-
---
|
| 5 |
-
|
| 6 |
-
<h1 align="center"> SAO finetuning for modern beat generation</h1>
|
| 7 |
-
<p align="center">
|
| 8 |
-
As a music and AI lover I wanted to dive into the music generation technologies. First, I started by exploring existing models for music generation such as Suno or Stable Audio 2.0, but I couldn't find any that could generate trap/rap/r&b beat as well. Then I got this idea, fine tune an open source model over a good amount of trap beat. I chose Stable Audio Open 1.0, as I found it to be the most suitable open-source foundation for this kind of task.
|
| 9 |
-
</p>
|
| 10 |
-
|
| 11 |
-
<p align="center">
|
| 12 |
-
<img src="./assets/preview.gif" alt="preview" width="400"/>
|
| 13 |
-
</p>
|
| 14 |
-
|
| 15 |
-
# Results
|
| 16 |
-
|
| 17 |
-
### Prompt 1
|
| 18 |
-
*A jazzy relaxed jazz rap beat at 95 BPM, featuring piano and nylon guitar, with lovely moods.*
|
| 19 |
-
|
| 20 |
-
| Stable Audio | DreamT 14 |
|
| 21 |
-
|:--|:--|
|
| 22 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1830357556.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1830357556.wav"></audio> |
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
### Prompt 2
|
| 27 |
-
*A dark and melancholic cloud trap beat, with nostalgic piano, plucked bass and synth bells, at 110 BPM.*
|
| 28 |
-
|
| 29 |
-
| Stable Audio | DreamT 14 |
|
| 30 |
-
|:--|:--|
|
| 31 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2306776750.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2306776750.wav"></audio> |
|
| 32 |
-
|
| 33 |
-
---
|
| 34 |
-
|
| 35 |
-
### Prompt 3
|
| 36 |
-
*A laid back lo-fi jazz rap at 85 BPM, featuring deep sub, plucked bass, and vocal chop, with chill and jazzy relaxed moods.*
|
| 37 |
-
|
| 38 |
-
| Stable Audio | DreamT 14 |
|
| 39 |
-
|:--|:--|
|
| 40 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2505643137.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2505643137.wav"></audio> |
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
### Prompt 4
|
| 45 |
-
*Melancholic trap beat at 105 BPM with shimmering synth bells and deep sub bass, minor chord progressions on piano, and airy vocal pads, evoking a cinematic and emotional atmosphere.*
|
| 46 |
-
|
| 47 |
-
| Stable Audio | DreamT 14 |
|
| 48 |
-
|:--|:--|
|
| 49 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1580039167.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1580039167.wav"></audio> |
|
| 50 |
-
|
| 51 |
-
---
|
| 52 |
-
|
| 53 |
-
### Prompt 5
|
| 54 |
-
*Relaxed chillhop beat with melodic piano, plucked guitar, vocal harmonization, and smooth sample loops. Light, airy mood with jazzy undertones at 85 BPM.*
|
| 55 |
-
|
| 56 |
-
| Stable Audio | DreamT 14 |
|
| 57 |
-
|:--|:--|
|
| 58 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1984661836.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1984661836.wav"></audio> |
|
| 59 |
-
|
| 60 |
-
---
|
| 61 |
-
|
| 62 |
-
### Prompt 6
|
| 63 |
-
*Dark trap beat with detuned piano, sub drops, glitch effects, booming bass, vocal chops, and cinematic samples. Brooding and ominous mood at 90 BPM.*
|
| 64 |
-
|
| 65 |
-
| Stable Audio | DreamT 14 |
|
| 66 |
-
|:--|:--|
|
| 67 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2756405298.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2756405298.wav"></audio> |
|
| 68 |
-
|
| 69 |
-
---
|
| 70 |
-
|
| 71 |
-
### Prompt 7
|
| 72 |
-
*Smooth and seductive at 115 BPM trap beat with electric guitar riffs, plucked bass, vocal adlibs, and warm synth pads. Relaxed, romantic, and sexy mood.*
|
| 73 |
-
|
| 74 |
-
| Stable Audio | DreamT 14 |
|
| 75 |
-
|:--|:--|
|
| 76 |
-
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3278661061.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3278661061.wav"></audio> |
|
| 77 |
-
|
| 78 |
-
# Dataset
|
| 79 |
-
|
| 80 |
-
I used 20,000 trap/rap beats spanning various subgenres such as cloud, trap, R&B, EDM, industrial hip-hop, and jazzy chillhop. For each instrumental, I extracted two segments of 20 to 35 seconds, so it ended up with 40k audio dataset for about 277h of audio, while keeping track of their starting timestamps. This allowed the model not only to learn the content of the beats but also to capture the temporal structure inherent to the musical phrases.
|
| 81 |
-
|
| 82 |
-
A key goal of this project was to enable the model to learn new instruments (synth bells, deep sub, plucked bass, snare, ...), tempos, and rhythmic patterns that are strongly associated with trap and its subgenres. To achieve this, I tagged each segment by computing its similarity with curated lists of instruments, moods, and genres using a CLAP LAION model.
|
| 83 |
-
|
| 84 |
-
Additionally, I used the Essentia library to extract the BPM (deeptemp-k16-3) and key/scale of each audio segment, considering only predictions with confidence above 70%.
|
| 85 |
-
|
| 86 |
-
```json
|
| 87 |
-
{
|
| 88 |
-
"39118.wav": {
|
| 89 |
-
"instruments_tags": [
|
| 90 |
-
"plucked guitar",
|
| 91 |
-
"synth bells",
|
| 92 |
-
"movie sample"
|
| 93 |
-
],
|
| 94 |
-
"genres_tags": [
|
| 95 |
-
"rap with soul"
|
| 96 |
-
],
|
| 97 |
-
"moods_tags": [
|
| 98 |
-
"trap melancholic",
|
| 99 |
-
"love"
|
| 100 |
-
],
|
| 101 |
-
"key": "G",
|
| 102 |
-
"scale": "minor",
|
| 103 |
-
"tempo": 109.0,
|
| 104 |
-
"start": 63,
|
| 105 |
-
"duration": 26
|
| 106 |
-
}
|
| 107 |
-
}
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
I chose to generate some synonyms to improve the model’s language variety. This combination of features instrumentation, tempo, key, mood, and genre provided a rich set of musical metadata.
|
| 111 |
-
|
| 112 |
-
<p align="center">
|
| 113 |
-
<img src="./assets/FreqMoods.png" alt="Frequence moods" width="400"/>
|
| 114 |
-
<img src="./assets/FreqInstruments.png" alt="Frequence moods" width="400"/>
|
| 115 |
-
<img src="./assets/FreqTempo.png" alt="Frequence moods" width="400"/>
|
| 116 |
-
</p>
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
Using this metadata, I was able to generate more human-readable prompts for the model via Llama 3.1 3B running locally, allowing the fine-tuned model to produce beats that better reflect the stylistic and structural characteristics of trap music.
|
| 120 |
-
|
| 121 |
-
```json
|
| 122 |
-
{"filepath": "39118.wav", "start": 63, "duration": 26, "prompt": "A melancholic and love-inspired rap with soul beat at 109 BPM in G minor, using plucked guitar, synth bells, and movie sample."}
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
# Training
|
| 126 |
-
|
| 127 |
-
The model was trained on a A100 Nvidia GPU Google Colab during about 42h, with a total of 40k audio segments (~277h) over 14 epochs.I set a batch size of 16, resulting in approximately 2,5k steps per epoch, so 35k steps in total.
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- stabilityai/stable-audio-open-1.0
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
<h1 align="center"> SAO finetuning for modern beat generation</h1>
|
| 7 |
+
<p align="center">
|
| 8 |
+
As a music and AI lover I wanted to dive into the music generation technologies. First, I started by exploring existing models for music generation such as Suno or Stable Audio 2.0, but I couldn't find any that could generate trap/rap/r&b beat as well. Then I got this idea, fine tune an open source model over a good amount of trap beat. I chose Stable Audio Open 1.0, as I found it to be the most suitable open-source foundation for this kind of task.
|
| 9 |
+
</p>
|
| 10 |
+
|
| 11 |
+
<p align="center">
|
| 12 |
+
<img src="./assets/preview.gif" alt="preview" width="400"/>
|
| 13 |
+
</p>
|
| 14 |
+
|
| 15 |
+
# Results
|
| 16 |
+
|
| 17 |
+
### Prompt 1
|
| 18 |
+
*A jazzy relaxed jazz rap beat at 95 BPM, featuring piano and nylon guitar, with lovely moods.*
|
| 19 |
+
|
| 20 |
+
| Stable Audio | DreamT 14 |
|
| 21 |
+
|:--|:--|
|
| 22 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1830357556.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1830357556.wav"></audio> |
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
### Prompt 2
|
| 27 |
+
*A dark and melancholic cloud trap beat, with nostalgic piano, plucked bass and synth bells, at 110 BPM.*
|
| 28 |
+
|
| 29 |
+
| Stable Audio | DreamT 14 |
|
| 30 |
+
|:--|:--|
|
| 31 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2306776750.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2306776750.wav"></audio> |
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
### Prompt 3
|
| 36 |
+
*A laid back lo-fi jazz rap at 85 BPM, featuring deep sub, plucked bass, and vocal chop, with chill and jazzy relaxed moods.*
|
| 37 |
+
|
| 38 |
+
| Stable Audio | DreamT 14 |
|
| 39 |
+
|:--|:--|
|
| 40 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2505643137.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2505643137.wav"></audio> |
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
### Prompt 4
|
| 45 |
+
*Melancholic trap beat at 105 BPM with shimmering synth bells and deep sub bass, minor chord progressions on piano, and airy vocal pads, evoking a cinematic and emotional atmosphere.*
|
| 46 |
+
|
| 47 |
+
| Stable Audio | DreamT 14 |
|
| 48 |
+
|:--|:--|
|
| 49 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1580039167.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1580039167.wav"></audio> |
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
### Prompt 5
|
| 54 |
+
*Relaxed chillhop beat with melodic piano, plucked guitar, vocal harmonization, and smooth sample loops. Light, airy mood with jazzy undertones at 85 BPM.*
|
| 55 |
+
|
| 56 |
+
| Stable Audio | DreamT 14 |
|
| 57 |
+
|:--|:--|
|
| 58 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1984661836.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1984661836.wav"></audio> |
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
### Prompt 6
|
| 63 |
+
*Dark trap beat with detuned piano, sub drops, glitch effects, booming bass, vocal chops, and cinematic samples. Brooding and ominous mood at 90 BPM.*
|
| 64 |
+
|
| 65 |
+
| Stable Audio | DreamT 14 |
|
| 66 |
+
|:--|:--|
|
| 67 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2756405298.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2756405298.wav"></audio> |
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
### Prompt 7
|
| 72 |
+
*Smooth and seductive at 115 BPM trap beat with electric guitar riffs, plucked bass, vocal adlibs, and warm synth pads. Relaxed, romantic, and sexy mood.*
|
| 73 |
+
|
| 74 |
+
| Stable Audio | DreamT 14 |
|
| 75 |
+
|:--|:--|
|
| 76 |
+
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3278661061.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3278661061.wav"></audio> |
|
| 77 |
+
|
| 78 |
+
# Dataset
|
| 79 |
+
|
| 80 |
+
I used 20,000 trap/rap beats spanning various subgenres such as cloud, trap, R&B, EDM, industrial hip-hop, and jazzy chillhop. For each instrumental, I extracted two segments of 20 to 35 seconds, so it ended up with 40k audio dataset for about 277h of audio, while keeping track of their starting timestamps. This allowed the model not only to learn the content of the beats but also to capture the temporal structure inherent to the musical phrases.
|
| 81 |
+
|
| 82 |
+
A key goal of this project was to enable the model to learn new instruments (synth bells, deep sub, plucked bass, snare, ...), tempos, and rhythmic patterns that are strongly associated with trap and its subgenres. To achieve this, I tagged each segment by computing its similarity with curated lists of instruments, moods, and genres using a CLAP LAION model.
|
| 83 |
+
|
| 84 |
+
Additionally, I used the Essentia library to extract the BPM (deeptemp-k16-3) and key/scale of each audio segment, considering only predictions with confidence above 70%.
|
| 85 |
+
|
| 86 |
+
```json
|
| 87 |
+
{
|
| 88 |
+
"39118.wav": {
|
| 89 |
+
"instruments_tags": [
|
| 90 |
+
"plucked guitar",
|
| 91 |
+
"synth bells",
|
| 92 |
+
"movie sample"
|
| 93 |
+
],
|
| 94 |
+
"genres_tags": [
|
| 95 |
+
"rap with soul"
|
| 96 |
+
],
|
| 97 |
+
"moods_tags": [
|
| 98 |
+
"trap melancholic",
|
| 99 |
+
"love"
|
| 100 |
+
],
|
| 101 |
+
"key": "G",
|
| 102 |
+
"scale": "minor",
|
| 103 |
+
"tempo": 109.0,
|
| 104 |
+
"start": 63,
|
| 105 |
+
"duration": 26
|
| 106 |
+
}
|
| 107 |
+
}
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
I chose to generate some synonyms to improve the model’s language variety. This combination of features instrumentation, tempo, key, mood, and genre provided a rich set of musical metadata.
|
| 111 |
+
|
| 112 |
+
<p align="center">
|
| 113 |
+
<img src="./assets/FreqMoods.png" alt="Frequence moods" width="400"/>
|
| 114 |
+
<img src="./assets/FreqInstruments.png" alt="Frequence moods" width="400"/>
|
| 115 |
+
<img src="./assets/FreqTempo.png" alt="Frequence moods" width="400"/>
|
| 116 |
+
</p>
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
Using this metadata, I was able to generate more human-readable prompts for the model via Llama 3.1 3B running locally, allowing the fine-tuned model to produce beats that better reflect the stylistic and structural characteristics of trap music.
|
| 120 |
+
|
| 121 |
+
```json
|
| 122 |
+
{"filepath": "39118.wav", "start": 63, "duration": 26, "prompt": "A melancholic and love-inspired rap with soul beat at 109 BPM in G minor, using plucked guitar, synth bells, and movie sample."}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
# Training
|
| 126 |
+
|
| 127 |
+
The model was trained on a A100 Nvidia GPU Google Colab during about 42h, with a total of 40k audio segments (~277h) over 14 epochs.I set a batch size of 16, resulting in approximately 2,5k steps per epoch, so 35k steps in total.
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
# Results Analysis
|
| 131 |
+
|
| 132 |
+
The model performs particularly well on melodic beats with a smooth and floating atmosphere.
|
| 133 |
+
It captures harmonic structures effectively and keeps a strong sense of coherence between instruments, mood, and tempo, which makes the generated beats sound natural, balanced, and musically pleasing.
|
| 134 |
+
The model is able to generate interesting beats that pretty well reflect the given prompt.
|
| 135 |
+
|
| 136 |
+
However, the model tends to underperform on styles that were underrepresented in the training dataset, such as boom bap or high-energy beats with dense percussive layers.
|
| 137 |
+
This limitation mainly stems from the uneven tag distribution within the dataset, certain instruments and genres are simply less present.
|
| 138 |
+
In addition, the tagging tool (CLAP), trained on general-purpose music datasets like LAION-Audio-630K, is not specialized for specific genres such as trap or hip-hop, leading to imprecise tagging of elements like snares, hi-hats, or 808 bass.
|
| 139 |
+
As a result, these styles are harder for the model to reproduce accurately.
|
| 140 |
+
|
| 141 |
+
# Perspectives
|
| 142 |
+
|
| 143 |
+
I'd like to fine tune over only 2-3 more epoch of a smaller dataset that represent better underrepresented styles.
|
| 144 |
+
It'd be interesting to start over with a CLAP specialized on trap/rap genres.
|
| 145 |
+
I’m open to any feedback or suggestions on my work.
|
| 146 |
+
|
| 147 |
+
## Sources
|
| 148 |
+
- [**Stable Audio Open 1.0**](https://huggingface.co/stabilityai/stable-audio-open-1.0) - Model used.
|
| 149 |
+
- [**LoRAW**](https://github.com/NeuralNotW0rk/LoRAW) — Pipeline implementation for stable audio open LoRA finetuning.
|
| 150 |
+
- [**Stable Audio Tools**](https://github.com/Stability-AI/stable-audio-tools) — Official stability.ai framework to use stable audio open.
|
| 151 |
+
- [**Essentia**](https://essentia.upf.edu/models.html) - Library for music features extractions.
|