File size: 11,341 Bytes
9ca6d90
 
 
edcb228
 
 
 
 
 
 
 
9ca6d90
 
6e5d588
9ca6d90
f492ec4
9ca6d90
 
 
 
 
 
f492ec4
 
 
 
9ca6d90
 
f492ec4
 
 
9d6cd46
9ca6d90
 
95a3440
9ca6d90
 
9d6cd46
9ca6d90
385e5a7
9ca6d90
6ef8e46
 
 
 
9ca6d90
 
95a3440
9ca6d90
 
9d6cd46
9ca6d90
385e5a7
9ca6d90
6ef8e46
 
 
 
9ca6d90
 
95a3440
9ca6d90
 
9d6cd46
9ca6d90
385e5a7
9ca6d90
6ef8e46
 
e565dd3
6ef8e46
9ca6d90
 
95a3440
6c41fb8
9ca6d90
9d6cd46
9ca6d90
e565dd3
9ca6d90
6ef8e46
 
e565dd3
6ef8e46
9ca6d90
 
95a3440
ef64b55
 
9d6cd46
9ca6d90
385e5a7
9ca6d90
6ef8e46
 
e565dd3
6ef8e46
9ca6d90
 
95a3440
f492ec4
ef64b55
9d6cd46
9ca6d90
f492ec4
ef64b55
6ef8e46
 
e565dd3
6ef8e46
ef64b55
 
95a3440
6c41fb8
ef64b55
9d6cd46
6ef8e46
e565dd3
6ef8e46
ef64b55
 
e565dd3
ef64b55
 
 
 
9ca6d90
 
 
f492ec4
9ca6d90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33f67ac
9ca6d90
6c01274
 
 
 
 
 
 
9ca6d90
6c01274
 
 
9ca6d90
 
 
 
 
 
 
 
 
f492ec4
 
 
 
9ca6d90
 
 
 
 
 
 
 
 
6c01274
 
 
 
 
9ca6d90
 
 
e565dd3
9ca6d90
 
 
 
 
e565dd3
 
9ca6d90
 
 
 
 
 
072c780
 
 
 
edcb228
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
base_model:
- stabilityai/stable-audio-open-1.0
tags:
- music-generation
- trap
- rap
- hip-hop
- beat-generation
- fine-tuning
- music-tagging
---

<h1 align="center"> SAO fine tuning for modern beat generation</h1>
<p align="center">
As a music and AI lover I wanted to dive into the music generation technologies.
</p>

<p align="center">
  <img src="./assets/preview.gif" alt="preview" width="400"/>
</p>

<p align="center">
First, I started by exploring existing models for music generation such as Suno or Stable Audio 2.0, but I couldn't find any that could generate modern trap/rap/r&b beat as well. Then I got this idea, fine tune an open source model over a good amount of trap beat. I chose Stable Audio Open 1.0, as I found it to be the most suitable open-source foundation for this kind of task.
</p>

# Results

[**Here**](https://github.com/Gab404/Stable-BeaT) the GitHub repository for model inference. 
</br>
All the following results have been generated with 200 steps, CFG scale of 7, second start set on 0s and duration on 47s.

---

### Prompt 1
*A dark and melancholic cloud trap beat, with nostalgic piano, plucked bass and synth bells, at 110 BPM.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2306776750.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2306776750.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **106.13** | **1159.43** | **0.000091** | **0.460** | **0.000073** | **0.489** |

---

### Prompt 2  
*A laid back lo-fi jazz rap at 85 BPM, featuring deep sub, plucked bass, and vocal chop, with chill and jazzy relaxed moods.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2505643137.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2505643137.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **82.72** | **784.82** | **0.000030** | **0.457** | **0.000015** | **0.429** |

---

### Prompt 3  
*Melancholic trap beat at 105 BPM with shimmering synth bells and deep sub bass, minor chord progressions on piano, and airy vocal pads, evoking a cinematic and emotional atmosphere.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1580039167.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1580039167.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **100.45** | **2540.28** | **0.000284** | **1.412** | **0.0000585** | **0.523** |

---

### Prompt 4  
*A jazzy chillhop beat at 101 BPM featuring synth bells, vocal pad, and movie sample, evoking trap nostalgic and chill moods.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1784661836.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1784661836.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **148.02** | **4287.26** | **0.00179** | **2.963** | **0.000195** | **0.552** |

---

### Prompt 5  
*Smooth and seductive at 115 BPM trap beat with electric guitar riffs, plucked bass, vocal adlibs, and warm synth pads. Relaxed, romantic, and sexy mood.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3278661061.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3278661061.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **82.72** | **1056.42** | **0.000046** | **0.645** | **0.000089** | **0.478*** |

---

### Prompt 6  
*A moody cloud trap beat, boomy bass, synth bells and melodic piano, evoking etherate mood at 100 BPM.*

| Stable Audio Open 1.0 | StableBeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3576830411.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3576830411.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **144.2** | **2458.5** | **0.000356** | **0.738** | **0.00206** | **0.363** |

---

### Prompt 7  
*A smooth neo-soul R&B instrumental at 90 BPM in D major, featuring live bass, soft Rhodes keys, and warm analog drum grooves.*

| Stable Audio Open 1.0 | Stable BeaT |
|:--|:--|
| <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1121349264.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1121349264.wav"></audio> |

| BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
|:--|:--|:--|:--|:--|:--|
| **130.81** | **1000.87** | **0.000166** | **0.679** | **0.000007288** | **0.250** |


---


# Dataset

I used 20,000 trap/rap beats spanning various subgenres such as cloud, trap, R&B, EDM, industrial hip-hop, jazzy chillhop... For each instrumental, I extracted two segments of 20 to 35 seconds, so it ended up with 40k audio dataset for about 277h of audio, while keeping track of their starting timestamps. This allowed the model not only to learn the content of the beats but also to capture the temporal structure inherent to the musical phrases.

A key goal of this project was to enable the model to learn new instruments (synth bells, deep sub, plucked bass, snare, ...), tempos, and rhythmic patterns that are strongly associated with trap and its subgenres. To achieve this, I tagged each segment by computing its similarity with curated lists of instruments, moods, and genres using a CLAP LAION model.

Additionally, I used the Essentia library to extract the BPM (deeptemp-k16-3) and key/scale of each audio segment, considering only predictions with confidence above 70%.

```json
{
  "39118.wav": {
    "instruments_tags": [
      "plucked guitar",
      "synth bells",
      "movie sample"
    ],
    "genres_tags": [
      "rap with soul"
    ],
    "moods_tags": [
      "trap melancholic",
      "love"
    ],
    "key": "G",
    "scale": "minor",
    "tempo": 109.0,
    "start": 63,
    "duration": 26
  }
}
```

I chose to generate some synonyms to improve the model’s language variety. This combination of features  instrumentation, tempo, key, mood, and genre provided a rich set of musical metadata. 

<p align="center">
  <img src="./assets/cluster.png" alt="Frequence moods" width="500"/>
</p>
We can observe how T5-Base encodes all of my tags, resulting in five distinct groups:

- Emotion (e.g., cheerful, joyful, dreamy)

- Groove (e.g., swing groove, nylon guitar, movie sample)

- Genre (e.g., g-funk, chill rap beat, jazzy chillhop)

- Sonority (e.g., trap vocal, trap guitar)

The clusters are very close to each other (Silhouette Score: 0.095), which is expected given that the model is fine-tuned on a specific musical subgenre. This proximity reflects the semantic density of the dataset: many tags are naturally related and share subtle differences.

Using this metadata, I was able to generate more human-readable prompts for the model via Llama 3.1 3B running locally, allowing the fine-tuned model to produce beats that better reflect the stylistic and structural characteristics of trap music.

```json 
{"filepath": "39118.wav", "start": 63, "duration": 26, "prompt": "A melancholic and love-inspired rap with soul beat at 109 BPM in G minor, using plucked guitar, synth bells, and movie sample."}
```

# Training

The model was trained on a A100 Nvidia GPU Google Colab during about 42h, with a total of 40k audio segments (~277h) over 14 epochs. I set a batch size of 16, resulting in approximately 2,5k steps per epoch, so 35k steps in total. 
</br>
It takes ~0.37s per step on a Nvidia RTX 4050 Laptop GPU, so about 1min15 for a good generation.



# Results Analysis

The model performs particularly well on melodic beats with a smooth and floating atmosphere.
It captures harmonic structures effectively and keeps a strong sense of coherence between instruments, mood, and tempo, which makes the generated beats sound natural, balanced, and musically pleasing.
The model is able to generate interesting beats that pretty well reflect the given prompt.

However, the model tends to underperform on styles that were underrepresented in the training dataset, such as boom bap or high-energy beats with dense percussive layers.

<p align="center">
  <img src="./assets/FreqMoods.png" alt="Frequence moods" width="600"/>
</p>

This limitation mainly stems from the uneven tag distribution within the dataset, certain instruments and genres are simply less present.
In addition, the tagging tool (CLAP), trained on general-purpose music datasets like LAION-Audio-630K, is not specialized for specific genres such as trap or hip-hop, leading to imprecise tagging of elements like snares, hi-hats, or 808 bass.
As a result, these styles are harder for the model to reproduce accurately.
I also noticed that the generated melodic elements, like piano or synths, often sound much quieter than the drums, since their frequencies are more subtle.

# Perspectives

I'd like to fine tune over only 2-3 more epoch of a smaller dataset that represent better underrepresented styles.
It'd be interesting to start over with a CLAP specialized on trap/rap genres.
Also interested about noise input conditioning such as [**SpecGrad**](https://arxiv.org/pdf/2203.16749).

I’m open to any feedback or suggestions on my work.

## Sources
- [**Stable Audio Open 1.0**](https://huggingface.co/stabilityai/stable-audio-open-1.0) - Model used.
- [**LoRAW**](https://github.com/NeuralNotW0rk/LoRAW) — Pipeline implementation for stable audio open LoRA finetuning.
- [**Stable Audio Tools**](https://github.com/Stability-AI/stable-audio-tools) — Official stability.ai framework to use stable audio open.
- [**Essentia**](https://essentia.upf.edu/models.html) - Library for music features extractions.

## Contact - Gabriel Guiet-Dupré
- [**Linkedin**](https://www.linkedin.com/in/gabriel-guiet-dupre/)
- [**GitHub**](https://github.com/Gab404)