|
|
--- |
|
|
datasets: |
|
|
- DaveLoay/NSynth_Bass_Captions |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
# Riffsuion Fine-Tune |
|
|
|
|
|
This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset. |
|
|
The porpuse of this work is to evaluate the performance of the model to generate bass audio samples. |
|
|
|
|
|
## Notes |
|
|
|
|
|
* This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me. |
|
|
|
|
|
## Quickstart Guide |
|
|
|
|
|
Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion) |
|
|
```python |
|
|
import torch |
|
|
from diffusers import DiffusionPipeline |
|
|
|
|
|
pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device) |
|
|
prompt = "Your desired prompt" |
|
|
image = pipe(prompt).images[0] |
|
|
``` |
|
|
|
|
|
After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo. |
|
|
|
|
|
```python |
|
|
from riffusion.spectrogram_image_converter import SpectrogramImageConverter |
|
|
from riffusion.spectrogram_params import SpectrogramParams |
|
|
|
|
|
params = SpectrogramParams() |
|
|
converter = SpectrogramImageConverter(params) |
|
|
audio = converter.audio_from_spectrogram_image(image) |
|
|
``` |
|
|
|
|
|
## Fine Tuning |
|
|
|
|
|
For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth) |
|
|
|
|
|
You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions) |
|
|
|
|
|
And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) |
|
|
|
|
|
After configuring all dependencies, I used the following code to train the model: |
|
|
|
|
|
```bash |
|
|
accelerate launch --mixed_precision="fp16" train_text_to_image.py \ |
|
|
--pretrained_model_name_or_path=riffusion/riffusion-model-v1 \ |
|
|
--dataset_name=DaveLoay/NSynth_Bass_Captions \ |
|
|
--resolution=512 \ |
|
|
--use_ema \ |
|
|
--train_batch_size=3 \ |
|
|
--gradient_accumulation_steps=4 \ |
|
|
--gradient_checkpointing \ |
|
|
--max_train_steps=4000 \ |
|
|
--learning_rate=1e-05 \ |
|
|
--max_grad_norm=1 \ |
|
|
--lr_scheduler="constant" --lr_warmup_steps=0 \ |
|
|
--output_dir="Riffusion_FT_Bass_512_4000" \ |
|
|
--push_to_hub |
|
|
``` |
|
|
|
|
|
## Hardware |
|
|
|
|
|
The hardware I used to fine-tune this model is: |
|
|
|
|
|
* NVIDIA A100 40 GB vRAM hosted in Google Colab Pro |
|
|
|
|
|
It took about 3 hours to complete the training process, and used about ~26 GB of vRAM. |
|
|
|
|
|
## Credits |
|
|
|
|
|
You can check the original repositories here: |
|
|
|
|
|
[Riffusion](https://www.riffusion.com/) |
|
|
|
|
|
[NSynth Dataset](https://magenta.tensorflow.org/nsynth) |
|
|
|
|
|
[Diffusers](https://huggingface.co/docs/diffusers/index) |
|
|
|
|
|
|
|
|
|