File size: 2,957 Bytes
1bf0eeb d3086dc 1bf0eeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
datasets:
- DaveLoay/NSynth_Bass_Captions
language:
- en
---
# Riffsuion Fine-Tune
This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset.
The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.
## Notes
* This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.
## Quickstart Guide
Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion)
```python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
prompt = "Your desired prompt"
image = pipe(prompt).images[0]
```
After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo.
```python
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
from riffusion.spectrogram_params import SpectrogramParams
params = SpectrogramParams()
converter = SpectrogramImageConverter(params)
audio = converter.audio_from_spectrogram_image(image)
```
## Fine Tuning
For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth)
You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions)
And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)
After configuring all dependencies, I used the following code to train the model:
```bash
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
--dataset_name=DaveLoay/NSynth_Bass_Captions \
--resolution=512 \
--use_ema \
--train_batch_size=3 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=4000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="Riffusion_FT_Bass_512_4000" \
--push_to_hub
```
## Hardware
The hardware I used to fine-tune this model is:
* NVIDIA A100 40 GB vRAM hosted in Google Colab Pro
It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.
## Credits
You can check the original repositories here:
[Riffusion](https://www.riffusion.com/)
[NSynth Dataset](https://magenta.tensorflow.org/nsynth)
[Diffusers](https://huggingface.co/docs/diffusers/index)
|