Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- DaveLoay/NSynth_Bass_Captions
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
---
|
| 7 |
+
# Riffsuion Fine-Tune
|
| 8 |
+
|
| 9 |
+
This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset.
|
| 10 |
+
The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.
|
| 11 |
+
|
| 12 |
+
## Notes
|
| 13 |
+
|
| 14 |
+
This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.
|
| 15 |
+
|
| 16 |
+
## Quickstart Guide
|
| 17 |
+
|
| 18 |
+
Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion)
|
| 19 |
+
```python
|
| 20 |
+
import torch
|
| 21 |
+
from diffusers import DiffusionPipeline
|
| 22 |
+
|
| 23 |
+
pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
|
| 24 |
+
prompt = "Your desired prompt"
|
| 25 |
+
image = pipe(prompt).images[0]
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo.
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
|
| 32 |
+
from riffusion.spectrogram_params import SpectrogramParams
|
| 33 |
+
|
| 34 |
+
params = SpectrogramParams()
|
| 35 |
+
converter = SpectrogramImageConverter(params)
|
| 36 |
+
audio = converter.audio_from_spectrogram_image(image)
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Fine Tuning
|
| 40 |
+
|
| 41 |
+
For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth)
|
| 42 |
+
|
| 43 |
+
You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions)
|
| 44 |
+
|
| 45 |
+
And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)
|
| 46 |
+
|
| 47 |
+
After configuring all dependencies, I used the following code to train the model:
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
|
| 51 |
+
--pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
|
| 52 |
+
--dataset_name=DaveLoay/NSynth_Bass_Captions \
|
| 53 |
+
--resolution=512 \
|
| 54 |
+
--use_ema \
|
| 55 |
+
--train_batch_size=3 \
|
| 56 |
+
--gradient_accumulation_steps=4 \
|
| 57 |
+
--gradient_checkpointing \
|
| 58 |
+
--max_train_steps=4000 \
|
| 59 |
+
--learning_rate=1e-05 \
|
| 60 |
+
--max_grad_norm=1 \
|
| 61 |
+
--lr_scheduler="constant" --lr_warmup_steps=0 \
|
| 62 |
+
--output_dir="Riffusion_FT_Bass_512_4000" \
|
| 63 |
+
--push_to_hub
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Hardware
|
| 67 |
+
|
| 68 |
+
The hardware I used to fine-tune this model is:
|
| 69 |
+
|
| 70 |
+
* NVIDIA A100 40 GB vRAM hosted in Google Colab Pro
|
| 71 |
+
|
| 72 |
+
It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.
|
| 73 |
+
|
| 74 |
+
## Credits
|
| 75 |
+
|
| 76 |
+
You can check the original repositories here:
|
| 77 |
+
|
| 78 |
+
[Riffusion](https://www.riffusion.com/)
|
| 79 |
+
|
| 80 |
+
[NSynth Dataset](https://magenta.tensorflow.org/nsynth)
|
| 81 |
+
|
| 82 |
+
[Diffusers](https://huggingface.co/docs/diffusers/index)
|
| 83 |
+
|
| 84 |
+
|