DaveLoay
/

Riffusion_FineTuning_Tutorial

StableDiffusionPipeline

Model card Files Files and versions

Metrics Training metrics Community

DaveLoay commited on May 31, 2023

Commit

1bf0eeb

·

1 Parent(s): 20e3d1c

Create README.md

Files changed (1) hide show

README.md +84 -0

README.md ADDED Viewed

	@@ -0,0 +1,84 @@

+---
+datasets:
+- DaveLoay/NSynth_Bass_Captions
+language:
+- en
+---
+# Riffsuion Fine-Tune
+This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset.
+The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.
+## Notes
+This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.
+## Quickstart Guide
+Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion)
+```python
+import torch
+from diffusers import DiffusionPipeline
+pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
+prompt = "Your desired prompt"
+image = pipe(prompt).images[0]
+```
+After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo.
+```python
+from riffusion.spectrogram_image_converter import SpectrogramImageConverter
+from riffusion.spectrogram_params import SpectrogramParams
+params = SpectrogramParams()
+converter = SpectrogramImageConverter(params)
+audio = converter.audio_from_spectrogram_image(image)
+```
+## Fine Tuning
+For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth)
+You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions)
+And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)
+After configuring all dependencies, I used the following code to train the model:
+```bash
+  accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
+    --pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
+    --dataset_name=DaveLoay/NSynth_Bass_Captions \
+    --resolution=512 \
+    --use_ema \
+    --train_batch_size=3 \
+    --gradient_accumulation_steps=4 \
+    --gradient_checkpointing \
+    --max_train_steps=4000 \
+    --learning_rate=1e-05 \
+    --max_grad_norm=1 \
+    --lr_scheduler="constant" --lr_warmup_steps=0 \
+    --output_dir="Riffusion_FT_Bass_512_4000" \
+    --push_to_hub
+```
+## Hardware
+The hardware I used to fine-tune this model is:
+* NVIDIA A100 40 GB vRAM hosted in Google Colab Pro
+It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.
+## Credits
+You can check the original repositories here:
+[Riffusion](https://www.riffusion.com/)
+[NSynth Dataset](https://magenta.tensorflow.org/nsynth)
+[Diffusers](https://huggingface.co/docs/diffusers/index)