DaveLoay
/

Riffusion_FineTuning_Tutorial

StableDiffusionPipeline

Model card Files Files and versions

Metrics Training metrics Community

Riffusion_FineTuning_Tutorial / README.md

DaveLoay's picture

Update README.md

1a2657b over 2 years ago

|

history blame contribute delete

2.96 kB

	---
	datasets:
	- DaveLoay/NSynth_Bass_Captions
	language:
	- en
	---
	# Riffsuion Fine-Tune

	This is a Fine-Tuned version of Rifussion, trained on bass samples extracted from the NSynth dataset.
	The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.

	## Notes

	* This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.

	## Quickstart Guide

	Clone the Riffusion repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion)
	```python
	import torch
	from diffusers import DiffusionPipeline

	pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
	prompt = "Your desired prompt"
	image = pipe(prompt).images[0]
	```

	After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the spectrogram_image_converter mehtod contained in the Rifussion repo.

	```python
	from riffusion.spectrogram_image_converter import SpectrogramImageConverter
	from riffusion.spectrogram_params import SpectrogramParams

	params = SpectrogramParams()
	converter = SpectrogramImageConverter(params)
	audio = converter.audio_from_spectrogram_image(image)
	```

	## Fine Tuning

	For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth)

	You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions)

	And as mention in the official Rifussion HF repo, I used the train_text_to_image script contained in the Diffusers repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)

	After configuring all dependencies, I used the following code to train the model:

	```bash
	accelerate launch --mixed_precision="fp16" train_text_to_image.py \
	--pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
	--dataset_name=DaveLoay/NSynth_Bass_Captions \
	--resolution=512 \
	--use_ema \
	--train_batch_size=3 \
	--gradient_accumulation_steps=4 \
	--gradient_checkpointing \
	--max_train_steps=4000 \
	--learning_rate=1e-05 \
	--max_grad_norm=1 \
	--lr_scheduler="constant" --lr_warmup_steps=0 \
	--output_dir="Riffusion_FT_Bass_512_4000" \
	--push_to_hub
	```

	## Hardware

	The hardware I used to fine-tune this model is:

	* NVIDIA A100 40 GB vRAM hosted in Google Colab Pro

	It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.

	## Credits

	You can check the original repositories here:

	[Riffusion](https://www.riffusion.com/)

	[NSynth Dataset](https://magenta.tensorflow.org/nsynth)

	[Diffusers](https://huggingface.co/docs/diffusers/index)