DaveLoay commited on
Commit
1bf0eeb
·
1 Parent(s): 20e3d1c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - DaveLoay/NSynth_Bass_Captions
4
+ language:
5
+ - en
6
+ ---
7
+ # Riffsuion Fine-Tune
8
+
9
+ This is a Fine-Tuned version of **Rifussion**, trained on **bass** samples extracted from the **NSynth** dataset.
10
+ The porpuse of this work is to evaluate the performance of the model to generate bass audio samples.
11
+
12
+ ## Notes
13
+
14
+ This is the way I found to achieve this goal, if you have a better idea for doing this, please share it with me.
15
+
16
+ ## Quickstart Guide
17
+
18
+ Clone the **Riffusion** repository and install the requirements.txt file from: [Riffusion Github](https://github.com/riffusion/riffusion)
19
+ ```python
20
+ import torch
21
+ from diffusers import DiffusionPipeline
22
+
23
+ pipe = DiffusionPipeline.from_pretrained("DaveLoay/Riffusion_FT_Bass_512_4000", torch_dtype=torch.float16).to(device)
24
+ prompt = "Your desired prompt"
25
+ image = pipe(prompt).images[0]
26
+ ```
27
+
28
+ After that, you would have been generated an spectrogram saved on image. So if you want to convert this image into an audio file, you could use the **spectrogram_image_converter** mehtod contained in the **Rifussion** repo.
29
+
30
+ ```python
31
+ from riffusion.spectrogram_image_converter import SpectrogramImageConverter
32
+ from riffusion.spectrogram_params import SpectrogramParams
33
+
34
+ params = SpectrogramParams()
35
+ converter = SpectrogramImageConverter(params)
36
+ audio = converter.audio_from_spectrogram_image(image)
37
+ ```
38
+
39
+ ## Fine Tuning
40
+
41
+ For the Fine-Tuning process, I used the bass samples from the test split in the NSynth dataset, which you can check out here: [NSynth Dataset](https://magenta.tensorflow.org/nsynth)
42
+
43
+ You can find the pre-processed files in my repo, here: [DaveLoay/NSynth_Bass_Captions](DaveLoay/NSynth_Bass_Captions)
44
+
45
+ And as mention in the official **Rifussion** HF repo, I used the **train_text_to_image** script contained in the **Diffusers** repo, which you can check out here: [Diffusers Repo](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)
46
+
47
+ After configuring all dependencies, I used the following code to train the model:
48
+
49
+ ```bash
50
+ accelerate launch --mixed_precision="fp16" train_text_to_image.py \
51
+ --pretrained_model_name_or_path=riffusion/riffusion-model-v1 \
52
+ --dataset_name=DaveLoay/NSynth_Bass_Captions \
53
+ --resolution=512 \
54
+ --use_ema \
55
+ --train_batch_size=3 \
56
+ --gradient_accumulation_steps=4 \
57
+ --gradient_checkpointing \
58
+ --max_train_steps=4000 \
59
+ --learning_rate=1e-05 \
60
+ --max_grad_norm=1 \
61
+ --lr_scheduler="constant" --lr_warmup_steps=0 \
62
+ --output_dir="Riffusion_FT_Bass_512_4000" \
63
+ --push_to_hub
64
+ ```
65
+
66
+ ## Hardware
67
+
68
+ The hardware I used to fine-tune this model is:
69
+
70
+ * NVIDIA A100 40 GB vRAM hosted in Google Colab Pro
71
+
72
+ It took about 3 hours to complete the training process, and used about ~26 GB of vRAM.
73
+
74
+ ## Credits
75
+
76
+ You can check the original repositories here:
77
+
78
+ [Riffusion](https://www.riffusion.com/)
79
+
80
+ [NSynth Dataset](https://magenta.tensorflow.org/nsynth)
81
+
82
+ [Diffusers](https://huggingface.co/docs/diffusers/index)
83
+
84
+