Update README.md
Browse files
README.md
CHANGED
|
@@ -8,4 +8,73 @@ tags:
|
|
| 8 |
- audio
|
| 9 |
- music-generation
|
| 10 |
- peft
|
| 11 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- audio
|
| 9 |
- music-generation
|
| 10 |
- peft
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
### Exploring Adapter Design Tradeoffs for Low Resource Music Generation
|
| 14 |
+
[Code](https://github.com/atharva20038/ACMMM_Adapters/edit/main) | [Models](https://huggingface.co/collections/athi180202/peft-adaptations-of-music-generation-models-684ba077a2a44999bb6cb175) | [Paper](https://arxiv.org/abs/2506.21298)
|
| 15 |
+
|
| 16 |
+
This repository contains our code for the paper: "Exploring Adapter Design Tradeoffs for Low Resource Music Generation"
|
| 17 |
+
|
| 18 |
+
Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources.
|
| 19 |
+
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance.
|
| 20 |
+
However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre.
|
| 21 |
+
In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music.
|
| 22 |
+
|
| 23 |
+
## Datasets
|
| 24 |
+
|
| 25 |
+
The [Compmusic - Turkish Makam](https://compmusic.upf.edu/datasets) dataset contains 405 hours of Turkish Makam and Hindustani Classical data.
|
| 26 |
+
|
| 27 |
+
The [Compmusic - Hindustani Classical](https://compmusic.upf.edu/datasets) dataset contains 305 hours of Hindustani Classical annotated data.
|
| 28 |
+
|
| 29 |
+
The Hindustani Classical dataset includes 21 different instrument types, such as the Pakhavaj, Zither, Sarangi, Ghatam, Harmonium,
|
| 30 |
+
and Santoor, along with vocals.
|
| 31 |
+
|
| 32 |
+
The Turkish Makam dataset features 42 makam-specific instruments, such as Oud, Tanbur, Ney, Davul, Clarinet, Kös, Kudüm,
|
| 33 |
+
Yaylı Tanbur, Tef, Kanun, Zurna, Bendir, Darbuka, Classical Kemençe, Rebab, Çevgen, and vocals. It encompasses 100 different
|
| 34 |
+
makams and 62 distinct usuls.
|
| 35 |
+
|
| 36 |
+
## Adapter Positioning
|
| 37 |
+
|
| 38 |
+
<div align="center">
|
| 39 |
+
<img src="img/Architecture-1.png" width="900"/>
|
| 40 |
+
</div>
|
| 41 |
+
|
| 42 |
+
### Mustango
|
| 43 |
+
To enhance this process, a Bottleneck Residual Adapter with convolution layers is integrated into the up-sampling, middle, and down-sampling blocks of the UNet, positioned just after the cross-attention block. This design facilitates cultural adaptation while preserving computational efficiency. The adapters reduce channel dimensions by a factor of 8, using a kernel size of 1 and GeLU activation after the down-projection layers to introduce non-linearity.
|
| 44 |
+
|
| 45 |
+
### MusicGen
|
| 46 |
+
In MusicGen, we enhance the model with an additional 2 million parameters by integrating Linear Bottleneck Residual Adapter after the transformer decoder within the MusicGen architecture after thorough experimentation with other placements.
|
| 47 |
+
|
| 48 |
+
The total parameter count of both the models is ~2 billion, making the adapter only 0.1% of the total size (2M params).
|
| 49 |
+
For both models, we used two RTX A6000 GPUs over a period of around 10 hours. The adapter block was fine-tuned, using the AdamW optimizer using MSE (Reconstruction Loss).
|
| 50 |
+
|
| 51 |
+
## Evaluations
|
| 52 |
+
### **Objective Evaluation Metrics for Music Models**
|
| 53 |
+
<div align="center">
|
| 54 |
+
<img src="img/fad_fd_image-1.png" width="900"/>
|
| 55 |
+
</div>
|
| 56 |
+
|
| 57 |
+
For Mustango, the objective evaluation results can also be seen in the following google sheet : [Spreadsheet](https://docs.google.com/spreadsheets/d/11aHVjt8zeHyMqmIBIdV5b4pvlu8gc83510HD0nwBrjo/edit?gid=0#gid=0).
|
| 58 |
+
|
| 59 |
+
### **Human Evaluation**
|
| 60 |
+
Hindustani Classical - Subjective Evaluation Results
|
| 61 |
+
<div align="center">
|
| 62 |
+
<img src="img/hindustani_quality (1).png" width="900"/>
|
| 63 |
+
</div>
|
| 64 |
+
|
| 65 |
+
Turkish Makam = Subjective Evaluation Results
|
| 66 |
+
<div align="center">
|
| 67 |
+
<img src="img/makam (1).png" width="900"/>
|
| 68 |
+
</div>
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
## Citation
|
| 72 |
+
Please consider citing the following article if you found our work useful:
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
@misc
|
| 77 |
+
{
|
| 78 |
+
|
| 79 |
+
}
|
| 80 |
+
```
|