|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# LocalSong |
|
|
|
|
|
LocalSong is a 700M parameter audio generation model focused on melodic instrumental music that uses tag-based conditioning. It was trained in 3 days on 1xH100 from scratch, reusing the ACE-Step VAE. |
|
|
|
|
|
## Installation |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
- Python 3.10 or higher |
|
|
- CUDA-capable GPU recommended with 8GB of VRAM |
|
|
|
|
|
### Setup |
|
|
|
|
|
``` |
|
|
hf download Localsong/LocalSong --local-dir LocalSong |
|
|
cd LocalSong |
|
|
python3 -m venv venv |
|
|
source venv/bin/activate |
|
|
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --extra-index-url https://download.pytorch.org/whl/cu128 |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Run |
|
|
|
|
|
``` |
|
|
python gradio_app.py |
|
|
``` |
|
|
|
|
|
The interface will be available at `http://localhost:7860` |
|
|
|
|
|
### Generation Advice |
|
|
|
|
|
Generations should use one of the soundtrack, soundtrack1 or soundtrack2 tags, as well as at least one other tag. They can use up to 8 tags; try combining genres and instruments. |
|
|
|
|
|
The default settings (CFG 3.5, steps 200) have been tested as optimal. |
|
|
|
|
|
If generation is too slow on your system, try lowering steps to 100. |
|
|
|
|
|
The first generation will be slower due to torch.compile, then speed will increase. |
|
|
|
|
|
The model was trained on vocals but not lyrics. Vocals will not have recognizable words. |
|
|
|
|
|
## LoRA Training |
|
|
|
|
|
- Prepare folder of .mp3 files |
|
|
- Run python train_lora_encode_latents.py --audio-dir=/path/to/your/mp3s --output-dir=latents to save the latents |
|
|
- Run python train_lora.py --latents_dir=latents to train the LoRA. You may need to adjust learning rate, steps or batch size depending on your dataset etc. |
|
|
- Run python merge_lora.py --lora-checkpoint=lora_step1000.safetensors --output-checkpoint=merged.safetensors to merge the LoRA checkpoint into the base model for inference |
|
|
- Run python gradio_app.py --checkpoint=merged.safetensors to run the merged checkpoint for inference |
|
|
- Test inference with tag "soundtrack"; Lora training uses this tag. Additional tags may work. |
|
|
|
|
|
## Credits |
|
|
|
|
|
This project builds upon the following open-source projects: |
|
|
|
|
|
- **Model Architecture**: Adapted from [DDT](https://github.com/MCG-NJU/DDT) |
|
|
- **Flow Matching**: Adapted from [minRF](https://github.com/cloneofsimo/minRF) |
|
|
- **Audio VAE**: [ACE-Step](https://github.com/ACE-Step/ACE-Step) |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the Apache License 2.0 |