Spaces:
Running
A newer version of the Gradio SDK is available:
6.9.0
ACE-Step 1.5 LoRA Training Tutorial
Hardware Requirements
| VRAM | Description |
|---|---|
| 16 GB (minimum) | Generally sufficient, but longer songs may cause out-of-memory errors |
| 20 GB or more (recommended) | Handles full-length songs; VRAM usage typically stays around 17 GB during training |
Note: During the preprocessing stage before training, you may need to restart Gradio multiple times to free VRAM. The specific timing will be mentioned in the steps below.
Disclaimer
This tutorial uses the album ใใฆใฟใณๆใใใฎ็ฉไฝY by Nayutanๆไบบ (NayutalieN) (13 tracks) as a demonstration, trained for 500 epochs (batch size 1). This tutorial is intended solely for educational purposes to understand LoRA fine-tuning. Please use your own original works to train your LoRA.
As a developer, I personally enjoy NayutalieN's work and chose one of their albums as an example. If you are the rights holder and believe this tutorial infringes upon your legitimate rights, please contact us immediately. We will remove the relevant content upon receiving a valid notice.
Technology should be used reasonably and lawfully. Please respect artists' creations and refrain from any actions that harm or damage the reputation, rights, or interests of original artists.
Data Preparation
Tip: If you are unfamiliar with programming, you can provide this document to AI coding tools such as Claude Code / Codex CLI / Cursor / Copilot and let them handle the scripting tasks for you.
Overview
Training data for each song consists of the following:
- Audio file โ Supported formats:
.mp3,.wav,.flac,.ogg,.opus - Lyrics โ A
.lyrics.txtfile with the same name as the audio (.txtis also supported for backward compatibility) - Annotation data โ Metadata including
caption,bpm,keyscale,timesignature,language, etc.
Annotation Data Format
If you already have complete annotation data, you can create JSON files and place them in the same directory as the audio and lyrics. The file structure is as follows:
dataset/
โโโ song1.mp3 # Audio
โโโ song1.lyrics.txt # Lyrics
โโโ song1.json # Annotations (optional)
โโโ song1.caption.txt # Caption (optional, can also be included in JSON)
โโโ song2.mp3
โโโ song2.lyrics.txt
โโโ song2.json
โโโ ...
JSON file structure (all fields are optional):
{
"caption": "A high-energy J-pop track with synthesizer leads and fast tempo",
"bpm": 190,
"keyscale": "D major",
"timesignature": "4",
"language": "ja"
}
If you don't have annotation data, you can obtain it using the methods described in later sections.
Lyrics
Save lyrics as a .lyrics.txt file with the same name as the audio file, placed in the same directory. Please ensure the lyrics are accurate.
Lyrics file lookup priority during scanning:
{filename}.lyrics.txt(recommended){filename}.txt(backward compatible)
Lyrics Transcription
If you don't have existing lyrics text, you can obtain transcribed lyrics using the following tools:
| Tool | Structural Tags | Accuracy | Ease of Use | Deployment |
|---|---|---|---|---|
| acestep-transcriber | No | May contain errors | High difficulty (requires model deployment) | Self-hosted |
| Gemini | Yes | May contain errors | Easy | Paid API |
| Whisper | No | May contain errors | Moderate | Self-hosted / Paid API |
| ElevenLabs | No | May contain errors | Moderate | Paid API (generous free tier) |
This project provides transcription scripts under scripts/lora_data_prepare/:
whisper_transcription.pyโ Transcription via OpenAI Whisper APIelevenlabs_transcription.pyโ Transcription via ElevenLabs Scribe API
Both scripts support the process_folder() method for batch processing entire folders.
Review and Cleanup (Required)
Transcribed lyrics may contain errors and must be manually reviewed and corrected.
If you are using LRC format lyrics, you need to remove the timestamps. Here is a simple cleanup example:
import re
def clean_lrc_content(lines):
"""Clean LRC file content by removing timestamps"""
result = []
for line in lines:
line = line.strip()
if not line:
continue
# Remove timestamps [mm:ss.x] [mm:ss.xx] [mm:ss.xxx]
cleaned = re.sub(r"\[\d{2}:\d{2}\.\d{1,3}\]", "", line)
result.append(cleaned)
# Remove trailing empty lines
while result and not result[-1]:
result.pop()
return result
Structural Tags (Optional)
Including structural tags in lyrics (such as [Verse], [Chorus], etc.) helps the model learn song structure more effectively. Training without structural tags is also possible.
Tip: You can use Gemini to add structural tags to existing lyrics.
Example:
[Intro]
La la la...
[Verse 1]
Walking down the empty street
Echoes dancing at my feet
[Chorus]
We are the stars tonight
Shining through the endless sky
[Bridge]
Close your eyes and feel the sound
Automatic Annotation
1. Obtaining BPM and Key
Use Key-BPM-Finder to obtain BPM and key annotations online:
Open the webpage and click Browse my files to select the audio files to process (processing too many at once may cause the page to freeze โ batch processing and merging CSVs is recommended). Processing is done locally and files are not uploaded to a server.

After processing, click Export CSV to download the CSV file.

CSV file content example:
File,Artist,Title,BPM,Key,Camelot song1.wav,,,190,D major,10B song2.wav,,,128,A minor,8APlace the CSV file in the dataset folder. To include caption data, add an extra column after
Camelot.
2. Obtaining Captions
Captions can be obtained in the following ways:
- Using acestep-5Hz-lm (0.6B / 1.7B / 4B) โ Via the Auto Label feature in the Gradio UI (see subsequent steps)
- Using Gemini API โ Refer to the script
scripts/lora_data_prepare/gemini_caption.py, which supportsprocess_folder()for batch processing and generates the following for each audio file:{filename}.lyrics.txtโ Lyrics{filename}.caption.txtโ Caption description
Data Preprocessing
Once data is prepared, you can use the Gradio UI for data review and preprocessing.
Important: If using a startup script, you need to modify the launch parameters to disable service pre-initialization:
- Windows (
start_gradio_ui.bat): Changeif not defined INIT_SERVICE set INIT_SERVICE=--init_service truetoif not defined INIT_SERVICE set INIT_SERVICE=--init_service false- Linux/macOS (
start_gradio_ui.sh): Change: "${INIT_SERVICE:=--init_service true}"to: "${INIT_SERVICE:=--init_service false}"
Launch the Gradio UI (via the startup script or by running acestep/acestep_v15_pipeline.py directly).
Step 1: Load Models
If you need to use LM for caption generation: Select the desired LM model during initialization (acestep-5Hz-lm-0.6B / 1.7B / 4B).

Step 2: Load Data
Switch to the LoRA Training tab, enter the dataset directory path, and click Scan.
The scanner automatically recognizes the following files:
| File | Description |
|---|---|
*.mp3 / *.wav / *.flac / ... |
Audio files |
{filename}.lyrics.txt (or {filename}.txt) |
Lyrics |
{filename}.caption.txt |
Caption description |
{filename}.json |
Annotation metadata (caption / bpm / keyscale / timesignature / language) |
*.csv |
Batch BPM / Key annotations (exported from Key-BPM-Finder) |
Step 3: Review and Adjust Dataset
- Duration โ Automatically read from the audio file
- Lyrics โ Requires a corresponding
.lyrics.txtfile (.txtis also supported) - Labeled โ Shows โ if caption exists, โ otherwise
- BPM / Key / Caption โ Loaded from JSON or CSV files
- If the dataset is not entirely instrumental, uncheck All Instrumental
- Format Lyrics and Transcribe Lyrics are currently disabled (not yet integrated with acestep-transcriber; using LM directly tends to produce hallucinations)
- Enter a Custom Trigger Tag (currently has limited effect; any option other than
Replace Captionis fine) - Genre Ratio controls the proportion of samples using genre instead of caption. Since the current LM-generated genre descriptions are far less descriptive than captions, keep this at 0
Step 4: Auto Label Data
- If you already have captions, you can skip this step
- If your data lacks captions, use LM inference to generate them
- If BPM / Key values are missing, obtain them via Key-BPM-Finder first โ generating them directly with LM will produce hallucinations
Step 5: Review and Edit Data
If needed, you can review and modify data entry by entry. Remember to click Save after editing each entry.
Step 6: Save Dataset
Enter a save path to export the dataset as a JSON file.
Step 7: Preprocess and Generate Tensor Files
Note: If you previously used LM to generate captions and VRAM is insufficient, restart Gradio to free VRAM first. When restarting, do not select the LM model. After restarting, enter the path to the saved JSON file and load it.
Enter the save path for tensor files, click to start preprocessing, and wait for it to complete.
Training
Note: After generating tensor files, it is also recommended to restart Gradio to free VRAM.
- Switch to the Train LoRA tab, enter the tensor file path, and load the dataset.
- If you are unfamiliar with training parameters, the default values are generally fine.
Parameter Reference
| Parameter | Description | Suggested Value |
|---|---|---|
| Max Epochs | Adjust based on dataset size | ~100 songs โ 500 epochs; 10โ20 songs โ 800 epochs (for reference only) |
| Batch Size | Can be increased if VRAM is sufficient | 1 (default); try 2 or 4 if VRAM allows |
| Save Every N Epochs | Checkpoint save interval | Set smaller for fewer Max Epochs, larger for more |
The above values are for reference only. Please adjust based on your actual situation.
- Click Start Training and wait for training to complete.
Using LoRA
- After training completes, restart Gradio and reload models (do not select the LM model).
- Once the model is initialized, load the trained LoRA weights.

- Start generating music.
Congratulations! You have completed the entire LoRA training workflow.
Advanced Training with Side-Step
For users who want more control over LoRA training โ including corrected timestep sampling, LoKR adapters, CLI-based workflows, VRAM optimization, and gradient sensitivity analysis โ the community-developed Side-Step toolkit provides an advanced alternative. Its documentation is bundled in this repository under docs/sidestep/.
| Topic | Description |
|---|---|
| Getting Started | Installation, prerequisites, and first-run setup |
| End-to-End Tutorial | Complete walkthrough from raw audio to generation |
| Dataset Preparation | JSON schema, audio formats, metadata fields, custom tags |
| Training Guide | LoRA vs LoKR, corrected vs vanilla mode, hyperparameter guide |
| Using Your Adapter | Output layout, loading in Gradio, LoKR limitations |
| VRAM Optimization Guide | GPU memory profiles and optimization strategies |
| Estimation Guide | Gradient sensitivity analysis for targeted training |
| Shift and Timestep Sampling | How training timesteps work and why Side-Step differs from the built-in trainer |
| Preset Management | Built-in presets, save/load/import/export |
| The Settings Wizard | Complete wizard settings reference |
| Model Management | Checkpoint structure and fine-tune support |
| Windows Notes | Windows-specific setup and workarounds |







