Ace-Step-Munk / docs /en /LoRA_Training_Tutorial.md
OnyxMunk's picture
Add LoRA training assets: scripts, docs (no binaries), ui, my_dataset
bc9c638

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

ACE-Step 1.5 LoRA Training Tutorial

Hardware Requirements

VRAM Description
16 GB (minimum) Generally sufficient, but longer songs may cause out-of-memory errors
20 GB or more (recommended) Handles full-length songs; VRAM usage typically stays around 17 GB during training

Note: During the preprocessing stage before training, you may need to restart Gradio multiple times to free VRAM. The specific timing will be mentioned in the steps below.

Disclaimer

This tutorial uses the album ใƒŠใƒฆใ‚ฟใƒณๆ˜Ÿใ‹ใ‚‰ใฎ็‰ฉไฝ“Y by Nayutanๆ˜Ÿไบบ (NayutalieN) (13 tracks) as a demonstration, trained for 500 epochs (batch size 1). This tutorial is intended solely for educational purposes to understand LoRA fine-tuning. Please use your own original works to train your LoRA.

As a developer, I personally enjoy NayutalieN's work and chose one of their albums as an example. If you are the rights holder and believe this tutorial infringes upon your legitimate rights, please contact us immediately. We will remove the relevant content upon receiving a valid notice.

Technology should be used reasonably and lawfully. Please respect artists' creations and refrain from any actions that harm or damage the reputation, rights, or interests of original artists.


Data Preparation

Tip: If you are unfamiliar with programming, you can provide this document to AI coding tools such as Claude Code / Codex CLI / Cursor / Copilot and let them handle the scripting tasks for you.

Overview

Training data for each song consists of the following:

  1. Audio file โ€” Supported formats: .mp3, .wav, .flac, .ogg, .opus
  2. Lyrics โ€” A .lyrics.txt file with the same name as the audio (.txt is also supported for backward compatibility)
  3. Annotation data โ€” Metadata including caption, bpm, keyscale, timesignature, language, etc.

Annotation Data Format

If you already have complete annotation data, you can create JSON files and place them in the same directory as the audio and lyrics. The file structure is as follows:

dataset/
โ”œโ”€โ”€ song1.mp3               # Audio
โ”œโ”€โ”€ song1.lyrics.txt        # Lyrics
โ”œโ”€โ”€ song1.json              # Annotations (optional)
โ”œโ”€โ”€ song1.caption.txt       # Caption (optional, can also be included in JSON)
โ”œโ”€โ”€ song2.mp3
โ”œโ”€โ”€ song2.lyrics.txt
โ”œโ”€โ”€ song2.json
โ””โ”€โ”€ ...

JSON file structure (all fields are optional):

{
    "caption": "A high-energy J-pop track with synthesizer leads and fast tempo",
    "bpm": 190,
    "keyscale": "D major",
    "timesignature": "4",
    "language": "ja"
}

If you don't have annotation data, you can obtain it using the methods described in later sections.


Lyrics

Save lyrics as a .lyrics.txt file with the same name as the audio file, placed in the same directory. Please ensure the lyrics are accurate.

Lyrics file lookup priority during scanning:

  1. {filename}.lyrics.txt (recommended)
  2. {filename}.txt (backward compatible)

Lyrics Transcription

If you don't have existing lyrics text, you can obtain transcribed lyrics using the following tools:

Tool Structural Tags Accuracy Ease of Use Deployment
acestep-transcriber No May contain errors High difficulty (requires model deployment) Self-hosted
Gemini Yes May contain errors Easy Paid API
Whisper No May contain errors Moderate Self-hosted / Paid API
ElevenLabs No May contain errors Moderate Paid API (generous free tier)

This project provides transcription scripts under scripts/lora_data_prepare/:

  • whisper_transcription.py โ€” Transcription via OpenAI Whisper API
  • elevenlabs_transcription.py โ€” Transcription via ElevenLabs Scribe API

Both scripts support the process_folder() method for batch processing entire folders.

Review and Cleanup (Required)

Transcribed lyrics may contain errors and must be manually reviewed and corrected.

If you are using LRC format lyrics, you need to remove the timestamps. Here is a simple cleanup example:

import re

def clean_lrc_content(lines):
    """Clean LRC file content by removing timestamps"""
    result = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        # Remove timestamps [mm:ss.x] [mm:ss.xx] [mm:ss.xxx]
        cleaned = re.sub(r"\[\d{2}:\d{2}\.\d{1,3}\]", "", line)
        result.append(cleaned)

    # Remove trailing empty lines
    while result and not result[-1]:
        result.pop()

    return result

Structural Tags (Optional)

Including structural tags in lyrics (such as [Verse], [Chorus], etc.) helps the model learn song structure more effectively. Training without structural tags is also possible.

Tip: You can use Gemini to add structural tags to existing lyrics.

Example:

[Intro]
La la la...

[Verse 1]
Walking down the empty street
Echoes dancing at my feet

[Chorus]
We are the stars tonight
Shining through the endless sky

[Bridge]
Close your eyes and feel the sound

Automatic Annotation

1. Obtaining BPM and Key

Use Key-BPM-Finder to obtain BPM and key annotations online:

  1. Open the webpage and click Browse my files to select the audio files to process (processing too many at once may cause the page to freeze โ€” batch processing and merging CSVs is recommended). Processing is done locally and files are not uploaded to a server. key-bpm-finder-0.jpg

  2. After processing, click Export CSV to download the CSV file. key-bpm-finder-1.jpg

  3. CSV file content example:

    File,Artist,Title,BPM,Key,Camelot
    song1.wav,,,190,D major,10B
    song2.wav,,,128,A minor,8A
    
  4. Place the CSV file in the dataset folder. To include caption data, add an extra column after Camelot.

2. Obtaining Captions

Captions can be obtained in the following ways:

  • Using acestep-5Hz-lm (0.6B / 1.7B / 4B) โ€” Via the Auto Label feature in the Gradio UI (see subsequent steps)
  • Using Gemini API โ€” Refer to the script scripts/lora_data_prepare/gemini_caption.py, which supports process_folder() for batch processing and generates the following for each audio file:
    • {filename}.lyrics.txt โ€” Lyrics
    • {filename}.caption.txt โ€” Caption description

Data Preprocessing

Once data is prepared, you can use the Gradio UI for data review and preprocessing.

Important: If using a startup script, you need to modify the launch parameters to disable service pre-initialization:

  • Windows (start_gradio_ui.bat): Change if not defined INIT_SERVICE set INIT_SERVICE=--init_service true to if not defined INIT_SERVICE set INIT_SERVICE=--init_service false
  • Linux/macOS (start_gradio_ui.sh): Change : "${INIT_SERVICE:=--init_service true}" to : "${INIT_SERVICE:=--init_service false}"

Launch the Gradio UI (via the startup script or by running acestep/acestep_v15_pipeline.py directly).

Step 1: Load Models

  • If you need to use LM for caption generation: Select the desired LM model during initialization (acestep-5Hz-lm-0.6B / 1.7B / 4B).

  • If you don't need LM: Do not select any LM model.

Step 2: Load Data

Switch to the LoRA Training tab, enter the dataset directory path, and click Scan.

The scanner automatically recognizes the following files:

File Description
*.mp3 / *.wav / *.flac / ... Audio files
{filename}.lyrics.txt (or {filename}.txt) Lyrics
{filename}.caption.txt Caption description
{filename}.json Annotation metadata (caption / bpm / keyscale / timesignature / language)
*.csv Batch BPM / Key annotations (exported from Key-BPM-Finder)

Step 3: Review and Adjust Dataset

  • Duration โ€” Automatically read from the audio file
  • Lyrics โ€” Requires a corresponding .lyrics.txt file (.txt is also supported)
  • Labeled โ€” Shows โœ… if caption exists, โŒ otherwise
  • BPM / Key / Caption โ€” Loaded from JSON or CSV files
  • If the dataset is not entirely instrumental, uncheck All Instrumental
  • Format Lyrics and Transcribe Lyrics are currently disabled (not yet integrated with acestep-transcriber; using LM directly tends to produce hallucinations)
  • Enter a Custom Trigger Tag (currently has limited effect; any option other than Replace Caption is fine)
  • Genre Ratio controls the proportion of samples using genre instead of caption. Since the current LM-generated genre descriptions are far less descriptive than captions, keep this at 0

Step 4: Auto Label Data

  • If you already have captions, you can skip this step
  • If your data lacks captions, use LM inference to generate them
  • If BPM / Key values are missing, obtain them via Key-BPM-Finder first โ€” generating them directly with LM will produce hallucinations

Step 5: Review and Edit Data

If needed, you can review and modify data entry by entry. Remember to click Save after editing each entry.

Step 6: Save Dataset

Enter a save path to export the dataset as a JSON file.

Step 7: Preprocess and Generate Tensor Files

Note: If you previously used LM to generate captions and VRAM is insufficient, restart Gradio to free VRAM first. When restarting, do not select the LM model. After restarting, enter the path to the saved JSON file and load it.

Enter the save path for tensor files, click to start preprocessing, and wait for it to complete.


Training

Note: After generating tensor files, it is also recommended to restart Gradio to free VRAM.

  1. Switch to the Train LoRA tab, enter the tensor file path, and load the dataset.
  2. If you are unfamiliar with training parameters, the default values are generally fine.

Parameter Reference

Parameter Description Suggested Value
Max Epochs Adjust based on dataset size ~100 songs โ†’ 500 epochs; 10โ€“20 songs โ†’ 800 epochs (for reference only)
Batch Size Can be increased if VRAM is sufficient 1 (default); try 2 or 4 if VRAM allows
Save Every N Epochs Checkpoint save interval Set smaller for fewer Max Epochs, larger for more

The above values are for reference only. Please adjust based on your actual situation.

  1. Click Start Training and wait for training to complete.


Using LoRA

  1. After training completes, restart Gradio and reload models (do not select the LM model).
  2. Once the model is initialized, load the trained LoRA weights.
  3. Start generating music.

Congratulations! You have completed the entire LoRA training workflow.


Advanced Training with Side-Step

For users who want more control over LoRA training โ€” including corrected timestep sampling, LoKR adapters, CLI-based workflows, VRAM optimization, and gradient sensitivity analysis โ€” the community-developed Side-Step toolkit provides an advanced alternative. Its documentation is bundled in this repository under docs/sidestep/.

Topic Description
Getting Started Installation, prerequisites, and first-run setup
End-to-End Tutorial Complete walkthrough from raw audio to generation
Dataset Preparation JSON schema, audio formats, metadata fields, custom tags
Training Guide LoRA vs LoKR, corrected vs vanilla mode, hyperparameter guide
Using Your Adapter Output layout, loading in Gradio, LoKR limitations
VRAM Optimization Guide GPU memory profiles and optimization strategies
Estimation Guide Gradient sensitivity analysis for targeted training
Shift and Timestep Sampling How training timesteps work and why Side-Step differs from the built-in trainer
Preset Management Built-in presets, save/load/import/export
The Settings Wizard Complete wizard settings reference
Model Management Checkpoint structure and fine-tune support
Windows Notes Windows-specific setup and workarounds