Spaces:

OnyxMunk
/

Ace-Step-Munk

Running

App Files Files Community

Ace-Step-Munk / docs /en /LoRA_Training_Tutorial.md

OnyxMunk

Add LoRA training assets: scripts, docs (no binaries), ui, my_dataset

bc9c638 28 days ago

preview code

raw

history blame contribute delete

13.5 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

ACE-Step 1.5 LoRA Training Tutorial

Hardware Requirements

VRAM	Description
16 GB (minimum)	Generally sufficient, but longer songs may cause out-of-memory errors
20 GB or more (recommended)	Handles full-length songs; VRAM usage typically stays around 17 GB during training

Note: During the preprocessing stage before training, you may need to restart Gradio multiple times to free VRAM. The specific timing will be mentioned in the steps below.

Disclaimer

This tutorial uses the album ナユタン星からの物体Y by Nayutan星人 (NayutalieN) (13 tracks) as a demonstration, trained for 500 epochs (batch size 1). This tutorial is intended solely for educational purposes to understand LoRA fine-tuning. Please use your own original works to train your LoRA.

As a developer, I personally enjoy NayutalieN's work and chose one of their albums as an example. If you are the rights holder and believe this tutorial infringes upon your legitimate rights, please contact us immediately. We will remove the relevant content upon receiving a valid notice.

Technology should be used reasonably and lawfully. Please respect artists' creations and refrain from any actions that harm or damage the reputation, rights, or interests of original artists.

Data Preparation

Tip: If you are unfamiliar with programming, you can provide this document to AI coding tools such as Claude Code / Codex CLI / Cursor / Copilot and let them handle the scripting tasks for you.

Overview

Training data for each song consists of the following:

Audio file — Supported formats: .mp3, .wav, .flac, .ogg, .opus
Lyrics — A .lyrics.txt file with the same name as the audio (.txt is also supported for backward compatibility)
Annotation data — Metadata including caption, bpm, keyscale, timesignature, language, etc.

Annotation Data Format

If you already have complete annotation data, you can create JSON files and place them in the same directory as the audio and lyrics. The file structure is as follows:

dataset/
├── song1.mp3               # Audio
├── song1.lyrics.txt        # Lyrics
├── song1.json              # Annotations (optional)
├── song1.caption.txt       # Caption (optional, can also be included in JSON)
├── song2.mp3
├── song2.lyrics.txt
├── song2.json
└── ...

JSON file structure (all fields are optional):

{
    "caption": "A high-energy J-pop track with synthesizer leads and fast tempo",
    "bpm": 190,
    "keyscale": "D major",
    "timesignature": "4",
    "language": "ja"
}

If you don't have annotation data, you can obtain it using the methods described in later sections.

Lyrics

Save lyrics as a .lyrics.txt file with the same name as the audio file, placed in the same directory. Please ensure the lyrics are accurate.

Lyrics file lookup priority during scanning:

{filename}.lyrics.txt (recommended)
{filename}.txt (backward compatible)

Lyrics Transcription

If you don't have existing lyrics text, you can obtain transcribed lyrics using the following tools:

Tool	Structural Tags	Accuracy	Ease of Use	Deployment
acestep-transcriber	No	May contain errors	High difficulty (requires model deployment)	Self-hosted
Gemini	Yes	May contain errors	Easy	Paid API
Whisper	No	May contain errors	Moderate	Self-hosted / Paid API
ElevenLabs	No	May contain errors	Moderate	Paid API (generous free tier)

This project provides transcription scripts under scripts/lora_data_prepare/:

whisper_transcription.py — Transcription via OpenAI Whisper API
elevenlabs_transcription.py — Transcription via ElevenLabs Scribe API

Both scripts support the process_folder() method for batch processing entire folders.

Review and Cleanup (Required)

Transcribed lyrics may contain errors and must be manually reviewed and corrected.

If you are using LRC format lyrics, you need to remove the timestamps. Here is a simple cleanup example:

import re

def clean_lrc_content(lines):
    """Clean LRC file content by removing timestamps"""
    result = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        # Remove timestamps [mm:ss.x] [mm:ss.xx] [mm:ss.xxx]
        cleaned = re.sub(r"\[\d{2}:\d{2}\.\d{1,3}\]", "", line)
        result.append(cleaned)

    # Remove trailing empty lines
    while result and not result[-1]:
        result.pop()

    return result

Structural Tags (Optional)

Including structural tags in lyrics (such as [Verse], [Chorus], etc.) helps the model learn song structure more effectively. Training without structural tags is also possible.

Tip: You can use Gemini to add structural tags to existing lyrics.

Example:

[Intro]
La la la...

[Verse 1]
Walking down the empty street
Echoes dancing at my feet

[Chorus]
We are the stars tonight
Shining through the endless sky

[Bridge]
Close your eyes and feel the sound

Automatic Annotation

1. Obtaining BPM and Key

Use Key-BPM-Finder to obtain BPM and key annotations online:

Open the webpage and click Browse my files to select the audio files to process (processing too many at once may cause the page to freeze — batch processing and merging CSVs is recommended). Processing is done locally and files are not uploaded to a server.
After processing, click Export CSV to download the CSV file.

CSV file content example:

File,Artist,Title,BPM,Key,Camelot
song1.wav,,,190,D major,10B
song2.wav,,,128,A minor,8A

Place the CSV file in the dataset folder. To include caption data, add an extra column after Camelot.

2. Obtaining Captions

Captions can be obtained in the following ways:

Using acestep-5Hz-lm (0.6B / 1.7B / 4B) — Via the Auto Label feature in the Gradio UI (see subsequent steps)
Using Gemini API — Refer to the script scripts/lora_data_prepare/gemini_caption.py, which supports process_folder() for batch processing and generates the following for each audio file:
- {filename}.lyrics.txt — Lyrics
- {filename}.caption.txt — Caption description

Data Preprocessing

Once data is prepared, you can use the Gradio UI for data review and preprocessing.

Important: If using a startup script, you need to modify the launch parameters to disable service pre-initialization:

Windows (start_gradio_ui.bat): Change if not defined INIT_SERVICE set INIT_SERVICE=--init_service true to if not defined INIT_SERVICE set INIT_SERVICE=--init_service false

Linux/macOS (start_gradio_ui.sh): Change : "${INIT_SERVICE:=--init_service true}" to : "${INIT_SERVICE:=--init_service false}"

Launch the Gradio UI (via the startup script or by running acestep/acestep_v15_pipeline.py directly).

Step 1: Load Models

If you need to use LM for caption generation: Select the desired LM model during initialization (acestep-5Hz-lm-0.6B / 1.7B / 4B).
If you don't need LM: Do not select any LM model.

Step 2: Load Data

Switch to the LoRA Training tab, enter the dataset directory path, and click Scan.

The scanner automatically recognizes the following files:

File	Description
`.mp3` / `.wav` / `*.flac` / ...	Audio files
`{filename}.lyrics.txt` (or `{filename}.txt`)	Lyrics
`{filename}.caption.txt`	Caption description
`{filename}.json`	Annotation metadata (caption / bpm / keyscale / timesignature / language)
`*.csv`	Batch BPM / Key annotations (exported from Key-BPM-Finder)

Step 3: Review and Adjust Dataset

Duration — Automatically read from the audio file
Lyrics — Requires a corresponding .lyrics.txt file (.txt is also supported)
Labeled — Shows ✅ if caption exists, ❌ otherwise
BPM / Key / Caption — Loaded from JSON or CSV files
If the dataset is not entirely instrumental, uncheck All Instrumental
Format Lyrics and Transcribe Lyrics are currently disabled (not yet integrated with acestep-transcriber; using LM directly tends to produce hallucinations)
Enter a Custom Trigger Tag (currently has limited effect; any option other than Replace Caption is fine)
Genre Ratio controls the proportion of samples using genre instead of caption. Since the current LM-generated genre descriptions are far less descriptive than captions, keep this at 0

Step 4: Auto Label Data

If you already have captions, you can skip this step
If your data lacks captions, use LM inference to generate them
If BPM / Key values are missing, obtain them via Key-BPM-Finder first — generating them directly with LM will produce hallucinations

Step 5: Review and Edit Data

If needed, you can review and modify data entry by entry. Remember to click Save after editing each entry.

Step 6: Save Dataset

Enter a save path to export the dataset as a JSON file.

Step 7: Preprocess and Generate Tensor Files

Note: If you previously used LM to generate captions and VRAM is insufficient, restart Gradio to free VRAM first. When restarting, do not select the LM model. After restarting, enter the path to the saved JSON file and load it.

Enter the save path for tensor files, click to start preprocessing, and wait for it to complete.

Training

Note: After generating tensor files, it is also recommended to restart Gradio to free VRAM.

Switch to the Train LoRA tab, enter the tensor file path, and load the dataset.
If you are unfamiliar with training parameters, the default values are generally fine.

Parameter Reference

Parameter	Description	Suggested Value
Max Epochs	Adjust based on dataset size	~100 songs → 500 epochs; 10–20 songs → 800 epochs (for reference only)
Batch Size	Can be increased if VRAM is sufficient	1 (default); try 2 or 4 if VRAM allows
Save Every N Epochs	Checkpoint save interval	Set smaller for fewer Max Epochs, larger for more

The above values are for reference only. Please adjust based on your actual situation.

Click Start Training and wait for training to complete.

Using LoRA

After training completes, restart Gradio and reload models (do not select the LM model).
Once the model is initialized, load the trained LoRA weights.
Start generating music.

Congratulations! You have completed the entire LoRA training workflow.

Advanced Training with Side-Step

For users who want more control over LoRA training — including corrected timestep sampling, LoKR adapters, CLI-based workflows, VRAM optimization, and gradient sensitivity analysis — the community-developed Side-Step toolkit provides an advanced alternative. Its documentation is bundled in this repository under docs/sidestep/.

Topic	Description
Getting Started	Installation, prerequisites, and first-run setup
End-to-End Tutorial	Complete walkthrough from raw audio to generation
Dataset Preparation	JSON schema, audio formats, metadata fields, custom tags
Training Guide	LoRA vs LoKR, corrected vs vanilla mode, hyperparameter guide
Using Your Adapter	Output layout, loading in Gradio, LoKR limitations
VRAM Optimization Guide	GPU memory profiles and optimization strategies
Estimation Guide	Gradient sensitivity analysis for targeted training
Shift and Timestep Sampling	How training timesteps work and why Side-Step differs from the built-in trainer
Preset Management	Built-in presets, save/load/import/export
The Settings Wizard	Complete wizard settings reference
Model Management	Checkpoint structure and fine-tune support
Windows Notes	Windows-specific setup and workarounds