Spaces:

ACE-Step
/

Ace-Step-v1.5

Running on Zero

File size: 17,144 Bytes

# ACE-Step Gradio Demo User Guide

**Language / 语言 / 言語:** [English](GRADIO_GUIDE.md) | [中文](../zh/GRADIO_GUIDE.md) | [日本語](../ja/GRADIO_GUIDE.md)

---

This guide provides comprehensive documentation for using the ACE-Step Gradio web interface for music generation, including all features and settings.

## Table of Contents

- [Getting Started](#getting-started)
- [Service Configuration](#service-configuration)
- [Generation Modes](#generation-modes)
- [Task Types](#task-types)
- [Input Parameters](#input-parameters)
- [Advanced Settings](#advanced-settings)
- [Results Section](#results-section)
- [LoRA Training](#lora-training)
- [Tips and Best Practices](#tips-and-best-practices)

---

## Getting Started

### Launching the Demo

```bash
# Basic launch
python app.py

# With pre-initialization
python app.py --config acestep-v15-turbo --init-llm

# With specific port
python app.py --port 7860
```

### Interface Overview

The Gradio interface consists of several main sections:

1. **Service Configuration** - Model loading and initialization
2. **Required Inputs** - Task type, audio uploads, and generation mode
3. **Music Caption & Lyrics** - Text inputs for generation
4. **Optional Parameters** - Metadata like BPM, key, duration
5. **Advanced Settings** - Fine-grained control over generation
6. **Results** - Generated audio playback and management

---

## Service Configuration

### Model Selection

| Setting | Description |
|---------|-------------|
| **Checkpoint File** | Select a trained model checkpoint (if available) |
| **Main Model Path** | Choose the DiT model configuration (e.g., `acestep-v15-turbo`, `acestep-v15-turbo-shift3`) |
| **Device** | Processing device: `auto` (recommended), `cuda`, or `cpu` |

### 5Hz LM Configuration

| Setting | Description |
|---------|-------------|
| **5Hz LM Model Path** | Select the language model (e.g., `acestep-5Hz-lm-0.6B`, `acestep-5Hz-lm-1.7B`) |
| **5Hz LM Backend** | `vllm` (faster, recommended) or `pt` (PyTorch, more compatible) |
| **Initialize 5Hz LM** | Check to load the LM during initialization (required for thinking mode) |

### Performance Options

| Setting | Description |
|---------|-------------|
| **Use Flash Attention** | Enable for faster inference (requires flash_attn package) |
| **Offload to CPU** | Offload models to CPU when idle to save GPU memory |
| **Offload DiT to CPU** | Specifically offload the DiT model to CPU |

### LoRA Adapter

| Setting | Description |
|---------|-------------|
| **LoRA Path** | Path to trained LoRA adapter directory |
| **Load LoRA** | Load the specified LoRA adapter |
| **Unload** | Remove the currently loaded LoRA |
| **Use LoRA** | Enable/disable the loaded LoRA for inference |

### Initialization

Click **Initialize Service** to load the models. The status box will show progress and confirmation.

---

## Generation Modes

### Simple Mode

Simple mode is designed for quick, natural language-based music generation.

**How to use:**
1. Select "Simple" in the Generation Mode radio button
2. Enter a natural language description in the "Song Description" field
3. Optionally check "Instrumental" if you don't want vocals
4. Optionally select a preferred vocal language
5. Click **Create Sample** to generate caption, lyrics, and metadata
6. Review the generated content in the expanded sections
7. Click **Generate Music** to create the audio

**Example descriptions:**
- "a soft Bengali love song for a quiet evening"
- "upbeat electronic dance music with heavy bass drops"
- "melancholic indie folk with acoustic guitar"
- "jazz trio playing in a smoky bar"

**Random Sample:** Click the 🎲 button to load a random example description.

### Custom Mode

Custom mode provides full control over all generation parameters.

**How to use:**
1. Select "Custom" in the Generation Mode radio button
2. Manually fill in the Caption and Lyrics fields
3. Set optional metadata (BPM, Key, Duration, etc.)
4. Optionally click **Format** to enhance your input using the LM
5. Configure advanced settings as needed
6. Click **Generate Music** to create the audio

---

## Task Types

### text2music (Default)

Generate music from text descriptions and/or lyrics.

**Use case:** Creating new music from scratch based on prompts.

**Required inputs:** Caption or Lyrics (at least one)

### cover

Transform existing audio while maintaining structure but changing style.

**Use case:** Creating cover versions in different styles.

**Required inputs:**
- Source Audio (upload in Audio Uploads section)
- Caption describing the target style

**Key parameter:** `Audio Cover Strength` (0.0-1.0)
- Higher values maintain more of the original structure
- Lower values allow more creative freedom

### repaint

Regenerate a specific time segment of audio.

**Use case:** Fixing or modifying specific sections of generated music.

**Required inputs:**
- Source Audio
- Repainting Start (seconds)
- Repainting End (seconds, -1 for end of file)
- Caption describing the desired content

### lego (Base Model Only)

Generate a specific instrument track in context of existing audio.

**Use case:** Adding instrument layers to backing tracks.

**Required inputs:**
- Source Audio
- Track Name (select from dropdown)
- Caption describing the track characteristics

**Available tracks:** vocals, backing_vocals, drums, bass, guitar, keyboard, percussion, strings, synth, fx, brass, woodwinds

### extract (Base Model Only)

Extract/isolate a specific instrument track from mixed audio.

**Use case:** Stem separation, isolating instruments.

**Required inputs:**
- Source Audio
- Track Name to extract

### complete (Base Model Only)

Complete partial tracks with specified instruments.

**Use case:** Auto-arranging incomplete compositions.

**Required inputs:**
- Source Audio
- Track Names (multiple selection)
- Caption describing the desired style

---

## Input Parameters

### Required Inputs

#### Task Type
Select the generation task from the dropdown. The instruction field updates automatically based on the selected task.

#### Audio Uploads

| Field | Description |
|-------|-------------|
| **Reference Audio** | Optional audio for style reference |
| **Source Audio** | Required for cover, repaint, lego, extract, complete tasks |
| **Convert to Codes** | Extract 5Hz semantic codes from source audio |

#### LM Codes Hints

Pre-computed audio semantic codes can be pasted here to guide generation. Use the **Transcribe** button to analyze codes and extract metadata.

### Music Caption

The text description of the desired music. Be specific about:
- Genre and style
- Instruments
- Mood and atmosphere
- Tempo feel (if not specifying BPM)

**Example:** "upbeat pop rock with electric guitars, driving drums, and catchy synth hooks"

Click 🎲 to load a random example caption.

### Lyrics

Enter lyrics with structure tags:

```
[Verse 1]
Walking down the street today
Thinking of the words you used to say

[Chorus]
I'm moving on, I'm staying strong
This is where I belong

[Verse 2]
...
```

**Instrumental checkbox:** Check this to generate instrumental music regardless of lyrics content.

**Vocal Language:** Select the language for vocals. Use "unknown" for auto-detection or instrumental tracks.

**Format button:** Click to enhance caption and lyrics using the 5Hz LM.

### Optional Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| **BPM** | Auto | Tempo in beats per minute (30-300) |
| **Key Scale** | Auto | Musical key (e.g., "C Major", "Am", "F# minor") |
| **Time Signature** | Auto | Time signature: 2 (2/4), 3 (3/4), 4 (4/4), 6 (6/8) |
| **Audio Duration** | Auto/-1 | Target length in seconds (10-600). -1 for automatic |
| **Batch Size** | 2 | Number of audio variations to generate (1-8) |

---

## Advanced Settings

### DiT Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| **Inference Steps** | 8 | Denoising steps. Turbo: 1-20, Base: 1-200 |
| **Guidance Scale** | 7.0 | CFG strength (base model only). Higher = follows prompt more |
| **Seed** | -1 | Random seed. Use comma-separated values for batches |
| **Random Seed** | ✓ | When checked, generates random seeds |
| **Audio Format** | mp3 | Output format: mp3, flac |
| **Shift** | 3.0 | Timestep shift factor (1.0-5.0). Recommended 3.0 for turbo |
| **Inference Method** | ode | ode (Euler, faster) or sde (stochastic) |
| **Custom Timesteps** | - | Override timesteps (e.g., "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0") |

### Base Model Only Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| **Use ADG** | ✗ | Enable Adaptive Dual Guidance for better quality |
| **CFG Interval Start** | 0.0 | When to start applying CFG (0.0-1.0) |
| **CFG Interval End** | 1.0 | When to stop applying CFG (0.0-1.0) |

### LM Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| **LM Temperature** | 0.85 | Sampling temperature (0.0-2.0). Higher = more creative |
| **LM CFG Scale** | 2.0 | LM guidance strength (1.0-3.0) |
| **LM Top-K** | 0 | Top-K sampling. 0 disables |
| **LM Top-P** | 0.9 | Nucleus sampling (0.0-1.0) |
| **LM Negative Prompt** | "NO USER INPUT" | Negative prompt for CFG |

### CoT (Chain-of-Thought) Options

| Option | Default | Description |
|--------|---------|-------------|
| **CoT Metas** | ✓ | Generate metadata via LM reasoning |
| **CoT Language** | ✓ | Detect vocal language via LM |
| **Constrained Decoding Debug** | ✗ | Enable debug logging |

### Generation Options

| Option | Default | Description |
|--------|---------|-------------|
| **LM Codes Strength** | 1.0 | How strongly LM codes influence generation (0.0-1.0) |
| **Auto Score** | ✗ | Automatically calculate quality scores |
| **Auto LRC** | ✗ | Automatically generate lyrics timestamps |
| **LM Batch Chunk Size** | 8 | Max items per LM batch (GPU memory) |

### Main Generation Controls

| Control | Description |
|---------|-------------|
| **Think** | Enable 5Hz LM for code generation and metadata |
| **ParallelThinking** | Enable parallel LM batch processing |
| **CaptionRewrite** | Let LM enhance the input caption |
| **AutoGen** | Automatically start next batch after completion |

---

## Results Section

### Generated Audio

Up to 8 audio samples are displayed based on batch size. Each sample includes:

- **Audio Player** - Play, pause, and download the generated audio
- **Send To Src** - Send this audio to the Source Audio input for further processing
- **Save** - Save audio and metadata to a JSON file
- **Score** - Calculate perplexity-based quality score
- **LRC** - Generate lyrics timestamps (LRC format)

### Details Accordion

Click "Score & LRC & LM Codes" to expand and view:
- **LM Codes** - The 5Hz semantic codes for this sample
- **Quality Score** - Perplexity-based quality metric
- **Lyrics Timestamps** - LRC format timing data

### Batch Navigation

| Control | Description |
|---------|-------------|
| **◀ Previous** | View the previous batch |
| **Batch Indicator** | Shows current batch position (e.g., "Batch 1 / 3") |
| **Next Batch Status** | Shows background generation progress |
| **Next ▶** | View the next batch (triggers generation if AutoGen is on) |

### Restore Parameters

Click **Apply These Settings to UI** to restore all generation parameters from the current batch back to the input fields. Useful for iterating on a good result.

### Batch Results

The "Batch Results & Generation Details" accordion contains:
- **All Generated Files** - Download all files from all batches
- **Generation Details** - Detailed information about the generation process

---

## LoRA Training

The LoRA Training tab provides tools for creating custom LoRA adapters.

### Dataset Builder Tab

#### Step 1: Load or Scan

**Option A: Load Existing Dataset**
1. Enter the path to a previously saved dataset JSON
2. Click **Load**

**Option B: Scan New Directory**
1. Enter the path to your audio folder
2. Click **Scan** to find audio files (wav, mp3, flac, ogg, opus)

#### Step 2: Configure Dataset

| Setting | Description |
|---------|-------------|
| **Dataset Name** | Name for your dataset |
| **All Instrumental** | Check if all tracks have no vocals |
| **Custom Activation Tag** | Unique tag to activate this LoRA's style |
| **Tag Position** | Where to place the tag: Prepend, Append, or Replace caption |

#### Step 3: Auto-Label

Click **Auto-Label All** to generate metadata for all audio files:
- Caption (music description)
- BPM
- Key
- Time Signature

**Skip Metas** option will skip LLM labeling and use N/A values.

#### Step 4: Preview & Edit

Use the slider to select samples and manually edit:
- Caption
- Lyrics
- BPM, Key, Time Signature
- Language
- Instrumental flag

Click **Save Changes** to update the sample.

#### Step 5: Save Dataset

Enter a save path and click **Save Dataset** to export as JSON.

#### Step 6: Preprocess

Convert the dataset to pre-computed tensors for fast training:
1. Optionally load an existing dataset JSON
2. Set the tensor output directory
3. Click **Preprocess**

This encodes audio to VAE latents, text to embeddings, and runs the condition encoder.

### Train LoRA Tab

#### Dataset Selection

Enter the path to preprocessed tensors directory and click **Load Dataset**.

#### LoRA Settings

| Setting | Default | Description |
|---------|---------|-------------|
| **LoRA Rank (r)** | 64 | Capacity of LoRA. Higher = more capacity, more memory |
| **LoRA Alpha** | 128 | Scaling factor (typically 2x rank) |
| **LoRA Dropout** | 0.1 | Dropout rate for regularization |

#### Training Parameters

| Setting | Default | Description |
|---------|---------|-------------|
| **Learning Rate** | 1e-4 | Optimization learning rate |
| **Max Epochs** | 500 | Maximum training epochs |
| **Batch Size** | 1 | Training batch size |
| **Gradient Accumulation** | 1 | Effective batch = batch_size × accumulation |
| **Save Every N Epochs** | 200 | Checkpoint save frequency |
| **Shift** | 3.0 | Timestep shift for turbo model |
| **Seed** | 42 | Random seed for reproducibility |

#### Training Controls

- **Start Training** - Begin the training process
- **Stop Training** - Interrupt training
- **Training Progress** - Shows current epoch and loss
- **Training Log** - Detailed training output
- **Training Loss Plot** - Visual loss curve

#### Export LoRA

After training, export the final adapter:
1. Enter the export path
2. Click **Export LoRA**

---

## Tips and Best Practices

### For Best Quality

1. **Use thinking mode** - Keep "Think" checkbox enabled for LM-enhanced generation
2. **Be specific in captions** - Include genre, instruments, mood, and style details
3. **Let LM detect metadata** - Leave BPM/Key/Duration empty for auto-detection
4. **Use batch generation** - Generate 2-4 variations and pick the best

### For Faster Generation

1. **Use turbo model** - Select `acestep-v15-turbo` or `acestep-v15-turbo-shift3`
2. **Keep inference steps at 8** - Default is optimal for turbo
3. **Reduce batch size** - Lower batch size if you need quick results
4. **Disable AutoGen** - Manual control over batch generation

### For Consistent Results

1. **Set a specific seed** - Uncheck "Random Seed" and enter a seed value
2. **Save good results** - Use "Save" to export parameters for reproduction
3. **Use "Apply These Settings"** - Restore parameters from a good batch

### For Long-form Music

1. **Set explicit duration** - Specify duration in seconds
2. **Use repaint task** - Fix problematic sections after initial generation
3. **Chain generations** - Use "Send To Src" to build upon previous results

### For Style Consistency

1. **Train a LoRA** - Create a custom adapter for your style
2. **Use reference audio** - Upload style reference in Audio Uploads
3. **Use consistent captions** - Maintain similar descriptive language

### Troubleshooting

**No audio generated:**
- Check that the model is initialized (green status message)
- Ensure 5Hz LM is initialized if using thinking mode
- Check the status output for error messages

**Poor quality results:**
- Increase inference steps (for base model)
- Adjust guidance scale
- Try different seeds
- Make caption more specific

**Out of memory:**
- Reduce batch size
- Enable CPU offloading
- Reduce LM batch chunk size

**LM not working:**
- Ensure "Initialize 5Hz LM" was checked during initialization
- Check that a valid LM model path is selected
- Verify vllm or PyTorch backend is available

---

## Keyboard Shortcuts

The Gradio interface supports standard web shortcuts:
- **Tab** - Move between input fields
- **Enter** - Submit text inputs
- **Space** - Toggle checkboxes

---

## Language Support

The interface supports multiple UI languages:
- **English** (en)
- **Chinese** (zh)
- **Japanese** (ja)

Select your preferred language in the Service Configuration section.

---

For more information, see:
- Main README: [`../../README.md`](../../README.md)
- REST API Documentation: [`API.md`](API.md)
- Python Inference API: [`INFERENCE.md`](INFERENCE.md)