# ACE-Step Gradio Demo User Guide **Language / 语言 / 言語:** [English](GRADIO_GUIDE.md) | [中文](../zh/GRADIO_GUIDE.md) | [日本語](../ja/GRADIO_GUIDE.md) --- This guide provides comprehensive documentation for using the ACE-Step Gradio web interface for music generation, including all features and settings. ## Table of Contents - [Getting Started](#getting-started) - [Service Configuration](#service-configuration) - [Generation Modes](#generation-modes) - [Task Types](#task-types) - [Input Parameters](#input-parameters) - [Advanced Settings](#advanced-settings) - [Results Section](#results-section) - [LoRA Training](#lora-training) - [Tips and Best Practices](#tips-and-best-practices) --- ## Getting Started ### Launching the Demo ```bash # Basic launch python app.py # With pre-initialization python app.py --config acestep-v15-turbo --init-llm # With specific port python app.py --port 7860 ``` ### Interface Overview The Gradio interface consists of several main sections: 1. **Service Configuration** - Model loading and initialization 2. **Required Inputs** - Task type, audio uploads, and generation mode 3. **Music Caption & Lyrics** - Text inputs for generation 4. **Optional Parameters** - Metadata like BPM, key, duration 5. **Advanced Settings** - Fine-grained control over generation 6. **Results** - Generated audio playback and management --- ## Service Configuration ### Model Selection | Setting | Description | |---------|-------------| | **Checkpoint File** | Select a trained model checkpoint (if available) | | **Main Model Path** | Choose the DiT model configuration (e.g., `acestep-v15-turbo`, `acestep-v15-turbo-shift3`) | | **Device** | Processing device: `auto` (recommended), `cuda`, or `cpu` | ### 5Hz LM Configuration | Setting | Description | |---------|-------------| | **5Hz LM Model Path** | Select the language model (e.g., `acestep-5Hz-lm-0.6B`, `acestep-5Hz-lm-1.7B`) | | **5Hz LM Backend** | `vllm` (faster, recommended) or `pt` (PyTorch, more compatible) | | **Initialize 5Hz LM** | Check to load the LM during initialization (required for thinking mode) | ### Performance Options | Setting | Description | |---------|-------------| | **Use Flash Attention** | Enable for faster inference (requires flash_attn package) | | **Offload to CPU** | Offload models to CPU when idle to save GPU memory | | **Offload DiT to CPU** | Specifically offload the DiT model to CPU | ### LoRA Adapter | Setting | Description | |---------|-------------| | **LoRA Path** | Path to trained LoRA adapter directory | | **Load LoRA** | Load the specified LoRA adapter | | **Unload** | Remove the currently loaded LoRA | | **Use LoRA** | Enable/disable the loaded LoRA for inference | ### Initialization Click **Initialize Service** to load the models. The status box will show progress and confirmation. --- ## Generation Modes ### Simple Mode Simple mode is designed for quick, natural language-based music generation. **How to use:** 1. Select "Simple" in the Generation Mode radio button 2. Enter a natural language description in the "Song Description" field 3. Optionally check "Instrumental" if you don't want vocals 4. Optionally select a preferred vocal language 5. Click **Create Sample** to generate caption, lyrics, and metadata 6. Review the generated content in the expanded sections 7. Click **Generate Music** to create the audio **Example descriptions:** - "a soft Bengali love song for a quiet evening" - "upbeat electronic dance music with heavy bass drops" - "melancholic indie folk with acoustic guitar" - "jazz trio playing in a smoky bar" **Random Sample:** Click the 🎲 button to load a random example description. ### Custom Mode Custom mode provides full control over all generation parameters. **How to use:** 1. Select "Custom" in the Generation Mode radio button 2. Manually fill in the Caption and Lyrics fields 3. Set optional metadata (BPM, Key, Duration, etc.) 4. Optionally click **Format** to enhance your input using the LM 5. Configure advanced settings as needed 6. Click **Generate Music** to create the audio --- ## Task Types ### text2music (Default) Generate music from text descriptions and/or lyrics. **Use case:** Creating new music from scratch based on prompts. **Required inputs:** Caption or Lyrics (at least one) ### cover Transform existing audio while maintaining structure but changing style. **Use case:** Creating cover versions in different styles. **Required inputs:** - Source Audio (upload in Audio Uploads section) - Caption describing the target style **Key parameter:** `Audio Cover Strength` (0.0-1.0) - Higher values maintain more of the original structure - Lower values allow more creative freedom ### repaint Regenerate a specific time segment of audio. **Use case:** Fixing or modifying specific sections of generated music. **Required inputs:** - Source Audio - Repainting Start (seconds) - Repainting End (seconds, -1 for end of file) - Caption describing the desired content ### lego (Base Model Only) Generate a specific instrument track in context of existing audio. **Use case:** Adding instrument layers to backing tracks. **Required inputs:** - Source Audio - Track Name (select from dropdown) - Caption describing the track characteristics **Available tracks:** vocals, backing_vocals, drums, bass, guitar, keyboard, percussion, strings, synth, fx, brass, woodwinds ### extract (Base Model Only) Extract/isolate a specific instrument track from mixed audio. **Use case:** Stem separation, isolating instruments. **Required inputs:** - Source Audio - Track Name to extract ### complete (Base Model Only) Complete partial tracks with specified instruments. **Use case:** Auto-arranging incomplete compositions. **Required inputs:** - Source Audio - Track Names (multiple selection) - Caption describing the desired style --- ## Input Parameters ### Required Inputs #### Task Type Select the generation task from the dropdown. The instruction field updates automatically based on the selected task. #### Audio Uploads | Field | Description | |-------|-------------| | **Reference Audio** | Optional audio for style reference | | **Source Audio** | Required for cover, repaint, lego, extract, complete tasks | | **Convert to Codes** | Extract 5Hz semantic codes from source audio | #### LM Codes Hints Pre-computed audio semantic codes can be pasted here to guide generation. Use the **Transcribe** button to analyze codes and extract metadata. ### Music Caption The text description of the desired music. Be specific about: - Genre and style - Instruments - Mood and atmosphere - Tempo feel (if not specifying BPM) **Example:** "upbeat pop rock with electric guitars, driving drums, and catchy synth hooks" Click 🎲 to load a random example caption. ### Lyrics Enter lyrics with structure tags: ``` [Verse 1] Walking down the street today Thinking of the words you used to say [Chorus] I'm moving on, I'm staying strong This is where I belong [Verse 2] ... ``` **Instrumental checkbox:** Check this to generate instrumental music regardless of lyrics content. **Vocal Language:** Select the language for vocals. Use "unknown" for auto-detection or instrumental tracks. **Format button:** Click to enhance caption and lyrics using the 5Hz LM. ### Optional Parameters | Parameter | Default | Description | |-----------|---------|-------------| | **BPM** | Auto | Tempo in beats per minute (30-300) | | **Key Scale** | Auto | Musical key (e.g., "C Major", "Am", "F# minor") | | **Time Signature** | Auto | Time signature: 2 (2/4), 3 (3/4), 4 (4/4), 6 (6/8) | | **Audio Duration** | Auto/-1 | Target length in seconds (10-600). -1 for automatic | | **Batch Size** | 2 | Number of audio variations to generate (1-8) | --- ## Advanced Settings ### DiT Parameters | Parameter | Default | Description | |-----------|---------|-------------| | **Inference Steps** | 8 | Denoising steps. Turbo: 1-20, Base: 1-200 | | **Guidance Scale** | 7.0 | CFG strength (base model only). Higher = follows prompt more | | **Seed** | -1 | Random seed. Use comma-separated values for batches | | **Random Seed** | ✓ | When checked, generates random seeds | | **Audio Format** | mp3 | Output format: mp3, flac | | **Shift** | 3.0 | Timestep shift factor (1.0-5.0). Recommended 3.0 for turbo | | **Inference Method** | ode | ode (Euler, faster) or sde (stochastic) | | **Custom Timesteps** | - | Override timesteps (e.g., "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0") | ### Base Model Only Parameters | Parameter | Default | Description | |-----------|---------|-------------| | **Use ADG** | ✗ | Enable Adaptive Dual Guidance for better quality | | **CFG Interval Start** | 0.0 | When to start applying CFG (0.0-1.0) | | **CFG Interval End** | 1.0 | When to stop applying CFG (0.0-1.0) | ### LM Parameters | Parameter | Default | Description | |-----------|---------|-------------| | **LM Temperature** | 0.85 | Sampling temperature (0.0-2.0). Higher = more creative | | **LM CFG Scale** | 2.0 | LM guidance strength (1.0-3.0) | | **LM Top-K** | 0 | Top-K sampling. 0 disables | | **LM Top-P** | 0.9 | Nucleus sampling (0.0-1.0) | | **LM Negative Prompt** | "NO USER INPUT" | Negative prompt for CFG | ### CoT (Chain-of-Thought) Options | Option | Default | Description | |--------|---------|-------------| | **CoT Metas** | ✓ | Generate metadata via LM reasoning | | **CoT Language** | ✓ | Detect vocal language via LM | | **Constrained Decoding Debug** | ✗ | Enable debug logging | ### Generation Options | Option | Default | Description | |--------|---------|-------------| | **LM Codes Strength** | 1.0 | How strongly LM codes influence generation (0.0-1.0) | | **Auto Score** | ✗ | Automatically calculate quality scores | | **Auto LRC** | ✗ | Automatically generate lyrics timestamps | | **LM Batch Chunk Size** | 8 | Max items per LM batch (GPU memory) | ### Main Generation Controls | Control | Description | |---------|-------------| | **Think** | Enable 5Hz LM for code generation and metadata | | **ParallelThinking** | Enable parallel LM batch processing | | **CaptionRewrite** | Let LM enhance the input caption | | **AutoGen** | Automatically start next batch after completion | --- ## Results Section ### Generated Audio Up to 8 audio samples are displayed based on batch size. Each sample includes: - **Audio Player** - Play, pause, and download the generated audio - **Send To Src** - Send this audio to the Source Audio input for further processing - **Save** - Save audio and metadata to a JSON file - **Score** - Calculate perplexity-based quality score - **LRC** - Generate lyrics timestamps (LRC format) ### Details Accordion Click "Score & LRC & LM Codes" to expand and view: - **LM Codes** - The 5Hz semantic codes for this sample - **Quality Score** - Perplexity-based quality metric - **Lyrics Timestamps** - LRC format timing data ### Batch Navigation | Control | Description | |---------|-------------| | **◀ Previous** | View the previous batch | | **Batch Indicator** | Shows current batch position (e.g., "Batch 1 / 3") | | **Next Batch Status** | Shows background generation progress | | **Next ▶** | View the next batch (triggers generation if AutoGen is on) | ### Restore Parameters Click **Apply These Settings to UI** to restore all generation parameters from the current batch back to the input fields. Useful for iterating on a good result. ### Batch Results The "Batch Results & Generation Details" accordion contains: - **All Generated Files** - Download all files from all batches - **Generation Details** - Detailed information about the generation process --- ## LoRA Training The LoRA Training tab provides tools for creating custom LoRA adapters. ### Dataset Builder Tab #### Step 1: Load or Scan **Option A: Load Existing Dataset** 1. Enter the path to a previously saved dataset JSON 2. Click **Load** **Option B: Scan New Directory** 1. Enter the path to your audio folder 2. Click **Scan** to find audio files (wav, mp3, flac, ogg, opus) #### Step 2: Configure Dataset | Setting | Description | |---------|-------------| | **Dataset Name** | Name for your dataset | | **All Instrumental** | Check if all tracks have no vocals | | **Custom Activation Tag** | Unique tag to activate this LoRA's style | | **Tag Position** | Where to place the tag: Prepend, Append, or Replace caption | #### Step 3: Auto-Label Click **Auto-Label All** to generate metadata for all audio files: - Caption (music description) - BPM - Key - Time Signature **Skip Metas** option will skip LLM labeling and use N/A values. #### Step 4: Preview & Edit Use the slider to select samples and manually edit: - Caption - Lyrics - BPM, Key, Time Signature - Language - Instrumental flag Click **Save Changes** to update the sample. #### Step 5: Save Dataset Enter a save path and click **Save Dataset** to export as JSON. #### Step 6: Preprocess Convert the dataset to pre-computed tensors for fast training: 1. Optionally load an existing dataset JSON 2. Set the tensor output directory 3. Click **Preprocess** This encodes audio to VAE latents, text to embeddings, and runs the condition encoder. ### Train LoRA Tab #### Dataset Selection Enter the path to preprocessed tensors directory and click **Load Dataset**. #### LoRA Settings | Setting | Default | Description | |---------|---------|-------------| | **LoRA Rank (r)** | 64 | Capacity of LoRA. Higher = more capacity, more memory | | **LoRA Alpha** | 128 | Scaling factor (typically 2x rank) | | **LoRA Dropout** | 0.1 | Dropout rate for regularization | #### Training Parameters | Setting | Default | Description | |---------|---------|-------------| | **Learning Rate** | 1e-4 | Optimization learning rate | | **Max Epochs** | 500 | Maximum training epochs | | **Batch Size** | 1 | Training batch size | | **Gradient Accumulation** | 1 | Effective batch = batch_size × accumulation | | **Save Every N Epochs** | 200 | Checkpoint save frequency | | **Shift** | 3.0 | Timestep shift for turbo model | | **Seed** | 42 | Random seed for reproducibility | #### Training Controls - **Start Training** - Begin the training process - **Stop Training** - Interrupt training - **Training Progress** - Shows current epoch and loss - **Training Log** - Detailed training output - **Training Loss Plot** - Visual loss curve #### Export LoRA After training, export the final adapter: 1. Enter the export path 2. Click **Export LoRA** --- ## Tips and Best Practices ### For Best Quality 1. **Use thinking mode** - Keep "Think" checkbox enabled for LM-enhanced generation 2. **Be specific in captions** - Include genre, instruments, mood, and style details 3. **Let LM detect metadata** - Leave BPM/Key/Duration empty for auto-detection 4. **Use batch generation** - Generate 2-4 variations and pick the best ### For Faster Generation 1. **Use turbo model** - Select `acestep-v15-turbo` or `acestep-v15-turbo-shift3` 2. **Keep inference steps at 8** - Default is optimal for turbo 3. **Reduce batch size** - Lower batch size if you need quick results 4. **Disable AutoGen** - Manual control over batch generation ### For Consistent Results 1. **Set a specific seed** - Uncheck "Random Seed" and enter a seed value 2. **Save good results** - Use "Save" to export parameters for reproduction 3. **Use "Apply These Settings"** - Restore parameters from a good batch ### For Long-form Music 1. **Set explicit duration** - Specify duration in seconds 2. **Use repaint task** - Fix problematic sections after initial generation 3. **Chain generations** - Use "Send To Src" to build upon previous results ### For Style Consistency 1. **Train a LoRA** - Create a custom adapter for your style 2. **Use reference audio** - Upload style reference in Audio Uploads 3. **Use consistent captions** - Maintain similar descriptive language ### Troubleshooting **No audio generated:** - Check that the model is initialized (green status message) - Ensure 5Hz LM is initialized if using thinking mode - Check the status output for error messages **Poor quality results:** - Increase inference steps (for base model) - Adjust guidance scale - Try different seeds - Make caption more specific **Out of memory:** - Reduce batch size - Enable CPU offloading - Reduce LM batch chunk size **LM not working:** - Ensure "Initialize 5Hz LM" was checked during initialization - Check that a valid LM model path is selected - Verify vllm or PyTorch backend is available --- ## Keyboard Shortcuts The Gradio interface supports standard web shortcuts: - **Tab** - Move between input fields - **Enter** - Submit text inputs - **Space** - Toggle checkboxes --- ## Language Support The interface supports multiple UI languages: - **English** (en) - **Chinese** (zh) - **Japanese** (ja) Select your preferred language in the Service Configuration section. --- For more information, see: - Main README: [`../../README.md`](../../README.md) - REST API Documentation: [`API.md`](API.md) - Python Inference API: [`INFERENCE.md`](INFERENCE.md)