Spaces:
Running
A newer version of the Gradio SDK is available: 6.13.0
ACE-Step 1.5 β A Musician's Guide
What Is This Thing?
ACE-Step is a music-making AI that runs on your own computer. You describe the music you want β the style, the instruments, the mood, the lyrics β and it generates a full song in seconds. Not a loop, not a beat β a complete song with vocals, instruments, and structure.
Unlike cloud services like Suno or Udio, ACE-Step runs locally. You own the software, you own the output, and you can use it offline with no subscription, no rate limits, and no terms of service restrictions.
It's open-source and free.
How Does It Actually Work?
ACE-Step has two "brains" that work together, like a songwriter and a studio engineer:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOU (the musician) β
β β
β "I want an upbeat pop song with electric guitars, β
β catchy chorus, female vocals, 120 BPM" β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
Your description
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BRAIN 1: The Songwriter (LM) β
β β
β Reads your description and thinks about it. β
β Fills in the gaps you didn't specify: β
β - What key fits this mood? β G Major β
β - What tempo feels right? β 122 BPM β
β - How should the song be structured? β
β - Where should energy peak? β
β β
β Creates a detailed blueprint of the song. β
β β
β (Optional β you can skip this brain for speed, β
β or if you already know exactly what you want.) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
Blueprint
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BRAIN 2: The Studio Engineer (DiT) β
β β
β Takes the blueprint and builds the actual audio. β
β Starts with pure noise (like static on a TV) β
β and gradually shapes it into music β step by step. β
β β
β Each step removes a layer of noise and adds β
β detail: instruments come into focus, vocals β
β emerge, drums tighten up, mix clears. β
β β
β After 8 steps (fast mode) or 50 steps (quality β
β mode), you have a finished song. β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
Finished audio
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOUR SONG βͺ β« β
β (WAV or MP3, ready to play) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The key idea: Brain 1 (the Songwriter) is optional. You can give Brain 2 (the Studio Engineer) your own blueprint directly if you prefer full control. Or you can let Brain 1 handle the planning for you. It's your choice every time.
What Can It Do?
ACE-Step has six creative modes. Think of them like different tools in a studio:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOUR CREATIVE TOOLKIT β
β β
β π΅ Text to Music Describe it β Get a song β
β π¨ Cover Restyle an existing song β
β ποΈ Repaint Fix one section of a song β
β π§± Lego Add layers to a backing track β
β π¬ Extract Pull out individual instruments β
β πΉ Complete Add accompaniment to vocals β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Text to Music β Start From Scratch
The simplest mode. Type a description, get a song.
You write: "melancholic indie folk with acoustic guitar and breathy female vocals" You get: A complete song matching that description.
Cover β Transform a Song's Style
Feed it an existing song and tell it what style you want instead. It keeps the structure (melody shape, rhythm, song form) but changes everything else.
You provide: A country ballad You write: "heavy metal rock with distorted guitars and screaming vocals" You get: The same song reimagined as heavy metal
Repaint β Fix Just One Part
Generated a song you love, except the intro is weak? Repaint lets you regenerate just that section while keeping the rest untouched.
You provide: A song where seconds 0-10 need work You write: "dramatic orchestral build-up" You get: Same song, but with a new intro
Lego β Stack Instrument Layers
Have a drum loop? Add bass. Have a guitar track? Add strings on top. Lego lets you build up a song one layer at a time.
Extract β Pull Apart a Mix
The opposite of Lego. Give it a full mix and ask it to isolate just the vocals, or just the drums, or just the guitar.
Complete β Add Accompaniment
Have a vocal recording with nothing else? Complete generates the backing instruments to match.
What Do I Need to Run It?
The Short Answer
A computer with a decent graphics card (GPU). The better the GPU, the faster and longer your songs can be.
Hardware Guide
YOUR GPU MEMORY WHAT YOU CAN DO
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
4 GB (entry level) Songs up to 6 minutes
ββββββββββββββββββββ 1 song at a time
Basic mode only (no Songwriter brain)
6-8 GB (budget) Songs up to 10 minutes
ββββββββββββββββββββ 1-2 songs at a time
Optional lightweight Songwriter brain (0.6B)
8-12 GB (mainstream) Songs up to 10 minutes
ββββββββββββββββββββ 2-4 songs at a time
Songwriter brain available (0.6B)
12-16 GB (sweet spot) Songs up to 10 minutes
ββββββββββββββββββββ 2-4 songs at a time
Full Songwriter brain (1.7B)
16-20 GB (enthusiast) Songs up to 10 minutes
ββββββββββββββββββββ 1-4 songs at a time
Larger Songwriter brain (1.7B)
20-24 GB (high end) Songs up to 8 minutes
ββββββββββββββββββββ 2-8 songs at a time
All Songwriter brains (0.6B/1.7B/4B), no offload needed
24 GB+ (pro) Songs up to 10 minutes
ββββββββββββββββββββ Up to 8 songs at a time
All features unlocked, best quality (4B)
Common GPUs and where they land:
| GPU | Memory | Tier |
|---|---|---|
| GTX 1050 Ti | 4 GB | Entry (Tier 1) |
| GTX 1660 / RTX 2060 | 6 GB | Budget (Tier 2) |
| RTX 3060 / 4060 | 8 GB | Mainstream (Tier 4) |
| RTX 3070 / 4070 | 8-12 GB | Mainstream-Sweet spot (Tier 4-5) |
| RTX 3080 16GB / 4060 Ti 16GB | 16 GB | Enthusiast (Tier 6a) |
| RTX 3090 / 4090 | 24 GB | High end / Pro (Tier 6b-Unlimited) |
| Apple M1/M2/M3 (Mac) | Shared memory | Supported, varies |
Disk space: About 100 GB free. The AI models are large files (around 60 GB total) that download automatically the first time you run the software.
Operating system: Windows, Mac, or Linux all work.
Getting It Running
On Windows (Easiest Path)
- Download the portable package from the ACE-Step website (a single .7z file)
- Extract it (right-click β Extract with 7-Zip or WinRAR)
- Double-click start_gradio_ui.bat inside the extracted folder
- A browser window opens β that's your studio
- On first launch, models download automatically (30 min to 2 hours depending on internet speed)
That's it. No programming required.
On Mac or Linux
You'll need to type a few commands in the terminal, but it's straightforward:
Step 1: Install the "uv" package manager (a one-time setup)
Step 2: Download ACE-Step from GitHub
Step 3: Run "uv sync" to install everything
Step 4: Run "uv run acestep" to launch
Step 5: Open your browser to http://localhost:7860
The project's README on GitHub walks through each step with copy-paste commands.
The Interface: What You'll See
When ACE-Step opens in your browser, you get a web interface with three main areas:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ACE-Step 1.5 β
βββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ€
β Generate β LoRA Training β Dataset Explorer β
βββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββββββββββ€
β β
β The Generate tab is where you'll spend 95% of your time. β
β β
β LoRA Training is for teaching the AI your personal style. β
β β
β Dataset Explorer is for browsing example prompts. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Generate Tab
This is your main workspace. It has two modes:
Simple Mode β For when you want quick results
- Type a natural description like "a soft love song for a quiet evening"
- Click "Create Sample" and the AI fills in all the details
- Click "Generate Music" β done
Custom Mode β For when you want precise control
- You write the exact description (caption)
- You write the lyrics with structure tags
- You set the tempo, key, and duration
- You adjust advanced settings if you want
Most people start with Simple Mode, then move to Custom Mode once they understand how the system responds.
Writing Prompts: Talking to the AI
The most important skill with ACE-Step is learning how to describe what you want. You communicate through two main inputs:
The Caption β Your Overall Vision
The caption is a short paragraph describing the entire song's vibe. Think of it as the answer to: "If you walked into a studio with session musicians, how would you describe what you want?"
Vague (the AI will guess a lot):
"a sad song"
Better (gives the AI real direction):
"melancholic piano ballad with soft female vocals, gentle string accompaniment, slow tempo, intimate and heartbreaking atmosphere"
Tips for good captions:
- Name the genre: pop, rock, jazz, electronic, folk, hip-hop, lo-fi, synthwave
- Name the instruments: acoustic guitar, piano, synth pads, 808 drums, strings
- Name the mood: melancholic, uplifting, energetic, dreamy, aggressive, intimate
- Name the production style: lo-fi, polished, live recording, bedroom pop, orchestral
The Lyrics β Your Song's Script
Lyrics do double duty in ACE-Step. They're not just words β they tell the AI how the song should be structured over time.
You use tags in square brackets to mark sections:
[Intro]
[Verse 1]
Walking through the empty streets
Thinking of your gentle touch
Summer nights and softer dreams
[Chorus]
We rise together
Into the light
This is our moment tonight
[Verse 2]
Stars are falling from the sky
Your hand fits perfectly in mine
[Bridge]
If tomorrow never comes
At least we had this
[Chorus]
We rise together
Into the light
This is our moment tonight
[Outro]
What the tags do:
[Intro] β Sets up atmosphere, usually instrumental
[Verse] β Main storytelling section, moderate energy
[Pre-Chorus] β Builds tension before the chorus
[Chorus] β Emotional peak, highest energy
[Bridge] β A shift β different melody, different feel
[Instrumental] β No vocals, just instruments
[Outro] β Winds down, often fades
Lyric tips:
- Keep lines around 6-10 syllables so the AI can fit them naturally
- Use UPPERCASE for words you want emphasized or shouted
- Use (parentheses) for background vocals or echoes
- Add descriptors to tags for extra guidance:
[Chorus - anthemic]or[Verse - whispered]
Optional: Metadata
You can also set specific musical parameters:
| Setting | What It Means | Typical Values |
|---|---|---|
| BPM | Speed of the song | 60-80 (slow), 90-120 (medium), 130-180 (fast) |
| Key | Musical key | C Major (bright), A minor (melancholic), etc. |
| Duration | Song length in seconds | 60 (1 min), 180 (3 min), 300 (5 min) |
| Language | Vocal language | English, Spanish, Japanese, Chinese, 50+ others |
If you don't set these, the AI will choose sensible defaults based on your caption and lyrics.
Working With Reference Audio
One of ACE-Step's most powerful features is using existing audio to guide generation. There are three ways to do this:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THREE WAYS TO USE AUDIO INPUT β
β β
β 1. REFERENCE AUDIO (style guide) β
β ββββββββββββ β
β β jazz.mp3 ββββ "Make something that SOUNDS β
β ββββββββββββ like this β same warmth, same β
β texture, same vibe" β
β β
β 2. SOURCE AUDIO + COVER (restyle a song) β
β ββββββββββββ β
β β song.mp3 ββββ "Keep the STRUCTURE of this song β
β ββββββββββββ but change the style completely" β
β β
β 3. SOURCE AUDIO + REPAINT (fix a section) β
β ββββββββββββ β
β β song.mp3 ββββ "Keep the whole song EXCEPT β
β ββββββββββββ regenerate seconds 10-20" β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cover Mode: The Style Transformer
This is the mode for turning one genre into another. The key control is Audio Cover Strength β a slider from 0 to 100%:
Audio Cover Strength
0% 50% 100%
ββββββββββββββββββββββββΌβββββββββββββββββββββββ€
β β β
Ignores the Balanced Follows the
original audio. blend. original closely.
Pure text-based Recognizable Same structure,
generation. but transformed. subtle changes only.
For dramatic genre changes (country β metal): use 30-50%
For moderate changes (pop β jazz): use 50-70%
For subtle changes (rock β indie rock): use 70-90%
Example: Country to Heavy Metal
- Upload your country song as source audio
- Select the "Cover" task
- Set Audio Cover Strength to about 40%
- Write a caption like: "heavy metal rock with heavily distorted electric guitars, aggressive double bass drumming, powerful screaming vocals, fast tempo, high energy, intense dark atmosphere"
- Generate a few variations (batch size 2-4)
- Pick your favorite
The Batch Generation Workflow
A critical concept: you should almost never generate just one version. AI music generation involves randomness. Think of it like rolling dice β sometimes you get exactly what you wanted, sometimes you don't. The solution is to roll several times and pick the best.
THE RECOMMENDED WORKFLOW
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Write your ββββββΆβ Generate a ββββββΆβ Listen to β
β description β β batch of 4 β β all four β
ββββββββββββββββ ββββββββββββββββ βββββββ¬βββββββββ
β
βββββββββββββββββββββΌβββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Love it? β β Close β β Not β
β Export! β β but not β β right? β
ββββββββββββ β quite? β β Tweak β
β Repaint β β prompt β
β the weak β β & retry β
β section β ββββββββββββ
ββββββββββββ
AutoGen: There's also an "auto-generate" feature that starts preparing the next batch while you're listening to the current one. This keeps your creative flow uninterrupted.
Training Your Own Style (LoRA)
LoRA is a way to teach ACE-Step your personal sound. If you have a collection of songs that represent a style you want the AI to learn β your own recordings, a specific genre, a particular mood β you can train a custom "style adapter."
What Is a LoRA?
Think of it as a small plug-in that sits on top of the base AI model:
ββββββββββββββββββββββββββββββββββββββββ
β BASE AI MODEL β
β (knows how to make all kinds β
β of music in general) β
β β
β ββββββββββββββββββββββββββββ β
β β YOUR LoRA ADAPTER β β
β β (teaches it YOUR style) β β
β β β β
β β Trained on 8-20 of β β
β β your reference songs β β
β ββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββ
Without LoRA: generic but versatile
With LoRA: sounds more like YOUR music
How to Train One
- Collect 8-20 songs that represent the style you want
- Go to the LoRA Training tab in the interface
- Point it at the folder with your audio files
- Click "Scan" β it analyzes each file automatically
- Review and edit the auto-generated labels if needed
- Click "Start Training" β takes about 1 hour on a good GPU
- When finished, you have a LoRA adapter file you can load any time
Using Your LoRA
- In the Generate tab, find the "LoRA Adapter" section
- Enter the path to your trained LoRA
- Click "Load LoRA"
- Adjust the LoRA Scale slider:
LoRA Scale
0% 50% 100%
ββββββββββββββββββββββββΌβββββββββββββββββββββββ€
β β β
No LoRA effect. Half strength. Full LoRA effect.
Pure base model. Blended style. Maximum influence
from your training.
- Generate music as usual β the output will now be influenced by your trained style
Current Limitation: One LoRA at a Time
Today, you can only use one LoRA adapter at a time. Loading a new one replaces the previous one. You cannot stack multiple styles simultaneously (for example, "jazz LoRA at 60% + vocal LoRA at 40%"). This is a known limitation that may be addressed in a future update.
The Speed Question
How long does generation take? It depends on your hardware and settings:
GENERATION SPEED (approximate)
GPU Tier 30-sec song 2-min song 5-min song
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Entry (4 GB) 10-15 sec 20-30 sec N/A
Mainstream (8 GB) 5-10 sec 10-18 sec 15-25 sec
Sweet spot (12 GB) 3-8 sec 8-12 sec 10-15 sec
High end (24 GB) 1-3 sec 3-7 sec 5-10 sec
Fast Mode vs. Quality Mode:
- Turbo (default): 8 processing steps, very fast, good quality
- SFT/Base: 50 processing steps, slower, more detail and nuance
Most people use Turbo for day-to-day work and switch to SFT/Base for final versions.
Languages
ACE-Step can generate vocals in 50+ languages, including:
English, Spanish, French, German, Italian, Portuguese, Chinese (Mandarin & Cantonese), Japanese, Korean, Hindi, Bengali, Arabic, Turkish, Thai, Vietnamese, Swedish, Dutch, Polish, Hebrew, and many more.
To use a different language:
- Select the vocal language in the interface
- Write your lyrics in that language
- The AI generates vocals with appropriate pronunciation and style
You can even mix languages within a single song.
Tips From Experience
Start Simple, Then Refine
Don't try to control everything on your first attempt. Start with a short caption and see what the AI gives you. Then add detail in areas where the output surprised you.
Generate in Batches
Always generate 2-4 versions at once. Picking the best from several options is faster and more satisfying than trying to get one perfect result.
Fix, Don't Redo
If 90% of a song is great but one section is off, use Repaint to regenerate just that section. Don't throw away the whole thing.
Be Specific About Instruments
"rock song" gives the AI too much freedom. "rock song with crunchy rhythm guitar, punchy snare, and gravelly male vocals" tells it exactly what you're hearing in your head.
Use Structure Tags in Lyrics
Even if you don't care about the actual words yet, writing [Intro] [Verse] [Chorus] [Verse] [Chorus] [Bridge] [Chorus] [Outro] gives the AI a roadmap for energy and dynamics.
Try Different Seeds
Every generation uses a random "seed" number. If you like your settings but want to hear different interpretations, just click generate again β each run uses a new seed automatically. You can also set a specific seed number to reproduce a result you liked.
The Songwriter Brain Is Optional
If you already know exactly what you want (tempo, key, structure, instruments), you can turn off "Thinking Mode" to skip Brain 1 entirely. This makes generation faster and gives you more direct control.
What ACE-Step Is Not
It's worth being clear about what this tool isn't:
- It's not a DAW. It doesn't replace Ableton, Logic, or FL Studio. It generates raw audio that you can import into your DAW for further editing.
- It's not perfect every time. Expect to generate several versions and pick the best. Think of it as a creative collaborator, not a jukebox.
- It's not a cloud service. It runs on your machine. If your GPU is small, results will be limited. There's no server doing the work for you.
- It's not one-click magic. The best results come from learning how to describe what you want. That's a skill that improves with practice.
What it is: a powerful, free, open instrument that puts AI music generation in your hands β literally on your own hardware β with full creative control and ownership.
Quick Reference Card
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUICK REFERENCE β
β β
β GENERATE A SONG β
β Caption: Describe style, instruments, mood β
β Lyrics: [Verse] [Chorus] [Bridge] with words β
β Click: Generate Music β
β β
β RESTYLE A SONG (Cover) β
β Upload: Source audio β
β Task: Cover β
β Caption: Describe the NEW style β
β Strength: 30-50% for big changes, 70-90% for subtle β
β β
β FIX A SECTION (Repaint) β
β Upload: Source audio β
β Task: Repaint β
β Time: Set start and end (in seconds) β
β Caption: Describe what the fixed section should be β
β β
β APPLY CUSTOM STYLE (LoRA) β
β Load: Your trained LoRA adapter file β
β Scale: 0-100% (how much style influence) β
β Then: Generate as usual β
β β
β KEYBOARD SHORTCUTS β
β Batch size 2-4 recommended for every generation β
β Use Turbo mode for speed, SFT/Base for quality β
β Turn off Thinking Mode if you know exactly what β
β you want β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ