cmp-nct
/

Ace-Step1.5 / README.md
cmp-nct's picture
Update README.md
f47a24f verified
|
Raw
History Blame Contribute Delete
20.4 kB
---
library_name: transformers
license: mit
pipeline_tag: text-to-audio
tags:
- audio
- music
- text2music
- demodokos
---
<div style="background:#f3efe4;border:1px solid #161614;border-radius:18px;margin:0 0 28px 0;font-family:Inter,Segoe UI,Arial,sans-serif;color:#161614;overflow:hidden">
<div style="background:#161614;color:#f3efe4;padding:14px 22px;display:flex;flex-wrap:wrap;gap:10px;align-items:center;justify-content:space-between">
<div style="display:flex;align-items:center;gap:10px">
<div style="width:12px;height:12px;background:#9cff3b;border-radius:50%;box-shadow:0 0 0 4px rgba(156,255,59,.16)"></div>
<div style="font-size:13px;font-weight:900;letter-spacing:1.4px;text-transform:uppercase">Demodokos Foundry</div>
</div>
<div style="display:flex;flex-wrap:wrap;gap:8px">
<a href="https://demodokos.com/#listen" style="color:#f3efe4;text-decoration:none;border:1px solid rgba(243,239,228,.28);border-radius:999px;padding:5px 10px;font-size:11px;font-weight:800;letter-spacing:.6px;text-transform:uppercase">Demos</a>
<a href="https://demodokos.com/#create" style="color:#f3efe4;text-decoration:none;border:1px solid rgba(243,239,228,.28);border-radius:999px;padding:5px 10px;font-size:11px;font-weight:800;letter-spacing:.6px;text-transform:uppercase">Features</a>
<a href="https://demodokos.com/#pricing" style="color:#f3efe4;text-decoration:none;border:1px solid rgba(243,239,228,.28);border-radius:999px;padding:5px 10px;font-size:11px;font-weight:800;letter-spacing:.6px;text-transform:uppercase">Pricing</a>
<a href="https://demodokos.com/#begin" style="color:#f3efe4;text-decoration:none;border:1px solid rgba(243,239,228,.28);border-radius:999px;padding:5px 10px;font-size:11px;font-weight:800;letter-spacing:.6px;text-transform:uppercase">Download</a>
</div>
</div>
<div style="padding:36px 30px 28px 30px">
<div style="font-size:12px;font-weight:950;letter-spacing:2px;text-transform:uppercase;color:#59614d;margin-bottom:13px">Local AI music, speech, editing, mixing, and automation</div>
<h1 style="margin:0 0 16px 0;font-size:40px;line-height:1.04;font-weight:950;letter-spacing:-1.7px;color:#11110f">Your GPU is the studio.</h1>
<p style="margin:0 0 22px 0;max-width:900px;font-size:16.5px;line-height:1.65;color:#3f4039">Generate music, create lifelike speech, clone voices, separate stems, repair sections, record, mix, master, and automate full audio workflows. Foundry runs locally on Windows with an NVIDIA GPU, so your scripts, voices, songs, and client audio stay on your machine.</p>
<div style="display:flex;flex-wrap:wrap;gap:10px;margin:0 0 22px 0">
<a href="https://demodokos.com/#begin" style="background:#161614;color:#f3efe4;text-decoration:none;border-radius:10px;padding:11px 18px;font-size:13px;font-weight:950;letter-spacing:.2px;display:inline-block">Download for Windows</a>
<a href="https://demodokos.com/" style="background:#9cff3b;color:#11110f;text-decoration:none;border-radius:10px;padding:11px 18px;font-size:13px;font-weight:950;letter-spacing:.2px;display:inline-block">Start free trial</a>
<a href="https://demodokos.com/#listen" style="background:#fffaf0;color:#161614;text-decoration:none;border:1px solid #161614;border-radius:10px;padding:10px 17px;font-size:13px;font-weight:900;letter-spacing:.2px;display:inline-block">Hear examples</a>
</div>
<div style="display:flex;flex-wrap:wrap;gap:8px;color:#4d4f46;font-size:12px;font-weight:900;letter-spacing:.5px;text-transform:uppercase">
<span>No cloud generation</span>
<span style="color:#a1a093">/</span>
<span>No credit meters</span>
<span style="color:#a1a093">/</span>
<span>Voice cloning included</span>
<span style="color:#a1a093">/</span>
<span>Built for private production</span>
</div>
</div>
<div style="padding:0 30px 28px 30px">
<div style="display:flex;flex-wrap:wrap;border:1px solid #161614;border-radius:14px;overflow:hidden;background:#fffaf0">
<div style="flex:1 1 145px;min-width:135px;padding:18px;border-right:1px solid #161614">
<div style="font-size:29px;font-weight:950;letter-spacing:-1px;color:#11110f">50+</div>
<div style="font-size:12px;font-weight:900;text-transform:uppercase;letter-spacing:.8px;color:#5b5d52">Music languages</div>
</div>
<div style="flex:1 1 145px;min-width:135px;padding:18px;border-right:1px solid #161614">
<div style="font-size:29px;font-weight:950;letter-spacing:-1px;color:#11110f">10</div>
<div style="font-size:12px;font-weight:900;text-transform:uppercase;letter-spacing:.8px;color:#5b5d52">Speech languages</div>
</div>
<div style="flex:1 1 145px;min-width:135px;padding:18px;border-right:1px solid #161614">
<div style="font-size:29px;font-weight:950;letter-spacing:-1px;color:#11110f">50 x 5</div>
<div style="font-size:12px;font-weight:900;text-transform:uppercase;letter-spacing:.8px;color:#5b5d52">Emotion control</div>
</div>
<div style="flex:1 1 145px;min-width:135px;padding:18px;border-right:1px solid #161614">
<div style="font-size:29px;font-weight:950;letter-spacing:-1px;color:#11110f">200+</div>
<div style="font-size:12px;font-weight:900;text-transform:uppercase;letter-spacing:.8px;color:#5b5d52">DSP presets</div>
</div>
<div style="flex:1 1 145px;min-width:135px;padding:18px">
<div style="font-size:29px;font-weight:950;letter-spacing:-1px;color:#11110f">120+</div>
<div style="font-size:12px;font-weight:900;text-transform:uppercase;letter-spacing:.8px;color:#5b5d52">Commands</div>
</div>
</div>
</div>
<div style="padding:0 30px 30px 30px">
<div style="display:flex;flex-wrap:wrap;gap:14px">
<div style="flex:1 1 250px;background:#161614;color:#f3efe4;border-radius:14px;padding:22px;min-height:205px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.5px;text-transform:uppercase;color:#9cff3b;margin-bottom:12px">Music</div>
<h3 style="margin:0 0 10px 0;font-size:22px;line-height:1.15;font-weight:950;color:#ffffff">Songs, vocals, structure, style, and language control.</h3>
<p style="margin:0 0 14px 0;color:#c8c3b8;font-size:14px;line-height:1.55">Create full tracks, extend ideas, transform references, patch weak sections, and generate multilingual songs locally.</p>
<a href="https://demodokos.com/blog/best-ai-music-generators-2026" style="color:#9cff3b;text-decoration:none;font-size:13px;font-weight:900">Compare music generators</a>
</div>
<div style="flex:1 1 250px;background:#fffaf0;border:1px solid #161614;border-radius:14px;padding:22px;min-height:205px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.5px;text-transform:uppercase;color:#59614d;margin-bottom:12px">Speech</div>
<h3 style="margin:0 0 10px 0;font-size:22px;line-height:1.15;font-weight:950;color:#11110f">Expressive narration that does not sound like filler.</h3>
<p style="margin:0 0 14px 0;color:#4b4c45;font-size:14px;line-height:1.55">Generate speech in 10 languages, direct emotion line by line, clone voices from short samples, and build multi-speaker scenes.</p>
<a href="https://demodokos.com/local_ai_voice" style="color:#161614;text-decoration:none;font-size:13px;font-weight:950">Explore local AI voice</a>
</div>
<div style="flex:1 1 250px;background:#fffaf0;border:1px solid #161614;border-radius:14px;padding:22px;min-height:205px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.5px;text-transform:uppercase;color:#59614d;margin-bottom:12px">Studio</div>
<h3 style="margin:0 0 10px 0;font-size:22px;line-height:1.15;font-weight:950;color:#11110f">A real production workspace, not just a prompt field.</h3>
<p style="margin:0 0 14px 0;color:#4b4c45;font-size:14px;line-height:1.55">Record, arrange, stem-split, patch, crossfade, process with DSP, mix on a DAW-style timeline, and export final audio.</p>
<a href="https://demodokos.com/#create" style="color:#161614;text-decoration:none;font-size:13px;font-weight:950">See what is inside</a>
</div>
</div>
</div>
<div style="background:#dfe8d2;border-top:1px solid #161614;border-bottom:1px solid #161614;padding:26px 30px">
<div style="display:flex;flex-wrap:wrap;gap:18px;align-items:flex-start">
<div style="flex:1 1 280px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.7px;text-transform:uppercase;color:#35402e;margin-bottom:10px">Local creative agent</div>
<h3 style="margin:0 0 10px 0;font-size:24px;line-height:1.15;font-weight:950;color:#11110f">Give the boring preparation work to the assistant.</h3>
<p style="margin:0;color:#3f4639;font-size:14px;line-height:1.6">Use the built-in agent for lyrics, music briefs, script preparation, speaker extraction, emotion planning, literature summaries, narration segmentation, batch workflows, and repeatable production tasks.</p>
</div>
<div style="flex:1 1 280px;background:#f3efe4;border:1px solid #161614;border-radius:14px;padding:18px">
<div style="font-family:Consolas,Monaco,monospace;font-size:13px;line-height:1.75;color:#161614">
<div><span style="color:#59614d">foundry</span> analyze manuscript.pdf</div>
<div><span style="color:#59614d">foundry</span> extract speakers and emotions</div>
<div><span style="color:#59614d">foundry</span> generate narration</div>
<div><span style="color:#59614d">foundry</span> compose intro music</div>
<div><span style="color:#59614d">foundry</span> mix and export</div>
</div>
</div>
</div>
</div>
<div style="padding:28px 30px 12px 30px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.7px;text-transform:uppercase;color:#59614d;margin-bottom:12px">Pick your workflow</div>
<div style="display:flex;flex-wrap:wrap;gap:12px">
<a href="https://demodokos.com/for_youtubers" style="flex:1 1 220px;background:#fffaf0;border:1px solid #161614;border-radius:14px;padding:18px;text-decoration:none;color:#161614;display:block">
<div style="font-size:18px;font-weight:950;margin-bottom:7px">YouTubers and faceless channels</div>
<div style="font-size:13px;line-height:1.5;color:#4b4c45">Voiceovers, hooks, background music, repeatable production, no per-video credit anxiety.</div>
</a>
<a href="https://demodokos.com/for_gamedevs" style="flex:1 1 220px;background:#fffaf0;border:1px solid #161614;border-radius:14px;padding:18px;text-decoration:none;color:#161614;display:block">
<div style="font-size:18px;font-weight:950;margin-bottom:7px">Game developers</div>
<div style="font-size:13px;line-height:1.5;color:#4b4c45">NPC dialogue, character voices, batch generation, ambience, score, and engine-ready export.</div>
</a>
<a href="https://demodokos.com/for_audiobooks" style="flex:1 1 220px;background:#fffaf0;border:1px solid #161614;border-radius:14px;padding:18px;text-decoration:none;color:#161614;display:block">
<div style="font-size:18px;font-weight:950;margin-bottom:7px">Audiobooks and podcasts</div>
<div style="font-size:13px;line-height:1.5;color:#4b4c45">Long-form narration, multi-speaker scenes, private manuscripts, music beds, and final export.</div>
</a>
</div>
</div>
<div style="padding:16px 30px 12px 30px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.7px;text-transform:uppercase;color:#59614d;margin-bottom:12px">Speech studios by language</div>
<div style="display:flex;flex-wrap:wrap;gap:8px">
<a href="https://demodokos.com/ai_audio_studio/english" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">English</a>
<a href="https://demodokos.com/ai_audio_studio/deutsch" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Deutsch</a>
<a href="https://demodokos.com/ai_audio_studio/francais" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Français</a>
<a href="https://demodokos.com/ai_audio_studio/espanol" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Español</a>
<a href="https://demodokos.com/ai_audio_studio/italiano" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Italiano</a>
<a href="https://demodokos.com/ai_audio_studio/zhongwen" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">中文</a>
<a href="https://demodokos.com/ai_audio_studio/nihongo" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">日本語</a>
<a href="https://demodokos.com/ai_audio_studio/russkiy" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Русский</a>
<a href="https://demodokos.com/ai_audio_studio/portugues" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">Português</a>
<a href="https://demodokos.com/ai_audio_studio/hangugeo" style="background:#161614;color:#f3efe4;border-radius:999px;padding:8px 12px;font-size:12px;font-weight:900;text-decoration:none">한국어</a>
</div>
</div>
<div style="padding:16px 30px 30px 30px">
<div style="border-top:1px solid #161614;padding-top:20px;display:flex;flex-wrap:wrap;gap:18px;align-items:flex-start;justify-content:space-between">
<div style="flex:1 1 260px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.7px;text-transform:uppercase;color:#59614d;margin-bottom:10px">Learn fast</div>
<div style="display:flex;flex-wrap:wrap;gap:8px">
<a href="https://demodokos.com/tutorials/getting-started" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Getting started</a>
<a href="https://demodokos.com/tutorials/emotions-style-manager" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Emotion manager</a>
<a href="https://demodokos.com/tutorials/patching" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Patching</a>
<a href="https://demodokos.com/tutorials/creative-ai-agent" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Creative AI agent</a>
</div>
</div>
<div style="flex:1 1 260px">
<div style="font-size:12px;font-weight:950;letter-spacing:1.7px;text-transform:uppercase;color:#59614d;margin-bottom:10px">Deep dives</div>
<div style="display:flex;flex-wrap:wrap;gap:8px">
<a href="https://demodokos.com/voice_cloning_software" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Voice cloning</a>
<a href="https://demodokos.com/elevenlabs_alternative" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">ElevenLabs alternative</a>
<a href="https://demodokos.com/blog/creative-ai-and-automation" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Automation</a>
<a href="https://demodokos.com/privacy" style="background:#fffaf0;color:#161614;border:1px solid #161614;text-decoration:none;border-radius:10px;padding:8px 12px;font-size:12px;font-weight:900">Privacy</a>
</div>
</div>
</div>
</div>
</div>
## Model
This model is hosted for Demodokos Foundry but it can be used for other purposes, enjoy a stable download location and custom quantizations not available elsewhere.
## Model Details
🚀 **ACE-Step v1.5** is a highly efficient open-source music foundation model designed to bring commercial-grade music generation to consumer hardware.
### Key Features
* **💰 Commercial-Ready:** Unlike many models trained on ambiguous datasets, ACE-Step v1.5 is designed for creators. You can strictly use the generated music for **commercial purposes**.
* **📚 Safe & Robust Training Data:** The model is trained on a massive, legally compliant dataset consisting of:
* **Licensed Data:** Professionally licensed music tracks.
* **Royalty-Free / No-Copyright Data:** A vast collection of public domain and royalty-free music.
* **Synthetic Data:** High-quality audio generated via advanced MIDI-to-Audio conversion.
* **⚡ Extreme Speed:** Generates a full song in under 2 seconds on an A100 and under 10 seconds on an RTX 3090.
* **🖥️ Consumer Hardware Friendly:** Runs locally with less than 4GB of VRAM.
### Technical Capabilities
🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️
🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸
- **Developed by:** [ACE-STEP]
- **Model type:** [Text2Music]
- **Language(s):** [50+ languages]
- **License:** [MIT]
## Evaluation
![image](https://cdn-uploads.huggingface.co/production/uploads/62dfaf90c42558bcbd0a4f6f/n9aKi_NhSmlMOgmGzahZi.png)
## 🏗️ Architecture
![image](https://cdn-uploads.huggingface.co/production/uploads/62dfaf90c42558bcbd0a4f6f/V_d1rTdqkQyoSM8td7OWl.png)
## 🦁 Model Zoo
![image](https://cdn-uploads.huggingface.co/production/uploads/62dfaf90c42558bcbd0a4f6f/B49V0OTKse_FRefTmTPsQ.png)
### DiT Models
| DiT Model | Pre-Training | SFT | RL | CFG | Step | Refer audio | Text2Music | Cover | Repaint | Extract | Lego | Complete | Quality | Diversity | Fine-Tunability | Hugging Face |
|-----------|:------------:|:---:|:--:|:---:|:----:|:-----------:|:----------:|:-----:|:-------:|:-------:|:----:|:--------:|:-------:|:---------:|:---------------:|--------------|
| `acestep-v15-base` | ✅ | ❌ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | High | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
| `acestep-v15-sft` | ✅ | ✅ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | High | Medium | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
| `acestep-v15-turbo` | ✅ | ✅ | ❌ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
| `acestep-v15-turbo-rl` | ✅ | ✅ | ✅ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | To be released |
### LM Models
| LM Model | Pretrain from | Pre-Training | SFT | RL | CoT metas | Query rewrite | Audio Understanding | Composition Capability | Copy Melody | Hugging Face |
|----------|---------------|:------------:|:---:|:--:|:---------:|:-------------:|:-------------------:|:----------------------:|:-----------:|--------------|
| `acestep-5Hz-lm-0.6B` | Qwen3-0.6B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Weak | ✅ |
| `acestep-5Hz-lm-1.7B` | Qwen3-1.7B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Medium | ✅ |
| `acestep-5Hz-lm-4B` | Qwen3-4B | ✅ | ✅ | ✅ | ✅ | ✅ | Strong | Strong | Strong | ✅ |
## 🙏 Acknowledgements
This project is co-led by ACE Studio and StepFun.
## 📖 Citation
If you find this project useful for your research, please consider citing:
```BibTeX
@misc{gong2026acestep,
title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
year={2026},
note={GitHub repository}
}