Instructions to use AEmotionStudio/acestep-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use AEmotionStudio/acestep-models with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("AEmotionStudio/acestep-models", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Mirror README.md from ACE-Step/acestep-captioner
Browse files
checkpoints/acestep-captioner/README.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- music
|
| 6 |
+
- audio
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
<a href="https://arxiv.org/abs/2602.00744">Tech Report</a>
|
| 10 |
+
|
| 11 |
+
# ACE-Step Captioner
|
| 12 |
+
|
| 13 |
+
## Description
|
| 14 |
+
|
| 15 |
+
ACE-Step Captioner is the annotation model used by **ACE-Step v1.5** for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.
|
| 16 |
+
|
| 17 |
+
### Performance
|
| 18 |
+
|
| 19 |
+
🏆 **Accuracy surpasses Gemini Pro 2.5** in music description tasks
|
| 20 |
+
|
| 21 |
+
### Key Features
|
| 22 |
+
|
| 23 |
+
- 🎼 **Musical Style Analysis** - Identifies genres, sub-genres, and stylistic influences
|
| 24 |
+
- 🎸 **Instrument Recognition** - Detects and describes 1000+ instrument types and combinations
|
| 25 |
+
- 🎭 **Structure & Progression** - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
|
| 26 |
+
- 🔊 **Timbre Description** - Captures tonal qualities, textures, and sonic characteristics
|
| 27 |
+
- 📝 **Rich Vocabulary** - Supports 1000+ descriptive terms for comprehensive music annotation
|
| 28 |
+
|
| 29 |
+
## Usage
|
| 30 |
+
|
| 31 |
+
The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B).
|
| 32 |
+
|
| 33 |
+
### Prompt Format
|
| 34 |
+
|
| 35 |
+
Use the following prompt to caption audio:
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
*Task* Describe this audio in detail
|
| 39 |
+
<audio>
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### Output Format
|
| 43 |
+
|
| 44 |
+
The model generates natural language descriptions covering multiple aspects of the music.
|
| 45 |
+
|
| 46 |
+
### Example Output
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
A melancholic indie folk track featuring fingerpicked acoustic guitar
|
| 50 |
+
as the primary instrument. The song opens with a sparse, contemplative
|
| 51 |
+
intro before the vocals enter with a breathy, intimate delivery.
|
| 52 |
+
The arrangement gradually builds through the verse, adding subtle
|
| 53 |
+
string pads and a gentle kick drum. The chorus lifts with layered
|
| 54 |
+
harmonies and a warmer, fuller texture. The bridge introduces a
|
| 55 |
+
key change and emotional climax before returning to the stripped-down
|
| 56 |
+
acoustic arrangement for the outro.
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Descriptive Capabilities
|
| 60 |
+
|
| 61 |
+
### Musical Styles (Examples)
|
| 62 |
+
|
| 63 |
+
| Category | Styles |
|
| 64 |
+
|----------|--------|
|
| 65 |
+
| **Electronic** | Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo |
|
| 66 |
+
| **Rock** | Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge |
|
| 67 |
+
| **Pop** | Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop |
|
| 68 |
+
| **Classical** | Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic |
|
| 69 |
+
| **World** | Latin, African, Middle Eastern, Asian Traditional, Celtic |
|
| 70 |
+
| **Jazz** | Fusion, Smooth, Bebop, Modal, Free Jazz |
|
| 71 |
+
| **Hip-Hop** | Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap |
|
| 72 |
+
|
| 73 |
+
### Instruments (1000+ Supported)
|
| 74 |
+
|
| 75 |
+
| Category | Examples |
|
| 76 |
+
|----------|----------|
|
| 77 |
+
| **Strings** | Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin |
|
| 78 |
+
| **Keys** | Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron |
|
| 79 |
+
| **Percussion** | Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone |
|
| 80 |
+
| **Wind** | Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn |
|
| 81 |
+
| **Electronic** | Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 |
|
| 82 |
+
|
| 83 |
+
### Structure Analysis
|
| 84 |
+
|
| 85 |
+
- **Intro / Outro** - Opening and closing sections
|
| 86 |
+
- **Verse / Pre-Chorus / Chorus** - Main song structure
|
| 87 |
+
- **Bridge / Break** - Transitional sections
|
| 88 |
+
- **Build-up / Drop / Climax** - Dynamic progression
|
| 89 |
+
- **Interlude / Solo** - Instrumental passages
|
| 90 |
+
|
| 91 |
+
### Timbre Descriptions
|
| 92 |
+
|
| 93 |
+
| Dimension | Descriptors |
|
| 94 |
+
|-----------|-------------|
|
| 95 |
+
| **Texture** | Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated |
|
| 96 |
+
| **Space** | Reverberant, Dry, Spacious, Intimate, Cavernous, Tight |
|
| 97 |
+
| **Dynamics** | Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic |
|
| 98 |
+
| **Character** | Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic |
|
| 99 |
+
|
| 100 |
+
## Use Cases
|
| 101 |
+
|
| 102 |
+
- **Music AI Training** - Generate high-quality captions for music generation models
|
| 103 |
+
- **Music Information Retrieval** - Create searchable metadata for audio databases
|
| 104 |
+
- **Content Moderation** - Analyze and categorize music content
|
| 105 |
+
- **Music Education** - Provide detailed analysis for learning purposes
|
| 106 |
+
- **Audio Production** - Document and describe sound design elements
|