AEmotionStudio
/

acestep-models

+---
+license: mit
+library_name: transformers
+tags:
+- music
+- audio
+---
+<a href="https://arxiv.org/abs/2602.00744">Tech Report</a>
+# ACE-Step Captioner
+## Description
+ACE-Step Captioner is the annotation model used by **ACE-Step v1.5** for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.
+### Performance
+🏆 **Accuracy surpasses Gemini Pro 2.5** in music description tasks
+### Key Features
+- 🎼 **Musical Style Analysis** - Identifies genres, sub-genres, and stylistic influences
+- 🎸 **Instrument Recognition** - Detects and describes 1000+ instrument types and combinations
+- 🎭 **Structure & Progression** - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
+- 🔊 **Timbre Description** - Captures tonal qualities, textures, and sonic characteristics
+- 📝 **Rich Vocabulary** - Supports 1000+ descriptive terms for comprehensive music annotation
+## Usage
+The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B).
+### Prompt Format
+Use the following prompt to caption audio:
+```
+*Task* Describe this audio in detail
+<audio>
+```
+### Output Format
+The model generates natural language descriptions covering multiple aspects of the music.
+### Example Output
+```
+A melancholic indie folk track featuring fingerpicked acoustic guitar
+as the primary instrument. The song opens with a sparse, contemplative
+intro before the vocals enter with a breathy, intimate delivery.
+The arrangement gradually builds through the verse, adding subtle
+string pads and a gentle kick drum. The chorus lifts with layered
+harmonies and a warmer, fuller texture. The bridge introduces a
+key change and emotional climax before returning to the stripped-down
+acoustic arrangement for the outro.
+```
+## Descriptive Capabilities
+### Musical Styles (Examples)
+| Category | Styles |
+|----------|--------|
+| **Electronic** | Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo |
+| **Rock** | Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge |
+| **Pop** | Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop |
+| **Classical** | Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic |
+| **World** | Latin, African, Middle Eastern, Asian Traditional, Celtic |
+| **Jazz** | Fusion, Smooth, Bebop, Modal, Free Jazz |
+| **Hip-Hop** | Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap |
+### Instruments (1000+ Supported)
+| Category | Examples |
+|----------|----------|
+| **Strings** | Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin |
+| **Keys** | Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron |
+| **Percussion** | Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone |
+| **Wind** | Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn |
+| **Electronic** | Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 |
+### Structure Analysis
+- **Intro / Outro** - Opening and closing sections
+- **Verse / Pre-Chorus / Chorus** - Main song structure
+- **Bridge / Break** - Transitional sections
+- **Build-up / Drop / Climax** - Dynamic progression
+- **Interlude / Solo** - Instrumental passages
+### Timbre Descriptions
+| Dimension | Descriptors |
+|-----------|-------------|
+| **Texture** | Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated |
+| **Space** | Reverberant, Dry, Spacious, Intimate, Cavernous, Tight |
+| **Dynamics** | Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic |
+| **Character** | Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic |
+## Use Cases
+- **Music AI Training** - Generate high-quality captions for music generation models
+- **Music Information Retrieval** - Create searchable metadata for audio databases
+- **Content Moderation** - Analyze and categorize music content
+- **Music Education** - Provide detailed analysis for learning purposes
+- **Audio Production** - Document and describe sound design elements