metadata
license: mit
library_name: transformers
tags:
- music
- audio
ACE-Step Captioner
Description
ACE-Step Captioner is the annotation model used by ACE-Step v1.5 for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.
Performance
๐ Accuracy surpasses Gemini Pro 2.5 in music description tasks
Key Features
- ๐ผ Musical Style Analysis - Identifies genres, sub-genres, and stylistic influences
- ๐ธ Instrument Recognition - Detects and describes 1000+ instrument types and combinations
- ๐ญ Structure & Progression - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
- ๐ Timbre Description - Captures tonal qualities, textures, and sonic characteristics
- ๐ Rich Vocabulary - Supports 1000+ descriptive terms for comprehensive music annotation
Usage
The usage is the same as Qwen2.5 Omni-7B.
Prompt Format
Use the following prompt to caption audio:
*Task* Describe this audio in detail
<audio>
Output Format
The model generates natural language descriptions covering multiple aspects of the music.
Example Output
A melancholic indie folk track featuring fingerpicked acoustic guitar
as the primary instrument. The song opens with a sparse, contemplative
intro before the vocals enter with a breathy, intimate delivery.
The arrangement gradually builds through the verse, adding subtle
string pads and a gentle kick drum. The chorus lifts with layered
harmonies and a warmer, fuller texture. The bridge introduces a
key change and emotional climax before returning to the stripped-down
acoustic arrangement for the outro.
Descriptive Capabilities
Musical Styles (Examples)
| Category | Styles |
|---|---|
| Electronic | Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo |
| Rock | Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge |
| Pop | Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop |
| Classical | Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic |
| World | Latin, African, Middle Eastern, Asian Traditional, Celtic |
| Jazz | Fusion, Smooth, Bebop, Modal, Free Jazz |
| Hip-Hop | Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap |
Instruments (1000+ Supported)
| Category | Examples |
|---|---|
| Strings | Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin |
| Keys | Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron |
| Percussion | Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone |
| Wind | Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn |
| Electronic | Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 |
Structure Analysis
- Intro / Outro - Opening and closing sections
- Verse / Pre-Chorus / Chorus - Main song structure
- Bridge / Break - Transitional sections
- Build-up / Drop / Climax - Dynamic progression
- Interlude / Solo - Instrumental passages
Timbre Descriptions
| Dimension | Descriptors |
|---|---|
| Texture | Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated |
| Space | Reverberant, Dry, Spacious, Intimate, Cavernous, Tight |
| Dynamics | Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic |
| Character | Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic |
Use Cases
- Music AI Training - Generate high-quality captions for music generation models
- Music Information Retrieval - Create searchable metadata for audio databases
- Content Moderation - Analyze and categorize music content
- Music Education - Provide detailed analysis for learning purposes
- Audio Production - Document and describe sound design elements