|
|
--- |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- music |
|
|
- audio |
|
|
--- |
|
|
|
|
|
<a href="https://arxiv.org/abs/2602.00744">Tech Report</a> |
|
|
|
|
|
# ACE-Step Captioner |
|
|
|
|
|
## Description |
|
|
|
|
|
ACE-Step Captioner is the annotation model used by **ACE-Step v1.5** for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content. |
|
|
|
|
|
### Performance |
|
|
|
|
|
π **Accuracy surpasses Gemini Pro 2.5** in music description tasks |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- πΌ **Musical Style Analysis** - Identifies genres, sub-genres, and stylistic influences |
|
|
- πΈ **Instrument Recognition** - Detects and describes 1000+ instrument types and combinations |
|
|
- π **Structure & Progression** - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro |
|
|
- π **Timbre Description** - Captures tonal qualities, textures, and sonic characteristics |
|
|
- π **Rich Vocabulary** - Supports 1000+ descriptive terms for comprehensive music annotation |
|
|
|
|
|
## Usage |
|
|
|
|
|
The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B). |
|
|
|
|
|
### Prompt Format |
|
|
|
|
|
Use the following prompt to caption audio: |
|
|
|
|
|
``` |
|
|
*Task* Describe this audio in detail |
|
|
<audio> |
|
|
``` |
|
|
|
|
|
### Output Format |
|
|
|
|
|
The model generates natural language descriptions covering multiple aspects of the music. |
|
|
|
|
|
### Example Output |
|
|
|
|
|
``` |
|
|
A melancholic indie folk track featuring fingerpicked acoustic guitar |
|
|
as the primary instrument. The song opens with a sparse, contemplative |
|
|
intro before the vocals enter with a breathy, intimate delivery. |
|
|
The arrangement gradually builds through the verse, adding subtle |
|
|
string pads and a gentle kick drum. The chorus lifts with layered |
|
|
harmonies and a warmer, fuller texture. The bridge introduces a |
|
|
key change and emotional climax before returning to the stripped-down |
|
|
acoustic arrangement for the outro. |
|
|
``` |
|
|
|
|
|
## Descriptive Capabilities |
|
|
|
|
|
### Musical Styles (Examples) |
|
|
|
|
|
| Category | Styles | |
|
|
|----------|--------| |
|
|
| **Electronic** | Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo | |
|
|
| **Rock** | Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge | |
|
|
| **Pop** | Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop | |
|
|
| **Classical** | Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic | |
|
|
| **World** | Latin, African, Middle Eastern, Asian Traditional, Celtic | |
|
|
| **Jazz** | Fusion, Smooth, Bebop, Modal, Free Jazz | |
|
|
| **Hip-Hop** | Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap | |
|
|
|
|
|
### Instruments (1000+ Supported) |
|
|
|
|
|
| Category | Examples | |
|
|
|----------|----------| |
|
|
| **Strings** | Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin | |
|
|
| **Keys** | Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron | |
|
|
| **Percussion** | Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone | |
|
|
| **Wind** | Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn | |
|
|
| **Electronic** | Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 | |
|
|
|
|
|
### Structure Analysis |
|
|
|
|
|
- **Intro / Outro** - Opening and closing sections |
|
|
- **Verse / Pre-Chorus / Chorus** - Main song structure |
|
|
- **Bridge / Break** - Transitional sections |
|
|
- **Build-up / Drop / Climax** - Dynamic progression |
|
|
- **Interlude / Solo** - Instrumental passages |
|
|
|
|
|
### Timbre Descriptions |
|
|
|
|
|
| Dimension | Descriptors | |
|
|
|-----------|-------------| |
|
|
| **Texture** | Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated | |
|
|
| **Space** | Reverberant, Dry, Spacious, Intimate, Cavernous, Tight | |
|
|
| **Dynamics** | Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic | |
|
|
| **Character** | Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic | |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Music AI Training** - Generate high-quality captions for music generation models |
|
|
- **Music Information Retrieval** - Create searchable metadata for audio databases |
|
|
- **Content Moderation** - Analyze and categorize music content |
|
|
- **Music Education** - Provide detailed analysis for learning purposes |
|
|
- **Audio Production** - Document and describe sound design elements |