File size: 3,993 Bytes
ec9d023 7109e9a 8b13325 ec9d023 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: mit
library_name: transformers
tags:
- music
- audio
---
<a href="https://arxiv.org/abs/2602.00744">Tech Report</a>
# ACE-Step Captioner
## Description
ACE-Step Captioner is the annotation model used by **ACE-Step v1.5** for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.
### Performance
๐ **Accuracy surpasses Gemini Pro 2.5** in music description tasks
### Key Features
- ๐ผ **Musical Style Analysis** - Identifies genres, sub-genres, and stylistic influences
- ๐ธ **Instrument Recognition** - Detects and describes 1000+ instrument types and combinations
- ๐ญ **Structure & Progression** - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
- ๐ **Timbre Description** - Captures tonal qualities, textures, and sonic characteristics
- ๐ **Rich Vocabulary** - Supports 1000+ descriptive terms for comprehensive music annotation
## Usage
The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B).
### Prompt Format
Use the following prompt to caption audio:
```
*Task* Describe this audio in detail
<audio>
```
### Output Format
The model generates natural language descriptions covering multiple aspects of the music.
### Example Output
```
A melancholic indie folk track featuring fingerpicked acoustic guitar
as the primary instrument. The song opens with a sparse, contemplative
intro before the vocals enter with a breathy, intimate delivery.
The arrangement gradually builds through the verse, adding subtle
string pads and a gentle kick drum. The chorus lifts with layered
harmonies and a warmer, fuller texture. The bridge introduces a
key change and emotional climax before returning to the stripped-down
acoustic arrangement for the outro.
```
## Descriptive Capabilities
### Musical Styles (Examples)
| Category | Styles |
|----------|--------|
| **Electronic** | Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo |
| **Rock** | Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge |
| **Pop** | Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop |
| **Classical** | Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic |
| **World** | Latin, African, Middle Eastern, Asian Traditional, Celtic |
| **Jazz** | Fusion, Smooth, Bebop, Modal, Free Jazz |
| **Hip-Hop** | Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap |
### Instruments (1000+ Supported)
| Category | Examples |
|----------|----------|
| **Strings** | Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin |
| **Keys** | Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron |
| **Percussion** | Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone |
| **Wind** | Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn |
| **Electronic** | Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 |
### Structure Analysis
- **Intro / Outro** - Opening and closing sections
- **Verse / Pre-Chorus / Chorus** - Main song structure
- **Bridge / Break** - Transitional sections
- **Build-up / Drop / Climax** - Dynamic progression
- **Interlude / Solo** - Instrumental passages
### Timbre Descriptions
| Dimension | Descriptors |
|-----------|-------------|
| **Texture** | Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated |
| **Space** | Reverberant, Dry, Spacious, Intimate, Cavernous, Tight |
| **Dynamics** | Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic |
| **Character** | Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic |
## Use Cases
- **Music AI Training** - Generate high-quality captions for music generation models
- **Music Information Retrieval** - Create searchable metadata for audio databases
- **Content Moderation** - Analyze and categorize music content
- **Music Education** - Provide detailed analysis for learning purposes
- **Audio Production** - Document and describe sound design elements |