acestep-captioner / README.md

Update README.md

7109e9a verified 3 days ago

3.99 kB

	---
	license: mit
	library_name: transformers
	tags:
	- music
	- audio
	---

	<a href="https://arxiv.org/abs/2602.00744">Tech Report</a>

	# ACE-Step Captioner

	## Description

	ACE-Step Captioner is the annotation model used by ACE-Step v1.5 for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.

	### Performance

	🏆 Accuracy surpasses Gemini Pro 2.5 in music description tasks

	### Key Features

	- 🎼 Musical Style Analysis - Identifies genres, sub-genres, and stylistic influences
	- 🎸 Instrument Recognition - Detects and describes 1000+ instrument types and combinations
	- 🎭 Structure & Progression - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
	- 🔊 Timbre Description - Captures tonal qualities, textures, and sonic characteristics
	- 📝 Rich Vocabulary - Supports 1000+ descriptive terms for comprehensive music annotation

	## Usage

	The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B).

	### Prompt Format

	Use the following prompt to caption audio:

	```
	Task Describe this audio in detail
	<audio>
	```

	### Output Format

	The model generates natural language descriptions covering multiple aspects of the music.

	### Example Output

	```
	A melancholic indie folk track featuring fingerpicked acoustic guitar
	as the primary instrument. The song opens with a sparse, contemplative
	intro before the vocals enter with a breathy, intimate delivery.
	The arrangement gradually builds through the verse, adding subtle
	string pads and a gentle kick drum. The chorus lifts with layered
	harmonies and a warmer, fuller texture. The bridge introduces a
	key change and emotional climax before returning to the stripped-down
	acoustic arrangement for the outro.
	```

	## Descriptive Capabilities

	### Musical Styles (Examples)

	\| Category \| Styles \|
	\|----------\|--------\|
	\| Electronic \| Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo \|
	\| Rock \| Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge \|
	\| Pop \| Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop \|
	\| Classical \| Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic \|
	\| World \| Latin, African, Middle Eastern, Asian Traditional, Celtic \|
	\| Jazz \| Fusion, Smooth, Bebop, Modal, Free Jazz \|
	\| Hip-Hop \| Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap \|

	### Instruments (1000+ Supported)

	\| Category \| Examples \|
	\|----------\|----------\|
	\| Strings \| Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin \|
	\| Keys \| Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron \|
	\| Percussion \| Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone \|
	\| Wind \| Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn \|
	\| Electronic \| Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 \|

	### Structure Analysis

	- Intro / Outro - Opening and closing sections
	- Verse / Pre-Chorus / Chorus - Main song structure
	- Bridge / Break - Transitional sections
	- Build-up / Drop / Climax - Dynamic progression
	- Interlude / Solo - Instrumental passages

	### Timbre Descriptions

	\| Dimension \| Descriptors \|
	\|-----------\|-------------\|
	\| Texture \| Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated \|
	\| Space \| Reverberant, Dry, Spacious, Intimate, Cavernous, Tight \|
	\| Dynamics \| Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic \|
	\| Character \| Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic \|

	## Use Cases

	- Music AI Training - Generate high-quality captions for music generation models
	- Music Information Retrieval - Create searchable metadata for audio databases
	- Content Moderation - Analyze and categorize music content
	- Music Education - Provide detailed analysis for learning purposes
	- Audio Production - Document and describe sound design elements

	---
	license: mit
	library_name: transformers
	tags:
	- music
	- audio
	---

	<a href="https://arxiv.org/abs/2602.00744">Tech Report</a>

	# ACE-Step Captioner

	## Description

	ACE-Step Captioner is the annotation model used by ACE-Step v1.5 for training data labeling. It is a professional-grade music captioning model that generates detailed, structured descriptions of audio content.

	### Performance

	🏆 Accuracy surpasses Gemini Pro 2.5 in music description tasks

	### Key Features

	- 🎼 Musical Style Analysis - Identifies genres, sub-genres, and stylistic influences
	- 🎸 Instrument Recognition - Detects and describes 1000+ instrument types and combinations
	- 🎭 Structure & Progression - Analyzes musical arrangement including intro, verse, chorus, bridge, climax, and outro
	- 🔊 Timbre Description - Captures tonal qualities, textures, and sonic characteristics
	- 📝 Rich Vocabulary - Supports 1000+ descriptive terms for comprehensive music annotation

	## Usage

	The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B).

	### Prompt Format

	Use the following prompt to caption audio:

	```
	Task Describe this audio in detail
	<audio>
	```

	### Output Format

	The model generates natural language descriptions covering multiple aspects of the music.

	### Example Output

	```
	A melancholic indie folk track featuring fingerpicked acoustic guitar
	as the primary instrument. The song opens with a sparse, contemplative
	intro before the vocals enter with a breathy, intimate delivery.
	The arrangement gradually builds through the verse, adding subtle
	string pads and a gentle kick drum. The chorus lifts with layered
	harmonies and a warmer, fuller texture. The bridge introduces a
	key change and emotional climax before returning to the stripped-down
	acoustic arrangement for the outro.
	```

	## Descriptive Capabilities

	### Musical Styles (Examples)

	\| Category \| Styles \|
	\|----------\|--------\|
	\| Electronic \| Ambient, Techno, House, Drum & Bass, Synthwave, IDM, Downtempo \|
	\| Rock \| Alternative, Indie, Post-Rock, Progressive, Psychedelic, Grunge \|
	\| Pop \| Synth-pop, Electropop, Dream Pop, Art Pop, Indie Pop \|
	\| Classical \| Orchestral, Chamber, Minimalist, Neo-Classical, Cinematic \|
	\| World \| Latin, African, Middle Eastern, Asian Traditional, Celtic \|
	\| Jazz \| Fusion, Smooth, Bebop, Modal, Free Jazz \|
	\| Hip-Hop \| Trap, Boom Bap, Lo-fi, Instrumental, Cloud Rap \|

	### Instruments (1000+ Supported)

	\| Category \| Examples \|
	\|----------\|----------\|
	\| Strings \| Acoustic Guitar, Electric Guitar, Violin, Cello, Bass, Harp, Mandolin \|
	\| Keys \| Piano, Synthesizer, Organ, Rhodes, Wurlitzer, Mellotron \|
	\| Percussion \| Drums, Electronic Drums, Congas, Bongos, Timpani, Vibraphone \|
	\| Wind \| Saxophone, Trumpet, Flute, Clarinet, Oboe, French Horn \|
	\| Electronic \| Synth Bass, Pad, Lead, Arpeggiator, Sampler, 808, 303 \|

	### Structure Analysis

	- Intro / Outro - Opening and closing sections
	- Verse / Pre-Chorus / Chorus - Main song structure
	- Bridge / Break - Transitional sections
	- Build-up / Drop / Climax - Dynamic progression
	- Interlude / Solo - Instrumental passages

	### Timbre Descriptions

	\| Dimension \| Descriptors \|
	\|-----------\|-------------\|
	\| Texture \| Warm, Bright, Dark, Crisp, Muddy, Clean, Distorted, Saturated \|
	\| Space \| Reverberant, Dry, Spacious, Intimate, Cavernous, Tight \|
	\| Dynamics \| Punchy, Soft, Aggressive, Gentle, Compressed, Dynamic \|
	\| Character \| Ethereal, Gritty, Smooth, Raw, Polished, Organic, Synthetic \|

	## Use Cases

	- Music AI Training - Generate high-quality captions for music generation models
	- Music Information Retrieval - Create searchable metadata for audio databases
	- Content Moderation - Analyze and categorize music content
	- Music Education - Provide detailed analysis for learning purposes
	- Audio Production - Document and describe sound design elements