|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: audio-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- music |
|
|
- audio |
|
|
--- |
|
|
|
|
|
<a href="https://arxiv.org/abs/2602.00744">Tech Report</a> |
|
|
|
|
|
# ACE-Step Transcriber |
|
|
|
|
|
## Description |
|
|
|
|
|
ACE-Step Transcriber is the annotation model used by **ACE-Step v1.5** for training data labeling. It is a powerful multilingual audio transcription model capable of transcribing both **speech** and **singing voice** with high accuracy. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- ๐ **50+ Languages Support** - Covers major world languages and regional dialects |
|
|
- ๐ค **Speech Transcription** - Accurately transcribes spoken content |
|
|
- ๐ต **Singing Voice Transcription** - Specialized in lyrics transcription with musical structure annotations |
|
|
- ๐ท๏ธ **Structure Annotation** - Automatically identifies song sections (verse, chorus, bridge, etc.) |
|
|
|
|
|
## Usage |
|
|
|
|
|
The usage is the same as [Qwen2.5 Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B). |
|
|
|
|
|
### Prompt Format |
|
|
|
|
|
Use the following prompt to transcribe audio: |
|
|
|
|
|
``` |
|
|
*Task* Transcribe this audio in detail |
|
|
<audio> |
|
|
``` |
|
|
|
|
|
### Output Format |
|
|
|
|
|
The model outputs structured content in the following format: |
|
|
|
|
|
``` |
|
|
# Languages |
|
|
<language_code> |
|
|
|
|
|
# Lyrics |
|
|
[Section Tag - Optional Instrument] |
|
|
|
|
|
<transcribed content> |
|
|
... |
|
|
``` |
|
|
|
|
|
### Example Output |
|
|
|
|
|
``` |
|
|
# Languages |
|
|
en |
|
|
|
|
|
# Lyrics |
|
|
[Intro - Acoustic Guitar] |
|
|
|
|
|
[Verse 1] |
|
|
Walking down the empty street tonight |
|
|
Stars are shining oh so bright |
|
|
... |
|
|
|
|
|
[Chorus] |
|
|
This is where we belong |
|
|
Singing our favorite song |
|
|
... |
|
|
``` |
|
|
|
|
|
### Supported Section Tags |
|
|
|
|
|
- `[Intro]`, `[Outro]` |
|
|
- `[Verse 1]`, `[Verse 2]`, etc. |
|
|
- `[Chorus]`, `[Pre-Chorus]`, `[Post-Chorus]` |
|
|
- `[Bridge]` |
|
|
- `[Guitar Interlude]`, `[Instrumental]` |
|
|
- `[Spoken]` |
|
|
|
|
|
### Supported Languages (50+) |
|
|
|
|
|
The model supports transcription in over 50 languages, including but not limited to: |
|
|
|
|
|
| Region | Languages | |
|
|
|--------|-----------| |
|
|
| **East Asia** | Chinese (zh), Japanese (ja), Korean (ko) | |
|
|
| **Southeast Asia** | Vietnamese (vi), Thai (th), Indonesian (id), Malay (ms), Filipino (tl) | |
|
|
| **South Asia** | Hindi (hi), Bengali (bn), Tamil (ta), Urdu (ur) | |
|
|
| **Europe** | English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Russian (ru), Polish (pl), Dutch (nl), Greek (el), Turkish (tr) | |
|
|
| **Middle East** | Arabic (ar), Hebrew (he), Persian (fa) | |
|
|
| **Others** | And many more regional languages... | |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Music Production** - Transcribe reference tracks for lyrics extraction |
|
|
- **Dataset Creation** - Generate high-quality labeled data for music AI models |
|
|
- **Accessibility** - Create subtitles and captions for audio content |
|
|
- **Music Analysis** - Extract structural information from songs |