File size: 1,171 Bytes
6efa094 a195815 6efa094 a195815 3bd3fa0 a195815 3bd3fa0 4eb45cc 3bd3fa0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
---
license: apache-2.0
library_name: funasr
pipeline_tag: feature-extraction
---
# Step-Audio-Tokenizer
This repository contains the tokenizer model described in the paper [Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction](https://arxiv.org/abs/2502.11946).
Step-Audio LLM is the industry’s first 130-billion parameter hu-manlike unified end-to-end model that integrates multimodal speech un-derstanding and generation capabilities, including singing voice synthesis, tool utilization, role-play and multilingual/dialectal comprehension and synthesis.
This repository provides the speech tokenizer component of Step-Audio LLM. For linguistic tokenization, we utilize the output from the Paraformer encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic tokenization, we employ CosyVoice’s tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz.
## More information
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio). |