stepfun-ai
/

Step-Audio-TTS-3B

@@ -1,10 +1,12 @@
 ---
 license: apache-2.0
 pipeline_tag: text-to-speech
 ---
 # Step-Audio-TTS-3B
-Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
 This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
@@ -160,4 +162,4 @@ This repository provides the model weights for StepAudio-TTS-3B, which is a dual
 </table>
 # More information
-For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).

 ---
 license: apache-2.0
 pipeline_tag: text-to-speech
+library_name: transformers
 ---
 # Step-Audio-TTS-3B
+[Step-Audio-TTS-3B](https://huggingface.co/papers/2502.11946) represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
 This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
 </table>
 # More information
+For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio) and the [paper](https://huggingface.co/papers/2502.11946).