| --- |
| license: cc-by-nc-4.0 |
| language: |
| - zh |
| - en |
| base_model: |
| - meta-llama/Llama-3.2-3B-Instruct |
| - HKUSTAudio/Llasa-3B |
| tags: |
| - Text-to-Speech |
| pipeline_tag: text-to-speech |
| --- |
| <div> |
| <p style="margin-bottom: 0; margin-top: 0;"> |
| <strong>See <a href="https://huggingface.co/collections/unsloth/text-to-speech-tts-models-68007ab12522e96be1e02155">our collection</a> for all our TTS model uploads.</strong> |
| </p> |
| <p style="margin-bottom: 0;"> |
| <em>Learn to fine-tune TTS models - <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning">Read our Guide</a>.</em> |
| </p> |
| <p style="margin-top: 0;margin-bottom: 0;"> |
| <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em> |
| </p> |
| <div style="display: flex; gap: 5px; align-items: center; "> |
| <a href="https://github.com/unslothai/unsloth/"> |
| <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> |
| </a> |
| <a href="https://discord.gg/unsloth"> |
| <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> |
| </a> |
| <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning"> |
| <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> |
| </a> |
| </div> |
| <h1 style="margin-top: 0rem;">✨ Run & Fine-tune TTS models with Unsloth!</h1> |
| </div> |
| |
| - Fine-tune TTS models for free using our Google [Colab notebooks here](https://docs.unsloth.ai/get-started/unsloth-notebooks#text-to-speech-tts-notebooks)! |
| - Read our Blog about TTS support: [unsloth.ai/blog/tts](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning) |
|
|
| | Unsloth supports | Free Notebooks | Performance | Memory use | |
| |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| |
| | **Llasa-3B** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llasa_TTS_(3B).ipynb) | 1.5x faster | 58% less | |
| | **Whisper Large V3** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Whisper.ipynb) | 1.5x faster | 50% less | |
| | **Qwen3 (14B)** | [▶️ Start on Colab](https://docs.unsloth.ai/get-started/unsloth-notebooks) | 2x faster | 70% less | |
| | **Llama 3.2 Vision (11B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 1.8x faster | 50% less | |
|
|
| [](https://arxiv.org/abs/2502.04128) |
|
|
| **Update (2025-05-10):** Sometimes I find that top_p=0.95 and temperature=0.9 produce more stable results. |
| |
| |
| **Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune). |
| |
| |
| **Update (2025-02-07):** Our paper has been released! |
| |
| |
| LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis |
| |
| |
| - **Train from Scratch**: If you want to train the model from scratch, use the [LLaSA Training Repository](https://github.com/zhenye234/LLaSA_training). |
| |
| - **Scale for Test-Time Computation**: If you want to experiment with scaling for test-time computation, use the [LLaSA Testing Repository](https://github.com/zhenye234/LLaSA_inference). |
| |
| ## Model Information |
| Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, |
| which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. |
| The model is capable of generating speech **either solely from input text or by utilizing a given speech prompt.** |
| |
| The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possiblity of existing method for compression, acceleration and finetuning for LLM to be applied. |
| |
| |
| |
| ## How to use |
| Install [XCodec2](https://huggingface.co/HKUSTAudio/xcodec2). |
| |
| **1. Speech synthesis solely from input text** |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| import soundfile as sf |
| |
| llasa_3b ='HKUSTAudio/Llasa-3B' |
|
|
| tokenizer = AutoTokenizer.from_pretrained(llasa_3b) |
| model = AutoModelForCausalLM.from_pretrained(llasa_3b) |
| model.eval() |
| model.to('cuda') |
|
|
| from xcodec2.modeling_xcodec2 import XCodec2Model |
| |
| model_path = "HKUSTAudio/xcodec2" |
| |
| Codec_model = XCodec2Model.from_pretrained(model_path) |
| Codec_model.eval().cuda() |
|
|
| input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.' |
| # input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"' |
| def ids_to_speech_tokens(speech_ids): |
| |
| speech_tokens_str = [] |
| for speech_id in speech_ids: |
| speech_tokens_str.append(f"<|s_{speech_id}|>") |
| return speech_tokens_str |
| |
| def extract_speech_ids(speech_tokens_str): |
| |
| speech_ids = [] |
| for token_str in speech_tokens_str: |
| if token_str.startswith('<|s_') and token_str.endswith('|>'): |
| num_str = token_str[4:-2] |
| |
| num = int(num_str) |
| speech_ids.append(num) |
| else: |
| print(f"Unexpected token: {token_str}") |
| return speech_ids |
| |
| #TTS start! |
| with torch.no_grad(): |
| |
| formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>" |
| |
| # Tokenize the text |
| chat = [ |
| {"role": "user", "content": "Convert the text to speech:" + formatted_text}, |
| {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"} |
| ] |
| |
| input_ids = tokenizer.apply_chat_template( |
| chat, |
| tokenize=True, |
| return_tensors='pt', |
| continue_final_message=True |
| ) |
| input_ids = input_ids.to('cuda') |
| speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>') |
| |
| # Generate the speech autoregressively |
| outputs = model.generate( |
| input_ids, |
| max_length=2048, # We trained our model with a max length of 2048 |
| eos_token_id= speech_end_id , |
| do_sample=True, |
| top_p=1, # Adjusts the diversity of generated content |
| temperature=0.8, # Controls randomness in output |
| ) |
| # Extract the speech tokens |
| generated_ids = outputs[0][input_ids.shape[1]:-1] |
| |
| speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
| |
| # Convert token <|s_23456|> to int 23456 |
| speech_tokens = extract_speech_ids(speech_tokens) |
| |
| speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0) |
| |
| # Decode the speech tokens to speech waveform |
| gen_wav = Codec_model.decode_code(speech_tokens) |
| |
| |
| sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000) |
| ``` |
| |
| **2. Speech synthesis utilizing a given speech prompt** |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| import soundfile as sf |
| |
| llasa_3b ='HKUSTAudio/Llasa-3B' |
|
|
| tokenizer = AutoTokenizer.from_pretrained(llasa_3b) |
| model = AutoModelForCausalLM.from_pretrained(llasa_3b) |
| model.eval() |
| model.to('cuda') |
|
|
| from xcodec2.modeling_xcodec2 import XCodec2Model |
| |
| model_path = "HKUSTAudio/xcodec2" |
| |
| Codec_model = XCodec2Model.from_pretrained(model_path) |
| Codec_model.eval().cuda() |
| # only 16khz speech support! |
| prompt_wav, sr = sf.read("太乙真人.wav") # you can find wav in Files |
| #prompt_wav, sr = sf.read("Anna.wav") # English prompt |
| prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0) |
| |
| prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。" |
| #promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything." |
| target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"' |
| #target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me." |
| input_text = prompt_text + target_text |
|
|
| def ids_to_speech_tokens(speech_ids): |
| |
| speech_tokens_str = [] |
| for speech_id in speech_ids: |
| speech_tokens_str.append(f"<|s_{speech_id}|>") |
| return speech_tokens_str |
| |
| def extract_speech_ids(speech_tokens_str): |
| |
| speech_ids = [] |
| for token_str in speech_tokens_str: |
| if token_str.startswith('<|s_') and token_str.endswith('|>'): |
| num_str = token_str[4:-2] |
| |
| num = int(num_str) |
| speech_ids.append(num) |
| else: |
| print(f"Unexpected token: {token_str}") |
| return speech_ids |
| |
| #TTS start! |
| with torch.no_grad(): |
| # Encode the prompt wav |
| vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav) |
| print("Prompt Vq Code Shape:", vq_code_prompt.shape ) |
| |
| vq_code_prompt = vq_code_prompt[0,0,:] |
| # Convert int 12345 to token <|s_12345|> |
| speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt) |
| |
| formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>" |
| |
| # Tokenize the text and the speech prefix |
| chat = [ |
| {"role": "user", "content": "Convert the text to speech:" + formatted_text}, |
| {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)} |
| ] |
| |
| input_ids = tokenizer.apply_chat_template( |
| chat, |
| tokenize=True, |
| return_tensors='pt', |
| continue_final_message=True |
| ) |
| input_ids = input_ids.to('cuda') |
| speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>') |
| |
| # Generate the speech autoregressively |
| outputs = model.generate( |
| input_ids, |
| max_length=2048, # We trained our model with a max length of 2048 |
| eos_token_id= speech_end_id , |
| do_sample=True, |
| top_p=1, |
| temperature=0.8, |
| ) |
| # Extract the speech tokens |
| generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1] |
| |
| speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
| |
| # Convert token <|s_23456|> to int 23456 |
| speech_tokens = extract_speech_ids(speech_tokens) |
| |
| speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0) |
| |
| # Decode the speech tokens to speech waveform |
| gen_wav = Codec_model.decode_code(speech_tokens) |
| |
| # if only need the generated part |
| # gen_wav = gen_wav[:,:,prompt_wav.shape[1]:] |
| |
| sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000) |
| ``` |
| |
| |
| ## Disclaimer |
| |
| This model is licensed under the CC BY-NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences. |
| |
| This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws. |