--- library_name: transformers datasets: - malaysia-ai/Multilingual-TTS - Scicom-intl/Emilia-YODAS-Voice-Conversion - Scicom-intl/Malaysian-Emilia base_model: - Qwen/Qwen3-1.7B-Base new_version: Scicom-intl/Multilingual-TTS-1.7B-Base language: - en - ms - zh - ta --- # Multilingual-TTS-1.7B-Base Continue pretraining [Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) on Multilingual Voice Conversion and TTS. 1. Use [neucodec](https://huggingface.co/neuphonic/neucodec) as speech detokenizer, 50 TPS, output in 24k sample rate. 2. Multi-speaker multilingual Voice Cloning, **up to 35.88B tokens**. 3. Multi-speaker multilingual TTS more than 150 languages, **up to 14.64B tokens**. 4. Flash Attention 3 10k context length varlen multipacking. 5. Mixed precision FP32-BF16. 6. MuonAdamW optimizer. ## How to First load Neucodec, ```python from neucodec import NeuCodec codec = NeuCodec.from_pretrained("neuphonic/neucodec") _ = codec.eval().to('cuda') ``` ### TTS You can use any speaker name available at https://huggingface.co/datasets/malaysia-ai/Multilingual-TTS ```python import re import soundfile as sf from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('Scicom-intl/Multilingual-TTS-1.7B-Base') tokenizer = AutoTokenizer.from_pretrained('Scicom-intl/Multilingual-TTS-1.7B-Base') speaker = 'husein' text = "Hi nama saya Husein, I am so cute, 我喜欢吃鸡饭, boire du thé glacé, ולהירגע על החוף, وأحب أن أتعرض لبعض أشعة الشمس." prompt = f"<|im_start|>{speaker}: {text}<|speech_start|>" inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=0.8, repetition_penalty=1.15, ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False) audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1]) audio_tokens = [int(token) for token in audio_tokens] audio_codes = torch.tensor(audio_tokens)[None, None] with torch.no_grad(): audio_waveform = codec.decode_code(audio_codes.cuda()) sf.write('husein-ms-en-zh-fr-he-ar.mp3', audio_waveform[0, 0].cpu().numpy(), 24000) ``` You can check the audio at [husein-ms-en-zh-fr-he-ar.mp3](husein-ms-en-zh-fr-he-ar.mp3). ### Voice Cloning Jenny from https://huggingface.co/datasets/reach-vb/jenny_tts_dataset ```python y, sr = librosa.load('jenny.wav', sr = 16000) with torch.no_grad(): codes = codec.encode_code(torch.tensor(y)[None, None]) tokens = ''.join([f'<|s_{i}|>' for i in codes[0, 0]]) prompt = f"<|im_start|>I wonder if I shall ever be happy enough to have real lace on my clothes and bows on my caps.<|speech_start|>{tokens}<|im_end|><|im_start|>Ye encik, apa yang saya boleh tolong? வணக்கம், நான் உங்களுக்கு என்ன உதவ வேண்டும்? Quieres pedir algo de comida? それとも飲み物も欲しいですか?<|speech_start|>" inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=0.8, repetition_penalty=1.15, ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False) audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[-1]) audio_tokens = [int(token) for token in audio_tokens] audio_codes = torch.tensor(audio_tokens)[None, None] with torch.no_grad(): audio_waveform = codec.decode_code(audio_codes.cuda()) sf.write('vc-jenny-ms-ta-es-ja.mp3', audio_waveform[0, 0].cpu().numpy(), 24000) ``` You can check the audio at [vc-jenny-ms-ta-es-ja.mp3](vc-jenny-ms-ta-es-ja.mp3). ## Optimize Inference For better concurrency, you can use https://github.com/Scicom-AI-Enterprise-Organization/TTS-API-Neucodec ## Source code All ablations and steps to reproduce at https://github.com/Scicom-AI-Enterprise-Organization/Multilingual-TTS ## Acknowledgement Special thanks to https://www.scitix.ai/ for H100 Node!