Instructions to use BricksDisplay/chatterbox-multilingual-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use BricksDisplay/chatterbox-multilingual-ONNX with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - ar | |
| - da | |
| - de | |
| - el | |
| - en | |
| - es | |
| - fi | |
| - fr | |
| - he | |
| - hi | |
| - it | |
| - ja | |
| - ko | |
| - ms | |
| - nl | |
| - 'no' | |
| - pl | |
| - pt | |
| - ru | |
| - sv | |
| - sw | |
| - tr | |
| - zh | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - text-to-speech | |
| - speech | |
| - speech-generation | |
| - voice-cloning | |
| - multilingual-tts | |
| library_name: chatterbox | |
| base_model: | |
| - ResembleAI/chatterbox | |
| <img width="800" alt="cb-big2" src="https://github.com/user-attachments/assets/bd8c5f03-e91d-4ee5-b680-57355da204d1" /> | |
| <h1 style="font-size: 32px">Chatterbox TTS</h1> | |
| <div style="display: flex; align-items: center; gap: 12px"> | |
| <a href="https://resemble-ai.github.io/chatterbox_demopage/"> | |
| <img src="https://img.shields.io/badge/listen-demo_samples-blue" alt="Listen to Demo Samples" /> | |
| </a> | |
| <a href="https://huggingface.co/spaces/ResembleAI/Chatterbox"> | |
| <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg" alt="Open in HF Spaces" /> | |
| </a> | |
| <a href="https://podonos.com/resembleai/chatterbox"> | |
| <img src="https://static-public.podonos.com/badges/insight-on-pdns-sm-dark.svg" alt="Insight on Podos" /> | |
| </a> | |
| </div> | |
| <div style="display: flex; align-items: center; gap: 8px;"> | |
| <img width="100" alt="resemble-logo-horizontal" src="https://github.com/user-attachments/assets/35cf756b-3506-4943-9c72-c05ddfa4e525" /> | |
| </div> | |
| **Chatterbox Multilingual** [Resemble AI's](https://resemble.ai) production-grade open source TTS model. Chatterbox Multilingual supports **Arabic**, **Danish**, **German**, **Greek**, **English**, **Spanish**, **Finnish**, **French**, **Hebrew**, **Hindi**, **Italian**, **Japanese**, **Korean**, **Malay**, **Dutch**, **Norwegian**, **Polish**, **Portuguese**, **Russian**, **Swedish**, **Swahili**, **Turkish**, **Chinese** out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations. | |
| Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support **emotion exaggeration control**, a powerful feature that makes your voices stand out. | |
| Chatterbox is provided in an exported ONNX format, enabling fast and portable inference with ONNX Runtime across platforms. | |
| # Key Details | |
| - SoTA zeroshot English TTS | |
| - 0.5B Llama backbone | |
| - Unique exaggeration/intensity control | |
| - Ultra-stable with alignment-informed inference | |
| - Trained on 0.5M hours of cleaned data | |
| - Watermarked outputs (optional) | |
| - Easy voice conversion script using onnxruntime | |
| - [Outperforms ElevenLabs](https://podonos.com/resembleai/chatterbox) | |
| # Tips | |
| - **General Use (TTS and Voice Agents):** | |
| - The default settings (`exaggeration=0.5`, `cfg=0.5`) work well for most prompts. | |
| - **Expressive or Dramatic Speech:** | |
| - Try increase `exaggeration` to around `0.7` or higher. | |
| - Higher `exaggeration` tends to speed up speech; | |
| # Usage | |
| [Link to GitHub ONNX Export and Inference script](https://github.com/VladOS95-cyber/onnx_conversion_scripts/tree/main/chatterbox) | |
| ```python | |
| # !pip install --upgrade onnxruntime==1.22.1 huggingface_hub==0.34.4 transformers==4.46.3 numpy==2.2.6 tqdm==4.67.1 librosa==0.11.0 soundfile==0.13.1 resemble-perth==1.0.1 | |
| # for Chinese, Japanese additionally pip install pkuseg==0.0.25 pykakasi==2.3.0 | |
| import onnxruntime | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoTokenizer | |
| import numpy as np | |
| from tqdm import tqdm | |
| import librosa | |
| import soundfile as sf | |
| from unicodedata import category | |
| import json | |
| S3GEN_SR = 24000 | |
| START_SPEECH_TOKEN = 6561 | |
| STOP_SPEECH_TOKEN = 6562 | |
| SUPPORTED_LANGUAGES = { | |
| "ar": "Arabic", | |
| "da": "Danish", | |
| "de": "German", | |
| "el": "Greek", | |
| "en": "English", | |
| "es": "Spanish", | |
| "fi": "Finnish", | |
| "fr": "French", | |
| "he": "Hebrew", | |
| "hi": "Hindi", | |
| "it": "Italian", | |
| "ja": "Japanese", | |
| "ko": "Korean", | |
| "ms": "Malay", | |
| "nl": "Dutch", | |
| "no": "Norwegian", | |
| "pl": "Polish", | |
| "pt": "Portuguese", | |
| "ru": "Russian", | |
| "sv": "Swedish", | |
| "sw": "Swahili", | |
| "tr": "Turkish", | |
| "zh": "Chinese", | |
| } | |
| class RepetitionPenaltyLogitsProcessor: | |
| def __init__(self, penalty: float): | |
| if not isinstance(penalty, float) or not (penalty > 0): | |
| raise ValueError(f"`penalty` must be a strictly positive float, but is {penalty}") | |
| self.penalty = penalty | |
| def __call__(self, input_ids: np.ndarray, scores: np.ndarray) -> np.ndarray: | |
| score = np.take_along_axis(scores, input_ids, axis=1) | |
| score = np.where(score < 0, score * self.penalty, score / self.penalty) | |
| scores_processed = scores.copy() | |
| np.put_along_axis(scores_processed, input_ids, score, axis=1) | |
| return scores_processed | |
| class ChineseCangjieConverter: | |
| """Converts Chinese characters to Cangjie codes for tokenization.""" | |
| def __init__(self): | |
| self.word2cj = {} | |
| self.cj2word = {} | |
| self.segmenter = None | |
| self._load_cangjie_mapping() | |
| self._init_segmenter() | |
| def _load_cangjie_mapping(self): | |
| """Load Cangjie mapping from HuggingFace model repository.""" | |
| try: | |
| cangjie_file = hf_hub_download( | |
| repo_id="onnx-community/chatterbox-multilingual-ONNX", | |
| filename="Cangjie5_TC.json", | |
| ) | |
| with open(cangjie_file, "r", encoding="utf-8") as fp: | |
| data = json.load(fp) | |
| for entry in data: | |
| word, code = entry.split("\t")[:2] | |
| self.word2cj[word] = code | |
| if code not in self.cj2word: | |
| self.cj2word[code] = [word] | |
| else: | |
| self.cj2word[code].append(word) | |
| except Exception as e: | |
| print(f"Could not load Cangjie mapping: {e}") | |
| def _init_segmenter(self): | |
| """Initialize pkuseg segmenter.""" | |
| try: | |
| from pkuseg import pkuseg | |
| self.segmenter = pkuseg() | |
| except ImportError: | |
| print("pkuseg not available - Chinese segmentation will be skipped") | |
| self.segmenter = None | |
| def _cangjie_encode(self, glyph: str): | |
| """Encode a single Chinese glyph to Cangjie code.""" | |
| normed_glyph = glyph | |
| code = self.word2cj.get(normed_glyph, None) | |
| if code is None: # e.g. Japanese hiragana | |
| return None | |
| index = self.cj2word[code].index(normed_glyph) | |
| index = str(index) if index > 0 else "" | |
| return code + str(index) | |
| def __call__(self, text): | |
| """Convert Chinese characters in text to Cangjie tokens.""" | |
| output = [] | |
| if self.segmenter is not None: | |
| segmented_words = self.segmenter.cut(text) | |
| full_text = " ".join(segmented_words) | |
| else: | |
| full_text = text | |
| for t in full_text: | |
| if category(t) == "Lo": | |
| cangjie = self._cangjie_encode(t) | |
| if cangjie is None: | |
| output.append(t) | |
| continue | |
| code = [] | |
| for c in cangjie: | |
| code.append(f"[cj_{c}]") | |
| code.append("[cj_.]") | |
| code = "".join(code) | |
| output.append(code) | |
| else: | |
| output.append(t) | |
| return "".join(output) | |
| def is_kanji(c: str) -> bool: | |
| """Check if character is kanji.""" | |
| return 19968 <= ord(c) <= 40959 | |
| def is_katakana(c: str) -> bool: | |
| """Check if character is katakana.""" | |
| return 12449 <= ord(c) <= 12538 | |
| def hiragana_normalize(text: str) -> str: | |
| """Japanese text normalization: converts kanji to hiragana; katakana remains the same.""" | |
| global _kakasi | |
| try: | |
| if _kakasi is None: | |
| import pykakasi | |
| _kakasi = pykakasi.kakasi() | |
| result = _kakasi.convert(text) | |
| out = [] | |
| for r in result: | |
| inp = r['orig'] | |
| hira = r["hira"] | |
| # Any kanji in the phrase | |
| if any([is_kanji(c) for c in inp]): | |
| if hira and hira[0] in ["は", "へ"]: # Safety check for empty hira | |
| hira = " " + hira | |
| out.append(hira) | |
| # All katakana | |
| elif all([is_katakana(c) for c in inp]) if inp else False: # Safety check for empty inp | |
| out.append(r['orig']) | |
| else: | |
| out.append(inp) | |
| normalized_text = "".join(out) | |
| # Decompose Japanese characters for tokenizer compatibility | |
| import unicodedata | |
| normalized_text = unicodedata.normalize('NFKD', normalized_text) | |
| return normalized_text | |
| except ImportError: | |
| print("pykakasi not available - Japanese text processing skipped") | |
| return text | |
| def add_hebrew_diacritics(text: str) -> str: | |
| """Hebrew text normalization: adds diacritics to Hebrew text.""" | |
| global _dicta | |
| try: | |
| if _dicta is None: | |
| from dicta_onnx import Dicta | |
| _dicta = Dicta() | |
| return _dicta.add_diacritics(text) | |
| except ImportError: | |
| print("dicta_onnx not available - Hebrew text processing skipped") | |
| return text | |
| except Exception as e: | |
| print(f"Hebrew diacritization failed: {e}") | |
| return text | |
| def korean_normalize(text: str) -> str: | |
| """Korean text normalization: decompose syllables into Jamo for tokenization.""" | |
| def decompose_hangul(char): | |
| """Decompose Korean syllable into Jamo components.""" | |
| if not ('\uac00' <= char <= '\ud7af'): | |
| return char | |
| # Hangul decomposition formula | |
| base = ord(char) - 0xAC00 | |
| initial = chr(0x1100 + base // (21 * 28)) | |
| medial = chr(0x1161 + (base % (21 * 28)) // 28) | |
| final = chr(0x11A7 + base % 28) if base % 28 > 0 else '' | |
| return initial + medial + final | |
| # Decompose syllables and normalize punctuation | |
| result = ''.join(decompose_hangul(char) for char in text) | |
| return result.strip() | |
| def prepare_language(txt, language_id): | |
| # Language-specific text processing | |
| cangjie_converter = ChineseCangjieConverter() | |
| if language_id == 'zh': | |
| txt = cangjie_converter(txt) | |
| elif language_id == 'ja': | |
| txt = hiragana_normalize(txt) | |
| elif language_id == 'he': | |
| txt = add_hebrew_diacritics(txt) | |
| elif language_id == 'ko': | |
| txt = korean_normalize(txt) | |
| # Prepend language token | |
| if language_id: | |
| txt = f"[{language_id.lower()}]{txt}" | |
| return txt | |
| def run_inference( | |
| text="The Lord of the Rings is the greatest work of literature.", | |
| language_id="en", | |
| target_voice_path=None, | |
| max_new_tokens=256, | |
| exaggeration=0.5, | |
| output_dir="converted", | |
| output_file_name="output.wav", | |
| apply_watermark=True, | |
| ): | |
| # Validate language_id | |
| if language_id and language_id.lower() not in SUPPORTED_LANGUAGES: | |
| supported_langs = ", ".join(SUPPORTED_LANGUAGES.keys()) | |
| raise ValueError( | |
| f"Unsupported language_id '{language_id}'. " | |
| f"Supported languages: {supported_langs}" | |
| ) | |
| model_id = "onnx-community/chatterbox-multilingual-ONNX" | |
| if not target_voice_path: | |
| target_voice_path = hf_hub_download(repo_id=model_id, filename="default_voice.wav", local_dir=output_dir) | |
| ## Load model | |
| speech_encoder_path = hf_hub_download(repo_id=model_id, filename="speech_encoder.onnx", local_dir=output_dir, subfolder='onnx') | |
| hf_hub_download(repo_id=model_id, filename="speech_encoder.onnx_data", local_dir=output_dir, subfolder='onnx') | |
| embed_tokens_path = hf_hub_download(repo_id=model_id, filename="embed_tokens.onnx", local_dir=output_dir, subfolder='onnx') | |
| hf_hub_download(repo_id=model_id, filename="embed_tokens.onnx_data", local_dir=output_dir, subfolder='onnx') | |
| conditional_decoder_path = hf_hub_download(repo_id=model_id, filename="conditional_decoder.onnx", local_dir=output_dir, subfolder='onnx') | |
| hf_hub_download(repo_id=model_id, filename="conditional_decoder.onnx_data", local_dir=output_dir, subfolder='onnx') | |
| language_model_path = hf_hub_download(repo_id=model_id, filename="language_model.onnx", local_dir=output_dir, subfolder='onnx') | |
| hf_hub_download(repo_id=model_id, filename="language_model.onnx_data", local_dir=output_dir, subfolder='onnx') | |
| # # Start inferense sessions | |
| speech_encoder_session = onnxruntime.InferenceSession(speech_encoder_path) | |
| embed_tokens_session = onnxruntime.InferenceSession(embed_tokens_path) | |
| llama_with_past_session = onnxruntime.InferenceSession(language_model_path) | |
| cond_decoder_session = onnxruntime.InferenceSession(conditional_decoder_path) | |
| def execute_text_to_audio_inference(text): | |
| print("Start inference script...") | |
| audio_values, _ = librosa.load(target_voice_path, sr=S3GEN_SR) | |
| audio_values = audio_values[np.newaxis, :].astype(np.float32) | |
| ## Prepare input | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| text = prepare_language(text, language_id) | |
| input_ids = tokenizer(text, return_tensors="np")["input_ids"].astype(np.int64) | |
| position_ids = np.where( | |
| input_ids >= START_SPEECH_TOKEN, | |
| 0, | |
| np.arange(input_ids.shape[1])[np.newaxis, :] - 1 | |
| ) | |
| ort_embed_tokens_inputs = { | |
| "input_ids": input_ids, | |
| "position_ids": position_ids.astype(np.int64), | |
| "exaggeration": np.array([exaggeration], dtype=np.float32) | |
| } | |
| ## Instantiate the logits processors. | |
| repetition_penalty = 1.2 | |
| repetition_penalty_processor = RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty) | |
| num_hidden_layers = 30 | |
| num_key_value_heads = 16 | |
| head_dim = 64 | |
| generate_tokens = np.array([[START_SPEECH_TOKEN]]) | |
| # ---- Generation Loop using kv_cache ---- | |
| for i in tqdm(range(max_new_tokens), desc="Sampling", dynamic_ncols=True): | |
| inputs_embeds = embed_tokens_session.run(None, ort_embed_tokens_inputs)[0] | |
| if i == 0: | |
| ort_speech_encoder_input = { | |
| "audio_values": audio_values, | |
| } | |
| cond_emb, prompt_token, ref_x_vector, prompt_feat = speech_encoder_session.run(None, ort_speech_encoder_input) | |
| inputs_embeds = np.concatenate((cond_emb, inputs_embeds), axis=1) | |
| ## Prepare llm inputs | |
| batch_size, seq_len, _ = inputs_embeds.shape | |
| past_key_values = { | |
| f"past_key_values.{layer}.{kv}": np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32) | |
| for layer in range(num_hidden_layers) | |
| for kv in ("key", "value") | |
| } | |
| attention_mask = np.ones((batch_size, seq_len), dtype=np.int64) | |
| logits, *present_key_values = llama_with_past_session.run(None, dict( | |
| inputs_embeds=inputs_embeds, | |
| attention_mask=attention_mask, | |
| **past_key_values, | |
| )) | |
| logits = logits[:, -1, :] | |
| next_token_logits = repetition_penalty_processor(generate_tokens, logits) | |
| next_token = np.argmax(next_token_logits, axis=-1, keepdims=True).astype(np.int64) | |
| generate_tokens = np.concatenate((generate_tokens, next_token), axis=-1) | |
| if (next_token.flatten() == STOP_SPEECH_TOKEN).all(): | |
| break | |
| # Get embedding for the new token. | |
| position_ids = np.full( | |
| (input_ids.shape[0], 1), | |
| i + 1, | |
| dtype=np.int64, | |
| ) | |
| ort_embed_tokens_inputs["input_ids"] = next_token | |
| ort_embed_tokens_inputs["position_ids"] = position_ids | |
| ## Update values for next generation loop | |
| attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=np.int64)], axis=1) | |
| for j, key in enumerate(past_key_values): | |
| past_key_values[key] = present_key_values[j] | |
| speech_tokens = generate_tokens[:, 1:-1] | |
| speech_tokens = np.concatenate([prompt_token, speech_tokens], axis=1) | |
| return speech_tokens, ref_x_vector, prompt_feat | |
| speech_tokens, speaker_embeddings, speaker_features = execute_text_to_audio_inference(text) | |
| cond_incoder_input = { | |
| "speech_tokens": speech_tokens, | |
| "speaker_embeddings": speaker_embeddings, | |
| "speaker_features": speaker_features, | |
| } | |
| wav = cond_decoder_session.run(None, cond_incoder_input)[0] | |
| wav = np.squeeze(wav, axis=0) | |
| # Optional: Apply watermark | |
| if apply_watermark: | |
| import perth | |
| watermarker = perth.PerthImplicitWatermarker() | |
| wav = watermarker.apply_watermark(wav, sample_rate=S3GEN_SR) | |
| sf.write(output_file_name, wav, S3GEN_SR) | |
| print(f"{output_file_name} was successfully saved") | |
| if __name__ == "__main__": | |
| run_inference( | |
| text="Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues.", | |
| language_id="fr", | |
| exaggeration=0.5, | |
| output_file_name="output.wav", | |
| apply_watermark=False, | |
| ) | |
| ``` | |
| # Acknowledgements | |
| - [Xenova](https://huggingface.co/Xenova) | |
| - [Vladislav Bronzov](https://github.com/VladOS95-cyber) | |
| - [Resemble AI](https://github.com/resemble-ai/chatterbox) | |
| # Built-in PerTh Watermarking for Responsible AI | |
| Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy. | |
| # Disclaimer | |
| Don't use this model to do bad things. Prompts are sourced from freely available data on the internet. |