| --- |
| license: other |
| library_name: pytorch |
| pipeline_tag: text-to-speech |
| tags: |
| - text-to-speech |
| - speech-synthesis |
| - voice |
| - audio |
| - sesame |
| - mimi |
| - llama |
| --- |
| |
| <div align="center"> |
|
|
| <img src="repo_banner.png" alt="Miso TTS 8B" width="100%"> |
| <p> |
| <a href="https://misolabs.ai"><img alt="Website" src="https://img.shields.io/badge/Website-misolabs.ai-black?style=for-the-badge"></a> |
| <a href="https://huggingface.co/MisoLabs/MisoTTS"><img alt="Hugging Face" src="https://img.shields.io/badge/Hugging%20Face-MisoTTS-yellow?style=for-the-badge"></a> |
| <a href="https://github.com/MisoLabsAI"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-MisoLabsAI-181717?style=for-the-badge&logo=github&labelColor=555555"></a> |
| <a href="https://x.com/MisoLabsAI"><img alt="X" src="https://img.shields.io/badge/-MisoLabsAI-181717?style=for-the-badge&logo=x&labelColor=555555"></a> |
| </p> |
|
|
| <p> |
| <a href="#quickstart">Quickstart</a> | |
| <a href="#model-introduction">Model Introduction</a> | |
| <a href="#model-summary">Model Summary</a> | |
| <a href="#architecture">Architecture</a> | |
| <a href="#links">Links</a> |
| </p> |
|
|
| </div> |
|
|
| --- |
| # Miso TTS 8B |
| ## Model Introduction |
|
|
| Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It |
| generates Mimi audio codes from text and optional audio context, using a large |
| Llama 3.2-style backbone and a smaller autoregressive audio decoder. |
|
|
| The model is designed for high-quality conversational speech generation and |
| voice continuation from prompt audio. This repository contains the inference |
| code, model definition, and setup instructions for running Miso TTS locally. |
|
|
| > **Language support:** Miso TTS 8B currently supports **English only**. |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| To run the model, use the inference code at our [public repository](https://github.com/MisoLabsAI/MisoTTS), |
| or try our demo at misolabs.ai. |
|
|
| ## Model Summary |
|
|
| | Item | Value | |
| | ------------------- | ---------------- | |
| | Model | Miso TTS 8B | |
| | Organization | Miso Labs | |
| | Task | Text-to-speech | |
| | Architecture | Sesame-style CSM | |
| | Backbone | `llama-8B` | |
| | Audio decoder | `llama-300M` | |
| | Text vocabulary | `128,256` | |
| | Audio vocabulary | `2,051` | |
| | Audio codebooks | `32` | |
| | Audio tokenizer | Mimi | |
| | Max sequence length | `2,048` | |
| | Language | `English` | |
|
|
| ### Architecture |
|
|
| Miso TTS 8B uses two transformer components: |
|
|
| - A large backbone transformer that consumes text/audio-frame embeddings. |
| - A smaller decoder transformer that autoregressively predicts higher-order |
| audio codebooks within each frame. |
|
|
| Codebook 0 is |
| predicted from the backbone hidden state, while codebooks 1 through 31 are |
| predicted by the audio decoder autoregressively in codebook depth. |
|
|
| --- |
|
|
| ## Links |
|
|
| - Website: [misolabs.ai](https://misolabs.ai) |
| - Hugging Face: [MisoLabs/MisoTTS](https://huggingface.co/MisoLabs/MisoTTS) |
| - GitHub: [MisoLabsAI](https://github.com/MisoLabsAI) |
| - X: [@MisoLabsAI](https://x.com/MisoLabsAI) |
|
|