license: other
library_name: pytorch
pipeline_tag: text-to-speech
tags:
- text-to-speech
- speech-synthesis
- voice
- audio
- sesame
- mimi
- llama
Miso TTS 8B
Model Introduction
Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.
The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.
Language support: Miso TTS 8B currently supports English only.
Quickstart
To run the model, use the inference code at our public repository, or try our demo at misolabs.ai.
Model Summary
| Item | Value |
|---|---|
| Model | Miso TTS 8B |
| Organization | Miso Labs |
| Task | Text-to-speech |
| Architecture | Sesame-style CSM |
| Backbone | llama-8B |
| Audio decoder | llama-300M |
| Text vocabulary | 128,256 |
| Audio vocabulary | 2,051 |
| Audio codebooks | 32 |
| Audio tokenizer | Mimi |
| Max sequence length | 2,048 |
| Language | English |
Architecture
Miso TTS 8B uses two transformer components:
- A large backbone transformer that consumes text/audio-frame embeddings.
- A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.
Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.
Links
- Website: misolabs.ai
- Hugging Face: MisoLabs/MisoTTS
- GitHub: MisoLabsAI
- X: @MisoLabsAI