MisoTTS / README.md
edwixx's picture
Update README.md
4fdb69d verified
|
raw
history blame
3.13 kB
---
license: other
library_name: pytorch
pipeline_tag: text-to-speech
tags:
- text-to-speech
- speech-synthesis
- voice
- audio
- sesame
- mimi
- llama
---
<div align="center">
<img src="repo_banner.png" alt="Miso TTS 8B" width="100%">
<p>
<a href="https://misolabs.ai"><img alt="Website" src="https://img.shields.io/badge/Website-misolabs.ai-black?style=for-the-badge"></a>
<a href="https://huggingface.co/MisoLabs/MisoTTS"><img alt="Hugging Face" src="https://img.shields.io/badge/Hugging%20Face-MisoTTS-yellow?style=for-the-badge"></a>
<a href="https://github.com/MisoLabsAI"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-MisoLabsAI-181717?style=for-the-badge&logo=github&labelColor=555555"></a>
<a href="https://x.com/MisoLabsAI"><img alt="X" src="https://img.shields.io/badge/-MisoLabsAI-181717?style=for-the-badge&logo=x&labelColor=555555"></a>
</p>
<p>
<a href="#quickstart">Quickstart</a> |
<a href="#model-introduction">Model Introduction</a> |
<a href="#model-summary">Model Summary</a> |
<a href="#architecture">Architecture</a> |
<a href="#links">Links</a>
</p>
</div>
---
# Miso TTS 8B
## Model Introduction
Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It
generates Mimi audio codes from text and optional audio context, using a large
Llama 3.2-style backbone and a smaller autoregressive audio decoder.
The model is designed for high-quality conversational speech generation and
voice continuation from prompt audio. This repository contains the inference
code, model definition, and setup instructions for running Miso TTS locally.
> **Language support:** Miso TTS 8B currently supports **English only**.
---
## Quickstart
To run the model, use the inference code at our [public repository](https://github.com/MisoLabsAI/MisoTTS),
or try our demo at misolabs.ai.
## Model Summary
| Item | Value |
| ------------------- | ---------------- |
| Model | Miso TTS 8B |
| Organization | Miso Labs |
| Task | Text-to-speech |
| Architecture | Sesame-style CSM |
| Backbone | `llama-8B` |
| Audio decoder | `llama-300M` |
| Text vocabulary | `128,256` |
| Audio vocabulary | `2,051` |
| Audio codebooks | `32` |
| Audio tokenizer | Mimi |
| Max sequence length | `2,048` |
| Language | `English` |
### Architecture
Miso TTS 8B uses two transformer components:
- A large backbone transformer that consumes text/audio-frame embeddings.
- A smaller decoder transformer that autoregressively predicts higher-order
audio codebooks within each frame.
Codebook 0 is
predicted from the backbone hidden state, while codebooks 1 through 31 are
predicted by the audio decoder autoregressively in codebook depth.
---
## Links
- Website: [misolabs.ai](https://misolabs.ai)
- Hugging Face: [MisoLabs/MisoTTS](https://huggingface.co/MisoLabs/MisoTTS)
- GitHub: [MisoLabsAI](https://github.com/MisoLabsAI)
- X: [@MisoLabsAI](https://x.com/MisoLabsAI)