MisoLabs
/

MisoTTS

speech-synthesis

Model card Files Files and versions

MisoTTS / README.md

edwixx's picture

Update README.md

4fdb69d verified 4 days ago

|

3.13 kB

	---
	license: other
	library_name: pytorch
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- speech-synthesis
	- voice
	- audio
	- sesame
	- mimi
	- llama
	---

	<div align="center">

	<img src="repo_banner.png" alt="Miso TTS 8B" width="100%">
	<p>
	<a href="https://misolabs.ai"><img alt="Website" src="https://img.shields.io/badge/Website-misolabs.ai-black?style=for-the-badge"></a>
	<a href="https://huggingface.co/MisoLabs/MisoTTS"><img alt="Hugging Face" src="https://img.shields.io/badge/Hugging%20Face-MisoTTS-yellow?style=for-the-badge"></a>
	<a href="https://github.com/MisoLabsAI"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-MisoLabsAI-181717?style=for-the-badge&logo=github&labelColor=555555"></a>
	<a href="https://x.com/MisoLabsAI"><img alt="X" src="https://img.shields.io/badge/-MisoLabsAI-181717?style=for-the-badge&logo=x&labelColor=555555"></a>
	</p>

	<p>
	<a href="#quickstart">Quickstart</a> \|
	<a href="#model-introduction">Model Introduction</a> \|
	<a href="#model-summary">Model Summary</a> \|
	<a href="#architecture">Architecture</a> \|
	<a href="#links">Links</a>
	</p>

	</div>

	---
	# Miso TTS 8B
	## Model Introduction

	Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It
	generates Mimi audio codes from text and optional audio context, using a large
	Llama 3.2-style backbone and a smaller autoregressive audio decoder.

	The model is designed for high-quality conversational speech generation and
	voice continuation from prompt audio. This repository contains the inference
	code, model definition, and setup instructions for running Miso TTS locally.

	> Language support: Miso TTS 8B currently supports English only.

	---

	## Quickstart

	To run the model, use the inference code at our [public repository](https://github.com/MisoLabsAI/MisoTTS),
	or try our demo at misolabs.ai.

	## Model Summary

	\| Item \| Value \|
	\| ------------------- \| ---------------- \|
	\| Model \| Miso TTS 8B \|
	\| Organization \| Miso Labs \|
	\| Task \| Text-to-speech \|
	\| Architecture \| Sesame-style CSM \|
	\| Backbone \| `llama-8B` \|
	\| Audio decoder \| `llama-300M` \|
	\| Text vocabulary \| `128,256` \|
	\| Audio vocabulary \| `2,051` \|
	\| Audio codebooks \| `32` \|
	\| Audio tokenizer \| Mimi \|
	\| Max sequence length \| `2,048` \|
	\| Language \| `English` \|

	### Architecture

	Miso TTS 8B uses two transformer components:

	- A large backbone transformer that consumes text/audio-frame embeddings.
	- A smaller decoder transformer that autoregressively predicts higher-order
	audio codebooks within each frame.

	Codebook 0 is
	predicted from the backbone hidden state, while codebooks 1 through 31 are
	predicted by the audio decoder autoregressively in codebook depth.

	---

	## Links

	- Website: [misolabs.ai](https://misolabs.ai)
	- Hugging Face: [MisoLabs/MisoTTS](https://huggingface.co/MisoLabs/MisoTTS)
	- GitHub: [MisoLabsAI](https://github.com/MisoLabsAI)
	- X: [@MisoLabsAI](https://x.com/MisoLabsAI)