MisoTTS / README.md
edwixx's picture
Update README.md
4fdb69d verified
|
raw
history blame
3.13 kB
metadata
license: other
library_name: pytorch
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - speech-synthesis
  - voice
  - audio
  - sesame
  - mimi
  - llama

Miso TTS 8B

Model Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.

The model is designed for high-quality conversational speech generation and voice continuation from prompt audio. This repository contains the inference code, model definition, and setup instructions for running Miso TTS locally.

Language support: Miso TTS 8B currently supports English only.


Quickstart

To run the model, use the inference code at our public repository, or try our demo at misolabs.ai.

Model Summary

Item Value
Model Miso TTS 8B
Organization Miso Labs
Task Text-to-speech
Architecture Sesame-style CSM
Backbone llama-8B
Audio decoder llama-300M
Text vocabulary 128,256
Audio vocabulary 2,051
Audio codebooks 32
Audio tokenizer Mimi
Max sequence length 2,048
Language English

Architecture

Miso TTS 8B uses two transformer components:

  • A large backbone transformer that consumes text/audio-frame embeddings.
  • A smaller decoder transformer that autoregressively predicts higher-order audio codebooks within each frame.

Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.


Links