Kanade: Compact Linguistically Rich Speech Tokens for Spoken Language Models

Kanade is a speech tokenizer that encodes speech into compact content tokens and global embeddings and decodes them back to mel spectrograms.

Updates

2026-01-09: Released kanade-25hz-clean model trained on LibriTTS-R with HiFT vocoder for better audio quality. LibriTTS-R is a restored version of LibriTTS removing noise, so the model trained on it can produce cleaner synthesis. Because of that, however, this version can no longer faithfully reflect the recording environment such as background noise and microphone characteristics. Also, the vocoder is changed to the HiFT model used in CosyVoice 2 for better quality. The content encoder part remains the same as the previous kanade-25hz model. We made tiny code change to support different vocoders during inference (specifically load_vocoder). Please refer to the updated usage section below.

Models

Model	Token Rate	Vocab Size	Bit Rate	Dataset	SSL Encoder	Vocoder	Parameters
`kanade-12.5hz`	12.5 Hz	12800	171 bps	LibriTTS	WavLM-base+	Vocos 24kHz	120M
`kanade-25hz`	25 Hz	12800	341 bps	LibriTTS	WavLM-base+	Vocos 24kHz	118M
`kanade-25hz-clean`	25 Hz	12800	341 bps	LibriTTS-R	WavLM-base+	HiFT 24kHz	142M

Installation

For simple inference, you can install Kanade tokenizer to your virtual environment:

# In your own project's virtual environment
uv add git+https://github.com/frothywater/kanade-tokenizer
# or using pip
pip install git+https://github.com/frothywater/kanade-tokenizer

We use FlashAttention for efficient local window attention in our training. We recommend installing it following the instructions in their repository to get the best performance and the closest match to our setup. The model will fall back to regular PyTorch SDPA implementation if FlashAttention is not available. In this case, we cannot guarantee the same quality as reported in the paper.
If using uv, you can install FlashAttention like: uv pip install flash-attn --no-build-isolation. (Ensure ninja is installed in your system or the build will be very slow.)

Usage

Example code to load the model from HuggingFace Hub and run inference:

from kanade_tokenizer import KanadeModel, load_audio, load_vocoder, vocode

# Load Kanade model
model = KanadeModel.from_pretrained("frothywater/kanade-12.5hz")
model = model.eval().cuda()

# Load vocoder
vocoder = load_vocoder(model.config.vocoder_name).cuda()

# Load audio (samples,)
audio = load_audio("path/to/audio.wav", sample_rate=model.config.sample_rate).cuda()

# Extract features
features = model.encode(waveform)

# Synthesize audio from extracted features
mel_spectrogram = model.decode(
    content_token_indices=features.content_token_indices, # (seq_len,)
    global_embedding=features.global_embedding, # (dim,)
) # (n_mels, T)

# Resynthesize waveform using vocoder
resynthesized_waveform = vocode(vocoder, mel_spectrogram.unsqueeze(0)) # (1, samples)

For details about voice conversion and how to train and fine-tune the model, please refer to our github repository.

Downloads last month: 64

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for frothywater/kanade-12.5hz

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Paper • 2309.09493 • Published Sep 18, 2023

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

Paper • 2305.18802 • Published May 30, 2023 • 4