--- license: mit language: - en tags: - speech - tokenizer --- # Kanade: Compact Linguistically Rich Speech Tokens for Spoken Language Models Kanade is a speech tokenizer that encodes speech into compact content tokens and global embeddings and decodes them back to mel spectrograms. ## Updates - **2026-01-09**: Released `kanade-25hz-clean` model trained on [LibriTTS-R](https://arxiv.org/abs/2305.18802) with [HiFT vocoder](https://arxiv.org/abs/2309.09493) for better audio quality. LibriTTS-R is a restored version of LibriTTS removing noise, so the model trained on it can produce cleaner synthesis. Because of that, however, this version can no longer faithfully reflect the recording environment such as background noise and microphone characteristics. Also, the vocoder is changed to the HiFT model used in [CosyVoice 2](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for better quality. The content encoder part remains the same as the previous `kanade-25hz` model. We made tiny code change to support different vocoders during inference (specifically `load_vocoder`). Please refer to the updated usage section below. ## Models | Model | Token Rate | Vocab Size | Bit Rate | Dataset | SSL Encoder | Vocoder | Parameters | | --------------------------------------------------------------------------- | ---------- | ---------- | -------- | ---------- | ----------- | ----------- | ---------- | | [`kanade-12.5hz`](https://huggingface.co/frothywater/kanade-12.5hz) | 12.5 Hz | 12800 | 171 bps | LibriTTS | WavLM-base+ | Vocos 24kHz | 120M | | [`kanade-25hz`](https://huggingface.co/frothywater/kanade-25hz) | 25 Hz | 12800 | 341 bps | LibriTTS | WavLM-base+ | Vocos 24kHz | 118M | | [`kanade-25hz-clean`](https://huggingface.co/frothywater/kanade-25hz-clean) | 25 Hz | 12800 | 341 bps | LibriTTS-R | WavLM-base+ | HiFT 24kHz | 142M | ## Installation For simple inference, you can install Kanade tokenizer to your virtual environment: ```bash # In your own project's virtual environment uv add git+https://github.com/frothywater/kanade-tokenizer # or using pip pip install git+https://github.com/frothywater/kanade-tokenizer ``` > [!IMPORTANT] > We use [FlashAttention](https://github.com/Dao-AILab/flash-attention) for efficient local window attention in our training. We recommend installing it following the instructions in their repository to get the best performance and the closest match to our setup. The model will fall back to regular PyTorch SDPA implementation if FlashAttention is not available. In this case, we cannot guarantee the same quality as reported in the paper. > If using uv, you can install FlashAttention like: `uv pip install flash-attn --no-build-isolation`. (Ensure `ninja` is installed in your system or the build will be very slow.) ## Usage Example code to load the model from HuggingFace Hub and run inference: ```python from kanade_tokenizer import KanadeModel, load_audio, load_vocoder, vocode # Load Kanade model model = KanadeModel.from_pretrained("frothywater/kanade-25hz") model = model.eval().cuda() # Load vocoder vocoder = load_vocoder(model.config.vocoder_name).cuda() # Load audio (samples,) audio = load_audio("path/to/audio.wav", sample_rate=model.config.sample_rate).cuda() # Extract features features = model.encode(waveform) # Synthesize audio from extracted features mel_spectrogram = model.decode( content_token_indices=features.content_token_indices, # (seq_len,) global_embedding=features.global_embedding, # (dim,) ) # (n_mels, T) # Resynthesize waveform using vocoder resynthesized_waveform = vocode(vocoder, mel_spectrogram.unsqueeze(0)) # (1, samples) ``` For details about voice conversion and how to train and fine-tune the model, please refer to [our github repository](https://github.com/frothywater/kanade-tokenizer).