MioTTS
Collection
12 items
β’
Updated
β’
7
MioCodec-25Hz-44.1kHz-v2 is an upsampled, high-fidelity version of the MioCodec-25Hz-24kHz model.
By integrating an UpsamplerBlock inspired by Inworld TTS-1 into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.
This model is a fine-tuned version of MioCodec-25Hz-24kHz with the following architectural enhancements:
MioCodec-25Hz-24kHz. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights |
|---|---|---|---|---|---|---|---|---|
| MioCodec-25Hz-44.1kHz-v2 | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | - (iSTFTHead) | 133M | Fast inference, good quality |
| MioCodec-25Hz-24kHz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | - (iSTFTHead) | 132M | Lightweight, fast inference |
| MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | MioVocoder (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate |
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model |
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model |
# Install via pip
pip install git+https://github.com/Aratako/MioCodec
# Or using uv
uv add git+https://github.com/Aratako/MioCodec
Basic usage for encoding and decoding audio:
from miocodec import MioCodecModel, load_audio
import soundfile as sf
# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()
# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
# 3. Encode Audio
features = model.encode(waveform)
# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
content_token_indices=features.content_token_indices,
global_embedding=features.global_embedding,
)
# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
@misc{miocodec-25hz-44.1khz-v2,
author = {Chihiro Arata},
title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
}
Base model
Aratako/MioCodec-25Hz-24kHz