MioCodec-25Hz-44.1kHz-v2: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling

MioCodec-25Hz-44.1kHz-v2 is an upsampled, high-fidelity version of the MioCodec-25Hz-24kHz model.

By integrating an UpsamplerBlock inspired by Inworld TTS-1 into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.

🌟 What's New in v2

This model is a fine-tuned version of MioCodec-25Hz-24kHz with the following architectural enhancements:

44.1 kHz Output: Achieves higher audio fidelity compared to the base 24 kHz model.
UpsamplerBlock + SnakeBeta: We adopted the UpsamplerBlock architecture from Inworld TTS-1 and enhanced it by integrating SnakeBeta activations. This combination allows the decoder to effectively predict and generate high-frequency components, enabling clear 44.1 kHz reconstruction from the lower-resolution input.
Token Compatibility: During fine-tuning, the content branch was frozen. This means the discrete tokens generated by this model are identical to those from MioCodec-25Hz-24kHz. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.

📊 Model Comparison

Model	Token Rate	Vocab Size	Bit Rate	Sample Rate	SSL Encoder	Vocoder	Parameters	Highlights
MioCodec-25Hz-44.1kHz-v2	25 Hz	12,800	341 bps	44.1 kHz	WavLM-base+	- (iSTFTHead)	133M	Fast inference, good quality
MioCodec-25Hz-24kHz	25 Hz	12,800	341 bps	24 kHz	WavLM-base+	- (iSTFTHead)	132M	Lightweight, fast inference
MioCodec-25Hz-44.1kHz	25 Hz	12,800	341 bps	44.1 kHz	WavLM-base+	MioVocoder (Jointly Tuned)	118M (w/o vocoder)	High-quality, high sample rate
kanade-25hz	25 Hz	12,800	341 bps	24 kHz	WavLM-base+	Vocos 24kHz	118M (w/o vocoder)	Original 25Hz model
kanade-12.5hz	12.5 Hz	12,800	171 bps	24 kHz	WavLM-base+	Vocos 24kHz	120M (w/o vocoder)	Original 12.5Hz model

🚀 Quick Start

Installation

# Install via pip
pip install git+https://github.com/Aratako/MioCodec

# Or using uv
uv add git+https://github.com/Aratako/MioCodec

Basic Inference

Basic usage for encoding and decoding audio:

from miocodec import MioCodecModel, load_audio
import soundfile as sf

# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()

# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

# 3. Encode Audio
features = model.encode(waveform)

# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
    content_token_indices=features.content_token_indices,
    global_embedding=features.global_embedding,
)

# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)

Voice Conversion (Zero-shot)

MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)

📜 Acknowledgements

Codec Architecture: Based on the brilliant work of kanade-tokenizer.
Decoder Design: Inspired by XCodec2 and Inworld TTS-1.

🖊️ Citation

@misc{miocodec-25hz-44.1khz-v2,
  author = {Chihiro Arata},
  title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
}