Audio-to-Audio
Safetensors
speech
audio
tokenizer

MioCodec-25Hz-44.1kHz-v2: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling

GitHub

MioCodec-25Hz-44.1kHz-v2 is an upsampled, high-fidelity version of the MioCodec-25Hz-24kHz model.

By integrating an UpsamplerBlock inspired by Inworld TTS-1 into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.

🌟 What's New in v2

This model is a fine-tuned version of MioCodec-25Hz-24kHz with the following architectural enhancements:

  • 44.1 kHz Output: Achieves higher audio fidelity compared to the base 24 kHz model.
  • UpsamplerBlock + SnakeBeta: We adopted the UpsamplerBlock architecture from Inworld TTS-1 and enhanced it by integrating SnakeBeta activations. This combination allows the decoder to effectively predict and generate high-frequency components, enabling clear 44.1 kHz reconstruction from the lower-resolution input.
  • Token Compatibility: During fine-tuning, the content branch was frozen. This means the discrete tokens generated by this model are identical to those from MioCodec-25Hz-24kHz. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.

πŸ“Š Model Comparison

Model Token Rate Vocab Size Bit Rate Sample Rate SSL Encoder Vocoder Parameters Highlights
MioCodec-25Hz-44.1kHz-v2 25 Hz 12,800 341 bps 44.1 kHz WavLM-base+ - (iSTFTHead) 133M Fast inference, good quality
MioCodec-25Hz-24kHz 25 Hz 12,800 341 bps 24 kHz WavLM-base+ - (iSTFTHead) 132M Lightweight, fast inference
MioCodec-25Hz-44.1kHz 25 Hz 12,800 341 bps 44.1 kHz WavLM-base+ MioVocoder (Jointly Tuned) 118M (w/o vocoder) High-quality, high sample rate
kanade-25hz 25 Hz 12,800 341 bps 24 kHz WavLM-base+ Vocos 24kHz 118M (w/o vocoder) Original 25Hz model
kanade-12.5hz 12.5 Hz 12,800 171 bps 24 kHz WavLM-base+ Vocos 24kHz 120M (w/o vocoder) Original 12.5Hz model

πŸš€ Quick Start

Installation

# Install via pip
pip install git+https://github.com/Aratako/MioCodec

# Or using uv
uv add git+https://github.com/Aratako/MioCodec

Basic Inference

Basic usage for encoding and decoding audio:

from miocodec import MioCodecModel, load_audio
import soundfile as sf

# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()

# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

# 3. Encode Audio
features = model.encode(waveform)

# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
    content_token_indices=features.content_token_indices,
    global_embedding=features.global_embedding,
)

# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)

Voice Conversion (Zero-shot)

MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)

πŸ“œ Acknowledgements

πŸ–ŠοΈ Citation

@misc{miocodec-25hz-44.1khz-v2,
  author = {Chihiro Arata},
  title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Aratako/MioCodec-25Hz-44.1kHz-v2

Finetuned
(1)
this model

Datasets used to train Aratako/MioCodec-25Hz-44.1kHz-v2

Collection including Aratako/MioCodec-25Hz-44.1kHz-v2

Paper for Aratako/MioCodec-25Hz-44.1kHz-v2