MioCodec: High-Fidelity 44.1kHz Neural Audio Codec for Efficient Spoken Language Modeling
MioCodec-25Hz is a high-fidelity neural audio codec designed for efficient spoken language modeling. Based on the Kanade-Tokenizer implementation, MioCodec extends the capabilities to 44.1 kHz sampling rates, providing superior audio quality while maintaining a very low token rate.
π Overview
MioCodec decomposes speech into two distinct components:
- Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
- Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")βincluding speaker identity, recording environment, and microphone traits.
By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.
Key features
- High-Resolution: Supports 44.1 kHz audio (compared to the standard 24 kHz in Kanade).
- Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
- End-to-End Optimization: Unlike original two-stage approaches, the codec and vocoder are jointly fine-tuned to minimize waveform artifacts and jitter.
π Model Comparison
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters |
|---|---|---|---|---|---|---|---|
| MioCodec-25Hz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | MioVocoder (Jointly Tuned) | 118M |
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M |
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M |
π Quick Start
Installation
# Install via pip
pip install git+https://github.com/Aratako/MioCodec
# Or using uv
uv add git+https://github.com/Aratako/MioCodec
Basic Inference
Basic usage for encoding and decoding audio:
from miocodec import MioCodec, load_audio
import soundfile as sf
# 1. Load model
model = MioCodec.from_pretrained("Aratako/MioCodec-25Hz").eval().cuda()
# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
# 3. Encode Audio
features = model.encode(waveform)
# 4. Decode to Waveform
resynth = model.decode(features=features)
# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), samplerate=model.config.sample_rate)
Voice Conversion (Zero-shot)
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), samplerate=model.config.sample_rate)
ποΈ Training Methodology
To achieve high-fidelity 44.1kHz reconstruction, MioCodec was trained in three phases. Phase 1 and 2 strictly follow the original Kanade paper to establish feature alignment and spectral sharpness, while Phase 3 introduces a novel end-to-end waveform refinement stage.
Phase 1: Feature Alignment
This phase corresponds to the "Main Training Phase" described in the original paper. The model is trained to minimize both Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The vocoder is not utilized; the loss is computed directly on the predicted mel-spectrograms.
Phase 2: Adversarial Alignment
Following the "GAN Post-Training" phase of the original paper, we introduce adversarial training to sharpen the spectrograms. In this stage, the content branch is frozen, and only the decoder and global branch are updated. The model is trained using Mel-spectrogram loss combined with GAN losses (Adversarial + Feature Matching) applied in the mel domain.
Phase 3: End-to-End Waveform Refinement
To address residual artifacts such as jitter or tremor often found in mel-only training, we introduced a third phase that shifts the domain to raw waveforms.
In this phase, Vocoder is unfrozen, allowing the Codec Decoder and Vocoder to be fine-tuned jointly in an end-to-end manner. Similar to Phase 2, the content branch remains frozen. The training objective minimizes waveform artifacts using objectives adapted from XCodec2 and Inworld TTS-1, with specific parameters tuned for 44.1 kHz:
- Multi-Resolution Mel Spectrogram Loss: Using window lengths of
[32, 64, 128, 256, 512, 1024, 2048, 4096]. - Multi-Period Discriminator (MPD): Using periods of
[2, 3, 5, 7, 11, 17, 23, 37]. - Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of
[216, 348, 568, 920, 1494, 2414, 3908, 6328]. - RMS Loss: Adopted from Inworld TTS-1 to stabilize energy and volume.
π Training Data
The training datasets are listed below:
| Language | Approx. Hours | Dataset | Used in Phases |
|---|---|---|---|
| Japanese | ~15,000h | Various public HF datasets | 1, 2, 3 |
| English | ~500h | Libriheavy-HQ | 1, 2, 3 |
| English | ~4000h | MLS-Sidon | 1, 2 |
| English | ~9000h | HiFiTTS-2 | 3 |
| German | ~1,950h | MLS-Sidon | 1, 2 |
| Dutch | ~1,550h | MLS-Sidon | 1, 2 |
| French | ~1,050h | MLS-Sidon | 1, 2 |
| Spanish | ~900h | MLS-Sidon | 1, 2 |
| Italian | ~240h | MLS-Sidon | 1, 2 |
| Portuguese | ~160h | MLS-Sidon | 1, 2 |
| Polish | ~100h | MLS-Sidon | 1, 2 |
π Acknowledgements
- Codec Architecture: Based on the brilliant work of kanade-tokenizer.
- Vocoder Base: Weights and codebase derived from AliasingFreeNeuralAudioSynthesis.
- Training Techniques: Phase 3 training objectives were heavily inspired by XCodec2 and Inworld TTS-1.
ποΈ Citation
@misc{miocodec-25hz,
author = {Chihiro Arata},
title = {MioCodec: High-Fidelity 44.1kHz Neural Audio Codec},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz}}
}
- Downloads last month
- 56