MioCodec-25Hz-24kHz: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling
MioCodec-25Hz-24kHz is a lightweight and fast neural audio codec designed for efficient spoken language modeling. Based on the Kanade-Tokenizer implementation, this model features an integrated wave decoder (iSTFTHead) that directly synthesizes waveforms without requiring an external vocoder.
For higher audio fidelity at 44.1 kHz, see MioCodec-25Hz-44.1kHz.
π Overview
MioCodec decomposes speech into two distinct components:
- Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
- Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")βincluding speaker identity, recording environment, and microphone traits.
By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.
Key features
- Lightweight & Fast: Integrated wave decoder (iSTFTHead) enables direct waveform synthesis without an external vocoder.
- Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
- End-to-End Design: Single model architecture from audio input to waveform output.
π Model Comparison
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights |
|---|---|---|---|---|---|---|---|---|
| MioCodec-25Hz-24kHz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | - (iSTFTHead) | 132M | Lightweight, fast inference |
| MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | MioVocoder (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate |
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model |
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model |
π Quick Start
Installation
# Install via pip
pip install git+https://github.com/Aratako/MioCodec
# Or using uv
uv add git+https://github.com/Aratako/MioCodec
Basic Inference
Basic usage for encoding and decoding audio:
from miocodec import MioCodecModel, load_audio
import soundfile as sf
# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-24kHz").eval().cuda()
# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
# 3. Encode Audio
features = model.encode(waveform)
# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
content_token_indices=features.content_token_indices,
global_embedding=features.global_embedding,
)
# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
Voice Conversion (Zero-shot)
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
ποΈ Training Methodology
MioCodec-25Hz-24kHz was trained in two phases with an integrated wave decoder that directly synthesizes waveforms via iSTFT.
Phase 1: Feature Alignment
The model is trained to minimize both Multi-Resolution Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The wave decoder directly generates waveforms, and losses are computed on the reconstructed audio.
- Multi-Resolution Mel Spectrogram Loss: Using window lengths of
[32, 64, 128, 256, 512, 1024, 2048]. - SSL Feature Reconstruction Loss: Using WavLM-base+ features.
Phase 2: Adversarial Refinement
Building upon Phase 1, adversarial training is introduced to improve perceptual quality. The training objectives include:
- Multi-Resolution Mel Spectrogram Loss: Using window lengths of
[32, 64, 128, 256, 512, 1024, 2048]. - SSL Feature Reconstruction Loss: Using WavLM-base+ features.
- Multi-Period Discriminator (MPD): Using periods of
[2, 3, 5, 7, 11, 17, 23]. - Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of
[118, 190, 310, 502, 814, 1314, 2128, 3444]. - RMS Loss: To stabilize energy and volume.
π Training Data
The training datasets are listed below:
| Language | Approx. Hours | Dataset |
|---|---|---|
| Japanese | ~22,500h | Various public HF datasets |
| English | ~500h | Libriheavy-HQ |
| English | ~4,000h | MLS-Sidon |
| English | ~9,000h | HiFiTTS-2 |
| English | ~27,000h | Emilia-YODAS |
| German | ~1,950h | MLS-Sidon |
| German | ~5,600h | Emilia-YODAS |
| Dutch | ~1,550h | MLS-Sidon |
| French | ~1,050h | MLS-Sidon |
| French | ~7,400h | Emilia-YODAS |
| Spanish | ~900h | MLS-Sidon |
| Italian | ~240h | MLS-Sidon |
| Portuguese | ~160h | MLS-Sidon |
| Polish | ~100h | MLS-Sidon |
| Korean | ~7,300h | Emilia-YODAS |
| Chinese | ~300h | Emilia-YODAS |
π Acknowledgements
- Codec Architecture: Based on the brilliant work of kanade-tokenizer.
- Decoder Design: Inspired by XCodec2.
- Training Techniques: Training objectives were inspired by XCodec2 and Inworld TTS-1.
ποΈ Citation
@misc{miocodec-25hz-24khz,
author = {Chihiro Arata},
title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-24kHz}}
}
- Downloads last month
- 38