MioCodec: High-Fidelity 44.1kHz Neural Audio Codec for Efficient Spoken Language Modeling

MioCodec-25Hz is a high-fidelity neural audio codec designed for efficient spoken language modeling. Based on the Kanade-Tokenizer implementation, MioCodec extends the capabilities to 44.1 kHz sampling rates, providing superior audio quality while maintaining a very low token rate.

🌟 Overview

MioCodec decomposes speech into two distinct components:

Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")—including speaker identity, recording environment, and microphone traits.

By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.

Key features

High-Resolution: Supports 44.1 kHz audio (compared to the standard 24 kHz in Kanade).
Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
End-to-End Optimization: Unlike original two-stage approaches, the codec and vocoder are jointly fine-tuned to minimize waveform artifacts and jitter.

📊 Model Comparison

Model	Token Rate	Vocab Size	Bit Rate	Sample Rate	SSL Encoder	Vocoder	Parameters
MioCodec-25Hz	25 Hz	12,800	341 bps	44.1 kHz	WavLM-base+	MioVocoder (Jointly Tuned)	118M
kanade-25hz	25 Hz	12,800	341 bps	24 kHz	WavLM-base+	Vocos 24kHz	118M
kanade-12.5hz	12.5 Hz	12,800	171 bps	24 kHz	WavLM-base+	Vocos 24kHz	120M

🚀 Quick Start

Installation

# Install via pip
pip install git+https://github.com/Aratako/MioCodec

# Or using uv
uv add git+https://github.com/Aratako/MioCodec

Basic Inference

Basic usage for encoding and decoding audio:

from miocodec import MioCodec, load_audio
import soundfile as sf

# 1. Load model
model = MioCodec.from_pretrained("Aratako/MioCodec-25Hz").eval().cuda()

# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

# 3. Encode Audio
features = model.encode(waveform)

# 4. Decode to Waveform
resynth = model.decode(features=features)

# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), samplerate=model.config.sample_rate)

Voice Conversion (Zero-shot)

MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), samplerate=model.config.sample_rate)

🏗️ Training Methodology

To achieve high-fidelity 44.1kHz reconstruction, MioCodec was trained in three phases. Phase 1 and 2 strictly follow the original Kanade paper to establish feature alignment and spectral sharpness, while Phase 3 introduces a novel end-to-end waveform refinement stage.

Phase 1: Feature Alignment

This phase corresponds to the "Main Training Phase" described in the original paper. The model is trained to minimize both Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The vocoder is not utilized; the loss is computed directly on the predicted mel-spectrograms.

Phase 2: Adversarial Alignment

Following the "GAN Post-Training" phase of the original paper, we introduce adversarial training to sharpen the spectrograms. In this stage, the content branch is frozen, and only the decoder and global branch are updated. The model is trained using Mel-spectrogram loss combined with GAN losses (Adversarial + Feature Matching) applied in the mel domain.

Phase 3: End-to-End Waveform Refinement

To address residual artifacts such as jitter or tremor often found in mel-only training, we introduced a third phase that shifts the domain to raw waveforms.

In this phase, Vocoder is unfrozen, allowing the Codec Decoder and Vocoder to be fine-tuned jointly in an end-to-end manner. Similar to Phase 2, the content branch remains frozen. The training objective minimizes waveform artifacts using objectives adapted from XCodec2 and Inworld TTS-1, with specific parameters tuned for 44.1 kHz:

Multi-Resolution Mel Spectrogram Loss: Using window lengths of [32, 64, 128, 256, 512, 1024, 2048, 4096].
Multi-Period Discriminator (MPD): Using periods of [2, 3, 5, 7, 11, 17, 23, 37].
Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of [216, 348, 568, 920, 1494, 2414, 3908, 6328].
RMS Loss: Adopted from Inworld TTS-1 to stabilize energy and volume.

📚 Training Data

The training datasets are listed below:

Language	Approx. Hours	Dataset	Used in Phases
Japanese	~15,000h	Various public HF datasets	1, 2, 3
English	~500h	Libriheavy-HQ	1, 2, 3
English	~4000h	MLS-Sidon	1, 2
English	~9000h	HiFiTTS-2	3
German	~1,950h	MLS-Sidon	1, 2
Dutch	~1,550h	MLS-Sidon	1, 2
French	~1,050h	MLS-Sidon	1, 2
Spanish	~900h	MLS-Sidon	1, 2
Italian	~240h	MLS-Sidon	1, 2
Portuguese	~160h	MLS-Sidon	1, 2
Polish	~100h	MLS-Sidon	1, 2

📜 Acknowledgements

Codec Architecture: Based on the brilliant work of kanade-tokenizer.
Vocoder Base: Weights and codebase derived from AliasingFreeNeuralAudioSynthesis.
Training Techniques: Phase 3 training objectives were heavily inspired by XCodec2 and Inworld TTS-1.

🖊️ Citation

@misc{miocodec-25hz,
  author = {Chihiro Arata},
  title = {MioCodec: High-Fidelity 44.1kHz Neural Audio Codec},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz}}
}

Downloads last month: 56

Aratako
/

MioCodec-25Hz