Audio-to-Audio
Safetensors
speech
audio
tokenizer

MioCodec: High-Fidelity 44.1kHz Neural Audio Codec for Efficient Spoken Language Modeling

GitHub

MioCodec-25Hz is a high-fidelity neural audio codec designed for efficient spoken language modeling. Based on the Kanade-Tokenizer implementation, MioCodec extends the capabilities to 44.1 kHz sampling rates, providing superior audio quality while maintaining a very low token rate.

🌟 Overview

MioCodec decomposes speech into two distinct components:

  1. Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
  2. Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")β€”including speaker identity, recording environment, and microphone traits.

By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.

Key features

  • High-Resolution: Supports 44.1 kHz audio (compared to the standard 24 kHz in Kanade).
  • Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
  • End-to-End Optimization: Unlike original two-stage approaches, the codec and vocoder are jointly fine-tuned to minimize waveform artifacts and jitter.

πŸ“Š Model Comparison

Model Token Rate Vocab Size Bit Rate Sample Rate SSL Encoder Vocoder Parameters
MioCodec-25Hz 25 Hz 12,800 341 bps 44.1 kHz WavLM-base+ MioVocoder (Jointly Tuned) 118M
kanade-25hz 25 Hz 12,800 341 bps 24 kHz WavLM-base+ Vocos 24kHz 118M
kanade-12.5hz 12.5 Hz 12,800 171 bps 24 kHz WavLM-base+ Vocos 24kHz 120M

πŸš€ Quick Start

Installation

# Install via pip
pip install git+https://github.com/Aratako/MioCodec

# Or using uv
uv add git+https://github.com/Aratako/MioCodec

Basic Inference

Basic usage for encoding and decoding audio:

from miocodec import MioCodec, load_audio
import soundfile as sf

# 1. Load model
model = MioCodec.from_pretrained("Aratako/MioCodec-25Hz").eval().cuda()

# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

# 3. Encode Audio
features = model.encode(waveform)

# 4. Decode to Waveform
resynth = model.decode(features=features)

# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), samplerate=model.config.sample_rate)

Voice Conversion (Zero-shot)

MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), samplerate=model.config.sample_rate)

πŸ—οΈ Training Methodology

To achieve high-fidelity 44.1kHz reconstruction, MioCodec was trained in three phases. Phase 1 and 2 strictly follow the original Kanade paper to establish feature alignment and spectral sharpness, while Phase 3 introduces a novel end-to-end waveform refinement stage.

Phase 1: Feature Alignment

This phase corresponds to the "Main Training Phase" described in the original paper. The model is trained to minimize both Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The vocoder is not utilized; the loss is computed directly on the predicted mel-spectrograms.

Phase 2: Adversarial Alignment

Following the "GAN Post-Training" phase of the original paper, we introduce adversarial training to sharpen the spectrograms. In this stage, the content branch is frozen, and only the decoder and global branch are updated. The model is trained using Mel-spectrogram loss combined with GAN losses (Adversarial + Feature Matching) applied in the mel domain.

Phase 3: End-to-End Waveform Refinement

To address residual artifacts such as jitter or tremor often found in mel-only training, we introduced a third phase that shifts the domain to raw waveforms.

In this phase, Vocoder is unfrozen, allowing the Codec Decoder and Vocoder to be fine-tuned jointly in an end-to-end manner. Similar to Phase 2, the content branch remains frozen. The training objective minimizes waveform artifacts using objectives adapted from XCodec2 and Inworld TTS-1, with specific parameters tuned for 44.1 kHz:

  • Multi-Resolution Mel Spectrogram Loss: Using window lengths of [32, 64, 128, 256, 512, 1024, 2048, 4096].
  • Multi-Period Discriminator (MPD): Using periods of [2, 3, 5, 7, 11, 17, 23, 37].
  • Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of [216, 348, 568, 920, 1494, 2414, 3908, 6328].
  • RMS Loss: Adopted from Inworld TTS-1 to stabilize energy and volume.

πŸ“š Training Data

The training datasets are listed below:

Language Approx. Hours Dataset Used in Phases
Japanese ~15,000h Various public HF datasets 1, 2, 3
English ~500h Libriheavy-HQ 1, 2, 3
English ~4000h MLS-Sidon 1, 2
English ~9000h HiFiTTS-2 3
German ~1,950h MLS-Sidon 1, 2
Dutch ~1,550h MLS-Sidon 1, 2
French ~1,050h MLS-Sidon 1, 2
Spanish ~900h MLS-Sidon 1, 2
Italian ~240h MLS-Sidon 1, 2
Portuguese ~160h MLS-Sidon 1, 2
Polish ~100h MLS-Sidon 1, 2

πŸ“œ Acknowledgements

πŸ–ŠοΈ Citation

@misc{miocodec-25hz,
  author = {Chihiro Arata},
  title = {MioCodec: High-Fidelity 44.1kHz Neural Audio Codec},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz}}
}
Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Aratako/MioCodec-25Hz