MioCodec-25Hz-24kHz: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling

MioCodec-25Hz-24kHz is a lightweight and fast neural audio codec designed for efficient spoken language modeling. Based on the Kanade-Tokenizer implementation, this model features an integrated wave decoder (iSTFTHead) that directly synthesizes waveforms without requiring an external vocoder.

For higher audio fidelity at 44.1 kHz, see MioCodec-25Hz-44.1kHz.

🌟 Overview

MioCodec decomposes speech into two distinct components:

Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")—including speaker identity, recording environment, and microphone traits.

By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.

Key features

Lightweight & Fast: Integrated wave decoder (iSTFTHead) enables direct waveform synthesis without an external vocoder.
Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
End-to-End Design: Single model architecture from audio input to waveform output.

📊 Model Comparison

Model	Token Rate	Vocab Size	Bit Rate	Sample Rate	SSL Encoder	Vocoder	Parameters	Highlights
MioCodec-25Hz-24kHz	25 Hz	12,800	341 bps	24 kHz	WavLM-base+	- (iSTFTHead)	132M	Lightweight, fast inference
MioCodec-25Hz-44.1kHz	25 Hz	12,800	341 bps	44.1 kHz	WavLM-base+	MioVocoder (Jointly Tuned)	118M (w/o vocoder)	High-quality, high sample rate
kanade-25hz	25 Hz	12,800	341 bps	24 kHz	WavLM-base+	Vocos 24kHz	118M (w/o vocoder)	Original 25Hz model
kanade-12.5hz	12.5 Hz	12,800	171 bps	24 kHz	WavLM-base+	Vocos 24kHz	120M (w/o vocoder)	Original 12.5Hz model

🚀 Quick Start

Installation

# Install via pip
pip install git+https://github.com/Aratako/MioCodec

# Or using uv
uv add git+https://github.com/Aratako/MioCodec

Basic Inference

Basic usage for encoding and decoding audio:

from miocodec import MioCodecModel, load_audio
import soundfile as sf

# 1. Load model
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-24kHz").eval().cuda()

# 2. Load audio
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

# 3. Encode Audio
features = model.encode(waveform)

# 4. Decode to Waveform (directly, no vocoder needed)
resynth = model.decode(
    content_token_indices=features.content_token_indices,
    global_embedding=features.global_embedding,
)

# 5. Save
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)

Voice Conversion (Zero-shot)

MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

# Perform conversion
vc_wave = model.voice_conversion(source, reference)
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)

🏗️ Training Methodology

MioCodec-25Hz-24kHz was trained in two phases with an integrated wave decoder that directly synthesizes waveforms via iSTFT.

Phase 1: Feature Alignment

The model is trained to minimize both Multi-Resolution Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The wave decoder directly generates waveforms, and losses are computed on the reconstructed audio.

Multi-Resolution Mel Spectrogram Loss: Using window lengths of [32, 64, 128, 256, 512, 1024, 2048].
SSL Feature Reconstruction Loss: Using WavLM-base+ features.

Phase 2: Adversarial Refinement

Building upon Phase 1, adversarial training is introduced to improve perceptual quality. The training objectives include:

Multi-Resolution Mel Spectrogram Loss: Using window lengths of [32, 64, 128, 256, 512, 1024, 2048].
SSL Feature Reconstruction Loss: Using WavLM-base+ features.
Multi-Period Discriminator (MPD): Using periods of [2, 3, 5, 7, 11, 17, 23].
Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of [118, 190, 310, 502, 814, 1314, 2128, 3444].
RMS Loss: To stabilize energy and volume.

📚 Training Data

The training datasets are listed below:

Language	Approx. Hours	Dataset
Japanese	~22,500h	Various public HF datasets
English	~500h	Libriheavy-HQ
English	~4,000h	MLS-Sidon
English	~9,000h	HiFiTTS-2
English	~27,000h	Emilia-YODAS
German	~1,950h	MLS-Sidon
German	~5,600h	Emilia-YODAS
Dutch	~1,550h	MLS-Sidon
French	~1,050h	MLS-Sidon
French	~7,400h	Emilia-YODAS
Spanish	~900h	MLS-Sidon
Italian	~240h	MLS-Sidon
Portuguese	~160h	MLS-Sidon
Polish	~100h	MLS-Sidon
Korean	~7,300h	Emilia-YODAS
Chinese	~300h	Emilia-YODAS

📜 Acknowledgements

Codec Architecture: Based on the brilliant work of kanade-tokenizer.
Decoder Design: Inspired by XCodec2.
Training Techniques: Training objectives were inspired by XCodec2 and Inworld TTS-1.

🖊️ Citation

@misc{miocodec-25hz-24khz,
  author = {Chihiro Arata},
  title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-24kHz}}
}

Downloads last month: 12,619

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Aratako/MioCodec-25Hz-24kHz

Finetunes

2 models

Datasets used to train Aratako/MioCodec-25Hz-24kHz

Spaces using Aratako/MioCodec-25Hz-24kHz 13

Collection including Aratako/MioCodec-25Hz-24kHz

MioTTS

Collection

12 items • Updated Feb 14 • 10