Add files using upload-large-folder tool

3a737f0 verified 19 days ago

7.55 kB

	---
	license: mit
	language:
	- en
	- ja
	- nl
	- fr
	- de
	- it
	- pl
	- pt
	- es
	- ko
	- zh
	tags:
	- speech
	- audio
	- tokenizer
	datasets:
	- sarulab-speech/mls_sidon
	- mythicinfinity/Libriheavy-HQ
	- nvidia/hifitts-2
	- amphion/Emilia-Dataset
	pipeline_tag: audio-to-audio
	---

	# MioCodec-25Hz-24kHz: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling

	[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)

	MioCodec-25Hz-24kHz is a lightweight and fast neural audio codec designed for efficient spoken language modeling. Based on the [Kanade-Tokenizer](https://github.com/frothywater/kanade-tokenizer) implementation, this model features an integrated wave decoder (iSTFTHead) that directly synthesizes waveforms without requiring an external vocoder.

	For higher audio fidelity at 44.1 kHz, see [MioCodec-25Hz-44.1kHz](https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz).

	## 🌟 Overview

	MioCodec decomposes speech into two distinct components:

	1. Content Tokens: Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
	2. Global Embeddings: A continuous vector representing broad acoustic characteristics ("how")—including speaker identity, recording environment, and microphone traits.

	By disentangling these elements, MioCodec is ideal for Spoken Language Modeling.

	### Key features

	* Lightweight & Fast: Integrated wave decoder (iSTFTHead) enables direct waveform synthesis without an external vocoder.
	* Ultra-Low Bitrate: Achieves high-fidelity reconstruction at only 341 bps.
	* End-to-End Design: Single model architecture from audio input to waveform output.

	## 📊 Model Comparison

	\| Model \| Token Rate \| Vocab Size \| Bit Rate \| Sample Rate \| SSL Encoder \| Vocoder \| Parameters \| Highlights \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :--- \|
	\| MioCodec-25Hz-24kHz \| 25 Hz \| 12,800 \| 341 bps \| 24 kHz \| WavLM-base+ \| - (iSTFTHead) \| 132M \| Lightweight, fast inference \|
	\| MioCodec-25Hz-44.1kHz \| 25 Hz \| 12,800 \| 341 bps \| 44.1 kHz \| WavLM-base+ \| [MioVocoder](https://huggingface.co/Aratako/MioVocoder) (Jointly Tuned) \| 118M (w/o vocoder) \| High-quality, high sample rate \|
	\| kanade-25hz \| 25 Hz \| 12,800 \| 341 bps \| 24 kHz \| WavLM-base+ \| Vocos 24kHz \| 118M (w/o vocoder) \| Original 25Hz model \|
	\| kanade-12.5hz \| 12.5 Hz \| 12,800 \| 171 bps \| 24 kHz \| WavLM-base+ \| Vocos 24kHz \| 120M (w/o vocoder) \| Original 12.5Hz model \|

	## 🚀 Quick Start

	### Installation

	```bash
	# Install via pip
	pip install git+https://github.com/Aratako/MioCodec

	# Or using uv
	uv add git+https://github.com/Aratako/MioCodec

	```

	### Basic Inference

	Basic usage for encoding and decoding audio:

	```python
	from miocodec import MioCodecModel, load_audio
	import soundfile as sf

	# 1. Load model
	model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-24kHz").eval().cuda()

	# 2. Load audio
	waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()

	# 3. Encode Audio
	features = model.encode(waveform)

	# 4. Decode to Waveform (directly, no vocoder needed)
	resynth = model.decode(
	content_token_indices=features.content_token_indices,
	global_embedding=features.global_embedding,
	)

	# 5. Save
	sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
	```

	### Voice Conversion (Zero-shot)

	MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.

	```python
	source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
	reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()

	# Perform conversion
	vc_wave = model.voice_conversion(source, reference)
	sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
	```

	## 🏗️ Training Methodology

	MioCodec-25Hz-24kHz was trained in two phases with an integrated wave decoder that directly synthesizes waveforms via iSTFT.

	### Phase 1: Feature Alignment

	The model is trained to minimize both Multi-Resolution Mel-spectrogram loss and SSL feature reconstruction loss (using WavLM-base+). The wave decoder directly generates waveforms, and losses are computed on the reconstructed audio.

	* Multi-Resolution Mel Spectrogram Loss: Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048]`.
	* SSL Feature Reconstruction Loss: Using WavLM-base+ features.

	### Phase 2: Adversarial Refinement

	Building upon Phase 1, adversarial training is introduced to improve perceptual quality. The training objectives include:

	* Multi-Resolution Mel Spectrogram Loss: Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048]`.
	* SSL Feature Reconstruction Loss: Using WavLM-base+ features.
	* Multi-Period Discriminator (MPD): Using periods of `[2, 3, 5, 7, 11, 17, 23]`.
	* Multi-Scale STFT Discriminator (MS-STFTD): Using FFT sizes of `[118, 190, 310, 502, 814, 1314, 2128, 3444]`.
	* RMS Loss: To stabilize energy and volume.

	## 📚 Training Data

	The training datasets are listed below:

	\| Language \| Approx. Hours \| Dataset \|
	\| :--- \| :--- \| :--- \|
	\| Japanese \| ~22,500h \| Various public HF datasets \|
	\| English \| ~500h \| [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ) \|
	\| English \| ~4,000h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| English \| ~9,000h \| [HiFiTTS-2](https://huggingface.co/datasets/nvidia/hifitts-2) \|
	\| English \| ~27,000h \| [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) \|
	\| German \| ~1,950h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| German \| ~5,600h \| [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) \|
	\| Dutch \| ~1,550h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| French \| ~1,050h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| French \| ~7,400h \| [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) \|
	\| Spanish \| ~900h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Italian \| ~240h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Portuguese \| ~160h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Polish \| ~100h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Korean \| ~7,300h \| [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) \|
	\| Chinese \| ~300h \| [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) \|

	## 📜 Acknowledgements

	* Codec Architecture: Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer).
	* Decoder Design: Inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0).
	* Training Techniques: Training objectives were inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1).

	## 🖊️ Citation

	```bibtex
	@misc{miocodec-25hz-24khz,
	author = {Chihiro Arata},
	title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
	year = {2026},
	publisher = {Hugging Face},
	journal = {Hugging Face repository},
	howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-24kHz}}
	}
	```